JavaOne2016_JavaFlameGraphs.pdf

JavaOne 2016: Java Performance Analysis on Linux with Flame Graphs

Talk for JavaOne 2016 by Brendan Gregg

Description: "Flame graphs are a simple stack trace visualization that have been used for many Java performance wins at Netflix. They simplify an everyday problem: how is software consuming resources, especially CPUs, and how has this changed? They are used by many companys, and were published in the June 2016 issue of CACM.

With Oracle's help these are now possible with -XX:+PreserveFramePointer and the Linux perf tool, and show everything: Java methods, GC, JVM internals, system libraries, and the kernel.

This talk has instructions for CPU Flame Graphs on Linux, and summarizes other types: memory page faults, TCP events, I/O, and more. New BPF stack and tracing capabilities from 2016 will be summarized, and Java package flame graphs introduced."

	next prev 1/71
	next prev 2/71
	next prev 3/71
	next prev 4/71
	next prev 5/71
	next prev 6/71
	next prev 7/71
	next prev 8/71
	next prev 9/71
	next prev 10/71
	next prev 11/71
	next prev 12/71
	next prev 13/71
	next prev 14/71
	next prev 15/71
	next prev 16/71
	next prev 17/71
	next prev 18/71
	next prev 19/71
	next prev 20/71
	next prev 21/71
	next prev 22/71
	next prev 23/71
	next prev 24/71
	next prev 25/71
	next prev 26/71
	next prev 27/71
	next prev 28/71
	next prev 29/71
	next prev 30/71
	next prev 31/71
	next prev 32/71
	next prev 33/71
	next prev 34/71
	next prev 35/71
	next prev 36/71
	next prev 37/71
	next prev 38/71
	next prev 39/71
	next prev 40/71
	next prev 41/71
	next prev 42/71
	next prev 43/71
	next prev 44/71
	next prev 45/71
	next prev 46/71
	next prev 47/71
	next prev 48/71
	next prev 49/71
	next prev 50/71
	next prev 51/71
	next prev 52/71
	next prev 53/71
	next prev 54/71
	next prev 55/71
	next prev 56/71
	next prev 57/71
	next prev 58/71
	next prev 59/71
	next prev 60/71
	next prev 61/71
	next prev 62/71
	next prev 63/71
	next prev 64/71
	next prev 65/71
	next prev 66/71
	next prev 67/71
	next prev 68/71
	next prev 69/71
	next prev 70/71
	next prev 71/71

PDF: JavaOne2016_JavaFlameGraphs.pdf

Keywords (from pdftotext):

slide 1:

Sep 2016
Java Performance Analysis
on Linux with Flame Graphs
Brendan Gregg Senior Performance Architect

slide 2:

Complete Visibility
Java Mixed-Mode Flame Graph via Linux perf_events
C++
(JVM)
Java
Java
(Inlined)
(User)
(Kernel)

slide 3:

Cloud
OpJonal Apache,
memcached,
Node.js, …
Atlas, Vector, S3
logs, sar, trace,
perf, perf-tools,
(BPF soon)
Java (JDK 8)
GC and
thread
dump
logging
Tomcat
hystrix, servo
CloudWatch, servo
• Tens of thousands of AWS EC2 instances
• Mostly Java (Oracle JVM)
Auto Scaling
Group
Instance: usually Ubuntu Linux
Scaling Policy
loadavg,
latency, …
Instance
Instance
Instance

slide 4:

The Problem with Profilers

slide 5:

Java Profilers
Kernel,
libraries,
JVM
Java

slide 6:

Java Profilers
• Visibility
– Java method execution
– Object usage
– GC logs
– Custom Java context
• Typical problems:
– Sampling often happens at safety/yield points (skew)
– Method tracing has massive observer effect
– Misidentifies RUNNING as on-CPU (e.g., epoll)
– Doesn't include or profile GC or JVM CPU time
– Tree views not quick (proportional) to comprehend
• Inaccurate (skewed) and incomplete profiles

slide 7:

System Profilers
Kernel
TCP/IP
Java
JVM
Locks
Time
epoll
Idle
thread

slide 8:

System Profilers
• Visibility
– JVM (C++)
– GC (C++)
– libraries (C)
– kernel (C)
• Typical problems (x86):
– Stacks missing for Java
– Symbols missing for Java methods
• Other architectures (e.g., SPARC) have fared better
• Profile everything except Java

slide 9:

Workaround
• Capture both:
Java
System
• An improvement, but system stacks are missing Java
context, and therefore hard to interpret

slide 10:

Solution
Java Mixed-Mode Flame Graph
Kernel
Java
JVM

slide 11:

Solution
• Fix system profiling,
see everything:
– Java methods
– JVM (C++)
– GC (C++)
– libraries (C)
– kernel (C)
– Other apps
Kernel
Java
JVM
• Minor Problems:
– 0-3% CPU overhead to enable frame pointers (usually

slide 12:

Saving 13M CPU Minutes Per Day
• eu
hXp://techblog.neZlix.com/2016/04/saving-13-million-computaJonal-minutes.html

slide 13:

System Example
Exception handling consuming CPU

slide 14:

Profiling GC
GC internals, visualized:

slide 15:

CPU Profiling

slide 16:

CPU Profiling
• Record stacks at a timed interval: simple and effective
– Pros: Low (deterministic) overhead
– Cons: Coarse accuracy, but usually sufficient
stack
samples:
syscall
on-CPU
time
off-CPU
block
interrupt

slide 17:

Stack Traces
• A code path snapshot. e.g., from jstack(1):
$ jstack 1819
[…]
"main" prio=10 tid=0x00007ff304009000
nid=0x7361 runnable [0x00007ff30d4f9000]
java.lang.Thread.State: RUNNABLE
at Func_abc.func_c(Func_abc.java:6)
at Func_abc.func_b(Func_abc.java:16)
at Func_abc.func_a(Func_abc.java:23)
at Func_abc.main(Func_abc.java:27)
running
parent
g.parent
g.g.parent

slide 18:

System Profilers
• Linux
– perf_events (aka "perf")
• Oracle Solaris
– DTrace
• OS X
– Instruments
• Windows
– XPerf, WPA (which now has flame graphs!)
• And many others…

slide 19:

Linux perf_events
• Standard Linux profiler
– Provides the perf command (multi-tool)
– Usually pkg added by linux-tools-common, etc.
• Many event sources:
– Timer-based sampling
– Hardware events
– Tracepoints
– Dynamic tracing
• Can sample stacks of (almost) everything on CPU
– Can miss hard interrupt ISRs, but these should be near-zero. They can
be measured if needed (I wrote my own tools)

slide 20:

perf Profiling
# perf record -F 99 -ag -- sleep 30
[ perf record: Woken up 9 times to write data ]
[ perf record: Captured and wrote 2.745 MB perf.data (~119930 samples) ]
# perf report -n -stdio
[…]
# Overhead
Samples Command
Shared Object
Symbol
# ........ ............ ....... ................. .............................
20.42%
bash [kernel.kallsyms] [k] xen_hypercall_xen_version
--- xen_hypercall_xen_version
call tree
check_events
summary
|--44.13%-- syscall_trace_enter
tracesys
|--35.58%-- __GI___libc_fcntl
|--65.26%-- do_redirection_internal
do_redirections
execute_builtin_or_function
execute_simple_command
[… ~13,000 lines truncated …]

slide 21:

Full perf report Output

slide 22:

… as a Flame Graph

slide 23:

Flame Graphs
• Flame Graphs:
– x-axis: alphabetical stack sort, to maximize merging
– y-axis: stack depth
– color: random (default), or a dimension
• Currently made from Perl + SVG + JavaScript
– Multiple d3 versions are being developed
• References:
– http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html
– http://queue.acm.org/detail.cfm?id=2927301
– "The Flame Graph" CACM, June 2016
• Easy to make
– Converters for many profilers

slide 24:

Flame Graph Interpretation
g()
e()
f()
d()
c()
i()
b()
h()
a()

slide 25:

Flame Graph Interpretation (1/3)
Top edge shows who is running on-CPU,
and how much (width)
g()
e()
f()
d()
c()
i()
b()
h()
a()

slide 26:

Flame Graph Interpretation (2/3)
Top-down shows ancestry
e.g., from g():
g()
e()
f()
d()
c()
i()
b()
h()
a()

slide 27:

Flame Graph Interpretation (3/3)
Widths are proportional to presence in samples
e.g., comparing b() to h() (incl. children)
g()
e()
f()
d()
c()
i()
b()
h()
a()

slide 28:

Mixed-Mode Flame Graphs
• Hues:
Mixed-Mode
– green == Java
– aqua == Java (inlined)
• if included
– red == system
– yellow == C++
• Intensity:
– Randomized to
differentiate frames
– Or hashed on
function name
Kernel
Java
JVM

slide 29:

Differential Flame Graphs
• Hues:
Differential
– red == more samples
– blue == less samples
• Intensity:
– Degree of difference
• Compares two profiles
• Can show other
metrics: e.g., CPI
• Other types exist
– flamegraphdiff
more
less

slide 30:

Flame Graph Search
• Color: magenta to show matched frames
search
button

slide 31:

Flame Charts
• Final note: these are useful, but are not flame graphs
• Flame charts: x-axis is time
• Flame graphs: x-axis is population (maximize merging)

slide 32:

Stack Tracing

slide 33:

Broken Java Stacks on x86
These stacks are
1 or 2 levels deep,
with junk values
• On x86 (x86_64),
hotspot uses the
frame pointer
register (RBP) as
general purpose
# perf record –F 99 –a –g – sleep 30
# perf script
[…]
java 4579 cpu-clock:
7f417908c10b [unknown] (/tmp/perf-4458.map)
java
4579 cpu-clock:
7f41792fc65f [unknown] (/tmp/perf-4458.map)
a2d53351ff7da603 [unknown] ([unknown])
[…]
This "compiler optimization" breaks (RBP-based) stack walking
Once upon a time, x86 had fewer registers, and this made more
sense
gcc provides -fno-omit-frame-pointer to avoid doing this, but
the JVM had no such option…

slide 34:

… as a Flame Graph
Broken Java stacks
(missing frame pointer)

slide 35:

Fixing Stack Walking
Possibilities:
A. Fix frame pointer-based stack walking (the default)
– Pros: simple, supported by many tools
– Cons: might cost a little extra CPU
B. Use a custom walker (likely needing kernel support)
– Pros: full stack walking (incl. inlining) & arguments
– Cons: custom kernel code, can cost more CPU when in use
C. Try libunwind and DWARF
– Even feasible with JIT?
Our current preference is (A)

slide 36:

-XX:+PreserveFramePointer
• I hacked OpenJDK x86_64 to support frame pointers
– Taking RBP out of register pools, and adding function prologues. It
worked, I shared the patch.
– It became JDK-8068945 for JDK 9 and JDK-8072465 for JDK 8
• Zoltán Majó (Oracle) rewrote it, and it is now:
– -XX:+PreserveFramePointer in JDK 9 and JDK 8 u60b19
– Thanks to Zoltán, Oracle, and the other hotspot engineers for helping
get this done!
• It might cost 0 – 3% CPU, depending on workload

slide 37:

Fixed Java Stacks
# perf script
[…]
java 4579 cpu-clock:
7f417908c10b [unknown] (/tmp/…
java
4579 cpu-clock:
7f41792fc65f [unknown] (/tmp/…
a2d53351ff7da603 [unknown] ([unkn…
[…]
# perf script
[…]
java 8131 cpu-clock:
7fff76f2dce1 [unknown] ([vdso])
7fd3173f7a93 os::javaTimeMillis() (/usr/lib/jvm…
7fd301861e46 [unknown] (/tmp/perf-8131.map)
7fd30184def8 [unknown] (/tmp/perf-8131.map)
7fd30174f544 [unknown] (/tmp/perf-8131.map)
7fd30175d3a8 [unknown] (/tmp/perf-8131.map)
7fd30166d51c [unknown] (/tmp/perf-8131.map)
7fd301750f34 [unknown] (/tmp/perf-8131.map)
7fd3016c2280 [unknown] (/tmp/perf-8131.map)
7fd301b02ec0 [unknown] (/tmp/perf-8131.map)
7fd3016f9888 [unknown] (/tmp/perf-8131.map)
7fd3016ece04 [unknown] (/tmp/perf-8131.map)
7fd30177783c [unknown] (/tmp/perf-8131.map)
7fd301600aa8 [unknown] (/tmp/perf-8131.map)
7fd301a4484c [unknown] (/tmp/perf-8131.map)
7fd3010072e0 [unknown] (/tmp/perf-8131.map)
7fd301007325 [unknown] (/tmp/perf-8131.map)
7fd301007325 [unknown] (/tmp/perf-8131.map)
7fd3010004e7 [unknown] (/tmp/perf-8131.map)
7fd3171df76a JavaCalls::call_helper(JavaValue*,…
7fd3171dce44 JavaCalls::call_virtual(JavaValue*…
7fd3171dd43a JavaCalls::call_virtual(JavaValue*…
7fd31721b6ce thread_entry(JavaThread*, Thread*)…
7fd3175389e0 JavaThread::thread_main_inner() (/…
7fd317538cb2 JavaThread::run() (/usr/lib/jvm/nf…
7fd3173f6f52 java_start(Thread*) (/usr/lib/jvm/…
7fd317a7e182 start_thread (/lib/x86_64-linux-gn…

slide 38:

Fixed Stacks Flame Graph
Java stacks
(but no symbols)

slide 39:

Stack Depth
• perf had a 127 frame limit
• Now tunable in Linux 4.8
– sysctl -w kernel.perf_event_max_stack=512
– Thanks Arnaldo Carvalho de Melo!
A Java microservice
with a stack depth
of >gt; 900
broken stacks
perf_event_max_stack=1024

slide 40:

Symbols

slide 41:

Fixing Symbols
For JIT'd code, Linux perf already looks for an
externally provided symbol file: /tmp/perf-PID.map, and
warns if it doesn't exist
# perf script
Failed to open /tmp/perf-8131.map, continuing without symbols
[…]
java 8131 cpu-clock:
7fff76f2dce1 [unknown] ([vdso])
7fd3173f7a93 os::javaTimeMillis() (/usr/lib/jvm…
7fd301861e46 [unknown] (/tmp/perf-8131.map)
[…]
This file can be created by a Java agent

slide 42:

Java Symbols for perf
• perf-map-agent
– https://github.com/jrudolph/perf-map-agent
– Agent attaches and writes the /tmp file on demand (previous versions
attached on Java start, wrote continually)
– Thanks Johannes Rudolph!
• Use of a /tmp symbol file
– Pros: simple, can be low overhead (snapshot on demand)
– Cons: stale symbols
• Using a symbol logger with perf instead
– Stephane Eranian contributed this to perf
– See lkml for "perf: add support for profiling jitted code"
# perf script
java 14025 [017] 8048.157085: cpu-clock:
HashMap;::get (/tmp/perf-12149.map)
[…]
7fd781253265 Ljava/util/
ﬁxed
symbols

slide 43:

Stacks & Symbols
Java Mixed-Mode Flame Graph
Kernel
Java
JVM

slide 44:

Stacks & Symbols (zoom)

slide 45:

Inlining
• Many frames may be missing (inlined)
– Flame graph may still make enough sense
• Inlining can be tuned
– -XX:-Inline to disable, but can be 80% slower!
– -XX:MaxInlineSize and -XX:InlineSmallCode
can be tuned a little to reveal more frames
• Can even improve performance!
• perf-map-agent can un-inline (unfoldall)
– Adds inlined frames to symbol dump
– flamegraph.pl --color=java will color these aqua
– Thanks Johannes Rudolph, T Jake Luciani, and
Nitsan Wakart

slide 46:

Instructions

slide 47:

Instructions
1. Check Java version
2. Install perf-map-agent
3. Set -XX:+PreserveFramePointer
4. Profile Java
5. Dump symbols
6. Generate Mixed-Mode Flame Graph
Note these are unsupported: use at your own risk
Reference: http://techblog.netflix.com/2015/07/java-in-flames.html

slide 48:

1. Check Java Version
• Need JDK8u60 or better
– for -XX:+PreserveFramePointer
$ java -version
java version "1.8.0_60"
Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode)
• Upgrade Java if necessary

slide 49:

2. Install perf-map-agent
• Check https://github.com/jrudolph/perf-map-agent for the
latest instructions. e.g.:
$ sudo bash
# apt-get install -y cmake
# export JAVA_HOME=/usr/lib/jvm/java-8-oracle
# cd /usr/lib/jvm
# git clone --depth=1 https://github.com/jrudolph/perf-map-agent
# cd perf-map-agent
# cmake .
# make

slide 50:

3. Set -XX:+PreserveFramePointer
• Needs to be set on Java startup
• Check it is enabled (on Linux):
$ ps wwp `pgrep –n java` | grep PreserveFramePointer
$ jcmd `pgrep –n java` VM.flags | grep PreserveFramePointer
• Measure overhead (should be small)

slide 51:

4. Profile Java
• Using Linux perf_events to profile all processes, at 99
Hertz, for 30 seconds (as root):
# perf record -F 99 -a -g -- sleep 30
• Just profile one PID (broken on some older kernels):
# perf record -F 99 -p PID -g -- sleep 30
• These create a perf.data file

slide 52:

5. Dump Symbols
• See perf-map-agent docs for updated usage
• e.g., as the same user as java:
$ cd /usr/lib/jvm/perf-map-agent/out
$ java -cp attach-main.jar:$JAVA_HOME/lib/tools.jar \
net.virtualvoid.perf.AttachOnce PID
• perf-map-agent contains helper scripts. I wrote my own:
– https://github.com/brendangregg/Misc/blob/master/java/jmaps
• Dump symbols quickly after perf record to minimize stale
symbols. How I do it:
# perf record -F 99 -a -g -- sleep 30; jmaps

slide 53:

6. Generate a Mixed-Mode Flame Graph
• Using my FlameGraph software:
# perf script >gt; out.stacks01
# git clone --depth=1 https://github.com/brendangregg/FlameGraph
# cat out.stacks01 | ./FlameGraph/stackcollapse-perf.pl | \
./FlameGraph/flamegraph.pl --color=java --hash >gt; flame01.svg
– perf script reads perf.data with /tmp/*.map
– out.stacks01 is an intermediate file; can be handy to keep
• Finally open flame01.svg in a browser
• Check for newer flame graph implementations (e.g., d3)

slide 54:

Optimizations
• Linux 2.6+, via perf.data and perf script:
git clone --depth 1 https://github.com/brendangregg/FlameGraph
cd FlameGraph
perf record -F 99 -a –g -- sleep 30
perf script | ./stackcollapse-perf.pl |./flamegraph.pl >gt; perf.svg
• Linux 4.5+ can use folded output
– Skips the CPU-costly stackcollapse-perf.pl step; see:
http://www.brendangregg.com/blog/2016-04-30/linux-perf-folded.html
• Linux 4.9+, via BPF:
git clone --depth 1 https://github.com/brendangregg/FlameGraph
git clone --depth 1 https://github.com/iovisor/bcc
./bcc/tools/profile.py -dF 99 30 | ./FlameGraph/flamegraph.pl >gt; perf.svg
– Most efficient: no perf.data file, summarizes in-kernel

slide 55:

Linux Proﬁling OpJmizaJons
Linux 2.6
Linux 4.5
Linux 4.9
capture stacks
capture stacks
count stacks (BPF)
perf record
perf record
profile.py
write samples
perf.data
reads samples
perf script
write text
stackcollapse-perf.pl
folded output
flamegraph.pl
write samples
perf.data
reads samples
perf report
–g folded
folded
output
folded report
awk
folded output
flamegraph.pl
flamegraph.pl

slide 56:

Automation

slide 57:

Netflix Vector

slide 58:

Netflix Vector
Select Instance
Select
Metrics
Flame Graphs
Near real-time,
per-second metrics

slide 59:

Netflix Vector
• Open source, on-demand, instance analysis tool
– https://github.com/netflix/vector
• Shows various real-time metrics
• Flame graph support currently in development
– Automating previous steps
– Using it internally already
– Also developing a new d3 front end

slide 60:

Advanced Analysis

slide 61:

Linux perf_events Coverage
… all possible with Java stacks

slide 62:

Advanced Flame Graphs
• Any event can be flame graphed, provided it is issued in
synchronous Java context
– Java thread still on-CPU, and event is directly triggered
– On-CPU Java context is valid
• Synchronous examples:
– Disk I/O requests issued directly by Java à yes
• direct reads, sync writes, page faults
– Disk I/O completion interrupts à no*
– Disk I/O requests triggered async, e.g., readahead à no*
* can be made yes by tracing and associating context

slide 63:

Page Faults
• Show what triggered main memory (resident) to grow:
# perf record -e page-faults -p PID -g -- sleep 120
• "fault" as (physical) main memory is allocated ondemand, when a virtual page is first populated
• Low overhead tool to solve some types of memory leak
RES column in top(1)
grows
because

slide 64:

Context Switches
• Show why Java blocked and stopped running on-CPU:
# perf record -e context-switches -p PID -g -- sleep 5
• Identifies locks, I/O, sleeps
– If code path shouldn't block and looks random, it's an involuntary context switch. I
could filter these, but you should have solved them beforehand (CPU load).
• e.g., was used to understand framework differences:
rxNetty
Tomcat
epoll
futex
sys_poll
futex

slide 65:

Disk I/O Requests
• Shows who issued disk I/O (sync reads & writes):
# perf record -e block:block_rq_insert -a -g -- sleep 60
• e.g.: page faults in GC? This JVM has swapped out!:

slide 66:

TCP Events
• TCP transmit, using dynamic tracing:
# perf probe tcp_sendmsg
# perf record -e probe:tcp_sendmsg -a -g -- sleep 1; jmaps
# perf script -f comm,pid,tid,cpu,time,event,ip,sym,dso,trace >gt; out.stacks
# perf probe --del tcp_sendmsg
• Note: can be high overhead for high packet rates
– For the current perf trace, dump, post-process cycle
• Can also trace TCP connect & accept (lower overhead)
• TCP receive is async
– Could trace via socket read
TCP sends

slide 67:

CPU Cache Misses
• In this example, sampling via Last Level Cache loads:
# perf record -e LLC-loads -c 10000 -a -g -- sleep 5; jmaps
# perf script -f comm,pid,tid,cpu,time,event,ip,sym,dso >gt; out.stacks
• -c is the count (samples
once per count)
• Use other CPU counters to
sample hits, misses, stalls

slide 68:

CPI Flame Graph
• Cycles Per
Instruction!
– red == instruction
heavy
– blue == cycle
heavy (likely mem
stall cycles)
zoomed:

slide 69:

Java Package Flame Graph
• Sample on-CPU instruction pointer only (no stack)
– Don't need -XX:+PreserveFramePointer
• y-axis: package name hierarchy
– java / util / ArrayList / ::size
Linux 2.6+ (pre-BPF):
no -g (stacks)
# perf record -F 199 -a -- sleep 30; ./jmaps
# perf script | ./pkgsplit-perf.sh | ./flamegraph.pl >gt; out.svg

slide 70:

Links & References
Flame Graphs
– http://www.brendangregg.com/flamegraphs.html
– http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html
– http://queue.acm.org/detail.cfm?id=2927301
– "The Flame Graph" CACM, Vol. 56, No. 6 (June 2016)
– http://techblog.netflix.com/2015/07/java-in-flames.html
– http://techblog.netflix.com/2016/04/saving-13-million-computational-minutes.html
– http://techblog.netflix.com/2014/11/nodejs-in-flames.html
– http://www.brendangregg.com/blog/2014-11-09/differential-flame-graphs.html
Linux perf_events
– https://perf.wiki.kernel.org/index.php/Main_Page
– http://www.brendangregg.com/perf.html
– http://www.brendangregg.com/blog/2015-02-27/linux-profiling-at-netflix.html
– Linux 4.5: http://www.brendangregg.com/blog/2016-04-30/linux-perf-folded.html
Netflix Vector
– https://github.com/netflix/vector
– http://techblog.netflix.com/2015/04/introducing-vector-netflixs-on-host.html
hprof: http://www.brendangregg.com/blog/2014-06-09/java-cpu-sampling-using-hprof.html

slide 71:

Sep 2016
Thanks
Questions?
http://techblog.netflix.com
http://slideshare.net/brendangregg
http://www.brendangregg.com
[email protected]
@brendangregg