JavaOne 2016: Java Performance Analysis on Linux with Flame Graphs
Talk for JavaOne 2016 by Brendan GreggDescription: "Flame graphs are a simple stack trace visualization that have been used for many Java performance wins at Netflix. They simplify an everyday problem: how is software consuming resources, especially CPUs, and how has this changed? They are used by many companys, and were published in the June 2016 issue of CACM.
With Oracle's help these are now possible with -XX:+PreserveFramePointer and the Linux perf tool, and show everything: Java methods, GC, JVM internals, system libraries, and the kernel.
This talk has instructions for CPU Flame Graphs on Linux, and summarizes other types: memory page faults, TCP events, I/O, and more. New BPF stack and tracing capabilities from 2016 will be summarized, and Java package flame graphs introduced."
next prev 1/71 | |
next prev 2/71 | |
next prev 3/71 | |
next prev 4/71 | |
next prev 5/71 | |
next prev 6/71 | |
next prev 7/71 | |
next prev 8/71 | |
next prev 9/71 | |
next prev 10/71 | |
next prev 11/71 | |
next prev 12/71 | |
next prev 13/71 | |
next prev 14/71 | |
next prev 15/71 | |
next prev 16/71 | |
next prev 17/71 | |
next prev 18/71 | |
next prev 19/71 | |
next prev 20/71 | |
next prev 21/71 | |
next prev 22/71 | |
next prev 23/71 | |
next prev 24/71 | |
next prev 25/71 | |
next prev 26/71 | |
next prev 27/71 | |
next prev 28/71 | |
next prev 29/71 | |
next prev 30/71 | |
next prev 31/71 | |
next prev 32/71 | |
next prev 33/71 | |
next prev 34/71 | |
next prev 35/71 | |
next prev 36/71 | |
next prev 37/71 | |
next prev 38/71 | |
next prev 39/71 | |
next prev 40/71 | |
next prev 41/71 | |
next prev 42/71 | |
next prev 43/71 | |
next prev 44/71 | |
next prev 45/71 | |
next prev 46/71 | |
next prev 47/71 | |
next prev 48/71 | |
next prev 49/71 | |
next prev 50/71 | |
next prev 51/71 | |
next prev 52/71 | |
next prev 53/71 | |
next prev 54/71 | |
next prev 55/71 | |
next prev 56/71 | |
next prev 57/71 | |
next prev 58/71 | |
next prev 59/71 | |
next prev 60/71 | |
next prev 61/71 | |
next prev 62/71 | |
next prev 63/71 | |
next prev 64/71 | |
next prev 65/71 | |
next prev 66/71 | |
next prev 67/71 | |
next prev 68/71 | |
next prev 69/71 | |
next prev 70/71 | |
next prev 71/71 |
PDF: JavaOne2016_JavaFlameGraphs.pdf
Keywords (from pdftotext):
slide 1:
Sep 2016 Java Performance Analysis on Linux with Flame Graphs Brendan Gregg Senior Performance Architectslide 2:
Complete Visibility Java Mixed-Mode Flame Graph via Linux perf_events C++ (JVM) Java Java (Inlined) (User) (Kernel)slide 3:
Cloud OpJonal Apache, memcached, Node.js, … Atlas, Vector, S3 logs, sar, trace, perf, perf-tools, (BPF soon) Java (JDK 8) GC and thread dump logging Tomcat hystrix, servo CloudWatch, servo • Tens of thousands of AWS EC2 instances • Mostly Java (Oracle JVM) Auto Scaling Group Instance: usually Ubuntu Linux Scaling Policy loadavg, latency, … Instance Instance Instanceslide 4:
The Problem with Profilersslide 5:
Java Profilers Kernel, libraries, JVM Javaslide 6:
Java Profilers • Visibility – Java method execution – Object usage – GC logs – Custom Java context • Typical problems: – Sampling often happens at safety/yield points (skew) – Method tracing has massive observer effect – Misidentifies RUNNING as on-CPU (e.g., epoll) – Doesn't include or profile GC or JVM CPU time – Tree views not quick (proportional) to comprehend • Inaccurate (skewed) and incomplete profilesslide 7:
System Profilers Kernel TCP/IP Java JVM Locks Time epoll Idle threadslide 8:
System Profilers • Visibility – JVM (C++) – GC (C++) – libraries (C) – kernel (C) • Typical problems (x86): – Stacks missing for Java – Symbols missing for Java methods • Other architectures (e.g., SPARC) have fared better • Profile everything except Javaslide 9:
Workaround • Capture both: Java System • An improvement, but system stacks are missing Java context, and therefore hard to interpretslide 10:
Solution Java Mixed-Mode Flame Graph Kernel Java JVMslide 11:
Solution • Fix system profiling, see everything: – Java methods – JVM (C++) – GC (C++) – libraries (C) – kernel (C) – Other apps Kernel Java JVM • Minor Problems: – 0-3% CPU overhead to enable frame pointers (usuallyslide 12: Saving 13M CPU Minutes Per Day • eu hXp://techblog.neZlix.com/2016/04/saving-13-million-computaJonal-minutes.htmlslide 13:System Example Exception handling consuming CPUslide 14:Profiling GC GC internals, visualized:slide 15:CPU Profilingslide 16:CPU Profiling • Record stacks at a timed interval: simple and effective – Pros: Low (deterministic) overhead – Cons: Coarse accuracy, but usually sufficient stack samples: syscall on-CPU time off-CPU block interruptslide 17:Stack Traces • A code path snapshot. e.g., from jstack(1): $ jstack 1819 […] "main" prio=10 tid=0x00007ff304009000 nid=0x7361 runnable [0x00007ff30d4f9000] java.lang.Thread.State: RUNNABLE at Func_abc.func_c(Func_abc.java:6) at Func_abc.func_b(Func_abc.java:16) at Func_abc.func_a(Func_abc.java:23) at Func_abc.main(Func_abc.java:27) running parent g.parent g.g.parentslide 18:System Profilers • Linux – perf_events (aka "perf") • Oracle Solaris – DTrace • OS X – Instruments • Windows – XPerf, WPA (which now has flame graphs!) • And many others…slide 19:Linux perf_events • Standard Linux profiler – Provides the perf command (multi-tool) – Usually pkg added by linux-tools-common, etc. • Many event sources: – Timer-based sampling – Hardware events – Tracepoints – Dynamic tracing • Can sample stacks of (almost) everything on CPU – Can miss hard interrupt ISRs, but these should be near-zero. They can be measured if needed (I wrote my own tools)slide 20:perf Profiling # perf record -F 99 -ag -- sleep 30 [ perf record: Woken up 9 times to write data ] [ perf record: Captured and wrote 2.745 MB perf.data (~119930 samples) ] # perf report -n -stdio […] # Overhead Samples Command Shared Object Symbol # ........ ............ ....... ................. ............................. 20.42% bash [kernel.kallsyms] [k] xen_hypercall_xen_version --- xen_hypercall_xen_version call tree check_events summary |--44.13%-- syscall_trace_enter tracesys |--35.58%-- __GI___libc_fcntl |--65.26%-- do_redirection_internal do_redirections execute_builtin_or_function execute_simple_command [… ~13,000 lines truncated …]slide 21:Full perf report Outputslide 22:… as a Flame Graphslide 23:Flame Graphs • Flame Graphs: – x-axis: alphabetical stack sort, to maximize merging – y-axis: stack depth – color: random (default), or a dimension • Currently made from Perl + SVG + JavaScript – Multiple d3 versions are being developed • References: – http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html – http://queue.acm.org/detail.cfm?id=2927301 – "The Flame Graph" CACM, June 2016 • Easy to make – Converters for many profilersslide 24:Flame Graph Interpretation g() e() f() d() c() i() b() h() a()slide 25:Flame Graph Interpretation (1/3) Top edge shows who is running on-CPU, and how much (width) g() e() f() d() c() i() b() h() a()slide 26:Flame Graph Interpretation (2/3) Top-down shows ancestry e.g., from g(): g() e() f() d() c() i() b() h() a()slide 27:Flame Graph Interpretation (3/3) Widths are proportional to presence in samples e.g., comparing b() to h() (incl. children) g() e() f() d() c() i() b() h() a()slide 28:Mixed-Mode Flame Graphs • Hues: Mixed-Mode – green == Java – aqua == Java (inlined) • if included – red == system – yellow == C++ • Intensity: – Randomized to differentiate frames – Or hashed on function name Kernel Java JVMslide 29:Differential Flame Graphs • Hues: Differential – red == more samples – blue == less samples • Intensity: – Degree of difference • Compares two profiles • Can show other metrics: e.g., CPI • Other types exist – flamegraphdiff more lessslide 30:Flame Graph Search • Color: magenta to show matched frames search buttonslide 31:Flame Charts • Final note: these are useful, but are not flame graphs • Flame charts: x-axis is time • Flame graphs: x-axis is population (maximize merging)slide 32:Stack Tracingslide 33:Broken Java Stacks on x86 These stacks are 1 or 2 levels deep, with junk values • On x86 (x86_64), hotspot uses the frame pointer register (RBP) as general purpose # perf record –F 99 –a –g – sleep 30 # perf script […] java 4579 cpu-clock: 7f417908c10b [unknown] (/tmp/perf-4458.map) java 4579 cpu-clock: 7f41792fc65f [unknown] (/tmp/perf-4458.map) a2d53351ff7da603 [unknown] ([unknown]) […] This "compiler optimization" breaks (RBP-based) stack walking Once upon a time, x86 had fewer registers, and this made more sense gcc provides -fno-omit-frame-pointer to avoid doing this, but the JVM had no such option…slide 34:… as a Flame Graph Broken Java stacks (missing frame pointer)slide 35:Fixing Stack Walking Possibilities: A. Fix frame pointer-based stack walking (the default) – Pros: simple, supported by many tools – Cons: might cost a little extra CPU B. Use a custom walker (likely needing kernel support) – Pros: full stack walking (incl. inlining) & arguments – Cons: custom kernel code, can cost more CPU when in use C. Try libunwind and DWARF – Even feasible with JIT? Our current preference is (A)slide 36:-XX:+PreserveFramePointer • I hacked OpenJDK x86_64 to support frame pointers – Taking RBP out of register pools, and adding function prologues. It worked, I shared the patch. – It became JDK-8068945 for JDK 9 and JDK-8072465 for JDK 8 • Zoltán Majó (Oracle) rewrote it, and it is now: – -XX:+PreserveFramePointer in JDK 9 and JDK 8 u60b19 – Thanks to Zoltán, Oracle, and the other hotspot engineers for helping get this done! • It might cost 0 – 3% CPU, depending on workloadslide 37:Fixed Java Stacks # perf script […] java 4579 cpu-clock: 7f417908c10b [unknown] (/tmp/… java 4579 cpu-clock: 7f41792fc65f [unknown] (/tmp/… a2d53351ff7da603 [unknown] ([unkn… […] # perf script […] java 8131 cpu-clock: 7fff76f2dce1 [unknown] ([vdso]) 7fd3173f7a93 os::javaTimeMillis() (/usr/lib/jvm… 7fd301861e46 [unknown] (/tmp/perf-8131.map) 7fd30184def8 [unknown] (/tmp/perf-8131.map) 7fd30174f544 [unknown] (/tmp/perf-8131.map) 7fd30175d3a8 [unknown] (/tmp/perf-8131.map) 7fd30166d51c [unknown] (/tmp/perf-8131.map) 7fd301750f34 [unknown] (/tmp/perf-8131.map) 7fd3016c2280 [unknown] (/tmp/perf-8131.map) 7fd301b02ec0 [unknown] (/tmp/perf-8131.map) 7fd3016f9888 [unknown] (/tmp/perf-8131.map) 7fd3016ece04 [unknown] (/tmp/perf-8131.map) 7fd30177783c [unknown] (/tmp/perf-8131.map) 7fd301600aa8 [unknown] (/tmp/perf-8131.map) 7fd301a4484c [unknown] (/tmp/perf-8131.map) 7fd3010072e0 [unknown] (/tmp/perf-8131.map) 7fd301007325 [unknown] (/tmp/perf-8131.map) 7fd301007325 [unknown] (/tmp/perf-8131.map) 7fd3010004e7 [unknown] (/tmp/perf-8131.map) 7fd3171df76a JavaCalls::call_helper(JavaValue*,… 7fd3171dce44 JavaCalls::call_virtual(JavaValue*… 7fd3171dd43a JavaCalls::call_virtual(JavaValue*… 7fd31721b6ce thread_entry(JavaThread*, Thread*)… 7fd3175389e0 JavaThread::thread_main_inner() (/… 7fd317538cb2 JavaThread::run() (/usr/lib/jvm/nf… 7fd3173f6f52 java_start(Thread*) (/usr/lib/jvm/… 7fd317a7e182 start_thread (/lib/x86_64-linux-gn…slide 38:Fixed Stacks Flame Graph Java stacks (but no symbols)slide 39:Stack Depth • perf had a 127 frame limit • Now tunable in Linux 4.8 – sysctl -w kernel.perf_event_max_stack=512 – Thanks Arnaldo Carvalho de Melo! A Java microservice with a stack depth of >gt; 900 broken stacks perf_event_max_stack=1024slide 40:Symbolsslide 41:Fixing Symbols For JIT'd code, Linux perf already looks for an externally provided symbol file: /tmp/perf-PID.map, and warns if it doesn't exist # perf script Failed to open /tmp/perf-8131.map, continuing without symbols […] java 8131 cpu-clock: 7fff76f2dce1 [unknown] ([vdso]) 7fd3173f7a93 os::javaTimeMillis() (/usr/lib/jvm… 7fd301861e46 [unknown] (/tmp/perf-8131.map) […] This file can be created by a Java agentslide 42:Java Symbols for perf • perf-map-agent – https://github.com/jrudolph/perf-map-agent – Agent attaches and writes the /tmp file on demand (previous versions attached on Java start, wrote continually) – Thanks Johannes Rudolph! • Use of a /tmp symbol file – Pros: simple, can be low overhead (snapshot on demand) – Cons: stale symbols • Using a symbol logger with perf instead – Stephane Eranian contributed this to perf – See lkml for "perf: add support for profiling jitted code" # perf script java 14025 [017] 8048.157085: cpu-clock: HashMap;::get (/tmp/perf-12149.map) […] 7fd781253265 Ljava/util/ fixed symbolsslide 43:Stacks & Symbols Java Mixed-Mode Flame Graph Kernel Java JVMslide 44:Stacks & Symbols (zoom)slide 45:Inlining • Many frames may be missing (inlined) – Flame graph may still make enough sense • Inlining can be tuned – -XX:-Inline to disable, but can be 80% slower! – -XX:MaxInlineSize and -XX:InlineSmallCode can be tuned a little to reveal more frames • Can even improve performance! • perf-map-agent can un-inline (unfoldall) – Adds inlined frames to symbol dump – flamegraph.pl --color=java will color these aqua – Thanks Johannes Rudolph, T Jake Luciani, and Nitsan Wakartslide 46:Instructionsslide 47:Instructions 1. Check Java version 2. Install perf-map-agent 3. Set -XX:+PreserveFramePointer 4. Profile Java 5. Dump symbols 6. Generate Mixed-Mode Flame Graph Note these are unsupported: use at your own risk Reference: http://techblog.netflix.com/2015/07/java-in-flames.htmlslide 48:1. Check Java Version • Need JDK8u60 or better – for -XX:+PreserveFramePointer $ java -version java version "1.8.0_60" Java(TM) SE Runtime Environment (build 1.8.0_60-b27) Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode) • Upgrade Java if necessaryslide 49:2. Install perf-map-agent • Check https://github.com/jrudolph/perf-map-agent for the latest instructions. e.g.: $ sudo bash # apt-get install -y cmake # export JAVA_HOME=/usr/lib/jvm/java-8-oracle # cd /usr/lib/jvm # git clone --depth=1 https://github.com/jrudolph/perf-map-agent # cd perf-map-agent # cmake . # makeslide 50:3. Set -XX:+PreserveFramePointer • Needs to be set on Java startup • Check it is enabled (on Linux): $ ps wwp `pgrep –n java` | grep PreserveFramePointer $ jcmd `pgrep –n java` VM.flags | grep PreserveFramePointer • Measure overhead (should be small)slide 51:4. Profile Java • Using Linux perf_events to profile all processes, at 99 Hertz, for 30 seconds (as root): # perf record -F 99 -a -g -- sleep 30 • Just profile one PID (broken on some older kernels): # perf record -F 99 -p PID -g -- sleep 30 • These create a perf.data fileslide 52:5. Dump Symbols • See perf-map-agent docs for updated usage • e.g., as the same user as java: $ cd /usr/lib/jvm/perf-map-agent/out $ java -cp attach-main.jar:$JAVA_HOME/lib/tools.jar \ net.virtualvoid.perf.AttachOnce PID • perf-map-agent contains helper scripts. I wrote my own: – https://github.com/brendangregg/Misc/blob/master/java/jmaps • Dump symbols quickly after perf record to minimize stale symbols. How I do it: # perf record -F 99 -a -g -- sleep 30; jmapsslide 53:6. Generate a Mixed-Mode Flame Graph • Using my FlameGraph software: # perf script >gt; out.stacks01 # git clone --depth=1 https://github.com/brendangregg/FlameGraph # cat out.stacks01 | ./FlameGraph/stackcollapse-perf.pl | \ ./FlameGraph/flamegraph.pl --color=java --hash >gt; flame01.svg – perf script reads perf.data with /tmp/*.map – out.stacks01 is an intermediate file; can be handy to keep • Finally open flame01.svg in a browser • Check for newer flame graph implementations (e.g., d3)slide 54:Optimizations • Linux 2.6+, via perf.data and perf script: git clone --depth 1 https://github.com/brendangregg/FlameGraph cd FlameGraph perf record -F 99 -a –g -- sleep 30 perf script | ./stackcollapse-perf.pl |./flamegraph.pl >gt; perf.svg • Linux 4.5+ can use folded output – Skips the CPU-costly stackcollapse-perf.pl step; see: http://www.brendangregg.com/blog/2016-04-30/linux-perf-folded.html • Linux 4.9+, via BPF: git clone --depth 1 https://github.com/brendangregg/FlameGraph git clone --depth 1 https://github.com/iovisor/bcc ./bcc/tools/profile.py -dF 99 30 | ./FlameGraph/flamegraph.pl >gt; perf.svg – Most efficient: no perf.data file, summarizes in-kernelslide 55:Linux Profiling OpJmizaJons Linux 2.6 Linux 4.5 Linux 4.9 capture stacks capture stacks count stacks (BPF) perf record perf record profile.py write samples perf.data reads samples perf script write text stackcollapse-perf.pl folded output flamegraph.pl write samples perf.data reads samples perf report –g folded folded output folded report awk folded output flamegraph.pl flamegraph.plslide 56:Automationslide 57:Netflix Vectorslide 58:Netflix Vector Select Instance Select Metrics Flame Graphs Near real-time, per-second metricsslide 59:Netflix Vector • Open source, on-demand, instance analysis tool – https://github.com/netflix/vector • Shows various real-time metrics • Flame graph support currently in development – Automating previous steps – Using it internally already – Also developing a new d3 front endslide 60:Advanced Analysisslide 61:Linux perf_events Coverage … all possible with Java stacksslide 62:Advanced Flame Graphs • Any event can be flame graphed, provided it is issued in synchronous Java context – Java thread still on-CPU, and event is directly triggered – On-CPU Java context is valid • Synchronous examples: – Disk I/O requests issued directly by Java à yes • direct reads, sync writes, page faults – Disk I/O completion interrupts à no* – Disk I/O requests triggered async, e.g., readahead à no* * can be made yes by tracing and associating contextslide 63:Page Faults • Show what triggered main memory (resident) to grow: # perf record -e page-faults -p PID -g -- sleep 120 • "fault" as (physical) main memory is allocated ondemand, when a virtual page is first populated • Low overhead tool to solve some types of memory leak RES column in top(1) grows becauseslide 64:Context Switches • Show why Java blocked and stopped running on-CPU: # perf record -e context-switches -p PID -g -- sleep 5 • Identifies locks, I/O, sleeps – If code path shouldn't block and looks random, it's an involuntary context switch. I could filter these, but you should have solved them beforehand (CPU load). • e.g., was used to understand framework differences: rxNetty Tomcat epoll futex sys_poll futexslide 65:Disk I/O Requests • Shows who issued disk I/O (sync reads & writes): # perf record -e block:block_rq_insert -a -g -- sleep 60 • e.g.: page faults in GC? This JVM has swapped out!:slide 66:TCP Events • TCP transmit, using dynamic tracing: # perf probe tcp_sendmsg # perf record -e probe:tcp_sendmsg -a -g -- sleep 1; jmaps # perf script -f comm,pid,tid,cpu,time,event,ip,sym,dso,trace >gt; out.stacks # perf probe --del tcp_sendmsg • Note: can be high overhead for high packet rates – For the current perf trace, dump, post-process cycle • Can also trace TCP connect & accept (lower overhead) • TCP receive is async – Could trace via socket read TCP sendsslide 67:CPU Cache Misses • In this example, sampling via Last Level Cache loads: # perf record -e LLC-loads -c 10000 -a -g -- sleep 5; jmaps # perf script -f comm,pid,tid,cpu,time,event,ip,sym,dso >gt; out.stacks • -c is the count (samples once per count) • Use other CPU counters to sample hits, misses, stallsslide 68:CPI Flame Graph • Cycles Per Instruction! – red == instruction heavy – blue == cycle heavy (likely mem stall cycles) zoomed:slide 69:Java Package Flame Graph • Sample on-CPU instruction pointer only (no stack) – Don't need -XX:+PreserveFramePointer • y-axis: package name hierarchy – java / util / ArrayList / ::size Linux 2.6+ (pre-BPF): no -g (stacks) # perf record -F 199 -a -- sleep 30; ./jmaps # perf script | ./pkgsplit-perf.sh | ./flamegraph.pl >gt; out.svgslide 70:Links & References Flame Graphs – http://www.brendangregg.com/flamegraphs.html – http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html – http://queue.acm.org/detail.cfm?id=2927301 – "The Flame Graph" CACM, Vol. 56, No. 6 (June 2016) – http://techblog.netflix.com/2015/07/java-in-flames.html – http://techblog.netflix.com/2016/04/saving-13-million-computational-minutes.html – http://techblog.netflix.com/2014/11/nodejs-in-flames.html – http://www.brendangregg.com/blog/2014-11-09/differential-flame-graphs.html Linux perf_events – https://perf.wiki.kernel.org/index.php/Main_Page – http://www.brendangregg.com/perf.html – http://www.brendangregg.com/blog/2015-02-27/linux-profiling-at-netflix.html – Linux 4.5: http://www.brendangregg.com/blog/2016-04-30/linux-perf-folded.html Netflix Vector – https://github.com/netflix/vector – http://techblog.netflix.com/2015/04/introducing-vector-netflixs-on-host.html hprof: http://www.brendangregg.com/blog/2014-06-09/java-cpu-sampling-using-hprof.htmlslide 71:Sep 2016 Thanks Questions? http://techblog.netflix.com http://slideshare.net/brendangregg http://www.brendangregg.com [email protected] @brendangregg