Solaris Off-CPU Analysis
Off-CPU analysis is a performance methodology where high resolution off-CPU time is measured and studied. This reveals which code-paths led to time spent waiting (blocked), and can be a quick and generic way to root-cause a wide range of performance issues.
Note: This page was last updated in 2014 and is no longer maintained. It is now part of my Crypt, and kept here for reference. The new Linux version is here.
Studying off-CPU time differs from traditional profiling, which often samples the activity of threads at a given interval, and (usually) only examine threads if they are executing work on-CPU. Here, the target are threads that, while processing a workload request, have blocked and are context-switched off-CPU. This method also differs from tracing techniques that instrument various applications functions that commonly block, since this method targets kernel functions that perform the blocking, and so doesn't rely on the foresight to instrument all the right application places.
Threads can leave CPU for a number of reasons, including waiting for file system or network I/O, acquiring a synchronization lock, paging/swapping (virtual memory), an explicit timer and signals. They can also leave CPU for some reasons somewhat unrelated to the current thread's execution, including involuntary context switching due to high demand for CPU resources, and interrupts. Whatever the reason, if this occurs during a workload request (a synchronous code-path), then it is introducing latency.
I'll summarize profiling techniques for off-CPU analysis, and introduce off-CPU as a metric. I'll then use DTrace to measure it, with MySQL as an example target to study. This methodology is suitable for any profiler that can trace kernel events, including perf, SystemTap, and ktap on Linux.
Profiling Techniques
While there are many tools that focus on CPU runtime analysis, the focus here is time spent blocked and off-CPU. I'll summarize sampling and tracing techniques, to show how they compare for identifying this off-CPU time.
CPU Sampling
Many traditional profiling tools sample activity across all CPUs, collecting snapshots of the current instruction address (program counter) or entire stack back trace at a certain rate (eg, 1000 Hertz). This will give counts of either the running function or the stack trace, allowing reasonable estimates to be calculated of where the CPU cycles are being spent. DTrace can sample too (dtrace -n 'profile-1001 /execname == "mysqld"/ { @[ustack()] = count(); }').
Consider application function A() calls function B(), which makes a blocking system call:
CPU Sampling -----------------------------------------------> | | | | | | | | | | | | A A A A B B B B A A A A A(---------. .----------) | | B(--------. .--) | | user-land - - - - - - - - - - syscall - - - - - - - - - - - - - - - - - | | kernel X Off-CPU | block . . . . . interrupt
While this can be very effective for studying on-CPU issues, including hot code-paths and adaptive mutex spins, it doesn't gather data when the application has blocked and is waiting off-CPU.
Application Tracing
App Tracing ------------------------------------------------> | | | | A( B( B) A) A(---------. .----------) | | B(--------. .--) | | user-land - - - - - - - - - - syscall - - - - - - - - - - - - - - - - - | | kernel X Off-CPU | block . . . . . interrupt
Here functions are instrumented so that timestamps are collected when they begin "(" and end ")", so that the time spent in functions can be calculated. If the timestamps include elapsed time and CPU time (eg, using times(2) or getrusage(3C)), then it is also possible to calculate which functions were slow on-CPU vs which functions were slow because they were blocked off-CPU. Unlike sampling, these timestamps can have a very high resolution (nanoseconds).
While this works, a disadvantage is that you either trace all application functions, which can have a significant performance impact (and affect the performance you are trying to measure), or you pick the functions that are likely to block, and hope you didn't miss any. My file system latency tools were examples of this, which traced just the file system I/O functions in MySQL (and I hope I found them all).
Off-CPU Tracing
I'll summarize this here, then explain it in more detail.
Off-CPU Tracing --------------------------------------------> | | B B A A A(---------. .----------) | | B(--------. .--) | | user-land - - - - - - - - - - syscall - - - - - - - - - - - - - - - - - | | kernel X Off-CPU | block . . . . . interrupt
With this approach, only the kernel functions that switch the thread off-CPU are traced, along with timestamps and user-land stack back traces. This focuses on the off-CPU events, without needing to trace all of the application functions, or needing to know what the application is. This approach works for any blocking event, for any application: MySQL, Apache, etc.
Off-CPU tracing captures all wait events for any application.
Later on this page I'll trace kernel off-CPU events and include some application level instrumentation to filter out asynchronous wait times (eg, threads waiting for work). Unlike application level instrumentation, I don't need to hunt down every place that may block off-CPU and instrument it; I just need to identify that the application is in a time sensitive code path (eg, during a MySQL query), so that the latency is synchronous to the workload.
Off-CPU
If it is possible to trace the kernel scheduler functions that context switch a thread off-CPU and return it, as well as visibility into user-land context (stack trace), then the approach pictured above is possible. DTrace can not only trace these kernel functions, but also has a stable kernel provider ("sched") for tracing the exact events of interest:
# dtrace -ln 'sched:::*cpu' ID PROVIDER MODULE FUNCTION NAME 18268 sched unix swtch remain-cpu 18289 sched unix resume_from_intr on-cpu 18290 sched unix resume_from_zombie on-cpu 18291 sched unix resume on-cpu 18323 sched unix resume_from_intr off-cpu 18324 sched unix resume_from_zombie off-cpu 18325 sched unix resume off-cpu
The time spent off-CPU can be calculated as the timestamp delta between the sched:::off-cpu probe and the sched:::on-cpu probe, for the same thread. By itself, that time is not particularly useful, even when summarized as a neat distribution plot:
# dtrace -n 'sched:::off-cpu { self->ts = timestamp; } sched:::on-cpu /self->ts/ { @["ns"] = quantize(timestamp - self->ts); self->ts = 0; }' dtrace: description 'sched:::off-cpu ' matched 6 probes ^C ns value ------------- Distribution ------------- count 4096 | 0 8192 |@@@ 55 16384 |@@@@@@@@@@@@@ 243 32768 |@@@@ 75 65536 | 8 131072 |@ 24 262144 | 3 524288 | 0 1048576 | 0 2097152 | 1 4194304 | 0 8388608 | 2 16777216 |@@@@@@ 105 33554432 |@@@ 66 67108864 |@@@ 55 134217728 |@@@ 61 268435456 |@ 28 536870912 |@@ 32 1073741824 | 0
That shows two groupings, the first showing threads leaving CPU between 8 and 65 us, and the second showing between 16 ms and 1 second. But that alone is not very useful to know. What we need is context - who the application is (or if it's the kernel), what it's doing, and why it's blocking and leaving CPU.
Context
Adding context is pretty easy: DTrace provides the stack() and ustack() functions, which retrieve the kernel and user-level stacks - the ancestry of the code-path taken. Adding these to the previous one-liner, and also filtering to only match our target application, "mysqld":
# dtrace -x ustackframes=100 -n 'sched:::off-cpu /execname == "mysqld"/ { self->ts = timestamp; } sched:::on-cpu /self->ts/ { @[stack(), ustack(), "ns"] = quantize(timestamp - self->ts); self->ts = 0; }' dtrace: description 'sched:::off-cpu ' matched 6 probes ^C [...] genunix`cv_wait+0x61 <- kernel-level stack trace zfs`zio_wait+0x5d zfs`dmu_buf_hold_array_by_dnode+0x21e zfs`dmu_buf_hold_array+0x71 zfs`dmu_read_uio+0x4d zfs`zfs_read+0x19a genunix`fop_read+0x6b genunix`pread+0x22c unix`sys_syscall+0x17a libc.so.1`__pread+0xa <- user-level stack trace mysqld`os_file_pread+0x8e mysqld`os_file_read+0x3b <- filesystem I/O mysqld`fil_io+0x2b0 mysqld`buf_read_page_low+0x14e mysqld`buf_read_page+0x81 mysqld`buf_page_get_gen+0x143 mysqld`btr_cur_search_to_nth_level+0x4dd mysqld`row_sel_get_clust_rec_for_mysql+0x9a mysqld`row_search_for_mysql+0xdf5 mysqld`_ZN11ha_innobase13general_fetchEPcjj+0x5b mysqld`_Z19join_read_next_sameP14st_read_record+0x32 mysqld`_Z10sub_selectP4JOINP13st_join_tableb+0x68 mysqld`_Z9do_selectP4JOINP4ListI4ItemEP8st_tableP9Procedure+0x1fd mysqld`_ZN4JOIN4execEv+0x12a9 mysqld`_Z12mysql_selectP3THDPPP4ItemP10TABLE_LISTjR4ListIS1_ES2_jP8... mysqld`_Z13handle_selectP3THDP6st_lexP13select_resultm+0x145 mysqld`_Z21mysql_execute_commandP3THD+0xe6e mysqld`_Z11mysql_parseP3THDPKcjPS2_+0x116 mysqld`_Z16dispatch_command19enum_server_commandP3THDPcj+0xfc1 mysqld`_Z10do_commandP3THD+0xb8 mysqld`handle_one_connection+0x7f7 libc.so.1`_thrp_setup+0x8d libc.so.1`_lwp_start ns value ------------- Distribution ------------- count 262144 | 0 524288 | 1 1048576 |@ 2 2097152 |@@@@@ 10 4194304 |@@@@@@@@@@@@@@@@@@@@@@@@@@ 53 8388608 |@@@@@@@@ 16 16777216 | 1 33554432 | 0
The output is many pages long, consisting of hundreds of kernel + user stack traces, plus their time spent waiting off-CPU as distribution plots ("value" == nanoseconds, "count" == number of occurrences in that range). To make the screenful of text a little easier on the eyes, I've added some color (red for kernel, blue for user-land stacks). I've also truncated the output to only show one off-CPU event: file system I/O.
This output is amazing, and is possible by a simplifying principle:
Application stack traces don't change while off-CPU.
This DTrace one-liner is only measuring the stack traces on the sched:::on-cpu event; it doesn't need to gather and save them on the sched:::off-cpu event, or during the time spent off-CPU, since the application stack doesn't change (and the kernel stack may not change much either). It's the application (user-level) stack that usually provides the most helpful context.
Interpreting the stack traces takes a little familiarity with the source code, which depends on how complex the application is and its language. While MySQL is quite complex and has a lot of functionality, the function names are usually intuitive and the code is often fairly straight forward.
The file system I/O latency shown in the distribution, from 1 to 16 ms, matches expectation for random disk I/O latency. It also shows that the file system (while I was tracing) was running at a 0% cache hit rate for this code-path; all I/O were at disk speeds. (This is probably due to this MySQL using application-level caches to cache the frequent requests, leaving a trickle of un-cached and infrequent requests for the file system.)
That stack trace is great - it's showing why the application was blocking and waiting off-CPU, and how long for. This is exactly the sort of information I'm usually looking for. However, this one-liner also catches many stack traces that aren't interesting, including:
[...] genunix`cv_wait_sig_swap_core+0x170 genunix`cv_wait_sig_swap+0x18 genunix`cv_waituntil_sig+0x13c genunix`lwp_park+0x157 genunix`syslwp_park+0x31 unix`sys_syscall+0x17a libc.so.1`__lwp_park+0x17 libc.so.1`cond_wait_queue+0x68 libc.so.1`__cond_wait+0x7b libc.so.1`cond_wait+0x23 libc.so.1`pthread_cond_wait+0x9 <- event waiting mysqld`_Z10end_threadP3THDb+0xba mysqld`handle_one_connection+0x3e8 libc.so.1`_thrp_setup+0x8d libc.so.1`_lwp_start ns value ------------- Distribution ------------- count 2048 | 0 4096 | 1 8192 | 0 16384 | 0 32768 | 0 65536 | 1 131072 | 3 262144 | 1 524288 |@ 9 1048576 |@@ 17 2097152 |@@@@ 34 4194304 |@@@@@ 43 8388608 |@@@@@@@ 68 16777216 |@@@@@@ 58 33554432 |@@@@@@ 58 67108864 |@@@@ 41 134217728 |@@ 21 268435456 |@ 12 536870912 |@ 9 1073741824 | 0
I believe this stack trace is from a thread ending and waiting in the MySQL thread cache for more work. This isn't so interesting; I'm usually looking to explain why queries are slow, and this latency happens outside a query. I'm not (usually) interested in latencies that occur asynchronously to the queries, such as threads waiting for work.
It would be ideal to only include off-CPU stacks from those wait events that occur synchronously during a query.
Synchronous Context
The DTrace pid provider allows the application to be traced and synchronous workload code-paths to be identified. Using this on MySQL, we can identify if a thread is processing a query, such as by checking that the thread is executing a dispatch_command() function. The following script does this, and includes some usability features to produce rolling status of gathered latency until Ctrl-C is hit; it also includes a minimum latency threshold, set to 10 ms, to only include off-CPU wait events that are longer than this time:
$ cat -n mysqld_pid_offcpu.d 1 #!/usr/sbin/dtrace -s 2 /* 3 * mysqld_pid_offcpu.d Trace off-CPU time during queries, showing stacks. 4 * 5 * USAGE: ./mysqld_pid_offcpu.d -p mysqld_PID 6 * 7 * TESTED: these pid-provider probes may only work on some mysqld versions. 8 * 5.0.51a: ok 9 * 5.1.13: ok 10 * 11 * 27-Feb-2011 Brendan Gregg Created This. 12 */ 13 14 #pragma D option quiet 15 16 dtrace:::BEGIN 17 { 18 min_ns = 10000000; 19 printf("Tracing PID %d for queries longer than %d ms\n", 20 $target, min_ns / 1000000); 21 printf("Hit Ctrl-C when ready for off-CPU report.\n\n"); 22 } 23 24 pid$target::*dispatch_command*:entry 25 { 26 self->start = timestamp; 27 } 28 29 sched:::off-cpu 30 /self->start/ 31 { 32 self->off = timestamp; 33 } 34 35 sched:::on-cpu 36 /self->off && (timestamp - self->off) > min_ns/ 37 { 38 @offcpu[stack(), ustack()] = quantize(timestamp - self->off); 39 self->off = 0; 40 } 41 42 pid$target::*dispatch_command*:return 43 /self->start && (timestamp - self->start) > min_ns/ 44 { 45 @time = quantize(timestamp - self->start); 46 } 47 48 pid$target::*dispatch_command*:return 49 { 50 self->start = 0; 51 self->off = 0; 52 } 53 54 profile:::tick-1s, 55 dtrace:::END 56 { 57 printf("MySQL query latency (ns):"); 58 printa(@time); 59 clear(@time); 60 } 61 62 dtrace:::END 63 { 64 printf("Top 10 off-CPU user & kernel stacks, by wait latency (ns):"); 65 trunc(@offcpu, 10); 66 }
The tick-1s probe prints status every second showing the distribution of query latency beyond 10 ms. The intent is that you watch this rolling status until you catch the latency you are looking for (outliers), and then hit Ctrl-C. The real output is in the @offcpu aggregation, which includes the stack traces and latency distribution (beyond 10 ms) for events during queries.
Here's an example (MySQL 5.0.51a):
# ./mysqld_pid_offcpu.d -p 3475 Tracing PID 3475 for queries longer than 10 ms Hit Ctrl-C when ready for off-CPU report. MySQL query latency (ns): value ------------- Distribution ------------- count 4194304 | 0 8388608 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 257 16777216 |@@@@@ 35 33554432 |@ 10 67108864 |@ 5 134217728 | 0 268435456 | 1 536870912 | 0 [...waiting to see latency of interest: beyond 1s...] MySQL query latency (ns): value ------------- Distribution ------------- count 4194304 | 0 8388608 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 2 16777216 | 0 MySQL query latency (ns): value ------------- Distribution ------------- count 4194304 | 0 8388608 |@@@@@@@@@@@@@ 1 16777216 |@@@@@@@@@@@@@ 1 33554432 | 0 67108864 | 0 134217728 | 0 268435456 | 0 536870912 | 0 1073741824 |@@@@@@@@@@@@@ 1 <- over 1 second 2147483648 | 0 [got it! Ctrl-C now] ^C [...] genunix`cv_timedwait_sig_hires+0x1db genunix`cv_waituntil_sig+0xba genunix`poll_common+0x56d genunix`pollsys+0xe4 unix`sys_syscall+0x17a libc.so.1`__pollsys+0xa libc.so.1`pselect+0x18c libc.so.1`select+0x70 mysqld`os_thread_sleep+0x3f <- innodb thread concurrency mysqld`srv_conc_enter_innodb+0xc4 mysqld`_ZN11ha_innobase10index_readEPcPKcj16ha_rkey_function+0x28a mysqld`_Z20join_read_always_keyP13st_join_table+0xc7 mysqld`_Z10sub_selectP4JOINP13st_join_tableb+0xb7 mysqld`_Z9do_selectP4JOINP4ListI4ItemEP8st_tableP9Procedure+0x1fd mysqld`_ZN4JOIN4execEv+0x12a9 mysqld`_Z12mysql_selectP3THDPPP4ItemP10TABLE_LISTjR4ListIS1_ES2_jP8... mysqld`_Z13handle_selectP3THDP6st_lexP13select_resultm+0x145 mysqld`_Z21mysql_execute_commandP3THD+0xe6e mysqld`_Z11mysql_parseP3THDPKcjPS2_+0x116 mysqld`_Z16dispatch_command19enum_server_commandP3THDPcj+0xfc1 mysqld`_Z10do_commandP3THD+0xb8 mysqld`handle_one_connection+0x7f7 libc.so.1`_thrp_setup+0x8d libc.so.1`_lwp_start value ------------- Distribution ------------- count 4194304 | 0 8388608 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 124 16777216 | 0 genunix`cv_wait_sig_swap_core+0x170 genunix`cv_wait_sig_swap+0x18 genunix`cv_waituntil_sig+0x13c genunix`lwp_park+0x157 genunix`syslwp_park+0x31 unix`sys_syscall+0x17a libc.so.1`__lwp_park+0x17 libc.so.1`cond_wait_queue+0x68 libc.so.1`__cond_wait+0x7b libc.so.1`cond_wait+0x23 libc.so.1`pthread_cond_wait+0x9 mysqld`os_event_wait+0x27 mysqld`srv_suspend_mysql_thread+0x29b mysqld`row_mysql_handle_errors+0x6e <- DB lock wait mysqld`row_search_for_mysql+0x909 mysqld`_ZN11ha_innobase10index_readEPcPKcj16ha_rkey_function+0x193 mysqld`_ZN7handler16read_range_firstEPK12st_key_rangeS2_bb+0x99 mysqld`_ZN7handler22read_multi_range_firstEPP18st_key_multi_rangeS1... mysqld`_ZN18QUICK_RANGE_SELECT8get_nextEv+0x100 mysqld`_Z8rr_quickP14st_read_record+0x1d mysqld`_Z12mysql_updateP3THDP10TABLE_LISTR4ListI4ItemES6_PS4_jP8st_... mysqld`_Z21mysql_execute_commandP3THD+0x3581 mysqld`_Z11mysql_parseP3THDPKcjPS2_+0x116 mysqld`_Z16dispatch_command19enum_server_commandP3THDPcj+0xfc1 mysqld`_Z10do_commandP3THD+0xb8 mysqld`handle_one_connection+0x7f7 value ------------- Distribution ------------- count 4194304 | 0 8388608 |@@@@@@@@@@ 1 16777216 |@@@@@@@@@@@@@@@@@@@@ 2 33554432 | 0 67108864 | 0 134217728 | 0 268435456 | 0 536870912 | 0 1073741824 |@@@@@@@@@@ 1 2147483648 | 0
I've included the last two off-CPU types and my interpretations: the first is innodb thread concurrency, where threads are put to sleep as part of the algorithm that keeps a certain number of threads running. And by sleep I mean sleep in the programming sense - it's waiting for 10 ms regardless.
The second stack I believe is the DB lock wait code-path, based on DTracing row_mysql_handle_errors() separately.
Since the distribution plot is summarizing latency from 10 ms onwards, this data could be used to provide some early warning that a problem was developing that could slow queries to unacceptable levels (eg, the typical 1 second "slow query").
This is one of the most powerful DTrace scripts I've ever written.
Being able to identify these latencies so quickly is awesome. However, there are a couple of caveats to bear in mind (at least): dispatcher queue latency and involuntary context switching.
Dispatcher Queue Latency
Something that's missing from these stacks is if the latency has included time spent waiting on the CPU dispatcher queue. If the CPUs are running at saturation, then any time a thread steps off-CPU, there may be additional time before it gets its turn stepping back on. That will be included in the reported time. DTrace can trace those dispatcher queue events separately if needed; in practice, CPU saturation is pretty easy to spot, so you are unlikely to be spending much time looking at off-CPU latency when you have a known CPU saturation issue to deal with.
Involuntary Context Switching
If you see a user-level stack trace that doesn't look like it should block, but it has, it could be because the kernel has context switched it off. The kernel-level stack trace may show this, eg:
unix`preempt+0xd7 <- involuntary context switch unix`trap+0x14e8 unix`sys_rtt_common+0x68 unix`_sys_rtt_ints_disabled+0x8 mysqld`_ZN4ItemC2Ev+0xae mysqld`_Z21execute_sqlcom_selectP3THDP10TABLE_LIST+0x124 mysqld`_Z21mysql_execute_commandP3THD+0x2400 mysqld`_ZN18Prepared_statement7executeEP6Stringb+0x2e9 mysqld`_ZN18Prepared_statement12execute_loopEP6StringbPhS2_+0x70 mysqld`_Z19mysqld_stmt_executeP3THDPcj+0xa1 mysqld`_Z16dispatch_command19enum_server_commandP3THDPcj+0xf85 mysqld`_Z10do_commandP3THD+0xc2 mysqld`_Z24do_handle_one_connectionP3THD+0x24e mysqld`handle_one_connection+0x37 mysqld`pfs_spawn_thread+0x50 libc.so.1`_thrp_setup+0x7e libc.so.1`_lwp_start value ------------- Distribution ------------- count 67108864 | 0 134217728 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1 268435456 | 0
Summary
Off-CPU analysis can be an effective way to locate types of latency where threads block and wait for some other event. By tracing this from the kernel scheduler functions that context switch threads, all off-CPU latency types can be analyzed in the same way, without needing to trace multiple sources. To see the context for the off-CPU event to understand why it occurred, both user and kernel stack back traces can be inspected. To only show latencies that affect an application workload, additional instrumentation can be used to identify that context and filter the asynchronous latency events.
On this page these principles were demonstrated using DTrace on MySQL, which observed different types of blocking events and summarized the off-CPU time as distribution plots in nanoseconds. The off-CPU approach can be used for any application which blocks for a variety of reasons, and there isn't already clear visibility into every reason why it may block.
For more on off-CPU analysis, see the visualizations in Off-CPU Flame Graphs and Hot/Cold Flame Graphs.