I originally posted this at http://dtrace.org/blogs/brendan/2011/02/18/dtrace-pid-provider-overhead.
I've been posting about the DTrace pid provider and its various features to share my real-world experiences in this area. Here I'll discuss an important, tricky, and generally undocumented topic: the overhead of using the pid provider. I'll provide some rough expectations for overhead based on some simple stress tests, and at the very least, show the units used to express the overhead of a fired probe (milliseconds, microseconds or nanoseconds?).
Answer: microseconds.
To cut to the chase, see the Expressing Overhead and Summary sections.
Zero Overhead
When the DTrace pid provider is not enabled and tracing, the overhead on applications is zero. You only pay a CPU cost when you actually use it.
The non-enabled overhead of the pid provider is zero.
When I first heard this, I was skeptical, wondering if it was more of a marketing or technical statement. Does zero overhead really mean zero, as in, zero extra instructions? Wouldn't a compare and jump instruction, at least, be necessary at some point, so that the operating system can check whether it should be calling into DTrace or not? If there are no such instructions, how can I walk up to already-running software and instrument it with DTrace? Wouldn't it just continue as-is, and never know that it should be branching into some DTrace code?
I only really understood this properly when I learned what happened at the instruction level, which I'll demo here.
Here is the strlen() function assembly (truncated), which I've been tracing in previous posts:
libc.so.1`strlen: movl 0x4(%esp),%eax libc.so.1`strlen+4: testl $0x3,%eax libc.so.1`strlen+9: jne +0x28 <libc.so.1`strlen+0x31> libc.so.1`strlen+0xb: nop libc.so.1`strlen+0xc: movl (%eax),%edx libc.so.1`strlen+0xe: movl $0x7f7f7f7f,%ecx libc.so.1`strlen+0x13: andl %edx,%ecx [...] libc.so.1`strlen+0x44: ret
Now I'll use the DTrace pid provider to instrument the entry to this function:
# dtrace -n 'pid$target::strlen:entry' -p 12072
And while DTrace is still attached, I'll dump the same assembly from this process:
libc.so.1`strlen: int $0x3 libc.so.1`strlen+1: incl %esp libc.so.1`strlen+2: andb $0x4,%al libc.so.1`strlen+4: testl $0x3,%eax libc.so.1`strlen+9: jne +0x28 <libc.so.1`strlen+0x31> libc.so.1`strlen+0xb: nop libc.so.1`strlen+0xc: movl (%eax),%edx libc.so.1`strlen+0xe: movl $0x7f7f7f7f,%ecx libc.so.1`strlen+0x13: andl %edx,%ecx [...] libc.so.1`strlen+0x44: ret
It's changed - while the process was running! DTrace has has dynamically patched the live instructions in this process's address space. This is the dynamic part of Dynamic Tracing. DTrace can dynamically instrument the kernel, too, via the fbt provider.
DTrace can dynamically patch live instructions in memory.
How DTrace dynamically instruments code depends on the location (user/kernel) and architecture. This example was a user process on AMD64, and the first instruction was replaced by "int" to call a soft interrupt containing a DTrace routine. The kernel is dynamically instrumented in a different way.
This seems terrifying.
The process is still running, and could be in production. Thankfully, the instrumention is applied in a safe manner, so that the process can never execute partially-patched instructions. So, while it seems terrifying, it is actually safe - which is a primary goal of DTrace: safe for production use.
This is why there is zero overhead when dynamic probes are not enabled. The code runs as usual.
When DTrace Isn't D-Trace
Before I move on, I want to note that dynamic tracing is only used by the pid and fbt providers. Many of the other providers that DTrace uses are "statically defined tracing", such as io, syscall, tcp, perl, python, javascript, etc. So, DTrace isn't just about dynamic tracing, as in the examples above, it uses static tracing a lot too.
Statically-defined providers are more effort to create and deploy - the source code must be instrumented before compilation, and that instrumented version must be executing. The advantage of static over dynamic providers is that they have a stable interface, and DTrace scripts written for these providers should continue working across different software versions and OSes. Stability was discussed in my first post in this series.
... Perhaps DTrace should have been called "Dynamic and Static Tracing", or DasTrace for short. (and coded on your DasKeyboard.)
Overhead
Usually when I'm considering overhead, it's about the running overhead of DTrace - the performance cost to the target application while the probes are enabled and collecting data. There is also some additional overhead when you first run DTrace and initialize pid provider probes, which can be proportional to the number of probes initialized. The running overhead is proportional to the rate of probes - the more they fire per second, the higher the overhead.
I'll run some overhead tests to provide rough figures to consider. These stress test the pid provider, so that the overhead of probes becomes noticable and measurable. I'd like to show the overhead from tracing a real-world application, but the overhead is often so small that it is lost in the noise (caching effects and normal workload variability).
Stress Testing
I'll stress test the pid provider by running and DTracing "looper" - a short C program that calls a simple function 100 million times. I'll time the program running normally, then while using the pid provider to instrument that function.
# cat -n looper.c 1 #include <strings.h> 2 3 int 4 func(char *input) 5 { 6 return (strlen(input) + 1); 7 } 8 9 int 10 main() 11 { 12 int i, j, sum = 0; 13 14 for (i = 0; i < 1000000; i++) { 15 for (j = 0; j < 100; j++) { 16 sum += func("hello DTrace"); 17 } 18 } 19 20 return (0); 21 }
Since func() is called 100 millions times and does very little, the execution of this program may become bounded by the overhead of executing the DTrace probes, which could take more CPU cycles than the function itself. This is a form of stress test to magnify something minute so that it becomes observable.
Control
Without any DTrace enabled, I'll use ptime to measure the run time of the program ("real" time):
# ptime ./looper real 1.513169989
It normally takes 1.5 seconds to run.
DTrace startup
Since I'm timing the overall command, including DTrace, I need to determine the time for DTrace to initialize itself. This time includes allocating the per-CPU output buffers. I've tested this by tracing a function that wasn't called, so additional time is almost entirely the cost of startup:
# ptime dtrace -n 'pid$target:libc:strncpy:entry { @ = count(); }' -c ./looper dtrace: description 'pid$target:libc:strncpy:entry ' matched 1 probe dtrace: pid 5475 has exited real 2.722916042
2.7 seconds. So, about 1.2 seconds is DTrace startup.
pid provider probe
The following traces the function that is called 100 million times. The action is simply to count ("@ = count()"), a commonly used action which also lets us confirm that we traced all 100 million invocations:
# ptime dtrace -n 'pid$target:a.out:func:entry { @ = count(); }' -c ./looper dtrace: description 'pid$target:a.out:func:entry ' matched 1 probe dtrace: pid 5524 has exited 100000000 real 1:02.519046129
The time has now increased to 62.5 seconds. (And the "100000000" confirms we traced all 100 million function calls.)
probe + copyinstr()
The copyinstr() action adds additional overhead, as it copies data from the user-land address space to the kernel. Testing this:
# ptime dtrace -n 'pid$target:a.out:func:entry { @[copyinstr(arg0)] = count(); }' -c ./looper dtrace: description 'pid$target:a.out:func:entry ' matched 1 probe dtrace: pid 9672 has exited hello DTrace 100000000 real 2:33.046645951
The time now was 153 seconds.
Calculating Overhead
The last example of DTrace made this program run 100 times slower. Ouch! However, this is an extreme test - instrumenting ten million function calls.
The overhead of each probe is fixed (about as fixed as anything can be, considering CPU and MMU cache effects). For the first function tracing test, tracing 100 million probes added 59.8 seconds (62.5 - 2.7s). That means each pid provider probe is costing about 0.598 microseconds. For this rough test, on this hardware.
This enabled pid provider probe was costing 0.6 microseconds per call.
When I added a copyinstr(), which does more CPU work (including MMU work to bridge address spaces), the cost per probe was about 1.5 microseconds. (These calculations have been updated thanks to a blog comment!)
Sanity Testing
As with any performance test, it's worth sanity checking via other means. Here I'll estimate pid provider probe cost a different way.
I'll start by timing strlen() (using vtimestamp to reduce the inclusion of DTrace probe cost):
# dtrace -n 'pid$target::strlen:entry,pid$target::strlen:return { trace(vtimestamp); }' -p 21607 dtrace: description 'pid$target::strlen:entry,pid$target::strlen:return ' matched 4 probes CPU ID FUNCTION:NAME 13 75978 strlen:entry 3479933 13 75977 strlen:return 3482396
DTrace measures this as 2463 nanoseconds (3482396 - 3479933).
Now I'll trace each instruction of strlen() and print a timestamp (switching to normal timestamps to include DTrace probe cost):
# dtrace -n 'pid$target::strlen: { trace(timestamp); }' -p 21607 dtrace: description 'pid$target::strlen: ' matched 80 probes CPU ID FUNCTION:NAME 1 75978 strlen:entry 10165584038261648 1 75979 strlen:0 10165584038263818 1 75980 strlen:4 10165584038266909 1 75981 strlen:6 10165584038269029 1 75982 strlen:9 10165584038271405 1 75983 strlen:b 10165584038273856 1 75984 strlen:e 10165584038275964 1 75985 strlen:10 10165584038278132 1 75986 strlen:11 10165584038280403 1 75987 strlen:14 10165584038282483 1 75988 strlen:16 10165584038284388 1 75989 strlen:19 10165584038286323 1 76015 strlen:51 10165584038288842 1 76016 strlen:55 10165584038290819 1 75977 strlen:return 10165584038292203 ^C
The time delta for strlen() was now 30555 nanoseconds (...38292203 - ...38261648), which is 28092 ns slower than measured before. This extra time is from the 14 probes, which makes the cost about 2.0 microseconds per probe.
This enabled pid provider probe was costing 2 microseconds per call.
As a sanity test, this looks ok. The result is in the same ballpark. It is a different probe type, instruction instead of function boundary, and could have a different cost associated (I'd have to check the implementation to confirm).
Sanity Testing #2
As a different form of sanity test, I traced libmosjs calls (JS_*()) on the Firefox web browser on my laptop. I left it running in the background with various tabs open, and found its CPU usage was consistent across a 60 second average (too variable across a 1 second average to draw conclusions). I'll summarize details here.
- Firefox CPU Utilization for 60 seconds: 22.7%
- ... while tracing all JS_* calls: 46.7%
- Number of JS_* calls during those 60 seconds: 7867128
- Average number of JS_* calls during 1 second: 131119
- Percent CPU cost for each probe ((46.7 - 22.7) / 131119): 0.000183039%
- Time for each probe (when "100%" CPU == 1000 ms per second): ~1.8 microseconds
Again, this is pretty rough, and is just intended as a sanity test. The result of 1.8 us is a little faster than the 2 us result, and may be due in part to testing this on a different platform with slightly faster CPUs.
Expressing Overhead
I'd suggest considering pid provider overhead as follows:
Don't worry too much about pid provider probe cost at < 1000 events/sec.
At > 10,000 events/sec, pid provider probe cost will be noticeable.
At > 100,000 events/sec, pid provider probe cost may be painful.
Let's discuss these using the 6 us per probe result from the previous tests. This could be faster (~2 us) or slower (> 15 us) depending on architecture and the probe action.
- at 1000 events/sec, that could cost 6 milliseconds of CPU time. If the application was 100% on-CPU, it may slow by 0.6% - ie, don't worry.
- At 10,000 events/sec, that could cost 60 milliseconds of CPU time. For a 100% on-CPU application, then the slow down may be 6%. If it was idle some of the time (systems often are), then the overhead may be less, as there are spare cycles to run DTrace (really depends on the location of the probe in the code-path, though).
- At 100,000 events/sec, the CPU cost could be 600 milliseconds. This can significantly slow down an application. Whether that's a problem depends on the app and its function - how sensitive it is to latency.
This was described in terms of a single thread of execution. If an application is spreading these events concurrently across multiple CPUs, then the performance impact should be less.
My previous stress test was an extreme case, as I traced a function that was called 100 million times in 1.5 seconds. That's a rate of over 60 million events per second. For these suggestions, that puts this well off the pain-end of the scale. And the result was indeed painful - the application ran two orders of magnitude slower.
From Experience
When investigating real-world applications, I can often find key information by tracing events in the 10 to 10,000 events/sec range. The overhead is usually small, and hard to measure over the noise of workload variability. I'm also often using DTrace for brief periods, less than 30 seconds, to collect information that I can then analyze offline.
The actual overhead of my typical pid provider investigations is so small it's difficult to measure.
I have used DTrace plenty of times for higher event rates where it has significantly slowed down the target application, because there was no other way to get the data. If I'm using DTrace to analyze a performance meltdown, then the system is already in hell, and a little more pain so that we can fix it sooner is usually worth it.
Reducing Overhead
Here are some tips for reducing overhead when using the pid provider.
Always use entry and return
Be careful with the last field of the probename. Left blank (wildcard) it will match all the instruction probes - creating many more probes than was probably wanted. With the pid provider, if you only meant entry and return, remember to specify them explicitly.
Trace for short intervals
If you expect a DTrace script will have high overhead, you may collect sufficient data when running it for a very short interval. High overhead usually means many thousands of events per second, so a fraction of a second may capture more than enough events anyway. The shortest I've run a pid provider enabling for is 100ms, by adding the probe:
profile:::tick-100ms { exit(0); }
This quits DTrace after 100ms of tracing.
Find higher level functions
Higher level functions are called less frequently than lower level functions. If you already know your way around the source code, candidates may be obvious. If not, you can try using stack back traces to find them.
Let's say I wanted to trace what a bash shell was doing, but didn't know where to start. For this example, I'll start with a high overhead experiment: tracing all of bash's functions and frequency counting them:
# dtrace -n 'pid$target:bash::entry { @[probefunc] = count(); }' -p 11290 dtrace: description 'pid$target:bash::entry ' matched 1957 probes ^C _rl_replace_text 1 check_mail 1 [...] _rl_reset_argument 108 readline_internal_char 108 rl_get_char 108 rl_getc 108 rl_read_key 108 rl_redisplay 108 sh_malloc 108 sh_xmalloc 108 update_line 108 sh_free 110 sh_xfree 110 _rl_block_sigint 112 _rl_release_sigint 112 internal_free 120 internal_malloc 128
The output was many pages long, truncated here to fit.
I'll pick one of the more frequently called functions and then count its stack back trace. Trying readline_internal_char():
# dtrace -n 'pid$target::readline_internal_char:entry { @[ustack()] = count(); }' -p 11290 dtrace: description 'pid$target::readline_internal_char:entry ' matched 1 probe ^C bash`readline_internal_char bash`readline+0x49 bash`yy_readline_get+0x52 bash`shell_getc+0xe1 bash`read_token+0x6f bash`yyparse+0x4b9 bash`parse_command+0x67 bash`read_command+0x52 bash`reader_loop+0xa5 bash`main+0xaff bash`_start+0x83 48
By reading down the stack, I'd guess that functions like read_command() may be a better place to start - if I was just interested in a high level view of what bash was doing. It is indeed called less often - six times when tracing earlier (although truncated from the above output).
Future Note
For future readers, I should note that these numbers are based on current CPU hardware. The "looper" program was run on a Joyent SmartMachine on the following hardware:
# psrinfo -vp The physical processor has 4 cores and 8 virtual processors (1 3 5 7 9 11 13 15) [...] x86 (GenuineIntel 106A5 family 6 model 26 step 5 clock 2400 MHz) Intel(r) Xeon(r) CPU E5530 @ 2.40GHz
As CPUs become faster, the fixed cost that I was calculating above should reduce. Six microseconds sounds fast now (2011), but may sound slow a decade from now.
Summary
The DTrace pid provider has zero overhead when not enabled, and a low overhead per probe invocation when enabled - on the order of microseconds (between 1.8 and 15 us when tested here). When choosing pid provider probes to enable, consider the rate that they will fire - the rate of events to trace. If this is less than a thousand per second, then the overhead is unlikely to be noticable; if the rate is over 100,000, then the overhead is likely to be noticable - and could start slowing the target application noticeably. Using particular DTrace actions, such as copyin() and copyinstr(), can increase the probe cost further.
I measured the overhead by taking things to the extreme, with a stress test calling 100 million fast functions. When tracing all 100 million calls and performing copyinstr()s, this test ran two orders of magnitude slower. While this was useful to estimate individual probe cost, this may serve another role: showing that you really can shoot yourself in the foot with the pid provider. While DTrace probes are individually fast, multiply by millions and the cost adds up.
The overhead is manageable if you know what to expect. If you'd like to be really sure before using the pid provider in production, setup a test environment and run a simple benchmark with and without the pid provider probes enabled.