DTrace Troubleshooting
This is a demonstration of dtrace analysis vs other tools
on Solaris 10.
Note: This document is not written by Sun.
dtrace vs truss
DTrace is the new comprehensive system analysis tool that is designed
to be less intrusive than other tools, and safe to run in production.
truss is a standard Solaris tool to analyse system calls and signals,
more recently enhanced to follow user level function calls as well (-u).
It's often said that DTrace is less of a burden than other tools such as truss - but is that true?
DTrace provides over 30,000 probes, and has the capability to drill down to each instruction of a process; truss by default only looks at system calls and signals. Should we believe that DTrace can do such amazing things at little CPU cost? The following demonstration puts both DTrace and truss to the test.
The following shows a system suffering a classic performance problem. Although this fault has been delibrately created for this demonstration, this type of problem is common.
A few commands are run to show the system is under heavy CPU load,
# uptime 1:12pm up 1 day(s), 16:36, 3 users, load average: 2.12, 1.97, 1.26 # # vmstat 1 kthr memory page disk faults cpu r b w swap free re mf pi po fr de sr cd s0 -- -- in sy cs us sy id 0 0 0 1248800 249416 3 31 2 0 0 0 1 1 0 0 0 344 1197 324 1 2 98 0 0 0 1132928 110196 737 9965 10 0 0 0 0 0 0 0 0 415 7133 557 13 87 0 1 0 0 1132936 110200 784 10586 0 0 0 0 0 3 0 0 0 411 8488 585 14 86 0 1 0 0 1132920 110192 841 11375 0 0 0 0 0 1 0 0 0 407 8016 608 14 86 0 0 0 0 1132984 110252 852 11473 0 0 0 0 0 0 0 0 0 408 8105 609 14 86 0 1 0 0 1132888 110176 823 11136 0 0 0 0 0 0 0 0 0 403 8907 603 15 85 0 ^C # |
This is a single CPU system, so a load average of 2.12 would indicate potential CPU saturation, with 100% utilisation. The output of vmstat supports this, it shows no idle time, and some threads queueing up for these 1 second averages.
Lets run prstat to identify the process responsible,
# prstat PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP 25969 root 2564K 1144K run 0 0 0:01:30 9.3% bash/1 912 brendan 93M 69M run 49 0 0:15:45 1.8% mozilla-bin/4 845 brendan 23M 19M sleep 59 0 0:03:44 0.5% acroread/1 17640 root 11M 7044K sleep 59 0 0:04:18 0.4% dtrace/1 25667 brendan 106M 54M sleep 49 0 0:00:28 0.3% soffice.bin/5 26084 root 4328K 4060K cpu0 49 0 0:00:00 0.2% prstat/1 1470 brendan 94M 28M sleep 59 0 0:02:09 0.1% java/14 412 root 89M 19M sleep 59 0 0:01:41 0.1% poold/8 496 brendan 53M 65M sleep 59 0 0:03:57 0.0% Xorg/1 166 root 1860K 1000K sleep 59 0 0:00:32 0.0% in.routed/1 17778 brendan 8612K 1948K sleep 59 0 0:00:00 0.0% sshd/1 17863 root 2564K 1796K sleep 49 0 0:00:00 0.0% bash/1 7 root 10M 7976K sleep 59 0 0:00:12 0.0% svc.startd/13 125 root 1228K 836K sleep 59 0 0:00:00 0.0% powerd/2 265 root 1844K 1096K sleep 59 0 0:00:00 0.0% ttymon/1 95 daemon 3688K 2000K sleep 59 0 0:00:00 0.0% kcfd/3 385 root 3304K 1200K sleep 59 0 0:00:00 0.0% sshd/1 Total: 83 processes, 191 lwps, load averages: 2.11, 2.03, 1.40 |
Hmm, looking at the CPU column doesn't explain our 100% utilised system, the processes we can see may add up to around 15% CPU. Where has the other 85% of CPU gone?
It's possible the CPU is consumed by various kernel threads that are not represented as processes. For example, CPU consumed by interrupt threads can now be checked in Solaris 10 by the intrstat command (which uses DTrace),
# intrstat 5 device | cpu0 %tim -------------+--------------- ata#1 | 0 0.0 rtls#0 | 421 0.3 uhci#0 | 421 0.1 uhci#1 | 0 0.0 device | cpu0 %tim -------------+--------------- ata#1 | 0 0.0 rtls#0 | 471 0.4 uhci#0 | 471 0.1 uhci#1 | 0 0.0 ^C |
This shows I'm losing 0.3% CPU to rtls0 - my network card. This hasn't explained the missing 85%, but it's really useful to be able to get details on interrput CPU usage anyway.
Another place the CPU can vanish to is to satisfy TLB and TSB misses - virtual to physical address translation. trapstat can show us,
# trapstat -T 1 trapstat: not implemented on i86pc |
Oops, that's what I get for using an x86 based system! Not to worry, testing the same fault on SPARC based systems shows about 2% of CPU is consumed here - which can't explain the missing 85%.
Well, for good measure we should check other areas of the system: memory, disk I/O, and network usage. I'll use swapinfo to check memory,
# swapinfo RAM _____Total 511.6 Mb RAM Unusable 8.0 Mb RAM Kernel 108.7 Mb RAM Locked 0.0 Mb RAM Used 287.4 Mb RAM Avail 107.4 Mb Disk _____Total 1023.7 Mb Disk Alloc 0.0 Mb Disk Free 1023.7 Mb Swap _____Total 1355.6 Mb Swap Alloc 185.8 Mb Swap Unalloc 64.5 Mb Swap Avail 1105.3 Mb Swap (MinFree) 62.9 Mb # |
So plenty of available RAM.
Now iostat to check disk activity,
# iostat -xnmpz 5 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.3 0.3 2.3 0.6 0.0 0.0 4.5 2.0 0 0 c0d0 0.0 0.0 0.2 0.0 0.0 0.0 0.0 11.2 0 0 mars:/var/tmp extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.2 7.4 1.6 7.6 0.0 0.0 1.0 0.7 0 0 c0d0 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device ^C |
Not much disk activity there.
Now nicstat to check network usage,
# nicstat 1 Time Int rKb/s wKb/s rPk/s wPk/s rAvs wAvs %Util Sat 13:44:10 rtls0 0.29 0.15 1.29 0.74 229.95 210.02 0.00 0.00 13:44:11 rtls0 5.66 32.51 92.98 178.04 62.38 186.98 0.31 0.00 13:44:12 rtls0 2.26 21.85 36.67 67.40 62.97 331.88 0.20 0.00 13:44:13 rtls0 4.08 28.11 68.29 132.62 61.13 217.04 0.26 0.00 13:44:14 rtls0 5.19 32.12 87.81 172.63 60.52 190.53 0.31 0.00 13:44:15 rtls0 6.49 36.43 110.06 218.14 60.41 171.02 0.35 0.00 ^C |
And not much network activity either.
After all that, we know the CPU is fairly busy but have little idea why.
truss needs to either run the program, or to be given a PID to examine. We already checked processes using prstat, and there was no process that explained where all the CPU had gone. The highest we saw was a bash shell at 9.3% with PID 25969. Lets run truss on that,
# truss -p 25969 lwp_sigmask(SIG_SETMASK, 0x00020002, 0x00000000) = 0xFFBFFEFF [0x0000FFFF] schedctl() = 0xD25A4000 lwp_sigmask(SIG_SETMASK, 0x00000000, 0x00000000) = 0xFFBFFEFF [0x0000FFFF] Received signal #18, SIGCLD [caught] siginfo: SIGCLD CLD_EXITED pid=2387 status=0x0000 lwp_sigmask(SIG_SETMASK, 0x00020000, 0x00000000) = 0xFFBFFEFF [0x0000FFFF] waitid(P_ALL, 0, 0x080476E0, WEXITED|WTRAPPED|WNOHANG) = 0 waitid(P_ALL, 0, 0x080476E0, WEXITED|WTRAPPED|WNOHANG) Err#10 ECHILD setcontext(0x080475E0) lwp_sigmask(SIG_SETMASK, 0x00020000, 0x00000000) = 0xFFBFFEFF [0x0000FFFF] lwp_sigmask(SIG_SETMASK, 0x00000000, 0x00000000) = 0xFFBFFEFF [0x0000FFFF] lwp_sigmask(SIG_SETMASK, 0x00020000, 0x00000000) = 0xFFBFFEFF [0x0000FFFF] sigaction(SIGINT, 0x08047AD0, 0x08047B40) = 0 lwp_sigmask(SIG_SETMASK, 0x00000000, 0x00000000) = 0xFFBFFEFF [0x0000FFFF] sigaction(SIGINT, 0x08047AC0, 0x08047B30) = 0 lwp_sigmask(SIG_SETMASK, 0x00020000, 0x00000000) = 0xFFBFFEFF [0x0000FFFF] lwp_sigmask(SIG_SETMASK, 0x00000000, 0x00000000) = 0xFFBFFEFF [0x0000FFFF] lwp_sigmask(SIG_SETMASK, 0x00020002, 0x00000000) = 0xFFBFFEFF [0x0000FFFF] fork1() = 2389 lwp_sigmask(SIG_SETMASK, 0x00020002, 0x00000000) = 0xFFBFFEFF [0x0000FFFF] [...] |
The above output scrolled very fast with much of the same. We can see a fork1() and a SIGCLD so we ought to follow children,
# truss -fp 25969 25969: sigaction(SIGINT, 0x08047AC0, 0x08047B30) = 0 25969: lwp_sigmask(SIG_SETMASK, 0x00020000, 0x00000000) = 0xFFBFFEFF [0x0000FFFF] 25969: lwp_sigmask(SIG_SETMASK, 0x00000000, 0x00000000) = 0xFFBFFEFF [0x0000FFFF] 25969: lwp_sigmask(SIG_SETMASK, 0x00020002, 0x00000000) = 0xFFBFFEFF [0x0000FFFF] 25969: fork1() = 26938 26938: fork1() (returning as child ...) = 25969 25969: lwp_sigmask(SIG_SETMASK, 0x00020002, 0x00000000) = 0xFFBFFEFF [0x0000FFFF] [...] 26938: sigaction(SIGTERM, 0x08047A60, 0x08047AD0) = 0 26938: sigaction(SIGCLD, 0x08047A60, 0x08047AD0) = 0 26938: execve("/usr/bin/date", 0x08101FA8, 0x08100E08) argc = 1 26938: resolvepath("/usr/bin/date", "/usr/bin/date", 1023) = 13 26938: sysconfig(_CONFIG_PAGESIZE) = 4096 26938: resolvepath("/usr/lib/ld.so.1", "/lib/ld.so.1", 1023) = 12 26938: xstat(2, "/usr/bin/date", 0x08047B38) = 0 26938: open("/var/ld/ld.config", O_RDONLY) Err#2 ENOENT [...] |
The above output shows the child is the date command. Lets concentrate on those execs only,
# truss -ftexec -p 25969 15670: execve("/usr/bin/date", 0x08101FA8, 0x08100E08) argc = 1 15672: execve("/usr/bin/date", 0x08100BE8, 0x08100E08) argc = 1 15674: execve("/usr/bin/date", 0x08101FE8, 0x08100E08) argc = 1 15676: execve("/usr/bin/date", 0x08101F88, 0x08100E08) argc = 1 15678: execve("/usr/bin/date", 0x08101B08, 0x08100E08) argc = 1 15680: execve("/usr/bin/date", 0x080FE168, 0x08100E08) argc = 1 15682: execve("/usr/bin/date", 0x08100BC8, 0x08100E08) argc = 1 15684: execve("/usr/bin/date", 0x08102108, 0x08100E08) argc = 1 15686: execve("/usr/bin/date", 0x08101FA8, 0x08100E08) argc = 1 15688: execve("/usr/bin/date", 0x08100BE8, 0x08100E08) argc = 1 [...] |
This is also scrolling rather fast. So it appears the problem is hundreds of short lived processes consuming the CPU, each dissapearing before being sampled by prstat.
... This fault was achieved by running the following command in the background,
# while :; do date; done |
It's an infinite loop that runs date over, and over, and over...
DTrace could have found this problem easily by using the execsnoop tool from the DTraceToolkit. execsnoop is a good tool to run for any given problem.
# execsnoop -v STRTIME UID PID PPID ARGS 2005 Sep 11 14:21:39 0 22085 25969 date 2005 Sep 11 14:21:39 0 22086 25969 date 2005 Sep 11 14:21:39 0 22087 25969 date 2005 Sep 11 14:21:39 0 22088 25969 date 2005 Sep 11 14:21:39 0 22089 25969 date 2005 Sep 11 14:21:39 0 22090 25969 date 2005 Sep 11 14:21:39 0 22091 25969 date 2005 Sep 11 14:21:39 0 22092 25969 date 2005 Sep 11 14:21:39 0 22093 25969 date 2005 Sep 11 14:21:39 0 22094 25969 date 2005 Sep 11 14:21:39 0 22095 25969 date 2005 Sep 11 14:21:39 0 22096 25969 date [...] |
This shows many date commands run every second.
So what? date is a fast command - maybe it doesn't explain the missing 85% CPU after all. We need to measure how much CPU was consumed. We'll try using truss first,
# truss -cf -p 25969 ^Csignals ------------ SIGCLD 730 total: 730 syscall seconds calls errors sigaction .092 4891 setcontext .045 730 waitid .151 3176 730 fork1 1.520 2445 lwp_sigmask .371 22740 schedctl .040 2446 -------- ------ ---- sys totals: 2.222 36428 730 usr time: .650 elapsed: 100.540 |
This shows around 2.2% system and 0.6% user time for the 100 second sample. Hmm, that doesn't seem right.
Now we will try DTrace. The shortlived.d script from the DTraceToolkit is run for the same interval,
# shortlived.d Sampling.. Hit Ctrl-C to stop. ^C short lived processes: 73.750 secs total sample duration: 100.833 secs Total time by process name, sendmail 10 ms date 63939 ms Total time by PPID, 415 5 ms 414 5 ms 25969 63939 ms |
This shows around 73.7% of the time was consumed by short lived processes - providing proof that these date commands are responsible for the CPU utilisation. This value goes further than any previous measurement to explain the missing 85%.
So far both tools have pointed to the problem, and DTrace provided more accurate timing info. What we really would like to see is the effect of running each tool.
normal - The following is a "sar -c 1 10" with no other analysis tool running,
# sar -c 1 10 SunOS jupiter 5.10 Generic i86pc 09/11/2005 14:44:43 scall/s sread/s swrit/s fork/s exec/s rchar/s wchar/s 14:44:44 9437 127 124 91.09 91.09 76216 6832 14:44:45 8395 122 122 92.08 92.08 76763 6517 14:44:46 8437 126 123 93.07 92.08 77978 7304 14:44:47 9348 125 122 91.09 91.09 75384 6635 14:44:48 8351 120 120 91.18 92.16 75241 6425 14:44:49 8489 121 121 93.07 93.07 76762 6516 14:44:50 9456 124 123 92.00 92.00 76137 6768 14:44:51 8305 120 120 91.18 91.18 75241 6425 14:44:52 8417 122 122 92.08 92.08 75987 6489 14:44:53 9356 125 122 91.09 91.09 76287 6662 Average 8797 123 122 91.79 91.79 76198 6657 |
Ok, so normally our test application is running around 92 new processes per second - those date commands. This value is our "control".
truss - Now we run truss on the fault as we did previously, and check the effect using sar. We send the output of truss to /dev/null as we are interested in the effect of how truss operates, not that of updating the screen with the output,
# truss -ftexec -p 25969 2> /dev/null & [2] 17953 # # sar -c 1 10 SunOS jupiter 5.10 Generic i86pc 09/11/2005 14:55:39 scall/s sread/s swrit/s fork/s exec/s rchar/s wchar/s 14:55:40 9683 909 885 66.34 33.66 331066 19767 14:55:41 10584 890 865 66.34 32.67 326276 19360 14:55:42 9553 896 873 65.69 33.33 327000 19376 14:55:43 9488 893 870 65.69 33.33 329688 19191 14:55:44 10533 893 867 66.34 32.67 323564 19547 14:55:45 9638 901 879 66.67 33.33 330474 19533 14:55:46 9508 890 866 65.69 33.33 328134 19139 14:55:47 9428 779 757 57.43 28.71 282797 17500 14:55:48 9576 898 876 66.34 32.67 327450 19472 14:55:49 9727 901 884 67.00 33.00 333738 19789 Average 9771 885 862 65.35 32.68 324028 19267 |
Yikes! Now we have dropped to around 33 processes per second from 92, which means truss is causing our application to run 64% slower.
DTrace - Now lets try the same with DTrace's execsnoop,
# execsnoop -v > /dev/null & [2] 17021 # # sar -c 1 10 SunOS jupiter 5.10 Generic i86pc 09/11/2005 15:00:35 scall/s sread/s swrit/s fork/s exec/s rchar/s wchar/s 15:00:36 9519 129 127 92.00 92.00 77230 11111 15:00:37 8318 124 124 90.20 90.20 75244 10486 15:00:38 8392 119 120 92.08 92.08 75208 10649 15:00:39 9469 127 124 92.00 92.00 76266 10933 15:00:40 8416 121 122 92.08 92.08 75986 10633 15:00:41 8317 118 119 91.26 90.29 75272 10498 15:00:42 9452 125 124 91.00 92.00 76138 10933 15:00:43 8409 119 120 92.08 92.08 75208 10557 15:00:44 8325 120 120 90.20 90.20 75244 10573 15:00:45 9455 124 123 92.00 92.00 75353 10904 Average 8803 122 122 91.49 91.49 75711 10726 |
Phwaorr! We have dropped from 91.79 processes per second to 91.49 - DTrace has slowed our application by 0.3%!
So why is DTrace so much less of a burden than truss?
Running an mpstat for both DTrace and truss will give us a big clue,
truss
# mpstat 1 8 CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 593 0 0 350 250 356 8 0 0 0 1606 2 6 0 92 0 10862 0 0 405 304 1320 96 0 0 0 9255 16 84 0 0 0 10687 0 0 403 303 1317 142 0 0 0 10262 17 83 0 0 0 10775 0 0 407 307 1321 98 0 5 0 9182 17 83 0 0 0 10719 0 0 399 300 1312 97 0 3 0 9119 17 83 0 0 0 10733 0 0 406 306 1315 142 0 0 0 10239 17 83 0 0 0 10778 0 0 406 306 1321 97 0 4 0 9161 17 83 0 0 0 10924 0 0 407 306 1332 98 0 1 0 9270 16 84 0 0 |
DTrace
# mpstat 1 8 CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 599 0 0 350 250 356 8 0 0 0 1610 2 6 0 92 0 11105 0 0 411 310 597 76 0 1 0 8853 15 85 0 0 0 11209 0 0 403 303 597 76 0 0 0 7892 15 85 0 0 0 11118 0 0 399 300 579 75 0 0 0 7782 15 85 0 0 0 11199 0 0 410 309 607 83 0 0 0 8944 15 85 0 0 0 11145 0 0 399 300 585 77 0 2 0 7808 15 85 0 0 0 11259 0 0 407 306 591 82 0 0 0 7932 14 86 0 0 0 11131 0 0 409 309 600 79 0 1 0 8874 15 85 0 0 |
There are less than half the number of context switches when DTrace is running.
Lets use dtrace to see what is context switching onto the CPU so much. The following is with truss,
# dtrace -n 'sched:::on-cpu { @[execname] = count(); } profile:::tick-20s { exit(0); }' ttymon 1 in.routed 2 sac 2 snmpd 3 miniserv.pl 5 fmd 5 svc.configd 6 inetd 6 sendmail 12 fsflush 21 Xorg 54 svc.startd 57 soffice.bin 120 dtrace 438 poold 631 sched 770 java 817 acroread 995 mozilla-bin 2534 date 3508 bash 4087 truss 11573 |
In this 20 second sample, truss jumped onto the CPU 11573 times. truss operates by sending control messages to procfs's ctl and lwpctl files, causing the process to freeze for every system call so that truss can print a line of output. Hense truss's behaviour is synchronous to the system calls.
Now the same test is done while using DTrace to analyse the problem,
# dtrace -n 'sched:::on-cpu { @[execname] = count(); } profile:::tick-20s { exit(0); }' automountd 1 utmpd 2 init 2 in.routed 2 nfsmapid 3 miniserv.pl 4 fmd 5 svc.configd 5 inetd 6 sendmail 11 fsflush 21 Xorg 50 svc.startd 61 soffice.bin 118 poold 612 dtrace 653 sched 738 java 801 acroread 994 date 1054 mozilla-bin 2560 bash 4100 |
DTrace barely steps onto the CPU. This is because DTrace uses a per CPU buffer in the kernel that is read at comfortable intervals by the user level DTrace consumer, /usr/sbin/dtrace. Often this interval is once per second, which would mean the DTrace program is stepping onto the CPU once per second rather than for every system call. This behaviour is asynchronous to the system calls, and results in minimal performance impact.
For completion, the following tools could also have been used to help analyse the short lived processes problem,
Conclusion
In the above demonstration,
I hope this has been a useful demonstration of DTrace.
More DTrace
Back to Brendan Gregg's Homepage