DTrace Troubleshooting
This is an example of performance analysis using DTrace
from Solaris 10.
Note: This document is not written by Sun.
prstat vs top
prstat (Solaris 8) and top (freeware) are two tools used to monitor the performance of the system, in particular process activity. There has been some minor debate over the years about the performance impact of each, in particular whether top is a burden on the system. Here we use DTrace to measure what is really happening.
I've heard some sysadmins describe top as a CPU hog, others say they have never had a problem with the tool. Some have made the move to prstat, which is more accurate and supposedly less of a burden; others swear by top, especially due to the extra info printed. The following are stories I've heard over the years related to these tools and performance,
So is top really a hog, or is this another "admin myth" like "sync; sync; sync; reboot"? DTrace can help us find out.
top was written by William LeFebvre, first in 1984 (inspired by VMS's top 10 tool). top has been one of the most useful tools ever created for Unix, making the most of a 24x80 text display to show a system summary. A sample output is below,
$ top last pid: 4018; load averages: 1.07, 1.04, 1.02 09:16:41 95 processes: 90 sleeping, 2 running, 2 stopped, 1 on cpu CPU states: 0.0% idle, 98.6% user, 1.4% kernel, 0.0% iowait, 0.0% swap Memory: 512M real, 170M free, 213M swap in use, 1169M swap free PID USERNAME LWP PRI NICE SIZE RES STATE TIME CPU COMMAND 3929 brendan 1 27 1 18M 17M run 414:06 96.19% setiathome 4011 root 1 59 0 3052K 1188K cpu 0:00 1.52% top 3597 brendan 1 59 0 71M 66M sleep 5:23 0.45% Xorg 4019 brendan 5 49 0 108M 60M sleep 0:48 0.08% soffice.bin 3730 brendan 3 49 0 77M 53M sleep 4:27 0.03% mozilla-bin 5185 brendan 1 59 0 3816K 2860K sleep 0:03 0.02% xterm 5637 brendan 1 49 0 3632K 2660K sleep 0:04 0.02% xterm 5199 root 1 49 0 2812K 2032K sleep 0:00 0.02% bash 3909 brendan 1 49 0 6228K 3604K sleep 0:22 0.01% dtterm 3642 brendan 1 59 0 3692K 2280K sleep 0:18 0.01% afterstep 135 root 1 59 0 1852K 1020K sleep 0:15 0.01% in.routed 7 root 12 59 0 10M 8112K run 0:08 0.01% svc.startd 84 root 24 59 0 3084K 2112K sleep 0:03 0.00% nscd 354 root 1 59 0 5428K 1828K sleep 0:02 0.00% sendmail 5642 brendan 1 59 0 1928K 668K sleep 0:00 0.00% rlogin |
The bulk of the output is a process listing sorted on CPU usage, with other useful server statistics. There is the "Memory:" summary, which can help identify RAM or Virtual Memory shortages; and "last pid" which may help to suggest problems with short lived processes.
prstat was added to Solaris in Solaris 8. The following is the default output,
$ prstat PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP 3929 brendan 18M 17M run 7 1 6:55:29 97% setiathome/1 3597 brendan 71M 66M sleep 59 0 0:05:24 1.3% Xorg/1 3642 brendan 3692K 2280K sleep 59 0 0:00:18 0.4% afterstep/1 4014 root 4352K 4144K cpu0 59 0 0:00:00 0.2% prstat/1 5185 brendan 3816K 2868K sleep 59 0 0:00:03 0.2% xterm/1 3909 brendan 6236K 3612K sleep 49 0 0:00:22 0.1% dtterm/1 4019 brendan 108M 60M sleep 49 0 0:00:48 0.1% soffice.bin/5 3648 brendan 3580K 2008K sleep 59 0 0:00:01 0.1% Pager/1 5199 root 2812K 2032K sleep 49 0 0:00:00 0.0% bash/1 3644 brendan 2684K 1432K sleep 59 0 0:00:00 0.0% Animate/1 3913 brendan 2120K 1312K sleep 49 0 0:00:00 0.0% rlogin/1 135 root 1852K 1020K sleep 59 0 0:00:15 0.0% in.routed/1 3914 brendan 1928K 664K sleep 49 0 0:00:00 0.0% rlogin/1 7 root 10M 8112K sleep 59 0 0:00:08 0.0% svc.startd/12 3730 brendan 77M 53M sleep 49 0 0:04:27 0.0% mozilla-bin/3 3649 brendan 3616K 2136K sleep 49 0 0:00:00 0.0% WinList/1 84 root 3084K 2112K sleep 59 0 0:00:03 0.0% nscd/24 5644 brendan 3624K 2668K sleep 49 0 0:00:00 0.0% xterm/1 5190 brendan 3624K 2656K sleep 49 0 0:00:03 0.0% xterm/1 354 root 5428K 1828K sleep 59 0 0:00:02 0.0% sendmail/1 234 root 1788K 1032K sleep 59 0 0:00:00 0.0% ttymon/1 Total: 95 processes, 202 lwps, load averages: 1.06, 1.03, 1.01 |
The prstat output is almost entirerly a process listing, missing some of the server details that top provides. There is a "NLWP" for number of Light Weight Processes, which can be useful to identify multithreaded processes. (and in Solaris 9+ this value equals the number of user threads).
prstat also has other output styles. The following is microstate accounting,
$ prstat -m PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/NLWP 3929 brendan 98 0.0 0.1 0.0 0.0 0.0 0.0 1.4 0 416 68 0 setiathome/1 3597 brendan 0.2 0.2 0.0 0.0 0.0 0.0 100 0.1 157 6 688 26 Xorg/1 4032 root 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 23 0 240 0 prstat/1 135 root 0.0 0.1 0.0 0.0 0.0 0.0 100 0.0 2 0 37 0 in.routed/1 5185 brendan 0.0 0.0 0.0 0.0 0.0 0.0 100 0.1 2 3 32 0 xterm/1 3642 brendan 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 4 0 84 0 afterstep/1 4019 brendan 0.0 0.0 0.0 0.0 0.0 40 60 0.2 42 10 43 0 soffice.bin/5 3648 brendan 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 3 0 15 0 Pager/1 304 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 3 0 3 0 miniserv.pl/1 354 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 3 0 10 0 sendmail/1 5190 brendan 0.0 0.0 0.0 0.0 0.0 0.0 100 0.1 1 1 4 0 xterm/1 244 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 3 0 2 0 utmpd/1 234 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 0 0 0 0 ttymon/1 218 daemon 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 0 0 0 0 lockd/2 110 daemon 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 0 0 0 0 kcfd/3 224 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 0 0 0 0 sac/1 214 daemon 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 0 0 0 0 statd/1 84 root 0.0 0.0 0.0 0.0 0.0 4.2 96 0.0 30 0 69 0 nscd/24 329 root 0.0 0.0 0.0 0.0 0.0 64 36 0.0 3 0 1 0 fmd/11 98 root 0.0 0.0 0.0 0.0 0.0 25 75 0.0 0 0 0 0 picld/4 1965 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 0 0 0 0 ttymon/1 Total: 95 processes, 202 lwps, load averages: 1.06, 1.04, 1.04 |
Other output styles include by user (-a), by project (-J), by task (-T) and by zone (-Z, Solaris 10).
There are many ways DTrace can analyse these tools. We begin by checking for system calls. Each tool is run with an interval of 1 second, and updates the screen 10 times.
prstat system calls, "prstat 1 10" was run during the following dtrace command to frequency count system calls,
# dtrace -n 'syscall:::entry /execname == "prstat"/ { @num[probefunc] = count(); }' dtrace: description 'syscall:::entry ' matched 226 probes ^C [...] pollsys 9 sigaction 10 getloadavg 10 ioctl 11 mmap 19 llseek 19 getdents64 20 write 32 brk 34 close 102 open 105 doorfs 700 pread 960 |
Many are pread(), doorfs(), and about a hundred open()s and close()s each.
top system calls, "top -s1 -d10" was run during the following dtrace command,
# dtrace -n 'syscall:::entry /execname == "top"/ { @num[probefunc] = count(); }' dtrace: description 'syscall:::entry ' matched 228 probes ^C [...] munmap 9 uadmin 10 gtime 10 write 11 getuid 11 sysconfig 13 llseek 19 getdents64 20 pread64 21 brk 24 mmap 31 ioctl 43 read 994 close 1000 open 1002 |
Now we have about a thousand reads()s, open()s and close()s each. So clearly more system calls.
top file opens, Lets try to determine what top is opening so many times,
# dtrace -n 'syscall::open:entry /execname == "top"/ { @name[copyinstr(arg0)] = count(); }' dtrace: description 'syscall::open:entry ' matched 1 probe ^C /var/ld/ld.config 1 /usr/share/lib/zoneinfo/Australia/NSW 1 /dev/kstat 1 /usr/share/lib/terminfo//x/xterm 1 [...] 3911/psinfo 10 3914/psinfo 10 3920/psinfo 10 3928/psinfo 10 5645/psinfo 10 5644/psinfo 10 5642/psinfo 10 1/psinfo 10 |
The output shows that there isn't a single file opened hundreds of times, instead the psinfo files from /proc are opened ten times each, once for each update.
prstat file opens, Trying the same for prstat gives,
dtrace -n 'syscall::open:entry /execname == "prstat"/ { @name[copyinstr(arg0)] = count(); }' dtrace: description 'syscall::open:entry ' matched 1 probe dtrace: error on enabled probe ID 1 (ID 15: syscall::open:entry): invalid address (0xd27b20af) in action #2 at DIF offset 28 ^C /proc/4092/psinfo 1 /proc/3919/psinfo 1 /proc/3929/psinfo 1 [...] /proc/211/psinfo 1 /proc/1/psinfo 1 /dev/tty 1 /lib/libcurses.so.1 1 /var/ld/ld.config 2 /lib/libc.so.1 2 /usr/share/lib/terminfo//x/xterm 2 |
The above list showed that each psinfo file had only been opened once.
top dtruss, A little truss may highlight what is happening. A DTrace version of truss called "dtruss" is used, which is a lightweight safer version of truss (although in this case it makes little difference if the original "truss" command was used). Here we focus on the system calls for the subsequent screen updates that gather the psinfo data,
# dtruss -n top [...] 4299: open("3920/psinfo\0", 0, 805853a) = 9 0 4299: read(9, "\0", 150) = 336 0 4299: close(9) = 0 0 4299: open("3914/psinfo\0", 0, 805853a) = 9 0 4299: read(9, "\0", 150) = 336 0 4299: close(9) = 0 0 4299: open("4296/psinfo\0", 0, 805853a) = 9 0 4299: read(9, "\0", 150) = 336 0 4299: close(9) = 0 0 4299: open("3911/psinfo\0", 0, 805853a) = 9 0 4299: read(9, "\0", 150) = 336 0 4299: close(9) = 0 0 4299: open("4200/psinfo\0", 0, 805853a) = 9 0 4299: read(9, "\0", 150) = 336 0 4299: close(9) = 0 0 [...] |
So top does an open, read and close for every process, on every update.
prstat dtruss, Checking the updates for prstat,
# dtruss -n prstat [...] 4295: pread(11, "\0", 150) = 336 0 4295: pread(12, "\0", 150) = 336 0 4295: pread(13, "\0", 150) = 336 0 4295: pread(14, "\0", 150) = 336 0 4295: pread(15, "\0", 150) = 336 0 [...] |
prstat has left the file handles open, so for each update it only needs to pread the new data for every process.
So far, DTrace has been used to show that top uses extra system calls over prstat. Maybe that is a performance problem. While DTrace is a very powerful tool, many times it is used for similar output to what the original truss could provide. DTrace has made the data easier and safer to generate - it's a case of DTrace making the difficult things easy.
Maybe extra system calls performed by top are a problem, maybe they are not. what we need are measurements of the CPU overhead for these extra calls.
prstat syscall overhead, The following DTrace script is used to measure the CPU overhead of system calls,
# cat time.d #!/usr/sbin/dtrace -s syscall:::entry /execname == $$1/ { self->start = vtimestamp; } syscall:::return /self->start/ { this->time = vtimestamp - self->start; @Time[probefunc] = sum(this->time); @Time["TOTAL:"] = sum(this->time); self->start = 0; } # # ./time.d prstat dtrace: script './time.d' matched 458 probes ^C [...] read 50544 llseek 100493 munmap 102172 xstat 117387 resolvepath 146191 brk 236784 mmap 264860 ioctl 328747 memcntl 427722 pollsys 637806 close 910030 write 1647629 exece 2052044 open 2339907 doorfs 6649380 getdents64 7143790 pread 57319190 TOTAL: 80679670 |
The above shows that the system calls took 81 ms in total, 57 ms for the preads, and 23 ms for the opens.
top syscall overhead, now the same script is run for top,
# ./time.d top dtrace: script './time.d' matched 458 probes ^C [...] doorfs 59614 open64 113472 resolvepath 123631 stat64 132480 brk 138305 xstat 147503 memcntl 318990 pread64 646356 pollsys 723641 write 795618 ioctl 1778097 munmap 4559626 getdents64 7034697 close 10806237 mmap 16958762 read 57622258 open 209996158 TOTAL: 312266191 |
In the above we can see 58 ms for the reads - a similar value to prstat, yet 210 ms for the opens. This is an extra 190 ms of system call overhead; 19 ms per screen update - around 2% of the CPU.
It should be noted that these millisecond values are relative. The server that these tests were perfomed on is an 867 MHz i86pc. On faster servers the millisecond values will be smaller, and on multi CPU servers the impact on the overall system will be smaller too. However, the reason for this extra overhead is the iteration over processes in /proc, so for servers running hundreds of processes the overhead will be greater (on this server only 95 processes were running).
A more relevant observation may be to say these extra opens are using around 60% of the overall CPU time for system calls; this value is more likely to be similar on servers with different CPUs.
So far we have measured extra overhead by top making system calls. However not all of the CPU overhead is system calls. Here we accuratly measure the time each tool spends on the CPU, whether this is a system call or not,
prstat CPU usage, The following DTrace script is used to measure how the process uses the CPUs,
# cat total.d #!/usr/sbin/dtrace -qs dtrace:::BEGIN { total = 0; } sched:::on-cpu /execname == $$1/ { self->start = vtimestamp; } sched:::off-cpu /self->start/ { this->time = vtimestamp - self->start; total += this->time; @times = quantize(this->time); self->start = 0; } dtrace:::END { printf("Total Time on CPU: %d ns (%d ms)\n",total,total/1000000); printa(@times); } # # ./total.d prstat ^C Total Time on CPU: 132799771 ns (132 ms) value ------------- Distribution ------------- count 8192 | 0 16384 |@@ 9 32768 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 180 65536 |@@@@ 21 131072 |@@ 11 262144 | 0 524288 | 1 1048576 | 0 2097152 | 0 4194304 |@@ 9 8388608 |@ 4 16777216 | 1 33554432 | 0 |
So in total the prstat command spends 132 ms on the CPU. The above distribution plot tell us that most of the times prstat was on the CPU was for around 32 us (microseconds) each time - a short inteval.
top CPU usage, The same DTrace script is run for top,
# ./total.d top ^C Total Time on CPU: 399404874 ns (399 ms) value ------------- Distribution ------------- count 8192 | 0 16384 |@@@@@ 7 32768 |@@@@ 5 65536 | 0 131072 |@@@ 4 262144 |@@@@ 5 524288 |@@@ 4 1048576 |@@@ 4 2097152 |@ 1 4194304 |@@ 2 8388608 |@@@@@@@@@ 12 16777216 |@@@@ 5 33554432 |@ 1 67108864 |@ 1 134217728 | 0 |
Top spends 399 ms in total on the CPU, quite often jumping onto the CPU for long periods (around 8 ms). We can already explain 190 ms of this due to the extra file opens, and another 30 ms or so due to extra related system calls. This leaves around 50 ms of extra CPU time by the top program unaccounted for. This would likely be due to differences in the code.
Rather than write the DTrace scripts above, the following demonstrates using the procsystime tool, found on the DTrace Tools website,
prstat procsystime and top, The following DTrace tool was run on both prstat and top to measure syscall details,
# procsystime -a -n prstat Hit Ctrl-C to stop sampling... ^C Elapsed Times for process prstat, SYSCALL TIME (ns) fxstat 15933 [...] brk 511507 ioctl 583028 write 1503880 close 1654823 exece 2073095 open 3218842 getdents64 7339480 doorfs 16485536 pread 63929093 pollsys 9028601228 CPU Overhead Times for process prstat, SYSCALL TIME (ns) sysconfig 2758 [...] ioctl 337347 memcntl 429263 pollsys 790727 close 1056745 write 1344893 exece 2057874 open 2473774 doorfs 4130442 getdents64 7176109 pread 57809119 Syscall Counts for process prstat, SYSCALL COUNT rexit 1 [...] ioctl 11 llseek 19 mmap 19 getdents64 20 write 22 brk 32 close 104 open 107 doorfs 420 pread 980 |
# procsystime -a -n top Hit Ctrl-C to stop sampling... ^C Elapsed Times for process top, SYSCALL TIME (ns) getpid 7321 [...] pread64 828000 write 859977 ioctl 2029376 munmap 4702569 getdents64 7198911 mmap 17038719 close 17724485 read 65008599 open 217945945 pollsys 9041990505 CPU Overhead Times for process top, SYSCALL TIME (ns) getgid 917 [...] pread64 652969 write 782977 pollsys 786389 ioctl 1687791 munmap 4620398 getdents64 7051767 close 11669988 mmap 16809776 read 58720211 open 211224762 Syscall Counts for process top, SYSCALL COUNT rexit 1 [...] sysconfig 13 llseek 19 getdents64 20 pread64 21 brk 24 mmap 31 ioctl 43 read 1014 close 1020 open 1022 |
Much of the discoveries from the previous sections can be found in the above output.
As we now have seen that the number of system calls generated is related to the number of processes running, an experiment is performed to exaggerate the results. We run an extra 3,000 processes (/usr/bin/sleep) and then check how top and prstat are performing.
prstat with 3096 processes, the command output,
PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP 3929 brendan 18M 16M run 0 1 13:02:09 88% setiathome/1 13457 root 6480K 6284K cpu0 45 0 0:00:07 9.3% prstat/1 3597 brendan 79M 51M sleep 59 0 0:07:40 0.3% Xorg/1 4019 brendan 108M 9312K sleep 49 0 0:01:07 0.1% soffice.bin/5 5185 brendan 4120K 2248K sleep 49 0 0:00:06 0.1% xterm/1 3909 brendan 6268K 2636K sleep 49 0 0:01:13 0.1% dtterm/1 84 root 3084K 1632K sleep 59 0 0:00:05 0.0% nscd/24 3642 brendan 3692K 1888K sleep 59 0 0:00:22 0.0% afterstep/1 135 root 1852K 980K sleep 59 0 0:00:21 0.0% in.routed/1 7 root 10M 3132K sleep 59 0 0:00:11 0.0% svc.startd/12 3730 brendan 78M 25M sleep 49 0 0:05:29 0.0% mozilla-bin/3 3648 brendan 3580K 1512K sleep 59 0 0:00:01 0.0% Pager/1 354 root 5428K 1380K sleep 59 0 0:00:03 0.0% sendmail/1 4016 brendan 2492K 1504K sleep 59 0 0:00:00 0.0% bash/1 244 root 1048K 600K sleep 59 0 0:00:00 0.0% utmpd/1 234 root 1788K 860K sleep 59 0 0:00:00 0.0% ttymon/1 218 daemon 2040K 832K sleep 60 -20 0:00:00 0.0% lockd/2 304 root 10M 1876K sleep 59 0 0:00:02 0.0% miniserv.pl/1 110 daemon 3592K 828K sleep 59 0 0:00:00 0.0% kcfd/3 224 root 1680K 788K sleep 59 0 0:00:00 0.0% sac/1 214 daemon 2464K 1068K sleep 59 0 0:00:00 0.0% statd/1 Total: 3096 processes, 3203 lwps, load averages: 1.29, 7.14, 7.98 |
Ok, so prstat is actually consuming a noticable amount of the CPU, 9.3%.
top with 3096 processes, the command output,
last pid: 13454; load averages: 5.24, 13.32, 9.79 15:35:44 3096 processes:3091 sleeping, 1 running, 3 stopped, 1 on cpu CPU states: 0.0% idle, 84.2% user, 15.8% kernel, 0.0% iowait, 0.0% swap Memory: 512M real, 8108K free, 506M swap in use, 770M swap free PID USERNAME LWP PRI NICE SIZE RES STATE TIME CPU COMMAND 3929 brendan 1 7 1 18M 16M run 778:57 82.49% setiathome 10448 root 1 42 0 5268K 2328K cpu 1:46 13.95% top 3597 brendan 1 59 0 79M 51M sleep 7:39 0.74% Xorg 3730 brendan 3 49 0 78M 26M sleep 5:29 0.54% mozilla-bin 4019 brendan 5 49 0 108M 9492K sleep 1:07 0.08% soffice.bin 3642 brendan 1 59 0 3692K 1904K sleep 0:22 0.08% afterstep 3909 brendan 1 49 0 6260K 2416K sleep 1:13 0.03% dtterm 135 root 1 59 0 1852K 984K sleep 0:21 0.03% in.routed 5185 brendan 1 49 0 4120K 2196K sleep 0:06 0.03% xterm 7 root 12 59 0 10M 3188K sleep 0:11 0.02% svc.startd 3648 brendan 1 59 0 3580K 1512K sleep 0:01 0.02% Pager 84 root 24 59 0 3084K 1644K sleep 0:05 0.01% nscd 1 root 1 59 0 1992K 644K sleep 0:09 0.00% init 5190 brendan 1 49 0 3624K 2088K sleep 0:06 0.00% xterm 354 root 1 59 0 5428K 1380K sleep 0:03 0.00% sendmail |
And top is using 14% of the CPU.
DTrace has already explained what is happening, this little experiment has shown that it can make a noticable on a system with many processes.
Conclusion
The following are some observations from the above DTrace measurements,
I hope this has been a useful demonstration of using DTrace from the command line, as well as from DTrace scripts.
Confessions
Well, I do have a confession to make. I've also written tools that use the /proc procfs data. Both prstat and top use this /proc data, prstat in a clever manner (preads), and top in a simple way (open, read, close). And when I've written tools, so far it's been in the same way as top! open, read, close, for every process.
As a programmer you try to think of every possible problem, especially bugs in the code that may cause it to crash or print incorrect data. A lot of effort goes into ensuring the program will work on every possible environment. And rightfully so, the program must be reliable! Less effort may go into benchmarking the program on every possible environment, and it's easy to not think of a particular scenario - like ten managers all running top on a system with thousands of processes.
So while top is more of a burden that prstat, it's not due to an obvious mistake or oversight, it's due to a programming style that many of us would have choosen - and only becomes a problem in certain scenarios.
More DTrace
Back to Brendan Gregg's Homepage