This is an example of performing troubleshooting using DTrace
from Solaris 10.
Note: This document is not written by Sun, and has been based on beta Solaris 10.
DTracing Lost CPU
A common command to gauge CPU activity is uptime,
# uptime 1:01pm up 6 day(s), 22:42, 1 user, load average: 0.80, 0.33, 0.18 |
This shows that there is some degree of load.
To check this in more detail we may try vmstat,
# vmstat 1 kthr memory page disk faults cpu r b w swap free re mf pi po fr de sr cd s0 -- -- in sy cs us sy id 0 0 0 629532 173496 4 22 2 0 0 0 3 0 0 0 0 315 443 246 2 1 97 1 0 51 288804 12028 2320 12220 512 0 0 0 0 148 0 0 0 539 10864 886 20 68 12 1 0 51 288604 11828 1950 10457 733 0 0 0 0 172 0 0 0 591 9443 856 15 59 26 1 0 51 288444 11652 2374 12616 467 0 0 0 0 156 0 0 0 554 11079 889 19 75 6 1 0 51 288544 11748 2426 12901 569 0 0 0 0 172 0 0 0 584 11206 915 19 72 10 2 0 51 288128 11336 1646 8621 1432 0 0 0 0 324 0 0 0 902 9613 1078 13 53 34 0 0 51 287968 10708 2081 11100 650 0 0 0 0 179 0 0 0 605 10059 872 17 62 21 2 0 51 287456 10308 1650 8663 808 0 0 0 0 192 0 0 0 634 8564 833 13 50 37 1 0 51 287336 9944 2212 11513 661 0 0 0 0 174 0 0 0 590 10532 871 17 65 18 1 0 51 287180 9704 2377 12571 475 0 0 0 0 155 0 0 0 555 11031 889 19 71 11 0 0 51 286936 9480 2324 12723 524 0 0 0 0 166 0 0 0 587 11028 931 18 70 12 2 0 51 286576 9412 43 380 923 0 0 0 0 232 0 0 0 707 2492 697 13 7 80 0 0 51 286448 9616 13 196 1525 0 0 0 0 380 0 0 0 1005 4066 1067 31 8 61 0 0 51 286452 10412 58 365 3307 0 0 0 0 827 0 0 0 1892 7517 1832 5 14 81 |
Although 1 second samples are rather short, idle (id) really seems to be see-sawing, and there is a high ratio of system time (cpu sy) versus user time. There is also an alarmingly high number of system calls (faults sy), and some pressure on the memory system (sr - although the column isn't lining up properly on this beta Solaris).
We now try prstat, which is especially good at summarising system activity,
PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP 443 brendan 347M 194M sleep 59 0 1:01:48 3.7% Xsun/1 584 brendan 26M 9872K sleep 59 0 0:15:58 3.2% gnome-terminal/1 6076 root 1128K 892K sleep 39 0 0:00:04 3.0% catman/1 576 brendan 19M 6728K sleep 59 0 0:05:13 0.6% metacity/1 580 brendan 24M 8332K sleep 59 0 0:35:25 0.4% gnome-panel/1 8774 root 4420K 4212K cpu0 49 0 0:00:00 0.2% prstat/1 1 root 1992K 360K sleep 59 0 0:00:01 0.1% init/1 646 brendan 6148K 1912K sleep 49 0 0:00:53 0.1% vncviewer/1 15578 brendan 2200K 268K sleep 59 0 0:00:05 0.1% rlogin/1 595 brendan 22M 5892K sleep 49 0 0:03:17 0.0% mixer_applet2/1 167 root 1924K 768K sleep 59 0 0:02:22 0.0% in.routed/1 593 brendan 16M 4376K sleep 49 0 0:01:08 0.0% galf-server/1 578 brendan 30M 10M sleep 59 0 0:01:48 0.0% nautilus/5 9315 root 1092K 820K run 19 0 0:00:00 0.0% sh/1 560 brendan 4512K 1984K sleep 59 0 0:00:28 0.0% xscreensaver/1 574 brendan 3412K 540K sleep 59 0 0:01:10 0.0% gnome-smproxy/1 1435 root 2516K 1268K sleep 49 0 0:00:00 0.0% bash/1 15579 brendan 2008K 108K sleep 59 0 0:00:04 0.0% rlogin/1 8509 root 1004K 712K sleep 59 0 0:00:00 0.0% sleep/1 535 brendan 1672K 400K sleep 59 0 0:00:28 0.0% dsdm/1 429 root 1196K 532K sleep 59 0 0:00:44 0.0% upsmonitor/1 Total: 92 processes, 174 lwps, load averages: 0.89, 0.49, 0.25 |
The above does little to explain the load average of 0.89 - there simply doesn't seem to be any processes using that much CPU!?
Now we try prstat -m, for microstate accounting,
# prstat -m PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/NLWP 23333 root 0.1 16 0.0 0.0 0.0 0.0 98 1.1 22 3 239 0 prstat/1 20818 root 0.6 3.5 0.0 0.0 0.0 0.0 82 14 251 10 2K 0 catman/1 24751 root 0.2 1.0 0.0 0.0 0.0 0.0 98 0.0 13 0 445 0 sleep/1 584 brendan 0.7 0.1 0.0 0.0 0.0 0.0 99 0.1 12 1 44 0 gnome-termin/1 443 brendan 0.7 0.1 0.0 0.0 0.0 0.0 99 0.1 20 1 62 0 Xsun/1 580 brendan 0.5 0.1 0.0 0.0 0.0 0.0 99 0.2 14 1 51 0 gnome-panel/1 429 root 0.0 0.1 0.0 0.0 0.0 0.0 99 0.4 2 0 15 0 upsmonitor/1 646 brendan 0.0 0.1 0.0 0.0 0.0 0.0 100 0.1 5 5 40 0 vncviewer/1 595 brendan 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 14 0 12 0 mixer_applet/1 249 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 22 0 64 0 nscd/25 576 brendan 0.0 0.0 0.0 0.0 0.0 0.0 100 0.2 4 0 20 0 metacity/1 578 brendan 0.0 0.0 0.0 0.0 0.0 100 100 0.0 3 0 4 0 nautilus/5 574 brendan 0.0 0.0 0.0 0.0 0.0 0.0 100 0.2 2 0 12 0 gnome-smprox/1 593 brendan 0.0 0.0 0.0 0.0 0.0 0.0 100 0.2 2 0 10 0 galf-server/1 167 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.1 1 0 3 0 in.routed/1 535 brendan 0.0 0.0 0.0 0.0 0.0 0.0 100 0.2 2 0 8 0 dsdm/1 560 brendan 0.0 0.0 0.0 0.0 0.0 0.0 100 0.1 1 0 6 0 xscreensaver/1 406 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 0 0 0 0 snmpdx/1 293 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 0 0 0 0 smcboot/1 294 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 0 0 0 0 smcboot/1 283 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 0 0 0 0 utmpd/1 Total: 86 processes, 169 lwps, load averages: 0.80, 0.36, 0.32 PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/NLWP 26535 root 0.0 15 0.0 0.0 0.0 0.0 99 0.4 24 2 239 0 prstat/1 25415 root 1.2 7.5 0.0 0.0 0.0 0.0 89 2.6 1K 33 28K 0 catman/1 443 brendan 0.3 0.0 0.0 0.0 0.0 0.0 100 0.1 53 1 151 0 Xsun/1 580 brendan 0.3 0.0 0.0 0.0 0.0 0.0 100 0.1 27 2 111 0 gnome-panel/1 584 brendan 0.2 0.0 0.0 0.0 0.0 0.0 100 0.0 30 1 67 0 gnome-termin/1 [...] PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/NLWP 1149 root 0.0 15 0.0 0.0 0.0 0.0 99 0.6 24 2 250 0 prstat/1 1248 root 1.1 8.9 0.0 0.0 0.0 0.0 0.0 47 0 580 5K 0 sh/1 1122 root 1.6 1.1 0.0 0.0 0.0 0.0 90 7.8 74 76 2K 0 sh/1 443 brendan 0.5 0.1 0.0 0.0 0.0 0.0 99 0.1 58 2 165 0 Xsun/1 580 brendan 0.3 0.0 0.0 0.0 0.0 0.0 100 0.0 28 2 109 0 gnome-panel/1 [...] PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/NLWP 1616 root 0.1 15 0.0 0.0 0.0 0.0 99 0.7 22 4 262 0 prstat/1 2368 root 7.1 5.2 0.0 0.0 0.0 0.0 87 0.0 1K 17 12K 0 getNAME/1 21 root 0.6 3.0 0.0 0.0 0.0 0.0 91 5.9 315 23 6K 0 catman/1 584 brendan 2.6 0.3 0.0 0.0 0.0 0.0 92 4.7 34 63 306 0 gnome-termin/1 2350 root 1.5 1.0 0.0 0.0 0.0 0.0 91 5.8 42 14 1K 0 sh/1 443 brendan 1.7 0.1 0.0 0.0 0.0 0.0 98 0.1 75 1 221 0 Xsun/1 [...] |
Depending on the sample, prstat -m sometimes identifies "catman" as the culprit for all those system calls, and other times identifies shell processes, getNAME, or other programs. When we try to analyse most of these processes further (such as PID 1248, 1122 or 2368) we find that they no longer exist. prstat is catching glimpses and is giving us clues, but it's hard to see the whole picture.
There are many ways DTrace can investigate what is happening here. We begin by checking what all those system calls are. The following dtrace command is used to frequency count the system calls; I let dtrace sample for several seconds then hit Ctrl-C,
# dtrace -n 'syscall:::entry { @Calls[probefunc] = count(); }' dtrace: description 'syscall:::entry ' matched 226 probes ^C [...] exece 26 getcwd 26 memcntl 26 chdir 50 rexit 50 fsat 51 lwp_self 51 fork1 51 sysi86 52 schedctl 52 munmap 52 sysconfig 53 setcontext 54 xstat 55 resolvepath 78 writev 78 fcntl 101 lwp_sigmask 106 brk 108 getdents64 110 getpid 128 mmap 157 pollsys 209 waitsys 228 write 237 ioctl 325 lseek 1685 llseek 1710 open 1737 fstat64 1787 close 1888 read 6861 |
Most of these are read()s, and many are open()s with other related calls for file access. Reading carefully there are actually many fork()s and exec()s for this interval, an indication of many new processes being created.
We will begin by examining who is responsible for all those reads. The following is run for several seconds,
# dtrace -n 'syscall::read:entry { @Execs[execname] = count(); }' dtrace: description 'syscall::read:entry ' matched 1 probe ^C xscreensaver 3 galf-server 4 gnome-smproxy 4 dsdm 8 metacity 16 gnome-panel 20 vncviewer 34 nawk 50 fgrep 50 head 50 grep 50 init 100 sh 128 dirname 175 Xsun 192 gnome-terminal 492 catman 639 sgml2roff 778 getNAME 1427 |
getNAME, sgml2roff, ... They don't seem like familiar daemons that are always running, rather an uncommon event.
Now lets try to fetch more info on the read()s themselves. Probably a good place to start is the open()s. We try the following,
# dtrace -n 'syscall::open:entry { @Open[copyinstr(arg0)] = count(); }' dtrace: description 'syscall::open:entry ' matched 1 probe ^C /usr/man/sman3c/__flbf.3c 1 /usr/man/sman3c/__fbufsize.3c 1 /usr/man/sman3c/FD_ZERO.3c 1 /usr/man/sman3c/FD_SET.3c 1 /usr/man/sman3c/FD_ISSET.3c 1 /usr/man/sman3c/FD_CLR.3c 1 /usr/man/sman1/u370.1 1 /usr/man/sman1/typeset.1 1 /usr/man/sman1/type.1 1 /usr/man/sman1/tty.1 1 /usr/man/sman1/tsort.1 1 /usr/man/sman1/truss.1 1 /usr/man/sman1/true.1 1 /usr/man/sman1/troff.1 1 /usr/man/sman1/trap.1 1 [...] /usr/man/man3/TIFFcodec.3t 6 /usr/man/man3/TIFFRGBAImage.3t 8 /usr/man/man3/TIFFswab.3t 10 /usr/man/man3/TIFFstrip.3t 10 /usr/man/man3/TIFFmemory.3t 12 /usr/man/man3/TIFFtile.3t 14 /usr/man/man3/TIFFquery.3t 24 /var/ld/ld.config 36 /lib/libc.so.1 36 |
This output is usually good at spotting a large number of open()s on one particular file. Here that wasn't the case, the output was many pages long and most of the files were only opened once. However, what is useful is the types of files themselves - man pages. We already suspect the catman command is involved, and now we have confirmed the man pages themselves are being opened.
Now lets try to identify what those exec() are for by using DTrace,
# dtrace -n 'syscall::exec*:entry { trace(execname); }' dtrace: description 'syscall::exec*:entry ' matched 2 probes CPU ID FUNCTION:NAME 0 113 exece:entry catman 0 113 exece:entry sh 0 113 exece:entry sgml2roff 0 113 exece:entry sgml2roff 0 113 exece:entry sgml2roff 0 113 exece:entry sgml2roff 0 113 exece:entry sgml2roff 0 113 exece:entry dirname 0 113 exece:entry catman 0 113 exece:entry sh 0 113 exece:entry sh 0 113 exece:entry catman 0 113 exece:entry sh 0 113 exece:entry sgml2roff 0 113 exece:entry sgml2roff 0 113 exece:entry sgml2roff 0 113 exece:entry sgml2roff 0 113 exece:entry sgml2roff 0 113 exece:entry dirname 0 113 exece:entry catman 0 113 exece:entry sh 0 113 exece:entry sgml2roff [...] |
The above looks like many formatting commands, perhaps processing documents. I should add that this was scrolling rather rapidly.
It is a bit difficult to determine exactly what is going on without viewing the command arguments. The following DTrace script does just that, and demonstrates that DTrace can easily dive into complex structures,
# dtrace -qn 'syscall::exec*:entry { printf("%5d %s\n",pid,stringof(curpsinfo->pr_psargs)); }' 16554 catman 16555 sh -c cd /usr/share/man; /usr/lib/sgml/sgml2roff /usr/share/man/sman9f/ddi_task 16556 /bin/sh /usr/lib/sgml/sgml2roff /usr/share/man/sman9f/ddi_taskq_wait.9f 16558 /bin/sh /usr/lib/sgml/sgml2roff /usr/share/man/sman9f/ddi_taskq_wait.9f 16557 /bin/sh /usr/lib/sgml/sgml2roff /usr/share/man/sman9f/ddi_taskq_wait.9f 16559 /bin/sh /usr/lib/sgml/sgml2roff /usr/share/man/sman9f/ddi_taskq_wait.9f 16560 /bin/sh /usr/lib/sgml/sgml2roff /usr/share/man/sman9f/ddi_taskq_wait.9f 16560 /usr/bin/sh /usr/bin/dirname /usr/share/man/sman9f/ddi_taskq_wait.9f 16561 catman 16562 sh -c cd /usr/share/man; rm -f /usr/share/man/cat9f/ddi_taskq_wait.9f; ln -s .. 16563 sh -c cd /usr/share/man; rm -f /usr/share/man/cat9f/ddi_taskq_wait.9f; ln -s .. 16564 catman 16565 sh -c cd /usr/share/man; /usr/lib/sgml/sgml2roff /usr/share/man/sman1as/add-res 16566 /bin/sh /usr/lib/sgml/sgml2roff /usr/share/man/sman1as/add-resources.1as 16568 /bin/sh /usr/lib/sgml/sgml2roff /usr/share/man/sman1as/add-resources.1as 16567 /bin/sh /usr/lib/sgml/sgml2roff /usr/share/man/sman1as/add-resources.1as 16569 /bin/sh /usr/lib/sgml/sgml2roff /usr/share/man/sman1as/add-resources.1as 16570 /bin/sh /usr/lib/sgml/sgml2roff /usr/share/man/sman1as/add-resources.1as 16570 /usr/bin/sh /usr/bin/dirname /usr/share/man/sman1as/add-resources.1as 16571 catman 16572 sh -c cd /usr/share/man; /usr/lib/sgml/sgml2roff /usr/share/man/sman1as/create- [...] |
It is now clear that catman is processing all the man pages, creating a series of short lived processes to do so.
The execution of processes is now analysed using the DTrace tool execsnoop, which has options to provide further details,
# ./execsnoop -v STRTIME UID PID PPID ARGS 2005 Jan 22 14:07:01 0 25253 24072 sh -c cd /usr/share/man; /usr/lib/sgml/sgml2roff /usr/share 2005 Jan 22 14:07:01 0 25254 25253 /bin/sh /usr/lib/sgml/sgml2roff /usr/share/man/sman1/gst-xm 2005 Jan 22 14:07:01 0 25255 25254 grep -i <refentry[ >] 2005 Jan 22 14:07:01 0 25257 25256 head -10 /usr/share/man/sman1/gst-xmlinspect.1 2005 Jan 22 14:07:01 0 25256 25254 fgrep SHADOW_PAGE 2005 Jan 22 14:07:01 0 25258 25254 nawk /SYSTEM/ {gsub("\"","",$4);gsub(">","",$4);print $4 2005 Jan 22 14:07:01 0 25259 25254 /usr/bin/sh /usr/bin/dirname /usr/share/man/sman1/gst-xmlin 2005 Jan 22 14:07:01 0 25259 25254 /usr/bin/expr /usr/share/man/sman1/gst-xmlinspect.1/ : \(/\ 2005 Jan 22 14:07:01 0 25260 24072 sh -c cd /usr/share/man; rm -f /usr/share/man/cat1/gst-xmli 2005 Jan 22 14:07:01 0 25261 25260 rm -f /usr/share/man/cat1/gst-xmlinspect.1 2005 Jan 22 14:07:01 0 25262 25260 ln -s ../cat1/gst-xmlinspect-0.8.1 /usr/share/man/cat1/gst- 2005 Jan 22 14:07:01 0 25263 24072 sh -c cd /usr/share/man; /usr/lib/sgml/sgml2roff /usr/share 2005 Jan 22 14:07:01 0 25264 25263 /bin/sh /usr/lib/sgml/sgml2roff /usr/share/man/sman1/gst-xm 2005 Jan 22 14:07:01 0 25265 25264 grep -i <refentry[ >] 2005 Jan 22 14:07:01 0 25267 25266 /usr/lib/sgml/nsgmls -gl -m/usr/share/lib/sgml/locale/C/dtd 2005 Jan 22 14:07:01 0 25266 25264 /usr/lib/sgml/instant -d -c/usr/share/lib/sgml/locale/C/tra 2005 Jan 22 14:07:01 0 25268 25266 sh -c 2005 Jan 22 14:07:01 0 25269 24072 sh -c cat /tmp/sman_rcbabV | tbl | eqn | nroff -u0 -Tlp -m 2005 Jan 22 14:07:01 0 25271 25270 cat /tmp/sman_rcbabV 2005 Jan 22 14:07:01 0 25272 25270 tbl 2005 Jan 22 14:07:01 0 25273 25270 eqn 2005 Jan 22 14:07:01 0 25274 25270 nroff -u0 -Tlp -man - 2005 Jan 22 14:07:01 0 25270 25269 col -x 2005 Jan 22 14:07:01 0 25275 24072 sh -c cd /usr/share/man; /usr/lib/sgml/sgml2roff /usr/share 2005 Jan 22 14:07:01 0 25276 25275 /bin/sh /usr/lib/sgml/sgml2roff /usr/share/man/sman1/gst-xm 2005 Jan 22 14:07:01 0 25277 25276 grep -i <refentry[ >] 2005 Jan 22 14:07:01 0 25279 25278 head -10 /usr/share/man/sman1/gst-xmllaunch.1 2005 Jan 22 14:07:01 0 25278 25276 fgrep SHADOW_PAGE 2005 Jan 22 14:07:01 0 25280 25276 nawk /SYSTEM/ {gsub("\"","",$4);gsub(">","",$4);print $4 2005 Jan 22 14:07:01 0 25281 25276 /usr/bin/sh /usr/bin/dirname /usr/share/man/sman1/gst-xmlla 2005 Jan 22 14:07:01 0 25281 25276 /usr/bin/expr /usr/share/man/sman1/gst-xmllaunch.1/ : \(/\) 2005 Jan 22 14:07:01 0 25282 24072 sh -c cd /usr/share/man; rm -f /usr/share/man/cat1/gst-xmll [...] |
The timestamps highlight the rate at which these processes are created and destroyed; their lifetime is so short they are not sampled by traditional tools such as ps or prstat.
And now back to the open()s, we could have used opensnoop,
# ./opensnoop -v STRTIME UID PID COMM FH PATH 2005 Jan 22 14:17:17 0 9831 catman 5 /usr/share/man/man4/ipnat.conf.4 2005 Jan 22 14:17:17 0 9831 catman 5 /usr/share/man/man4/ipnat.4 2005 Jan 22 14:17:17 0 11504 sh -1 /var/ld/ld.config 2005 Jan 22 14:17:17 0 11504 sh 3 /lib/libc.so.1 2005 Jan 22 14:17:17 0 11505 rm -1 /var/ld/ld.config 2005 Jan 22 14:17:17 0 11505 rm 3 /lib/libc.so.1 2005 Jan 22 14:17:17 0 11179 dtrace 4 /usr/share/lib/zoneinfo/Australia/NSW 2005 Jan 22 14:17:17 0 11506 ln -1 /var/ld/ld.config 2005 Jan 22 14:17:17 0 11506 ln 3 /lib/libc.so.1 2005 Jan 22 14:17:17 0 9831 catman 5 /usr/share/man/man4/ipnodes.4 2005 Jan 22 14:17:17 0 9831 catman 5 /usr/share/man/man4/ippool.4 2005 Jan 22 14:17:17 0 9831 catman 5 /usr/share/man/man4/ippool.conf.4 2005 Jan 22 14:17:17 0 9831 catman 5 /usr/share/man/man4/ippool.4 2005 Jan 22 14:17:17 0 11507 sh -1 /var/ld/ld.config 2005 Jan 22 14:17:17 0 11507 sh 3 /lib/libc.so.1 2005 Jan 22 14:17:17 0 11508 rm -1 /var/ld/ld.config 2005 Jan 22 14:17:17 0 11508 rm 3 /lib/libc.so.1 2005 Jan 22 14:17:17 0 11509 ln -1 /var/ld/ld.config 2005 Jan 22 14:17:17 0 11509 ln 3 /lib/libc.so.1 2005 Jan 22 14:17:17 0 9831 catman 5 /usr/share/man/man4/isa.4 2005 Jan 22 14:17:17 0 9831 catman 5 /usr/share/man/man4/sysbus.4 [...] |
Which has identified a rapid number of processes, the catman process, and man pages being opened.
The following are some observations from the above DTrace measurements,
I hope this has been a useful demonstration of using DTrace from the command line, as well as from DTrace scripts.
Back to Brendan Gregg's Homepage