DTrace Troubleshooting

This is a demonstration of dtrace analysis vs other tools on Solaris 10.
Note: This document is not written by Sun.

dtrace vs truss

DTrace is the new comprehensive system analysis tool that is designed to be less intrusive than other tools, and safe to run in production.
truss is a standard Solaris tool to analyse system calls and signals, more recently enhanced to follow user level function calls as well (-u).

1. Background

It's often said that DTrace is less of a burden than other tools such as truss - but is that true?

DTrace provides over 30,000 probes, and has the capability to drill down to each instruction of a process; truss by default only looks at system calls and signals. Should we believe that DTrace can do such amazing things at little CPU cost? The following demonstration puts both DTrace and truss to the test.

2. A Classic Problem

The following shows a system suffering a classic performance problem. Although this fault has been delibrately created for this demonstration, this type of problem is common.

A few commands are run to show the system is under heavy CPU load,

# uptime
  1:12pm  up 1 day(s), 16:36,  3 users,  load average: 2.12, 1.97, 1.26
#
# vmstat 1
 kthr      memory            page            disk          faults      cpu
 r b w   swap  free  re  mf pi po fr de sr cd s0 -- --   in   sy   cs us sy id
 0 0 0 1248800 249416 3  31  2  0  0  0  1  1  0  0  0  344 1197  324  1  2 98
 0 0 0 1132928 110196 737 9965 10 0 0 0  0  0  0  0  0  415 7133  557 13 87  0
 1 0 0 1132936 110200 784 10586 0 0 0 0  0  3  0  0  0  411 8488  585 14 86  0
 1 0 0 1132920 110192 841 11375 0 0 0 0  0  1  0  0  0  407 8016  608 14 86  0
 0 0 0 1132984 110252 852 11473 0 0 0 0  0  0  0  0  0  408 8105  609 14 86  0
 1 0 0 1132888 110176 823 11136 0 0 0 0  0  0  0  0  0  403 8907  603 15 85  0   
^C
#

This is a single CPU system, so a load average of 2.12 would indicate potential CPU saturation, with 100% utilisation. The output of vmstat supports this, it shows no idle time, and some threads queueing up for these 1 second averages.

Lets run prstat to identify the process responsible,

# prstat
   PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP
 25969 root     2564K 1144K run      0    0   0:01:30 9.3% bash/1
   912 brendan    93M   69M run     49    0   0:15:45 1.8% mozilla-bin/4
   845 brendan    23M   19M sleep   59    0   0:03:44 0.5% acroread/1
 17640 root       11M 7044K sleep   59    0   0:04:18 0.4% dtrace/1
 25667 brendan   106M   54M sleep   49    0   0:00:28 0.3% soffice.bin/5
 26084 root     4328K 4060K cpu0    49    0   0:00:00 0.2% prstat/1
  1470 brendan    94M   28M sleep   59    0   0:02:09 0.1% java/14
   412 root       89M   19M sleep   59    0   0:01:41 0.1% poold/8
   496 brendan    53M   65M sleep   59    0   0:03:57 0.0% Xorg/1
   166 root     1860K 1000K sleep   59    0   0:00:32 0.0% in.routed/1
 17778 brendan  8612K 1948K sleep   59    0   0:00:00 0.0% sshd/1
 17863 root     2564K 1796K sleep   49    0   0:00:00 0.0% bash/1
     7 root       10M 7976K sleep   59    0   0:00:12 0.0% svc.startd/13   
   125 root     1228K  836K sleep   59    0   0:00:00 0.0% powerd/2
   265 root     1844K 1096K sleep   59    0   0:00:00 0.0% ttymon/1
    95 daemon   3688K 2000K sleep   59    0   0:00:00 0.0% kcfd/3
   385 root     3304K 1200K sleep   59    0   0:00:00 0.0% sshd/1
Total: 83 processes, 191 lwps, load averages: 2.11, 2.03, 1.40

Hmm, looking at the CPU column doesn't explain our 100% utilised system, the processes we can see may add up to around 15% CPU. Where has the other 85% of CPU gone?

It's possible the CPU is consumed by various kernel threads that are not represented as processes. For example, CPU consumed by interrupt threads can now be checked in Solaris 10 by the intrstat command (which uses DTrace),

# intrstat 5
 
      device |      cpu0 %tim
-------------+---------------
       ata#1 |         0  0.0
      rtls#0 |       421  0.3
      uhci#0 |       421  0.1
      uhci#1 |         0  0.0
 
      device |      cpu0 %tim
-------------+---------------
       ata#1 |         0  0.0
      rtls#0 |       471  0.4
      uhci#0 |       471  0.1
      uhci#1 |         0  0.0     
^C

This shows I'm losing 0.3% CPU to rtls0 - my network card. This hasn't explained the missing 85%, but it's really useful to be able to get details on interrput CPU usage anyway.

Another place the CPU can vanish to is to satisfy TLB and TSB misses - virtual to physical address translation. trapstat can show us,

# trapstat -T 1
trapstat: not implemented on i86pc

Oops, that's what I get for using an x86 based system! Not to worry, testing the same fault on SPARC based systems shows about 2% of CPU is consumed here - which can't explain the missing 85%.

Well, for good measure we should check other areas of the system: memory, disk I/O, and network usage. I'll use swapinfo to check memory,

#  swapinfo
RAM  _____Total   511.6 Mb
RAM    Unusable     8.0 Mb
RAM      Kernel   108.7 Mb
RAM      Locked     0.0 Mb
RAM        Used   287.4 Mb
RAM       Avail   107.4 Mb
 
Disk _____Total  1023.7 Mb
Disk      Alloc     0.0 Mb
Disk       Free  1023.7 Mb
 
Swap _____Total  1355.6 Mb
Swap      Alloc   185.8 Mb
Swap    Unalloc    64.5 Mb
Swap      Avail  1105.3 Mb
Swap  (MinFree)    62.9 Mb     
#

So plenty of available RAM.

Now iostat to check disk activity,

# iostat -xnmpz 5
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.3    0.3    2.3    0.6  0.0  0.0    4.5    2.0   0   0 c0d0
    0.0    0.0    0.2    0.0  0.0  0.0    0.0   11.2   0   0 mars:/var/tmp
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.2    7.4    1.6    7.6  0.0  0.0    1.0    0.7   0   0 c0d0
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device   
^C

Not much disk activity there.

Now nicstat to check network usage,

# nicstat 1
    Time   Int   rKb/s   wKb/s   rPk/s   wPk/s    rAvs    wAvs   %Util     Sat   
13:44:10 rtls0    0.29    0.15    1.29    0.74  229.95  210.02    0.00    0.00
13:44:11 rtls0    5.66   32.51   92.98  178.04   62.38  186.98    0.31    0.00
13:44:12 rtls0    2.26   21.85   36.67   67.40   62.97  331.88    0.20    0.00
13:44:13 rtls0    4.08   28.11   68.29  132.62   61.13  217.04    0.26    0.00
13:44:14 rtls0    5.19   32.12   87.81  172.63   60.52  190.53    0.31    0.00
13:44:15 rtls0    6.49   36.43  110.06  218.14   60.41  171.02    0.35    0.00
^C

And not much network activity either.

After all that, we know the CPU is fairly busy but have little idea why.

3. Bring on truss

truss needs to either run the program, or to be given a PID to examine. We already checked processes using prstat, and there was no process that explained where all the CPU had gone. The highest we saw was a bash shell at 9.3% with PID 25969. Lets run truss on that,

# truss -p 25969
lwp_sigmask(SIG_SETMASK, 0x00020002, 0x00000000) = 0xFFBFFEFF [0x0000FFFF]
schedctl()                                      = 0xD25A4000
lwp_sigmask(SIG_SETMASK, 0x00000000, 0x00000000) = 0xFFBFFEFF [0x0000FFFF]
    Received signal #18, SIGCLD [caught]
      siginfo: SIGCLD CLD_EXITED pid=2387 status=0x0000
lwp_sigmask(SIG_SETMASK, 0x00020000, 0x00000000) = 0xFFBFFEFF [0x0000FFFF]
waitid(P_ALL, 0, 0x080476E0, WEXITED|WTRAPPED|WNOHANG) = 0
waitid(P_ALL, 0, 0x080476E0, WEXITED|WTRAPPED|WNOHANG) Err#10 ECHILD
setcontext(0x080475E0)
lwp_sigmask(SIG_SETMASK, 0x00020000, 0x00000000) = 0xFFBFFEFF [0x0000FFFF]
lwp_sigmask(SIG_SETMASK, 0x00000000, 0x00000000) = 0xFFBFFEFF [0x0000FFFF]
lwp_sigmask(SIG_SETMASK, 0x00020000, 0x00000000) = 0xFFBFFEFF [0x0000FFFF]
sigaction(SIGINT, 0x08047AD0, 0x08047B40)       = 0
lwp_sigmask(SIG_SETMASK, 0x00000000, 0x00000000) = 0xFFBFFEFF [0x0000FFFF]
sigaction(SIGINT, 0x08047AC0, 0x08047B30)       = 0
lwp_sigmask(SIG_SETMASK, 0x00020000, 0x00000000) = 0xFFBFFEFF [0x0000FFFF]
lwp_sigmask(SIG_SETMASK, 0x00000000, 0x00000000) = 0xFFBFFEFF [0x0000FFFF]
lwp_sigmask(SIG_SETMASK, 0x00020002, 0x00000000) = 0xFFBFFEFF [0x0000FFFF]   
fork1()                                         = 2389
lwp_sigmask(SIG_SETMASK, 0x00020002, 0x00000000) = 0xFFBFFEFF [0x0000FFFF]
[...]

The above output scrolled very fast with much of the same. We can see a fork1() and a SIGCLD so we ought to follow children,

# truss -fp 25969
25969:  sigaction(SIGINT, 0x08047AC0, 0x08047B30)       = 0
25969:  lwp_sigmask(SIG_SETMASK, 0x00020000, 0x00000000) = 0xFFBFFEFF [0x0000FFFF]   
25969:  lwp_sigmask(SIG_SETMASK, 0x00000000, 0x00000000) = 0xFFBFFEFF [0x0000FFFF]
25969:  lwp_sigmask(SIG_SETMASK, 0x00020002, 0x00000000) = 0xFFBFFEFF [0x0000FFFF]
25969:  fork1()                                         = 26938
26938:  fork1()         (returning as child ...)        = 25969
25969:  lwp_sigmask(SIG_SETMASK, 0x00020002, 0x00000000) = 0xFFBFFEFF [0x0000FFFF]
[...]
26938:  sigaction(SIGTERM, 0x08047A60, 0x08047AD0)      = 0
26938:  sigaction(SIGCLD, 0x08047A60, 0x08047AD0)       = 0
26938:  execve("/usr/bin/date", 0x08101FA8, 0x08100E08)  argc = 1
26938:  resolvepath("/usr/bin/date", "/usr/bin/date", 1023) = 13
26938:  sysconfig(_CONFIG_PAGESIZE)                     = 4096
26938:  resolvepath("/usr/lib/ld.so.1", "/lib/ld.so.1", 1023) = 12
26938:  xstat(2, "/usr/bin/date", 0x08047B38)           = 0
26938:  open("/var/ld/ld.config", O_RDONLY)             Err#2 ENOENT
[...]

The above output shows the child is the date command. Lets concentrate on those execs only,

# truss -ftexec -p 25969
15670:  execve("/usr/bin/date", 0x08101FA8, 0x08100E08)  argc = 1
15672:  execve("/usr/bin/date", 0x08100BE8, 0x08100E08)  argc = 1
15674:  execve("/usr/bin/date", 0x08101FE8, 0x08100E08)  argc = 1
15676:  execve("/usr/bin/date", 0x08101F88, 0x08100E08)  argc = 1
15678:  execve("/usr/bin/date", 0x08101B08, 0x08100E08)  argc = 1
15680:  execve("/usr/bin/date", 0x080FE168, 0x08100E08)  argc = 1
15682:  execve("/usr/bin/date", 0x08100BC8, 0x08100E08)  argc = 1
15684:  execve("/usr/bin/date", 0x08102108, 0x08100E08)  argc = 1
15686:  execve("/usr/bin/date", 0x08101FA8, 0x08100E08)  argc = 1
15688:  execve("/usr/bin/date", 0x08100BE8, 0x08100E08)  argc = 1
[...]

This is also scrolling rather fast. So it appears the problem is hundreds of short lived processes consuming the CPU, each dissapearing before being sampled by prstat.

... This fault was achieved by running the following command in the background,

# while :; do date; done

It's an infinite loop that runs date over, and over, and over...

4. DTrace to the rescue

DTrace could have found this problem easily by using the execsnoop tool from the DTraceToolkit. execsnoop is a good tool to run for any given problem.

# execsnoop -v
STRTIME                UID    PID   PPID ARGS
2005 Sep 11 14:21:39     0  22085  25969 date
2005 Sep 11 14:21:39     0  22086  25969 date
2005 Sep 11 14:21:39     0  22087  25969 date
2005 Sep 11 14:21:39     0  22088  25969 date
2005 Sep 11 14:21:39     0  22089  25969 date
2005 Sep 11 14:21:39     0  22090  25969 date
2005 Sep 11 14:21:39     0  22091  25969 date
2005 Sep 11 14:21:39     0  22092  25969 date
2005 Sep 11 14:21:39     0  22093  25969 date
2005 Sep 11 14:21:39     0  22094  25969 date
2005 Sep 11 14:21:39     0  22095  25969 date
2005 Sep 11 14:21:39     0  22096  25969 date     
[...]

This shows many date commands run every second.

So what? date is a fast command - maybe it doesn't explain the missing 85% CPU after all. We need to measure how much CPU was consumed. We'll try using truss first,

# truss -cf -p 25969
^Csignals ------------
SIGCLD           730
total:           730
 
 
syscall               seconds   calls  errors
sigaction                .092    4891
setcontext               .045     730
waitid                   .151    3176     730
fork1                   1.520    2445
lwp_sigmask              .371   22740
schedctl                 .040    2446
                     --------  ------   ----
sys totals:             2.222   36428    730       
usr time:                .650
elapsed:              100.540

This shows around 2.2% system and 0.6% user time for the 100 second sample. Hmm, that doesn't seem right.

Now we will try DTrace. The shortlived.d script from the DTraceToolkit is run for the same interval,

# shortlived.d
Sampling.. Hit Ctrl-C to stop.
^C
short lived processes:     73.750 secs
total sample duration:    100.833 secs     
 
Total time by process name,
          sendmail           10 ms
              date        63939 ms
 
Total time by PPID,
               415            5 ms
               414            5 ms
             25969        63939 ms

This shows around 73.7% of the time was consumed by short lived processes - providing proof that these date commands are responsible for the CPU utilisation. This value goes further than any previous measurement to explain the missing 85%.

5. DTrace vs truss showdown!

So far both tools have pointed to the problem, and DTrace provided more accurate timing info. What we really would like to see is the effect of running each tool.

normal - The following is a "sar -c 1 10" with no other analysis tool running,

# sar -c 1 10 
 
SunOS jupiter 5.10 Generic i86pc    09/11/2005
 
14:44:43 scall/s sread/s swrit/s  fork/s  exec/s rchar/s wchar/s
14:44:44    9437     127     124   91.09   91.09   76216    6832
14:44:45    8395     122     122   92.08   92.08   76763    6517
14:44:46    8437     126     123   93.07   92.08   77978    7304
14:44:47    9348     125     122   91.09   91.09   75384    6635
14:44:48    8351     120     120   91.18   92.16   75241    6425
14:44:49    8489     121     121   93.07   93.07   76762    6516
14:44:50    9456     124     123   92.00   92.00   76137    6768
14:44:51    8305     120     120   91.18   91.18   75241    6425
14:44:52    8417     122     122   92.08   92.08   75987    6489
14:44:53    9356     125     122   91.09   91.09   76287    6662     
 
Average     8797     123     122   91.79   91.79   76198    6657

Ok, so normally our test application is running around 92 new processes per second - those date commands. This value is our "control".

truss - Now we run truss on the fault as we did previously, and check the effect using sar. We send the output of truss to /dev/null as we are interested in the effect of how truss operates, not that of updating the screen with the output,

# truss -ftexec -p 25969 2> /dev/null &
[2] 17953
#
# sar -c 1 10 
  
SunOS jupiter 5.10 Generic i86pc    09/11/2005
  
14:55:39 scall/s sread/s swrit/s  fork/s  exec/s rchar/s wchar/s
14:55:40    9683     909     885   66.34   33.66  331066   19767
14:55:41   10584     890     865   66.34   32.67  326276   19360
14:55:42    9553     896     873   65.69   33.33  327000   19376
14:55:43    9488     893     870   65.69   33.33  329688   19191
14:55:44   10533     893     867   66.34   32.67  323564   19547
14:55:45    9638     901     879   66.67   33.33  330474   19533
14:55:46    9508     890     866   65.69   33.33  328134   19139
14:55:47    9428     779     757   57.43   28.71  282797   17500
14:55:48    9576     898     876   66.34   32.67  327450   19472
14:55:49    9727     901     884   67.00   33.00  333738   19789   
  
Average     9771     885     862   65.35   32.68  324028   19267

Yikes! Now we have dropped to around 33 processes per second from 92, which means truss is causing our application to run 64% slower.

DTrace - Now lets try the same with DTrace's execsnoop,

# execsnoop -v > /dev/null &
[2] 17021
#
# sar -c 1 10
 
SunOS jupiter 5.10 Generic i86pc    09/11/2005
 
15:00:35 scall/s sread/s swrit/s  fork/s  exec/s rchar/s wchar/s
15:00:36    9519     129     127   92.00   92.00   77230   11111
15:00:37    8318     124     124   90.20   90.20   75244   10486
15:00:38    8392     119     120   92.08   92.08   75208   10649
15:00:39    9469     127     124   92.00   92.00   76266   10933
15:00:40    8416     121     122   92.08   92.08   75986   10633
15:00:41    8317     118     119   91.26   90.29   75272   10498
15:00:42    9452     125     124   91.00   92.00   76138   10933
15:00:43    8409     119     120   92.08   92.08   75208   10557
15:00:44    8325     120     120   90.20   90.20   75244   10573
15:00:45    9455     124     123   92.00   92.00   75353   10904
 
Average     8803     122     122   91.49   91.49   75711   10726

Phwaorr! We have dropped from 91.79 processes per second to 91.49 - DTrace has slowed our application by 0.3%!

6. Why?

So why is DTrace so much less of a burden than truss?

Running an mpstat for both DTrace and truss will give us a big clue,

truss

# mpstat 1 8
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0  593   0    0   350  250  356    8    0    0    0  1606    2   6   0  92
  0 10862   0    0   405  304 1320   96    0    0    0  9255   16  84   0   0
  0 10687   0    0   403  303 1317  142    0    0    0 10262   17  83   0   0
  0 10775   0    0   407  307 1321   98    0    5    0  9182   17  83   0   0
  0 10719   0    0   399  300 1312   97    0    3    0  9119   17  83   0   0
  0 10733   0    0   406  306 1315  142    0    0    0 10239   17  83   0   0
  0 10778   0    0   406  306 1321   97    0    4    0  9161   17  83   0   0   
  0 10924   0    0   407  306 1332   98    0    1    0  9270   16  84   0   0

DTrace

# mpstat 1 8
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0  599   0    0   350  250  356    8    0    0    0  1610    2   6   0  92
  0 11105   0    0   411  310  597   76    0    1    0  8853   15  85   0   0
  0 11209   0    0   403  303  597   76    0    0    0  7892   15  85   0   0
  0 11118   0    0   399  300  579   75    0    0    0  7782   15  85   0   0
  0 11199   0    0   410  309  607   83    0    0    0  8944   15  85   0   0
  0 11145   0    0   399  300  585   77    0    2    0  7808   15  85   0   0
  0 11259   0    0   407  306  591   82    0    0    0  7932   14  86   0   0   
  0 11131   0    0   409  309  600   79    0    1    0  8874   15  85   0   0

There are less than half the number of context switches when DTrace is running.

Lets use dtrace to see what is context switching onto the CPU so much. The following is with truss,

# dtrace -n 'sched:::on-cpu { @[execname] = count(); } profile:::tick-20s { exit(0); }'
 
  ttymon                                                            1
  in.routed                                                         2
  sac                                                               2
  snmpd                                                             3
  miniserv.pl                                                       5
  fmd                                                               5
  svc.configd                                                       6
  inetd                                                             6
  sendmail                                                         12
  fsflush                                                          21
  Xorg                                                             54
  svc.startd                                                       57
  soffice.bin                                                     120
  dtrace                                                          438
  poold                                                           631
  sched                                                           770
  java                                                            817
  acroread                                                        995
  mozilla-bin                                                    2534
  date                                                           3508
  bash                                                           4087     
  truss                                                         11573

In this 20 second sample, truss jumped onto the CPU 11573 times. truss operates by sending control messages to procfs's ctl and lwpctl files, causing the process to freeze for every system call so that truss can print a line of output. Hense truss's behaviour is synchronous to the system calls.

Now the same test is done while using DTrace to analyse the problem,

# dtrace -n 'sched:::on-cpu { @[execname] = count(); } profile:::tick-20s { exit(0); }'
 
  automountd                                                        1
  utmpd                                                             2
  init                                                              2
  in.routed                                                         2
  nfsmapid                                                          3
  miniserv.pl                                                       4
  fmd                                                               5
  svc.configd                                                       5
  inetd                                                             6
  sendmail                                                         11
  fsflush                                                          21
  Xorg                                                             50
  svc.startd                                                       61
  soffice.bin                                                     118
  poold                                                           612
  dtrace                                                          653
  sched                                                           738
  java                                                            801
  acroread                                                        994
  date                                                           1054
  mozilla-bin                                                    2560     
  bash                                                           4100

DTrace barely steps onto the CPU. This is because DTrace uses a per CPU buffer in the kernel that is read at comfortable intervals by the user level DTrace consumer, /usr/sbin/dtrace. Often this interval is once per second, which would mean the DTrace program is stepping onto the CPU once per second rather than for every system call. This behaviour is asynchronous to the system calls, and results in minimal performance impact.

7. Notes

For completion, the following tools could also have been used to help analyse the short lived processes problem,

prstat -m, this shows many system calls for the bash process.
lastcomm, if process accounting is switched on this will list recent processes.
praudit, if BSM auditing is switched on this will give extensive details on short lived processes.

Conclusion

In the above demonstration,

DTrace can be far less of a burden on applications than running truss.
In this demonstration the difference was 0.3% slower using DTrace compared to 64% slower using truss.
DTrace can also measure CPU time consumed accurately.
DTrace rocks!

I hope this has been a useful demonstration of DTrace.

More DTrace

DTrace Tools - DTrace tools; iosnoop, execsnoop, ...

DTrace - Sun's BigAdmin DTrace website.
DTrace - OpenSolaris Community DTrace website.

Back to Brendan Gregg's Homepage