I originally posted this at http://blogs.sun.com/brendan/entry/dtrace_bourne_shell_sh_provider1.
Alan Hargreaves has prototyped a DTrace Bourne shell provider, and is looking ahead to see how this can be integrated with Solaris and OpenSolaris. This is great news for anyone who would like to analyze or troubleshoot their Bourne shell scripts. It is also good news for people (especially system administrators) wanting to learn DTrace, as it may provide a familiar path (sh scripting) for you to learn user-land DTracing.
Why sh, and not csh, tcsh, ksh, bash or zsh? Alan could have done any, we wanted to start with the Bourne shell to both get it done and as an example for the other shell providers to follow (watch this space). So if you really just care about bash or whatever, then learning about the sh provider now should be a useful heads-up of what should be possible.
hello.sh
When learning to DTrace software, it can be best to start with the simple stuff. Confirm that what you think will happen, does indeed happen, then move to progressivly more complex scripts.
Here is hello.sh, a dead simple script to start with:
# cat -n hello.sh 1 #!./sh 2 3 echo "Hello World!" 4 echo "Hello World!" 5 echo "Hello World!" 6 echo "Hello World!" 7 echo "Hello World!"
The following DTrace one-liner will run hello.sh (-c), trace all sh provider probes (sh*:::), and frequency count their probe names (@[probename] = count()).
# dtrace -Zn 'sh$target::: { @[probename] = count(); }' -c ./hello.sh dtrace: description 'sh$target::: ' matched 9 probes Hello World! Hello World! Hello World! Hello World! Hello World! dtrace: pid 248847 has exited script-begin 1 script-end 1 builtin-entry 5 builtin-return 5 line 5
So builtin-entry and builtin-return were called 5 times, one for each echo. The line probe was also called 5 times, once for each line that the shell executed.
Frequency counting builtins
Each probe provides useful information in their args, such as the script name, builtin or exec name, line number, etc. Here I'll use the builtin-entry probe to frequency count the builtin names:
# dtrace -Zn 'sh$target:::builtin-entry { @[copyinstr(arg1)] = count(); }' -c ./hello.sh dtrace: description 'sh$target:::builtin-entry ' matched 1 probe Hello World! [...] dtrace: pid 248851 has exited echo 5
That found that echo was called 5 times. Nothing suprising there, this is just confirming that the probes behave as they should.
Line number tracing
The line probe allows us to trace line execution. This one liner counts how many times each line was run:
# dtrace -Zn 'sh$target:::line { @[arg1] = count(); }' -c ./hello.sh dtrace: description 'sh$target:::line ' matched 1 probe Hello World! [...] dtrace: pid 248855 has exited 3 1 4 1 5 1 6 1 7 1
Each of the lines 3 to 7 were called once, as expected. This one-liner should be useful for finding loop hotspots.
And the following one-liner traces line execution live:
# dtrace -Zn 'sh$target:::line { trace(arg1); }' -c ./hello.sh dtrace: description 'sh$target:::line ' matched 1 probe Hello World! [...] CPU ID FUNCTION:NAME 0 54539 execute:line 3 0 54539 execute:line 4 0 54539 execute:line 5 0 54539 execute:line 6 0 54539 execute:line 7 dtrace: pid 248869 has exited
Great, that works. Is this getting too easy? Lets move on...
func_abc.sh
Now for somthing a little harder (and more interesting). In the following script, func_a() calls func_b() which calls func_c().
# cat -n func_abc.sh 1 #!./sh 2 3 func_c() 4 { 5 echo "Function C" 6 sleep 1 7 } 8 9 func_b() 10 { 11 echo "Function B" 12 sleep 1 13 func_c 14 } 15 16 func_a() 17 { 18 echo "Function A" 19 sleep 1 20 func_b 21 } 22 23 func_a
What order will the lines be executed now? Try thinking about it before checking the answer below.
Tracing line execution
# dtrace -Zn 'sh$target:::line { trace(arg1); }' -c ./func_abc.sh dtrace: description 'sh$target:::line ' matched 1 probe Function A CPU ID FUNCTION:NAME 0 54539 execute:line 23 0 54539 execute:line 18 0 54539 execute:line 19 pause Function B 0 54539 execute:line 20 0 54539 execute:line 11 0 54539 execute:line 12 pause Function C 0 54539 execute:line 13 0 54539 execute:line 5 0 54539 execute:line 6 pause dtrace: pid 248873 has exited
There were some conspicuous one-second pauses during the output, which are sadly lost in this screen shot. I've added the word "pause" so that you know when that happened.
The order of the lines does make sense. What might be confusing is that we see the output text "Function B" followed by the trace of line 20 -- this is due to DTrace output buffering.
Tracing function flow: sh_flow.d
Since we have function-entry and function-return probes, we can trace function execution and indent the output based on the function depth. The following demonstrates sh_flow.d: a script which does just that for any script running on the system (with this sh provider enabled binary), and also traces builtin calls and command executions. sh_flow.d is run in one window, and func_abc.sh runs in another.
# ./sh_flow.d C TIME FILE -- NAME 0 2007 Aug 10 18:41:10 func_abc.sh -> func_a 0 2007 Aug 10 18:41:10 func_abc.sh > echo 0 2007 Aug 10 18:41:10 func_abc.sh | sleep 0 2007 Aug 10 18:41:11 func_abc.sh -> func_b 0 2007 Aug 10 18:41:11 func_abc.sh > echo 0 2007 Aug 10 18:41:11 func_abc.sh | sleep 0 2007 Aug 10 18:41:12 func_abc.sh -> func_c 0 2007 Aug 10 18:41:12 func_abc.sh > echo 0 2007 Aug 10 18:41:12 func_abc.sh | sleep 0 2007 Aug 10 18:41:13 func_abc.sh <- func_c 0 2007 Aug 10 18:41:13 func_abc.sh <- func_b 0 2007 Aug 10 18:41:13 func_abc.sh <- func_a ^C
The legend is:
-> | function entry |
<- | function return |
> | builtin entry |
| | command entry |
Great! the output makes it clear who is calling who.
The sh_flow.d script is:
#!/usr/sbin/dtrace -Zs /* * sh_flow.d - snoop Bourne shell execution showing function flow. * Written for the Solaris sh DTrace provider. * * This traces shell activity from all Bourne shells on the system that are * running with sh provider support. * * USAGE: sh_flow.d # hit Ctrl-C to end * * This watches shell function entries and returns, and indents child * function calls. Shell builtins are also printed. * * FIELDS: * C CPU-id * TIME Time of execution * FILE Filename that this function belongs to * NAME Shell function, builtin or command name * * If a name isn't available at the time of tracing, "<null>" is printed. * * WARNING: Watch the first column carefully, it prints the CPU-id. If it * changes, then it is very likely that the output has been shuffled. */ #pragma D option quiet #pragma D option switchrate=10 dtrace:::BEGIN { depth = 0; printf("%s %-20s %-22s -- %s\\n", "C", "TIME", "FILE", "NAME"); } sh*:::function-entry { depth++; printf("%d %-20Y %-22s %\*s-> %s\\n", cpu, walltimestamp, basename(copyinstr(arg0)), depth*2, "", copyinstr(arg1)); } sh*:::function-return { printf("%d %-20Y %-22s %\*s<- %s\\n", cpu, walltimestamp, basename(copyinstr(arg0)), depth*2, "", copyinstr(arg1)); depth--; } sh*:::builtin-entry { printf("%d %-20Y %-22s %\*s > %s\\n", cpu, walltimestamp, basename(copyinstr(arg0)), depth*2, "", copyinstr(arg1)); } sh*:::command-entry { printf("%d %-20Y %-22s %\*s | %s\\n", cpu, walltimestamp, basename(copyinstr(arg0)), depth*2, "", copyinstr(arg1)); }
Tracing function flow with times: sh_flowtime.d
A great feature of DTrace is the availability of high resolution timestamps. Here I've enhanced the previous script to print delta times:
# ./sh_flowtime.d C TIME FILE DELTA(us) -- NAME 0 2007 Aug 10 18:52:51 func_abc.sh 0 -> func_a 0 2007 Aug 10 18:52:51 func_abc.sh 54 > echo 0 2007 Aug 10 18:52:52 func_abc.sh 1022880 | sleep 0 2007 Aug 10 18:52:52 func_abc.sh 34 -> func_b 0 2007 Aug 10 18:52:52 func_abc.sh 44 > echo 0 2007 Aug 10 18:52:53 func_abc.sh 1029963 | sleep 0 2007 Aug 10 18:52:53 func_abc.sh 44 -> func_c 0 2007 Aug 10 18:52:53 func_abc.sh 43 > echo 0 2007 Aug 10 18:52:54 func_abc.sh 1029863 | sleep 0 2007 Aug 10 18:52:54 func_abc.sh 33 <- func_c 0 2007 Aug 10 18:52:54 func_abc.sh 14 <- func_b 0 2007 Aug 10 18:52:54 func_abc.sh 7 <- func_a ^C
The legend for understanding the delta times is:
-> | function entry | time from the previous event, to the entry of this function |
<- | function return | time from the previous event, to the completion of this function |
> | builtin entry | time from the previous event, to when this builtin completed execution |
| | command entry | time from the previous event, to when this command completed execution |
Hey, we found the latency in the script, it's those 1.02 second calls to "sleep"! :-)
Being able to present this information in such a compelling way is awsome stuff.
I won't include the source to that script just yet (this blog entry will get too long), but I'll encourage Alan to bundle these scripts along with the patch/binary of the sh provider (when he makes that available).
Tracing type times: sh_typetime.d
If your script is rather long, the above output from sh_flowtime.d may get rather long. This condensed summary may be more practical:
# ./sh_typetime.d Tracing... Hit Ctrl-C to end. ^C FILE TYPE NAME TIME(us) func_abc.sh builtin echo 82 func_abc.sh func func_c 260 func_abc.sh func func_b 285 func_abc.sh func func_a 478 func_abc.sh exec sleep 3080418
It's pretty obvious from that output where the time was spent.
I was experimenting with more exciting forms of that type of script:
# ./sh_typetime2.d Tracing... Hit Ctrl-C to end. ^C FILE TYPE NAME COUNT ELAPSED ON-CPU func_abc.sh func func_c 1 296 31 func_abc.sh func func_b 1 346 44 func_abc.sh func func_a 1 529 61 func_abc.sh builtin echo 3 87 77 func_abc.sh exec sleep 3 3085831 53984
Lots more will be possible, Alan only wrote the provider two days ago, so I haven't had much of a chance to write scripts yet!
Real world tracing
Alan literally wrote this provider from scratch in the last few days, so we haven't had a chance to point it at something big yet. Here's at least one example of something familiar, /etc/profile:
# ./sh_typetime2.d Tracing... Hit Ctrl-C to end. ^C FILE TYPE NAME COUNT ELAPSED ON-CPU profile builtin [ 1 3 2 profile builtin umask 1 4 3 profile builtin export 1 5 3 profile builtin trap 2 30 22
Great. See Alan's blog for updates.