I originally posted this at http://dtrace.org/blogs/brendan/2011/05/24/file-system-latency-part-4.
This is part 4 on file system latency, a series on storage I/O performance from the application perspective (see part 1, part 2 and part 3). In the previous post I showed how to trace file system latency from within MySQL using the pid provider. In this post I'll show how similar data can be retrieved using the DTrace syscall and fbt providers. These allow us to trace at the system call layer, and deeper in the kernel at both the Virtual File System (VFS) interface and the specific file system itself.
Syscall Tracing
By tracing at the system call layer, the file system can be traced system-wide - examining all applications simultaneously (no "-p PID"). DTrace has the "syscall" provider for tracing system calls:
+---------------+ | Application | | mysqld | +---------------+ reads ^ | writes | | user-land HERE --> ---------system calls--------------------------- | | kernel +--------------------------------+ | File System | | | | | V | |/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\| [...]
Another advantage is that the syscalls are well understood and documented in the man pages. They are also much less likely to change than the mysql functions traced earlier (I would say - will never change - until I saw the exec() family of syscalls change slightly during the Solaris 10 releases).
syscall-read-zfs.d
To demonstrate syscall tracing, this DTrace script shows the latency of read()s to ZFS by application name:
# ./syscall-read-zfs.d dtrace: script './syscall-read-zfs.d' matched 2 probes ^C httpd (ns): value ------------- Distribution ------------- count 512 | 0 1024 |@@@@@@ 1072 2048 |@@@@@@@ 1276 4096 |@@@@@ 890 8192 |@@@@@@@@@@@@@@@@@@@@ 3520 16384 |@ 152 32768 | 10 65536 | 2 131072 | 0 mysqld (ns): value ------------- Distribution ------------- count 512 | 0 1024 |@@@ 1268 2048 |@@@@@@@@@@@@@@@@@ 7710 4096 |@@@@@@@@@@@@@ 5773 8192 |@@@@@ 2231 16384 |@ 446 32768 | 186 65536 | 26 131072 | 7 262144 | 0
As seen previously with mysqld_pid_fslatency.d, file system reads are extremely fast - most likely returning out of DRAM. The slowest seen above reached only the 131 to 262 microsecond range (less than 0.3 ms).
Tracing syscalls has been made dramatically easier with the introduction of the fds[] array, which allows file descriptor numbers to be converted into descriptive details, such as the file system type. The array is indexed by file descriptor number, which for the read() syscall is the first argument: read(fd, buf, size). Here the fi_fs (file system) member is checked on line 4, to only match reads to ZFS:
1 #!/usr/sbin/dtrace -s 2 3 syscall::read:entry 4 /fds[arg0].fi_fs == "zfs"/ 5 { 6 self->start = timestamp; 7 } 8 9 syscall::read:return 10 /self->start/ 11 { 12 @[execname, "(ns):"] = quantize(timestamp - self->start); 13 self->start = 0; 14 }
This script can be modified to include other syscall types, and other file systems. See fsrwtime.d from the DTrace book for a version that matches more syscall types, and prints latency by file system, operation and mount point.
Syscall analysis with DTrace is easy and effective.
When doing amazing things by tracing the internals of an application, it can be easy to forget that syscall tracing may be good enough - and a lot simpler. For this reason we put it early in the Strategy section of the File System chapter of the DTrace book.
Drawbacks of the syscall approach are:
- Can't currently execute in a Solaris zone or Joyent SmartMachine (only because the fds[] array isn't currently available; the syscall provider does work in those environments).
- No query context. Expressing file system latency as a portion of query latency (as was done with mysqld_pid_fslatency_slowlog.d) isn't possible. (Unless this is inferred from syscall activity, such as via socket related syscalls; it might be possible - I haven't tried yet.)
Stack Fishing
Another use of the syscall provider is to investigate how applications are using the file system - the calling stack trace. This approach is how I initially found the above functions from the mysql source: my_read(), my_write(), etc, which took me to the right place to start reading the code. You can also try this approach if the mysqld_pid_fslatency.d script from part 3 fails:
# ./mysqld_pid_fslatency.d -p 16060 dtrace: failed to compile script ./mysqld_pid_fslatency.d: line 23: probe description pid16060::os_file_read:entry does not match any probes
First, make sure that the PID really is mysqld. Then, you can use "stack fishing" to find out what is being called instead of os_file_read() (in that case).
This one-liner demonstrates the approach, frequency counting the syscall type and user stack frames for the given process calling into the ZFS file system:
# dtrace -x ustackframes=100 -n 'syscall::*read:entry, syscall::*write:entry /pid == $target && fds[arg0].fi_fs == "zfs"/ { @[probefunc, ustack()] = count(); }' -p 29952 dtrace: description 'syscall::*read:entry,syscall::*write:entry ' matched 4 probes ^C pread libc.so.1`__pread+0xa mysqld`os_file_pread+0x8e mysqld`os_file_read+0x3b mysqld`fil_io+0x2b0 mysqld`buf_read_page_low+0x14e mysqld`buf_read_page+0x81 mysqld`buf_page_get_gen+0x143 mysqld`fsp_reserve_free_extents+0x6d mysqld`btr_cur_pessimistic_delete+0x96 mysqld`row_purge_remove_sec_if_poss_low+0x31c mysqld`row_purge_step+0x8e1 mysqld`que_run_threads+0x7c6 mysqld`trx_purge+0x3cb mysqld`srv_master_thread+0x99d libc.so.1`_thrp_setup+0x8d libc.so.1`_lwp_start 1 [...output truncated...] pwrite libc.so.1`__pwrite+0xa mysqld`os_file_write+0x97 mysqld`fil_io+0x2b0 mysqld`log_group_write_buf+0x34f mysqld`log_write_up_to+0x566 mysqld`trx_commit_off_kernel+0x72f mysqld`trx_commit_for_mysql+0x9f mysqld`_Z15innobase_commitP3THDb+0x116 mysqld`_Z19ha_commit_one_phaseP3THDb+0x95 mysqld`_Z15ha_commit_transP3THDb+0x136 mysqld`_Z9end_transP3THD25enum_mysql_completiontype+0x191 mysqld`_Z21mysql_execute_commandP3THD+0x2172 mysqld`_Z11mysql_parseP3THDPKcjPS2_+0x116 mysqld`_Z16dispatch_command19enum_server_commandP3THDPcj+0xfc1 mysqld`_Z10do_commandP3THD+0xb8 mysqld`handle_one_connection+0x7f7 libc.so.1`_thrp_setup+0x8d libc.so.1`_lwp_start 904 write libc.so.1`__write+0xa mysqld`my_write+0x3e mysqld`my_b_flush_io_cache+0xdd mysqld`_ZN9MYSQL_LOG14flush_and_syncEv+0x2a mysqld`_ZN9MYSQL_LOG5writeEP3THDP11st_io_cacheP9Log_event+0x209 mysqld`_Z16binlog_end_transP3THDP11st_io_cacheP9Log_event+0x25 mysqld`_ZN9MYSQL_LOG7log_xidEP3THDy+0x51 mysqld`_Z15ha_commit_transP3THDb+0x24a mysqld`_Z9end_transP3THD25enum_mysql_completiontype+0x191 mysqld`_Z21mysql_execute_commandP3THD+0x2172 mysqld`_Z11mysql_parseP3THDPKcjPS2_+0x116 mysqld`_Z16dispatch_command19enum_server_commandP3THDPcj+0xfc1 mysqld`_Z10do_commandP3THD+0xb8 mysqld`handle_one_connection+0x7f7 libc.so.1`_thrp_setup+0x8d libc.so.1`_lwp_start 923 read libc.so.1`__read+0xa mysqld`my_read+0x4a mysqld`_my_b_read+0x17d mysqld`_ZN9Log_event14read_log_eventEP11st_io_cacheP6StringP14_pthread_mutex+0xf4 mysqld`_Z17mysql_binlog_sendP3THDPcyt+0x5dc mysqld`_Z16dispatch_command19enum_server_commandP3THDPcj+0xc09 mysqld`_Z10do_commandP3THD+0xb8 mysqld`handle_one_connection+0x7f7 libc.so.1`_thrp_setup+0x8d libc.so.1`_lwp_start 1496 read libc.so.1`__read+0xa mysqld`my_read+0x4a mysqld`_my_b_read+0x17d mysqld`_ZN9Log_event14read_log_eventEP11st_io_cacheP6StringP14_pthread_mutex+0xf4 mysqld`_Z17mysql_binlog_sendP3THDPcyt+0x35e mysqld`_Z16dispatch_command19enum_server_commandP3THDPcj+0xc09 mysqld`_Z10do_commandP3THD+0xb8 mysqld`handle_one_connection+0x7f7 libc.so.1`_thrp_setup+0x8d libc.so.1`_lwp_start 2939
The DTrace scripts shown earlier take the file system functions (as seen in the above stack traces) and measure latency. There are a lot more functions that DTrace can also inspect (any of the lines above), along with the function entry arguments and return values.
Stack traces show functions that can be individually traced with DTrace.
Note that this one-liner includes all file system I/O, not just those during a query. The very first stack trace looks like it is as asynchronous database thread (srv_master_thread() -> trx_purge()), whereas all the rest look like they are indeed during a query (handle_one_connection() -> do_command()). The numbers at the bottom of the stack shows the number of times that entire stack trace was responsible for the syscall being called, during tracing (I let it run for several seconds).
VFS Tracing
Apart from the application itself and system calls, DTrace can also drill down into the kernel. The first location of interest is the Virtual File System abstraction layer, which is an interface that file systems are called from. Such a common interface is good fodder for DTracing. There isn't a vfs provider for DTrace (at least, one that exposes the latency of events), and so to do this we can currently use the fbt provider to trace kernel internals.
Introducing VFS into our I/O stack:
+---------------+ | Application | | mysqld | +---------------+ reads ^ | writes | | | | user-land -------------system calls--------------------------- | | kernel +-------------------+ HERE --> | VFS | +-----------| |------------+ | File +-------------------+ | | Systems / / \ \ | | / / \ \ | | +------------+ +------------+ | | | ZFS | | | | UFS | | | | | | | | | | | | | ... | | | | V | | | V | | | |/\/\/\/\/\/\| |/\/\/\/\/\/\| | [...]
Advantages of tracing at the VFS level include:
- All file system types can be traced from one place.
- The VFS interface functions are more "stable" than other parts of the kernel.
- Kernel context is available, including more information than fds[] makes available.
You can find examples of VFS tracing in Chapter 5 of the DTrace book, which can be downloaded as a sample chapter (PDF). Here is an example, solvfssnoop.d, which traces all VFS I/O on Solaris:
# ./solvfssnoop.d -n 'tick-10ms { exit(0); }' TIME(ms) UID PID PROCESS CALL KB PATH 18844835237 104 29952 mysqld fop_read 0 <null> 18844835237 104 29952 mysqld fop_write 0 <null> 18844835238 0 22703 sshd fop_read 16 /devices/pseudo/clone@0:ptm 18844835237 104 29008 mysqld fop_write 16 /z01/opt/mysql5-64/data/xxxxx/xxxxx.ibd 18844835237 104 29008 mysqld fop_write 32 /z01/opt/mysql5-64/data/xxxxx/xxxxx.ibd 18844835237 104 29008 mysqld fop_write 48 /z01/opt/mysql5-64/data/xxxxx/xxxxx.ibd 18844835237 104 29008 mysqld fop_write 16 /z01/opt/mysql5-64/data/xxxxx/xxxxx.ibd 18844835237 104 29008 mysqld fop_write 16 /z01/opt/mysql5-64/data/xxxxx/xxxxx.ibd 18844835237 104 29008 mysqld fop_write 32 /z01/opt/mysql5-64/data/xxxxx/xxxxx.ibd
I've had to redact the filename info (replaced portions with "xxxxx"), but you should still get the picture. This has all the useful details except for the latency, which can be added to the script by tracing the "return" probe as well as the "entry" probes, and comparing timestamps (similar to how the syscalls were traced earlier).
Since VFS I/O can be very frequent (thousands of I/O per second), when I invoked the script above I added an action to exit after 10 milliseconds. The script also accepts a process name as an argument, eg, "mysqld" to only trace VFS I/O from mysqld processes.
VFS Latency
To demonstrate fetching latency info, here is VFS read()s on Solaris traced via fop_read():
# dtrace -n 'fbt::fop_read:entry { self->start = timestamp; } fbt::fop_read:return /self->start/ { @[execname, "ns"] = quantize(timestamp - self->start); self->start = 0; }' dtrace: description 'fbt::fop_read:entry ' matched 2 probes ^C [...] mysqld ns value ------------- Distribution ------------- count 512 | 0 1024 |@@ 725 2048 |@@@@@@@@@@@@@@@@ 5928 4096 |@@@@@@@@@ 3319 8192 |@@ 708 16384 | 80 32768 | 17 65536 | 130 131072 |@ 532 262144 |@ 492 524288 |@ 489 1048576 |@@ 862 2097152 |@@@ 955 4194304 |@@ 602 8388608 |@ 271 16777216 | 102 33554432 | 27 67108864 | 14 134217728 | 2 268435456 | 0
Yikes! Wasn't this system running with a 99.9% cache hit rate earlier? The second group in the distribution shows VFS reads between 1 and 8 ms, sounding a lot like disk I/O cache misses. They aren't - and this is a disadvantage of tracing at VFS - it's including other things that use the VFS interface that aren't really file systems, including socket I/O.
Filtering just for ZFS:
# dtrace -n 'fbt::fop_read:entry /args[0]->v_op->vnop_name == "zfs"/ { self->start = timestamp; } fbt::fop_read:return /self->start/ { @[execname, "ns"] = quantize(timestamp - self->start); self->start = 0; }' dtrace: description 'fbt::fop_read:entry ' matched 2 probes ^C [...] mysqld ns value ------------- Distribution ------------- count 512 | 0 1024 |@@@@@@@ 931 2048 |@@@@@@@@ 1149 4096 |@@@@@@@ 992 8192 |@@@@@@@@@@@@@@@@ 2266 16384 |@@ 320 32768 | 20 65536 | 2 131072 | 0
That's better.
Drawbacks of VFS tracing:
- It can include other kernel components that use VFS, such as sockets.
- Application context not available.
- It is not possible to do this from a Solaris zones or Joyent SmartMachine environment, via direct use of the fbt provider. The fbt provider allows inspection of kernel internals, which has the potential to share privileged data between zones, and so is unlikely to ever be available from within a zone. (But indirect use via a secure means may be possible - more on this in part 5).
- The fbt provider is considered an "unstable" interface, since it exposes thousands of raw kernel functions. Any scripts written to use it may stop working on kernel updates, should the kernel engineer rename or modify functions you are tracing.
File System Tracing
The second location in the kernel to consider is the File System itself: ZFS, UFS, etc. This will expose file system specific characteristics, and can be the origin of many file system latency issues. DTrace can examine these using the fbt provider.
File systems in the I/O stack:
+---------------+ | Application | | mysqld | +---------------+ reads ^ | writes | | | | user-land -------------system calls--------------------------- | | kernel +-------------------+ | VFS | +-----------| |------------+ | File +-------------------+ | | Systems / / \ \ | | / / \ \ | | +------------+ +------------+ | HERE --> | | ZFS | | | | UFS | | | | | | | | | | | | | ... | | | | V | | | V | | | |/\/\/\/\/\/\| |/\/\/\/\/\/\| | [...]
Advantages for tracing at the file system:
- File system specific behaviour can be examined.
- Kernel context is available, including more information than fds[] makes available.
After the VFS scripts, there are examples of file system tracing scripts in Chapter 5 of the DTrace book. One of my favourites is zfsslower.d, which takes a millisecond argument and shows any I/O that was slower than that time:
# ./zfsslower.d 10 TIME PROCESS D KB ms FILE 2011 May 17 01:23:12 mysqld R 16 19 /z01/opt/mysql5-64/data/xxxxx/xxxxx.ibd 2011 May 17 01:23:13 mysqld W 16 10 /z01/var/mysql/xxxxx/xxxxx.ibd 2011 May 17 01:23:33 mysqld W 16 11 /z01/var/mysql/xxxxx/xxxxx.ibd 2011 May 17 01:23:33 mysqld W 16 10 /z01/var/mysql/xxxxx/xxxxx.ibd 2011 May 17 01:23:51 httpd R 56 14 /z01/home/xxxxx/xxxxx/xxxxx/xxxxx/xxxxx ^C
Again, I've redacted the filename info (replaced portions with "xxxxx"), but the output should still make sense. This is tracing the POSIX requests of the ZFS file system, via functions including zfs_read() and zfs_write(), and showing details including latency for any longer than the specified time.
Drawbacks of File System tracing:
- Application context not available.
- It is not possible to do this from a Solaris zones or Joyent SmartMachine environment, via direct use of the fbt provider. The fbt provider allows inspection of kernel internals, which has the potential to share privileged data between zones, and so is unlikely to ever be available from within a zone. (But indirect use via a secure means may be possible - more on this in part 5).
- The fbt provider is considered an "unstable" interface, since it exposes thousands of raw kernel functions. Any scripts written to use it may stop working on kernel updates, should the kernel engineer rename or modify functions you are tracing.
- File Systems get complex.
ZFS
The zfsslower.d script mentioned above only traces requests to ZFS. DTrace can continue drilling and expose all of the internals of ZFS, pinpointing file system induced latency. This could be the subject of a separate series of blog posts, including:
- Lock contention latency
- ZIO pipeline latency
- Compression latency
- Allocation latency
- vdev queue latency
- and lots more...
You may be able to skip this part if the latency can be traced at a lower level than the file system - ie, originating from the disk subsystem and being passed up the stack. Which can be a practical approach - digging into file system internals can be very time consuming, and isn't necessary for every issue.
Lower Level
By tracing down to the disk device, you can identify exactly where the latency is originating. The full kernel stack would include (showing Solaris ZFS in this example):
-------------------system calls--------------------------- | | kernel +-------------------+ +-----------| VFS |------------+ | File +-------------------+ | | Systems | | | | +-----------------------+ | | | ZFS | | | | | | +---------+ | | | | | ZIO | | | | | +---------+ | | | | | vdevs | | | | +------+---------+------+ | | | | | +--------------------------------------------+ | | +----------------+ | block device | +----------------+ | | +------------+ | SCSI | +------------+ | | +------------+ | SAS | +------------+ | | V | .-----------. `-----------' | disk | `-----------'
Latency at each of these layers can be traced: VFS, ZFS (including ZIO pipeline and vdevs), block device (sd), SCSI and SAS. If the latency is originating from any of these locations, you can identify it by comparing between the layers.
To show what this can look like, here is an experimental script that shows latency from multiple layers at the same time - for comparison:
# ./zfsstacklatency.d dtrace: script './zfsstacklatency.d' matched 25 probes ^C CPU ID FUNCTION:NAME 15 2 :END zfs_read time (ns) value ------------- Distribution ------------- count 512 | 0 1024 |@@@@ 424 2048 |@@@@@@@@ 768 4096 |@@@@ 375 8192 |@@@@@@@@@@@@@@@@ 1548 16384 |@@@@@@@@ 763 32768 | 35 65536 | 4 131072 | 12 262144 | 1 524288 | 0 zfs_write time (ns) value ------------- Distribution ------------- count 2048 | 0 4096 |@@@ 718 8192 |@@@@@@@@@@@@@@@@@@@ 5152 16384 |@@@@@@@@@@@@@@@ 4085 32768 |@@@ 731 65536 |@ 137 131072 | 23 262144 | 3 524288 | 0 zio_wait time (ns) value ------------- Distribution ------------- count 512 | 0 1024 |@@@@@@@@@@@@@ 6188 2048 |@@@@@@@@@@@@@@@@@@@@@@@ 11459 4096 |@@@@ 2026 8192 | 60 16384 | 37 32768 | 8 65536 | 2 131072 | 0 262144 | 0 524288 | 1 1048576 | 0 2097152 | 0 4194304 | 0 8388608 | 0 16777216 | 0 33554432 | 0 67108864 | 0 134217728 | 0 268435456 | 1 536870912 | 0 zio_vdev_io_done time (ns) value ------------- Distribution ------------- count 2048 | 0 4096 |@ 8 8192 |@@@@ 56 16384 |@ 17 32768 |@ 13 65536 | 2 131072 |@@ 24 262144 |@@ 23 524288 |@@@ 44 1048576 |@@@ 38 2097152 | 1 4194304 | 4 8388608 | 4 16777216 | 4 33554432 |@@@ 43 67108864 |@@@@@@@@@@@@@@@@@@@@@ 315 134217728 | 0 268435456 | 2 536870912 | 0 vdev_disk_io_done time (ns) value ------------- Distribution ------------- count 65536 | 0 131072 |@ 12 262144 |@@ 26 524288 |@@@@ 47 1048576 |@@@ 40 2097152 | 1 4194304 | 4 8388608 | 4 16777216 | 4 33554432 |@@@ 43 67108864 |@@@@@@@@@@@@@@@@@@@@@@@@@ 315 134217728 | 0 268435456 | 2 536870912 | 0 io:::start time (ns) value ------------- Distribution ------------- count 32768 | 0 65536 | 3 131072 |@@ 19 262144 |@@ 21 524288 |@@@@ 45 1048576 |@@@ 38 2097152 | 0 4194304 | 4 8388608 | 4 16777216 | 4 33554432 |@@@ 43 67108864 |@@@@@@@@@@@@@@@@@@@@@@@@@ 315 134217728 | 0 268435456 | 2 536870912 | 0 scsi time (ns) value ------------- Distribution ------------- count 16384 | 0 32768 | 2 65536 | 3 131072 |@ 18 262144 |@@ 20 524288 |@@@@ 46 1048576 |@@@ 37 2097152 | 0 4194304 | 4 8388608 | 4 16777216 | 4 33554432 |@@@ 43 67108864 |@@@@@@@@@@@@@@@@@@@@@@@@@ 315 134217728 | 0 268435456 | 2 536870912 | 0 mega_sas time (ns) value ------------- Distribution ------------- count 16384 | 0 32768 | 2 65536 | 5 131072 |@@ 20 262144 |@ 16 524288 |@@@@ 50 1048576 |@@@ 33 2097152 | 0 4194304 | 4 8388608 | 4 16777216 | 4 33554432 |@@@ 43 67108864 |@@@@@@@@@@@@@@@@@@@@@@@@@ 315 134217728 | 0 268435456 | 2 536870912 | 0
mega_sas is the SAS disk device driver - which shows the true latency of the disk I/O (about as deep as the operating system can go). The first distribution printed was for zfs_read() latency, which is the read requests to ZFS.
It's hugely valuable to be able to pluck this sort of latency data out from different layers of the operating system stack, to narrow down the source of the latency. Comparing all I/O in this way can also identify the origin of outliers (few I/O with high latency) quickly, which may be hit-or-miss if single I/O were picked and traced as they executed through the kernel.
Latency at different levels of the OS stack can be examined and compared to identify the origin.
The spike of slow disk I/O seen in the mega_sas distribution (315 I/O with a latency between 67 and 134 ms), which is likely due to queueing on the disk, propagates up the stack to a point and then vanishes. That latency is not visible in the zfs_read() and zfs_write() interfaces, meaning, no application was affected by that latency (at least via read/write). The spike corresponded to a ZFS TXG flush - which is asynchronous to the application, and queues a bunch of I/O to the disks. If that spike were to propagate all the way up into zfs_read()/zfs_write(), then this output would have identified the origin - the disks.
zfsstacklatency.d
I wrote zfsstacklatency.d as a demonstration only script, to show what is technically possible. The script breaks a rule I learned the hard way: keep it simple. zfsstacklatency.d is not simple, it traces at multiple stack layers using the unstable fbt provider and is over 100 lines long. This makes it brittle and not likely to run on different kernel builds other than the system I'm on (there is little point including it here, since it almost certainly won't run for you). To trace at these layers, it can be more relibale to run small scripts that trace individual layers separately, and to maintain those invidiual scripts if and when they break on newer kernel versions. Chapter 4 of the DTrace book does this via scripts such as scsilatency.d, satalatency.d, mptlatency.d, etc.
Comparing File System Latency
By examining latency at different levels of the I/O stack, its origin can be identified. DTrace provides the ability to trace latency from the application right down to the disk device driver - leaving no stone unturned. This can be especially useful for the cases where latency is not caused by the disks, but by other issues in the kernel.
In the next and final post in this series, I'll show other useful presentations of file system latency as a metric.