File System Latency: part 4

I originally posted this at http://dtrace.org/blogs/brendan/2011/05/24/file-system-latency-part-4.

This is part 4 on file system latency, a series on storage I/O performance from the application perspective (see part 1, part 2 and part 3). In the previous post I showed how to trace file system latency from within MySQL using the pid provider. In this post I'll show how similar data can be retrieved using the DTrace syscall and fbt providers. These allow us to trace at the system call layer, and deeper in the kernel at both the Virtual File System (VFS) interface and the specific file system itself.

Syscall Tracing

By tracing at the system call layer, the file system can be traced system-wide - examining all applications simultaneously (no "-p PID"). DTrace has the "syscall" provider for tracing system calls:

                        +---------------+ 
                        |  Application  | 
                        |    mysqld     | 
                        +---------------+ 
                         reads ^ | writes 
                               | |                  user-land
      HERE -->  ---------system calls---------------------------
                               | |                  kernel
               +--------------------------------+
               |  File System  | |              |
               |               | V              |
               |/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\|
[...]

Another advantage is that the syscalls are well understood and documented in the man pages. They are also much less likely to change than the mysql functions traced earlier (I would say - will never change - until I saw the exec() family of syscalls change slightly during the Solaris 10 releases).

syscall-read-zfs.d

To demonstrate syscall tracing, this DTrace script shows the latency of read()s to ZFS by application name:

# ./syscall-read-zfs.d 
dtrace: script './syscall-read-zfs.d' matched 2 probes
^C
  httpd                                               (ns):                                             
           value  ------------- Distribution ------------- count    
             512 |                                         0        
            1024 |@@@@@@                                   1072     
            2048 |@@@@@@@                                  1276     
            4096 |@@@@@                                    890      
            8192 |@@@@@@@@@@@@@@@@@@@@                     3520     
           16384 |@                                        152      
           32768 |                                         10       
           65536 |                                         2        
          131072 |                                         0        

  mysqld                                              (ns):                                             
           value  ------------- Distribution ------------- count    
             512 |                                         0        
            1024 |@@@                                      1268     
            2048 |@@@@@@@@@@@@@@@@@                        7710     
            4096 |@@@@@@@@@@@@@                            5773     
            8192 |@@@@@                                    2231     
           16384 |@                                        446      
           32768 |                                         186      
           65536 |                                         26       
          131072 |                                         7        
          262144 |                                         0

As seen previously with mysqld_pid_fslatency.d, file system reads are extremely fast - most likely returning out of DRAM. The slowest seen above reached only the 131 to 262 microsecond range (less than 0.3 ms).

Tracing syscalls has been made dramatically easier with the introduction of the fds[] array, which allows file descriptor numbers to be converted into descriptive details, such as the file system type. The array is indexed by file descriptor number, which for the read() syscall is the first argument: read(fd, buf, size). Here the fi_fs (file system) member is checked on line 4, to only match reads to ZFS:

     1  #!/usr/sbin/dtrace -s
     2  
     3  syscall::read:entry
     4  /fds[arg0].fi_fs == "zfs"/
     5  {
     6      self->start = timestamp;
     7  }
     8  
     9  syscall::read:return
    10  /self->start/
    11  {
    12      @[execname, "(ns):"] = quantize(timestamp - self->start);
    13      self->start = 0;
    14  }

This script can be modified to include other syscall types, and other file systems. See fsrwtime.d from the DTrace book for a version that matches more syscall types, and prints latency by file system, operation and mount point.

Syscall analysis with DTrace is easy and effective.

When doing amazing things by tracing the internals of an application, it can be easy to forget that syscall tracing may be good enough - and a lot simpler. For this reason we put it early in the Strategy section of the File System chapter of the DTrace book.

Drawbacks of the syscall approach are:

Can't currently execute in a Solaris zone or Joyent SmartMachine (only because the fds[] array isn't currently available; the syscall provider does work in those environments).
No query context. Expressing file system latency as a portion of query latency (as was done with mysqld_pid_fslatency_slowlog.d) isn't possible. (Unless this is inferred from syscall activity, such as via socket related syscalls; it might be possible - I haven't tried yet.)

Stack Fishing

Another use of the syscall provider is to investigate how applications are using the file system - the calling stack trace. This approach is how I initially found the above functions from the mysql source: my_read(), my_write(), etc, which took me to the right place to start reading the code. You can also try this approach if the mysqld_pid_fslatency.d script from part 3 fails:

# ./mysqld_pid_fslatency.d -p 16060
dtrace: failed to compile script ./mysqld_pid_fslatency.d: line 23: probe description 
pid16060::os_file_read:entry does not match any probes

First, make sure that the PID really is mysqld. Then, you can use "stack fishing" to find out what is being called instead of os_file_read() (in that case).

This one-liner demonstrates the approach, frequency counting the syscall type and user stack frames for the given process calling into the ZFS file system:

# dtrace -x ustackframes=100 -n 'syscall::*read:entry,
    syscall::*write:entry /pid == $target && fds[arg0].fi_fs == "zfs"/ { 
    @[probefunc, ustack()] = count(); }' -p 29952
dtrace: description 'syscall::*read:entry,syscall::*write:entry ' matched 4 probes
^C

  pread                                             
              libc.so.1`__pread+0xa
              mysqld`os_file_pread+0x8e
              mysqld`os_file_read+0x3b
              mysqld`fil_io+0x2b0
              mysqld`buf_read_page_low+0x14e
              mysqld`buf_read_page+0x81
              mysqld`buf_page_get_gen+0x143
              mysqld`fsp_reserve_free_extents+0x6d
              mysqld`btr_cur_pessimistic_delete+0x96
              mysqld`row_purge_remove_sec_if_poss_low+0x31c
              mysqld`row_purge_step+0x8e1
              mysqld`que_run_threads+0x7c6
              mysqld`trx_purge+0x3cb
              mysqld`srv_master_thread+0x99d
              libc.so.1`_thrp_setup+0x8d
              libc.so.1`_lwp_start
                1

[...output truncated...]
  pwrite                                            
              libc.so.1`__pwrite+0xa
              mysqld`os_file_write+0x97
              mysqld`fil_io+0x2b0
              mysqld`log_group_write_buf+0x34f
              mysqld`log_write_up_to+0x566
              mysqld`trx_commit_off_kernel+0x72f
              mysqld`trx_commit_for_mysql+0x9f
              mysqld`_Z15innobase_commitP3THDb+0x116
              mysqld`_Z19ha_commit_one_phaseP3THDb+0x95
              mysqld`_Z15ha_commit_transP3THDb+0x136
              mysqld`_Z9end_transP3THD25enum_mysql_completiontype+0x191
              mysqld`_Z21mysql_execute_commandP3THD+0x2172
              mysqld`_Z11mysql_parseP3THDPKcjPS2_+0x116
              mysqld`_Z16dispatch_command19enum_server_commandP3THDPcj+0xfc1
              mysqld`_Z10do_commandP3THD+0xb8
              mysqld`handle_one_connection+0x7f7
              libc.so.1`_thrp_setup+0x8d
              libc.so.1`_lwp_start
              904
  write                                             
              libc.so.1`__write+0xa
              mysqld`my_write+0x3e
              mysqld`my_b_flush_io_cache+0xdd
              mysqld`_ZN9MYSQL_LOG14flush_and_syncEv+0x2a
              mysqld`_ZN9MYSQL_LOG5writeEP3THDP11st_io_cacheP9Log_event+0x209
              mysqld`_Z16binlog_end_transP3THDP11st_io_cacheP9Log_event+0x25
              mysqld`_ZN9MYSQL_LOG7log_xidEP3THDy+0x51
              mysqld`_Z15ha_commit_transP3THDb+0x24a
              mysqld`_Z9end_transP3THD25enum_mysql_completiontype+0x191
              mysqld`_Z21mysql_execute_commandP3THD+0x2172
              mysqld`_Z11mysql_parseP3THDPKcjPS2_+0x116
              mysqld`_Z16dispatch_command19enum_server_commandP3THDPcj+0xfc1
              mysqld`_Z10do_commandP3THD+0xb8
              mysqld`handle_one_connection+0x7f7
              libc.so.1`_thrp_setup+0x8d
              libc.so.1`_lwp_start
              923
  read                                              
              libc.so.1`__read+0xa
              mysqld`my_read+0x4a
              mysqld`_my_b_read+0x17d
              mysqld`_ZN9Log_event14read_log_eventEP11st_io_cacheP6StringP14_pthread_mutex+0xf4
              mysqld`_Z17mysql_binlog_sendP3THDPcyt+0x5dc
              mysqld`_Z16dispatch_command19enum_server_commandP3THDPcj+0xc09
              mysqld`_Z10do_commandP3THD+0xb8
              mysqld`handle_one_connection+0x7f7
              libc.so.1`_thrp_setup+0x8d
              libc.so.1`_lwp_start
             1496
  read                                              
              libc.so.1`__read+0xa
              mysqld`my_read+0x4a
              mysqld`_my_b_read+0x17d
              mysqld`_ZN9Log_event14read_log_eventEP11st_io_cacheP6StringP14_pthread_mutex+0xf4
              mysqld`_Z17mysql_binlog_sendP3THDPcyt+0x35e
              mysqld`_Z16dispatch_command19enum_server_commandP3THDPcj+0xc09
              mysqld`_Z10do_commandP3THD+0xb8
              mysqld`handle_one_connection+0x7f7
              libc.so.1`_thrp_setup+0x8d
              libc.so.1`_lwp_start
             2939

The DTrace scripts shown earlier take the file system functions (as seen in the above stack traces) and measure latency. There are a lot more functions that DTrace can also inspect (any of the lines above), along with the function entry arguments and return values.

Stack traces show functions that can be individually traced with DTrace.

Note that this one-liner includes all file system I/O, not just those during a query. The very first stack trace looks like it is as asynchronous database thread (srv_master_thread() -> trx_purge()), whereas all the rest look like they are indeed during a query (handle_one_connection() -> do_command()). The numbers at the bottom of the stack shows the number of times that entire stack trace was responsible for the syscall being called, during tracing (I let it run for several seconds).

VFS Tracing

Apart from the application itself and system calls, DTrace can also drill down into the kernel. The first location of interest is the Virtual File System abstraction layer, which is an interface that file systems are called from. Such a common interface is good fodder for DTracing. There isn't a vfs provider for DTrace (at least, one that exposes the latency of events), and so to do this we can currently use the fbt provider to trace kernel internals.

Introducing VFS into our I/O stack:

                              +---------------+ 
                              |  Application  | 
                              |    mysqld     | 
                              +---------------+ 
                               reads ^ | writes 
                                     | |        
                                     | |                  user-land
                  -------------system calls---------------------------
                                     | |                  kernel
                            +-------------------+   
      HERE -->              |        VFS        |               
                +-----------|                   |------------+
                | File      +-------------------+            |
                | Systems       / /         \ \              |
                |              / /           \ \             |
                |     +------------+  +------------+         |
                |     |  ZFS  | |  |  |  UFS  | |  |         |
                |     |       | |  |  |       | |  |  ...    |
                |     |       | V  |  |       | V  |         |
                |     |/\/\/\/\/\/\|  |/\/\/\/\/\/\|         |
[...]

Advantages of tracing at the VFS level include:

All file system types can be traced from one place.
The VFS interface functions are more "stable" than other parts of the kernel.
Kernel context is available, including more information than fds[] makes available.

You can find examples of VFS tracing in Chapter 5 of the DTrace book, which can be downloaded as a sample chapter (PDF). Here is an example, solvfssnoop.d, which traces all VFS I/O on Solaris:

# ./solvfssnoop.d -n 'tick-10ms { exit(0); }'
TIME(ms)      UID    PID PROCESS   CALL        KB  PATH
18844835237   104  29952 mysqld    fop_read    0   <null>
18844835237   104  29952 mysqld    fop_write   0   <null>
18844835238     0  22703 sshd      fop_read    16  /devices/pseudo/clone@0:ptm
18844835237   104  29008 mysqld    fop_write   16  /z01/opt/mysql5-64/data/xxxxx/xxxxx.ibd
18844835237   104  29008 mysqld    fop_write   32  /z01/opt/mysql5-64/data/xxxxx/xxxxx.ibd
18844835237   104  29008 mysqld    fop_write   48  /z01/opt/mysql5-64/data/xxxxx/xxxxx.ibd
18844835237   104  29008 mysqld    fop_write   16  /z01/opt/mysql5-64/data/xxxxx/xxxxx.ibd
18844835237   104  29008 mysqld    fop_write   16  /z01/opt/mysql5-64/data/xxxxx/xxxxx.ibd
18844835237   104  29008 mysqld    fop_write   32  /z01/opt/mysql5-64/data/xxxxx/xxxxx.ibd

I've had to redact the filename info (replaced portions with "xxxxx"), but you should still get the picture. This has all the useful details except for the latency, which can be added to the script by tracing the "return" probe as well as the "entry" probes, and comparing timestamps (similar to how the syscalls were traced earlier).

Since VFS I/O can be very frequent (thousands of I/O per second), when I invoked the script above I added an action to exit after 10 milliseconds. The script also accepts a process name as an argument, eg, "mysqld" to only trace VFS I/O from mysqld processes.

VFS Latency

To demonstrate fetching latency info, here is VFS read()s on Solaris traced via fop_read():

# dtrace -n 'fbt::fop_read:entry { self->start = timestamp; }
    fbt::fop_read:return /self->start/ { @[execname, "ns"] = 
    quantize(timestamp - self->start); self->start = 0; }'
dtrace: description 'fbt::fop_read:entry ' matched 2 probes
^C
[...]

  mysqld                                              ns                                                
           value  ------------- Distribution ------------- count    
             512 |                                         0        
            1024 |@@                                       725      
            2048 |@@@@@@@@@@@@@@@@                         5928     
            4096 |@@@@@@@@@                                3319     
            8192 |@@                                       708      
           16384 |                                         80       
           32768 |                                         17       
           65536 |                                         130      
          131072 |@                                        532      
          262144 |@                                        492      
          524288 |@                                        489      
         1048576 |@@                                       862      
         2097152 |@@@                                      955      
         4194304 |@@                                       602      
         8388608 |@                                        271      
        16777216 |                                         102      
        33554432 |                                         27       
        67108864 |                                         14       
       134217728 |                                         2        
       268435456 |                                         0

Yikes! Wasn't this system running with a 99.9% cache hit rate earlier? The second group in the distribution shows VFS reads between 1 and 8 ms, sounding a lot like disk I/O cache misses. They aren't - and this is a disadvantage of tracing at VFS - it's including other things that use the VFS interface that aren't really file systems, including socket I/O.

Filtering just for ZFS:

# dtrace -n 'fbt::fop_read:entry /args[0]->v_op->vnop_name == "zfs"/ {
    self->start = timestamp; } fbt::fop_read:return /self->start/ {
    @[execname, "ns"] = quantize(timestamp - self->start); self->start = 0; }'
dtrace: description 'fbt::fop_read:entry ' matched 2 probes
^C
[...]

  mysqld                                              ns                                                
           value  ------------- Distribution ------------- count    
             512 |                                         0        
            1024 |@@@@@@@                                  931      
            2048 |@@@@@@@@                                 1149     
            4096 |@@@@@@@                                  992      
            8192 |@@@@@@@@@@@@@@@@                         2266     
           16384 |@@                                       320      
           32768 |                                         20       
           65536 |                                         2        
          131072 |                                         0

That's better.

Drawbacks of VFS tracing:

It can include other kernel components that use VFS, such as sockets.
Application context not available.
It is not possible to do this from a Solaris zones or Joyent SmartMachine environment, via direct use of the fbt provider. The fbt provider allows inspection of kernel internals, which has the potential to share privileged data between zones, and so is unlikely to ever be available from within a zone. (But indirect use via a secure means may be possible - more on this in part 5).
The fbt provider is considered an "unstable" interface, since it exposes thousands of raw kernel functions. Any scripts written to use it may stop working on kernel updates, should the kernel engineer rename or modify functions you are tracing.

File System Tracing

The second location in the kernel to consider is the File System itself: ZFS, UFS, etc. This will expose file system specific characteristics, and can be the origin of many file system latency issues. DTrace can examine these using the fbt provider.

File systems in the I/O stack:

                              +---------------+ 
                              |  Application  | 
                              |    mysqld     | 
                              +---------------+ 
                               reads ^ | writes 
                                     | |        
                                     | |                  user-land
                  -------------system calls---------------------------
                                     | |                  kernel
                            +-------------------+   
                            |        VFS        |               
                +-----------|                   |------------+
                | File      +-------------------+            |
                | Systems       / /         \ \              |
                |              / /           \ \             |
                |     +------------+  +------------+         |
      HERE -->  |     |  ZFS  | |  |  |  UFS  | |  |         |
                |     |       | |  |  |       | |  |  ...    |
                |     |       | V  |  |       | V  |         |
                |     |/\/\/\/\/\/\|  |/\/\/\/\/\/\|         |
[...]

Advantages for tracing at the file system:

File system specific behaviour can be examined.
Kernel context is available, including more information than fds[] makes available.

After the VFS scripts, there are examples of file system tracing scripts in Chapter 5 of the DTrace book. One of my favourites is zfsslower.d, which takes a millisecond argument and shows any I/O that was slower than that time:

# ./zfsslower.d 10 
TIME                 PROCESS     D   KB   ms FILE
2011 May 17 01:23:12 mysqld      R   16   19 /z01/opt/mysql5-64/data/xxxxx/xxxxx.ibd
2011 May 17 01:23:13 mysqld      W   16   10 /z01/var/mysql/xxxxx/xxxxx.ibd
2011 May 17 01:23:33 mysqld      W   16   11 /z01/var/mysql/xxxxx/xxxxx.ibd
2011 May 17 01:23:33 mysqld      W   16   10 /z01/var/mysql/xxxxx/xxxxx.ibd
2011 May 17 01:23:51 httpd       R   56   14 /z01/home/xxxxx/xxxxx/xxxxx/xxxxx/xxxxx
^C

Again, I've redacted the filename info (replaced portions with "xxxxx"), but the output should still make sense. This is tracing the POSIX requests of the ZFS file system, via functions including zfs_read() and zfs_write(), and showing details including latency for any longer than the specified time.

Drawbacks of File System tracing:

Application context not available.
It is not possible to do this from a Solaris zones or Joyent SmartMachine environment, via direct use of the fbt provider. The fbt provider allows inspection of kernel internals, which has the potential to share privileged data between zones, and so is unlikely to ever be available from within a zone. (But indirect use via a secure means may be possible - more on this in part 5).
The fbt provider is considered an "unstable" interface, since it exposes thousands of raw kernel functions. Any scripts written to use it may stop working on kernel updates, should the kernel engineer rename or modify functions you are tracing.
File Systems get complex.

ZFS

The zfsslower.d script mentioned above only traces requests to ZFS. DTrace can continue drilling and expose all of the internals of ZFS, pinpointing file system induced latency. This could be the subject of a separate series of blog posts, including:

Lock contention latency
ZIO pipeline latency
Compression latency
Allocation latency
vdev queue latency
and lots more...

You may be able to skip this part if the latency can be traced at a lower level than the file system - ie, originating from the disk subsystem and being passed up the stack. Which can be a practical approach - digging into file system internals can be very time consuming, and isn't necessary for every issue.

Lower Level

By tracing down to the disk device, you can identify exactly where the latency is originating. The full kernel stack would include (showing Solaris ZFS in this example):

            -------------------system calls---------------------------
                                     | |                  kernel
                            +-------------------+   
                +-----------|        VFS        |------------+
                | File      +-------------------+            |
                | Systems            | |                     |
                |         +-----------------------+          |
                |         |  ZFS     | |          |          |
                |         |      +---------+      |          |
                |         |      |   ZIO   |      |          |
                |         |      +---------+      |          |
                |         |      |  vdevs  |      |          |
                |         +------+---------+------+          |
                |                    | |                     |
                +--------------------------------------------+
                                     | |            
                              +----------------+    
                              |  block device  |    
                              +----------------+    
                                     | |            
                                +------------+    
                                |    SCSI    |    
                                +------------+    
                                     | |            
                                +------------+    
                                |    SAS     |    
                                +------------+    
                                     | |            
                                     V |            
                                .-----------.      
                                `-----------'       
                                |   disk    |       
                                `-----------'

Latency at each of these layers can be traced: VFS, ZFS (including ZIO pipeline and vdevs), block device (sd), SCSI and SAS. If the latency is originating from any of these locations, you can identify it by comparing between the layers.

To show what this can look like, here is an experimental script that shows latency from multiple layers at the same time - for comparison:

# ./zfsstacklatency.d
dtrace: script './zfsstacklatency.d' matched 25 probes
^C
CPU     ID                    FUNCTION:NAME
 15      2                             :END 
  zfs_read                                            time (ns)                                         
           value  ------------- Distribution ------------- count    
             512 |                                         0        
            1024 |@@@@                                     424      
            2048 |@@@@@@@@                                 768      
            4096 |@@@@                                     375      
            8192 |@@@@@@@@@@@@@@@@                         1548     
           16384 |@@@@@@@@                                 763      
           32768 |                                         35       
           65536 |                                         4        
          131072 |                                         12       
          262144 |                                         1        
          524288 |                                         0        

  zfs_write                                           time (ns)                                         
           value  ------------- Distribution ------------- count    
            2048 |                                         0        
            4096 |@@@                                      718      
            8192 |@@@@@@@@@@@@@@@@@@@                      5152     
           16384 |@@@@@@@@@@@@@@@                          4085     
           32768 |@@@                                      731      
           65536 |@                                        137      
          131072 |                                         23       
          262144 |                                         3        
          524288 |                                         0        


  zio_wait                                            time (ns)                                         
           value  ------------- Distribution ------------- count    
             512 |                                         0        
            1024 |@@@@@@@@@@@@@                            6188     
            2048 |@@@@@@@@@@@@@@@@@@@@@@@                  11459    
            4096 |@@@@                                     2026     
            8192 |                                         60       
           16384 |                                         37       
           32768 |                                         8        
           65536 |                                         2        
          131072 |                                         0        
          262144 |                                         0        
          524288 |                                         1        
         1048576 |                                         0        
         2097152 |                                         0        
         4194304 |                                         0        
         8388608 |                                         0        
        16777216 |                                         0        
        33554432 |                                         0        
        67108864 |                                         0        
       134217728 |                                         0        
       268435456 |                                         1        
       536870912 |                                         0        


  zio_vdev_io_done                                    time (ns)                                         
           value  ------------- Distribution ------------- count    
            2048 |                                         0        
            4096 |@                                        8        
            8192 |@@@@                                     56       
           16384 |@                                        17       
           32768 |@                                        13       
           65536 |                                         2        
          131072 |@@                                       24       
          262144 |@@                                       23       
          524288 |@@@                                      44       
         1048576 |@@@                                      38       
         2097152 |                                         1        
         4194304 |                                         4        
         8388608 |                                         4        
        16777216 |                                         4        
        33554432 |@@@                                      43       
        67108864 |@@@@@@@@@@@@@@@@@@@@@                    315      
       134217728 |                                         0        
       268435456 |                                         2        
       536870912 |                                         0        


  vdev_disk_io_done                                   time (ns)                                         
           value  ------------- Distribution ------------- count    
           65536 |                                         0        
          131072 |@                                        12       
          262144 |@@                                       26       
          524288 |@@@@                                     47       
         1048576 |@@@                                      40       
         2097152 |                                         1        
         4194304 |                                         4        
         8388608 |                                         4        
        16777216 |                                         4        
        33554432 |@@@                                      43       
        67108864 |@@@@@@@@@@@@@@@@@@@@@@@@@                315      
       134217728 |                                         0        
       268435456 |                                         2        
       536870912 |                                         0        


  io:::start                                          time (ns)                                         
           value  ------------- Distribution ------------- count    
           32768 |                                         0        
           65536 |                                         3        
          131072 |@@                                       19       
          262144 |@@                                       21       
          524288 |@@@@                                     45       
         1048576 |@@@                                      38       
         2097152 |                                         0        
         4194304 |                                         4        
         8388608 |                                         4        
        16777216 |                                         4        
        33554432 |@@@                                      43       
        67108864 |@@@@@@@@@@@@@@@@@@@@@@@@@                315      
       134217728 |                                         0        
       268435456 |                                         2        
       536870912 |                                         0        


  scsi                                                time (ns)                                         
           value  ------------- Distribution ------------- count    
           16384 |                                         0        
           32768 |                                         2        
           65536 |                                         3        
          131072 |@                                        18       
          262144 |@@                                       20       
          524288 |@@@@                                     46       
         1048576 |@@@                                      37       
         2097152 |                                         0        
         4194304 |                                         4        
         8388608 |                                         4        
        16777216 |                                         4        
        33554432 |@@@                                      43       
        67108864 |@@@@@@@@@@@@@@@@@@@@@@@@@                315      
       134217728 |                                         0        
       268435456 |                                         2        
       536870912 |                                         0        


  mega_sas                                            time (ns)                                         
           value  ------------- Distribution ------------- count    
           16384 |                                         0        
           32768 |                                         2        
           65536 |                                         5        
          131072 |@@                                       20       
          262144 |@                                        16       
          524288 |@@@@                                     50       
         1048576 |@@@                                      33       
         2097152 |                                         0        
         4194304 |                                         4        
         8388608 |                                         4        
        16777216 |                                         4        
        33554432 |@@@                                      43       
        67108864 |@@@@@@@@@@@@@@@@@@@@@@@@@                315      
       134217728 |                                         0        
       268435456 |                                         2        
       536870912 |                                         0

mega_sas is the SAS disk device driver - which shows the true latency of the disk I/O (about as deep as the operating system can go). The first distribution printed was for zfs_read() latency, which is the read requests to ZFS.

It's hugely valuable to be able to pluck this sort of latency data out from different layers of the operating system stack, to narrow down the source of the latency. Comparing all I/O in this way can also identify the origin of outliers (few I/O with high latency) quickly, which may be hit-or-miss if single I/O were picked and traced as they executed through the kernel.

Latency at different levels of the OS stack can be examined and compared to identify the origin.

The spike of slow disk I/O seen in the mega_sas distribution (315 I/O with a latency between 67 and 134 ms), which is likely due to queueing on the disk, propagates up the stack to a point and then vanishes. That latency is not visible in the zfs_read() and zfs_write() interfaces, meaning, no application was affected by that latency (at least via read/write). The spike corresponded to a ZFS TXG flush - which is asynchronous to the application, and queues a bunch of I/O to the disks. If that spike were to propagate all the way up into zfs_read()/zfs_write(), then this output would have identified the origin - the disks.

zfsstacklatency.d

I wrote zfsstacklatency.d as a demonstration only script, to show what is technically possible. The script breaks a rule I learned the hard way: keep it simple. zfsstacklatency.d is not simple, it traces at multiple stack layers using the unstable fbt provider and is over 100 lines long. This makes it brittle and not likely to run on different kernel builds other than the system I'm on (there is little point including it here, since it almost certainly won't run for you). To trace at these layers, it can be more relibale to run small scripts that trace individual layers separately, and to maintain those invidiual scripts if and when they break on newer kernel versions. Chapter 4 of the DTrace book does this via scripts such as scsilatency.d, satalatency.d, mptlatency.d, etc.

Comparing File System Latency

By examining latency at different levels of the I/O stack, its origin can be identified. DTrace provides the ability to trace latency from the application right down to the disk device driver - leaving no stone unturned. This can be especially useful for the cases where latency is not caused by the disks, but by other issues in the kernel.

In the next and final post in this series, I'll show other useful presentations of file system latency as a metric.

Brendan Gregg's Blog