Active Benchmarking: Bonnie++

This is an example story of active benchmarking.

You're creating new virtual machines in a cloud. Your provider asks you to pick the operating system type for your instances. Not sure what to use, you decide to benchmark disk I/O performance on different OSes to see how they perform. A search for "disk benchmarks" turns up the commonly-used tools bonnie and bonnie++. As the bonnie++ homepage says:

Bonnie++ is a benchmark suite that is aimed at performing a number of simple tests of hard drive and file system performance.

Sounds good!

This example isn't really about bonnie++ itself; there are many storage benchmark tools to choose from, and you may already have a favorite such as sysbench, fio, iometer, etc. Here I'm using bonnie++ to illustrate this type of active benchmarking analysis.

Things are going to go awry, but this can happen with any tool.

Passive Benchmarking

In passive benchmarking type, you simply follow the instructions and let the tool do its job.

Running bonnie++ on Fedora/Linux (KVM virtualization):

# bonnie++
[...]
Version 1.03e       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
9d219ce8-cf52-40 2G 52384  23 47334   3 31938   3 74866  67 1009669  61 +++++ +++
[...]

This is SmartOS/illumos (OS Virtualization):

# bonnie++
Version 1.03e       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
smartos1.local   2G 162464  99 72027  86 65222  99 251249  99 2426619  99 +++++ +++
[...]

The first results line (highlighted) shows that SmartOS is much faster: for "Sequential Output Per Chr K/sec", SmartOS is 3.1x faster (162464 vs 52384 for Fedora).

Are we done?

Running a benchmark fire-and-forget, as was done here, is passive benchmarking.

Active Benchmarking

While a benchmark is running, which may take minutes or hours, you could be using other tools to investigate its own performance. This is active benchmarking. Tools you can use include vmstat, iostat, top, prstat, strace, truss, DTrace, ps, pidstat, tcpdump, snoop, perf, cpustat, etc. For performance analysis tool lists, see my Linux and Solaris USE Method checklists.

The goals of active benchmarking are to:

Check that the benchmark is doing what you think it is doing
Determine what the performance limiters really are

Benchmarking is also an ideal time to develop your performance analysis skills, because you are testing a known workload, and can tune it and study the effects of that tuning.

Take 1

I'll start on SmartOS, checking disk I/O during the sequential output "Per Chr" test. What do you think it would be doing? Per-character disk I/O? Does that mean 1 byte per disk I/O? bonnie++ reported 158 Mbytes/sec for this test. What exactly does that mean? Running iostat:

$ iostat -xnz 1
[...]
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
                    extended device statistics              
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
                    extended device statistics              
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0  668.9    0.0 82964.3  0.0  6.0    0.0    8.9   1  60 c0t1d0
                    extended device statistics              
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0  419.0    0.0 53514.5  0.0 10.0    0.0   23.8   0 100 c0t1d0
[...]

iostat shows that there was no disk I/O for a few seconds, followed by varying write throughput at a rate much less than the bonnie++ result. The disk I/O size is around 128 Kbytes (kw/s / w/s), which isn't remotely close to 1 byte as you might expected for "Per Chr" (or even 512 bytes or 4 Kbytes, which are commonly-used disk sector sizes).

Turns out that bonnie++ isn't testing disk I/O – it's testing file I/O. This may correspond to disk I/O, depending on whether the I/O is read or write, cached, has synchronous flags, matches the file system record size, etc.

Take 2

So if this isn't "Per Chr" to the disk, is it "Per Chr" to the file system?

Here is VFS-level I/O, as reported by vfsstat(1):

$ vfsstat 1
  r/s   w/s  kr/s  kw/s ractv wactv read_t writ_t  %r  %w   d/s  del_t zone
[...]
 45.3 1514.7   4.5 193877.3   0.0   0.1    0.0    0.0   0   6 412.4    5.5 b8b2464c (21)
 45.3 1343.6   4.5 171979.4   0.0   0.1    0.0    0.1   0   7 1343.6   14.0 b8b2464c (21)
 45.3 1224.8   4.5 156776.9   0.0   0.1    0.0    0.1   0   6 1157.9   12.2 b8b2464c (21)
 45.3 1224.8   4.5 156776.9   0.0   0.1    0.0    0.1   0   6 1157.9   12.2 b8b2464c (21)
 45.3 1299.5   4.5 166338.6   0.0   0.1    0.0    0.0   0   5   0.0    0.0 b8b2464c (21)
 45.3 1258.3   4.5 161065.5   0.0   0.1    0.0    0.0   0   5   0.0    0.0 b8b2464c (21)
 45.3 1317.7   4.5 168661.2   0.0   0.1    0.0    0.0   0   6 455.6    6.3 b8b2464c (21)

The throughput is consistent with the bonnie++ result; however, the write I/O size is still averaging around 128 Kbytes per I/O, not 1 byte. Not "Per Chr".

Also note that d/s (delays/sec) is non-zero – this SmartOS instance is encountering I/O throttling, a ZFS feature used for performance isolation of disk I/O on the SmartOS cloud. These delays will have throttled the performance of bonnie++.

Running bonnie++ via truss(1) (SmartOS syscall tracer) to see what's happening for each I/O:

write(4, "\001020304050607\b\t\n\v".., 131072)	= 131072
write(4, "\001020304050607\b\t\n\v".., 131072)	= 131072
write(4, "\001020304050607\b\t\n\v".., 131072)	= 131072
write(4, "\001020304050607\b\t\n\v".., 131072)	= 131072

So, by the syscall interface, bonnie++ is calling 128 Kbyte writes. Where does "Per Chr" come from?

Take 3

To find out more about these 128 Kbyte writes, DTrace can pull out the user-level stack traces:

# dtrace -n 'syscall::write:entry /execname == "bonnie++"/ { @[ustack()] = count(); }'
dtrace: description 'syscall::write:entry ' matched 1 probe
^C
              libc.so.1`__write+0x15
              libc.so.1`_xflsbuf+0xcf
              libc.so.1`_flsbuf+0x103
              bonnie++`_ZN7CFileOp16write_block_putcEv+0x55
              bonnie++`_Z11TestFileOpsiR12CGlobalItems+0x160
              bonnie++`main+0x8be
              bonnie++`_start+0x83
             1923

The stack includes C++ signatures. Browsing the source code shows that bonnie++ is using its own wrapper to putc():

int CFileOp::write_block_putc()
{
  for(int i = 0; i < m_chunk_size; i++)
  {
    if (putc(i & 0x7f, m_stream[m_file_ind]) == EOF)
    {
      fprintf(stderr, "Can't putc() - disk full?\n");
      return -1;
    }
  }
[...]

And it calls putc() in a loop.

putc() is #define'd:

#define putc(x, p)      (--(p)-&gt;_cnt &gt;= 0 ?\                                    
        (int)(*(p)-&gt;_ptr++ = (unsigned char)(x)) :\                             
        (((p)-&gt;_flag & _IOLBF) && -(p)-&gt;_cnt < (p)-&gt;_bufsiz ?\                  
                ((*(p)-&gt;_ptr = (unsigned char)(x)) != '\n' ?\                   
                        (int)(*(p)-&gt;_ptr++) :\                                  
                        _flsbuf(*(unsigned char *)(p)-&gt;_ptr, p)) :\             
                _flsbuf((unsigned char)(x), p)))                                
#endif

So, putc() buffers data, eventually calling _flsbuf() (flush output buffer), which may also do buffering:

/* The routine _flsbuf may or may not actually flush the output buffer.  If     
 * the file is line-buffered, the fact that iop->_cnt has run below zero        
 * is meaningless: it is always kept below zero so that invocations of putc     
 * will consistently give control to _flsbuf, even if the buffer is far from    
 * full.  _flsbuf, on seeing the "line-buffered" flag, determines whether the   
 * buffer is actually full by comparing iop->_ptr to the end of the buffer      
 * iop->_base + iop->_bufsiz.  If it is full, or if an output line is           
 * completed (with a newline), the buffer is flushed.  (Note: the character     
 * argument to _flsbuf is not flushed with the current buffer if the buffer     
 * is actually full -- it goes into the buffer after flushing.)                 
 */                                                                             
                                                                                
int                                                                             
_flsbuf(unsigned char c, FILE *iop)                                             
{    
[...]

Checking using DTrace:

# dtrace -n 'pid$target::_ZN7CFileOp16write_block_putcEv:entry,pid$target::_flsbuf:entry,
    syscall::write:entry /pid == $target/ { @[probefunc] = count(); }
    tick-1s { printa(@); trunc(@); }' -c 'bonnie++'
dtrace: description 'pid$target::_ZN7CFileOp16write_block_putcEv:entry,...' matched 4 probes
^CCPU     ID                    FUNCTION:NAME
  9  73066                         :tick-1s 
  _flsbuf                                                        1390
  write                                                          1395
  _ZN7CFileOp16write_block_putcEv                               22233

  9  73066                         :tick-1s 
  _flsbuf                                                        1253
  write                                                          1253
  _ZN7CFileOp16write_block_putcEv                               20046
[...]

I'd like to DTrace putc() as well, but that is inlined and harder to trace (though still possible, if I want to dig out the instruction offset).

We know that the write() syscalls are 128 Kbytes each, so we can piece together the rest: write_block_putc() performs 8192 putc()s per call, which, after filling a 128 Kbyte putc() buffer, then calls into _flsbuf() and the write() syscall.

So, ultimately, the "Per Chr" test calls putc(). Which is a user-level-inlined-buffered-thingy, that eventually calls massive (131072x larger) file system writes.

Strange as that may sound – especially if you had earlier assumed "Per Chr" for bonnie++ means 1 byte disk or file system I/O – it is still a means to benchmark the file system, even if it ultimately tests 128 Kbyte writes.

Take 4

On Linux (Fedora).

Checking at the syscall level using strace(1):

write(3, "\0\1\2\3\4\5\6\7\10\t\n\v\f\r\16\17\20\21\22\23\24\25\26\27\30"..., 4096) = 4096
write(3, "\0\1\2\3\4\5\6\7\10\t\n\v\f\r\16\17\20\21\22\23\24\25\26\27\30"..., 4096) = 4096
write(3, "\0\1\2\3\4\5\6\7\10\t\n\v\f\r\16\17\20\21\22\23\24\25\26\27\30"..., 4096) = 4096
write(3, "\0\1\2\3\4\5\6\7\10\t\n\v\f\r\16\17\20\21\22\23\24\25\26\27\30"..., 4096) = 4096

Oh, boy. So linux is buffering to 4 Kbytes, while SmartOS is using 128 Kbytes. If this were a file system benchmark, it's not testing the same thing.

I suppose at this point you could say that it is a valid putc() test, and finding these OS differences is a positive result of benchmarking putc(). Although this is a far cry from the description of bonnie++ as "a number of simple tests of hard drive and file system performance". We're above the syscall event horizon, not below it, so that description should include system library performance as well.

Fortunately, the buffer size can be tuned using the setbuffer() call. If I can change the source code for the intended application, then this is an example of finding an easy/dumb thing to tune. It wouldn't be fair to not choose Fedora/Linux because of the 4 Kbyte default, if it can be easily tuned to 128 Kbytes by changing the application source.

Take 5

A little hacking of bonnie++ (setbuffer()), and it now buffers to 128 Kbytes on Fedora/Linux:

                                                                          
write(3, "\0\1\2\3\4\5\6\7\10\t\n\v\f\r\16\17\20\21\22\23\24\25\26\27\30"..., 131072) = 131072
write(3, "\0\1\2\3\4\5\6\7\10\t\n\v\f\r\16\17\20\21\22\23\24\25\26\27\30"..., 131072) = 131072
write(3, "\0\1\2\3\4\5\6\7\10\t\n\v\f\r\16\17\20\21\22\23\24\25\26\27\30"..., 131072) = 131072
write(3, "\0\1\2\3\4\5\6\7\10\t\n\v\f\r\16\17\20\21\22\23\24\25\26\27\30"..., 131072) = 131072

Results:

Version 1.03e       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
9d219ce8-cf52-40 2G 61952  38 65644   5 64024   8 72127  64 984495  70 +++++ +++

Performance improved by 18%, in this case.

This particular type of tuning assumes you can modify your application source. You may be stuck with putc() and the system defaults.

Take 6

You know who else tests putc()?

Bonnie!

That is, the original Bonnie, by Tim Bray.

Testing on SmartOS:

              -------Sequential Output-------- ---Sequential Input-- --Random--
              -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine    MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU  /sec %CPU
          100 178380 100.0 1012047 100.0 910667 100.0 128642 100.0 3153097 100.0 250595.2 299.7

So Bonnie's putc() result is 10% faster than Bonnie++ (this is repeatable)!

Bonnie is C. Bonnie++ is C++.

From the homepage of bonnie++:

After it became apparent that Tim Bray and I will probably never agree on the relative merits of C vs C++ Tim suggested that I release my code independantly under the name Bonnie++

Is it possible that, to some degree, this is benchmarking the benchmarks?

Take 7

The version of bonnie++ used here was 1.03e, which is currently at the top of the bonnie++ homepage. By going to the experimental releases and running 1.96, I found that the first test has changed. It now does this:

write(3, "d", 1)                        = 1
write(3, "e", 1)                        = 1
write(3, "f", 1)                        = 1
write(3, "g", 1)                        = 1

The "Per Chr" test really is doing 1 byte I/O. To the syscall level, and from there the file system.

The description while running bonnie++ has also changed. It originally had "Writing with putc()...done" (truncated earlier, but was an early clue that putc() was on the scene), which now says "Writing a byte at a time...done".

Was the putc() behavior deemed a bug, and so this is the fix? The more important point here is that benchmarks are software, and all software can have bugs. This includes the analysis tools you are using to double-check, and the kernel-supplied statistics that the tools use (I talk about one such case involving bad network device driver statistics here). It's important to be mindful of this when using benchmarks and using tools to check them.

Take 8

Another goal of active benchmarking is to determine what the performance limiter really is. The limiter may be the target of the test (eg, a random disk I/O benchmark being limited by random disk I/O performance), or it may be something else. Sometimes the limiter is something dumb and easily tunable (eg, setbuffer()), allowing you to tune it and then more seriously investigate the performance of the target.

I like to start with microstate accounting (if available), or at least the USR/SYS breakdown per-thread. Here is prstat from SmartOS:

# prstat -mLcp `pgrep bonnie` 1
Please wait...
   PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID 
 59992 nobody    84  16 0.0 0.0 0.0 0.0 0.0 0.0   1   6  1K   0 bonnie++/1
Total: 1 processes, 1 lwps, load averages: 0.15, 0.03, 0.01
   PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID 
 59992 nobody    85  15 0.0 0.0 0.0 0.0 0.0 0.0   1  13  1K   0 bonnie++/1
Total: 1 processes, 1 lwps, load averages: 0.16, 0.04, 0.01
   PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID 
 59992 nobody    85  15 0.0 0.0 0.0 0.0 0.0 0.0   0   6  1K   0 bonnie++/1
[...]

First observation: bonnie++ is single-threaded. Second is that it's on-CPU 100% of the time, most of which (85%) in user-level code (USR). If I had begun with this instead of the previous analysis, this would immediately tell me something was amiss: bonnie++ is hot on-CPU in itself, not blocked on disk (which would show up as SLP) or in file system code (SYS).

The on-CPU instructions for that user-level time is:

# dtrace -n 'profile-297 /execname == "bonnie++" && arg1/ { @[uaddr(arg1)] = count(); }'
dtrace: description 'profile-297 ' matched 1 probe
^C
  bonnie++`_ZN7CFileOp16write_block_putcEv+0x1b                     1
  bonnie++`_ZN7CFileOp16write_block_putcEv+0x33                     1
  bonnie++`_ZN7CFileOp16write_block_putcEv+0x30                    75
  bonnie++`_ZN7CFileOp16write_block_putcEv+0x18                    76
  bonnie++`_ZN7CFileOp16write_block_putcEv+0x3e                    76
  bonnie++`_ZN7CFileOp16write_block_putcEv+0x40                    86
  bonnie++`_ZN7CFileOp16write_block_putcEv+0x39                   125
  bonnie++`_ZN7CFileOp16write_block_putcEv+0x3b                   297
  bonnie++`_ZN7CFileOp16write_block_putcEv+0x22                   347

This is bonnie++'s write_block_putc() function, which I assume includes the inlined putc().

So the limiters for this test are: CPU speed, write_block_putc() and putc() instructions, and that it is single-threaded.

We can drill a little further to find out what the CPU is actually doing, at the instruction and cycle level. One way can be to dissassemble the write_block_putc() function, and compare the instruction offsets with the DTrace sample counts above. The CPU cycles-per-instruction can also be examined: I'd assume it would be pretty good, considering the locality and cacheability of putc()'s behavior.

It's basically a CPU test.

Summary

Summarized results for the bonnie++ ver 1.03e "Per Chr" streaming output follow.

With passive benchmarking, we learned:

SmartOS/illumos was 3.1x faster than Fedora/Linux

With active benchmarking, we learned:

To some degree, the benchmark was artificially limited by ZFS I/O throttling (vfsstat d/s).
This was not testing disk I/O directly (synchronously).
The resulting disk I/O write size was 128 Kbytes (not "Per Chr").
The file system was tested with the same volume of data (same throughput as result).
The file system write size at this level was 128 Kbytes (not "Per Chr").
The "Per Chr" test was referring to the use of putc().
On SmartOS/illumos, putc() was buffered by the system library into 128 Kbyte writes.
On Fedora/Linux, putc() was buffered by the system library into 4 Kbyte writes.
putc() is faster on SmartOS, due in some part to the larger buffer size.
The buffer size can be tuned using setbuffer(), which improved Fedora/Linux perf by 18%.
Bonnie++ is 10% slower than the original Bonnie for this test. Which means, 10% of the Bonnie++ result is due to the tool, not the target of the test.
Newer (experimental) versions of bonnie++ changed "Per Chr" to actually perform 1 byte FS I/O.
This test is limited by: CPU speed, write_block_putc() and putc() instructions, and by the fact that it is single-threaded.

You might not have guessed any of this from the raw benchmark data. I've led many benchmark postmortems during the past five years, and these kinds of findings are not uncommon (and certainly not isolated to bonnie++). Benchmarking is treacherous and confusing, and often done poorly - which means that you need to take any benchmark results with a large grain of salt.

Next Steps

The example scenario I gave was to choose a cloud instance type based on disk I/O benchmarking. Since putc() doesn't test disk or file system I/O performance very well, your next step might be to move on to the next test from bonnie++, and to perform active benchmarking with that. The next test was "Sequential Output Block": so, is it performing disk I/O? What size does "Block" mean? Does disk throughput match bonnie++'s results? What's happening at the file system level? System call level? Where is the bonnie++ thread spending its time? And so on...

As you work through benchmark tests and gather results, you should also note information on what the found or suspected limiter was. It'll take a while to step through the bonnie++ result set and understand what it has really measured. I've described this before as:

If you've spent less than a week studying a benchmark result, it's probably wrong.

Running a benchmark is the easy part. Understanding a benchmark can take much longer.

My analysis of bonnie++'s first result has taken most of a day so far, and it may take a few more days to get to the bottom of it, and be able to fully show why one OS is a certain percentage faster than another. This analysis would include examining kernel and virtualization internals, to see and quantify what other factors are affecting runtime.

Conclusion

Here, I picked the first test from bonnie++ and examined it using active benchmarking. What it ultimately turned out to be testing (CPU speed and putc() buffering) may not be what you would assume, given the role of bonnie++. This kind of surprise isn't uncommon with benchmarking, and is why in-flight analysis is so important.

Sometimes you need to analyze the benchmark source code, right down to instructions, to really understand what it is doing. I would have guessed most of bonnie++'s behavior for the first test, but the 128k vs 4k putc() buffer difference on SmartOS and Fedora was a surprise.

Bonnie and bonnie++ are not unusually bad – there are many aspects I like about them, and they have served people well in many cases. The take-away here is to analyze whatever you do use.

References

References and resources:

Bonnie homepage, which explains what each test does.
Bonnie++ homepage, the C++ version that supports larger files, concurrency, and more.
Fishworks Analytics by Tim Bray shows active benchmarking analysis of Bonnie by using an Analytics tool to examine its performance while running (I helped develop this tool with Bryan.)
Decoding Bonnie++ by Roch Bourbonnais analyses all the tests of Bonnie++, and goes deeper with active benchmarking using the same Analytics tool. (I was also working with Roch at the time.)
Benchmarking the ZFS Storage Appliance parts 1, 2, and 3, provides more benchmarking advise.
7410 hardware update, and analyzing the HyperTransport is an example of my previous benchmark and analysis work at Sun Microsystems, which includes many screenshots to double check results.

Thanks to Deirdré Straughan for helping with this, and to Tim and Roch for their analysis and blog posts.