What is the distribution of sent packet sizes for my Linux system?
# ./perf-stat-hist net:net_dev_xmit len 10 Tracing net:net_dev_xmit, power-of-4, max 16384, for 10 seconds... Range : Count Distribution 0 : 0 | | 1 -> 3 : 0 | | 4 -> 15 : 0 | | 16 -> 63 : 6 |# | 64 -> 255 : 385 |######################################| 256 -> 1023 : 133 |############## | 1024 -> 4095 : 155 |################ | 4096 -> 16383 : 0 | | 16384 -> : 0 | |
Great! So about half are between 64 and 255 bytes, and the rest are between 256 and 4095 bytes.
How about the requested size of read() syscalls?
# time ./perf-stat-hist syscalls:sys_enter_read count 10 Tracing syscalls:sys_enter_read, power-of-4, max 1048576, for 10 seconds... Range : Count Distribution 0 : 0 | | 1 -> 3 : 1361 |# | 4 -> 15 : 2 |# | 16 -> 63 : 8 |# | 64 -> 255 : 60 |# | 256 -> 1023 : 1933859 |######################################| 1024 -> 4095 : 59 |# | 4096 -> 16383 : 146 |# | 16384 -> 65535 : 21 |# | 65536 -> 262143 : 554007 |########### | 262144 -> 1048575 : 0 | | 1048576 -> : 0 | | real 0m10.056s user 0m0.012s sys 0m0.008s
Neat! The most common are in the 256 - 1023 byte range.
This time I added a time command, to show that extracting this information from the kernel cost little.
The script I'm using, perf-stat-hist, is demonstrating a custom distribution capability that is bread-and-butter for more advanced tracers like SystemTap, ktap, and DTrace. However, I'm not using those.
I'm using Linux perf_events on the 3.2 kernel. Aka, the perf command.
To do in-kernel histograms.
Stock, standard, perf_events.
Via user-space
Yes, for the current version of perf_events (3.16 and earlier) this is supposed to be impossible. perf_events can do in-kernel tracepoint counts, but anything beyond that requires dumping data to user-space for post-processing, like this:
# perf record -e 'syscalls:sys_enter_read' -a sleep 5 [ perf record: Woken up 25 times to write data ] [ perf record: Captured and wrote 132.355 MB perf.data (~5782677 samples) ]
Now I have two problems. This perf.data file has over 5 million entries, which will cost some CPU to process. How much CPU just to read it? Lets dump it using perf script to /dev/null:
window1# time perf script > /dev/null ...hang... window2# top ...hang...
Both windows have frozen. Now I have four problems.
When top finally runs, I can see what's wrong:
# top top - 23:21:58 up 25 days, 2:56, 2 users, load average: 1.68, 1.42, 1.09 Tasks: 142 total, 2 running, 136 sleeping, 0 stopped, 4 zombie Cpu(s): 18.9%us, 54.1%sy, 0.0%ni, 27.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 3839240k total, 3813020k used, 26220k free, 448k buffers Swap: 0k total, 0k used, 0k free, 167712k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 27328 root 20 0 3437m 3.3g 68m R 100 88.8 0:06.02 perf 34 root 20 0 0 0 0 S 26 0.0 0:45.04 kswapd0 26289 root 20 0 0 0 0 S 10 0.0 0:00.04 kworker/u4:0 26290 root 20 0 17344 712 340 R 5 0.0 0:00.90 top [...]
perf has not only eaten one CPU, but also exhausted the memory on this system.
perf_events does have an excellent architecture for passing data from the kernel to user-level programs, minimizing overhead and CPU costs. What I'm testing is an extreme case. In many other cases, a perf report and perf script cycle will work fine, and the overhead will be negligible. A perf-stat-hist script using that cycle would merely bucketize the perf script output and print a report: a trivial program.
I could reduce overheads further by reading the binary perf.data file directly, or even better, calling perf_event_open() and mmap() to read the binary buffer, and process data without a trip via the file system. But there's also another way...
The hacktogram
This is based on perf stat, which does efficient in-kernel counts. I gave a quick tour of basic perf Counting capabilities in my previous post.
perf stat lets you instrument the same tracepoint multiple times, with different filters. The trick is to use a tracepoint and filter pair for each histogram bucket. For example:
# perf stat -e syscalls:sys_enter_read --filter 'count < 1024' \ -e syscalls:sys_enter_read --filter 'count >= 1024 && count < 1048576' \ -e syscalls:sys_enter_read --filter 'count > 1048576' -a sleep 5 Performance counter stats for 'system wide': 1,522,160 syscalls:sys_enter_read [100.00%] 401,805 syscalls:sys_enter_read [100.00%] 18 syscalls:sys_enter_read 5.001822069 seconds time elapsed
This shows that there were 1,522,160 read() syscalls requesting less than 1 Kbyte, 401,805 requesting between 1 Kbyte and 1 Mbyte, and 18 requesting over 1 Mbyte.
That's the approach I used in perf-stat-hist. Tracing the same tracepoint multiple times does incur additional overhead, so this approach is not ideal, and can slow my target by up to 50% when using over a dozen tracepoints (buckets). It's a hack.
As for the variable I'm using, in this case "count": those come from the tracepoint. See the end of my previous post on perf Counting, and the contents of the /sys/.../format file.
Ideal
What would be ideal is for perf stat to provide a histogram option. Eg:
# perf stat -e syscalls:sys_enter_read --hist "pow2 count"
For a power-of-2 histogram of the count variable.
I think it's likely perf_events will get this capability in the future, especially thanks to recent kernel developments (more on this soon). So my perf-stat-hist workaround has a limited lifespan.
For more on perf_events, see my perf_events examples page and the perf_events wiki.
Click here for Disqus comments (ad supported).