I originally posted this at http://dtrace.org/blogs/brendan/2012/01/09/activity-of-the-zfs-arc
Disk I/O is still a common source of performance issues, despite modern cloud environments, modern file systems and huge amounts of main memory serving as file system cache. Understanding how well that cache is working is a key task while investigating disk I/O issues. In this post, I'll show the activity of the ZFS file system Adaptive Replacement Cache (ARC).
There are often more statistics available than you realize (or have been documented), which may certainly be true with the ARC. Apart from showing these statistics, I'll also show how to extend observability using dynamic tracing (DTrace). These tracing techniques are also applicable to any kernel subsystem. This is an advanced topic, where I'll sometimes dip into kernel code.
Architecture
For background on the ZFS ARC, see the paper ARC: A Self-Tuning, Low Overhead Replacement Cache, by Nimrod Megiddo and Dharmendra S. Modha. In a nutshell, the ARC achieves a high cache hit rate by using multiple cache algorithms at the same time: most recently used (MRU) and most frequently used (MFU). Main memory is balanced between these algorithms based on their performance, which is known by maintaining extra metadata (in main memory) to see how each algorithm would perform if it ruled all of memory. Such extra metadata is held on "ghost lists".
The ZFS ARC has some changes beyond this design, as described in the block comment at the top of uts/common/fs/zfs/arc.c. These changes include the ability to lock pages, vary the size of the cache, and to cache buffers of different sizes.
Lists
+----------------------------------------------------------+ | ZFS ARC | | +---------------+----------------------------------+ | | | MRU | MRU ghost | | | +---------------+---------------+------------------+ | | | MFU | MFU ghost | | | +-------------------------------+------------------+ | | <--- available main memory ---> | | | +----------------------------------------------------------+
The MRU + MFU lists refer to the data cached in main memory; the MRU ghost + MFU ghost lists consist of themselves only (the metadata) to track algorithm performance.
This is a simplification to convey the basic principle. The current version of the ZFS ARC splits the lists above into separate data and metadata lists, and also has a list for anonymous buffers and one for L2ARC only buffers (which I added when I developed the L2ARC).
The actual lists are these, from arc.c:
typedef struct arc_state { list_t arcs_list[ARC_BUFC_NUMTYPES]; /* list of evictable buffers */ uint64_t arcs_lsize[ARC_BUFC_NUMTYPES]; /* amount of evictable data */ uint64_t arcs_size; /* total amount of data in this state */ kmutex_t arcs_mtx; } arc_state_t; /* The 6 states: */ static arc_state_t ARC_anon; static arc_state_t ARC_mru; static arc_state_t ARC_mru_ghost; static arc_state_t ARC_mfu; static arc_state_t ARC_mfu_ghost; static arc_state_t ARC_l2c_only;
These lists exhibit MRU- and MFU-like behavior, but aren't strictly MRU/MFU. This can be understood from a lifecycle of an ARC buffer: on the first access, it is created and moved to the head of the MRU list. On the second access, it is moved to the head of the MFU list. On the third access, it moves back to the start of the MFU list. (Other lifecycles are possible, this is just one example.) So, the most recently accessed buffer may be at the start of the MFU list, not the MRU list. And, the most frequently accessed buffer may not be at the very start of the MFU list.
Locks
Data exists in the cache as buffers, where the primary structures are the arc_buf_hdr_t (header struct, defined in arc.c) and arc_buf_t (buffer struct, defined in arc.h). Access to these is protected by a hash table based on the 128-bit ZFS data virtual address (DVA). The hash table has 256 buffer chains (BUF_LOCKS, which may vary based on your ZFS version), each protected by a padded lock (to avoid false sharing). From arc.c:
#define HT_LOCK_PAD 64 struct ht_lock { kmutex_t ht_lock; #ifdef _KERNEL unsigned char pad[(HT_LOCK_PAD - sizeof (kmutex_t))]; #endif }; #define BUF_LOCKS 256 typedef struct buf_hash_table { uint64_t ht_mask; arc_buf_hdr_t **ht_table; struct ht_lock ht_locks[BUF_LOCKS]; } buf_hash_table_t;
These are optimized for performance since ARC buffers can be accessed, modified and moved between lists frequently.
For more details on ARC lists and locks, see the block comments in arc.c, and the overview by Joerg Moellenkamp.
Sizing
The ARC grows to fill available memory on the system, on the principle that if there is free memory, use it. It shouldn't do this at the expense of applications, ie, it shouldn't push out application memory (at least, in any large and sustained way). It keeps its size in check via:
- allocation: once the ARC size has grown to its expected maximum, it will begin evicting buffers during new allocations. There is also some logic in arc_evict() to "recycle" a buffer of equal size, an optimization to avoid doing an evict-free-alloc path of the same size.
- reclaim thread: this is arc_reclaim_thread(), which wakes up every second (or sooner if signaled by the arc_reclaim_thr_cv conditional variable) and will attempt to reduce the size of the ARC to the target size. It calls arc_kmem_reap_now() to clean up the kmem caches, and arc_adjust() to resize the ARC lists. If arc_shrink() is called by arc_kmem_reap_now(), the target ARC size is reduced by arc_shrink_shift (or needfree), which means shrinking the ARC by 3%. If you plot the ARC size, you sometimes see these arc_shrink() steps appearing as teeth on a saw - a sharp drop followed by a gradual increase.
This is a brief summary, and includes keywords so you can find the right places in the source to start reading. I should note that the ARC did have sizing issues in the past, where it did seem to push out application memory; those were since fixed. (One issue was where it didn't account for its own footprint accurately, missing a source of metadata into its size calculation, which meant the ARC was reaping later than it should have.)
Statistics
On Solaris-based systems, ARC statistics are available from kstat (kernel statistics), the same resource used by tools such as vmstat(1M) and iostat(1M). kstats are global (entire system, not individual zones) and accessible from non-root users. On the down side, they usually are not documented and are not considered a stable interface.
On FreeBSD, the same kstats for the ARC are available via sysctl (kstat.zfs.misc.arcstats).
ARC Hit/Miss Rate
ARC hit or miss rate can be determined from the kstats zfs::arcstats:hits and zfs::arcstats:misses. To watch a rate over time, they can be processed using a little awk (example for Solaris-based systems):
# cat -n archits.sh 1 #!/usr/bin/sh 2 3 interval=${1:-5} # 5 secs by default 4 5 kstat -p zfs:0:arcstats:hits zfs:0:arcstats:misses $interval | awk ' 6 BEGIN { 7 printf "%12s %12s %9s\n", "HITS", "MISSES", "HITRATE" 8 } 9 /hits/ { 10 hits = $2 - hitslast 11 hitslast = $2 12 } 13 /misses/ { 14 misses = $2 - misslast 15 misslast = $2 16 rate = 0 17 total = hits + misses 18 if (total) 19 rate = (hits * 100) / total 20 printf "%12d %12d %8.2f%%\n", hits, misses, rate 21 } 22 '
This program could be shorter; I've spent some extra lines to write it more clearly. You could also write this in Perl (see my Sun::Solaris::Kstat examples), or C via libkstat.
$ ./archits.sh 1 HITS MISSES HITRATE 651329528960 370490565 99.94% 22600 11 99.95% 17984 6 99.97% 8978 8 99.91% 87041 28 99.97% 89861 10 99.99% [...]
The first line is the summary since boot, then interval summaries. These counters are system wide. The hit rate on this system is impressive (99.94% since boot), although hit rates can be misleading. I'm usually studying the MISSES column, as a linear measure of pain.
arcstat.pl
Neelakanth Nadgir wrote arcstat.pl (Solaris), which prints various statistics including reads, misses and the size of the ARC. Mike Harsch delevoped arcstat.pl further, including L2ARC statistics.
$ ./arcstat.pl 1 time read miss miss% dmis dm% pmis pm% mmis mm% arcsz c 04:45:47 0 0 0 0 0 0 0 0 0 14G 14G 04:45:49 15K 10 0 10 0 0 0 1 0 14G 14G 04:45:50 23K 81 0 81 0 0 0 1 0 14G 14G 04:45:51 65K 25 0 25 0 0 0 4 0 14G 14G 04:45:52 30K 11 0 11 0 0 0 3 0 14G 14G [...]
Instead of hit rates, this tool uses miss rates.
In Neel's version the first line is the summary since boot; this isn't the case in Mike's current L2ARC version: an extra snap_stats() for an early L2ARC check means that by the time the statistics loop is reached the first iteration is comparing "now with now" instead of "now with boot".
Jason Hellenthal has created a FreeBSD version.
All statistics
All the kstats from the arcstat group (which feed the tools seen above) can be listed using:
$ kstat -pn arcstats zfs:0:arcstats:c 15730138449 zfs:0:arcstats:c_max 50447089664 zfs:0:arcstats:c_min 6305886208 zfs:0:arcstats:class misc zfs:0:arcstats:crtime 95.921230719 zfs:0:arcstats:data_size 13565817856 zfs:0:arcstats:deleted 388469245 zfs:0:arcstats:demand_data_hits 611277816567 zfs:0:arcstats:demand_data_misses 258220641 zfs:0:arcstats:demand_metadata_hits 40050025212 zfs:0:arcstats:demand_metadata_misses 88523590 zfs:0:arcstats:evict_skip 5669994 zfs:0:arcstats:hash_chain_max 20 zfs:0:arcstats:hash_chains 248783 zfs:0:arcstats:hash_collisions 2106095400 zfs:0:arcstats:hash_elements 971654 zfs:0:arcstats:hash_elements_max 5677254 zfs:0:arcstats:hdr_size 188240232 zfs:0:arcstats:hits 651328694708 [...l2arc statistics truncated...] zfs:0:arcstats:memory_throttle_count 0 zfs:0:arcstats:mfu_ghost_hits 55377634 zfs:0:arcstats:mfu_hits 649347616033 zfs:0:arcstats:misses 370489546 zfs:0:arcstats:mru_ghost_hits 127477329 zfs:0:arcstats:mru_hits 1980639328 zfs:0:arcstats:mutex_miss 11530337 zfs:0:arcstats:other_size 1967741376 zfs:0:arcstats:p 14713329404 zfs:0:arcstats:prefetch_data_hits 21342 zfs:0:arcstats:prefetch_data_misses 20782630 zfs:0:arcstats:prefetch_metadata_hits 831587 zfs:0:arcstats:prefetch_metadata_misses 2962685 zfs:0:arcstats:recycle_miss 27036925 zfs:0:arcstats:size 15721799464 zfs:0:arcstats:snaptime 29379870.8764106
More of the activity related statistics will be discussed in the next sections.
Demand/Prefetch
Hits and misses can be broken down into four components, such that:
hits = demand_data_hits + demand_metadata_hits + prefetch_data_hits + prefetch_metadata_hits
And similar for misses. Prefetch and demand refer to how the ARC request was initiated; data and metadata refer to the type of data requested.
Prefetch is the ZFS read-ahead feature, to predict and pre-cache blocks for streaming (sequential) workloads. All the prefetch statistics refer to ARC requests that have originated from the ZFS prefetch algorithm - which happens before the ARC - and without knowing whether the data is already cached in the ARC. So, a prefetch hit means that ZFS initiated a prefetch, which was then found in the ARC. A prefetch miss happens when that prefetch request was not in the ARC, and so initiated a disk I/O request (normal behavior). Demand is the opposite of prefetch: direct requests to the ARC, not predicted requests.
Another way to understand prefetch statistics is to follow the code. In dbuf.c, see the ARC_PREFETCH flag set in dbuf_prefetch(), which is then checked in arc.c via the ARCSTAT_CONDSTAT macro to determine which kstat to increment.
You can also add these up in other ways; eg:
streaming ratio = prefetch_* / (hits + misses)
At least, that identifies the ratio of the workload that ZFS has identified as streaming. This can be turned into a kstat tool (awk/Perl/C), as with hits/misses earlier, to show both summary since boot and interval summaries (current activity).
Data/Metadata
Metadata describes the ZFS dataset (file system or volume) and the objects within it. The data is the contents of those objects, including file, directory and volume blocks.
metadata ratio = *_metadata_* / (hits + misses)
This may be useful to check for considering the effect of picking a small recsize setting (thus increasing metadata), or when considering the effect of setting primarycache to metadata only.
Others
Some other activity related kstats worth mentioning for the ARC:
- mru_hits, mru_ghost_hits, mfu_hits, mfu_ghost_hits, p: Comparing the mru_hits and mfu_hits statistic with misses can determine the performance of each ARC list type (it's not comparing performance of the MRU/MFU algorithms alone, since these aren't strictly MRU/MFU, as mentioned in Architecture). By adding _hits + _ghost_hits for each type, and then comparing the ratio of each type over time, you can also identify if the workload changes in terms of ARC MRU/MFU. And you can also see how quickly the ARC adapts to the workload, by watching the "p" statistic (ARC parameter) change.
- hash_chain_max, hash_collisions: These show how well the DVA hash table is hashing. hash_chain_max is the longest length seen for a chain, when DVAs hash to the same table entry, and is usually less than 10. If that was much higher, performance may degrade as the hash locks are held longer while the chains are walked, assuming the max is reflective and not an anomaly caused by some short event. This could be double checked by studying the hash_collisions rate. If an issue is found, the number of hash table entries (BUF_LOCKS) could be increased in arc.c, and ZFS recompiled (this isn't a regular tunable); although I wouldn't expect needing to tune this for a while.
Other kstats in the arcstats group describe sizes of the ARC, and the L2ARC.
arc_summary.pl
Another Perl Sun::Solaris::Kstat-based ARC tool worth mentioning is Ben Rockwood's arc_summary.pl, which prints a neat summary of the hit/miss rate and many of the other counters. Jason Hellenthal also ported the tool to FreeBSD.
$ ./arc_summary.pl System Memory: Physical RAM: 49134 MB Free Memory : 1925 MB LotsFree: 767 MB ZFS Tunables (/etc/system): set zfs:zil_disable=1 set zfs:zfs_prefetch_disable=1 set zfs:zfs_nocacheflush=1 ARC Size: Current Size: 15172 MB (arcsize) Target Size (Adaptive): 15256 MB (c) Min Size (Hard Limit): 6013 MB (zfs_arc_min) Max Size (Hard Limit): 48110 MB (zfs_arc_max) ARC Size Breakdown: Most Recently Used Cache Size: 77% 11865 MB (p) Most Frequently Used Cache Size: 22% 3391 MB (c-p) ARC Efficency: Cache Access Total: 654018720316 Cache Hit Ratio: 99% 653646329407 [Defined State for buffer] Cache Miss Ratio: 0% 372390909 [Undefined State for Buffer] REAL Hit Ratio: 99% 653645890054 [MRU/MFU Hits Only] Data Demand Efficiency: 99% Data Prefetch Efficiency: 0% CACHE HITS BY CACHE LIST: Anon: --% Counter Rolled. Most Recently Used: 0% 1989696958 (mru) [ Return Customer ] Most Frequently Used: 99% 651656193096 (mfu) [ Frequent Customer ] Most Recently Used Ghost: 0% 128471495 (mru_ghost) [ Return Customer Evicted, Now Back ] Most Frequently Used Ghost: 0% 55618357 (mfu_ghost) [ Frequent Customer Evicted, Now Back ] CACHE HITS BY DATA TYPE: Demand Data: 93% 613371468593 Prefetch Data: 0% 21342 Demand Metadata: 6% 40274007879 Prefetch Metadata: 0% 831593 CACHE MISSES BY DATA TYPE: Demand Data: 69% 259735783 Prefetch Data: 5% 20782630 Demand Metadata: 23% 88909678 Prefetch Metadata: 0% 2962818 ---------------------------------------------
Percentages and raw counters are provided, and the four breakdowns of hit/miss statistics (which I documented above; Ben's been bugging me to document the arcstats for a while).
Tracing
Apart from statistics, the activity of the ARC can also be observed by tracing function points and probes in the kernel. While statistics are always enabled and collected, tracing is enabled when needed, and costs much higher overhead. This overhead is relative to the frequency of the traced events, which for the ARC can be very frequent (hundreds of thousands of events per second). I usually only trace the ARC for short periods (seconds or minutes) to gather debug data.
There isn't a stable DTrace provider for the ARC (and there probably never will be - other areas make much more sense), but there are sdt-provider probes in the ARC code:
# dtrace -ln 'sdt:zfs::arc-*' ID PROVIDER MODULE FUNCTION NAME 19307 sdt zfs arc_read_nolock arc-miss 19310 sdt zfs arc_evict_ghost arc-delete 19311 sdt zfs arc_evict arc-evict 19312 sdt zfs arc_read_nolock arc-hit 19313 sdt zfs arc_buf_add_ref arc-hit
If these didn't exist, you could use the fbt provider. I'd begin by inspecting the functions listed in the FUNCTION column.
Note that neither of these providers (sdt or fbt) are available from within Solaris zones - these must be traced from the global zone. They are also both considered unstable interfaces, meaning the one-liners and scripts that follow may not work on future versions of the ARC without maintenance to match the code changes.
ARC accesses by applicaiton
Checking which applications are (directly) using the ARC:
# dtrace -n 'sdt:zfs::arc-hit,sdt:zfs::arc-miss { @[execname] = count() }' dtrace: description 'sdt:zfs::arc-hit,sdt:zfs::arc-miss ' matched 3 probes ^C sendmail 1 qmgr 3 [...] nscd 81 httpd 243 imapd 1417 python2.6 2572 awstats.pl 4285 php 6934 mysqld 105901
This frequency counts the execname during ARC access. mysqld was the heaviest user, with 105,901 accesses while tracing.
The kernel will show up as "sched", for activities including ZFS transaction group flushes (TXG flush).
ARC accesses by kernel call path
For more details on why the ARC is being accessed, the kernel calling stack can be frequency counted:
# dtrace -n 'sdt:zfs::arc-hit,sdt:zfs::arc-miss { @[execname, probefunc, stack()] = count(); }' dtrace: description 'sdt:zfs::arc-hit,sdt:zfs::arc-miss ' matched 3 probes ^C [...] sched arc_buf_add_ref zfs`dbuf_hold_impl+0xea zfs`dbuf_hold+0x2e zfs`dmu_buf_hold+0x75 zfs`zap_lockdir+0x67 zfs`zap_update+0x5b zfs`uidacct+0xc4 zfs`zfs_space_delta_cb+0x112 zfs`dmu_objset_do_userquota_callbacks+0x151 zfs`dsl_pool_sync+0xfe zfs`spa_sync+0x32b ← spa sync zfs`txg_sync_thread+0x265 unix`thread_start+0x8 26 [...] python2.6 arc_buf_add_ref zfs`dbuf_hold_impl+0xea zfs`dbuf_hold+0x2e zfs`dmu_buf_hold+0x75 zfs`zap_get_leaf_byblk+0x56 zfs`zap_deref_leaf+0x78 zfs`fzap_cursor_retrieve+0xa7 zfs`zap_cursor_retrieve+0x152 zfs`zfs_readdir+0x2b8 genunix`fop_readdir+0xab ← read directory genunix`getdents64+0xbc unix`_sys_sysenter_post_swapgs+0x149 2130 [...] mysqld arc_buf_add_ref zfs`dbuf_hold_impl+0xea zfs`dbuf_hold+0x2e zfs`dmu_buf_hold_array_by_dnode+0x1a7 zfs`dmu_buf_hold_array+0x71 zfs`dmu_read_uio+0x4d zfs`zfs_read+0x19a genunix`fop_read+0x6b ← read genunix`read+0x2b8 genunix`read32+0x22 unix`_sys_sysenter_post_swapgs+0x149 101955
The output was many pages long; I've truncated to include a few different stacks, and added annotations.
ARC misses by user-land call path
Here's another view of ARC access call paths, this time for misses only, and the user-land stack trace that led to the miss. I've filtered on "mysqld" processes only:
# dtrace -n 'sdt:zfs::arc-miss /execname == "mysqld"/ { @[execname, probefunc, ustack()] = count(); }' dtrace: description 'sdt:zfs::arc-miss ' matched 1 probe ^C [...] mysqld arc_read_nolock libc.so.1`__read+0x15 mysqld`my_read+0x43 mysqld`_Z7openfrmP3THDPKcS2_jjjP8st_table+0x95 mysqld`_ZL17open_unireg_entryP3THDP8st_tablePKcS4_S4_P10TABLE_LISTP1... mysqld`_Z10open_tableP3THDP10TABLE_LISTP11st_mem_rootPbj+0x6d7 mysqld`_Z11open_tablesP3THDPP10TABLE_LISTPjj+0x1b0 mysqld`_Z30open_normal_and_derived_tablesP3THDP10TABLE_LISTj+0x1b mysqld`_Z14get_all_tablesP3THDP10TABLE_LISTP4Item+0x73b mysqld`_Z24get_schema_tables_resultP4JOIN23enum_schema_table_state+0x18 mysqld`_ZN4JOIN4execEv+0x59e mysqld`_Z12mysql_selectP3THDPPP4ItemP10TABLE_LISTjR4ListIS1_ES2_jP8s... mysqld`_Z13handle_selectP3THDP6st_lexP13select_resultm+0x102 mysqld`_Z21mysql_execute_commandP3THD+0x51c6 mysqld`_Z11mysql_parseP3THDPKcjPS2_+0x1be mysqld`_Z16dispatch_command19enum_server_commandP3THDPcj+0x895 mysqld`handle_one_connection+0x318 libc.so.1`_thrp_setup+0x7e libc.so.1`_lwp_start 124
The kernel stack trace could be included as well, showing the complete call path from user-land to a kernel event.
ARC access sizes
Digging a bit deeper; the sdt probes used previously were declared as:
DTRACE_PROBE1(arc__hit, arc_buf_hdr_t *, hdr);
Which means arg0 is an arc_buf_hdr_t. It's declared in arc.c, and contains various members including:
struct arc_buf_hdr { [...] arc_buf_t *b_buf; uint32_t b_flags; [...] arc_buf_contents_t b_type; uint64_t b_size; uint64_t b_spa; [...] clock_t b_arc_access; [...]
Let's pick out the size, and trace ARC accesses by buffer size:
# dtrace -n 'sdt:zfs::arc-hit,sdt:zfs::arc-miss { @["bytes"] = quantize(((arc_buf_hdr_t *)arg0)->b_size); }' dtrace: description 'sdt:zfs::arc-hit,sdt:zfs::arc-miss ' matched 3 probes ^C bytes value ------------- Distribution ------------- count 0 | 0 1 | 82 2 | 1 4 | 0 8 | 1 16 | 1 32 | 1 64 | 1 128 | 0 256 | 0 512 |@ 1526 1024 | 605 2048 | 780 4096 | 913 8192 |@ 1094 16384 |@@ 4386 32768 | 618 65536 |@@ 4196 131072 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 60811 262144 | 0
Here I used a power-of-2 quantization, which showed that most of the buffers were in the 128 Kbyte range. (Which is also the default recsize for the datasets on this system.) Smaller buffers will exist due to cases including files and directories that are smaller than 128k.
Other members of arc_buf_hdr_t can be retrieved and inspected in similar ways.
ARC buffer age
Here's a neat use of the b_arc_access member, which tracks the time that the buffer was last accessed in terms of clock ticks. This time the fbt provider is used, to trace arc_access() before and after it updates b_arc_access:
# cat -n arcaccess.d 1 #!/usr/sbin/dtrace -s 2 3 #pragma D option quiet 4 5 dtrace:::BEGIN 6 { 7 printf("lbolt rate is %d Hertz.\n", `hz); 8 printf("Tracing lbolts between ARC accesses..."); 9 } 10 11 fbt::arc_access:entry 12 { 13 self->ab = args[0]; 14 self->lbolt = args[0]->b_arc_access; 15 } 16 17 fbt::arc_access:return 18 /self->lbolt/ 19 { 20 @ = quantize(self->ab->b_arc_access - self->lbolt); 21 self->ab = 0; 22 self->lbolt = 0; 23 }
Running for 10 seconds:
# ./arcaccess.d -n 'tick-10s { exit(0); }' lbolt rate is 100 Hertz. Tracing lbolts between ARC accesses... value ------------- Distribution ------------- count -1 | 0 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 729988 1 | 3805 ← 10 ms 2 | 3038 4 | 2028 8 | 1428 16 | 1398 32 | 1618 64 | 2883 ← 1 second 128 | 738 256 | 681 512 | 338 1024 | 569 2048 | 166 4096 | 607 ← 1 minute 8192 | 632 16384 | 808 32768 | 373 65536 | 110 131072 | 142 262144 | 39 ← 1 hour 524288 | 5 1048576 | 97 2097152 | 10 4194304 | 44 8388608 | 617 ← 1 day 16777216 | 1 33554432 | 0
This is interesting data. It shows that most buffers were accessed less than one clock tick apart (10 ms), with 729,988 accesses in the 0 to 1 tick range. The oldest buffer accessed was in the 16777216+ range, which (converting lbolts @100 Hertz into time) means it is at least 46 hours since last access. The above output has been annotated to show where times fall in the lbolt ranges (eg, 1 second falls in the 64 - 127 lbolt range).
This gives us an insight into the age of the oldest buffers in the ARC (at least, in terms of access rate - not birth), and, of its churn rate. This particular ARC is 25 Gbytes, and has been running with a 99.94% hit rate as shown earlier - which may be less surprising now we know that it is so large that it can contain buffers accessed 40+ hrs apart.
ARC hash lock
To get a handle on ARC hash lock contention (instead of using more heavyweight tools like lockstat(1M)), you can try tracing the time for arc_buf_add_ref(), since it grabs the buffer hash lock:
# dtrace -n 'arc_buf_add_ref:entry { self->s = timestamp; } arc_buf_add_ref:return /self->s/ { @["ns"] = quantize(timestamp - self->s); self->s = 0; }' dtrace: description 'arc_buf_add_ref:entry ' matched 2 probes ^C ns value ------------- Distribution ------------- count 256 | 0 512 |@ 2123 1024 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 44784 2048 |@@@@@ 7556 4096 |@@ 2267 8192 | 385 16384 | 4 32768 | 0 65536 | 1 131072 | 0
Most of the times were in the 1 - 2 us range, with only a single occurrence passing 65 us.
ARC reap
Here's a simple script to provide insight into the ARC reclaim thread - an asynchronous task that keeps the size of the ARC in check.
# cat -n arcreap.d 1 #!/usr/sbin/dtrace -s 2 3 fbt::arc_kmem_reap_now:entry, 4 fbt::arc_adjust:entry 5 { 6 self->start[probefunc] = timestamp; 7 } 8 9 fbt::arc_shrink:entry 10 { 11 trace("called"); 12 } 13 14 fbt::arc_kmem_reap_now:return, 15 fbt::arc_adjust:return 16 /self->start[probefunc]/ 17 { 18 printf("%Y %d ms", walltimestamp, 19 (timestamp - self->start[probefunc]) / 1000000); 20 self->start[probefunc] = 0; 21 }
Different functions are traced: arc_kmem_reap_now(), to see the time taken to reap the ARC kmem caches; arc_adjust(), for resizing the ARC lists; and arc_shrink(), to know when the ARC size has been stepped down (this isn't timed, since any real work will be done by arc_adjust()).
# ./arcreap.d dtrace: script './arcreap.d' matched 5 probes CPU ID FUNCTION:NAME 0 64929 arc_shrink:entry called 0 62414 arc_adjust:return 2012 Jan 9 23:10:01 18 ms 9 62420 arc_kmem_reap_now:return 2012 Jan 9 23:10:03 1511 ms 0 62414 arc_adjust:return 2012 Jan 9 23:10:24 0 ms 6 62414 arc_adjust:return 2012 Jan 9 23:10:49 0 ms
This isn't the only way the ARC keeps its size sane; it will also evict/recycle buffers during allocation, as mentioned in the Architecture section. This reclaim thread is the more aggressive method, so if you have occasional odd ARC behavior it may be handy to check if it is related to reclaims.
Evicts by list and type
Tracing the function that does eviction, with details:
# cat -n arcevict.d 1 #!/usr/sbin/dtrace -s 2 3 #pragma D option quiet 4 5 dtrace:::BEGIN 6 { 7 trace("Tracing ARC evicts...\n"); 8 } 9 10 fbt::arc_evict:entry 11 { 12 printf("%Y %-10a %-10s %-10s %d bytes\n", walltimestamp, args[0], 13 arg4 == 0 ? "data" : "metadata", 14 arg3 == 0 ? "evict" : "recycle", arg2); 15 }
Sample output:
# ./arcevict.d Tracing ARC evicts... 2012 Jan 8 08:13:03 zfs`ARC_mru data evict 812181411 bytes 2012 Jan 8 08:13:03 zfs`ARC_mfu data evict 5961212 bytes 2012 Jan 8 08:13:03 zfs`ARC_mfu data recycle 131072 bytes 2012 Jan 8 08:13:04 zfs`ARC_mfu data recycle 131072 bytes 2012 Jan 8 08:13:07 zfs`ARC_mfu data recycle 131072 bytes 2012 Jan 8 08:13:07 zfs`ARC_mfu data recycle 131072 bytes 2012 Jan 8 08:13:08 zfs`ARC_mfu metadata recycle 16384 bytes 2012 Jan 8 08:13:08 zfs`ARC_mfu data recycle 131072 bytes [...]
The output begins by catching an 800 Mbyte evict from the ARC MRU data list, followed by a 6 Mbyte evict from the MFU data list. After that, buffers were evicted due to the recycle code path, which recycles buffers when the ARC is getting full instead of allocating new ones.
To understand (and maintain) the arg mappings above, see the invocations of arc_evict() in arc.c. Eg, from arc_adjust():
if (adjustment > 0 && arc_mru->arcs_lsize[ARC_BUFC_DATA] > 0) { delta = MIN(arc_mru->arcs_lsize[ARC_BUFC_DATA], adjustment); (void) arc_evict(arc_mru, NULL, delta, FALSE, ARC_BUFC_DATA); adjustment -= delta; }
This is the first arc_evict() in arc_adjust(), which is why the ARC MRU data list is hit up first.
And more
The previous tracing examples show the sort of additional information that can be obtained using static tracing (the sdt provider) and dynamic tracing (the fbt provider). With dynamic tracing, a lot more can be seen as needed. Every function that makes up the ARC can be traced, along with their arguments.
One detail that is actually difficult to trace is the file names during ARC accesses, since vnode pointers are not passed down to the ARC layer. It is possible, and has been done before (I don't have an example on-hand though). You could more easily cache them from upper layers (eg, VFS; see the sample chapter from the DTrace book).
Conclusion
In this post, I examined ZFS ARC activity in detail, starting with statistics provided by kstat and then tracing provided by DTrace. Apart from calculating hit and miss rates, I discussed other statistics including prefetch and metadata ratios. I then used tracing to observe information from the ARC - including who is using the ARC and why, ARC buffer sizes, the age of the ARC buffers, lock contention timings and eviction details. More can be traced as needed: ZFS with DTrace provides great performance and observability.
I've spent much time on kernel internals, but I haven't really blogged about the deeper areas. I'm trying to change that, at least occasionally, starting with this post on ARC activity. I hope it is useful.
Thanks to the original ZFS team - especially Mark Maybee - for writing the ARC and explaining details to me, and to Bryan Cantrill for kstat-ifying the ARC statistics and creating DTrace.