USE Method: FreeBSD Performance Checklist
This page contains an example USE Method-based performance checklist for FreeBSD, for identifying common bottlenecks and errors. This is intended to be used early in a performance investigation, before moving onto more time consuming methodologies. This should be helpful for anyone using FreeBSD, especially system administrators.
This was developed on FreeBSD 10.0 alpha, and focuses on tools shipped by default. With DTrace, I was able to create a few new one-liners to answer some metrics. See the notes below the tables.
Physical Resources
component | type | metric |
---|---|---|
CPU | utilization | system-wide: vmstat 1, "us" + "sy"; per-cpu: vmstat -P; per-process: top, "WCPU" for weighted and recent usage; per-kernel-process: top -S, "WCPU" |
CPU | saturation | system-wide: uptime, "load averages" > CPU count; vmstat 1, "procs:r" > CPU count; per-cpu: DTrace to profile CPU run queue lengths [1]; per-process: DTrace of scheduler events [2] |
CPU | errors | dmesg; /var/log/messages; pmcstat for PMC and whatever error counters are supported (eg, thermal throttling) |
Memory capacity | utilization | system-wide: vmstat 1, "fre" is main memory free; top, "Mem:"; per-process: top -o res, "RES" is resident main memory size, "SIZE" is virtual memory size; ps -auxw, "RSS" is resident set size (Kbytes), "VSZ" is virtual memory size (Kbytes) |
Memory capacity | saturation | system-wide: vmstat 1, "sr" for scan rate, "w" for swapped threads (was saturated, may not be now); swapinfo, "Capacity" also for evidence of swapping/paging; per-process: DTrace [3] |
Memory capacity | errors | physical: dmesg?; /var/log/messages?; virtual: DTrace failed malloc()s |
Network Interfaces | utilization | system-wide: netstat -i 1, assume one very busy interface and use input/output "bytes" / known max (note: includes localhost traffic); per-interface: netstat -I interface 1, input/output "bytes" / known max |
Network Interfaces | saturation | system-wide: netstat -s, for saturation related metrics, eg netstat -s | egrep 'retrans|drop|out-of-order|memory problems|overflow'; per-interface: DTrace |
Network Interfaces | errors | system-wide: netstat -s | egrep 'bad|checksum', for various metrics; per-interface: netstat -i, "Ierrs", "Oerrs" (eg, late collisions), "Colls" [5] |
Storage device I/O | utilization | system-wide: iostat -xz 1, "%b"; per-process: DTrace io provider, eg, iosnoop or iotop (DTT, needs porting) |
Storage device I/O | saturation | system-wide: iostat -xz 1, "qlen"; DTrace for queue duration or length [4] |
Storage device I/O | errors | DTrace io:::done probe when /args[0]->b_error != 0/ |
Storage capacity | utilization | file systems: df -h, "Capacity"; swap: swapinfo, "Capacity"; pstat -T, also shows swap space; |
Storage capacity | saturation | not sure this one makes sense - once its full, ENOSPC |
Storage capacity | errors | DTrace; /var/log/messages file system full messages |
Storage controller | utilization | iostat -xz 1, sum IOPS & tput metrics for devices on the same controller, and compare to known limits [5] |
Storage controller | saturation | check utilization and DTrace and look for kernel queueing |
Storage controller | errors | DTrace the driver |
Network controller | utilization | system-wide: netstat -i 1, assume one busy controller and examine input/output "bytes" / known max (note: includes localhost traffic) |
Network controller | saturation | see network interface saturation |
Network controller | errors | see network interface errors |
CPU interconnect | utilization | pmcstat (PMC) for CPU interconnect ports, tput / max |
CPU interconnect | saturation | pmcstat and relevant PMCs for CPU interconnect stall cycles |
CPU interconnect | errors | pmcstat and relevant PMCs for whatever is available |
Memory interconnect | utilization | pmcstat and relevant PMCs for memory bus throughput / max, or, measure CPI and treat, say, 5+ as high utilization |
Memory interconnect | saturation | pmcstat and relevant PMCs for memory stall cycles |
Memory interconnect | errors | pmcstat and relevant PMCs for whatever is available |
I/O interconnect | utilization | pmcstat and relevant PMCs for tput / max if available; inference via known tput from iostat/netstat/... |
I/O interconnect | saturation | pmcstat and relevant PMCs for I/O bus stall cycles |
I/O interconnect | errors | pmcstat and relevant PMCs for whatever is available |
- [1] eg, using per-CPU run queue length as the saturation metric: dtrace -n 'profile-99 { @[cpu] = lquantize(`tdq_cpu[cpu].tdq_load, 0, 128, 1); } tick-1s { printa(@); trunc(@); }' where > 1 is saturation. If you're using the older BSD scheduler, profile runq_length[]. There are also the sched:::load-change and other sched probes.
- [2] For this metric, lets use time spent in TDS_RUNQ as a per-thread saturation (latency) metric. Here is an (unstable) fbt-based one-liner: dtrace -n 'fbt::tdq_runq_add:entry { ts[arg1] = timestamp; } fbt::choosethread:return /ts[arg1]/ { @[stringof(args[1]->td_name), "runq (ns)"] = quantize(timestamp - ts[arg1]); ts[arg1] = 0; }'. This would be better (stability) if it can be rewritten to use the sched probes. It would also be great if there were simply high resolution thread state times in struct rusage or rusage_ext, eg, cumulative times for each state in td_state and more, which would make reading this metric easier and have lower overhead. See the Thread State Analysis Method from my Velocity talk for suggested states.
- [3] eg, for swapping: dtrace -n 'fbt::cpu_thread_swapin:entry, fbt::cpu_thread_swapout:entry { @[probefunc, stringof(args[0]->td_name)] = count(); }' (NOTE, I would trace vm_thread_swapin() and vm_thread_swapout(), but their probes don't exist). Tracing paging is tricker until the vminfo provider is added; you could try tracing from swap_pager_putpages() and swap_pager_getpages(), but I didn't see an easy way to walk back to a thread struct; another approach may be via vm_fault_hold(). Good luck. See thread states [2], which could make this much easier.
- [4] eg, sampling GEOM queue length at 99 Hertz: dtrace -n 'profile-99 { @["geom qlen"] = lquantize(`g_bio_run_down.bio_queue_length, 0, 256, 1); }'
- [5] This approach is different from storage device (disk) utilization. For controllers, percent busy has much less meaning, so we're calculating utilization based on throughput (bytes/sec) and IOPS instead. Controllers typically have limits for these based on their busses and processing capacity. If you don't know them, you can determine them experimentally.
- PMC == Performance Monitoring Counters, aka CPU Performance Counters (CPC), Performance Instrumentation Counters (PICs), and more. These are processor hardware counters that are read via programmable registers on each CPU. The availability of these counters is dependent on the processor type. See pmc(3) and pmcstat(8).
- pmcstat(8): the FreeBSD tool for instrumenting PMCs. You might need to run a kldload hwpmc first before use. To figure out which PMCs you need to use and how, it usually takes some serious time (days) with the processor vendor manuals; eg, the Intel 64 and IA-32 Architectures Software Developer's Manual, Volume 3B, Appendix A-E: Performance-Monitoring Events.
- DTT == DTraceToolkit scripts. These are in the FreeBSD source under cddl/contrib/dtracetoolkit, and dtruss is under /usr/sbin. As features are added to DTrace (see the freebsd-dtrace mailing list), more scripts can be ported.
- CPI == Cycles Per Instruction (others use IPC == Instructions Per Cycle).
- I/O interconnect: this includes the CPU to I/O controller busses, the I/O controller(s), and device busses (eg, PCIe).
Software Resources
component | type | metric |
---|---|---|
Kernel mutex | utilization | lockstat -H (held time); DTrace lockstat provider |
Kernel mutex | saturation | lockstat -C (contention); DTrace lockstat provider [6]; spinning shows up with dtrace -n 'profile-997 { @[stack()] = count(); }' |
Kernel mutex | errors | lockstat -E (errors); DTrace and fbt provider for return probes and error status |
User mutex | utilization | DTrace pid provider for hold times; eg, pthread_mutex_*lock() return to pthread_mutex_unlock() entry |
User mutex | saturation | DTrace pid provider for contention; eg, pthread_mutex_*lock() entry to return times |
User mutex | errors | DTrace pid provider for EINVAL, EDEADLK, ... see pthread_mutex_lock(3C) etc. |
Process capacity | utilization | current/max using: ps -a | wc -l / sysctl kern.maxproc; top, "Processes:" also shows current |
Process capacity | saturation | not sure this makes sense |
Process capacity | errors | "can't fork()" messages |
File descriptors | utilization | system-wide: pstat -T, "files"; sysctl kern.openfiles / sysctl kern.maxfiles; per-process: can figure out using fstat -p PID and ulimit -n |
File descriptors | saturation | I don't think this one makes sense, as if it can't allocate or expand the array, it errors; see fdalloc() |
File descriptors | errors | truss, dtruss, or custom DTrace to look for errno == EMFILE on syscalls returning fds (eg, open(), accept(), ...) |
- lockstat: you may need to run kldload ksyms before lockstat will work (otherwise: "lockstat: can't load kernel symbols: No such file or directory").
- [6] eg, showing adaptive lock block time totals (in nanoseconds) by calling function name: dtrace -n 'lockstat:::adaptive-block { @[caller] = sum(arg1); } END { printa("%40a%@16d ns\n", @); }'
Other Tools
I didn't include procstat, sockstat, gstat or others, as here I'm beginning with questions (the methodology) and only including tools that answer them. This is instead of the other way around: listing all the tools and trying to find a use for them. Those other tools are useful for other methodologies, which can be used after this one.
What's Next
See the USE Method for the follow-up methodologies after identifying a possible bottleneck. If you complete this checklist but still have a performance issue, move onto other methodologies: drill-down analysis and latency analysis.
Acknowledgements
Resources used:
- FreeBSD source code and man pages
- FreeBSD Wiki PmcTools
- FreeBSD Handbook
- The Rosetta Stone for Unix is always handy, and also gave me the idea of adding some color backgrounds. Like it?
Filling this this checklist has required a lot of research, testing and experimentation. Please reference back to this post if it helps you develop related material.
It's quite possible I've missed something or included the wrong metric somewhere (sorry); I'll update the post to fix these up as they are understood, and note at the top the update date.
Also see my USE method performance checklists for Solaris, SmartOS, and Linux, and Mac OS X.