USE Method: Mac OS X Performance Checklist
This is my example USE Method-based performance checklist for the Apple Mac OS X operating system, for identifying common bottlenecks and errors. This draws upon both command line and graphical tools for coverage, focusing where possible on those that are provided with the OS by default, or by Apple (eg, Instruments). Further notes about tools are provided after this table.
Some of the metrics are easy to find in various GUIs or from the command line (eg, using Terminal; if you've never used Terminal before, follow my instructions at the top of this post). Many metrics require some math, inference, or quite a bit of digging. This will hopefully get easier in the future, as tools include a USE method wizard or the metrics required to follow this easily.
Physical Resources, Standard
component | type | metric |
---|---|---|
CPU | utilization | system-wide: iostat 1, "us" + "sy"; per-cpu: DTrace [1]; Activity Monitor → CPU Usage or Floating CPU Window; per-process: top -o cpu, "%CPU"; Activity Monitor → Activity Monitor, "%CPU"; per-kernel-thread: DTrace profile stack() |
CPU | saturation | system-wide: uptime, "load averages" > CPU count; latency, "SCHEDULER" and "INTERRUPTS"; per-cpu: dispqlen.d (DTT), non-zero "value"; runocc.d (DTT), non-zero "%runocc"; per-process: Instruments → Thread States, "On run queue"; DTrace [2] |
CPU | errors | dmesg; /var/log/system.log; Instruments → Counters, for PMC and whatever error counters are supported (eg, thermal throttling) |
Memory capacity | utilization | system-wide: vm_stat 1, main memory free = "free" + "inactive", in units of pages; Activity Monitor → Activity Monitor → System Memory, "Free" for main memory; per-process: top -o rsize, "RSIZE" is resident main memory size, "VSIZE" is virtual memory size; ps -alx, "RSS" is resident set size, "SZ" is virtual memory size; ps aux similar (legacy format) |
Memory capacity | saturation | system-wide: vm_stat 1, "pageout"; per-process: anonpgpid.d (DTT), DTrace vminfo:::anonpgin [3] (frequent anonpgin == pain); Instruments → Memory Monitor, high rate of "Page Ins" and "Page Outs"; sysctl vm.memory_pressure [4] |
Memory capacity | errors | System Information → Hardware → Memory, "Status" for physical failures; DTrace failed malloc()s |
Network Interfaces | utilization | system-wide: netstat -i 1, assume one very busy interface and use input/output "bytes" / known max (note: includes localhost traffic); per-interface: netstat -I interface 1, input/output "bytes" / known max; Activity Monitor → Activity Monitor → Network, "Data received/sec" "Data sent/sec" / known max (note: includes localhost traffic); atMonitor, interface percent |
Network Interfaces | saturation | system-wide: netstat -s, for saturation related metrics, eg netstat -s | egrep 'retrans|overflow|full|out of space|no bufs'; per-interface: DTrace |
Network Interfaces | errors | system-wide: netstat -s | grep bad, for various metrics; per-interface: netstat -i, "Ierrs", "Oerrs" (eg, late collisions), "Colls" [5] |
Storage device I/O | utilization | system-wide: iostat 1, "KB/t" and "tps" are rough usage stats [6]; DTrace could be used to calculate a percent busy, using io provider probes; atMonitor, "disk0" is percent busy; per-process: iosnoop (DTT), shows usage; iotop (DTT), has -P for percent I/O |
Storage device I/O | saturation | system-wide: iopending (DTT) |
Storage device I/O | errors | DTrace io:::done probe when /args[0]->b_error == 0/ |
Storage capacity | utilization | file systems: df -h; swap: sysctl vm.swapusage, for swap file usage; Activity Monitor → Activity Monitor → System Memory, "Swap used" |
Storage capacity | saturation | not sure this one makes sense - once its full, ENOSPC |
Storage capacity | errors | DTrace; /var/log/system.log file system full messages |
- [1] eg: dtrace -x aggsortkey -n 'profile-100 /!(curthread->state & 0x80)/ { @ = lquantize(cpu, 0, 1000, 1); } tick-1s { printa(@); clear(@); }'. Josh Clulow also wrote a simple C program to dig out per-CPU utilization: cpu_usage.c.
- [2] Until there are sched:::enqueue/dequeue probes, I suspect this could be done using fbt tracing of thread_*(). I haven't tried yet. It might be worth seeing what Instruments uses for its "On run queue" thread state trace, and DTracing that.
- [3] eg: dtrace -n 'vminfo:::anonpgin { printf("%Y %s", walltimestamp, execname); }'.
- [4] the kernel source under bsd/vm/vm_unix.c describes this as "Memory pressure indicator", although I've yet to see this as non-zero.
- [5] the netstat(1) man page reads: "BUGS: The notion of errors is ill-defined."
- [6] it would be great if Mac OS X iostat added a -x option to include utilization, saturation, and error columns, like Solaris "iostat -xnze 1".
- atMonitor is a 3rd party tool that provides various statistics; I'm running version 2.7b, although it crashes if you leave the "Top Window" open for more than 2 seconds.
- Activity Monitor is a default Apple performance monitoring tool with a graphical interface.
- Instruments is an Apple performance analysis product with a graphical interface. It is comprehensive, consuming performance data from multiple frameworks, including DTrace. Instruments also includes functionality that was provided by separate previous performance analysis products, like CHUD and Shark, making it a one stop shop. It'd be wonderful if it included latency heat maps as well :-).
- Temperature Monitor: 3rd party software that can read various temperature probes.
- PMC == Performance Monitor Counters, aka CPU Performance Counters (CPC), Performance Instrumentation Counters (PICs), and more. These are processor hardware counters that are read via programmable registers on each CPU.
- DTT == DTraceToolkit scripts, many of which were ported by the Apple engineers and shipped by default with Mac OS X. ie, you should be able to run these immediately, eg, sudo runocc.d.
Physical Resources, Advanced
component | type | metric |
---|---|---|
GPU | utilization | directly: DTrace [7]; atMonitor, "gpu"; indirect: Temperature Monitor; atMonitor, "gput" |
GPU | saturation | DTrace [7]; Instruments → OpenGL Driver, "Client GLWait Time" (maybe) |
GPU | errors | DTrace [7] |
Storage controller | utilization | iostat 1, compare to known IOPS/tput limits per-card |
Storage controller | saturation | DTrace and look for kernel queueing |
Storage controller | errors | DTrace the driver |
Network controller | utilization | system-wide: netstat -i 1, assume one busy controller and examine input/output "bytes" / known max (note: includes localhost traffic) |
Network controller | saturation | see network interface saturation |
Network controller | errors | see network interface errors |
CPU interconnect | utilization | for multi-processor systems, try Instruments → Counters, and relevent PMCs for CPU interconnect port I/O, and measure throughput / max |
CPU interconnect | saturation | Instruments → Counters, and relevent PMCs for stall cycles |
CPU interconnect | errors | Instruments → Counters, and relevent PMCs for whatever is available |
Memory interconnect | utilization | Instruments → Counters, and relevent PMCs for memory bus throughput / max, or, measure CPI and treat, say, 5+ as high utilization; Shark had "Processor bandwidth analysis" as a feature, which either was or included memory bus throughput, but I never used it |
Memory interconnect | saturation | Instruments → Counters, and relevent PMCs for stall cycles |
Memory interconnect | errors | Instruments → Counters, and relevent PMCs for whatever is available |
I/O interconnect | utilization | Instruments → Counters, and relevent PMCs for tput / max if available; inference via known tput from iostat/... |
I/O interconnect | saturation | Instruments → Counters, and relevent PMCs for stall cycles |
I/O interconnect | errors | Instruments → Counters, and relevent PMCs for whatever is available |
- [7] I haven't found a shipped tool to provide GPU statistics easily. I'd like a gpustat that behaved like mpstat, with at least the columns: utilization, saturation, errors. Until there is such a tool, you could trace GPU activity (at least the scheduling of activity) using DTrace on the graphics drivers. It won't be easy. I imagine Instruments will at some point add a GPU instrument set (other than the OpenGL instruments), otherwise, 3rd party tools can be used, like atMonitor.
- CPI == Cycles Per Instruction (others use IPC == Instructions Per Cycle).
- I/O interconnect: this includes the CPU to I/O controller busses, the I/O controller(s), and device busses (eg, PCIe).
- Using PMCs is typically a lot of work. This involves researching the processor manuals to see what counters are available and what they mean, and then collecting and interpreting them. I've used them on other OSes, but haven't used them all under Instruments → Counters, so I don't know if there's a hitch with anything there. Good luck.
Software Resources
component | type | metric |
---|---|---|
Kernel mutex | utilization | DTrace and lockstat provider for held times |
Kernel mutex | saturation | DTrace and lockstat provider for contention times [8] |
Kernel mutex | errors | DTrace and fbt provider for return probes and error status |
User mutex | utilization | plockstat -H (held time); DTrace plockstat provider |
User mutex | saturation | plockstat -C (contention); DTrace plockstat provider |
User mutex | errors | DTrace plockstat and pid providers, for EDEADLK, EINVAL, ... see pthread_mutex_lock(3C) |
Process capacity | utilization | current/max using: ps -e | wc -l / sysctl kern.maxproc; top, "Processes:" also shows current |
Process capacity | saturation | not sure this makes sense |
Process capacity | errors | "can't fork()" messages |
File descriptors | utilization | system-wide: sysctl kern.num_files / sysctl kern.maxfiles; per-process: can figure out using lsof and ulimit -n |
File descriptors | saturation | I don't think this one makes sense, as if it can't allocate or expand the array, it errors; see fdalloc() |
File descriptors | errors | dtruss or custom DTrace to look for errno == EMFILE on syscalls returning fds (eg, open(), accept(), ...) |
- [8] eg, showing adaptive lock block time totals (in nanoseconds) by calling function name: dtrace -n 'lockstat:::adaptive-block { @[caller] = sum(arg1); } END { printa("%40a%@16d ns\n", @); }'
Other Tools
I didn't include fs_usage, sc_usage, sample, spindump, heap, vmmap, malloc_history, leaks, and other useful Mac OS X performance tools, as here I'm beginning with questions (the methodology) and only including tools that answer them. This is instead of the other way around: listing all the tools and trying to find a use for them. Those other tools are useful for other methodologies, which can be used after this one.
What's Next
See the USE Method for the follow-up methodologies after identifying a possible bottleneck. If you complete this checklist but still have a performance issue, move onto other methodologies: drill-down analysis and latency analysis.
For more performance analysis, also see my earlier post on Top 10 DTrace Scripts for Mac OS X.
Acknowledgements
Resources used:
- Instruments User Guide and Instruments User Reference
- Apple's Performance Tools summary
- man pages
- xnu source code (kernel)
- Mac OS X Internals, by Amit Singh, and his online list of performance tools
Filling this this checklist has required a lot of research, testing and experimentation. Please reference back to this post if it helps you develop related material.
It's quite possible I've missed something or included the wrong metric somewhere (sorry); I'll update the post to fix these up as they are understood, and note at the top the update date.
Also see my USE method performance checklists for Solaris, SmartOS, Linux, and FreeBSD.