USE Method: Solaris Performance Checklist

The USE Method provides a strategy for performing a complete check of system health, identifying common bottlenecks and errors. For each system resource, metrics for utilization, saturation and errors are identified and checked. Any issues discovered are then investigated using further strategies.

This is an example USE-based metric list for the Solaris family of operating systems. I'm writing this for later Solaris 10, Oracle Solaris 11, and illumos-based systems: SmartOS, OmniOS. This is primarily intended for system administrators of the physical systems (not tenants of cloud or zone instances; for those users, see my SmartOS performance checklist).

Physical Resources

component	type	metric
CPU	utilization	per-cpu: `mpstat 1`, "usr" + "sys"; system-wide: `vmstat 1`, "us" + "sy"; per-process: `prstat -c 1` ("CPU" == recent), `prstat -mLc 1` ("USR" + "SYS"); per-kernel-thread: `lockstat -Ii rate`, DTrace profile stack()
CPU	saturation	system-wide: `uptime`, load averages; `vmstat 1`, "r"; DTrace dispqlen.d (DTT) for a better "vmstat r"; per-process: `prstat -mLc 1`, "LAT"
CPU	errors	`fmadm faulty`; `cpustat` (CPC) for whatever error counters are supported (eg, thermal throttling)
Memory capacity	utilization	system-wide: `vmstat 1`, "free" (main memory), "swap" (virtual memory); per-process: `prstat -c`, "RSS" (main memory), "SIZE" (virtual memory)
Memory capacity	saturation	system-wide: `vmstat 1`, "sr" (bad now), "w" (was very bad); `vmstat -p 1`, "api" (anon page ins == pain), "apo"; per-process: `prstat -mLc 1`, "DFL"; DTrace anonpgpid.d (DTT), vminfo:::anonpgin on execname
Memory capacity	errors	`fmadm faulty` and `prtdiag` for physical failures; `fmstat -s -m cpumem-retire` (ECC events); DTrace failed malloc()s
Network Interfaces	utilization	`nicstat` (latest version here); `kstat`; `dladm show-link -s -i 1 interface`
Network Interfaces	saturation	`nicstat`; `kstat` for whatever custom statistics are available (eg, "nocanputs", "defer", "norcvbuf", "noxmtbuf"); `netstat -s`, retransmits
Network Interfaces	errors	`netstat -i`, error counters; `dladm show-phys`; `kstat` for extended errors, look in the interface and "link" statistics (there are often custom counters for the card); DTrace for driver internals
Storage device I/O	utilization	system-wide: `iostat -xnz 1`, "%b"; per-process: DTrace iotop
Storage device I/O	saturation	`iostat -xnz 1`, "wait"; DTrace iopending (DTT), sdqueue.d (DTB)
Storage device I/O	errors	`iostat -En`; DTrace I/O subsystem, eg, ideerr.d (DTB), satareasons.d (DTB), scsireasons.d (DTB), sdretry.d (DTB)
Storage capacity	utilization	swap: `swap -s`; file systems: `df -h`; plus other commands depending on FS type
Storage capacity	saturation	not sure this one makes sense - once its full, ENOSPC
Storage capacity	errors	DTrace; /var/adm/messages file system full messages
Storage controller	utilization	`iostat -Cxnz 1`, compare to known IOPS/tput limits per-card
Storage controller	saturation	look for kernel queueing: sd (iostat "wait" again), ZFS zio pipeline
Storage controller	errors	DTrace the driver, eg, mptevents.d (DTB); /var/adm/messages
Network controller	utilization	infer from `nicstat` and known controller max tput
Network controller	saturation	see network interface saturation
Network controller	errors	`kstat` for whatever is there / DTrace
CPU interconnect	utilization	`cpustat` (CPC) for CPU interconnect ports, tput / max (eg, see the amd64htcpu script)
CPU interconnect	saturation	`cpustat` (CPC) for stall cycles
CPU interconnect	errors	`cpustat` (CPC) for whatever is available
Memory interconnect	utilization	`cpustat` (CPC) for memory busses, tput / max; or CPI greater than, say, 5; CPC may also have local vs remote counters
Memory interconnect	saturation	`cpustat` (CPC) for stall cycles
Memory interconnect	errors	`cpustat` (CPC) for whatever is available
I/O interconnect	utilization	`busstat` (SPARC only); `cpustat` for tput / max if available; inference via known tput from iostat/nicstat/...
I/O interconnect	saturation	`cpustat` (CPC) for stall cycles
I/O interconnect	errors	`cpustat` (CPC) for whatever is available

CPU utilization: a single hot CPU can be caused by a single hot thread, or mapped hardware interrupt. Relief of the bottleneck usually involves tuning to use more CPUs in parallel.
lockstat and plockstat are DTrace-based since Solaris 10 FCS.
vmstat "r": this is coarse as it is only updated once per second.
CPC == CPU Performance Counters (aka "Performance Instrumentation Counters" (PICs), or "Performance Monitoring Events"), read via programmable registers on each CPU, by cpustat(1M) or the DTrace "cpc" provider. These have traditionally been hard to work with due to differences between CPUs, but are getting much easier with the PAPI standard. Still, expect to spend some quality time (days) with the processor vendor manuals (what "cpustat -h" tells you to read), and to post-process cpustat with awk or perl. See my short talk (video) about CPC (2010). (Many years ago, I made a toolkit including CPC scripts - CacheKit - that was too much work to maintain.)
Memory capacity utilization: interpreting vmstat's "free" has been tricky across different Solaris versions (we documented it in the Perf & Tools book), due to different ways it was calculated, and tunables that affect when the system will kick-off the page scanner. It'll also typically shrink as the kernel uses unused memory for caching (ZFS ARC).
Be aware that kstat can report bad data (so can any tool); there isn't really a test suite for kstat data, and engineers can add new code paths and forget to add the counters.
DTT == DTraceToolkit scripts, DTB == DTrace book scripts.
CPI == Cycles Per Instruction (others use IPC == Instructions Per Cycle).
I/O interconnect: this includes the CPU to I/O controller busses, the I/O controller(s), and device busses (eg, PCIe).

Software Resources

component	type	metric
Kernel mutex	utilization	`lockstat -H` (held time); DTrace lockstat provider
Kernel mutex	saturation	`lockstat -C` (contention); DTrace lockstat provider; spinning shows up with `dtrace -n 'profile-997 { @[stack()] = count(); }'`
Kernel mutex	errors	`lockstat -E`, eg recusive mutex enter (other errors can cause kernel lockup/panic, debug with `mdb -k`)
User mutex	utilization	`plockstat -H` (held time); DTrace plockstat provider
User mutex	saturation	`plockstat -C` (contention); `prstat -mLc 1`, "LCK"; DTrace plockstat provider
User mutex	errors	DTrace plockstat and pid providers, for EAGAIN, EINVAL, EPERM, EDEADLK, ENOMEM, EOWNERDEAD, ... see pthread_mutex_lock(3C)
Process capacity	utilization	`sar -v`, "proc-sz"; `kstat`, "unix:0:var:v_proc" for max, "unix:0:system_misc:nproc" for current; DTrace (`nproc vs `max_nprocs)
Process capacity	saturation	not sure this makes sense; you might get queueing on pidlinklock in pid_allocate(), as it scans for available slots once the table gets full
Process capacity	errors	"can't fork()" messages
Thread capacity	utilization	user-level: `kstat`, "unix:0:lwp_cache:buf_inuse" for current, `prctl -n zone.max-lwps -i zone ZONE` for max; kernel: `mdb -k` or DTrace, "nthread" for current, limited by memory
Thread capacity	saturation	threads blocking on memory allocation; at this point the page scanner should be running (vmstat "sr"), else examine using DTrace/`mdb`.
Thread capacity	errors	user-level: pthread_create() failures with EAGAIN, EINVAL, ...; kernel: thread_create() blocks for memory but won't fail.
File descriptors	utilization	system-wide (no limit other than RAM); per-process: `pfiles` vs `ulimit` or `prctl -t basic -n process.max-file-descriptor PID`; a quicker check than pfiles is `ls /proc/PID/fd \| wc -l`
File descriptors	saturation	does this make sense? I don't think there is any queueing or blocking, other than on memory allocation.
File descriptors	errors	`truss` or DTrace (better) to look for errno == EMFILE on syscalls returning fds (eg, open(), accept(), ...).

lockstat/plockstat often drop events due to load; I often roll my own to avoid this using the DTrace lockstat/plockstat provider (there are examples of this in the DTrace book).
File descriptor utilization: while other OSes have a system-wide limit, Solaris doesn't (at least at the moment, this could change; see my writeup about it).

What's Next

See the USE Method for the follow-up strategies after identifying a possible bottleneck. If you complete this checklist but still have a performance issue, move onto other strategies: drill-down analysis and latency analysis.