USE Method: SmartOS Performance Checklist

The USE Method provides a strategy for performing a complete check of system health, identifying common bottlenecks and errors. For each system resource, metrics for utilization, saturation and errors are identified and checked. Any issues discovered are then investigated using further strategies.

This is an example of a USE-based metric list for use within a SmartOS SmartMachine (Zone), such as those provided by the Joyent Public Cloud. These use the illumos kernel, and so this list should also be mostly relevant for OmniOS Zones, and to a lesser degree (due to some missing features) Solaris Zones. This is primarily intended for users of the zones. For the system administrators of the physical systems (via the Global Zone), also see the Solaris checklist, which has greater visibility.

Cloud limits (software resource controls) are listed first, as they are usually encountered before the physical limits.

Cloud Limits

These cover CPU, memory, disk I/O (file system), and network.

component	type	metric
CPU cap	utilization	`sm-cpuinfo` (previously `jinf -c`); raw counters: `kstat -p caps::cpucaps_zone*:`, "usage" == current CPU used, "value" == CPU cap
CPU cap	saturation	`uptime` load averages are zone-aware; per-process: `prstat -mLc 1`, "LAT"; rough counter: `kstat -p caps::cpucaps_zone*:above_sec`
CPU cap	errors	N/A
Memory cap	utilization	`sm-meminfo rss` for main memory (previously `jinf -m`); `sm-meminfo swap` for virtual memory; `zonememstat`, "RSS" vs "CAP"; `prstat -Z`, zone "RSS", "SIZE" (VM); raw counters: `kstat -p memory_cap:::`, "rss" vs "physcap", "swap" vs "swapcap"
Memory cap	saturation	`zonememstat`, increasing "NOVER" (# over) and "POUT" (paged out); per-process: `prstat -mLc 1`, "DFL"; some raw counters: `kstat -p memory_cap:::anonpgin`
Memory cap	errors	DTrace failed malloc()s; raw counters: `kstat -p memory_cap:::anon_alloc_fail`
FS I/O throttle	utilization	N/A - it kicks in only when needed (see saturation)
FS I/O throttle	saturation	`vfsstat`, "d/s" (delays/sec), and magnitude of "del_t" (average delay time, us)
FS I/O throttle	errors	N/A
FS capacity	utilization	`df -h`, "used" / "size"
FS capacity	saturation	once it's full, ENOSPC
FS capacity	errors	DTrace errno for FS syscalls; /var/adm/messages file system full messages
Network cap	utilization	`dladm show-linkprop -p maxbw` for max bandwidth (if set); `dladm show-link -s -i 1 net0`, for current throughput; `nicstat` can also show throughput
Network cap	saturation	not available from within a zone (need to DTrace mac_bw_state & SRS_BW_ENFORCED)
Network cap	errors	N/A

For the Joyent Public Cloud, the CPU cap is the bursting limit. You are bursting when your CPU usage is over the kstat "caps::cpucaps_zone*:baseline". If everyone bursts at the same time, your minimum usage should be the baseline, which is provided by the Fair Share Scheduler (FSS).
sm-cpuinfo is from the smtools package.

Storage devices (disks) are not listed, since limits for storage I/O are imposed at the file system layer.

Physical Resources

Since Zones are OS-Virtualization (OS partitioning), the physical resources are not emulated or virtualized, and many of the observability tools will show you the entire physical system. This can be both good – you can really understand what's going on, and confusing – why are the resources busy when my system is idle? (it's someone else; you can't see their process address space).

component	type	metric
CPU	utilization	per-cpu: `mpstat 1`, "usr" + "sys"; system-wide: `vmstat 1`, "us" + "sy"; per-process: `prstat -c 1` ("CPU" == recent), `prstat -mLc 1` ("USR" + "SYS"); per-kernel-thread: not available from within a zone
CPU	saturation	system-wide: `vmstat 1`, "r"; per-process: `prstat -mLc 1`, "LAT"
CPU	errors	`fmdump`
Memory capacity	utilization	system-wide: `vmstat 1`, "free" (main memory), "swap" (virtual memory); per-process: `prstat -c`, "RSS" (main memory), "SIZE" (virtual memory)
Memory capacity	saturation	system-wide: `vmstat 1`, "sr" (bad now), "w" (was very bad); `vmstat -p 1`, "api" (anon page ins == pain), "apo"; per-process: `prstat -mLc 1`, "DFL"
Memory capacity	errors	`fmdump`; DTrace failed malloc()s
Network Interfaces	utilization	`nicstat` (see notes below); `kstat` (look for physical interface kstats, eg, `kstat -p \| grep ifspeed` to find their names, and then `kstat -p ixgbe::mac:` for ixgbe interfaces)
Network Interfaces	saturation	`nicstat`; `kstat` for whatever custom statistics are available (eg, "nocanputs", "defer", "norcvbuf", "noxmtbuf"); `netstat -s`, retransmits
Network Interfaces	errors	`netstat -i`, error counters; `kstat` for extended errors, look in the interface and "link" statistics (there are often custom counters for the card); driver internals not available from within a zone
Storage device I/O	utilization	system-wide: `iostat -xnz 1`, "%b"
Storage device I/O	saturation	`iostat -xnz 1`, "wait"
Storage device I/O	errors	`iostat -En`; driver internals not available from within a zone
Storage capacity	utilization	swap: `swap -s`; file systems: "df -h"
Storage capacity	saturation	once it's full, ENOSPC
Storage capacity	errors	DTrace errno on FS syscalls; /var/adm/messages file system full messages
Storage controller	utilization	`iostat -Cxnz 1`, compare to known IOPS/tput limits per-card
Storage controller	saturation	look for kernel queueing: sd (iostat "wait" again)
Storage controller	errors	/var/adm/messages; driver internals not available from within a zone
Network controller	utilization	infer from `kstat` or `nicstat` and known controller max tput
Network controller	saturation	see network interface saturation
Network controller	errors	`kstat` for whatever is there; driver internals not available from within a zone
CPU interconnect	utilization	not available from within a zone
CPU interconnect	saturation	not available from within a zone
CPU interconnect	errors	not available from within a zone
Memory interconnect	utilization	not available from within a zone
Memory interconnect	saturation	not available from within a zone
Memory interconnect	errors	not available from within a zone
I/O interconnect	utilization	not available from within a zone
I/O interconnect	saturation	not available from within a zone
I/O interconnect	errors	not available from within a zone

For Joyent SmartMachines, Cloud Analytics (both the GUI and API) provide additional details from the physical system (global zone) that are not directly visible from within the SmartMachine.
CPU utilization: a single hot CPU can be caused by a single hot thread, or mapped hardware interrupt. Relief of the bottleneck usually involves tuning to use more CPUs in parallel.
nicstat may already be available in the SmartMachine; if not, there are both C and Perl versions (either may need a little tweaking to work properly).
vmstat "r": this is coarse as it is only updated once per second.
Memory capacity utilization: interpreting vmstat's "free" has been tricky across different Solaris versions (we documented it in the Perf & Tools book), due to different ways it was calculated, and tunables that affect when the system will kick-off the page scanner. It'll also typically shrink as the kernel uses unused memory for caching (ZFS ARC).
Be aware that kstat can report bad data (so can any tool); there isn't really a test suite for kstat data, and engineers can add new code paths and forget to add the counters.

Software Resources

component	type	metric
Kernel mutex	utilization	not available from within a zone
Kernel mutex	saturation	`mpstat` "smtx"
Kernel mutex	errors	not available from within a zone
User mutex	utilization	`plockstat -H` (held time); DTrace plockstat provider
User mutex	saturation	`plockstat -C` (contention); `prstat -mLc 1`, "LCK"; DTrace plockstat provider
User mutex	errors	DTrace plockstat and pid providers, for EAGAIN, EINVAL, EPERM, EDEADLK, ENOMEM, EOWNERDEAD, ... see pthread_mutex_lock(3C)
Process capacity	utilization	`kstat`, "unix:0:var:v_proc" for system-wide max, system-wide current usage isn't available in a zone, but "unix:0:process_cache:slab_alloc" gives a rough idea; zone: "unix:0:system_misc:nproc" for current zone usage; `prctl -n zone.max-processes -i zone ZONE`, "privileged/system" for zone max, and "usage" for current usage.
Process capacity	saturation	queueing on pidlinklock in pid_allocate(), as it scans for available slots once the table gets full.
Process capacity	errors	"can't fork()" messages
Thread capacity	utilization	user-level: `prctl -n zone.max-lwps -i zone ZONE`, "privileged/system" for zone max, and "usage" for current zone usage; kernel: limited by system memory - see memory usage.
Thread capacity	saturation	threads blocking on memory allocation - see memory cap usage.
Thread capacity	errors	user-level: pthread_create() failures with EAGAIN, EINVAL, ...; kernel: not available from within a zone
File descriptors	utilization	system-wide (no limit other than RAM); per-process: `pfiles` vs `ulimit` or `prctl -t basic -n process.max-file-descriptor PID`; a quicker check than pfiles is `ls /proc/PID/fd \| wc -l`
File descriptors	saturation	I don't think there is any queueing or blocking, other than on memory allocation.
File descriptors	errors	`truss` or DTrace (better) to look for errno == EMFILE on syscalls returning fds (eg, open(), accept(), ...).

plockstat often drop events due to load; I often roll my own to avoid this using the DTrace plockstat provider (examples in the DTrace book).
File descriptor utilization: while other OSes have a system-wide limit, Solaris doesn't (at least at the moment, this could change; see my writeup about it).

What's Next

See the USE Method for the follow-up strategies after identifying a possible bottleneck. If you complete this checklist but still have a performance issue, move onto other strategies: drill-down analysis and latency analysis.

Also see the Solaris Performance Checklist if you have access to the physical host (global zone).