I originally posted this at http://blogs.sun.com/brendan/entry/amd64_pics.
I recently wanted to gather some numbers on CPU and memory system performance, for AMD64 CPUs. I reached a point where I searched the Internet for other Solaris AMD64 PIC (Performance Instrumentation Counters) analysis, and found little. I hope to improve this with some blog entries. In this part I'll introduce PIC observability, and demonstrate measuring CPI (cycles per instruction) for different workloads. Note that PICs are also known as Performance Monitoring Counters (PMCs).
To see why PICs are important, the following are the sort of questions that PIC analysis can answer:
- What is the Level 2 cache hit rate?
- What is the Level 2 cache miss volume?
- What is the hit rate and miss volume for the TLB?
- What is my memory bus utilization?
Questions 1 and 2 relate to the CPU hardware cache, where Level 2 is the E$ (meaning either "external" cache or "embedded" cache, depending on the CPU architecture). For optimal performance we want to see a high hit rate, and more importantly, a low miss volume.
Question 3 concerns a component of the memory management unit: the translation lookaside buffer (TLB). This processes and caches virtual to physical memory page translations. It can consume a lot of CPU (the worst I've seen is 60%), and it can be tuned. A good document for understanding this further is Taming Your Emu by Richard McDougall.
Question 4 seems obvious, the memory bus can be a bottleneck for system performance, so, how utilized is it? Answering this isn't easy, but it is usually possible by examining CPU PICs.
cpustat
There are many AMD64 CPU PICs available, which can be viewed using tools such as cpustat and cputrack. Running cpustat -h dumps the list:
# cpustat -h Usage: cpustat [-c events] [-p period] [-nstD] [interval [count]] -c events specify processor events to be monitored -n suppress titles -p period cycle through event list periodically -s run user soaker thread for system-only events -t include tsc register -D enable debug mode -h print extended usage information Use cputrack(1) to monitor per-process statistics. CPU performance counter interface: AMD Opteron & Athlon64 event specification syntax: [picn=][,attr[n][=]][,[picn=][,attr[n][=]],...] event[0-3]: FP_dispatched_fpu_ops FP_cycles_no_fpu_ops_retired FP_dispatched_fpu_ops_ff LS_seg_reg_load LS_uarch_resync_self_modify LS_uarch_resync_snoop LS_buffer_2_full LS_locked_operation LS_retired_cflush LS_retired_cpuid DC_access DC_miss DC_refill_from_L2 DC_refill_from_system DC_copyback DC_dtlb_L1_miss_L2_hit DC_dtlb_L1_miss_L2_miss DC_misaligned_data_ref DC_uarch_late_cancel_access DC_uarch_early_cancel_access DC_1bit_ecc_error_found DC_dispatched_prefetch_instr DC_dcache_accesses_by_locks BU_memory_requests BU_data_prefetch BU_system_read_responses BU_quadwords_written_to_system BU_cpu_clk_unhalted BU_internal_L2_req BU_fill_req_missed_L2 BU_fill_into_L2 IC_fetch IC_miss IC_refill_from_L2 IC_refill_from_system IC_itlb_L1_miss_L2_hit IC_itlb_L1_miss_L2_miss IC_uarch_resync_snoop IC_instr_fetch_stall IC_return_stack_hit IC_return_stack_overflow FR_retired_x86_instr_w_excp_intr FR_retired_uops FR_retired_branches_w_excp_intr FR_retired_branches_mispred FR_retired_taken_branches FR_retired_taken_branches_mispred FR_retired_far_ctl_transfer FR_retired_resyncs FR_retired_near_rets FR_retired_near_rets_mispred FR_retired_taken_branches_mispred_addr_miscop FR_retired_fpu_instr FR_retired_fastpath_double_op_instr FR_intr_masked_cycles FR_intr_masked_while_pending_cycles FR_taken_hardware_intrs FR_nothing_to_dispatch FR_dispatch_stalls FR_dispatch_stall_branch_abort_to_retire FR_dispatch_stall_serialization FR_dispatch_stall_segment_load FR_dispatch_stall_reorder_buffer_full FR_dispatch_stall_resv_stations_full FR_dispatch_stall_fpu_full FR_dispatch_stall_ls_full FR_dispatch_stall_waiting_all_quiet FR_dispatch_stall_far_ctl_trsfr_resync_branc_pend FR_fpu_exception FR_num_brkpts_dr0 FR_num_brkpts_dr1 FR_num_brkpts_dr2 FR_num_brkpts_dr3 NB_mem_ctrlr_page_access NB_mem_ctrlr_page_table_overflow NB_mem_ctrlr_turnaround NB_mem_ctrlr_bypass_counter_saturation NB_ECC_errors NB_sized_commands NB_probe_result NB_gart_events NB_ht_bus0_bandwidth NB_ht_bus1_bandwidth NB_ht_bus2_bandwidth NB_sized_blocks NB_cpu_io_to_mem_io NB_cache_block_commands attributes: edge pc inv cmask umask nouser sys See Chapter 10 of the "BIOS and Kernel Developer's Guide for the AMD Athlon 64 and AMD Opteron Processors," AMD publication #26094
There are around fifty names above such as "FP_dispatched_fpu_ops" which describe the PICs available. On my AMD Opteron CPUs you can measure four of these at a time, which can be provided in the arguments to cpustat, eg:
# cpustat -c IC_fetch,DC_access,DC_dtlb_L1_miss_L2_hit,DC_dtlb_L1_miss_L2_miss 0.25 time cpu event pic0 pic1 pic2 pic3 0.257 0 tick 6406429 8333198 45826 5515 0.257 1 tick 3333442 3942694 24682 4409 0.507 1 tick 6450964 8229104 44046 5713 0.507 0 tick 2359697 2828683 14365 4415 0.757 0 tick 2490406 3060416 16458 4901 0.757 1 tick 7292986 9530806 68956 6490 1.007 0 tick 2514008 3063049 15037 3863 1.007 1 tick 6057048 7747580 42415 6083 ^C
In the above example I printed four PICs every 0.25 seconds, for each CPU (I'm on a 2 x virtual CPU server). The CPU column shows that the output is slightly shuffled – a harmless side effect from the way cpustat was coded (it pbinds a libcpc consumer onto each CPU in the available processor set, and all threads write to STDOUT in any order). These PICs are provided by programmable hardware registers, so there is no ideal way around the four-at-a-time limit. You can shuffle measurements between different sets of PICs, which cpustat supports.
Reference Documentation
Since different CPUs provide different PICs, the guide mentioned at the bottom of the cpustat -h output will list what PICs your CPU type provides. It is important to read these guides carefully. For example, PICs that track cache misses may have some exceptions to what is considered a "miss".
I spent a while with AMD's #26094 guide, but I found that the PIC descriptions raise more questions than answers. (try to find basics such as "instruction count")... If you find yourself in a similar situation, it can help to create known workloads and then examine which metrics move by a similar amount. I used this approach to confirm what PICs provided cycle counts and instruction counts.
I did eventually find two good resources on AMD PICs,
- Basic Performance Measurements by Paul J. Drongowski of AMD: a summary of the most important PICs and suggestions for interpretation.
- Opteron Hardware Performance Counters by Richard Smith of Sun: a crash course in AMD Architecture from a PICs perspective.
You may notice some really interesting PICs mentioned, such as memory locality observability in the newer revs of AMD CPUs.
If you are interested in PIC analysis for any CPU type, see chapter 8 "Performance Counters" in Solaris Performance and Tools, by Richard McDougall, Jim Mauro and myself. One of the metrics we made sure to include in the book was CPI (cycles per instruction), as it proves to be a useful starting point for understanding CPU behavior.
Example: CPI
The cycles per instruction metric (sometimes measured as IPC: instructions per cycle) is a useful ratio, and (depending on the CPU type) it is fairly easy to measure. If the measured CPI ratio is low, more instructions can be dispatched in a given time, which usually means higher performance. High CPI means instructions are stalling, usually on main memory bus activity.
The output of cpustat can be formatted with a little scripting. I write the following script "amd64cpiu" to use a little shell and Perl to aggregate and print the output:
#!/usr/bin/sh # # amd64cpiu - measure CPI and Utilization on AMD64 processors. # # USAGE: amd64cpiu [interval] # eg, # amd64cpiu 0.1 # for 0.1 second intervals # # CPI is cycles per instruction, a metric that increases due to activity # such as main memory bus lookups. # # ident "@(#)amd64cpiu.sh 1.1 07/02/17 SMI" interval=${1:-1} # default interval, 1 second set -- `kstat -p unix:0:system_misc:ncpus` # assuming no psets, cpus=$2 # number of CPUs pics='BU_cpu_clk_unhalted' # cycles pics=$pics,'FR_retired_x86_instr_w_excp_intr' # instructions /usr/sbin/cpustat -tc $pics $interval | perl -e ' printf "%16s %8s %8s\n", "Instructions", "CPI", "%CPU"; while (<>) { next if ++$lines == 1; split; $total += $_[3]; $cycles += $_[4]; $instructions += $_[5]; if ((($lines - 1) % '$cpus') == 0) { printf "%16u %8.2f %8.2f\n", $instructions, $cycles / $instructions, $total ? 100 * $cycles / $total : 0; $total = 0; $cycles = 0; $instructions = 0; } } '
This script prints a column for CPI and for percent CPU utilization. I've used the PICs that were suggested in the AMD article, and from testing they do appear to be the best ones for measuring CPI.
Here amd64cpiu is used to examine a CPU bound workload of fast register based instructions:
# ./amd64cpiu.sh Instructions CPI %CPU 16509657954 0.34 97.56 16550162001 0.33 98.54 16523746049 0.34 98.41 16510783100 0.34 98.32 16497135723 0.34 98.29 ^C
The CPI is around 0.34. This is the maximum to be expected from the AMD64 architecture, which attempts to run three instructions per clock cycle.
Now for a memory bound workload of sequential 1 byte memory reads:
# ./amd64cpiu.sh Instructions CPI %CPU 4883935299 1.12 97.60 4852961204 1.12 97.03 4884120645 1.13 97.69 4898818096 1.12 97.92 4895064839 1.12 97.80 ^C
Things are starting to become slower due to the extra overhead of memory requests. Many reads will satisfy from the level 1 cache, some from the slower level 2 cache, and occasionally a cache line will be read from main memory. This additional overhead slows us to 1.13 CPI, and we are running fewer instructions for the same %CPU.
Watch what happens when our memory workload performs 1 byte scattered reads (100 Kbytes apart):
# ./amd64cpiu.sh Instructions CPI %CPU 653300388 8.53 98.36 648496314 8.53 98.37 644163952 8.54 97.75 648941939 8.53 98.35 648507176 8.53 98.37 ^C
Many of the reads will not be in the CPU caches, and so now most are requiring a memory bus lookup. Our CPI is around 8.53, some 25 times slower than register based CPU instructions. Our %CPU is still around the same, but this buys us fewer instructions in total.
As you can see, CPI is shedding light on memory bus activity. It's very interesting, and from such a simple metric.
Now for a real application: Here I watch as Sun's C compiler chews through a source tree:
# ./amd64cpiu.sh Instructions CPI %CPU 2624028943 1.26 58.52 2992167837 1.19 63.17 2327129316 1.26 52.08 2046997158 1.27 46.14 2414376864 1.23 52.80 3305351199 1.23 70.72 ^C
That's not so bad, any memory access instructions must be hitting caches fairly often (something that we can confirm by measuring other PICs).
Beware of output such as the following:
# ./amd64cpiu.sh Instructions CPI %CPU 22695257 1.82 0.73 22197894 1.75 0.69 49626271 2.16 1.90 102731779 2.21 4.04 104795796 1.49 2.78 ^C
The CPUs are fairly idle (less than 5% utilization), and so CPI is less likely to be useful to indicate performance issues.
Suggested Actions: CPI
While many PICs produce interesting measurements, it's much more useful if there is some action we can take based on the results. The following is a list of suggestions based on CPI.
Firstly, to be even considering this list you need to have a known and measured performance issue. If one or more CPUs are 100% busy, then that may be a performance issue and it can be useful to check CPI; if your CPUs are idle, then it probably won't be useful to check. As for a measured performance issue, it can be especially helpful to quantify an issue: eg, average latency is 150 ms. Tools such as DTrace can take these measurements.
- Measure other PICs. CPI is a high level metric, and there are many other PICs that will explain why CPI is high or low, such as cache hits and misses, TLB hits and misses, and memory locality.
- If CPI is low:
- Examine the application for unnecessary CPU work (eg, using DTrace)
- Get faster CPUs
- Get more CPUs
- If CPI is high:
- Examine the application for unnecessary memory work
- Recompile your application with optimization, and with Sun's C compiler
- Consider using processor sets to improve memory locality
- Get CPUs with larger caches
- Test different CPU architectures (multi-core/multi-thread may improve performance as bus distance is shorter)
I hope this has been helpful. And there are many more useful metrics to observe on AMD64 CPUs: CPI is just the beginning.