AMD64 PICs, CPI

I originally posted this at http://blogs.sun.com/brendan/entry/amd64_pics.

I recently wanted to gather some numbers on CPU and memory system performance, for AMD64 CPUs. I reached a point where I searched the Internet for other Solaris AMD64 PIC (Performance Instrumentation Counters) analysis, and found little. I hope to improve this with some blog entries. In this part I'll introduce PIC observability, and demonstrate measuring CPI (cycles per instruction) for different workloads. Note that PICs are also known as Performance Monitoring Counters (PMCs).

To see why PICs are important, the following are the sort of questions that PIC analysis can answer:

What is the Level 2 cache hit rate?
What is the Level 2 cache miss volume?
What is the hit rate and miss volume for the TLB?
What is my memory bus utilization?

Questions 1 and 2 relate to the CPU hardware cache, where Level 2 is the E$ (meaning either "external" cache or "embedded" cache, depending on the CPU architecture). For optimal performance we want to see a high hit rate, and more importantly, a low miss volume.

Question 3 concerns a component of the memory management unit: the translation lookaside buffer (TLB). This processes and caches virtual to physical memory page translations. It can consume a lot of CPU (the worst I've seen is 60%), and it can be tuned. A good document for understanding this further is Taming Your Emu by Richard McDougall.

Question 4 seems obvious, the memory bus can be a bottleneck for system performance, so, how utilized is it? Answering this isn't easy, but it is usually possible by examining CPU PICs.

cpustat

There are many AMD64 CPU PICs available, which can be viewed using tools such as cpustat and cputrack. Running cpustat -h dumps the list:

# cpustat -h
Usage:
        cpustat [-c events] [-p period] [-nstD] [interval [count]]

        -c events specify processor events to be monitored
        -n        suppress titles
        -p period cycle through event list periodically
        -s        run user soaker thread for system-only events
        -t        include tsc register
        -D        enable debug mode
        -h        print extended usage information

        Use cputrack(1) to monitor per-process statistics.

        CPU performance counter interface: AMD Opteron & Athlon64

        event specification syntax:
        [picn=][,attr[n][=]][,[picn=][,attr[n][=]],...]

        event[0-3]: FP_dispatched_fpu_ops FP_cycles_no_fpu_ops_retired 
                 FP_dispatched_fpu_ops_ff LS_seg_reg_load 
                 LS_uarch_resync_self_modify LS_uarch_resync_snoop 
                 LS_buffer_2_full LS_locked_operation LS_retired_cflush 
                 LS_retired_cpuid DC_access DC_miss DC_refill_from_L2 
                 DC_refill_from_system DC_copyback DC_dtlb_L1_miss_L2_hit 
                 DC_dtlb_L1_miss_L2_miss DC_misaligned_data_ref 
                 DC_uarch_late_cancel_access DC_uarch_early_cancel_access 
                 DC_1bit_ecc_error_found DC_dispatched_prefetch_instr 
                 DC_dcache_accesses_by_locks BU_memory_requests 
                 BU_data_prefetch BU_system_read_responses 
                 BU_quadwords_written_to_system BU_cpu_clk_unhalted 
                 BU_internal_L2_req BU_fill_req_missed_L2 BU_fill_into_L2 
                 IC_fetch IC_miss IC_refill_from_L2 IC_refill_from_system 
                 IC_itlb_L1_miss_L2_hit IC_itlb_L1_miss_L2_miss 
                 IC_uarch_resync_snoop IC_instr_fetch_stall 
                 IC_return_stack_hit IC_return_stack_overflow 
                 FR_retired_x86_instr_w_excp_intr FR_retired_uops 
                 FR_retired_branches_w_excp_intr 
                 FR_retired_branches_mispred FR_retired_taken_branches 
                 FR_retired_taken_branches_mispred 
                 FR_retired_far_ctl_transfer FR_retired_resyncs 
                 FR_retired_near_rets FR_retired_near_rets_mispred 
                 FR_retired_taken_branches_mispred_addr_miscop 
                 FR_retired_fpu_instr FR_retired_fastpath_double_op_instr 
                 FR_intr_masked_cycles FR_intr_masked_while_pending_cycles 
                 FR_taken_hardware_intrs FR_nothing_to_dispatch 
                 FR_dispatch_stalls 
                 FR_dispatch_stall_branch_abort_to_retire 
                 FR_dispatch_stall_serialization 
                 FR_dispatch_stall_segment_load 
                 FR_dispatch_stall_reorder_buffer_full 
                 FR_dispatch_stall_resv_stations_full 
                 FR_dispatch_stall_fpu_full FR_dispatch_stall_ls_full 
                 FR_dispatch_stall_waiting_all_quiet 
                 FR_dispatch_stall_far_ctl_trsfr_resync_branc_pend 
                 FR_fpu_exception FR_num_brkpts_dr0 FR_num_brkpts_dr1 
                 FR_num_brkpts_dr2 FR_num_brkpts_dr3 
                 NB_mem_ctrlr_page_access NB_mem_ctrlr_page_table_overflow 
                 NB_mem_ctrlr_turnaround 
                 NB_mem_ctrlr_bypass_counter_saturation NB_ECC_errors 
                 NB_sized_commands NB_probe_result NB_gart_events 
                 NB_ht_bus0_bandwidth NB_ht_bus1_bandwidth 
                 NB_ht_bus2_bandwidth NB_sized_blocks NB_cpu_io_to_mem_io 
                 NB_cache_block_commands 

        attributes: edge pc inv cmask umask nouser sys 

        See Chapter 10 of the "BIOS and Kernel Developer's Guide for the 
        AMD Athlon 64 and AMD Opteron Processors," AMD publication #26094

There are around fifty names above such as "FP_dispatched_fpu_ops" which describe the PICs available. On my AMD Opteron CPUs you can measure four of these at a time, which can be provided in the arguments to cpustat, eg:

# cpustat -c IC_fetch,DC_access,DC_dtlb_L1_miss_L2_hit,DC_dtlb_L1_miss_L2_miss 0.25
   time cpu event      pic0      pic1      pic2      pic3 
  0.257   0  tick   6406429   8333198     45826      5515 
  0.257   1  tick   3333442   3942694     24682      4409 
  0.507   1  tick   6450964   8229104     44046      5713 
  0.507   0  tick   2359697   2828683     14365      4415 
  0.757   0  tick   2490406   3060416     16458      4901 
  0.757   1  tick   7292986   9530806     68956      6490 
  1.007   0  tick   2514008   3063049     15037      3863 
  1.007   1  tick   6057048   7747580     42415      6083 
^C

In the above example I printed four PICs every 0.25 seconds, for each CPU (I'm on a 2 x virtual CPU server). The CPU column shows that the output is slightly shuffled – a harmless side effect from the way cpustat was coded (it pbinds a libcpc consumer onto each CPU in the available processor set, and all threads write to STDOUT in any order). These PICs are provided by programmable hardware registers, so there is no ideal way around the four-at-a-time limit. You can shuffle measurements between different sets of PICs, which cpustat supports.

Reference Documentation

Since different CPUs provide different PICs, the guide mentioned at the bottom of the cpustat -h output will list what PICs your CPU type provides. It is important to read these guides carefully. For example, PICs that track cache misses may have some exceptions to what is considered a "miss".

I spent a while with AMD's #26094 guide, but I found that the PIC descriptions raise more questions than answers. (try to find basics such as "instruction count")... If you find yourself in a similar situation, it can help to create known workloads and then examine which metrics move by a similar amount. I used this approach to confirm what PICs provided cycle counts and instruction counts.

I did eventually find two good resources on AMD PICs,

Basic Performance Measurements by Paul J. Drongowski of AMD: a summary of the most important PICs and suggestions for interpretation.
Opteron Hardware Performance Counters by Richard Smith of Sun: a crash course in AMD Architecture from a PICs perspective.

You may notice some really interesting PICs mentioned, such as memory locality observability in the newer revs of AMD CPUs.

If you are interested in PIC analysis for any CPU type, see chapter 8 "Performance Counters" in Solaris Performance and Tools, by Richard McDougall, Jim Mauro and myself. One of the metrics we made sure to include in the book was CPI (cycles per instruction), as it proves to be a useful starting point for understanding CPU behavior.

Example: CPI

The cycles per instruction metric (sometimes measured as IPC: instructions per cycle) is a useful ratio, and (depending on the CPU type) it is fairly easy to measure. If the measured CPI ratio is low, more instructions can be dispatched in a given time, which usually means higher performance. High CPI means instructions are stalling, usually on main memory bus activity.

The output of cpustat can be formatted with a little scripting. I write the following script "amd64cpiu" to use a little shell and Perl to aggregate and print the output:

#!/usr/bin/sh
#
# amd64cpiu - measure CPI and Utilization on AMD64 processors.
#
# USAGE: amd64cpiu [interval]
#   eg,
#        amd64cpiu 0.1          # for 0.1 second intervals
#
# CPI is cycles per instruction, a metric that increases due to activity
# such as main memory bus lookups.
#
# ident "@(#)amd64cpiu.sh       1.1     07/02/17 SMI"

interval=${1:-1}        # default interval, 1 second

set -- `kstat -p unix:0:system_misc:ncpus`      # assuming no psets,
cpus=$2                                         # number of CPUs

pics='BU_cpu_clk_unhalted'                      # cycles
pics=$pics,'FR_retired_x86_instr_w_excp_intr'   # instructions

/usr/sbin/cpustat -tc $pics $interval | perl -e '
        printf "%16s %8s %8s\n", "Instructions", "CPI", "%CPU";
        while (<>) {
                next if ++$lines == 1;
                split;
                $total += $_[3];
                $cycles += $_[4];
                $instructions += $_[5];

                if ((($lines - 1) % '$cpus') == 0) {
                        printf "%16u %8.2f %8.2f\n", $instructions,
                            $cycles / $instructions, $total ?
                            100 * $cycles / $total : 0;
                        $total = 0;
                        $cycles = 0;
                        $instructions = 0;
                }
        }
'

This script prints a column for CPI and for percent CPU utilization. I've used the PICs that were suggested in the AMD article, and from testing they do appear to be the best ones for measuring CPI.

Here amd64cpiu is used to examine a CPU bound workload of fast register based instructions:

# ./amd64cpiu.sh
    Instructions      CPI     %CPU
     16509657954     0.34    97.56
     16550162001     0.33    98.54
     16523746049     0.34    98.41
     16510783100     0.34    98.32
     16497135723     0.34    98.29
^C

The CPI is around 0.34. This is the maximum to be expected from the AMD64 architecture, which attempts to run three instructions per clock cycle.

Now for a memory bound workload of sequential 1 byte memory reads:

# ./amd64cpiu.sh
    Instructions      CPI     %CPU
      4883935299     1.12    97.60
      4852961204     1.12    97.03
      4884120645     1.13    97.69
      4898818096     1.12    97.92
      4895064839     1.12    97.80
^C

Things are starting to become slower due to the extra overhead of memory requests. Many reads will satisfy from the level 1 cache, some from the slower level 2 cache, and occasionally a cache line will be read from main memory. This additional overhead slows us to 1.13 CPI, and we are running fewer instructions for the same %CPU.

Watch what happens when our memory workload performs 1 byte scattered reads (100 Kbytes apart):

# ./amd64cpiu.sh
    Instructions      CPI     %CPU
       653300388     8.53    98.36
       648496314     8.53    98.37
       644163952     8.54    97.75
       648941939     8.53    98.35
       648507176     8.53    98.37
^C

Many of the reads will not be in the CPU caches, and so now most are requiring a memory bus lookup. Our CPI is around 8.53, some 25 times slower than register based CPU instructions. Our %CPU is still around the same, but this buys us fewer instructions in total.

As you can see, CPI is shedding light on memory bus activity. It's very interesting, and from such a simple metric.

Now for a real application: Here I watch as Sun's C compiler chews through a source tree:

# ./amd64cpiu.sh
    Instructions      CPI     %CPU
      2624028943     1.26    58.52
      2992167837     1.19    63.17
      2327129316     1.26    52.08
      2046997158     1.27    46.14
      2414376864     1.23    52.80
      3305351199     1.23    70.72
^C

That's not so bad, any memory access instructions must be hitting caches fairly often (something that we can confirm by measuring other PICs).

Beware of output such as the following:

# ./amd64cpiu.sh
    Instructions      CPI     %CPU
        22695257     1.82     0.73
        22197894     1.75     0.69
        49626271     2.16     1.90
       102731779     2.21     4.04
       104795796     1.49     2.78
^C

The CPUs are fairly idle (less than 5% utilization), and so CPI is less likely to be useful to indicate performance issues.

Suggested Actions: CPI

While many PICs produce interesting measurements, it's much more useful if there is some action we can take based on the results. The following is a list of suggestions based on CPI.

Firstly, to be even considering this list you need to have a known and measured performance issue. If one or more CPUs are 100% busy, then that may be a performance issue and it can be useful to check CPI; if your CPUs are idle, then it probably won't be useful to check. As for a measured performance issue, it can be especially helpful to quantify an issue: eg, average latency is 150 ms. Tools such as DTrace can take these measurements.

Measure other PICs. CPI is a high level metric, and there are many other PICs that will explain why CPI is high or low, such as cache hits and misses, TLB hits and misses, and memory locality.
If CPI is low:

Examine the application for unnecessary CPU work (eg, using DTrace)
Get faster CPUs
Get more CPUs

If CPI is high:

Examine the application for unnecessary memory work
Recompile your application with optimization, and with Sun's C compiler
Consider using processor sets to improve memory locality
Get CPUs with larger caches
Test different CPU architectures (multi-core/multi-thread may improve performance as bus distance is shorter)

I hope this has been helpful. And there are many more useful metrics to observe on AMD64 CPUs: CPI is just the beginning.

Brendan Gregg's Blog

AMD64 PICs, CPI

cpustat

Reference Documentation

Example: CPI

Suggested Actions: CPI