Solaris Off-CPU Analysis

Measuring low resolution time

Off-CPU analysis is a performance methodology where high resolution off-CPU time is measured and studied. This reveals which code-paths led to time spent waiting (blocked), and can be a quick and generic way to root-cause a wide range of performance issues.

Note: This page was last updated in 2014 and is no longer maintained. It is now part of my Crypt, and kept here for reference. The new Linux version is here.

Studying off-CPU time differs from traditional profiling, which often samples the activity of threads at a given interval, and (usually) only examine threads if they are executing work on-CPU. Here, the target are threads that, while processing a workload request, have blocked and are context-switched off-CPU. This method also differs from tracing techniques that instrument various applications functions that commonly block, since this method targets kernel functions that perform the blocking, and so doesn't rely on the foresight to instrument all the right application places.

Threads can leave CPU for a number of reasons, including waiting for file system or network I/O, acquiring a synchronization lock, paging/swapping (virtual memory), an explicit timer and signals. They can also leave CPU for some reasons somewhat unrelated to the current thread's execution, including involuntary context switching due to high demand for CPU resources, and interrupts. Whatever the reason, if this occurs during a workload request (a synchronous code-path), then it is introducing latency.

I'll summarize profiling techniques for off-CPU analysis, and introduce off-CPU as a metric. I'll then use DTrace to measure it, with MySQL as an example target to study. This methodology is suitable for any profiler that can trace kernel events, including perf, SystemTap, and ktap on Linux.

Profiling Techniques

While there are many tools that focus on CPU runtime analysis, the focus here is time spent blocked and off-CPU. I'll summarize sampling and tracing techniques, to show how they compare for identifying this off-CPU time.

CPU Sampling

Many traditional profiling tools sample activity across all CPUs, collecting snapshots of the current instruction address (program counter) or entire stack back trace at a certain rate (eg, 1000 Hertz). This will give counts of either the running function or the stack trace, allowing reasonable estimates to be calculated of where the CPU cycles are being spent. DTrace can sample too (dtrace -n 'profile-1001 /execname == "mysqld"/ { @[ustack()] = count(); }').

Consider application function A() calls function B(), which makes a blocking system call:

    CPU Sampling ----------------------------------------------->
     |  |  |  |  |  |  |                      |  |  |  |  |         
     A  A  A  A  B  B  B                      B  A  A  A  A         
    A(---------.                                .----------)        
               |                                |                   
               B(--------.                   .--)                   
                         |                   |         user-land    
   - - - - - - - - - - syscall - - - - - - - - - - - - - - - - -    
                         |                   |         kernel       
                         X     Off-CPU       |                      
                       block . . . . . interrupt

While this can be very effective for studying on-CPU issues, including hot code-paths and adaptive mutex spins, it doesn't gather data when the application has blocked and is waiting off-CPU.

Application Tracing

    App Tracing ------------------------------------------------>
    |          |                                |          |        
    A(         B(                               B)         A)       
                                                                    
    A(---------.                                .----------)        
               |                                |                   
               B(--------.                   .--)                   
                         |                   |         user-land    
   - - - - - - - - - - syscall - - - - - - - - - - - - - - - - -    
                         |                   |         kernel       
                         X     Off-CPU       |                      
                       block . . . . . interrupt

Here functions are instrumented so that timestamps are collected when they begin "(" and end ")", so that the time spent in functions can be calculated. If the timestamps include elapsed time and CPU time (eg, using times(2) or getrusage(3C)), then it is also possible to calculate which functions were slow on-CPU vs which functions were slow because they were blocked off-CPU. Unlike sampling, these timestamps can have a very high resolution (nanoseconds).

While this works, a disadvantage is that you either trace all application functions, which can have a significant performance impact (and affect the performance you are trying to measure), or you pick the functions that are likely to block, and hope you didn't miss any. My file system latency tools were examples of this, which traced just the file system I/O functions in MySQL (and I hope I found them all).

Off-CPU Tracing

I'll summarize this here, then explain it in more detail.

    Off-CPU Tracing -------------------------------------------->
                         |                   |                   
                         B                   B                   
                         A                   A                   
    A(---------.                                .----------)        
               |                                |                   
               B(--------.                   .--)                   
                         |                   |         user-land    
   - - - - - - - - - - syscall - - - - - - - - - - - - - - - - -    
                         |                   |         kernel       
                         X     Off-CPU       |                      
                       block . . . . . interrupt

With this approach, only the kernel functions that switch the thread off-CPU are traced, along with timestamps and user-land stack back traces. This focuses on the off-CPU events, without needing to trace all of the application functions, or needing to know what the application is. This approach works for any blocking event, for any application: MySQL, Apache, etc.

Off-CPU tracing captures all wait events for any application.

Later on this page I'll trace kernel off-CPU events and include some application level instrumentation to filter out asynchronous wait times (eg, threads waiting for work). Unlike application level instrumentation, I don't need to hunt down every place that may block off-CPU and instrument it; I just need to identify that the application is in a time sensitive code path (eg, during a MySQL query), so that the latency is synchronous to the workload.

Off-CPU

If it is possible to trace the kernel scheduler functions that context switch a thread off-CPU and return it, as well as visibility into user-land context (stack trace), then the approach pictured above is possible. DTrace can not only trace these kernel functions, but also has a stable kernel provider ("sched") for tracing the exact events of interest:

# dtrace -ln 'sched:::*cpu'
   ID   PROVIDER            MODULE                          FUNCTION NAME
18268      sched              unix                             swtch remain-cpu
18289      sched              unix                  resume_from_intr on-cpu
18290      sched              unix                resume_from_zombie on-cpu
18291      sched              unix                            resume on-cpu
18323      sched              unix                  resume_from_intr off-cpu
18324      sched              unix                resume_from_zombie off-cpu
18325      sched              unix                            resume off-cpu

The time spent off-CPU can be calculated as the timestamp delta between the sched:::off-cpu probe and the sched:::on-cpu probe, for the same thread. By itself, that time is not particularly useful, even when summarized as a neat distribution plot:

# dtrace -n 'sched:::off-cpu { self->ts = timestamp; }
    sched:::on-cpu /self->ts/ { @["ns"] = quantize(timestamp - self->ts);
    self->ts = 0; }'
dtrace: description 'sched:::off-cpu ' matched 6 probes
^C

  ns                                                
           value  ------------- Distribution ------------- count    
            4096 |                                         0        
            8192 |@@@                                      55       
           16384 |@@@@@@@@@@@@@                            243      
           32768 |@@@@                                     75       
           65536 |                                         8        
          131072 |@                                        24       
          262144 |                                         3        
          524288 |                                         0        
         1048576 |                                         0        
         2097152 |                                         1        
         4194304 |                                         0        
         8388608 |                                         2        
        16777216 |@@@@@@                                   105      
        33554432 |@@@                                      66       
        67108864 |@@@                                      55       
       134217728 |@@@                                      61       
       268435456 |@                                        28       
       536870912 |@@                                       32       
      1073741824 |                                         0

That shows two groupings, the first showing threads leaving CPU between 8 and 65 us, and the second showing between 16 ms and 1 second. But that alone is not very useful to know. What we need is context - who the application is (or if it's the kernel), what it's doing, and why it's blocking and leaving CPU.

Context

Adding context is pretty easy: DTrace provides the stack() and ustack() functions, which retrieve the kernel and user-level stacks - the ancestry of the code-path taken. Adding these to the previous one-liner, and also filtering to only match our target application, "mysqld":

# dtrace -x ustackframes=100 -n 'sched:::off-cpu /execname == "mysqld"/ {
    self->ts = timestamp; } sched:::on-cpu /self->ts/ { @[stack(), ustack(), "ns"] =
    quantize(timestamp - self->ts); self->ts = 0; }'
dtrace: description 'sched:::off-cpu ' matched 6 probes
^C
[...]

              genunix`cv_wait+0x61		<- kernel-level stack trace
              zfs`zio_wait+0x5d
              zfs`dmu_buf_hold_array_by_dnode+0x21e
              zfs`dmu_buf_hold_array+0x71
              zfs`dmu_read_uio+0x4d
              zfs`zfs_read+0x19a
              genunix`fop_read+0x6b
              genunix`pread+0x22c
              unix`sys_syscall+0x17a

              libc.so.1`__pread+0xa		<- user-level stack trace
              mysqld`os_file_pread+0x8e
              mysqld`os_file_read+0x3b			<- filesystem I/O
              mysqld`fil_io+0x2b0
              mysqld`buf_read_page_low+0x14e
              mysqld`buf_read_page+0x81
              mysqld`buf_page_get_gen+0x143
              mysqld`btr_cur_search_to_nth_level+0x4dd
              mysqld`row_sel_get_clust_rec_for_mysql+0x9a
              mysqld`row_search_for_mysql+0xdf5
              mysqld`_ZN11ha_innobase13general_fetchEPcjj+0x5b
              mysqld`_Z19join_read_next_sameP14st_read_record+0x32
              mysqld`_Z10sub_selectP4JOINP13st_join_tableb+0x68
              mysqld`_Z9do_selectP4JOINP4ListI4ItemEP8st_tableP9Procedure+0x1fd
              mysqld`_ZN4JOIN4execEv+0x12a9
              mysqld`_Z12mysql_selectP3THDPPP4ItemP10TABLE_LISTjR4ListIS1_ES2_jP8...
              mysqld`_Z13handle_selectP3THDP6st_lexP13select_resultm+0x145
              mysqld`_Z21mysql_execute_commandP3THD+0xe6e
              mysqld`_Z11mysql_parseP3THDPKcjPS2_+0x116
              mysqld`_Z16dispatch_command19enum_server_commandP3THDPcj+0xfc1
              mysqld`_Z10do_commandP3THD+0xb8
              mysqld`handle_one_connection+0x7f7
              libc.so.1`_thrp_setup+0x8d
              libc.so.1`_lwp_start
  ns                                                
           value  ------------- Distribution ------------- count    
          262144 |                                         0        
          524288 |                                         1        
         1048576 |@                                        2        
         2097152 |@@@@@                                    10       
         4194304 |@@@@@@@@@@@@@@@@@@@@@@@@@@               53       
         8388608 |@@@@@@@@                                 16       
        16777216 |                                         1        
        33554432 |                                         0

The output is many pages long, consisting of hundreds of kernel + user stack traces, plus their time spent waiting off-CPU as distribution plots ("value" == nanoseconds, "count" == number of occurrences in that range). To make the screenful of text a little easier on the eyes, I've added some color (red for kernel, blue for user-land stacks). I've also truncated the output to only show one off-CPU event: file system I/O.

This output is amazing, and is possible by a simplifying principle:

Application stack traces don't change while off-CPU.

This DTrace one-liner is only measuring the stack traces on the sched:::on-cpu event; it doesn't need to gather and save them on the sched:::off-cpu event, or during the time spent off-CPU, since the application stack doesn't change (and the kernel stack may not change much either). It's the application (user-level) stack that usually provides the most helpful context.

Interpreting the stack traces takes a little familiarity with the source code, which depends on how complex the application is and its language. While MySQL is quite complex and has a lot of functionality, the function names are usually intuitive and the code is often fairly straight forward.

The file system I/O latency shown in the distribution, from 1 to 16 ms, matches expectation for random disk I/O latency. It also shows that the file system (while I was tracing) was running at a 0% cache hit rate for this code-path; all I/O were at disk speeds. (This is probably due to this MySQL using application-level caches to cache the frequent requests, leaving a trickle of un-cached and infrequent requests for the file system.)

That stack trace is great - it's showing why the application was blocking and waiting off-CPU, and how long for. This is exactly the sort of information I'm usually looking for. However, this one-liner also catches many stack traces that aren't interesting, including:

[...]
              genunix`cv_wait_sig_swap_core+0x170
              genunix`cv_wait_sig_swap+0x18
              genunix`cv_waituntil_sig+0x13c
              genunix`lwp_park+0x157
              genunix`syslwp_park+0x31
              unix`sys_syscall+0x17a

              libc.so.1`__lwp_park+0x17
              libc.so.1`cond_wait_queue+0x68
              libc.so.1`__cond_wait+0x7b
              libc.so.1`cond_wait+0x23
              libc.so.1`pthread_cond_wait+0x9		<- event waiting
              mysqld`_Z10end_threadP3THDb+0xba
              mysqld`handle_one_connection+0x3e8
              libc.so.1`_thrp_setup+0x8d
              libc.so.1`_lwp_start
  ns                                                
           value  ------------- Distribution ------------- count    
            2048 |                                         0        
            4096 |                                         1        
            8192 |                                         0        
           16384 |                                         0        
           32768 |                                         0        
           65536 |                                         1        
          131072 |                                         3        
          262144 |                                         1        
          524288 |@                                        9        
         1048576 |@@                                       17       
         2097152 |@@@@                                     34       
         4194304 |@@@@@                                    43       
         8388608 |@@@@@@@                                  68       
        16777216 |@@@@@@                                   58       
        33554432 |@@@@@@                                   58       
        67108864 |@@@@                                     41       
       134217728 |@@                                       21       
       268435456 |@                                        12       
       536870912 |@                                        9        
      1073741824 |                                         0

I believe this stack trace is from a thread ending and waiting in the MySQL thread cache for more work. This isn't so interesting; I'm usually looking to explain why queries are slow, and this latency happens outside a query. I'm not (usually) interested in latencies that occur asynchronously to the queries, such as threads waiting for work.

It would be ideal to only include off-CPU stacks from those wait events that occur synchronously during a query.

Synchronous Context

The DTrace pid provider allows the application to be traced and synchronous workload code-paths to be identified. Using this on MySQL, we can identify if a thread is processing a query, such as by checking that the thread is executing a dispatch_command() function. The following script does this, and includes some usability features to produce rolling status of gathered latency until Ctrl-C is hit; it also includes a minimum latency threshold, set to 10 ms, to only include off-CPU wait events that are longer than this time:

$ cat -n mysqld_pid_offcpu.d
     1	#!/usr/sbin/dtrace -s
     2	/*
     3	 * mysqld_pid_offcpu.d	Trace off-CPU time during queries, showing stacks.
     4	 *
     5	 * USAGE: ./mysqld_pid_offcpu.d -p mysqld_PID
     6	 *
     7	 * TESTED: these pid-provider probes may only work on some mysqld versions.
     8	 *	5.0.51a: ok
     9	 *	5.1.13: ok
    10	 *
    11	 * 27-Feb-2011	Brendan Gregg	Created This.
    12	 */
    13	
    14	#pragma D option quiet
    15	
    16	dtrace:::BEGIN
    17	{
    18		min_ns = 10000000;
    19		printf("Tracing PID %d for queries longer than %d ms\n",
    20		    $target, min_ns / 1000000);
    21		printf("Hit Ctrl-C when ready for off-CPU report.\n\n");
    22	}
    23	
    24	pid$target::*dispatch_command*:entry
    25	{
    26		self->start = timestamp;
    27	}
    28	
    29	sched:::off-cpu
    30	/self->start/
    31	{
    32		self->off = timestamp;
    33	}
    34	
    35	sched:::on-cpu
    36	/self->off && (timestamp - self->off) > min_ns/
    37	{
    38		@offcpu[stack(), ustack()] = quantize(timestamp - self->off);
    39		self->off = 0;
    40	}
    41	
    42	pid$target::*dispatch_command*:return
    43	/self->start && (timestamp - self->start) > min_ns/
    44	{
    45		@time = quantize(timestamp - self->start);
    46	}
    47	
    48	pid$target::*dispatch_command*:return
    49	{
    50		self->start = 0;
    51		self->off = 0;
    52	}
    53	
    54	profile:::tick-1s,
    55	dtrace:::END
    56	{
    57		printf("MySQL query latency (ns):");
    58		printa(@time);
    59		clear(@time);
    60	}
    61	
    62	dtrace:::END
    63	{
    64		printf("Top 10 off-CPU user & kernel stacks, by wait latency (ns):");
    65		trunc(@offcpu, 10);
    66	}

The tick-1s probe prints status every second showing the distribution of query latency beyond 10 ms. The intent is that you watch this rolling status until you catch the latency you are looking for (outliers), and then hit Ctrl-C. The real output is in the @offcpu aggregation, which includes the stack traces and latency distribution (beyond 10 ms) for events during queries.

Here's an example (MySQL 5.0.51a):

# ./mysqld_pid_offcpu.d -p 3475
Tracing PID 3475 for queries longer than 10 ms
Hit Ctrl-C when ready for off-CPU report.

MySQL query latency (ns):

           value  ------------- Distribution ------------- count    
         4194304 |                                         0        
         8388608 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@        257      
        16777216 |@@@@@                                    35       
        33554432 |@                                        10       
        67108864 |@                                        5        
       134217728 |                                         0        
       268435456 |                                         1        
       536870912 |                                         0        

[...waiting to see latency of interest: beyond 1s...]

MySQL query latency (ns):

           value  ------------- Distribution ------------- count    
         4194304 |                                         0        
         8388608 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 2        
        16777216 |                                         0        

MySQL query latency (ns):

           value  ------------- Distribution ------------- count    
         4194304 |                                         0        
         8388608 |@@@@@@@@@@@@@                            1        
        16777216 |@@@@@@@@@@@@@                            1        
        33554432 |                                         0        
        67108864 |                                         0        
       134217728 |                                         0        
       268435456 |                                         0        
       536870912 |                                         0        
      1073741824 |@@@@@@@@@@@@@                            1     <- over 1 second
      2147483648 |                                         0     

[got it! Ctrl-C now]
^C
[...]

              genunix`cv_timedwait_sig_hires+0x1db
              genunix`cv_waituntil_sig+0xba
              genunix`poll_common+0x56d
              genunix`pollsys+0xe4
              unix`sys_syscall+0x17a

              libc.so.1`__pollsys+0xa
              libc.so.1`pselect+0x18c
              libc.so.1`select+0x70
              mysqld`os_thread_sleep+0x3f	<- innodb thread concurrency
              mysqld`srv_conc_enter_innodb+0xc4		
              mysqld`_ZN11ha_innobase10index_readEPcPKcj16ha_rkey_function+0x28a
              mysqld`_Z20join_read_always_keyP13st_join_table+0xc7
              mysqld`_Z10sub_selectP4JOINP13st_join_tableb+0xb7
              mysqld`_Z9do_selectP4JOINP4ListI4ItemEP8st_tableP9Procedure+0x1fd
              mysqld`_ZN4JOIN4execEv+0x12a9
              mysqld`_Z12mysql_selectP3THDPPP4ItemP10TABLE_LISTjR4ListIS1_ES2_jP8...
              mysqld`_Z13handle_selectP3THDP6st_lexP13select_resultm+0x145
              mysqld`_Z21mysql_execute_commandP3THD+0xe6e
              mysqld`_Z11mysql_parseP3THDPKcjPS2_+0x116
              mysqld`_Z16dispatch_command19enum_server_commandP3THDPcj+0xfc1
              mysqld`_Z10do_commandP3THD+0xb8
              mysqld`handle_one_connection+0x7f7
              libc.so.1`_thrp_setup+0x8d
              libc.so.1`_lwp_start

           value  ------------- Distribution ------------- count    
         4194304 |                                         0        
         8388608 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 124      
        16777216 |                                         0        


              genunix`cv_wait_sig_swap_core+0x170
              genunix`cv_wait_sig_swap+0x18
              genunix`cv_waituntil_sig+0x13c
              genunix`lwp_park+0x157
              genunix`syslwp_park+0x31
              unix`sys_syscall+0x17a

              libc.so.1`__lwp_park+0x17
              libc.so.1`cond_wait_queue+0x68
              libc.so.1`__cond_wait+0x7b
              libc.so.1`cond_wait+0x23
              libc.so.1`pthread_cond_wait+0x9
              mysqld`os_event_wait+0x27
              mysqld`srv_suspend_mysql_thread+0x29b
              mysqld`row_mysql_handle_errors+0x6e	<- DB lock wait
              mysqld`row_search_for_mysql+0x909
              mysqld`_ZN11ha_innobase10index_readEPcPKcj16ha_rkey_function+0x193
              mysqld`_ZN7handler16read_range_firstEPK12st_key_rangeS2_bb+0x99
              mysqld`_ZN7handler22read_multi_range_firstEPP18st_key_multi_rangeS1...
              mysqld`_ZN18QUICK_RANGE_SELECT8get_nextEv+0x100
              mysqld`_Z8rr_quickP14st_read_record+0x1d
              mysqld`_Z12mysql_updateP3THDP10TABLE_LISTR4ListI4ItemES6_PS4_jP8st_...
              mysqld`_Z21mysql_execute_commandP3THD+0x3581
              mysqld`_Z11mysql_parseP3THDPKcjPS2_+0x116
              mysqld`_Z16dispatch_command19enum_server_commandP3THDPcj+0xfc1
              mysqld`_Z10do_commandP3THD+0xb8
              mysqld`handle_one_connection+0x7f7

           value  ------------- Distribution ------------- count    
         4194304 |                                         0        
         8388608 |@@@@@@@@@@                               1        
        16777216 |@@@@@@@@@@@@@@@@@@@@                     2        
        33554432 |                                         0        
        67108864 |                                         0        
       134217728 |                                         0        
       268435456 |                                         0        
       536870912 |                                         0        
      1073741824 |@@@@@@@@@@                               1        
      2147483648 |                                         0

I've included the last two off-CPU types and my interpretations: the first is innodb thread concurrency, where threads are put to sleep as part of the algorithm that keeps a certain number of threads running. And by sleep I mean sleep in the programming sense - it's waiting for 10 ms regardless.

The second stack I believe is the DB lock wait code-path, based on DTracing row_mysql_handle_errors() separately.

Since the distribution plot is summarizing latency from 10 ms onwards, this data could be used to provide some early warning that a problem was developing that could slow queries to unacceptable levels (eg, the typical 1 second "slow query").

This is one of the most powerful DTrace scripts I've ever written.

Being able to identify these latencies so quickly is awesome. However, there are a couple of caveats to bear in mind (at least): dispatcher queue latency and involuntary context switching.

Dispatcher Queue Latency

Something that's missing from these stacks is if the latency has included time spent waiting on the CPU dispatcher queue. If the CPUs are running at saturation, then any time a thread steps off-CPU, there may be additional time before it gets its turn stepping back on. That will be included in the reported time. DTrace can trace those dispatcher queue events separately if needed; in practice, CPU saturation is pretty easy to spot, so you are unlikely to be spending much time looking at off-CPU latency when you have a known CPU saturation issue to deal with.

Involuntary Context Switching

If you see a user-level stack trace that doesn't look like it should block, but it has, it could be because the kernel has context switched it off. The kernel-level stack trace may show this, eg:

              unix`preempt+0xd7			<- involuntary context switch
              unix`trap+0x14e8
              unix`sys_rtt_common+0x68
              unix`_sys_rtt_ints_disabled+0x8

              mysqld`_ZN4ItemC2Ev+0xae
              mysqld`_Z21execute_sqlcom_selectP3THDP10TABLE_LIST+0x124
              mysqld`_Z21mysql_execute_commandP3THD+0x2400
              mysqld`_ZN18Prepared_statement7executeEP6Stringb+0x2e9
              mysqld`_ZN18Prepared_statement12execute_loopEP6StringbPhS2_+0x70
              mysqld`_Z19mysqld_stmt_executeP3THDPcj+0xa1
              mysqld`_Z16dispatch_command19enum_server_commandP3THDPcj+0xf85
              mysqld`_Z10do_commandP3THD+0xc2
              mysqld`_Z24do_handle_one_connectionP3THD+0x24e
              mysqld`handle_one_connection+0x37
              mysqld`pfs_spawn_thread+0x50
              libc.so.1`_thrp_setup+0x7e
              libc.so.1`_lwp_start

           value  ------------- Distribution ------------- count    
        67108864 |                                         0        
       134217728 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1        
       268435456 |                                         0

Summary

Off-CPU analysis can be an effective way to locate types of latency where threads block and wait for some other event. By tracing this from the kernel scheduler functions that context switch threads, all off-CPU latency types can be analyzed in the same way, without needing to trace multiple sources. To see the context for the off-CPU event to understand why it occurred, both user and kernel stack back traces can be inspected. To only show latencies that affect an application workload, additional instrumentation can be used to identify that context and filter the asynchronous latency events.

On this page these principles were demonstrated using DTrace on MySQL, which observed different types of blocking events and summarized the off-CPU time as distribution plots in nanoseconds. The off-CPU approach can be used for any application which blocks for a variety of reasons, and there isn't already clear visibility into every reason why it may block.

For more on off-CPU analysis, see the visualizations in Off-CPU Flame Graphs and Hot/Cold Flame Graphs.