Performance Analysis Methodology

A performance analysis methodology is a procedure that you can follow to analyze system or application performance. These generally provide a starting point and then guidance to root cause, or causes. Different methodologies are suited for solving different classes of issues, and you may try more than one before accomplishing your goal.

Analysis without a methodology can become a fishing expedition, where metrics are examined ad hoc, until the issue is found – if it is at all.

Methodologies documented in more detail on this site are:

The USE Method: for finding resource bottlenecks
The TSA Method: for analyzing application time
Off-CPU Analysis: for analyzing any type of thread wait latency
Active Benchmarking: for accurate and successful benchmarking

The following briefly summarizes methodologies I've either created or encountered. You can print these all out as a cheetsheet/reminder.

Summaries

I first summarized and named various performance methodologies (mostly developed by me) for my USENIX LISA 2012 talk: Performance Analysis Methodology (PDF, slideshare, youtube, USENIX), then later documented them in my Systems Performance book, and the ACMQ article Thinking Methodically about Performance, which was also published in Communications of the ACM, Feb 2013. More detailed references are at the end of this page.

The following is my most up to date summary list, with methodologies enumerated. These begin with anti-methods, which are included for comparison, and not to follow.

Anti-Methodologies

Blame-Someone-Else Anti-Method

Find a system or environment component you are not responsible for
Hypothesize that the issue is with that component
Redirect the issue to the responsible team
When proven wrong, go to 1

Streetlight Anti-Method

Pick observability tools that are:

familiar
found on the Internet
found at random

Run tools
Look for obvious issues

Drunk Man Anti-Method

Change things at random until the problem goes away

Random Change Anti-Method

Measure a performance baseline
Pick a random attribute to change (eg, a tunable)
Change it in one direction
Measure performance
Change it in the other direction
Measure performance
Were the step 4 or 6 results better than the baseline? If so, keep the change; of not, revert
Goto step 1

Passive Benchmarking Anti-Method

Pick a benchmark tool
Run it with a variety of options
Make a slide deck of the results
Hand the slides to management

Traffic Light Anti-Method

Open dashboard
All green? Assume everything is good.
Something red? Assume that's a problem.

Methodologies

Ad Hoc Checklist Method

..N. Run A, if B, do C

Problem Statement Method

What makes you think there is a performance problem?
Has this system ever performed well?
What has changed recently? (Software? Hardware? Load?)
Can the performance degradation be expressed in terms of latency or run time?
Does the problem affect other people or applications (or is it just you)?
What is the environment? What software and hardware is used? Versions? Configuration?

RTFM Method

(Read The Fine Manual) How to research performance tools and metrics:

Man pages
Books
Web search
Co-workers
Prior talk slides/video
Support services
Source code
Experimentation
Social Media

Scientific Method

Question
Hypothesis
Prediction
Test
Analysis

OODA Loop

Observe
Orient
Decide
Act

Workload Characterization Method

Who is causing the load? (PID, UID, IP addr, ...)
Why is the load called? (code path)
What is the load? (IOPS, tput, type)
How is the load changing over time? (time series line graph)

Drill-Down Analysis Method

Start at highest level
Examine next-level details
Pick most interesting breakdown
If problem unsolved, go to 2

Process of Elimination

Divide the target into components
Choose a test which:

Can exonerate many untested components (ideally, half of those remaining)
Is quick to perform

Perform test
Were the tested components exonerated?

Yes: go to 2
No: problem found?

Yes: done
No: how many components were tested?

one: target = tested component; go to 1
multiple: go to 2

Not sure: consider components untested; go to 2 and choose a different test

Time Division Method

Measure operation time (or latency)
Divide time into logical synchronous components
Continue division until latency origin is identified
Quantify: estimate speedup if problem fixed

(I previously called this the "Latency Analysis Method")

5 Whys Performance Method

Given delivered performance, ask, "why?", then answer this question
..5 Given previous answer, ask, "why?", then answer this question

By-Layer Method

Measure latency in detail (eg, as a histogram) from:

Dynamic languages
Executable
Libraries
Syscalls
Kernel: FS, network
Device drivers

Investigate the lowest layer that latency is introduced

Tools Method

List available performance tools (optionally add more)
For each tool, list its useful metrics
For each metric, list possible interpretation
Run selected tools and interpret selected metrics.

USE Method

For every resource, check:

Utilization
Saturation
Errors

RED Method

For every service or microservice, check:

Request rate
Errors
Duration

CPU Profile Method

Take a CPU profile (especially a flame graph)
Understand all software in profile > 1%

Off-CPU Analysis

Profile per-thread off-CPU time with stack traces
Coalesce times with like stacks
Study stacks from largest to shortest time

Stack Profile Method

Profile thread stack traces, on- and off-CPU
Coalesce
Study stacks bottom-up

TSA Method

For each thread of interest, measure time in operating system thread states. Eg:

Executing
Runnable
Swapping
Sleeping
Lock
Idle

Investigate states from most to least frequent, using appropriate tools

Active Benchmarking Method

Configure the benchmark to run for a long duration
While running, analyze performance using other tools, and determine limiting factors

Method R

Select user actions that matter for the business workload
Measure causes of response time for user actions
Calculate best net-payoff optimization activity

If sufficient gain, tune
If insufficient gain, suspend tuning until something changes

Goto 1

Performance Evaluation Steps

State the goals of the study and define system boundaries
List system services and possible outcomes
Select performance metrics
List system and workload parameters
Select factors and their values
Select the workload
Design the experiments
Analyze and interpret the data
Present the results
If necessary, start over

Capacity Planning Process

Instrument the system
Monitor system usage
Characterize workload
Predict performance under different alternatives
Select the lowest cost, highest performance alternative

Intel Hierarchical Top-Down Performance Characterization Methodology

Are UOPs issued?

If yes:

Are UOPs retired?

If yes: retiring (good)
If no: investigate bad speculations

If no:

Allocation stall?

If yes: investigate back-end stalls
If no: investigate front-end stalls

Performance Mantras

Don't do it
Do it, but don't do it again
Do it less
Do it later
Do it when they're not looking
Do it concurrently
Do it cheaper

Benchmarking Checklist

Why not double?
Did it break limits?
Did it error?
Does it reproduce?
Does it matter?
Did it even happen?

References

Blame-Someone-Else Anti-Method and the USE Method were developed by me and first in print in [Gregg 13a]: Gregg, B., "Thinking Methodically about Performance", Communications of the ACM, Volume 56 Issue 2, Feb 2013. This article also included the Streetlight Anti-Method and the Problem Statement Method for the first time in print, also developed by me, but based on earlier work (the streetlight effect, https://en.wikipedia.org/wiki/Streetlight_effect, and initial performance checklists used by Sun Microsystems support).
Random Change Anti-Method, Passive Benchmarking Anti-Method, Ad Hoc Checklist Method, Tools Method, TSA Method, and Active Benchmarking, were developed by me and first in print in [Gregg 13b]: Gregg, B., Systems Performance: Enterprise and the Cloud, Prentice Hall, Oct 2013.
My USENIX LISA 2012 talk: "Performance Analysis Methodology", first summarized and named many of these (before they were in print).
Method R is from [Millsap 03]: Millsap, C., Holt, J., Optimizing Oracle Performance, O'Reilly, 2003.
Performance Evaluation Steps and Capacity Planning Process are from pages 26 and 124 of [Jain 91]: Jain, R., The Art of Computer Systems Performance Analysis: Techniques for Experimental Design, Measurement, Simulation, and Modeling, Wiley, 1991.
Workload Characterization Method, and Drill-Down Analysis Method, were documented as specific methodologies in [Gregg 13a] and [Gregg 13b], but the general process has been known in IT for many years. You could glean it from [Jain 91], at least. I don't yet have an earlier reference.
Intel Hierarchical Top-Down Performance Characterization Methodology was developed by Intel and documented in B.3.2 of the Intel 64 and IA-32 Architectures Optimization Reference Manual, 248966-030, Sep 2014.
OODA Loop is from Joyd Boyd of the US Air Force, and popularized in technology by Roy Rapoport of Netflix.
Performance Mantas is from Craig Hanson and Pat Crain.
Benchmarking Checklist was developed by me.
RED Method is from Tom Wilkie.