South Bay SRE meetup 2017: Netflix Performance Engineering
Talk by the Netflix Performance Engineering team for SBSRE 2017.Video: https://www.youtube.com/watch?v=i5Ml9uY2rBw
Description: "A look into how Netflix measures and tunes performance for our clients and the streaming service."
This includes a section starting on slide 61 by Brendan Gregg titled "Netflix PMCs on the Cloud" showing low-level CPU performance analysis using PMCs on AWS EC2. (PMCs are performance monitoring counters from the performance monitoring unit [PMU]).
next prev 1/70 | |
next prev 2/70 | |
next prev 3/70 | |
next prev 4/70 | |
next prev 5/70 | |
next prev 6/70 | |
next prev 7/70 | |
next prev 8/70 | |
next prev 9/70 | |
next prev 10/70 | |
next prev 11/70 | |
next prev 12/70 | |
next prev 13/70 | |
next prev 14/70 | |
next prev 15/70 | |
next prev 16/70 | |
next prev 17/70 | |
next prev 18/70 | |
next prev 19/70 | |
next prev 20/70 | |
next prev 21/70 | |
next prev 22/70 | |
next prev 23/70 | |
next prev 24/70 | |
next prev 25/70 | |
next prev 26/70 | |
next prev 27/70 | |
next prev 28/70 | |
next prev 29/70 | |
next prev 30/70 | |
next prev 31/70 | |
next prev 32/70 | |
next prev 33/70 | |
next prev 34/70 | |
next prev 35/70 | |
next prev 36/70 | |
next prev 37/70 | |
next prev 38/70 | |
next prev 39/70 | |
next prev 40/70 | |
next prev 41/70 | |
next prev 42/70 | |
next prev 43/70 | |
next prev 44/70 | |
next prev 45/70 | |
next prev 46/70 | |
next prev 47/70 | |
next prev 48/70 | |
next prev 49/70 | |
next prev 50/70 | |
next prev 51/70 | |
next prev 52/70 | |
next prev 53/70 | |
next prev 54/70 | |
next prev 55/70 | |
next prev 56/70 | |
next prev 57/70 | |
next prev 58/70 | |
next prev 59/70 | |
next prev 60/70 | |
next prev 61/70 | |
next prev 62/70 | |
next prev 63/70 | |
next prev 64/70 | |
next prev 65/70 | |
next prev 66/70 | |
next prev 67/70 | |
next prev 68/70 | |
next prev 69/70 | |
next prev 70/70 |
PDF: SBSRE_perf_meetup_aug2017.pdf
Keywords (from pdftotext):
slide 1:
Netflix Performance Meetupslide 2:
Global Client Performance Fast Metricsslide 3:
3G in Kazakhstanslide 4:
Making the Internet fast is slow. Global Internet: faster (better networking) slower (broader reach, congestion) Don't wait for it, measure it and deal Working app >gt; Feature rich appslide 5:
We need to know what the Internet looks like, without averages, seeing the full distribution.slide 6:
Logging Anti-Patterns Averages Sampling Can't see the distribution Missed data Outliers heavily distort Rare events ∞, 0, negatives, errors Problems aren’t equal in Population Instead, use the client as a map-reducer and send up aggregated data, less often.slide 7:
Sizing up the Internet.slide 8:
Infinite (free) compute power!slide 9:
slide 10:
Get median, 95th, etc. Calculate the inverse empirical cumulative distribution function by math. o ...or just use R which is free and knows how to do it already >gt; library(HistogramTools) >gt; iecdfslide 12: slide 13:Data >gt; Opinions.slide 14:Better than debating opinions. "We live in a "No one really minds the 50ms world!" spinner." "Why should we spend time on that instead of COOLFEATURE?" "There's no way that the client makes that many requests.” Architecture is hard. Make it cheap to experiment where your users really are.slide 15:We built Daedalus Fast DNS Time Elsewhere Slowslide 16:Interpret the data Visual → Numerical, need the IECDF for Percentiles ƒ(0.50) = 50th (median) ƒ(0.95) = 95th Cluster to get pretty colors similar experiences. (k-means, hierarchical, etc.)slide 17:slide 18:slide 19:slide 20:slide 21:Practical Teleportation. Go there! Abstract analysis - hard Feeling reality is much simpler than looking at graphs. Build!slide 22:Make a Reality Lab.slide 23:slide 24:Don't guess. Developing a model based on production data, without missing the distribution of samples (network, render, responsiveness) will lead to better software. Global reach doesn't need to be scary. @gcirino42 http://blogofsomeguy.comslide 25:Icarus Martin Spier @spiermar Performance Engineering @ Netflixslide 26:slide 27:Problem & Motivation Real-user performance monitoring solution More insight into the App performance (as perceived by real users) Too many variables to trust synthetic tests and labs Prioritize work around App performance Track App improvement progress over time Detect issues, internal and externalslide 28:Device Diversity ● Netflix runs on all sorts of devices ● Smart TVs, Gaming Consoles, Mobile Phones, Cable TV boxes, ... ● Consistently evaluate performanceslide 29:slide 30:What are we monitoring? User Actions (or things users do in the App) App Startup User Navigation Playing a Title Internal App metricsslide 31:What are we measuring? ● When does the timer start and stop? ● Time-to-Interactive (TTI) ○ Interactive, even if some items were not fully loaded and rendered ● Time-to-Render (TTR) ○ Everything above the fold (visible without scrolling) is rendered ● Play Delay ● Meaningful for what we are monitoringslide 32:High-dimensional Data ● Complex device categorization ● Geo regions, subregions, countries ● Highly granular network classifications ● High volume of A/B tests ● Different facets of the same user action ○ Cold, suspended and backgrounded App startups ○ Target view/page on App startupslide 33:slide 34:slide 35:slide 36:Data Sketches ● Data structures that approximately resemble a much larger data set ● Preserve essential features! ● Significantly smaller! ● Faster to operate on!slide 37:t-Digest ● t-Digest data structure ● Rank-based statistics (such as quantiles) ● Parallel friendly (can be merged!) ● Very fast! ● Really accurate! https://github.com/tdunning/t-digestslide 38:+ t-Digest sketchesslide 39:slide 40:iOS Median Comparison, Break by Countryslide 41:iOS Median Comparison, Break by Country + iPhone 6S Plusslide 42:CDFs by UI Versionslide 43:Warm Startup Rateslide 44:A/B Cell Comparisonslide 45:Anomaly Detectionslide 46:Going Forward ● Resource utilization metrics ● Device profiling ○ Instrumenting client code ● Explore other visualizations ○ Frequency heat maps ● Connection between perceived performance, acquisition and retention @spiermarslide 47:Netflix Autoscaling for experts Vadimslide 48:Savings! ● Mid-tier stateless services are ~2/3rd of the total ● Savings - 30% of mid-tier footprint (roughly 30K instances) ○ Higher savings if we break it down by region ○ Even higher savings on services that scale wellslide 49:Why we autoscale - philosophical reasonsslide 50:Why we autoscale - pragmatic reasons ** Hack-day project Encoding Precompute Failover Red/black pushes Curing cancer** And more...slide 51:Should you autoscale? Benefits ● On-demand capacity: direct $$ savings ● RI capacity: re-purposing spare capacity However, for each server group, beware of ● Uneven distribution of traffic ● Sticky traffic ● Bursty traffic ● Small ASG sizes (slide 52: Autoscaling impacts availability - true or false? * If done correctly Under-provisioning, however, can impact availability ● Autoscaling is not a problem ● The real problem is not knowing performance characteristics of the serviceslide 53:AWS autoscaling mechanics ASG scaling policy CloudWatch alarm Aggregated metric feed Notification Tunables Metric ● Threshold ● # of eval periods ● Scaling amount ● Warmup timeslide 54:What metric to scale on? Resource utilization Throughput Pros Tracks a direct measure of work Linear scaling Predictable Requires less adjustment over time Cons Thresholds tend to drift over time Prone to changes in request mixture Less predictable More oscillation / jitterslide 55:Autoscaling on multiple metrics Proceed with caution ● Harder to reason about scaling behavior ● Different metrics might contradict each other, causing oscillation Typical Netflix configuration: ● Scale-up policy on throughput ● Scale-down policy on throughput ● Emergency scale-up policy on CPU, aka “the hammer rule”slide 56:Well-behaved autoscalingslide 57:Common mistakes - “no rush” scaling Problem: scaling amounts too small, cooldown too long Effect: scaling lags behind the traffic flow. Not enough capacity at peak, capacity wasted in trough Remedy: increase scaling amounts, migrate to step policiesslide 58:Common mistakes - twitchy scaling Problem: Scale-up policy is too aggressive Effect: unnecessary capacity churn Remedy: reduce scale-up amount, increase the # of eval periodsslide 59:Common mistakes - should I stay or should I go Problem: -up and -down thresholds are too close to each other Effect: constant capacity oscillation Remedy: move -up and -down thresholds farther apartslide 60:AWS target tracking - your best bet! Think of it as a step policy with auto-steps You can also think of it as a thermostat Accounts for the rate of change in monitored metric Pick a metric, set the target value and warmup time - that’s it! Step Target-trackingslide 61:Netflix PMCs on the Cloud Brendanslide 62:90% CPU utilization: Busy Waiting (“idle”)slide 63:90% CPU utilization: Waiting (“idle”) Busy Reality: Busy Waiting (“stalled”) Waiting (“idle”)slide 64:# perf stat -a -- sleep 10 Performance counter stats for 'system wide': 7,562 1,157 109,734slide 65:gt; gt; gt; gt; gt; gt; task-clock (msec) context-switches cpu-migrations page-faults cycles stalled-cycles-frontend stalled-cycles-backend instructions branches branch-misses 10.001715965 seconds time elapsed 8.000 CPUs utilized 0.095 K/sec 0.014 K/sec 0.001 M/sec (100.00%) (100.00%) (100.00%) Performance Monitoring Counters (PMCs) in most clouds # perf stat -a -- sleep 10 Performance counter stats for 'system wide': 641320.173626 task-clock (msec) 1,047,222 context-switches 83,420 cpu-migrations 38,905 page-faults 655,419,788,755 cyclesslide 66:gt; stalled-cycles-frontend gt; stalled-cycles-backend 536,830,399,277 instructions 97,103,651,128 branches 1,230,478,597 branch-misses 64.122 CPUs utilized 0.002 M/sec 0.130 K/sec 0.061 K/sec 1.022 GHz [100.00%] [100.00%] [100.00%] 0.82 insns per cycle 151.412 M/sec 1.27% of all branches [75.02%] [75.02%] [74.99%] [75.02%] 10.001622154 seconds time elapsed AWS EC2 m4.16xl Interpreting IPC & Actionable Items IPC: Instructions Per Cycle (invert of CPI) ● IPCslide 67:gt; 1.0: likely instruction bound Reduce code execution, eliminate unnecessary work, cache operations, improve algorithm order. Can analyze using CPU flame graphs. Faster CPUs. Intel Architectural PMCs Event Name Umask Event S. Example Event Mask Mnemonic UnHalted Core Cycles 00H 3CH CPU_CLK_UNHALTED.THREAD_P Instruction Retired 00H C0H INST_RETIRED.ANY_P UnHalted Reference Cycles 01H 3CH CPU_CLK_THREAD_UNHALTED.REF_XCLK LLC Reference 4FH 2EH LONGEST_LAT_CACHE.REFERENCE LLC Misses 41H 2EH LONGEST_LAT_CACHE.MISS Branch Instruction Retired 00H C4H BR_INST_RETIRED.ALL_BRANCHES Branch Misses Retired 00H C5H BR_MISP_RETIRED.ALL_BRANCHES Now available in AWS EC2 on full dedicated hosts (eg, m4.16xl, …)slide 68:# pmcarch 1 CYCLES [...] INSTRUCTIONS IPC BR_RETIRED 0.71 11760496978 0.78 10665897008 0.82 9538082731 0.78 12672090735 0.67 10542795714 BR_MISPRED BMR% LLCREF 1.48 1542464817 1.48 1361315177 1.44 1272163733 1.43 1685112288 1.37 1204703117 LLCMISS LLC% https://github.com/brendangregg/pmc-cloud-tools tiptop Tasks: 96 total, PID [ %CPU] %SYS 35.3 28.5 1319+ [root] 3 displayed Mcycle screen Minstr IPC %MISS %BMIS 0: default %BUS COMMAND 0.0 java 0.0 nm-applet 0.0 dbus-daemoslide 69:Netflix Performance Meetupslide 70:Netflix Performance Meetup