DockerCon2017_performance

DockerCon 2017: Container Performance Analysis

Slides for DockerCon 2017 by Brendan Gregg

Video: https://www.youtube.com/watch?v=bK9A5ODIgac

Description: "Containers pose interesting challenges for performance monitoring and analysis, requiring new analysis methodologies and tooling. Resource-oriented analysis, as is common with systems performance tools and GUIs, must now account for both hardware limits and soft limits, as implemented using resource controls including cgroups. The interaction between containers can also be examined, and noisy neighbors either identified of exonerated. Performance tooling can also need special usage or workarounds to function properly from within a container or on the host, to deal with different privilege levels and name spaces. At Netflix, we're using containers for some microservices, and care very much about analyzing and tuning our containers to be as fast and efficient as possible. This talk will show how to successfully analyze performance in a Docker container environment, and navigate differences encountered."

	next prev 1/75
	next prev 2/75
	next prev 3/75
	next prev 4/75
	next prev 5/75
	next prev 6/75
	next prev 7/75
	next prev 8/75
	next prev 9/75
	next prev 10/75
	next prev 11/75
	next prev 12/75
	next prev 13/75
	next prev 14/75
	next prev 15/75
	next prev 16/75
	next prev 17/75
	next prev 18/75
	next prev 19/75
	next prev 20/75
	next prev 21/75
	next prev 22/75
	next prev 23/75
	next prev 24/75
	next prev 25/75
	next prev 26/75
	next prev 27/75
	next prev 28/75
	next prev 29/75
	next prev 30/75
	next prev 31/75
	next prev 32/75
	next prev 33/75
	next prev 34/75
	next prev 35/75
	next prev 36/75
	next prev 37/75
	next prev 38/75
	next prev 39/75
	next prev 40/75
	next prev 41/75
	next prev 42/75
	next prev 43/75
	next prev 44/75
	next prev 45/75
	next prev 46/75
	next prev 47/75
	next prev 48/75
	next prev 49/75
	next prev 50/75
	next prev 51/75
	next prev 52/75
	next prev 53/75
	next prev 54/75
	next prev 55/75
	next prev 56/75
	next prev 57/75
	next prev 58/75
	next prev 59/75
	next prev 60/75
	next prev 61/75
	next prev 62/75
	next prev 63/75
	next prev 64/75
	next prev 65/75
	next prev 66/75
	next prev 67/75
	next prev 68/75
	next prev 69/75
	next prev 70/75
	next prev 71/75
	next prev 72/75
	next prev 73/75
	next prev 74/75
	next prev 75/75

PDF: DockerCon2017_performance_analysis.pdf

Keywords (from pdftotext):

slide 1:

Container Performance
Analysis
Brendan Gregg
Sr. Performance Architect,
Netflix

slide 2:

Take Aways
Identify bottlenecks:
1. In the host vs container, using system metrics
2. In application code on containers, using CPU flame graphs
3. Deeper in the kernel, using tracing tools
Focus of this talk is how containers work in Linux (will demo on 4.9)
I will include some Docker specifics, and start with a Netflix summary (Titus)

slide 3:

1. Titus
Containers at
Summary slides from the Titus team

slide 4:

Titus
• Cloud runtime platform for container jobs
• Scheduling
Service & batch job management
Advanced resource management across
elastic shared resource pool
• Container Execution
• Docker and AWS EC2 Integration
Adds VPC, security groups, EC2
Service
Batch
Job Management
Resource Management & Op=miza=on
Container Execu=on
metadata, IAM roles, S3 logs, …
• Integration with Netflix infrastructure
• In depth: http://techblog.netflix.com/2017/04/the-evolution-of-container-usage-at.html
Integra=on

slide 5:

Current Titus Scale
• Deployed across multiple AWS accounts & three regions
• Over 2,500 instances (Mostly M4.4xls & R3.8xls)
• Over a week period launched over 1,000,000 containers

slide 6:

Titus Use Cases
• Service
Stream Processing (Flink)
UI Services (Node.JS single core)
Internal dashboards
• Batch
Algorithm training, personalization & recommendations
Adhoc reporting
Continuous integration builds
• Queued worker model
Media encoding

slide 7:

Container Performance @Netflix
• Ability to scale and balance workloads with EC2 and Titus
Can already solve many perf issues
• Performance needs:
• Application analysis: using CPU flame graphs with containers
• Host tuning: file system, networking, sysctl's, …
• Container analysis and tuning: cgroups, GPUs, …
• Capacity planning: reduce over provisioning

slide 8:

2. Container Background
And Strategy

slide 9:

Namespaces
Restricting visibility
Namespaces:
• cgroup
• ipc
• mnt
• net
• pid
• user
• uts
PID namespaces
Host
PID 1
PID namespace 1
1 (1238)
2 (1241)
Kernel

slide 10:

Control Groups
Restricting usage
cgroups:
• blkio
• cpu,cpuacct
• cpuset
• devices
• hugetlb
• memory
• net_cls,net_prio
• pids
• …
CPU cgroups
container
container
container
cpu
cgroup 1
CPUs

slide 11:

Linux Containers
Container = combination of namespaces & cgroups
Host
Container 1
Container 2
Container 3
(namespaces)
(namespaces)
(namespaces)
cgroups
cgroups
cgroups
Kernel

slide 12:

cgroup v1
cpu,cpuacct:
cap CPU usage (hard limit). e.g. 1.5 CPUs.
CPU shares. e.g. 100 shares.
usage statistics (cpuacct)
Docker:
--cpus (1.13)
--cpu-shares
memory:
limit and kmem limit (maximum bytes)
OOM control: enable/disable
usage statistics
blkio (block I/O):
weights (like shares)
IOPS/tput caps per storage device
statistics
--memory --kernel-memory
--oom-kill-disable

slide 13:

CPU Shares
Container's CPU limit = 100% x
container's shares
total busy shares
This lets a container use other tenant's idle CPU (aka "bursting"), when available.
Container's minimum CPU limit = 100% x
container's shares
total allocated shares
Can make analysis tricky. Why did perf regress? Less bursting available?

slide 14:

cgroup v2
• Major rewrite has been happening: cgroups v2
Supports nested groups, better organization and consistency
Some already merged, some not yet (e.g. CPU)
• See docs/talks by maintainer Tejun Heo (Facebook)
• References:
https://www.kernel.org/doc/Documentation/cgroup-v2.txt
https://lwn.net/Articles/679786/

slide 15:

Container OS Configuration
File systems
• Containers may be setup with aufs/overlay on top of another FS
• See "in practice" pages and their performance sections from
https://docs.docker.com/engine/userguide/storagedriver/
Networking
• With Docker, can be bridge, host, or overlay networks
• Overlay networks have come with significant performance cost

slide 16:

Analysis Strategy
Performance analysis with containers:
• One kernel
• Two perspectives
• Namespaces
• cgroups
Methodologies:
• USE Method
• Workload characterization
• Checklists
• Event tracing

slide 17:

USE Method
For every resource, check:
1. Utilization
2. Saturation
3. Errors
Resource
Utilization
(%)
For example, CPUs:
• Utilization: time busy
• Saturation: run queue length or latency
• Errors: ECC errors, etc.
Can be applied to hardware resources and software resources (cgroups)

slide 18:

3. Host Tools
And Container Awareness
… if you have host access

slide 19:

Host Analysis Challenges
• PIDs in host don't match those seen in containers
• Symbol files aren't where tools expect them
• The kernel currently doesn't have a container ID

slide 20:

CLI Tool
Disclaimer
I'll demo CLI tools
It's the lowest common
denominator
You may usually use GUIs
(like we do). They source
the same metrics.

slide 21:

3.1. Host Physical Resources
A refresher of basics... Not container specific.
This will, however, solve many issues!
Containers are often not the problem.

slide 22:

Linux
Perf
Tools
Where
can we
begin?

slide 23:

Host Perf Analysis in 60s
uptime
dmesg | tail
vmstat 1
mpstat -P ALL 1
pidstat 1
iostat -xz 1
free -m
sar -n DEV 1
sar -n TCP,ETCP 1
top
http://techblog.netflix.com/2015/11/linux-performance-analysis-in-60s.html
load averages
kernel errors
overall stats by time
CPU balance
process usage
disk I/O
memory usage
network I/O
TCP stats
check overview

slide 24:

USE Method: Host Resources
Resource
Utilization
Saturation
Errors
CPU
mpstat -P ALL 1,
sum non-idle fields
vmstat 1, "r"
perf
Memory
Capacity
free –m,
"used"/"total"
vmstat 1, "si"+"so";
demsg | grep killed
dmesg
Storage I/O
iostat –xz 1,
"%util"
iostat –xnz 1,
"avgqu-sz" >gt; 1
/sys/…/ioerr_cnt;
smartctl
Network
nicstat, "%Util"
ifconfig, "overrunns";
netstat –s "retrans…"
ifconfig,
"errors"
These should be in your monitoring GUI. Can do other resources too (busses, ...)

slide 25:

Event Tracing: e.g. iosnoop
Disk I/O events with latency (from perf-tools; also in bcc/BPF as biosnoop)
# ./iosnoop
Tracing block I/O... Ctrl-C to end.
COMM
PID
TYPE DEV
supervise
202,1
supervise
202,1
tar
14794 RM
202,1
tar
14794 RM
202,1
tar
14794 RM
202,1
tar
14794 RM
202,1
tar
14794 RM
202,1
tar
14794 RM
202,1
tar
14794 RM
202,1
tar
14794 RM
202,1
BLOCK
BYTES
LATms

slide 26:

Event Tracing: e.g. zfsslower
# /usr/share/bcc/tools/zfsslower 1
Tracing ZFS operations slower than 1 ms
TIME
COMM
PID
T BYTES
23:44:40 java
31386 O 0
23:44:53 java
31386 W 8190
23:44:59 java
31386 W 8192
23:44:59 java
31386 W 8191
23:45:00 java
31386 W 8192
23:45:15 java
31386 O 0
23:45:56 dockerd
S 0
23:46:16 java
31386 W 31
OFF_KB
LAT(ms) FILENAME
8.02 solrFeatures.txt
36.24 solrFeatures.txt
20.28 solrFeatures.txt
28.15 solrFeatures.txt
32.17 solrFeatures.txt
27.44 solrFeatures.txt
1.03 .tmp-a66ce9aad…
36.28 solrFeatures.txt
• This is from our production Titus system (Docker).
• File system latency is a better pain indicator than disk latency.
• zfsslower (and btrfs*, etc) are in bcc/BPF. Can exonerate FS/disks.

slide 27:

Latency Histograms: e.g. btrfsdist
# ./btrfsdist
Tracing btrfs operation latency... Hit Ctrl-C to end.
operation = 'read'
usecs
: count
distribution
0 ->gt; 1
: 192529
|****************************************|
2 ->gt; 3
: 72337
|***************
4 ->gt; 7
: 5620
probably
8 ->gt; 15
: 1026
cache reads
16 ->gt; 31
: 369
32 ->gt; 63
: 239
64 ->gt; 127
: 53
probably
128 ->gt; 255
: 975
cache misses
256 ->gt; 511
: 524
512 ->gt; 1023
: 128
(flash reads)
1024 ->gt; 2047
: 16
2048 ->gt; 4095
: 7
4096 ->gt; 8191
: 2

slide 28:

Latency Histograms: e.g. btrfsdist
[…]
operation = 'write'
usecs
0 ->gt; 1
2 ->gt; 3
4 ->gt; 7
8 ->gt; 15
16 ->gt; 31
32 ->gt; 63
64 ->gt; 127
128 ->gt; 255
256 ->gt; 511
512 ->gt; 1023
: count
: 1
: 276
: 32125
: 111253
: 59154
: 5463
: 612
: 25
: 2
: 1
distribution
|***********
|****************************************|
|*********************
• From a test Titus system (Docker).
• Histograms show modes, outliers. Also in bcc/BPF (with other FSes).
• Latency heat maps: http://queue.acm.org/detail.cfm?id=1809426

slide 29:

3.2. Host Containers & cgroups
Inspecting containers from the host

slide 30:

Namespaces
Worth checking namespace config before analysis:
# ./dockerpsns.sh
CONTAINER
NAME
host
titusagent-mainvpc-m
b27909cd6dd1 Titus-1435830-worker
dcf3a506de45 Titus-1392192-worker
370a3f041f36 Titus-1243558-worker
af7549c76d9a Titus-1243553-worker
dc27769a9b9c Titus-1243546-worker
e18bd6189dcd Titus-1243517-worker
ab45227dcea9 Titus-1243516-worker
PID PATH
CGROUP
IPC
MNT
NET
PID
USER
UTS
1 systemd
4026531835 4026531839 4026531840 4026532533 4026531836 4026531837 4026531838
37280 svscanboot
4026531835 4026533387 4026533385 4026532931 4026533388 4026531837 4026533386
27992 /apps/spaas/spaa 4026531835 4026533354 4026533352 4026532991 4026533355 4026531837 4026533353
98602 /apps/spaas/spaa 4026531835 4026533290 4026533288 4026533223 4026533291 4026531837 4026533289
97972 /apps/spaas/spaa 4026531835 4026533216 4026533214 4026533149 4026533217 4026531837 4026533215
97356 /apps/spaas/spaa 4026531835 4026533142 4026533140 4026533075 4026533143 4026531837 4026533141
96733 /apps/spaas/spaa 4026531835 4026533068 4026533066 4026533001 4026533069 4026531837 4026533067
96173 /apps/spaas/spaa 4026531835 4026532920 4026532918 4026532830 4026532921 4026531837 4026532919
• A POC "docker ps --namespaces" tool. NS shared with root in red.
• https://github.com/docker/docker/issues/32501

slide 31:

systemd-cgtop
A "top" for cgroups:
# systemd-cgtop
Control Group
/docker
/docker/dcf3a...9d28fc4a1c72bbaff4a24834
/docker/370a3...e64ca01198f1e843ade7ce21
/system.slice
/system.slice/daemontools.service
/docker/dc277...42ab0603bbda2ac8af67996b
/user.slice
/user.slice/user-0.slice
/user.slice/u....slice/session-c26.scope
/docker/ab452...c946f8447f2a4184f3ccff2a
/docker/e18bd...26ffdd7368b870aa3d1deb7a
[...]
Tasks
%CPU
Memory
45.9G
42.1G
24.0G
3.0G
4.1G
2.8G
2.3G
34.5M
15.7M
13.3M
6.3G
2.9G
Input/s Output/s

slide 32:

docker stats
A "top" for containers. Resource utilization. Workload characterization.
# docker stats
CONTAINER
CPU %
353426a09db1 526.81%
6bf166a66e08 303.82%
58dcf8aed0a7 41.01%
61061566ffe5 85.92%
bdc721460293 2.69%
6c80ed61ae63 477.45%
337292fb5b64 89.05%
b652ede9a605 173.50%
d7cd2599291f 504.28%
05bf9f3e0d13 314.46%
09082f005755 142.04%
bd45a3e1ce16 190.26%
[...]
MEM USAGE / LIMIT
4.061 GiB / 8.5 GiB
3.448 GiB / 8.5 GiB
1.322 GiB / 2.5 GiB
220.9 MiB / 3.023 GiB
1.204 GiB / 3.906 GiB
557.7 MiB / 8 GiB
766.2 MiB / 8 GiB
689.2 MiB / 8 GiB
673.2 MiB / 8 GiB
711.6 MiB / 8 GiB
693.9 MiB / 8 GiB
538.3 MiB / 8 GiB
MEM %
47.78%
40.57%
52.89%
7.14%
30.82%
6.81%
9.35%
8.41%
8.22%
8.69%
8.47%
6.57%
NET I/O
0 B / 0 B
0 B / 0 B
0 B / 0 B
0 B / 0 B
0 B / 0 B
0 B / 0 B
0 B / 0 B
0 B / 0 B
0 B / 0 B
0 B / 0 B
0 B / 0 B
0 B / 0 B
BLOCK I/O
2.818 MB / 0 B
2.032 MB / 0 B
0 B / 0 B
43.4 MB / 0 B
4.35 MB / 0 B
9.257 MB / 0 B
5.493 MB / 0 B
6.48 MB / 0 B
12.58 MB / 0 B
7.942 MB / 0 B
8.081 MB / 0 B
10.6 MB / 0 B
Loris Degioanni demoed a similar sysdigcloud view yesterday (needs the sysdig kernel agent)
PIDS

slide 33:

top
In the host, top shows all processes. Currently doesn't show a container ID.
# top - 22:46:53 up 36 days, 59 min, 1 user, load average: 5.77, 5.61, 5.63
Tasks: 1067 total,
1 running, 1046 sleeping,
0 stopped, 20 zombie
%Cpu(s): 34.8 us, 1.8 sy, 0.0 ni, 61.3 id, 0.0 wa, 0.0 hi, 1.9 si, 0.1 st
KiB Mem : 65958552 total, 12418448 free, 49247988 used, 4292116 buff/cache
KiB Swap:
0 total,
0 free,
0 used. 13101316 avail Mem
PID USER
28321 root
97712 root
98306 root
96511 root
5283 root
2079 root
5272 titusag+
[…]
VIRT
RES
0 33.126g 0.023t
0 11.445g 2.333g
0 12.149g 3.060g
0 15.567g 6.313g
0 1643676 100092
0 10.473g 1.611g
SHR S %CPU %MEM
TIME+ COMMAND
37564 S 621.1 38.2 35184:09 java
37084 S
3.1 3.7 404:27.90 java
36996 S
2.0 4.9 194:21.10 java
37112 S
1.7 10.0 168:07.44 java
94184 S
1.0 0.2 401:36.16 mesos-slave
12 S
0.7 0.0 220:07.75 rngd
23488 S
0.7 2.6
1934:44 java
… remember, there is no container ID in the kernel yet.

slide 34:

htop
htop can add a CGROUP field, but, can truncate important info:
CGROUP
PID USER
PRI NI VIRT
RES
SHR S CPU% MEM%
TIME+ Command
:pids:/docker/ 28321 root
0 33.1G 24.0G 37564 S 524. 38.2
672h /apps/java
:pids:/docker/
9982 root
0 33.1G 24.0G 37564 S 44.4 38.2 17h00:41 /apps/java
:pids:/docker/
9985 root
0 33.1G 24.0G 37564 R 41.9 38.2 16h44:51 /apps/java
:pids:/docker/
9979 root
0 33.1G 24.0G 37564 S 41.2 38.2 17h01:35 /apps/java
:pids:/docker/
9980 root
0 33.1G 24.0G 37564 S 39.3 38.2 16h59:17 /apps/java
:pids:/docker/
9981 root
0 33.1G 24.0G 37564 S 39.3 38.2 17h01:32 /apps/java
:pids:/docker/
9984 root
0 33.1G 24.0G 37564 S 37.3 38.2 16h49:03 /apps/java
:pids:/docker/
9983 root
0 33.1G 24.0G 37564 R 35.4 38.2 16h54:31 /apps/java
:pids:/docker/
9986 root
0 33.1G 24.0G 37564 S 35.4 38.2 17h05:30 /apps/java
:name=systemd:/user.slice/user-0.slice/session-c31.scope? 74066 root
0 27620
:pids:/docker/
9998 root
0 33.1G 24.0G 37564 R 28.3 38.2 11h38:03 /apps/java
:pids:/docker/ 10001 root
0 33.1G 24.0G 37564 S 27.7 38.2 11h38:59 /apps/java
:name=systemd:/system.slice/daemontools.service?
5272 titusagen 20
0 10.5G 1650M 23
:pids:/docker/ 10002 root
0 33.1G 24.0G 37564 S 25.1 38.2 11h40:37 /apps/java
Can fix, but that would be Docker + cgroup-v1 specific. Still need a kernel CID.

slide 35:

Host PID ->gt; Container ID
… who does that (CPU busy) PID 28321 belong to?
# grep 28321 /sys/fs/cgroup/cpu,cpuacct/docker/*/tasks | cut -d/ -f7
dcf3a506de453107715362f6c9ba9056fcfc6e769d28fc4a1c72bbaff4a24834
• Only works for Docker, and that cgroup v1 layout. Some Linux commands:
# ls -l /proc/27992/ns/*
lrwxrwxrwx 1 root root 0 Apr 13 20:49 cgroup ->gt; cgroup:[4026531835]
lrwxrwxrwx 1 root root 0 Apr 13 20:49 ipc ->gt; ipc:[4026533354]
lrwxrwxrwx 1 root root 0 Apr 13 20:49 mnt ->gt; mnt:[4026533352]
[…]
# cat /proc/27992/cgroup
11:freezer:/docker/dcf3a506de453107715362f6c9ba9056fcfc6e769d28fc4a1c72bbaff4a24834
10:blkio:/docker/dcf3a506de453107715362f6c9ba9056fcfc6e769d28fc4a1c72bbaff4a24834
9:perf_event:/docker/dcf3a506de453107715362f6c9ba9056fcfc6e769d28fc4a1c72bbaff4a24834
[…]

slide 36:

nsenter Wrapping
… what hostname is PID 28321 running on?
# nsenter -t 28321 -u hostname
titus-1392192-worker-14-16
• Can namespace enter:
• -m: mount
-u: uts -i: ipc
-n: net -p: pid -U: user
• Bypasses cgroup limits, and seccomp profile (allowing syscalls)
For Docker, you can enter the container more completely with: docker exec -it CID command
• Handy nsenter one-liners:
nsenter -t PID -u hostname
nsenter -t PID -n netstat -i
nsenter -t PID –m -p df -h
nsenter -t PID -p top
container hostname
container netstat
container file system usage
container top

slide 37:

nsenter: Host ->gt; Container top
… Given PID 28321, running top for its container by entering its namespaces:
# nsenter -t 28321 -m -p top
top - 18:16:13 up 36 days, 20:28, 0 users, load average: 5.66, 5.29, 5.28
Tasks:
6 total,
1 running,
5 sleeping,
0 stopped,
0 zombie
%Cpu(s): 30.5 us, 1.7 sy, 0.0 ni, 65.9 id, 0.0 wa, 0.0 hi, 1.8 si, 0.1 st
KiB Mem: 65958552 total, 54664124 used, 11294428 free,
164232 buffers
KiB Swap:
0 total,
0 used,
0 free. 1592372 cached Mem
PID USER
301 root
1 root
87888 root
VIRT
RES
0 33.127g 0.023t
SHR S %CPU %MEM
37564 S 537.3 38.2
1812 S
0.0 0.0
1348 R
0.0 0.0
TIME+ COMMAND
40269:41 java
4:15.11 bash
0:00.00 top
Note that it is PID 301 in the container. Can also see this using:
# grep NSpid /proc/28321/status
NSpid:

slide 38:

perf: CPU Profiling
Can run system-wide (-a), match a pid (-p), or cgroup (-G, if it works)
# perf record -F 49 -a -g -- sleep 30
# perf script
Failed to open /lib/x86_64-linux-gnu/libc-2.19.so, continuing without symbols
Failed to open /tmp/perf-28321.map, continuing without symbols
• Current symbol translation gotchas (up to 4.10-ish):
perf can't find /tmp/perf-PID.map files in the host, and the PID is different
perf can't find container binaries under host paths (what /usr/bin/java?)
• Can copy files to the host, map PIDs, then run perf script/report:
http://blog.alicegoldfuss.com/making-flamegraphs-with-containerized-java/
http://batey.info/docker-jvm-flamegraphs.html
• Can nsenter (-m -u -i -n -p) a "power" shell, and then run "perf -p PID"
• perf should be fixed to be namespace aware (like bcc was, PR#1051)

slide 39:

CPU Flame Graphs
git clone --depth 1 https://github.com/brendangregg/FlameGraph
cd FlameGraph
perf record –F 49 -a –g -- sleep 30
perf script | ./stackcollapse-perf.pl | ./flamegraph.pl >gt; perf.svg
• See previous slide for getting perf symbols to work
• From the host, can study all containers, as well as container overheads
Kernel TCP/IP stack
Look in areas like this to find
and quantify overhead (cgroup
throttles, FS layers, networking, etc).
It's likely small and hard to find.
Java, missing stacks (need
-XX:+PreserveFramePointer)

slide 40:

/sys/fs/cgroups (raw)
The best source for per-cgroup metrics. e.g. CPU:
# cd /sys/fs/cgroup/cpu,cpuacct/docker/02a7cf65f82e3f3e75283944caa4462e82f8f6ff5a7c9a...
# ls
cgroup.clone_children cpuacct.usage_all
cpuacct.usage_sys
cpu.shares
cgroup.procs
cpuacct.usage_percpu
cpuacct.usage_user cpu.stat
cpuacct.stat
cpuacct.usage_percpu_sys
cpu.cfs_period_us
notify_on_release
cpuacct.usage
cpuacct.usage_percpu_user cpu.cfs_quota_us
tasks
# cat cpuacct.usage
# cat cpu.stat
total time throttled (nanoseconds). saturation metric.
nr_periods 507
average throttle time = throttled_time / nr_throttled
nr_throttled 74
throttled_time 3816445175
https://www.kernel.org/doc/Documentation/cgroup-v1/, ../scheduler/sched-bwc.txt
https://blog.docker.com/2013/10/gathering-lxc-docker-containers-metrics/
Note: grep cgroup /proc/mounts to check where these are mounted
These metrics should be included in performance monitoring GUIs

slide 41:

Netflix Atlas
Cloud-wide monitoring of
containers (and instances)
Fetches cgroup metrics via
Intel snap
https://github.com/netflix/Atlas

slide 42:

Netflix Vector
Our per-instance analyzer
Has per-container metrics
https://github.com/Netflix/vector

slide 43:

Intel snap
A metric collector used by
monitoring GUIs
https://github.com/intelsdi-x/snap
Has a Docker plugin to read
cgroup stats
There's also a collectd plugin:
https://github.com/bobrik/collectddocker

slide 44:

3.3. Let's Play a Game
Host or Container?
(or neither?)

slide 45:

Game Scenario 1
Container user claims they have a CPU performance issue
Container has a CPU cap and CPU shares configured
There is idle CPU on the host
Other tenants are CPU busy
/sys/fs/cgroup/.../cpu.stat ->gt; throttled_time is increasing
/proc/PID/status nonvoluntary_ctxt_switches is increasing
Container CPU usage equals its cap (clue: this is not really a clue)

slide 46:

Game Scenario 2
Container user claims they have a CPU performance issue
Container has a CPU cap and CPU shares configured
There is no idle CPU on the host
Other tenants are CPU busy
/sys/fs/cgroup/.../cpu.stat ->gt; throttled_time is not increasing
/proc/PID/status nonvoluntary_ctxt_switches is increasing

slide 47:

Game Scenario 3
Container user claims they have a CPU performance issue
Container has CPU shares configured
There is no idle CPU on the host
Other tenants are CPU busy
/sys/fs/cgroup/.../cpu.stat ->gt; throttled_time is not increasing
/proc/PID/status nonvoluntary_ctxt_switches is not increasing much
Experiments to confirm conclusion?

slide 48:

Methodology: Reverse Diagnosis
Enumerate possible outcomes, and work backwards to the metrics needed for diagnosis.
e.g. CPU performance outcomes:
A. physical CPU throttled
B. cap throttled
C. shares throttled (assumes physical CPU limited as well)
D. not throttled
Game answers: 1. B, 2. C, 3. D

slide 49:

CPU Bottleneck
Identification
throttled_time
increasing?
cap throttled
nonvol…switches
increasing?
(but dig further)
not throttled
share throttled
host has idle
CPU?
all other tenants
idle?
physical CPU
throttled

slide 50:

4. Guest Tools
And Container Awareness
… if you only have guest access

slide 51:

Guest Analysis Challenges
• Some resource metrics are for the container, some for the host. Confusing!
• May lack system capabilities or syscalls to run profilers and tracers

slide 52:

CPU
Can see host's CPU devices, but only container (pid namespace) processes:
container# uptime
20:17:19 up 45 days, 21:21, 0 users, load average: 5.08, 3.69, 2.22
container# mpstat 1
Linux 4.9.0 (02a7cf65f82e) 04/14/17
_x86_64_
(8 CPU)
busy CPUs
20:17:26
CPU
%usr
%nice
%sys %iowait
%irq
20:17:27
all
20:17:28
all
Average:
all
container# pidstat 1
Linux 4.9.0 (02a7cf65f82e) 04/14/17
_x86_64_
(8 CPU)
load!
%soft
%steal
%guest
%gnice
%idle
20:17:33
UID
PID
%usr %system
%guest
%CPU
CPU
Command
20:17:34
UID
PID
%usr %system
%guest
%CPU
CPU
Command
20:17:35
[...]
UID
PID
%usr %system
%guest
%CPU
CPU
Command
but this container
is running nothing
(we saw CPU usage
from neighbors)

slide 53:

Memory
Can see host's memory:
container# free -m
total
Mem:
Swap:
used
free
container# perl -e '$a = "A" x 1_000_000_000'
Killed
shared
buff/cache
available
host memory (this container is --memory=1g)
tries to consume ~2 Gbytes

slide 54:

Disks
Can see host's disk devices:
container# iostat -xz 1
avg-cpu: %user %nice %system %iowait %steal
0.00 16.94
%idle
host disk I/O
Device: rrqm/s wrqm/s
r/s
w/s
rkB/s
wkB/s avgrq-sz avgqu-sz
xvdap1
xvdb
0.00 200.00
0.00 3080.00
xvdc
0.00 185.00
0.00 2840.00
md0
0.00 385.00
0.00 5920.00
[...]
container# pidstat -d 1
Linux 4.9.0 (02a7cf65f82e) 04/18/17
_x86_64_
(8 CPU)
await r_await w_await svctm %util
2.00 2.00 0.40
0.00 0.20 4.00
0.00 0.24 4.40
0.00 0.00 0.00
22:41:13
UID
PID
kB_rd/s
kB_wr/s kB_ccwr/s iodelay
Command
22:41:14
UID
PID
kB_rd/s
kB_wr/s kB_ccwr/s iodelay
Command
22:41:15
[...]
UID
PID
kB_rd/s
kB_wr/s kB_ccwr/s iodelay
Command
but no
container I/O

slide 55:

Network
Can't see host's network interfaces (network namespace):
container# sar -n DEV,TCP 1
Linux 4.9.0 (02a7cf65f82e) 04/14/17
21:45:07
21:45:08
21:45:08
21:45:07
21:45:08
21:45:08
21:45:09
21:45:09
21:45:08
21:45:09
[...]
IFACE
eth0
_x86_64_
(8 CPU)
rxpck/s
txpck/s
rxkB/s
active/s passive/s
iseg/s
oseg/s
rxpck/s
txpck/s
rxkB/s
active/s passive/s
iseg/s
oseg/s
IFACE
eth0
txkB/s
rxcmp/s
txcmp/s
rxmcst/s
%ifutil
txkB/s
rxcmp/s
txcmp/s
rxmcst/s
%ifutil
host has heavy network I/O,
container sees itself (idle)

slide 56:

Metrics Namespace
This confuses apps too: trying to bind on all CPUs, or using 25% of memory
Including the JDK, which is unaware of container limits, covered yesterday by Fabiane Nardon
We could add a "metrics" namespace so the container only sees itself
Or enhance existing namespaces to do this
If you add a metrics namespace, please consider adding an option for:
• /proc/host/stats: maps to host's /proc/stats, for CPU stats
• /proc/host/diskstats: maps to host's /proc/diskstats, for disk stats
As those host metrics can be useful, to identify/exonerate neighbor issues

slide 57:

perf: CPU Profiling
Needs capabilities to run from a container:
container# ./perf record -F 99 -a -g -- sleep 10
perf_event_open(..., PERF_FLAG_FD_CLOEXEC) failed with unexpected error 1 (Operation not permitted)
perf_event_open(..., 0) failed unexpectedly with error 1 (Operation not permitted)
Error: You may not have permission to collect system-wide stats.
Consider tweaking /proc/sys/kernel/perf_event_paranoid,
which controls use of the performance events system by
unprivileged users (without CAP_SYS_ADMIN).
Helpful message
The current value is 2:
-1: Allow use of (almost) all events by all users
>gt;= 0: Disallow raw tracepoint access by users without CAP_IOC_LOCK
>gt;= 1: Disallow CPU event access by users without CAP_SYS_ADMIN
>gt;= 2: Disallow kernel profiling by users without CAP_SYS_ADMIN
Although, after setting perf_event_paranoid to -1, it prints the same error...

slide 58:

perf & Container Debugging
Debugging using strace from the host (as ptrace() is also blocked):
host# strace -fp 26450
bash PID, from which I then ran perf
[...]
[pid 27426] perf_event_open(0x2bfe498, -1, 0, -1, 0) = -1 EPERM (Operation not permitted)
[pid 27426] perf_event_open(0x2bfe498, -1, 0, -1, 0) = -1 EPERM (Operation not permitted)
[pid 27426] perf_event_open(0x2bfc1a8, -1, 0, -1, PERF_FLAG_FD_CLOEXEC) = -1 EPERM (Operation not permitted)
Many different ways to debug this.
https://docs.docker.com/engine/security/seccomp/#significant-syscalls-blocked-by-the-default-profile:

slide 59:

perf, cont.
• Can enable perf_event_open() with: docker run --cap-add sys_admin
Also need (for kernel symbols): echo 0 >gt; /proc/sys/kernel/kptr_restrict
• perf then "works", and you can make flame graphs. But it sees all CPUs!?
perf needs to be "container aware", and only see the container's tasks.
patch pending: https://lkml.org/lkml/2017/1/12/308
• Currently easier to run perf from the host (or secure "monitoring" container)
Via a secure monitoring agent,
e.g. Netflix Vector ->gt; CPU Flame Graph
See earlier slides for steps

slide 60:

5. Tracing
Advanced Analysis
… a few more examples
(iosnoop, zfsslower, and btrfsdist shown earlier)

slide 61:

Built-in Linux Tracers
Srace
(2008+)
perf_events
(2009+)
eBPF
(2014+)
Some front-ends:
• ftrace: https://github.com/brendangregg/perf-tools
• perf_events: used for CPU flame graphs
• eBPF (aka BPF): https://github.com/iovisor/bcc (Linux 4.4+)

slide 62:

ftrace: Overlay FS Function Calls
Using ftrace via my perf-tools to count function calls in-kernel context:
# funccount '*ovl*'
Tracing "*ovl*"... Ctrl-C to end.
FUNC
COUNT
ovl_cache_free
ovl_xattr_get
[...]
ovl_fill_merge
ovl_path_real
ovl_path_upper
ovl_update_time
ovl_permission
ovl_d_real
ovl_override_creds
Ending tracing...
Each can be a target for further study with kprobes

slide 63:

ftrace: Overlay FS Function Tracing
Using kprobe (perf-tools) to trace ovl_fill_merg() args and stack trace
# kprobe -s 'p:ovl_fill_merge ctx=%di name=+0(%si):string'
Tracing kprobe ovl_fill_merge. Ctrl-C to end.
bash-16633 [000] d... 14390771.218973: ovl_fill_merge: (ovl_fill_merge+0x0/0x1f0
[overlay]) ctx=0xffffc90042477db0 name="iostat"
bash-16633 [000] d... 14390771.218981: gt;
=>gt; ovl_fill_merge
=>gt; ext4_readdir
=>gt; iterate_dir
=>gt; ovl_dir_read_merged
=>gt; ovl_iterate
=>gt; iterate_dir
=>gt; SyS_getdents
=>gt; do_syscall_64
=>gt; return_from_SYSCALL_64
[…]
Good for debugging, although dumping all events can cost too much overhead. ftrace has
some solutions to this, BPF has more…

slide 64:

Enhanced BPF Tracing Internals
Observability Program
Kernel
load
BPF
bytecode
BPF
program
event conﬁg
veriﬁer
tracepoints
a\ach
dynamic tracing
BPF
output
per-event
data
sta=s=cs
sta=c tracing
kprobes
uprobes
async
copy
sampling, PMCs
maps
perf_events

slide 65:

BPF: Scheduler Latency 1
host# runqlat -p 20228 10 1
Tracing run queue latency... Hit Ctrl-C to end.
usecs
0 ->gt; 1
2 ->gt; 3
4 ->gt; 7
8 ->gt; 15
16 ->gt; 31
32 ->gt; 63
64 ->gt; 127
128 ->gt; 255
256 ->gt; 511
512 ->gt; 1023
: count
: 0
: 4
: 368
: 151
: 22
: 14
: 19
: 0
: 2
: 1
distribution
|****************************************|
|****************
|**
|**
This is an app in a Docker container on a system with idle CPU
Tracing scheduler events can be costly (high rate), but this BPF program reduces cost by using
in-kernel maps to summarize data, and only emits the "count" column to user space.

slide 66:

BPF: Scheduler Latency 2
host# runqlat -p 20228 10 1
Tracing run queue latency... Hit Ctrl-C to end.
usecs
0 ->gt; 1
2 ->gt; 3
4 ->gt; 7
8 ->gt; 15
16 ->gt; 31
32 ->gt; 63
64 ->gt; 127
128 ->gt; 255
256 ->gt; 511
512 ->gt; 1023
1024 ->gt; 2047
2048 ->gt; 4095
4096 ->gt; 8191
8192 ->gt; 16383
16384 ->gt; 32767
32768 ->gt; 65535
65536 ->gt; 131071
131072 ->gt; 262143
262144 ->gt; 524287
: count
: 0
: 0
: 7
: 14
: 0
: 0
: 0
: 0
: 0
: 0
: 0
: 5
: 6
: 28
: 59
: 99
: 6
: 2
: 1
distribution
Now other tenants are using |
|**
|*****
more CPU, and this PID is
throttled via CPU shares
8 - 65ms delays
|**
|**
|***********
|***********************
|****************************************|
|**

slide 67:

BPF: Scheduler Latency 3
host# runqlat --pidnss -m
Tracing run queue latency... Hit Ctrl-C to end.
pidns = 4026532870
msecs
: count
distribution
0 ->gt; 1
: 264
|****************************************|
2 ->gt; 3
: 0
4 ->gt; 7
: 0
8 ->gt; 15
: 0
16 ->gt; 31
: 0
Per-PID namespace histograms ||
32 ->gt; 63
: 0
64 ->gt; 127
: 2
(I added this yesterday) |
[…]
pidns = 4026532382
msecs
0 ->gt; 1
2 ->gt; 3
4 ->gt; 7
8 ->gt; 15
16 ->gt; 31
32 ->gt; 63
: count
: 646
: 18
: 48
: 17
: 150
: 134
distribution
|****************************************|
|**
|*********
|********

slide 68:

BPF: Namespace-ing Tools
Walking from the task_struct to the PID namespace ID:
task_struct->gt;nsproxy->gt;pid_ns_for_children->gt;ns.inum
This is unstable, and could break between kernel versions. If it becomes a problem, we'll add a
bpf_get_current_pidns()
Does needs a *task, or bpf_get_current_task() (added in 4.8)
Can also pull out cgroups, but gets tricker…

slide 69:

bcc
(BPF)
Perf
Tools

slide 70:

Docker Analysis & Debugging
If needed, dockerd can also be analyzed using:
• go execution tracer
• GODEBUG with gctrace and schedtrace
• gdb and Go runtime support
• perf profiling
• bcc/BPF and uprobes
Each has pros/cons. bcc/BPF can trace user & kernel events.

slide 71:

BPF: dockerd Go Function Counting
Counting dockerd Go calls in-kernel using BPF that match "*docker*get":
# funccount '/usr/bin/dockerd:*docker*get*'
Tracing 463 functions for "/usr/bin/dockerd:*docker*get*"... Hit Ctrl-C to end.
FUNC
COUNT
github.com/docker/docker/daemon.(*statsCollector).getSystemCPUUsage
github.com/docker/docker/daemon.(*Daemon).getNetworkSandboxID
github.com/docker/docker/daemon.(*Daemon).getNetworkStats
github.com/docker/docker/daemon.(*statsCollector).getSystemCPUUsage.func1
github.com/docker/docker/pkg/ioutils.getBuffer
github.com/docker/docker/vendor/golang.org/x/net/trace.getBucket
github.com/docker/docker/vendor/golang.org/x/net/trace.getFamily
github.com/docker/docker/vendor/google.golang.org/grpc.(*ClientConn).getTransport
github.com/docker/docker/vendor/github.com/golang/protobuf/proto.getbase
github.com/docker/docker/vendor/google.golang.org/grpc/transport.(*http2Client).getStream
Detaching...
# objdump -tTj .text /usr/bin/dockerd | wc -l
35,859 functions can be traced!
Uses uprobes, and needs newer kernels. Warning: will cost overhead at high function rates.

slide 72:

BPF: dockerd Go Stack Tracing
Counting stack traces that led to this ioutils.getBuffer() call:
# stackcount 'p:/usr/bin/dockerd:*/ioutils.getBuffer'
Tracing 1 functions for "p:/usr/bin/dockerd:*/ioutils.getBuffer"... Hit Ctrl-C to end.
github.com/docker/docker/pkg/ioutils.getBuffer
github.com/docker/docker/pkg/broadcaster.(*Unbuffered).Write
bufio.(*Reader).writeBuf
bufio.(*Reader).WriteTo
io.copyBuffer
io.Copy
github.com/docker/docker/pkg/pools.Copy
github.com/docker/docker/container/stream.(*Config).CopyToPipe.func1.1
runtime.goexit
dockerd [18176]
means this stack was seen 110 times
Detaching...
Can also trace function arguments, and latency (with some work)
http://www.brendangregg.com/blog/2017-01-31/golang-bcc-bpf-function-tracing.html

slide 73:

Summary
Identify bottlenecks:
1. In the host vs container, using system metrics
2. In application code on containers, using CPU flame graphs
3. Deeper in the kernel, using tracing tools

slide 74:

References
http://techblog.netflix.com/2017/04/the-evolution-of-container-usage-at.html
http://techblog.netflix.com/2016/07/distributed-resource-scheduling-with.html
https://www.slideshare.net/aspyker/netflix-and-containers-titus
https://docs.docker.com/engine/admin/runmetrics/#tips-for-high-performance-metric-collection
https://blog.docker.com/2013/10/gathering-lxc-docker-containers-metrics/
https://www.slideshare.net/jpetazzo/anatomy-of-a-container-namespaces-cgroups-some-filesystem-magic-linuxcon
https://www.youtube.com/watch?v=sK5i-N34im8 Cgroups, namespaces, and beyond
https://jvns.ca/blog/2016/10/10/what-even-is-a-container/
https://blog.jessfraz.com/post/containers-zones-jails-vms/
http://blog.alicegoldfuss.com/making-flamegraphs-with-containerized-java/
http://www.brendangregg.com/USEmethod/use-linux.html full USE method list
http://www.brendangregg.com/blog/2017-01-31/golang-bcc-bpf-function-tracing.html
http://techblog.netflix.com/2015/11/linux-performance-analysis-in-60s.html
http://queue.acm.org/detail.cfm?id=1809426 latency heat maps
https://github.com/brendangregg/perf-tools ftrace tools, https://github.com/iovisor/bcc BPF tools

slide 75:

Thank You!
http://techblog.netflix.com
http://slideshare.net/brendangregg
http://www.brendangregg.com
[email protected]
@brendangregg
Titus team: @aspyker @anwleung @fabiokung @tomaszbak1974
@amit_joshee @sargun @corindwyer …
#dockercon