Systems Performance 2nd Ed.



BPF Performance Tools book

Recent posts:
Blog index
About
RSS

DockerCon 2017: Container Performance Analysis

Slides for DockerCon 2017 by Brendan Gregg

Video: https://www.youtube.com/watch?v=bK9A5ODIgac

Description: "Containers pose interesting challenges for performance monitoring and analysis, requiring new analysis methodologies and tooling. Resource-oriented analysis, as is common with systems performance tools and GUIs, must now account for both hardware limits and soft limits, as implemented using resource controls including cgroups. The interaction between containers can also be examined, and noisy neighbors either identified of exonerated. Performance tooling can also need special usage or workarounds to function properly from within a container or on the host, to deal with different privilege levels and name spaces. At Netflix, we're using containers for some microservices, and care very much about analyzing and tuning our containers to be as fast and efficient as possible. This talk will show how to successfully analyze performance in a Docker container environment, and navigate differences encountered."

next
prev
1/75
next
prev
2/75
next
prev
3/75
next
prev
4/75
next
prev
5/75
next
prev
6/75
next
prev
7/75
next
prev
8/75
next
prev
9/75
next
prev
10/75
next
prev
11/75
next
prev
12/75
next
prev
13/75
next
prev
14/75
next
prev
15/75
next
prev
16/75
next
prev
17/75
next
prev
18/75
next
prev
19/75
next
prev
20/75
next
prev
21/75
next
prev
22/75
next
prev
23/75
next
prev
24/75
next
prev
25/75
next
prev
26/75
next
prev
27/75
next
prev
28/75
next
prev
29/75
next
prev
30/75
next
prev
31/75
next
prev
32/75
next
prev
33/75
next
prev
34/75
next
prev
35/75
next
prev
36/75
next
prev
37/75
next
prev
38/75
next
prev
39/75
next
prev
40/75
next
prev
41/75
next
prev
42/75
next
prev
43/75
next
prev
44/75
next
prev
45/75
next
prev
46/75
next
prev
47/75
next
prev
48/75
next
prev
49/75
next
prev
50/75
next
prev
51/75
next
prev
52/75
next
prev
53/75
next
prev
54/75
next
prev
55/75
next
prev
56/75
next
prev
57/75
next
prev
58/75
next
prev
59/75
next
prev
60/75
next
prev
61/75
next
prev
62/75
next
prev
63/75
next
prev
64/75
next
prev
65/75
next
prev
66/75
next
prev
67/75
next
prev
68/75
next
prev
69/75
next
prev
70/75
next
prev
71/75
next
prev
72/75
next
prev
73/75
next
prev
74/75
next
prev
75/75

PDF: DockerCon2017_performance_analysis.pdf

Keywords (from pdftotext):

slide 1:
    Container Performance
    Analysis
    Brendan Gregg
    Sr. Performance Architect,
    Netflix
    
slide 2:
    Take Aways
    Identify bottlenecks:
    1. In the host vs container, using system metrics
    2. In application code on containers, using CPU flame graphs
    3. Deeper in the kernel, using tracing tools
    Focus of this talk is how containers work in Linux (will demo on 4.9)
    I will include some Docker specifics, and start with a Netflix summary (Titus)
    
slide 3:
    1. Titus
    Containers at
    Summary slides from the Titus team
    
slide 4:
    Titus
    • Cloud runtime platform for container jobs
    • Scheduling
    Service & batch job management
    Advanced resource management across
    elastic shared resource pool
    • Container Execution
    • Docker and AWS EC2 Integration
    Adds VPC, security groups, EC2
    Service
    Batch
    Job Management
    Resource Management & Op=miza=on
    Container Execu=on
    metadata, IAM roles, S3 logs, …
    • Integration with Netflix infrastructure
    • In depth: http://techblog.netflix.com/2017/04/the-evolution-of-container-usage-at.html
    Integra=on
    
slide 5:
    Current Titus Scale
    • Deployed across multiple AWS accounts & three regions
    • Over 2,500 instances (Mostly M4.4xls & R3.8xls)
    • Over a week period launched over 1,000,000 containers
    
slide 6:
    Titus Use Cases
    • Service
    Stream Processing (Flink)
    UI Services (Node.JS single core)
    Internal dashboards
    • Batch
    Algorithm training, personalization & recommendations
    Adhoc reporting
    Continuous integration builds
    • Queued worker model
    Media encoding
    
slide 7:
    Container Performance @Netflix
    • Ability to scale and balance workloads with EC2 and Titus
    Can already solve many perf issues
    • Performance needs:
    • Application analysis: using CPU flame graphs with containers
    • Host tuning: file system, networking, sysctl's, …
    • Container analysis and tuning: cgroups, GPUs, …
    • Capacity planning: reduce over provisioning
    
slide 8:
    2. Container Background
    And Strategy
    
slide 9:
    Namespaces
    Restricting visibility
    Namespaces:
    • cgroup
    • ipc
    • mnt
    • net
    • pid
    • user
    • uts
    PID namespaces
    Host
    PID 1
    PID namespace 1
    1 (1238)
    2 (1241)
    Kernel
    
slide 10:
    Control Groups
    Restricting usage
    cgroups:
    • blkio
    • cpu,cpuacct
    • cpuset
    • devices
    • hugetlb
    • memory
    • net_cls,net_prio
    • pids
    • …
    CPU cgroups
    container
    container
    container
    cpu
    cgroup 1
    CPUs
    
slide 11:
    Linux Containers
    Container = combination of namespaces & cgroups
    Host
    Container 1
    Container 2
    Container 3
    (namespaces)
    (namespaces)
    (namespaces)
    cgroups
    cgroups
    cgroups
    Kernel
    
slide 12:
    cgroup v1
    cpu,cpuacct:
    cap CPU usage (hard limit). e.g. 1.5 CPUs.
    CPU shares. e.g. 100 shares.
    usage statistics (cpuacct)
    Docker:
    --cpus (1.13)
    --cpu-shares
    memory:
    limit and kmem limit (maximum bytes)
    OOM control: enable/disable
    usage statistics
    blkio (block I/O):
    weights (like shares)
    IOPS/tput caps per storage device
    statistics
    --memory --kernel-memory
    --oom-kill-disable
    
slide 13:
    CPU Shares
    Container's CPU limit = 100% x
    container's shares
    total busy shares
    This lets a container use other tenant's idle CPU (aka "bursting"), when available.
    Container's minimum CPU limit = 100% x
    container's shares
    total allocated shares
    Can make analysis tricky. Why did perf regress? Less bursting available?
    
slide 14:
    cgroup v2
    • Major rewrite has been happening: cgroups v2
    Supports nested groups, better organization and consistency
    Some already merged, some not yet (e.g. CPU)
    • See docs/talks by maintainer Tejun Heo (Facebook)
    • References:
    https://www.kernel.org/doc/Documentation/cgroup-v2.txt
    https://lwn.net/Articles/679786/
    
slide 15:
    Container OS Configuration
    File systems
    • Containers may be setup with aufs/overlay on top of another FS
    • See "in practice" pages and their performance sections from
    https://docs.docker.com/engine/userguide/storagedriver/
    Networking
    • With Docker, can be bridge, host, or overlay networks
    • Overlay networks have come with significant performance cost
    
slide 16:
    Analysis Strategy
    Performance analysis with containers:
    • One kernel
    • Two perspectives
    • Namespaces
    • cgroups
    Methodologies:
    • USE Method
    • Workload characterization
    • Checklists
    • Event tracing
    
slide 17:
    USE Method
    For every resource, check:
    1. Utilization
    2. Saturation
    3. Errors
    Resource
    Utilization
    (%)
    For example, CPUs:
    • Utilization: time busy
    • Saturation: run queue length or latency
    • Errors: ECC errors, etc.
    Can be applied to hardware resources and software resources (cgroups)
    
slide 18:
    3. Host Tools
    And Container Awareness
    … if you have host access
    
slide 19:
    Host Analysis Challenges
    • PIDs in host don't match those seen in containers
    • Symbol files aren't where tools expect them
    • The kernel currently doesn't have a container ID
    
slide 20:
    CLI Tool
    Disclaimer
    I'll demo CLI tools
    It's the lowest common
    denominator
    You may usually use GUIs
    (like we do). They source
    the same metrics.
    
slide 21:
    3.1. Host Physical Resources
    A refresher of basics... Not container specific.
    This will, however, solve many issues!
    Containers are often not the problem.
    
slide 22:
    Linux
    Perf
    Tools
    Where
    can we
    begin?
    
slide 23:
    Host Perf Analysis in 60s
    uptime
    dmesg | tail
    vmstat 1
    mpstat -P ALL 1
    pidstat 1
    iostat -xz 1
    free -m
    sar -n DEV 1
    sar -n TCP,ETCP 1
    top
    http://techblog.netflix.com/2015/11/linux-performance-analysis-in-60s.html
    load averages
    kernel errors
    overall stats by time
    CPU balance
    process usage
    disk I/O
    memory usage
    network I/O
    TCP stats
    check overview
    
slide 24:
    USE Method: Host Resources
    Resource
    Utilization
    Saturation
    Errors
    CPU
    mpstat -P ALL 1,
    sum non-idle fields
    vmstat 1, "r"
    perf
    Memory
    Capacity
    free –m,
    "used"/"total"
    vmstat 1, "si"+"so";
    demsg | grep killed
    dmesg
    Storage I/O
    iostat –xz 1,
    "%util"
    iostat –xnz 1,
    "avgqu-sz" >gt; 1
    /sys/…/ioerr_cnt;
    smartctl
    Network
    nicstat, "%Util"
    ifconfig, "overrunns";
    netstat –s "retrans…"
    ifconfig,
    "errors"
    These should be in your monitoring GUI. Can do other resources too (busses, ...)
    
slide 25:
    Event Tracing: e.g. iosnoop
    Disk I/O events with latency (from perf-tools; also in bcc/BPF as biosnoop)
    # ./iosnoop
    Tracing block I/O... Ctrl-C to end.
    COMM
    PID
    TYPE DEV
    supervise
    202,1
    supervise
    202,1
    tar
    14794 RM
    202,1
    tar
    14794 RM
    202,1
    tar
    14794 RM
    202,1
    tar
    14794 RM
    202,1
    tar
    14794 RM
    202,1
    tar
    14794 RM
    202,1
    tar
    14794 RM
    202,1
    tar
    14794 RM
    202,1
    BLOCK
    BYTES
    LATms
    
slide 26:
    Event Tracing: e.g. zfsslower
    # /usr/share/bcc/tools/zfsslower 1
    Tracing ZFS operations slower than 1 ms
    TIME
    COMM
    PID
    T BYTES
    23:44:40 java
    31386 O 0
    23:44:53 java
    31386 W 8190
    23:44:59 java
    31386 W 8192
    23:44:59 java
    31386 W 8191
    23:45:00 java
    31386 W 8192
    23:45:15 java
    31386 O 0
    23:45:56 dockerd
    S 0
    23:46:16 java
    31386 W 31
    OFF_KB
    LAT(ms) FILENAME
    8.02 solrFeatures.txt
    36.24 solrFeatures.txt
    20.28 solrFeatures.txt
    28.15 solrFeatures.txt
    32.17 solrFeatures.txt
    27.44 solrFeatures.txt
    1.03 .tmp-a66ce9aad…
    36.28 solrFeatures.txt
    • This is from our production Titus system (Docker).
    • File system latency is a better pain indicator than disk latency.
    • zfsslower (and btrfs*, etc) are in bcc/BPF. Can exonerate FS/disks.
    
slide 27:
    Latency Histograms: e.g. btrfsdist
    # ./btrfsdist
    Tracing btrfs operation latency... Hit Ctrl-C to end.
    operation = 'read'
    usecs
    : count
    distribution
    0 ->gt; 1
    : 192529
    |****************************************|
    2 ->gt; 3
    : 72337
    |***************
    4 ->gt; 7
    : 5620
    probably
    8 ->gt; 15
    : 1026
    cache reads
    16 ->gt; 31
    : 369
    32 ->gt; 63
    : 239
    64 ->gt; 127
    : 53
    probably
    128 ->gt; 255
    : 975
    cache misses
    256 ->gt; 511
    : 524
    512 ->gt; 1023
    : 128
    (flash reads)
    1024 ->gt; 2047
    : 16
    2048 ->gt; 4095
    : 7
    4096 ->gt; 8191
    : 2
    
slide 28:
    Latency Histograms: e.g. btrfsdist
    […]
    operation = 'write'
    usecs
    0 ->gt; 1
    2 ->gt; 3
    4 ->gt; 7
    8 ->gt; 15
    16 ->gt; 31
    32 ->gt; 63
    64 ->gt; 127
    128 ->gt; 255
    256 ->gt; 511
    512 ->gt; 1023
    : count
    : 1
    : 276
    : 32125
    : 111253
    : 59154
    : 5463
    : 612
    : 25
    : 2
    : 1
    distribution
    |***********
    |****************************************|
    |*********************
    • From a test Titus system (Docker).
    • Histograms show modes, outliers. Also in bcc/BPF (with other FSes).
    • Latency heat maps: http://queue.acm.org/detail.cfm?id=1809426
    
slide 29:
    3.2. Host Containers & cgroups
    Inspecting containers from the host
    
slide 30:
    Namespaces
    Worth checking namespace config before analysis:
    # ./dockerpsns.sh
    CONTAINER
    NAME
    host
    titusagent-mainvpc-m
    b27909cd6dd1 Titus-1435830-worker
    dcf3a506de45 Titus-1392192-worker
    370a3f041f36 Titus-1243558-worker
    af7549c76d9a Titus-1243553-worker
    dc27769a9b9c Titus-1243546-worker
    e18bd6189dcd Titus-1243517-worker
    ab45227dcea9 Titus-1243516-worker
    PID PATH
    CGROUP
    IPC
    MNT
    NET
    PID
    USER
    UTS
    1 systemd
    4026531835 4026531839 4026531840 4026532533 4026531836 4026531837 4026531838
    37280 svscanboot
    4026531835 4026533387 4026533385 4026532931 4026533388 4026531837 4026533386
    27992 /apps/spaas/spaa 4026531835 4026533354 4026533352 4026532991 4026533355 4026531837 4026533353
    98602 /apps/spaas/spaa 4026531835 4026533290 4026533288 4026533223 4026533291 4026531837 4026533289
    97972 /apps/spaas/spaa 4026531835 4026533216 4026533214 4026533149 4026533217 4026531837 4026533215
    97356 /apps/spaas/spaa 4026531835 4026533142 4026533140 4026533075 4026533143 4026531837 4026533141
    96733 /apps/spaas/spaa 4026531835 4026533068 4026533066 4026533001 4026533069 4026531837 4026533067
    96173 /apps/spaas/spaa 4026531835 4026532920 4026532918 4026532830 4026532921 4026531837 4026532919
    • A POC "docker ps --namespaces" tool. NS shared with root in red.
    • https://github.com/docker/docker/issues/32501
    
slide 31:
    systemd-cgtop
    A "top" for cgroups:
    # systemd-cgtop
    Control Group
    /docker
    /docker/dcf3a...9d28fc4a1c72bbaff4a24834
    /docker/370a3...e64ca01198f1e843ade7ce21
    /system.slice
    /system.slice/daemontools.service
    /docker/dc277...42ab0603bbda2ac8af67996b
    /user.slice
    /user.slice/user-0.slice
    /user.slice/u....slice/session-c26.scope
    /docker/ab452...c946f8447f2a4184f3ccff2a
    /docker/e18bd...26ffdd7368b870aa3d1deb7a
    [...]
    Tasks
    %CPU
    Memory
    45.9G
    42.1G
    24.0G
    3.0G
    4.1G
    2.8G
    2.3G
    34.5M
    15.7M
    13.3M
    6.3G
    2.9G
    Input/s Output/s
    
slide 32:
    docker stats
    A "top" for containers. Resource utilization. Workload characterization.
    # docker stats
    CONTAINER
    CPU %
    353426a09db1 526.81%
    6bf166a66e08 303.82%
    58dcf8aed0a7 41.01%
    61061566ffe5 85.92%
    bdc721460293 2.69%
    6c80ed61ae63 477.45%
    337292fb5b64 89.05%
    b652ede9a605 173.50%
    d7cd2599291f 504.28%
    05bf9f3e0d13 314.46%
    09082f005755 142.04%
    bd45a3e1ce16 190.26%
    [...]
    MEM USAGE / LIMIT
    4.061 GiB / 8.5 GiB
    3.448 GiB / 8.5 GiB
    1.322 GiB / 2.5 GiB
    220.9 MiB / 3.023 GiB
    1.204 GiB / 3.906 GiB
    557.7 MiB / 8 GiB
    766.2 MiB / 8 GiB
    689.2 MiB / 8 GiB
    673.2 MiB / 8 GiB
    711.6 MiB / 8 GiB
    693.9 MiB / 8 GiB
    538.3 MiB / 8 GiB
    MEM %
    47.78%
    40.57%
    52.89%
    7.14%
    30.82%
    6.81%
    9.35%
    8.41%
    8.22%
    8.69%
    8.47%
    6.57%
    NET I/O
    0 B / 0 B
    0 B / 0 B
    0 B / 0 B
    0 B / 0 B
    0 B / 0 B
    0 B / 0 B
    0 B / 0 B
    0 B / 0 B
    0 B / 0 B
    0 B / 0 B
    0 B / 0 B
    0 B / 0 B
    BLOCK I/O
    2.818 MB / 0 B
    2.032 MB / 0 B
    0 B / 0 B
    43.4 MB / 0 B
    4.35 MB / 0 B
    9.257 MB / 0 B
    5.493 MB / 0 B
    6.48 MB / 0 B
    12.58 MB / 0 B
    7.942 MB / 0 B
    8.081 MB / 0 B
    10.6 MB / 0 B
    Loris Degioanni demoed a similar sysdigcloud view yesterday (needs the sysdig kernel agent)
    PIDS
    
slide 33:
    top
    In the host, top shows all processes. Currently doesn't show a container ID.
    # top - 22:46:53 up 36 days, 59 min, 1 user, load average: 5.77, 5.61, 5.63
    Tasks: 1067 total,
    1 running, 1046 sleeping,
    0 stopped, 20 zombie
    %Cpu(s): 34.8 us, 1.8 sy, 0.0 ni, 61.3 id, 0.0 wa, 0.0 hi, 1.9 si, 0.1 st
    KiB Mem : 65958552 total, 12418448 free, 49247988 used, 4292116 buff/cache
    KiB Swap:
    0 total,
    0 free,
    0 used. 13101316 avail Mem
    PID USER
    28321 root
    97712 root
    98306 root
    96511 root
    5283 root
    2079 root
    5272 titusag+
    […]
    VIRT
    RES
    0 33.126g 0.023t
    0 11.445g 2.333g
    0 12.149g 3.060g
    0 15.567g 6.313g
    0 1643676 100092
    0 10.473g 1.611g
    SHR S %CPU %MEM
    TIME+ COMMAND
    37564 S 621.1 38.2 35184:09 java
    37084 S
    3.1 3.7 404:27.90 java
    36996 S
    2.0 4.9 194:21.10 java
    37112 S
    1.7 10.0 168:07.44 java
    94184 S
    1.0 0.2 401:36.16 mesos-slave
    12 S
    0.7 0.0 220:07.75 rngd
    23488 S
    0.7 2.6
    1934:44 java
    … remember, there is no container ID in the kernel yet.
    
slide 34:
    htop
    htop can add a CGROUP field, but, can truncate important info:
    CGROUP
    PID USER
    PRI NI VIRT
    RES
    SHR S CPU% MEM%
    TIME+ Command
    :pids:/docker/ 28321 root
    0 33.1G 24.0G 37564 S 524. 38.2
    672h /apps/java
    :pids:/docker/
    9982 root
    0 33.1G 24.0G 37564 S 44.4 38.2 17h00:41 /apps/java
    :pids:/docker/
    9985 root
    0 33.1G 24.0G 37564 R 41.9 38.2 16h44:51 /apps/java
    :pids:/docker/
    9979 root
    0 33.1G 24.0G 37564 S 41.2 38.2 17h01:35 /apps/java
    :pids:/docker/
    9980 root
    0 33.1G 24.0G 37564 S 39.3 38.2 16h59:17 /apps/java
    :pids:/docker/
    9981 root
    0 33.1G 24.0G 37564 S 39.3 38.2 17h01:32 /apps/java
    :pids:/docker/
    9984 root
    0 33.1G 24.0G 37564 S 37.3 38.2 16h49:03 /apps/java
    :pids:/docker/
    9983 root
    0 33.1G 24.0G 37564 R 35.4 38.2 16h54:31 /apps/java
    :pids:/docker/
    9986 root
    0 33.1G 24.0G 37564 S 35.4 38.2 17h05:30 /apps/java
    :name=systemd:/user.slice/user-0.slice/session-c31.scope? 74066 root
    0 27620
    :pids:/docker/
    9998 root
    0 33.1G 24.0G 37564 R 28.3 38.2 11h38:03 /apps/java
    :pids:/docker/ 10001 root
    0 33.1G 24.0G 37564 S 27.7 38.2 11h38:59 /apps/java
    :name=systemd:/system.slice/daemontools.service?
    5272 titusagen 20
    0 10.5G 1650M 23
    :pids:/docker/ 10002 root
    0 33.1G 24.0G 37564 S 25.1 38.2 11h40:37 /apps/java
    Can fix, but that would be Docker + cgroup-v1 specific. Still need a kernel CID.
    
slide 35:
    Host PID ->gt; Container ID
    … who does that (CPU busy) PID 28321 belong to?
    # grep 28321 /sys/fs/cgroup/cpu,cpuacct/docker/*/tasks | cut -d/ -f7
    dcf3a506de453107715362f6c9ba9056fcfc6e769d28fc4a1c72bbaff4a24834
    • Only works for Docker, and that cgroup v1 layout. Some Linux commands:
    # ls -l /proc/27992/ns/*
    lrwxrwxrwx 1 root root 0 Apr 13 20:49 cgroup ->gt; cgroup:[4026531835]
    lrwxrwxrwx 1 root root 0 Apr 13 20:49 ipc ->gt; ipc:[4026533354]
    lrwxrwxrwx 1 root root 0 Apr 13 20:49 mnt ->gt; mnt:[4026533352]
    […]
    # cat /proc/27992/cgroup
    11:freezer:/docker/dcf3a506de453107715362f6c9ba9056fcfc6e769d28fc4a1c72bbaff4a24834
    10:blkio:/docker/dcf3a506de453107715362f6c9ba9056fcfc6e769d28fc4a1c72bbaff4a24834
    9:perf_event:/docker/dcf3a506de453107715362f6c9ba9056fcfc6e769d28fc4a1c72bbaff4a24834
    […]
    
slide 36:
    nsenter Wrapping
    … what hostname is PID 28321 running on?
    # nsenter -t 28321 -u hostname
    titus-1392192-worker-14-16
    • Can namespace enter:
    • -m: mount
    -u: uts -i: ipc
    -n: net -p: pid -U: user
    • Bypasses cgroup limits, and seccomp profile (allowing syscalls)
    For Docker, you can enter the container more completely with: docker exec -it CID command
    • Handy nsenter one-liners:
    nsenter -t PID -u hostname
    nsenter -t PID -n netstat -i
    nsenter -t PID –m -p df -h
    nsenter -t PID -p top
    container hostname
    container netstat
    container file system usage
    container top
    
slide 37:
    nsenter: Host ->gt; Container top
    … Given PID 28321, running top for its container by entering its namespaces:
    # nsenter -t 28321 -m -p top
    top - 18:16:13 up 36 days, 20:28, 0 users, load average: 5.66, 5.29, 5.28
    Tasks:
    6 total,
    1 running,
    5 sleeping,
    0 stopped,
    0 zombie
    %Cpu(s): 30.5 us, 1.7 sy, 0.0 ni, 65.9 id, 0.0 wa, 0.0 hi, 1.8 si, 0.1 st
    KiB Mem: 65958552 total, 54664124 used, 11294428 free,
    164232 buffers
    KiB Swap:
    0 total,
    0 used,
    0 free. 1592372 cached Mem
    PID USER
    301 root
    1 root
    87888 root
    VIRT
    RES
    0 33.127g 0.023t
    SHR S %CPU %MEM
    37564 S 537.3 38.2
    1812 S
    0.0 0.0
    1348 R
    0.0 0.0
    TIME+ COMMAND
    40269:41 java
    4:15.11 bash
    0:00.00 top
    Note that it is PID 301 in the container. Can also see this using:
    # grep NSpid /proc/28321/status
    NSpid:
    
slide 38:
    perf: CPU Profiling
    Can run system-wide (-a), match a pid (-p), or cgroup (-G, if it works)
    # perf record -F 49 -a -g -- sleep 30
    # perf script
    Failed to open /lib/x86_64-linux-gnu/libc-2.19.so, continuing without symbols
    Failed to open /tmp/perf-28321.map, continuing without symbols
    • Current symbol translation gotchas (up to 4.10-ish):
    perf can't find /tmp/perf-PID.map files in the host, and the PID is different
    perf can't find container binaries under host paths (what /usr/bin/java?)
    • Can copy files to the host, map PIDs, then run perf script/report:
    http://blog.alicegoldfuss.com/making-flamegraphs-with-containerized-java/
    http://batey.info/docker-jvm-flamegraphs.html
    • Can nsenter (-m -u -i -n -p) a "power" shell, and then run "perf -p PID"
    • perf should be fixed to be namespace aware (like bcc was, PR#1051)
    
slide 39:
    CPU Flame Graphs
    git clone --depth 1 https://github.com/brendangregg/FlameGraph
    cd FlameGraph
    perf record –F 49 -a –g -- sleep 30
    perf script | ./stackcollapse-perf.pl | ./flamegraph.pl >gt; perf.svg
    • See previous slide for getting perf symbols to work
    • From the host, can study all containers, as well as container overheads
    Kernel TCP/IP stack
    Look in areas like this to find
    and quantify overhead (cgroup
    throttles, FS layers, networking, etc).
    It's likely small and hard to find.
    Java, missing stacks (need
    -XX:+PreserveFramePointer)
    
slide 40:
    /sys/fs/cgroups (raw)
    The best source for per-cgroup metrics. e.g. CPU:
    # cd /sys/fs/cgroup/cpu,cpuacct/docker/02a7cf65f82e3f3e75283944caa4462e82f8f6ff5a7c9a...
    # ls
    cgroup.clone_children cpuacct.usage_all
    cpuacct.usage_sys
    cpu.shares
    cgroup.procs
    cpuacct.usage_percpu
    cpuacct.usage_user cpu.stat
    cpuacct.stat
    cpuacct.usage_percpu_sys
    cpu.cfs_period_us
    notify_on_release
    cpuacct.usage
    cpuacct.usage_percpu_user cpu.cfs_quota_us
    tasks
    # cat cpuacct.usage
    # cat cpu.stat
    total time throttled (nanoseconds). saturation metric.
    nr_periods 507
    average throttle time = throttled_time / nr_throttled
    nr_throttled 74
    throttled_time 3816445175
    https://www.kernel.org/doc/Documentation/cgroup-v1/, ../scheduler/sched-bwc.txt
    https://blog.docker.com/2013/10/gathering-lxc-docker-containers-metrics/
    Note: grep cgroup /proc/mounts to check where these are mounted
    These metrics should be included in performance monitoring GUIs
    
slide 41:
    Netflix Atlas
    Cloud-wide monitoring of
    containers (and instances)
    Fetches cgroup metrics via
    Intel snap
    https://github.com/netflix/Atlas
    
slide 42:
    Netflix Vector
    Our per-instance analyzer
    Has per-container metrics
    https://github.com/Netflix/vector
    
slide 43:
    Intel snap
    A metric collector used by
    monitoring GUIs
    https://github.com/intelsdi-x/snap
    Has a Docker plugin to read
    cgroup stats
    There's also a collectd plugin:
    https://github.com/bobrik/collectddocker
    
slide 44:
    3.3. Let's Play a Game
    Host or Container?
    (or neither?)
    
slide 45:
    Game Scenario 1
    Container user claims they have a CPU performance issue
    Container has a CPU cap and CPU shares configured
    There is idle CPU on the host
    Other tenants are CPU busy
    /sys/fs/cgroup/.../cpu.stat ->gt; throttled_time is increasing
    /proc/PID/status nonvoluntary_ctxt_switches is increasing
    Container CPU usage equals its cap (clue: this is not really a clue)
    
slide 46:
    Game Scenario 2
    Container user claims they have a CPU performance issue
    Container has a CPU cap and CPU shares configured
    There is no idle CPU on the host
    Other tenants are CPU busy
    /sys/fs/cgroup/.../cpu.stat ->gt; throttled_time is not increasing
    /proc/PID/status nonvoluntary_ctxt_switches is increasing
    
slide 47:
    Game Scenario 3
    Container user claims they have a CPU performance issue
    Container has CPU shares configured
    There is no idle CPU on the host
    Other tenants are CPU busy
    /sys/fs/cgroup/.../cpu.stat ->gt; throttled_time is not increasing
    /proc/PID/status nonvoluntary_ctxt_switches is not increasing much
    Experiments to confirm conclusion?
    
slide 48:
    Methodology: Reverse Diagnosis
    Enumerate possible outcomes, and work backwards to the metrics needed for diagnosis.
    e.g. CPU performance outcomes:
    A. physical CPU throttled
    B. cap throttled
    C. shares throttled (assumes physical CPU limited as well)
    D. not throttled
    Game answers: 1. B, 2. C, 3. D
    
slide 49:
    CPU Bottleneck
    Identification
    throttled_time
    increasing?
    cap throttled
    nonvol…switches
    increasing?
    (but dig further)
    not throttled
    share throttled
    host has idle
    CPU?
    all other tenants
    idle?
    physical CPU
    throttled
    
slide 50:
    4. Guest Tools
    And Container Awareness
    … if you only have guest access
    
slide 51:
    Guest Analysis Challenges
    • Some resource metrics are for the container, some for the host. Confusing!
    • May lack system capabilities or syscalls to run profilers and tracers
    
slide 52:
    CPU
    Can see host's CPU devices, but only container (pid namespace) processes:
    container# uptime
    20:17:19 up 45 days, 21:21, 0 users, load average: 5.08, 3.69, 2.22
    container# mpstat 1
    Linux 4.9.0 (02a7cf65f82e) 04/14/17
    _x86_64_
    (8 CPU)
    busy CPUs
    20:17:26
    CPU
    %usr
    %nice
    %sys %iowait
    %irq
    20:17:27
    all
    20:17:28
    all
    Average:
    all
    container# pidstat 1
    Linux 4.9.0 (02a7cf65f82e) 04/14/17
    _x86_64_
    (8 CPU)
    load!
    %soft
    %steal
    %guest
    %gnice
    %idle
    20:17:33
    UID
    PID
    %usr %system
    %guest
    %CPU
    CPU
    Command
    20:17:34
    UID
    PID
    %usr %system
    %guest
    %CPU
    CPU
    Command
    20:17:35
    [...]
    UID
    PID
    %usr %system
    %guest
    %CPU
    CPU
    Command
    but this container
    is running nothing
    (we saw CPU usage
    from neighbors)
    
slide 53:
    Memory
    Can see host's memory:
    container# free -m
    total
    Mem:
    Swap:
    used
    free
    container# perl -e '$a = "A" x 1_000_000_000'
    Killed
    shared
    buff/cache
    available
    host memory (this container is --memory=1g)
    tries to consume ~2 Gbytes
    
slide 54:
    Disks
    Can see host's disk devices:
    container# iostat -xz 1
    avg-cpu: %user %nice %system %iowait %steal
    0.00 16.94
    %idle
    host disk I/O
    Device: rrqm/s wrqm/s
    r/s
    w/s
    rkB/s
    wkB/s avgrq-sz avgqu-sz
    xvdap1
    xvdb
    0.00 200.00
    0.00 3080.00
    xvdc
    0.00 185.00
    0.00 2840.00
    md0
    0.00 385.00
    0.00 5920.00
    [...]
    container# pidstat -d 1
    Linux 4.9.0 (02a7cf65f82e) 04/18/17
    _x86_64_
    (8 CPU)
    await r_await w_await svctm %util
    2.00 2.00 0.40
    0.00 0.20 4.00
    0.00 0.24 4.40
    0.00 0.00 0.00
    22:41:13
    UID
    PID
    kB_rd/s
    kB_wr/s kB_ccwr/s iodelay
    Command
    22:41:14
    UID
    PID
    kB_rd/s
    kB_wr/s kB_ccwr/s iodelay
    Command
    22:41:15
    [...]
    UID
    PID
    kB_rd/s
    kB_wr/s kB_ccwr/s iodelay
    Command
    but no
    container I/O
    
slide 55:
    Network
    Can't see host's network interfaces (network namespace):
    container# sar -n DEV,TCP 1
    Linux 4.9.0 (02a7cf65f82e) 04/14/17
    21:45:07
    21:45:08
    21:45:08
    21:45:07
    21:45:08
    21:45:08
    21:45:09
    21:45:09
    21:45:08
    21:45:09
    [...]
    IFACE
    eth0
    _x86_64_
    (8 CPU)
    rxpck/s
    txpck/s
    rxkB/s
    active/s passive/s
    iseg/s
    oseg/s
    rxpck/s
    txpck/s
    rxkB/s
    active/s passive/s
    iseg/s
    oseg/s
    IFACE
    eth0
    txkB/s
    rxcmp/s
    txcmp/s
    rxmcst/s
    %ifutil
    txkB/s
    rxcmp/s
    txcmp/s
    rxmcst/s
    %ifutil
    host has heavy network I/O,
    container sees itself (idle)
    
slide 56:
    Metrics Namespace
    This confuses apps too: trying to bind on all CPUs, or using 25% of memory
    Including the JDK, which is unaware of container limits, covered yesterday by Fabiane Nardon
    We could add a "metrics" namespace so the container only sees itself
    Or enhance existing namespaces to do this
    If you add a metrics namespace, please consider adding an option for:
    • /proc/host/stats: maps to host's /proc/stats, for CPU stats
    • /proc/host/diskstats: maps to host's /proc/diskstats, for disk stats
    As those host metrics can be useful, to identify/exonerate neighbor issues
    
slide 57:
    perf: CPU Profiling
    Needs capabilities to run from a container:
    container# ./perf record -F 99 -a -g -- sleep 10
    perf_event_open(..., PERF_FLAG_FD_CLOEXEC) failed with unexpected error 1 (Operation not permitted)
    perf_event_open(..., 0) failed unexpectedly with error 1 (Operation not permitted)
    Error: You may not have permission to collect system-wide stats.
    Consider tweaking /proc/sys/kernel/perf_event_paranoid,
    which controls use of the performance events system by
    unprivileged users (without CAP_SYS_ADMIN).
    Helpful message
    The current value is 2:
    -1: Allow use of (almost) all events by all users
    >gt;= 0: Disallow raw tracepoint access by users without CAP_IOC_LOCK
    >gt;= 1: Disallow CPU event access by users without CAP_SYS_ADMIN
    >gt;= 2: Disallow kernel profiling by users without CAP_SYS_ADMIN
    Although, after setting perf_event_paranoid to -1, it prints the same error...
    
slide 58:
    perf & Container Debugging
    Debugging using strace from the host (as ptrace() is also blocked):
    host# strace -fp 26450
    bash PID, from which I then ran perf
    [...]
    [pid 27426] perf_event_open(0x2bfe498, -1, 0, -1, 0) = -1 EPERM (Operation not permitted)
    [pid 27426] perf_event_open(0x2bfe498, -1, 0, -1, 0) = -1 EPERM (Operation not permitted)
    [pid 27426] perf_event_open(0x2bfc1a8, -1, 0, -1, PERF_FLAG_FD_CLOEXEC) = -1 EPERM (Operation not permitted)
    Many different ways to debug this.
    https://docs.docker.com/engine/security/seccomp/#significant-syscalls-blocked-by-the-default-profile:
    
slide 59:
    perf, cont.
    • Can enable perf_event_open() with: docker run --cap-add sys_admin
    Also need (for kernel symbols): echo 0 >gt; /proc/sys/kernel/kptr_restrict
    • perf then "works", and you can make flame graphs. But it sees all CPUs!?
    perf needs to be "container aware", and only see the container's tasks.
    patch pending: https://lkml.org/lkml/2017/1/12/308
    • Currently easier to run perf from the host (or secure "monitoring" container)
    Via a secure monitoring agent,
    e.g. Netflix Vector ->gt; CPU Flame Graph
    See earlier slides for steps
    
slide 60:
    5. Tracing
    Advanced Analysis
    … a few more examples
    (iosnoop, zfsslower, and btrfsdist shown earlier)
    
slide 61:
    Built-in Linux Tracers
    Srace
    (2008+)
    perf_events
    (2009+)
    eBPF
    (2014+)
    Some front-ends:
    • ftrace: https://github.com/brendangregg/perf-tools
    • perf_events: used for CPU flame graphs
    • eBPF (aka BPF): https://github.com/iovisor/bcc (Linux 4.4+)
    
slide 62:
    ftrace: Overlay FS Function Calls
    Using ftrace via my perf-tools to count function calls in-kernel context:
    # funccount '*ovl*'
    Tracing "*ovl*"... Ctrl-C to end.
    FUNC
    COUNT
    ovl_cache_free
    ovl_xattr_get
    [...]
    ovl_fill_merge
    ovl_path_real
    ovl_path_upper
    ovl_update_time
    ovl_permission
    ovl_d_real
    ovl_override_creds
    Ending tracing...
    Each can be a target for further study with kprobes
    
slide 63:
    ftrace: Overlay FS Function Tracing
    Using kprobe (perf-tools) to trace ovl_fill_merg() args and stack trace
    # kprobe -s 'p:ovl_fill_merge ctx=%di name=+0(%si):string'
    Tracing kprobe ovl_fill_merge. Ctrl-C to end.
    bash-16633 [000] d... 14390771.218973: ovl_fill_merge: (ovl_fill_merge+0x0/0x1f0
    [overlay]) ctx=0xffffc90042477db0 name="iostat"
    bash-16633 [000] d... 14390771.218981: gt;
    =>gt; ovl_fill_merge
    =>gt; ext4_readdir
    =>gt; iterate_dir
    =>gt; ovl_dir_read_merged
    =>gt; ovl_iterate
    =>gt; iterate_dir
    =>gt; SyS_getdents
    =>gt; do_syscall_64
    =>gt; return_from_SYSCALL_64
    […]
    Good for debugging, although dumping all events can cost too much overhead. ftrace has
    some solutions to this, BPF has more…
    
slide 64:
    Enhanced BPF Tracing Internals
    Observability Program
    Kernel
    load
    BPF
    bytecode
    BPF
    program
    event config
    verifier
    tracepoints
    a\ach
    dynamic tracing
    BPF
    output
    per-event
    data
    sta=s=cs
    sta=c tracing
    kprobes
    uprobes
    async
    copy
    sampling, PMCs
    maps
    perf_events
    
slide 65:
    BPF: Scheduler Latency 1
    host# runqlat -p 20228 10 1
    Tracing run queue latency... Hit Ctrl-C to end.
    usecs
    0 ->gt; 1
    2 ->gt; 3
    4 ->gt; 7
    8 ->gt; 15
    16 ->gt; 31
    32 ->gt; 63
    64 ->gt; 127
    128 ->gt; 255
    256 ->gt; 511
    512 ->gt; 1023
    : count
    : 0
    : 4
    : 368
    : 151
    : 22
    : 14
    : 19
    : 0
    : 2
    : 1
    distribution
    |****************************************|
    |****************
    |**
    |**
    This is an app in a Docker container on a system with idle CPU
    Tracing scheduler events can be costly (high rate), but this BPF program reduces cost by using
    in-kernel maps to summarize data, and only emits the "count" column to user space.
    
slide 66:
    BPF: Scheduler Latency 2
    host# runqlat -p 20228 10 1
    Tracing run queue latency... Hit Ctrl-C to end.
    usecs
    0 ->gt; 1
    2 ->gt; 3
    4 ->gt; 7
    8 ->gt; 15
    16 ->gt; 31
    32 ->gt; 63
    64 ->gt; 127
    128 ->gt; 255
    256 ->gt; 511
    512 ->gt; 1023
    1024 ->gt; 2047
    2048 ->gt; 4095
    4096 ->gt; 8191
    8192 ->gt; 16383
    16384 ->gt; 32767
    32768 ->gt; 65535
    65536 ->gt; 131071
    131072 ->gt; 262143
    262144 ->gt; 524287
    : count
    : 0
    : 0
    : 7
    : 14
    : 0
    : 0
    : 0
    : 0
    : 0
    : 0
    : 0
    : 5
    : 6
    : 28
    : 59
    : 99
    : 6
    : 2
    : 1
    distribution
    Now other tenants are using |
    |**
    |*****
    more CPU, and this PID is
    throttled via CPU shares
    8 - 65ms delays
    |**
    |**
    |***********
    |***********************
    |****************************************|
    |**
    
slide 67:
    BPF: Scheduler Latency 3
    host# runqlat --pidnss -m
    Tracing run queue latency... Hit Ctrl-C to end.
    pidns = 4026532870
    msecs
    : count
    distribution
    0 ->gt; 1
    : 264
    |****************************************|
    2 ->gt; 3
    : 0
    4 ->gt; 7
    : 0
    8 ->gt; 15
    : 0
    16 ->gt; 31
    : 0
    Per-PID namespace histograms ||
    32 ->gt; 63
    : 0
    64 ->gt; 127
    : 2
    (I added this yesterday) |
    […]
    pidns = 4026532382
    msecs
    0 ->gt; 1
    2 ->gt; 3
    4 ->gt; 7
    8 ->gt; 15
    16 ->gt; 31
    32 ->gt; 63
    : count
    : 646
    : 18
    : 48
    : 17
    : 150
    : 134
    distribution
    |****************************************|
    |**
    |*********
    |********
    
slide 68:
    BPF: Namespace-ing Tools
    Walking from the task_struct to the PID namespace ID:
    task_struct->gt;nsproxy->gt;pid_ns_for_children->gt;ns.inum
    This is unstable, and could break between kernel versions. If it becomes a problem, we'll add a
    bpf_get_current_pidns()
    Does needs a *task, or bpf_get_current_task() (added in 4.8)
    Can also pull out cgroups, but gets tricker…
    
slide 69:
    bcc
    (BPF)
    Perf
    Tools
    
slide 70:
    Docker Analysis & Debugging
    If needed, dockerd can also be analyzed using:
    • go execution tracer
    • GODEBUG with gctrace and schedtrace
    • gdb and Go runtime support
    • perf profiling
    • bcc/BPF and uprobes
    Each has pros/cons. bcc/BPF can trace user & kernel events.
    
slide 71:
    BPF: dockerd Go Function Counting
    Counting dockerd Go calls in-kernel using BPF that match "*docker*get":
    # funccount '/usr/bin/dockerd:*docker*get*'
    Tracing 463 functions for "/usr/bin/dockerd:*docker*get*"... Hit Ctrl-C to end.
    FUNC
    COUNT
    github.com/docker/docker/daemon.(*statsCollector).getSystemCPUUsage
    github.com/docker/docker/daemon.(*Daemon).getNetworkSandboxID
    github.com/docker/docker/daemon.(*Daemon).getNetworkStats
    github.com/docker/docker/daemon.(*statsCollector).getSystemCPUUsage.func1
    github.com/docker/docker/pkg/ioutils.getBuffer
    github.com/docker/docker/vendor/golang.org/x/net/trace.getBucket
    github.com/docker/docker/vendor/golang.org/x/net/trace.getFamily
    github.com/docker/docker/vendor/google.golang.org/grpc.(*ClientConn).getTransport
    github.com/docker/docker/vendor/github.com/golang/protobuf/proto.getbase
    github.com/docker/docker/vendor/google.golang.org/grpc/transport.(*http2Client).getStream
    Detaching...
    # objdump -tTj .text /usr/bin/dockerd | wc -l
    35,859 functions can be traced!
    Uses uprobes, and needs newer kernels. Warning: will cost overhead at high function rates.
    
slide 72:
    BPF: dockerd Go Stack Tracing
    Counting stack traces that led to this ioutils.getBuffer() call:
    # stackcount 'p:/usr/bin/dockerd:*/ioutils.getBuffer'
    Tracing 1 functions for "p:/usr/bin/dockerd:*/ioutils.getBuffer"... Hit Ctrl-C to end.
    github.com/docker/docker/pkg/ioutils.getBuffer
    github.com/docker/docker/pkg/broadcaster.(*Unbuffered).Write
    bufio.(*Reader).writeBuf
    bufio.(*Reader).WriteTo
    io.copyBuffer
    io.Copy
    github.com/docker/docker/pkg/pools.Copy
    github.com/docker/docker/container/stream.(*Config).CopyToPipe.func1.1
    runtime.goexit
    dockerd [18176]
    means this stack was seen 110 times
    Detaching...
    Can also trace function arguments, and latency (with some work)
    http://www.brendangregg.com/blog/2017-01-31/golang-bcc-bpf-function-tracing.html
    
slide 73:
    Summary
    Identify bottlenecks:
    1. In the host vs container, using system metrics
    2. In application code on containers, using CPU flame graphs
    3. Deeper in the kernel, using tracing tools
    
slide 74:
    References
    http://techblog.netflix.com/2017/04/the-evolution-of-container-usage-at.html
    http://techblog.netflix.com/2016/07/distributed-resource-scheduling-with.html
    https://www.slideshare.net/aspyker/netflix-and-containers-titus
    https://docs.docker.com/engine/admin/runmetrics/#tips-for-high-performance-metric-collection
    https://blog.docker.com/2013/10/gathering-lxc-docker-containers-metrics/
    https://www.slideshare.net/jpetazzo/anatomy-of-a-container-namespaces-cgroups-some-filesystem-magic-linuxcon
    https://www.youtube.com/watch?v=sK5i-N34im8 Cgroups, namespaces, and beyond
    https://jvns.ca/blog/2016/10/10/what-even-is-a-container/
    https://blog.jessfraz.com/post/containers-zones-jails-vms/
    http://blog.alicegoldfuss.com/making-flamegraphs-with-containerized-java/
    http://www.brendangregg.com/USEmethod/use-linux.html full USE method list
    http://www.brendangregg.com/blog/2017-01-31/golang-bcc-bpf-function-tracing.html
    http://techblog.netflix.com/2015/11/linux-performance-analysis-in-60s.html
    http://queue.acm.org/detail.cfm?id=1809426 latency heat maps
    https://github.com/brendangregg/perf-tools ftrace tools, https://github.com/iovisor/bcc BPF tools
    
slide 75:
    Thank You!
    http://techblog.netflix.com
    http://slideshare.net/brendangregg
    http://www.brendangregg.com
    [email protected]
    @brendangregg
    Titus team: @aspyker @anwleung @fabiokung @tomaszbak1974
    @amit_joshee @sargun @corindwyer …
    #dockercon