Systems Performance 2nd Ed.



BPF Performance Tools book

Recent posts:
Blog index
About
RSS

SCaLE12x (2014): What Linux can learn from Solaris performance and vice-versa

Keynote for the Southern California Linux Expo 12x (2014) by Brendan Gregg.

Video: http://www.youtube.com/watch?v=6TYC5h4yz1o

What Linux can learn from Solaris performance and vice-versa.

next
prev
1/115
next
prev
2/115
next
prev
3/115
next
prev
4/115
next
prev
5/115
next
prev
6/115
next
prev
7/115
next
prev
8/115
next
prev
9/115
next
prev
10/115
next
prev
11/115
next
prev
12/115
next
prev
13/115
next
prev
14/115
next
prev
15/115
next
prev
16/115
next
prev
17/115
next
prev
18/115
next
prev
19/115
next
prev
20/115
next
prev
21/115
next
prev
22/115
next
prev
23/115
next
prev
24/115
next
prev
25/115
next
prev
26/115
next
prev
27/115
next
prev
28/115
next
prev
29/115
next
prev
30/115
next
prev
31/115
next
prev
32/115
next
prev
33/115
next
prev
34/115
next
prev
35/115
next
prev
36/115
next
prev
37/115
next
prev
38/115
next
prev
39/115
next
prev
40/115
next
prev
41/115
next
prev
42/115
next
prev
43/115
next
prev
44/115
next
prev
45/115
next
prev
46/115
next
prev
47/115
next
prev
48/115
next
prev
49/115
next
prev
50/115
next
prev
51/115
next
prev
52/115
next
prev
53/115
next
prev
54/115
next
prev
55/115
next
prev
56/115
next
prev
57/115
next
prev
58/115
next
prev
59/115
next
prev
60/115
next
prev
61/115
next
prev
62/115
next
prev
63/115
next
prev
64/115
next
prev
65/115
next
prev
66/115
next
prev
67/115
next
prev
68/115
next
prev
69/115
next
prev
70/115
next
prev
71/115
next
prev
72/115
next
prev
73/115
next
prev
74/115
next
prev
75/115
next
prev
76/115
next
prev
77/115
next
prev
78/115
next
prev
79/115
next
prev
80/115
next
prev
81/115
next
prev
82/115
next
prev
83/115
next
prev
84/115
next
prev
85/115
next
prev
86/115
next
prev
87/115
next
prev
88/115
next
prev
89/115
next
prev
90/115
next
prev
91/115
next
prev
92/115
next
prev
93/115
next
prev
94/115
next
prev
95/115
next
prev
96/115
next
prev
97/115
next
prev
98/115
next
prev
99/115
next
prev
100/115
next
prev
101/115
next
prev
102/115
next
prev
103/115
next
prev
104/115
next
prev
105/115
next
prev
106/115
next
prev
107/115
next
prev
108/115
next
prev
109/115
next
prev
110/115
next
prev
111/115
next
prev
112/115
next
prev
113/115
next
prev
114/115
next
prev
115/115

PDF: SCaLE_Linux_vs_Solaris_Performance2014.pdf

Keywords (from pdftotext):

slide 1:
    What Linux can learn from
    Solaris performance
    and vice-versa
    Brendan Gregg
    Lead Performance Engineer
    [email protected]
    @brendangregg
    SCaLE12x
    February, 2014
    
slide 2:
    Linux vs Solaris Performance Differences
    DTrace libumem Symbols Microstate futex
    Accounting
    Up-to-date
    Mature fully
    packages
    preemptive
    Applications
    CPU scalability
    likely()/unlikely()
    CONFIGurable
    RCU
    DynTicks
    System Libraries
    SLUB
    System Call Interface
    ZFS
    btrfs
    I/O
    Scheduler
    VFS
    Sockets
    File Systems
    TCP/UDP
    Volume Managers
    Block Device Interface
    Ethernet
    Scheduler
    Virtual
    Memory
    Virtualization
    Device Drivers
    Zones
    KVM
    Crossbow FireEngine
    More device drivers
    Process
    swapping
    Overcommit
    & OOM Killer
    MPSS
    Lazy TLB
    
slide 3:
    whoami
    • Lead Performance Engineer at Joyent
    • Work on Linux and SmartOS performance
    • Work/Research: tools, visualizations, methodologies
    • Did kernel engineering at Sun Microsystems; worked on
    DTrace and ZFS
    
slide 4:
    Joyent
    • High-Performance Cloud Infrastructure
    • Compete on cloud instance/OS performance
    • OS Virtualization for bare metal performance (Zones)
    • Core developers of illumos/SmartOS and Node.js
    • Recently launched Manta: a high performance object store
    • KVM for Linux guests
    • Certified Ubuntu on Joyent cloud now available!
    
slide 5:
    SCaLE11x: Linux Performance Analysis
    strace
    Operating System
    pidstat
    netstat
    Hardware
    perf
    Applications
    DBs, all server types, ...
    mpstat
    System Libraries
    CPU
    Interconnect
    System Call Interface
    Linux Kernel
    perf
    dtrace
    stap
    lttng
    ktap
    VFS
    Sockets
    File Systems
    TCP/UDP
    Volume Managers
    Block Device Interface
    Ethernet
    Scheduler
    top ps
    pidstat
    Virtual
    Memory
    vmstat
    slabtop
    free
    Device Drivers
    iostat
    iotop
    blktrace
    perf
    Expander Interconnect
    I/O Bus
    I/O Bridge
    tcpdump
    I/O Controller
    Disk
    Memory
    Bus
    perf
    DRAM
    nicstat
    Network Controller
    Interface Transports
    Disk
    CPU
    Various:
    Port
    Swap
    swapon
    ping
    Latest version: http://www.brendangregg.com/linuxperf.html
    Port
    traceroute
    sar
    dstat
    /proc
    
slide 6:
    SCaLE12x: Linux and Solaris Performance
    • Also covered in my new book,
    Systems Performance
    (Prentice Hall, 2013)
    • Focus is on understanding
    systems and the methodologies
    to analyze them. Linux and
    Solaris are used as examples
    
slide 7:
    Agenda
    • Why systems differ
    • Specific differences
    • What Solaris could learn from Linux
    • What Linux could learn from Solaris
    • What both can learn
    • Results
    
slide 8:
    Terminology
    • For convenience in this talk:
    • Linux = an operating system distribution which uses the Linux kernel.
    eg, Ubuntu.
    • Solaris = a distribution from the family of operating systems whose
    kernel code originated from Sun Solaris.
    • SmartOS = a Solaris-family distribution developed by Joyent, based on
    the illumos kernel, which was based on the OpenSolaris kernel, which
    was based on the Solaris kernel
    • System = everything you get when you pick a Linux or Solaris
    distribution: the kernel, libraries, tools, and package repos
    • Opinions in this presentation are my own, and I do not
    represent Oracle Solaris. I'll actually be talking about SmartOS.
    
slide 9:
    Why Systems Differ
    
slide 10:
    Why Systems Differ
    • Does the system even matter?
    • Will your application perform the same on Linux and Solaris?
    
slide 11:
    Example
    • Let's start with this simple program:
    perl -e 'for ($i = 0; $i 
slide 12:
    Example Results
    • One of these is Linux, the other SmartOS. Same hardware:
    systemA$ time perl -e 'for ($i = 0; $i 
slide 13:
    Possible Differences:Versioning
    • Different versions of Perl
    • Applications improve performance from release to release
    • Linux and SmartOS distributions use entirely different
    package repos; different software versions are common
    • Different compilers used to build Perl
    • Compilers come from package repos, too. I've seen 50%
    performance improvements by gcc version alone
    • Different compiler options used to build Perl
    • Application Makefile: #ifdef Linux -O3 #else -O0. ie, the
    performance difference is due to a Makefile decision
    • 32-bit vs 64-bit builds
    
slide 14:
    Possible Differences: OS
    • Different system libraries
    • If any library calls are used. eg: strcmp(), malloc(),
    memcpy(), ... These implementations vary between Linux
    and Solaris, and can perform very differently
    • Robert Mustacchi enhanced libumem to provide improved
    malloc() performance on SmartOS. This can make a
    noticeable difference for some workloads
    • Different background tasks
    • Runtime could be perturbed by OS daemons doing async
    housekeeping. These differ between Linux and Solaris
    
slide 15:
    Possible Differences: Observability
    • Can the 14% be root caused?
    • Observability tools differ. These don't cause the 14%, but
    can make the difference as to whether you can diagnose
    and fix it – or not.
    • DTrace has meant that anything can be solved; without an
    equivalent on Linux, you may have to live with that 14%
    • Although, Linux observability is getting much better...
    
slide 16:
    Possible Differences: Kernel
    • Can the kernel make a difference? ... As a reminder:
    perl -e 'for ($i = 0; $i 
slide 17:
    Possible Differences: Kernel, cont.
    • Can the kernel make a difference? ... As a reminder:
    perl -e 'for ($i = 0; $i 
slide 18:
    Possible Differences: Kernel, cont.
    • During a perturbation, the kernel CPU scheduler may
    migrate the thread to another CPU, which can hurt
    performance (cold caches, memory locality)
    • Sure, but would that happen for this simple Perl program?
    
slide 19:
    Possible Differences: Kernel, cont.
    • During a perturbation, the kernel CPU scheduler may
    migrate the thread to another CPU, which can hurt
    performance (cold caches, memory locality)
    • Sure, but would that happen for this simple Perl program?
    # dtrace -n 'profile-99 /pid == $target/ { @ = lquantize(cpu, 0, 16, 1); }' -c ...
    value ------------- Distribution ------------- count
    
slide 20:
    Kernel Myths and Realities
    • Myth: "The kernel gets out of the way for applications"
    • The only case where the kernel gets out of the way is
    when your software calls halt() or shutdown()
    • The performance difference between kernels may be
    small, eg, 5% – but I have already seen a 5x difference this
    year
    arch/ia64/kernel/smp.c:
    void
    cpu_die(void)
    max_xtp();
    local_irq_disable();
    cpu_halt();
    /* Should never be here */
    BUG();
    for (;;);
    unintentional kernel humor...
    
slide 21:
    Other Differences
    • The previous example was simple. Any applications that do
    I/O (ie, everything) encounter more differences:
    • Different network stack implementations, including support
    for different TCP/IP features
    • Different file systems, storage I/O stacks
    • Different device drivers and device feature support
    • Different resource control implementations
    • Different virtualization technologies
    • Different community support: stackoverflow, meetups, ...
    
slide 22:
    Types of Differences
    App versions from
    system repos
    Syscall interface
    File systems:
    ZFS, btrfs, ...
    I/O scheduling
    Virtualization
    Technologies
    Resource
    controls
    Compiler Observability
    System
    library
    options
    tools
    daemons implementations;
    malloc(), str*(), ...
    Applications
    DBs, all server types, ...
    System Libraries
    System Call Interface
    VFS
    Sockets
    File Systems
    TCP/UDP
    Volume Managers
    Block Device Interface
    Ethernet
    Scheduler
    Virtual
    Memory
    Virtualization
    Device Drivers
    Device driver TCP/IP stack
    support
    and features
    Scheduler
    classes and
    behavior
    Memory
    allocation
    and locality
    Network device
    CPU fanout
    
slide 23:
    Specific Differences
    
slide 24:
    Specific Differences
    • Comparing systems is like comparing countries
    • I'm often asked: how's Australia different from the US?
    • Where do I start!?
    • I'll categorize performance differences into big or small, based
    on their engineering cost, not their performance effect
    • If one system is 2x faster than another for a given workload,
    the real question for the slower system is:
    • Is this a major undertaking to fix?
    • Is there a quick fix or workaround?
    • Using SmartOS for specific examples...
    
slide 25:
    Big Differences
    • Major bodies of perf work and other big differences, include:
    • Linux
    • up-to-date packages, large community, more device
    drivers, futex, RCU, btrfs, DynTicks, SLUB, I/O scheduling
    classes, overcommit & OOM killer, lazy TLB, likely()/
    unlikely(), CONFIGurable
    • SmartOS
    • Mature: Zones, ZFS, DTrace, fully pre-emptable kernel
    • Microstate accounting, symbols by default, CPU scalability,
    MPSS, libumem, FireEngine, Crossbow, binary /proc,
    process swapping
    
slide 26:
    Big Differences: Linux
    Up-to-date packages
    Large community
    More device drivers
    Latest application versions, with the latest
    performance fixes
    Weird perf issue? May be answered on
    stackoverflow, or discussed at meetups
    There can be better coverage for high
    performing network cards or driver features
    futex
    Fast user-space mutex
    RCU
    Fast-performing read-copy updates
    btrfs
    Modern file system with pooled storage
    DynTicks
    SLUB
    Dynamic ticks: tickless kernel, reduces
    interrupts and saves power
    Simplified version of SLAB kernel memory
    allocator, improving performance
    
slide 27:
    Big Differences: Linux, cont.
    I/O scheduling classes Block I/O classes: deadline, anticipatory, ...
    Overcommit & OOM
    killer
    Doing more with less main memory
    Lazy TLB
    Higher performing munmap()
    likely()/unlikely()
    CONFIGurable
    Kernel is embedded with compiler information
    for branch prediction, improving runtime perf
    Lightweight kernels possible by disabling
    features
    
slide 28:
    Big Differences: SmartOS
    OS virtualization for high-performing server
    Mature Zones
    instances
    Fully-featured and high-performing modern
    Mature ZFS
    integrated file system with pooled storage
    Programmable dynamic and static tracing for
    Mature DTrace
    performance analysis
    Mature fully preSupport for real-time systems was an early Sun
    emptable kernel
    differentiator
    Numerous high-resolution thread state times for
    Microstate accounting
    performance debugging
    Symbols
    Symbols available for profiling tools by default
    CPU scalability
    Code is often tested, and bugs fixed, for large
    SMP servers (mainframes)
    MPSS
    Multiple page size support (not just hugepages)
    
slide 29:
    Big Differences: SmartOS, cont.
    libumem
    FireEngine
    Crossbow
    binary /proc
    Process swapping
    High-performing memory allocation library, with
    per-thread CPU caches
    High-performing TCP/IP stack enhancements,
    including vertical perimeters and IP fanout
    High-performing virtualized network interfaces,
    as used by OS virtualization
    Process statistics are binary (slightly more
    efficient) by default
    Apart from paging (what Linux calls swapping),
    Solaris can still swap out entire processes
    
slide 30:
    Big Differences: Linux vs SmartOS
    DTrace libumem Symbols Microstate futex
    Accounting
    Up-to-date
    Mature fully
    packages
    preemptive
    Applications
    CPU scalability
    likely()/unlikely()
    CONFIGurable
    RCU
    DynTicks
    System Libraries
    SLUB
    System Call Interface
    ZFS
    btrfs
    I/O
    Scheduler
    VFS
    Sockets
    File Systems
    TCP/UDP
    Volume Managers
    Block Device Interface
    Ethernet
    Scheduler
    Virtual
    Memory
    Resource Controls
    Device Drivers
    Zones
    Crossbow FireEngine
    More device drivers
    Process
    swapping
    Overcommit
    & OOM Killer
    MPSS
    Lazy TLB
    
slide 31:
    Small Differences
    • Smaller performance-related differences, tunables, bugs
    • Linux
    • glibc, better TCP defaults, better CPU affinity, perf stat, a
    working sar, htop, splice(), fadvise(), ionice, /usr/bin/time,
    mpstat %steal, voluntary preemption, swappiness, various
    accounting frameworks, tcp_tw_reuse/recycle, TCP tail
    loss probe, SO_REUSEPORT, ...
    • SmartOS
    • perf tools by default, kstat, vfsstat, iostat -e, ptime -m,
    CPU-only load averages, some STREAMS leftovers, ZFS
    SCSI cache flush by default, different TCP slow start
    default, ...
    
slide 32:
    Small Differences, cont.
    • Small differences change frequently: a feature is added to one
    kernel, then the other a year later; a difference may only exist
    for a short period of time.
    • These small kernel differences may still make a significant
    performance difference, but are classified as "small" based on
    engineering cost.
    
slide 33:
    System Similarities
    • It's important to note that many performance-related features
    are roughly equivalent:
    • Both are Unix-like systems: processes, kernel, syscalls,
    time sharing, preemption, virtual memory, paged virtual
    memory, demand paging, ...
    • Similar modern features: unified buffer cache, memory
    mapped files, multiprocessor support, CPU scheduling
    classes, CPU sets, 64-bit support, memory locality,
    resource controls, PIC profiler, epoll, ...
    
slide 34:
    Non Performance Differences
    • Linux
    • Open source (vs Oracle Solaris), "everyone knows it",
    embedded Linux, popular and well supported desktop/
    laptop use...
    • SmartOS
    • SMF/FMA, contracts, privileges, mdb (postmortem
    debugging), gcore, crash dumps by default, ...
    
slide 35:
    WARNING
    The next sections are not suitable for those suffering
    Not Invented Here (NIH) syndrome,
    or those who are easily trolled
    
slide 36:
    What Solaris can learn from Linux performance
    
slide 37:
    What Solaris can learn from Linux performance
    • Packaging
    • Overcommit & OOM Killer
    • Community
    • SLUB
    • Compiler Options
    • Lazy TLB
    • likely()/unlikely()
    • TIME_WAIT Recycling
    • Tickless Kernel
    • sar
    • Process Swapping
    • KVM
    • Either learning what to do, or learning what not to do...
    
slide 38:
    Packaging
    • Linux package repositories are often well stocked and updated
    • Convenience aside, this can mean that users run newer
    software versions, along with the latest perf fixes
    • They find "Linux is faster", but the real difference is the version
    of: gcc, openssl, mysql, ... Solaris is unfairly blamed
    
slide 39:
    Packaging, cont.
    • Packaging is important and needs serious support
    • Dedicated staff, community
    • eg, Joyent has dedicated staff for the SmartOS package repo,
    which is based on pkgsrc from NetBSD
    • It's not just the operating system that matters; it's the
    ecosystem
    
slide 40:
    Community
    • A large community means:
    • Q/A sites have performance tips: stackoverflow, ...
    • Conference talks on performance (this one!), slides, video
    • Weird issues more likely found and fixed by someone else
    • More case studies shared: what tuning/config worked
    • A community helps people hear about the latest tools, tuning,
    and developments, and adopt them
    
slide 41:
    Community, cont.
    • Linux users expect to Google a question and find an answer
    on stackoverflow
    • Either foster a community to share content on tuning, tools,
    configuration, or, have staff to create such content.
    • Hire a good community manager!
    
slide 42:
    Compiler Options
    • Apps may compile with optimizations for Linux only. eg:
    Oh, ha ha ha
    • #ifdef Linux -O3 #else -O0
    • Developers are often writing software on Linux, and that
    platform gets the most attention. (Works on my system.)
    • I've also seen 64-bit vs 32-bit. #ifdef Linux USE_FUTEX would
    be fine, since Solaris doesn't have them yet.
    • Last time I found compiler differences using Flame Graphs:
    Extra Function:
    UnzipDocid()
    Linux
    SmartOS
    
slide 43:
    Compiler Options, cont.
    • Can be addressed by tuning packages in the repo
    • Also file bugs/patches with developers to tune Makefiles
    • Someone has to do this, eg, package repo staff/community
    who find and do the workarounds anyway
    
slide 44:
    likely()/unlikely()
    • These become compiler hints (__builtin_expect) for branch
    prediction, and are throughout the Linux kernel:
    net/ipv4/tcp_output.c, tcp_transmit_skb():
    [...]
    [...]
    if (likely(clone_it)) {
    if (unlikely(skb_cloned(skb)))
    skb = pskb_copy(skb, gfp_mask);
    else
    skb = skb_clone(skb, gfp_mask);
    if (unlikely(!skb))
    return -ENOBUFS;
    • The Solaris kernel doesn't do this yet
    • If the kernel is built using profile feedback instead – which should
    be even better – I don't know about it
    • The actual perf difference is likely to be small
    
slide 45:
    likely()/unlikely(), cont.
    • Could be adopted by kernel engineering
    • Might help readability, might not
    
slide 46:
    Tickless Kernel
    • Linux does this already (DynTicks), which reduces interrupts
    and improves processor power saving (good for laptops and
    embedded devices)
    • Solaris still has a clock() routine, to perform various kernel
    housekeeping functions
    • Called by default at 100 Hertz
    • If hires_tick=1, at 1000 Hertz
    • I've occasionally encountered perf issues involving 10 ms
    latencies, that don't exist on Linux
    • ... which become 1 ms latencies after setting hires_tick=1
    
slide 47:
    Tickless Kernel, cont.
    • Sun/Oracle did start work on this years ago...
    
slide 48:
    Process Swapping
    • Linux doesn't do it. Linux "swapping" means paging.
    • Process swapping
    made sense on the
    PDP-11/20, where
    the maximum
    process size was
    64 Kbytes
    • Paging was added
    later in BSD, but
    the swapping code
    remained
    
slide 49:
    Process Swapping, cont.
    • Consider ditching it
    • All that time learning what swapping is could be spent learning
    more useful features
    
slide 50:
    Overcommit & OOM Killer
    • On Linux, malloc() may never fail
    • No virtual memory limit (main memory + swap) like Solaris
    by default. Tunable using vm.overcommit_memory
    • More user memory can be allocated than can be stored.
    May be great for small devices, running applications that
    sparsely use the memory they allocate
    • Don't worry, if Linux runs very low on available main memory,
    a sacrificial process is identified by the kernel and killed
    by the Out Of Memory (OOM) Killer, based on an OOM score
    • OOM score was just added to htop (1.0.2, Jan 2014):
    
slide 51:
    Overcommit & OOM Killer, cont.
    • Solaris can learn why not to do this (cautionary tale)
    • If an important app depended on this, and couldn't be fixed,
    the kernel could have an overcommit option that wasn't default
    • ... this is why so much new code doesn't check for ENOMEM
    
slide 52:
    SLUB
    • Linux integrated the Solaris kernel SLAB allocator, then later
    simplified it: The SLUB allocator
    • Removed object queues and per-CPU caches, leaving NUMA
    optimization to the page allocator's free lists
    • Worth considering?
    
slide 53:
    Lazy TLB
    • Lazy TLB mode: a way to delay TLB updates (shootdowns)
    • munmap() heavy workloads on Solaris can experience heavy
    HAT CPU cross calls. Linux doesn't seem to have this problem.
    TLB
    Lazy TLB
    As seen by Solaris
    Correct
    Reckless
    As seen by Linux
    Paranoid
    Fast
    
slide 54:
    Lazy TLB, cont.
    • This difference needs to be investigated, quantified, and
    possibly fixed (tunable?)
    
slide 55:
    TIME_WAIT Recycling
    • A localhost HTTP benchmark on Solaris:
    # netstat -s 1 | grep ActiveOpen
    tcpActiveOpens
    =728004
    tcpActiveOpens
    tcpActiveOpens
    = 4939
    tcpActiveOpens
    = 5849
    tcpActiveOpens
    = 1341
    tcpActiveOpens
    = 1006
    tcpActiveOpens
    tcpActiveOpens
    tcpActiveOpens
    tcpActiveOpens
    tcpActiveOpens
    tcpPassiveOpens
    tcpPassiveOpens
    tcpPassiveOpens
    tcpPassiveOpens
    tcpPassiveOpens
    tcpPassiveOpens
    tcpPassiveOpens
    tcpPassiveOpens
    tcpPassiveOpens
    tcpPassiveOpens
    tcpPassiveOpens
    =726547
    = 4939
    = 5798
    = 1292
    = 1008
    Fast
    Slow
    • Connection rate drops by 5x due to sessions in TIME_WAIT
    • Linux avoids this by recycling better (tcp_tw_reuse/recycle)
    • Usually doesn't hurt production workloads, as it must be a
    flood of connections from a single host to a single port. It
    comes up in benchmarks/evaluations.
    
slide 56:
    TIME_WAIT Recycling, cont.
    • Improve tcp_time_wait_processing()
    • This is being fixed for illumos/SmartOS
    
slide 57:
    sar
    • Linux sar is awesome, and has extra options:
    $ sar -n DEV -n TCP -n ETCP 1
    11:16:34 PM
    IFACE
    rxpck/s
    11:16:35 PM
    eth0
    11:16:35 PM
    eth1
    11:16:35 PM ip6tnl0
    11:16:35 PM
    11:16:35 PM ip_vti0
    11:16:35 PM
    sit0
    11:16:35 PM
    tunl0
    11:16:34 PM
    11:16:35 PM
    active/s passive/s
    11:16:34 PM
    11:16:35 PM
    atmptf/s
    txpck/s
    rxkB/s
    iseg/s
    oseg/s
    estres/s retrans/s isegerr/s
    txkB/s
    rxcmp/s
    orsts/s
    • -n DEV: network interface statistics
    • -n TCP: TCP statistics
    • -n ETCP: TCP error statistics
    • Linux sar's other metrics are also far less buggy
    txcmp/s
    rxmcst/s
    
slide 58:
    sar, cont.
    • Sar must be fixed for the 21st century
    • Use the Linux sar options and column names, which follow a
    neat convention
    
slide 59:
    KVM
    • The KVM type 2 hypervisor originated for Linux
    • While Zones are faster, KVM can run different kernels (Linux)
    • vs Type 1 hypervisors (Xen):
    • KVM has better perf
    observability, as it can
    use the regular OS tools
    • KVM can use OS
    resource controls, just
    like any other process
    
slide 60:
    KVM, cont.
    • illumos/SmartOS learned this, Joyent ported KVM!
    • Oracle Solaris doesn't have it yet
    
slide 61:
    What Linux can learn from Solaris performance
    
slide 62:
    What Linux can learn from Solaris performance
    • ZFS
    • Zones
    • STREAMS
    • Symbols
    • prstat -mLc
    • vfsstat
    • DTrace
    • Culture
    • Either learning what to do, or learning what not to do...
    
slide 63:
    ZFS
    • More performance features than you can shake a stick at:
    • Pooled storage, COW, logging (batching writes), ARC,
    variable block sizes, dynamic striping, intelligent prefetch,
    multiple prefetch streams, snapshots, ZIO pipeline,
    compression (lzjb can improve perf by reducing I/O load),
    SLOG, L2ARC, vdev cache, data deduplication (possibly
    better cache reach)
    • The Adaptive Replacement Cache (ARC) can make a big
    difference: it can resist perturbations (backups) and stay warm
    • ZFS I/O throttling (in illumos/SmartOS) throttles disk I/O at the
    VFS layer, to solve cloud noisy neighbor issues
    • ZFS is Mature. Widespread use in criticial environments
    
slide 64:
    ZFS, cont.
    • Linux has been learning about ZFS for a while
    • http://zfsonlinux.org/
    • btrfs
    
slide 65:
    Zones
    • Ancestry: chroot  FreeBSD jails  Solaris Zones
    • OS Virtualization. Zero I/O path overheads.
    
slide 66:
    Zones, cont.
    • Compare to HW Virtualization:
    • This shows the initial I/O control flow. There are optimizations/
    variants for improving the HW Virt I/O path, esp for Xen.
    
slide 67:
    Zones, cont.
    • Comparing 1 GB instances on Joyent
    • Max network throughput:
    • KVM: 400 Mbits/sec
    • Zones: 4.54 Gbits/sec (over 10x)
    • Max network IOPS:
    • KVM: 18,000 packets/sec
    • Zones: 78,000 packets/sec (over 4x)
    • Numbers go much higher for larger instances
    • http://dtrace.org/blogs/brendan/2013/01/11/virtualization-performance-zones-kvm-xen
    
slide 68:
    Zones, cont.
    • Performance analysis for Zones is also easy. Analyze the
    applications as usual:
    Operating System
    Applications
    Zone .
    analyze
    System Libraries
    Kernel
    System Call Interface
    VFS
    Sockets
    File Systems
    TCP/UDP
    Volume Managers
    Block Device Interface
    Ethernet
    Resource Controls
    Device Drivers
    Firmware
    Metal
    Scheduler
    Virtual
    Memory
    
slide 69:
    Zones, cont.
    Host Applications
    • Compared
    QEMU
    to HW Virt
    (KVM):
    analyze
    Guest Applications
    System Libraries
    System Call Interface
    Linux
    kernel
    VFS
    Sockets
    File Systems
    TCP/UDP
    Volume Managers
    Block Device Interface
    Ethernet
    Scheduler
    Virtual
    Memory
    Resource Controls
    correlate
    Device Drivers
    observability
    boundary
    Virtual Devices
    System Libraries
    System Call Interface
    host
    kernel
    KVM
    VFS
    Sockets
    File Systems
    TCP/UDP
    Volume Managers
    Block Device Interface
    Ethernet
    Resource Controls
    Device Drivers
    Firmware
    Metal
    Scheduler
    Virtual
    Memory
    
slide 70:
    Zones, cont.
    • Linux has been learning: LXC & cgroups, but not widespread
    adoption yet. Docker will likely drive adoption.
    
slide 71:
    STREAMS
    • AT&T modular I/O subsystem
    • Like Unix shell pipes, but for kernel messages. Can push
    modules into the stream to customize processing
    • Introduced (fully) in Unix 8th Ed Research Unix, became SVr4
    STREAMS, and was used by Solaris for network TCP/IP stack
    • With greater demands for TCP/IP performance, the overheads
    of STREAMS reduced scalability
    • Sun switched high-performing paths to be direct function calls
    
slide 72:
    STREAMS, cont.
    • A cautionary tale: not good for high performance code paths
    
slide 73:
    Symbols
    • Compilers on Linux strip symbols by default, making perf
    profiler output inscrutable without the dbgsym packages
    57.14%
    sshd libc-2.15.so
    [.] connect
    --- connect
    |--25.00%-- 0x7ff3c1cddf29
    |--25.00%-- 0x7ff3bfe82761
    What??
    0x7ff3bfe82b7c
    |--25.00%-- 0x7ff3bfe82dfc
    --25.00%-- [...]
    • Linux compilers also drop frame pointers, making stacks hard
    to profile. Please use -fno-omit-frame-pointer to stop this.
    • as a workaround, perf_events has "-g dwarf" for libunwind
    • Solaris keeps symbols and stacks, and often has CTF too,
    making Mean Time To Flame Graph very fast
    
slide 74:
    Symbols, cont.
    Flame Graphs need
    symbols and stacks
    
slide 75:
    Symbols, cont.
    • Keep symbols and frame pointers. Faster resolution for
    performance analysis and troubleshooting.
    
slide 76:
    prstat -mLc
    • Per-thread time broken down into states, from a top-like tool:
    $ prstat -mLc 1
    PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID
    63037 root
    83 16 0.1 0.0 0.0 0.0 0.0 0.5 30 243 45K
    0 node/1
    12927 root
    14 49 0.0 0.0 0.0 0.0 34 2.9 6K 365 .1M
    0 ab/1
    63037 root
    0.5 0.6 0.0 0.0 0.0 3.7 95 0.4 1K
    0 1K
    0 node/2
    [...]
    • These columns narrow an investigation
    immediately, and have been critical for
    solving countless issues. Unsung hero
    of Solaris performance analysis
    • Well suited for the Thread State Analysis
    (TSA) methodology, which I've taught
    in class, and has helped students get
    started and fix unknown perf issues
    • http://www.brendangregg.com/tsamethod.html
    
slide 77:
    prstat -mLc, cont.
    • Linux has various thread states: delay accounting, I/O
    accounting, schedstats. Can they be added to htop?
    See TSA Method for use case and desired metrics.
    
slide 78:
    vfsstat
    • VFS-level iostat (added to SmartOS, not Solaris):
    $ vfsstat -M 1
    r/s
    w/s Mr/s Mw/s ractv wactv read_t writ_t
    761.0 267.1 15.4
    4076.8 2796.0 41.7 2.3 0.1
    4945.1 2807.4 157.1 2.3 0.1
    3550.9 1910.4 109.7 1.6 0.4
    0.0 112.9
    [...]
    d/s
    del_t zone
    23.5 5716a5b6
    0.0 5716a5b6
    0.0 5716a5b6
    0.0 5716a5b6
    • Shows what the applications request from
    the file system, and the true performance
    that they experience
    • iostat includes asynchronous I/O
    • vfsstat sees issues iostat can't:
    • lock contention
    • resource control throttling
    Applications
    System Libraries
    System Call Interface
    vfsstat
    VFS
    File Systems
    Volume Managers
    iostat
    Block Device Interface
    Device Drivers
    Storage Devices
    
slide 79:
    vfsstat, cont.
    • Add vfsstat, or VFS metrics to sar.
    
slide 80:
    DTrace
    • Programmable, real-time, dynamic and static
    tracing, for performance analysis and
    troubleshooting, in dev and production
    • Used on Solaris, illumos/SmartOS,
    Mac OS X, FreeBSD, ...
    • Solve virtually any perf issue. eg,
    fix the earlier Perl 15% delta,
    no matter where the problem is.
    Without DTrace's capabilities, you
    may have to wear that 15%.
    • Users can write their own custom DTrace
    one-liners and scripts, or use/modify others
    (eg, mine).
    
slide 81:
    DTrace: illumos Scripts
    • Some of my DTrace scripts from the DTraceToolkit, DTrace book...
    cifs*.d, iscsi*.d :Services
    nfsv3*.d, nfsv4*.d
    ssh*.d, httpd*.d
    Language Providers:
    Databases:
    fswho.d, fssnoop.d
    sollife.d
    solvfssnoop.d
    dnlcsnoop.d
    zfsslower.d
    ziowait.d
    ziostacks.d
    spasync.d
    metaslab_free.d
    iosnoop, iotop
    disklatency.d
    satacmds.d
    satalatency.d
    scsicmds.d
    scsilatency.d
    sdretry.d, sdqueue.d
    ide*.d, mpt*.d
    hotuser, umutexmax.d, lib*.d
    node*.d, erlang*.d, j*.d, js*.d
    php*.d, pl*.d, py*.d, rb*.d, sh*.d
    mysql*.d, postgres*.d, redis*.d, riak*.d
    opensnoop, statsnoop
    errinfo, dtruss, rwtop
    rwsnoop, mmap.d, kill.d
    shellsnoop, zonecalls.d
    weblatency.d, fddist
    Applications
    DBs, all server types, ...
    System Libraries
    System Call Interface
    VFS
    Sockets
    File Systems
    TCP/UDP
    Volume Managers
    Block Device Interface
    Ethernet
    Device Drivers
    Scheduler
    priclass.d, pridist.d
    cv_wakeup_slow.d
    displat.d, capslat.d
    Virtual
    Memory
    minfbypid.d
    pgpginbypid.d
    macops.d, ixgbecheck.d
    ngesnoop.d, ngelink.d
    soconnect.d, soaccept.d, soclose.d, socketio.d, so1stbyte.d
    sotop.d, soerror.d, ipstat.d, ipio.d, ipproto.d, ipfbtsnoop.d
    ipdropper.d, tcpstat.d, tcpaccept.d, tcpconnect.d, tcpioshort.d
    tcpio.d, tcpbytes.d, tcpsize.d, tcpnmap.d, tcpconnlat.d, tcp1stbyte.d
    tcpfbtwatch.d, tcpsnoop.d, tcpconnreqmaxq.d, tcprefused.d
    tcpretranshosts.d, tcpretranssnoop.d, tcpsackretrans.d, tcpslowstart.d
    tcptimewait.d, udpstat.d, udpio.d, icmpstat.d, icmpsnoop.d
    
slide 82:
    DTrace, cont.
    • What Linux needs to learn about DTrace:
    Feature #1 is production safety
    • There should be NO risk of panics or freezes. It should be an
    everyday tool like top(1).
    • Related to production safety is the minimization of overheads,
    which can be done with in-kernel summaries. Some of the
    Linux tools need to learn how to do this, too, as the overheads
    of dump & post-analysis can get too high.
    • Features aren't features if users don't use them
    
slide 83:
    DTrace, cont.
    • Linux might get DTrace-like capabilities via:
    • dtrace4linux
    • perf_events
    • ktap
    • SystemTap
    • LTTng
    • The Linux kernel has the necessary frameworks which are
    sourced by these tools: tracepoints, kprobes, uprobes
    • ... and another thing Linux can learn:
    • DTrace has a memorable unofficial mascot (the ponycorn
    by Deirdré Straughan, using General Zoi's pony creator).
    She's created some for the Linux tools too...
    
slide 84:
    dtrace4linux
    • Two DTrace ports in development for Linux:
    • 1. dtrace4linux
    • https://github.com/dtrace4linux/linux
    • Mostly by Paul Fox
    • Not safe for production use yet;
    I've used it to solve issues by
    first reproducing them in the lab
    • 2. Oracle Enterprise Linux DTrace
    • Has been steady progress. Oracle
    Linux 6.5 featured "full DTrace
    integration" (Dec 2013)
    
slide 85:
    dtrace4linux: Example
    • Tracing ext4 read/write calls with size distributions (bytes):
    #!/usr/sbin/dtrace -s
    fbt::vfs_read:entry, fbt::vfs_write:entry
    /stringof(((struct file *)arg0)->gt;f_path.dentry->gt;d_sb->gt;s_type->gt;name) == "ext4"/
    @[execname, probefunc + 4] = quantize(arg2);
    dtrace:::END
    printa("\n
    %s %s (bytes)%@d", @);
    # ./ext4rwsize.d
    dtrace: script './ext4rwsize.d' matched 3 probes
    CPU
    FUNCTION:NAME
    :END
    [...]
    vi read (bytes)
    value ------------- Distribution ------------- count
    128 |
    256 |
    512 |@@@@@@@
    1024 |@
    2048 |
    4096 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
    8192 |
    
slide 86:
    dtrace4linux: Example
    • Tracing TCP retransmits (tcpretransmit.d for 3.11.0-17):
    #!/usr/sbin/dtrace -qs
    dtrace:::BEGIN { trace("Tracing TCP retransmits... Ctrl-C to end.\n"); }
    fbt::tcp_retransmit_skb:entry {
    this->gt;so = (struct sock *)arg0;
    this->gt;d = (unsigned char *)&this->gt;so->gt;__sk_common; /* 1st is skc_daddr */
    printf("%Y: retransmit to %d.%d.%d.%d, by:", walltimestamp,
    this->gt;d[0], this->gt;d[1], this->gt;d[2], this->gt;d[3]);
    stack(99);
    # ./tcpretransmit.d
    Tracing TCP retransmits... Ctrl-C to end.
    1970 Jan 1 12:24:45: retransmit to 50.95.220.155, by:
    kernel`tcp_retransmit_skb
    kernel`dtrace_int3_handler+0xcc
    kernel`dtrace_int3+0x3a
    kernel`tcp_retransmit_skb+0x1
    that
    kernel`tcp_retransmit_timer+0x276
    used to kernel`tcp_write_timer
    kernel`tcp_write_timer_handler+0xa0
    kernel`tcp_write_timer+0x6c
    work...
    kernel`call_timer_fn+0x36
    kernel`tcp_write_timer
    kernel`run_timer_softirq+0x1fd
    kernel`__do_softirq+0xf7
    kernel`call_softirq+0x1c
    [...]
    
slide 87:
    perf_events
    • In the Linux tree. perf-tools package. Can do sampling, static
    and dynamic tracing, with stack traces and local variables
    • Often involves an enablecollectdumpanalyze cycle
    • A powerful profiler, loaded with
    features (eg, libunwind stacks!)
    • Isn't programmable, and so has
    limited ability for processing data
    in-kernel. Does counts.
    • You can post-process in userland, but the overheads of
    passing all event data incurs
    overhead; can be Gbytes of data
    
slide 88:
    perf_events: Example
    • Dynamic tracing of tcp_sendmsg() with size:
    # perf probe --add 'tcp_sendmsg size'
    [...]
    # perf record -e probe:tcp_sendmsg -a
    ^C[ perf record: Woken up 1 times to write data ]
    [ perf record: Captured and wrote 0.052 MB perf.data (~2252 samples) ]
    # perf script
    # ========
    # captured on: Fri Jan 31 23:49:55 2014
    # hostname : dev1
    # os release : 3.13.1-ubuntu-12-opt
    [...]
    # ========
    sshd
    sshd
    sshd
    sshd
    sshd
    sshd
    [...]
    1301 [001]
    1301 [001]
    2371 [000]
    2372 [000]
    2372 [001]
    2372 [001]
    502.424719: probe:tcp_sendmsg: (ffffffff81505d80) size=b0
    502.424814: probe:tcp_sendmsg: (ffffffff81505d80) size=40
    502.952590: probe:tcp_sendmsg: (ffffffff81505d80) size=27
    503.025023: probe:tcp_sendmsg: (ffffffff81505d80) size=3c0
    503.203776: probe:tcp_sendmsg: (ffffffff81505d80) size=98
    503.281312: probe:tcp_sendmsg: (ffffffff81505d80) size=2d0
    
slide 89:
    ktap
    • A new static/dynamic tracing tool for Linux
    • Lightweight, simple, based on lua. Uses bytecode for
    programmable and safe tracing
    • Suitable for use on embedded Linux
    • http://www.ktap.org
    • Features are limited (still in
    development), but I've been
    impressed so far
    • In development, so I can't recommend
    production use yet
    
slide 90:
    ktap: Example
    • Summarize read() syscalls by return value (size/err):
    # ktap -e 's = {}; trace syscalls:sys_exit_read { s[arg2] += 1 }
    trace_end { histogram(s); }'
    value ------------- Distribution ------------- count
    -11 |@@@@@@@@@@@@@@@@@@@@@@@@
    18 |@@@@@@
    histogram
    72 |@@
    1024 |@
    of a key/
    0 |
    2 |
    value table
    446 |
    515 |
    48 |
    • Write scripts (excerpt from syslatl.kp, highlighting time delta):
    trace syscalls:sys_exit_* {
    if (self[tid()] == nil) { return }
    delta = (gettimeofday_us() - self[tid()]) / (step * 1000)
    if (delta >gt; max) { max = delta }
    lats[delta] += 1
    self[tid()] = nil
    
slide 91:
    ktap: Setup
    • Installing on Ubuntu (~5 minutes):
    # apt-get install git gcc make
    # git clone https://github.com/ktap/ktap
    # cd ktap
    # make
    # make install
    # make load
    • Example dynamic tracing of tcp_sendmsg() and stacks:
    # ktap -e 's = ptable(); trace probe:tcp_sendmsg { s[backtrace(12, -1)] 
slide 92:
    SystemTap
    • Sampling, static and dynamic tracing, fully programmable
    • The most featured of all the tools. Does some things that
    DTrace can't (eg, loops).
    • http://sourceware.org/systemtap
    • Has its own tracing language,
    which is compiled (gcc) into
    kernel modules (slow; safe?)
    • I used it a lot in 2011, and had
    problems with panics/freezes;
    never felt safe to run it on my
    customer's production systems
    • Needs vmlinux/debuginfo
    
slide 93:
    SystemTap: Setup
    • Setting up a SystemTap on ubuntu (2014):
    # ./opensnoop.stp
    semantic error: while resolving probe point: identifier 'syscall' at ./
    opensnoop.stp:11:7
    source: probe syscall.open
    helpful tips...
    semantic error: no match
    Pass 2: analysis failed. [man error::pass2]
    Tip: /usr/share/doc/systemtap/README.Debian should help you get started.
    # more /usr/share/doc/systemtap/README.Debian
    [...]
    supported yet, see Debian bug #691167). To use systemtap you need to
    manually install the linux-image-*-dbg and linux-header-* packages
    that match your running kernel. To simplify this task you can use the
    stap-prep command. Please always run this before reporting a bug.
    # stap-prep
    You need package linux-image-3.11.0-17-generic-dbgsym but it does not seem
    to be available
    Ubuntu -dbgsym packages are typically in a separate repository
    Follow https://wiki.ubuntu.com/DebuggingProgramCrash to add this
    repository
    
slide 94:
    SystemTap: Setup, cont.
    • After following ubuntu's DebuggingProgramCrash site:
    # apt-get install linux-image-3.11.0-17-generic-dbgsym
    Reading package lists... Done
    Building dependency tree
    but my perf issue
    Reading state information... Done
    is happening now...
    The following NEW packages will be installed:
    linux-image-3.11.0-17-generic-dbgsym
    0 upgraded, 1 newly installed, 0 to remove and 0 not upgraded.
    Need to get 834 MB of archives.
    After this operation, 2,712 MB of additional disk space will be used.
    Get:1 http://ddebs.ubuntu.com/ saucy-updates/main linux-image-3.11.0-17generic-dbgsym amd64 3.11.0-17.31 [834 MB]
    0% [1 linux-image-3.11.0-17-generic-dbgsym 1,581 kB/834 MB 0%]
    215 kB/s
    1h 4min 37s
    • In fairness:
    • 1. The Red Hat SystemTap developer's primary focus is to
    get it working on Red Hat (where they say it works fine)
    • 2. Lack of CTF isn't SystemTap's fault, as said earlier
    
slide 95:
    SystemTap: Example
    • opensnoop.stp:
    #!/usr/bin/stap
    probe begin
    printf("\n%6s %6s %16s %s\n", "UID", "PID", "COMM", "PATH");
    probe syscall.open
    printf("%6d %6d %16s %s\n", uid(), pid(), execname(), filename);
    • Output:
    # ./opensnoop.stp
    UID
    PID
    0 11108
    0 11108
    0 11108
    0 11108
    0 11108
    0 11108
    [...]
    COMM PATH
    sshd gt;
    sshd gt;
    sshd /lib/x86_64-linux-gnu/libwrap.so.0
    sshd /lib/x86_64-linux-gnu/libpam.so.0
    sshd /lib/x86_64-linux-gnu/libselinux.so.1
    sshd /usr/lib/x86_64-linux-gnu/libck-connector.so.0
    
slide 96:
    LTTng
    • Profiling, static and dynamic tracing
    • Based on Linux Trace Toolkit (LTT), which dabbled with
    dynamic tracing (DProbes) in 2001
    • Involves an enablestartstopview cycle
    • Designed to be highly efficient
    • I haven't used it properly yet,
    so I don't have an informed
    opinion (sorry LTTng, not
    your fault)
    
slide 97:
    LTTng, cont.
    • Example sequence:
    # lttng create session1
    # lttng enable-event sched_process_exec -k
    # lttng start
    # lttng stop
    # lttng view
    # lttng destroy session1
    
slide 98:
    DTrace, cont.
    • 2014 is an exciting year for dynamic tracing and Linux –
    one of these may reach maturity and win!
    
slide 99:
    DTrace, final word
    • What Oracle Solaris can learn from dtrace4linux:
    • Dynamic tracing is crippled without source code
    • Oracle could give customers scripts to run, but customers
    lose any practical chance of writing them themselves
    • If the dtrace4linux port is completed, it will be
    more useful than Oracle Solaris
    DTrace (unless they open
    source it again)
    
slide 100:
    Culture
    • Sun Microsystems, out of necessity, developed a performance
    engineering culture that had an appetite for understanding and
    measuring the system: data-driven analysis
    • If your several-million-dollar Ultra Enterprise 10000
    doesn’t perform well and your company is losing nontrivial sums of money every minute because of it, you call
    Sun Service and start demanding answers.
    – System Performance Tuning [Musumeci 02]
    • Includes the diagnostic cycle:
    • hypothesis  instrumentation  data  hypothesis
    • Some areas of Linux are already learning this (and some
    areas of Solaris never did)
    
slide 101:
    Culture, cont.
    • Linux perf issues are often debugged using only top(1), *stat,
    sar(1), strace(1), and tcpdump(8). These leave many areas
    not measured.
    top layer
    Kernel
    strace layer
    If only
    it were
    this
    simple...
    tcpdump layer
    • What about the other tools and metrics that are part of Linux?
    perf_events, tracepoints/kprobes/uprobes, schedstats, I/O
    accounting, blktrace, etc.
    
slide 102:
    Culture, cont.
    • Understand the system, and measure if at all possible
    • Hypothesis  instrumentation  data  hypothesis
    • Use perf_events (and others once they are stable/safe)
    • strace(1) is intermediate, not advanced
    • High performance doesn't just mean hardware, system, and
    config. It foremost means analysis of performance limiters.
    
slide 103:
    What Both can Learn
    
slide 104:
    What Both can Learn
    • Get better at benchmarking
    
slide 105:
    Benchmarking
    • How Linux vs Solaris performance is often compared
    • Results incorrect or misleading almost 100% of the time
    • Get reliable benchmark results by active benchmarking:
    • Analyze performance of all components during the
    benchmark, to identify limiters
    • Takes time: benchmarks may take weeks, not days
    • Opposite of passive benchmarking: fire and forget
    • If SmartOS loses a benchmark, my management demands
    answers, and I almost always overturn the result with analysis
    • Test variance as well as steady workloads. Measure jitter.
    These differ between systems as well.
    
slide 106:
    Results
    
slide 107:
    Out-of-the-Box
    • Out-of-the-box, which is faster, Linux or Solaris?
    
slide 108:
    Out-of-the-Box
    • Out-of-the-box, which is faster, Linux or Solaris?
    • There are many differences, it's basically a crapshoot.
    I've seen everything from 5% to 5x differences either way
    • It really depends on workload and platform characteristics:
    • CPU usage, FS I/O, TCP I/O, TCP connect rates, default
    TCP tunables, synchronous writes, lock calls, library calls,
    multithreading, multiprocessor, network driver support, ...
    • From prior Linux vs SmartOS comparisons, it's hard to pick a
    winner, but in general my expectations are:
    • SmartOS: for heavy file system or network I/O workloads
    • Linux: for CPU-bound workloads
    
slide 109:
    Out-of-the-Box Relevance
    • Out-of-the-box performance isn't that interesting: if you care
    about performance, why not do some analysis and tuning?
    
slide 110:
    In Practice
    • With some analysis and tuning, which is faster?
    
slide 111:
    In Practice
    • With some analysis and tuning, which is faster?
    • Depends on the workload, and which differences matter to you
    • With analysis, I can usually make SmartOS beat Linux
    • DTrace and microstate accounting give me a big
    advantage: I can analyze and fix all the small differences
    (which sometimes exist as apps are developed for Linux)
    • Although, perf/ktap/... are catching up
    • I can do the same and make Linux beat SmartOS, but it's
    much more time-consuming without an equivalent DTrace
    • On the same hardware, it's more about the performance
    engineer than the kernel. DTrace doesn't run itself.
    
slide 112:
    At Joyent
    • Joyent builds high performance SmartOS instances that
    frequently beat Linux, but that's due to more than just the OS.
    We use:
    • Config: OS virtualization (Zones), all instances on ZFS
    • Hardware: 10 GbE networks, plenty of DRAM for the ARC
    • Analysis: DTrace to quickly root-cause issues and tune
    • With SmartOS (ZFS, DTrace, Zones, KVM) configured for
    performance, and with some analysis, I expect to win most
    head-to-head performance comparisons
    
slide 113:
    Learning From Linux
    • Joyent has also been learning from Linux to improve SmartOS.
    • Package repos: dedicated staff
    • Community: dedicated staff
    • Compiler options: dedicated repo staff
    • KVM: ported!
    • TCP TIME_WAIT: fixed localhost; more fixes to come
    • sar: fix understood
    • More to do...
    
slide 114:
    References
    • General Zoi's Awesome Pony Creator: http://
    generalzoi.deviantart.com/art/Pony-Creator-v3-397808116
    • More perf examples: http://www.brendangregg.com/perf.html
    • More ktap examples: http://www.brendangregg.com/ktap.html
    • http://www.ktap.org/doc/tutorial.html
    • Flame Graphs: http://www.brendangregg.com/flamegraphs.html
    • Diagnostic cycle: http://dtrace.org/resources/bmc/cec_analytics.pdf
    • http://www.brendangregg.com/activebenchmarking.html
    • https://blogs.oracle.com/OTNGarage/entry/doing_more_with_dtrace_on
    
slide 115:
    Thank You
    • More info:
    • illumos: http://illumos.org
    • SmartOS: http://smartos.org
    • DTrace: http://dtrace.org
    • Joyent: http://joyent.com
    • Systems Performance book:
    http://www.brendangregg.com/sysperf.html
    • Me: http://www.brendangregg.com, http://dtrace.org/blogs/
    brendan, @brendangregg, [email protected]