These are rough notes.
I run the Slack messaging application on Ubuntu Linux, and it recently started mysteriously crashing. I'd Alt-Tab and find it was no longer there. No error message, no dialog, just gone. It usually happened when locking and unlocking the screen. A quick internet search revealed nothing.
These are rough notes for how I debugged it, in case it's useful for someone searching on this topic. I spend many hours documenting advanced debugging stories for books, talks, and blog posts, but many things I never have time to share. I'm experimenting with this "rough notes" format as a way to quickly share things. No editing, spell checking, or comments. Mostly screenshots. Dead ends included.
Note that I don't know anything about Slack internals, and there may be better ways to solve this.
1. Enabling core dumps
I'm guessing it's core dumping and Ubuntu's apport is eating them. Redirecting them to the file system so I can then do core dump analysis using gdb(1), as root:
# cat /proc/sys/kernel/core_pattern |/usr/share/apport/apport %p %s %c %d %P # mkdir /var/cores # echo "/var/cores/core.%e.%p.%h.%t" > /proc/sys/kernel/core_pattern [...another crash...] # ls /var/cores #
This didn't work: No core file showed up. I may need to increase the core file size ulimits for Slack, but that might mean mucking around with its startup scripts; I'll try some other tracing first.
2. exitsnoop
Using an eBPF/bcc tool to look for exit reasons:
# exitsnoop -t TIME-AEST PCOMM PID PPID TID AGE(s) EXIT_CODE 13:51:19.432 kworker/dying 3663305 2 3663305 1241.59 0 13:51:30.948 kworker/dying 3663626 2 3663626 835.76 0 13:51:33.296 systemd-udevd 3664149 2054939 3664149 3.55 0 13:53:09.256 kworker/dying 3662179 2 3662179 2681.15 0 13:53:25.636 kworker/dying 3663520 2 3663520 1122.60 0 13:53:30.705 grep 3664239 6009 3664239 0.08 0 13:53:30.705 ps 3664238 6009 3664238 0.08 0 13:53:40.297 slack 3663135 1786 3663135 1459.54 signal 6 (ABRT) 13:53:40.298 slack 3663208 3663140 3663208 1457.86 0 13:53:40.302 slack 3663140 1786 3663140 1459.18 0 13:53:40.302 slack 3663139 1786 3663139 1459.18 0 13:53:40.303 slack 3663171 1786 3663171 1458.22 0 13:53:40.317 slack 3663197 1786 3663197 1458.03 0 13:53:44.827 gdm-session-wor 3664269 1778 3664269 0.02 0 [...]
This traced a Slack SIGABRT which happened around the same time as a crash. A strong lead.
3. killsnoop
Running killsnoop (eBPF/bcc) to get more info:
# killsnoop TIME PID COMM SIG TPID RESULT 13:45:01 2053366 systemd-journal 0 1024 0 13:45:01 2053366 systemd-journal 0 3663525 -3 13:45:01 2053366 systemd-journal 0 3663528 -3 13:49:00 2054939 systemd-udevd 15 3664053 0 13:51:33 2054939 systemd-udevd 15 3664149 0 13:53:44 2053366 systemd-journal 0 4265 -1 13:53:44 2053366 systemd-journal 0 972 0 13:53:44 2053366 systemd-journal 0 1778 0 13:53:44 2053366 systemd-journal 0 6414 -1 [...]
A crash happened, but killsnoop(8) didn't see it. A quick look at the killsnoop(8) source reminded me that I wrote it back in 2015, which is practically ancient in eBPF years. Back then there wasn't tracepoint support yet so I was using kprobes for everything. Kprobes aren't a stable interface, which might be the problem.
4. perf trace
Nowadays this can be done as a perf one-liner:
# perf list syscalls:sys_enter_*kill List of pre-defined events (to be used in -e): syscalls:sys_enter_kill [Tracepoint event] syscalls:sys_enter_tgkill [Tracepoint event] syscalls:sys_enter_tkill [Tracepoint event] # perf trace -e 'syscalls:sys_enter_*kill' 15755.483 slack/3684015 syscalls:sys_enter_tgkill(tgid: 3684015 (slack), pid: 3684015 (slack), sig: ABRT)
Ok, so there's our slack SIGABRT, sent via tgkill(2). (And I filed an issue to update bcc killsnoop(8) to use tracepoints.) This output doesn't really tell me much more about it though. I want to see a stack trace. I can use perf record or bpftrace...and that reminds me, didn't I write another signal tool using bpftrace?
5. signals.bt
The signals.bt bpftrace tool from my BPF book traces the signal:signal_generate tracepoint, which should catch every type of generated signal, including tgkill(2). Trying it out:
# bpftrace /home/bgregg/Git/bpf-perf-tools-book/originals/Ch13_Applications/signals.bt Attaching 3 probes... Counting signals. Hit Ctrl-C to end. ^C @[SIGNAL, PID, COMM] = COUNT @[SIGPIPE, 1883, Xorg]: 1 @[SIGCHLD, 1797, dbus-daemon]: 1 @[SIGINT, 3665167, bpftrace]: 1 @[SIGTERM, 3665198, gdm-session-wor]: 1 @[SIGCHLD, 3665197, gdm-session-wor]: 1 @[SIGABRT, 3664940, slack]: 1 @[SIGTERM, 3665197, gdm-session-wor]: 1 @[SIGKILL, 3665207, dbus-daemon]: 1 @[SIGWINCH, 859450, bash]: 2 @[SIGCHLD, 1778, gdm-session-wor]: 2 @[, 3665201, gdbus]: 2 @[, 3665199, gmain]: 2 @[SIGWINCH, 3665167, bpftrace]: 2 @[SIGWINCH, 3663319, vi]: 2 @[SIGCHLD, 1786, systemd]: 6 @[SIGALRM, 1883, Xorg]: 106
Ok, there's the SIGABRT for slack. (There's a new sigsnoop(8) tool for bcc that uses this tracepoint as well.)
6. Signal Stacks
Moving from signals.bt to a bpftrace one-liner:
# bpftrace -e 't:signal:signal_generate /comm == "slack"/ { printf("%d, %s%s\n", args->sig, kstack, ustack); }' Attaching 1 probe... 6, __send_signal+579 __send_signal+579 send_signal+221 do_send_sig_info+81 do_send_specific+110 do_tkill+171 __x64_sys_tgkill+34 do_syscall_64+73 entry_SYSCALL_64_after_hwframe+68 0x7f4a2e2e2f95
This was supposed to print both the kernel and user stack traces that led to the SIGABRT, but the user stack is broken, showing 0x7f4a2e2e2f95 only. Grumble. There's ways to fix this, but it usually gets time consuming, so let me try something else first. Logs!
7. Logs
Does Slack have logs? I have no idea. Maybe they contain the error message.
# lsof -p `pgrep -n slack` | grep -i log lsof: WARNING: can't stat() fuse.gvfsd-fuse file system /run/user/1000/gvfs Output information may be incomplete. lsof: WARNING: can't stat() fuse file system /run/user/1000/doc Output information may be incomplete.
Ignore the lsof(8) warnings. There's no output here, nothing containing "log". Although I'm looking at the most recent process called "slack" and maybe that's the wrong one.
# pstree -ps `pgrep -n slack` systemd(1)───systemd(1786)───slack(3666477)───slack(3666481)───slack(3666548)─┬─{slack}(3666549) ├─{slack}(3666551) ├─{slack}(3666552) ├─{slack}(3666553) ├─{slack}(3666554) ├─{slack}(3666555) ├─{slack}(3666556) ├─{slack}(3666557) ├─{slack}(3666558) ├─{slack}(3666559) ├─{slack}(3666560) ├─{slack}(3666564) ├─{slack}(3666566) ├─{slack}(3666568) └─{slack}(3666609)
Ok, how about I try the great-grandparent, PID 3666477:
# lsof -p 3666477 | grep -i log lsof: WARNING: can't stat() fuse.gvfsd-fuse file system /run/user/1000/gvfs Output information may be incomplete. lsof: WARNING: can't stat() fuse file system /run/user/1000/doc Output information may be incomplete. slack 3666477 bgregg 37r REG 253,1 32768 140468 /home/bgregg/.local/share/gvfs-metadata/home-8fd8d123.log slack 3666477 bgregg 40r REG 253,1 32768 131314 /home/bgregg/.local/share/gvfs-metadata/trash:-85854456.log slack 3666477 bgregg 71w REG 253,1 15566 1707316 /home/bgregg/.config/Slack/Local Storage/leveldb/013430.log slack 3666477 bgregg 72w REG 253,1 347 1704816 /home/bgregg/.config/Slack/Local Storage/leveldb/LOG slack 3666477 bgregg 73w REG 253,1 2324236 1718407 /home/bgregg/.config/Slack/logs/browser.log slack 3666477 bgregg 90w REG 253,1 363600 1713625 /home/bgregg/.config/Slack/Service Worker/Database/000004.log slack 3666477 bgregg 92w REG 253,1 274 1704249 /home/bgregg/.config/Slack/Service Worker/Database/LOG slack 3666477 bgregg 108w REG 253,1 4182513 1749672 /home/bgregg/.config/Slack/logs/webapp-service-worker-console.log slack 3666477 bgregg 116w REG 253,1 259 1704369 /home/bgregg/.config/Slack/Session Storage/LOG slack 3666477 bgregg 122w REG 253,1 31536 1749325 /home/bgregg/.config/Slack/Session Storage/000036.log slack 3666477 bgregg 126w REG 253,1 3970909 1704566 /home/bgregg/.config/Slack/logs/webapp-console.log slack 3666477 bgregg 127w REG 253,1 2330006 1748923 /home/bgregg/.config/Slack/IndexedDB/https_app.slack.com_0.indexeddb.leveldb/023732.log slack 3666477 bgregg 131w REG 253,1 330 1704230 /home/bgregg/.config/Slack/IndexedDB/https_app.slack.com_0.indexeddb.leveldb/LOG slack 3666477 bgregg 640r REG 253,1 32768 140378 /home/bgregg/.local/share/gvfs-metadata/root-7d269acf.log (deleted)
Lots of logs in ~/config/Slack/logs!
# cd ~/.config/Slack/logs # ls -lrth total 26M -rw-rw-r-- 1 bgregg bgregg 5.0M Aug 20 07:54 webapp-service-worker-console1.log -rw-rw-r-- 1 bgregg bgregg 5.1M Aug 23 19:30 webapp-console2.log -rw-rw-r-- 1 bgregg bgregg 5.1M Aug 25 16:07 webapp-console1.log drwxrwxr-x 2 bgregg bgregg 4.0K Aug 27 14:34 recorded-trace/ -rw-rw-r-- 1 bgregg bgregg 4.0M Aug 27 14:46 webapp-service-worker-console.log -rw-rw-r-- 1 bgregg bgregg 2.3M Aug 27 14:46 browser.log -rw-rw-r-- 1 bgregg bgregg 3.9M Aug 27 14:46 webapp-console.log
Ok, so how about this one:
# cat webapp-console.log [...] [08/27/21, 14:46:36:238] info: [API-Q] (TKZ41AXQD) 614b3789-1630039595.801 dnd.teamInfo is RESOLVED [08/27/21, 14:46:36:240] info: [API-Q] (TKZ41AXQD) 614b3789-1630039595.930 users.interactions.list is RESOLVED [08/27/21, 14:46:36:242] info: [DND] (TKZ41AXQD) Fetched DND info for the following member: ULG5H012L [08/27/21, 14:46:36:251] info: [DND] (TKZ41AXQD) Checking for changes in DND status for the following members: ULG5H012L [08/27/21, 14:46:36:254] info: [DND] (TKZ41AXQD) Will check for changes in DND status again in 5 minutes [08/27/21, 14:46:36:313] info: [DND] (TKZ41AXQD) Fetched DND info for the following members: UL0US3455 [08/27/21, 14:46:36:313] info: [DND] (TKZ41AXQD) Checking for changes in DND status for the following members: ULG5H012L,UL0US3455 [08/27/21, 14:46:36:314] info: [DND] (TKZ41AXQD) Will check for changes in DND status again in 5 minutes [08/27/21, 14:46:37:337] info: [FOCUS-EVENT] Client window blurred [08/27/21, 14:46:40:022] info: [RTM] (T029N2L97) Processed 1 user_typing event(s) in channel(s) C0S9267BE over 0.10ms [08/27/21, 14:46:40:594] info: [RTM] (T029N2L97) Processed 1 message:message_replied event(s) in channel(s) C0S9267BE over 2.60ms [08/27/21, 14:46:40:595] info: [RTM] Setting a timeout of 37 ms to process more rtm events [08/27/21, 14:46:40:633] info: [RTM] Waited 37 ms, processing more rtm events now [08/27/21, 14:46:40:653] info: [RTM] (T029N2L97) Processed 1 message event(s) in channel(s) C0S9267BE over 18.60ms [08/27/21, 14:46:44:938] info: [RTM] (T029N2L97) Processed 1 user_typing event(s) in channel(s) C0S9267BE over 0.00ms
No, I don't see any errors jumping out at me. How about searching for errors:
# egrep -i 'error|fail' webapp-console.log [08/25/21, 16:07:13:051] info: [DESKTOP-SIDE-EFFECT] (TKZ41AXQD) Reacting to {"type":"[39] Set a value that represents whether we are currently in the boot phase","payload":false,"error":false} [08/25/21, 16:07:13:651] info: [DESKTOP-SIDE-EFFECT] (T7GLTMS0P) Reacting to {"type":"[39] Set a value that represents whether we are currently in the boot phase","payload":false,"error":false} [08/25/21, 16:07:14:249] info: [DESKTOP-SIDE-EFFECT] (T0DS04W11) Reacting to {"type":"[39] Set a value that represents whether we are currently in the boot phase","payload":false,"error":false} [08/25/21, 16:07:14:646] info: [DESKTOP-SIDE-EFFECT] (T0375HBGA) Reacting to {"type":"[39] Set a value that represents whether we are currently in the boot phase","payload":false,"error":false} [...] # egrep -i 'error|fail' browser.log [07/16/21, 08:18:27:621] error: Cannot override webPreferences key(s): webviewTag, nativeWindowOpen, nodeIntegration, nodeIntegrationInWorker, nodeIntegrationInSubFrames, enableRemoteModule, contextIsolation, sandbox [07/16/21, 08:18:27:653] error: Failed to load empty window url in window "error": { "stack": "Error: ERR_ABORTED (-3) loading 'about:blank'\n at rejectAndCleanup (electron/js2c/browser_init.js:217:1457)\n at Object.navigationListener (electron/js2c/browser_init.js:217:1763)\n at Object.emit (events.js:315:20)\n at Object.EventEmitter.emit (domain.js:467:12)", [07/16/21, 08:18:31:355] error: Cannot override webPreferences key(s): webviewTag, nativeWindowOpen, nodeIntegration, nodeIntegrationInWorker, nodeIntegrationInSubFrames, enableRemoteModule, contextIsolation, sandbox [07/16/21, 08:18:31:419] error: Cannot override webPreferences key(s): webviewTag, nativeWindowOpen, nodeIntegration, nodeIntegrationInWorker, nodeIntegrationInSubFrames, enableRemoteModule, contextIsolation, sandbox [07/24/21, 09:00:52:252] error: Failed to load calls-desktop-interop.WindowBorderPanel "error": { "stack": "Error: Module did not self-register: '/snap/slack/42/usr/lib/slack/resources/app.asar.unpacked/node_modules/@tinyspeck/calls-desktop-interop/build/Release/CallsDesktopInterop.node'.\n at process.func [as dlopen] (electron/js2c/asar_bundle.js:5:1846)\n at Object.Module._extensions..node (internal/modules/cjs/loader.js:1138:18)\n at Object.func [as .node] (electron/js2c/asar_bundle.js:5:2073)\n at Module.load (internal/modules/cjs/loader.js:935:32)\n at Module._load (internal/modules/cjs/loader.js:776:14)\n at Function.f._load (electron/js2c/asar_bundle.js:5:12684)\n at Module.require (internal/modules/cjs/loader.js:959:19)\n at require (internal/modules/cjs/helpers.js:88:18)\n at bindings (/snap/slack/42/usr/lib/slack/resources/app.asar/node_modules/bindings/bindings.js:112:48)\n at Object.(/snap/slack/42/usr/lib/slack/resources/app.asar/node_modules/@tinyspeck/calls-desktop-interop/lib/index.js:1:34)", [07/24/21, 09:00:52:260] warn: Failed to install protocol handler for slack:// links [07/24/21, 09:00:52:440] error: Cannot override webPreferences key(s): webviewTag [...]
I browsed the logs for a while but didn't see a smoking gun. Surely it spits out some error message when crashing, like to STDERR...
8. STDERR Tracing
Where is STDERR written?
# lsof -p 3666477 [...] slack 3666477 bgregg mem REG 7,16 141930 7165 /snap/slack/44/usr/lib/slack/chrome_100_percent.pak slack 3666477 bgregg mem REG 7,16 165680 7433 /snap/slack/44/usr/lib/slack/v8_context_snapshot.bin slack 3666477 bgregg 0r CHR 1,3 0t0 6 /dev/null slack 3666477 bgregg 1w CHR 1,3 0t0 6 /dev/null slack 3666477 bgregg 2w CHR 1,3 0t0 6 /dev/null slack 3666477 bgregg 3r FIFO 0,12 0t0 29532192 pipe slack 3666477 bgregg 4u unix 0x00000000134e3c45 0t0 29526717 type=SEQPACKET slack 3666477 bgregg 5r REG 7,16 10413488 7167 /snap/slack/44/usr/lib/slack/icudtl.dat [...]
/dev/null? Like that's going to stop me. I could trace writes to STDERR, but I think my old shellsnoop(8) tool (another from eBPF/bcc) already does that:
# shellsnoop 3666477 [...] [08/27/21, 14:46:36:314] info: [DND] (TKZ41AXQD) Will check for changes in DND status again in 5 minutes [08/27/21, 14:46:37:337] info: [FOCUS-EVENT] Client window blurred [08/27/21, 14:46:40:022] info: [RTM] (T029N2L97) Processed 1 user_typing event(s) in channel(s) C0S928EBE over 0.10ms [08/27/21, 14:46:40:594] info: [RTM] (T029N2L97) Processed 1 message:message_replied event(s) in channel(s) C0S928EBE over 2.60ms [08/27/21, 14:46:40:595] info: [RTM] Setting a timeout of 37 ms to process more rtm events [08/27/21, 14:46:40:633] info: [RTM] Waited 37 ms, processing more rtm events now [08/27/21, 14:46:40:653] info: [RTM] (T029N2L97) Processed 1 message event(s) in channel(s) C0S928EBE over 18.60ms [08/27/21, 14:46:44:938] info: [RTM] (T029N2L97) Processed 1 user_typing event(s) in channel(s) C0S928EBE over 0.00ms (slack:3666477): Gtk-WARNING **: 14:46:45.525: Could not load a pixbuf from icon theme. This may indicate that pixbuf loaders or the mime database could not be found. ** Gtk:ERROR:../../../../gtk/gtkiconhelper.c:494:ensure_surface_for_gicon: assertion failed (error == NULL): Failed to load /usr/share/icons/Yaru/16x16/status/image-missing.png: Unable to load image-loading module: /snap/slack/42/usr/lib/x86_64-linux-gnu/gdk-pixbuf-2.0/2.10.0/loaders/libpixbufloader-png.so: /snap/slack/42/usr/lib/x86_64-linux-gnu/gdk-pixbuf-2.0/2.10.0/loaders/libpixbufloader-png.so: cannot open shared object file: No such file or directory (gdk-pixbuf-error-quark, 5)
Ah-ha! The last message printed is an error about a .png file and a .so file. As it's Slack's final mesage before crashing, this is a smoking gun.
Note that this was not in any log!:
# grep image-missing.png * grep: recorded-trace: Is a directory
It's the .so file that is missing, not the .png:
# ls -lh /usr/share/icons/Yaru/16x16/status/image-missing.png -rw-r--r-- 1 root root 535 Nov 6 2020 /usr/share/icons/Yaru/16x16/status/image-missing.png # ls -lh /snap/slack/42/usr/lib/x86_64-linux-gnu/gdk-pixbuf-2.0/2.10.0/loaders/libpixbufloader-png.so ls: cannot access '/snap/slack/42/usr/lib/x86_64-linux-gnu/gdk-pixbuf-2.0/2.10.0/loaders/libpixbufloader-png.so': No such file or directory
But there is a .so file with a similar path:
# ls -lh /snap/slack/ total 0 drwxrwxr-x 8 root root 123 Jul 14 02:49 43/ drwxrwxr-x 8 root root 123 Aug 18 10:27 44/ lrwxrwxrwx 1 root root 2 Aug 24 09:48 current -> 44/ # ls -lh /snap/slack/44/usr/lib/x86_64-linux-gnu/gdk-pixbuf-2.0/2.10.0/loaders/libpixbufloader-png.so -rw-r--r-- 1 root root 27K Aug 18 10:27 /snap/slack/44/usr/lib/x86_64-linux-gnu/gdk-pixbuf-2.0/2.10.0/loaders/libpixbufloader-png.so
Hmm, I wonder...
9. Workaround
This is obviously a hack and is not guaranteed to be safe:
# cd /snap/slack # ln -s current 42 # ls -lh total 0 lrwxrwxrwx 1 root root 7 Aug 27 15:01 42 -> current/ drwxrwxr-x 8 root root 123 Jul 14 02:49 43/ drwxrwxr-x 8 root root 123 Aug 18 10:27 44/ lrwxrwxrwx 1 root root 2 Aug 24 09:48 current -> 44/ # ls -lh /snap/slack/42/usr/lib/x86_64-linux-gnu/gdk-pixbuf-2.0/2.10.0/loaders/libpixbufloader-png.so -rw-r--r-- 1 root root 27K Aug 18 10:27 /snap/slack/42/usr/lib/x86_64-linux-gnu/gdk-pixbuf-2.0/2.10.0/loaders/libpixbufloader-png.so
I don't know why Slack was looking up this library via the old directory version, but linking the new version to the old path did the trick. Slack has stopped crashing!
I'm guessing this is a problem with how the snap is built. Needs more debugging.
10. Other debugging
In case you're wondering what I'd do if I didn't find the error in STDERR, I'd go back to setting ulimits to see if I could get a core dump, and if that still didn't work, I'd try to run Slack from a gdb(1) session. I'd also work on fixing the user stack trace and symbols to see what that revealed.
11. Bonus: opensnoop
I often wonder how I could have debugged things sooner, and I'm kicking myself I didn't run opensnoop(8) as I usually do. Tracing just failed opens:
# opensnoop -Tx TIME(s) PID COMM FD ERR PATH [...] 11.412358000 3673057 slack -1 2 /var/lib/snapd/desktop/mime/subclasses 11.412360000 3673057 slack -1 2 /var/lib/snapd/desktop/mime/icons 11.412363000 3673057 slack -1 2 /var/lib/snapd/desktop/mime/generic-icons 11.412495000 3673057 slack -1 2 /snap/slack/42/usr/lib/x86_64-linux-gnu/gdk-pixbuf-2.0/2.10.0/loaders/libpixbufloader-png.so 11.412527000 3673057 slack -1 2 /usr/share/locale/en_AU/LC_MESSAGES/gdk-pixbuf.mo 11.412537000 3673057 slack -1 2 /usr/share/locale/en/LC_MESSAGES/gdk-pixbuf.mo 11.412559000 3673057 slack -1 2 /usr/share/locale-langpack/en/LC_MESSAGES/gdk-pixbuf.mo 11.412916000 3673057 slack -1 2 /snap/slack/42/usr/lib/x86_64-linux-gnu/gdk-pixbuf-2.0/2.10.0/loaders/libpixbufloader-png.so 11.425405000 1786 systemd -1 2 /sys/fs/cgroup/memory/user.slice/user-1000.slice/[email protected]/snap.slack.slack.402dde03-7f71-48a0-98a5-33fd695ccbde.scope/memory.events
That shows its last failed open was to the .so file. Which would have been a good lead. But the best clue was the secret STDERR messages Slack sends to /dev/null, rescued using shellsnoop(8).