SEGV/GPF with MALLOC_PERTURB_ in all processes that include gdbus since ~2.76, causes full-desktop crashes
For the past 3-4 months now, every few days random programs including my entire desktop (xsession) will crash. Further investigation reveals glib2 to be the problem, specifically its "gdbus" thread, combined with my usage of MALLOC_PERTURB_. This is a glibc envvar that causes malloc to write a pattern after allocation, to help detect memory bugs such as this one. Note that the variable name ends with an underscore.
# journalctl -k -g "general protection"
Nov 09 23:27:36 $HOST kernel: traps: gdbus[2506] general protection fault ip:7f2a84deafd9 sp:7f2a821fc930 error:0 in libglib-2.0.so.0.7800.1[7f2a84dd1000+99000]
Nov 10 00:53:17 $HOST kernel: traps: gdbus[241659] general protection fault ip:7fa1fb9a9fd9 sp:7fa1f8dfc930 error:0 in libglib-2.0.so.0.7800.1[7fa1fb990000+99000]
Nov 10 00:53:39 $HOST kernel: traps: gdbus[358338] general protection fault ip:7f35df080fd9 sp:7f35d9dfc8f0 error:0 in libglib-2.0.so.0.7800.1[7f35df067000+99000]
Nov 10 01:41:29 $HOST kernel: traps: xfce4-terminal[358590] general protection fault ip:7f884a800fd9 sp:7ffc89a01b10 error:0 in libglib-2.0.so.0.7800.1[7f884a7e7000+99000]
Nov 10 02:01:38 $HOST kernel: traps: xfce4-terminal[449960] general protection fault ip:7f604b8e9fd9 sp:7fff362ad1d0 error:0 in libglib-2.0.so.0.7800.1[7f604b8d0000+99000]
Nov 10 06:38:40 $HOST kernel: traps: gdbus[358509] general protection fault ip:7f0ef15a4fd9 sp:7f0eee5fc970 error:0 in libglib-2.0.so.0.7800.1[7f0ef158b000+99000]
# ls -ltr /var/lib/systemd/coredump/
total 6184
-rw-r-----+ 1 root root 746747 Nov 9 23:27 core.xfce4-session.1000.d00aedc029654c04829c8abc00b55f60.2305.1699572456000000.zst
-rw-r-----+ 1 root root 752648 Nov 10 00:53 core.xfce4-session.1000.d00aedc029654c04829c8abc00b55f60.241545.1699577597000000.zst
-rw-r-----+ 1 root root 537884 Nov 10 00:53 core.xfsettingsd.1000.d00aedc029654c04829c8abc00b55f60.358310.1699577619000000.zst
-rw-r-----+ 1 root root 1734502 Nov 10 01:41 core.xfce4-terminal.1000.d00aedc029654c04829c8abc00b55f60.358590.1699580489000000.zst
-rw-r-----+ 1 root root 1609384 Nov 10 02:01 core.xfce4-terminal.1000.d00aedc029654c04829c8abc00b55f60.449960.1699581698000000.zst
-rw-r-----+ 1 root root 915050 Nov 10 06:38 core.panel-17-power-.1000.d00aedc029654c04829c8abc00b55f60.358482.1699598320000000.zst
Examining all of these coredumps leads us to the same problem:
(gdb) bt
#0 g_datalist_id_dup_data (datalist=datalist@entry=0x7f0ee4015210, key_id=56, dup_func=dup_func@entry=0x0, user_data=user_data@entry=0x0) at ../../../glib/gdataset.c:979
#1 0x00007f0ef15a503d in g_datalist_id_get_data (datalist=datalist@entry=0x7f0ee4015210, key_id=<optimized out>) at ../../../glib/gdataset.c:917
#2 0x00007f0ef16ce470 in toggle_refs_notify (object=0x7f0ee4015200, is_last_ref=1) at ../../../gobject/gobject.c:3612
#3 0x00007f0ef16ce95a in g_object_unref (_object=<optimized out>) at ../../../gobject/gobject.c:3830
#4 0x00007f0ef1838fc7 in _g_dbus_worker_queue_or_deliver_received_message (message=0x0, worker=0x5650780c0fc0) at ../../../gio/gdbusprivate.c:521
#5 _g_dbus_worker_do_read_cb (input_stream=<optimized out>, res=<optimized out>, user_data=0x5650780c0fc0) at ../../../gio/gdbusprivate.c:805
#6 0x00007f0ef17c9ee3 in g_task_return_now (task=task@entry=0x7f0ee401cb10) at ../../../gio/gtask.c:1371
#7 0x00007f0ef17c9f1d in complete_in_idle_cb (task=0x7f0ee401cb10) at ../../../gio/gtask.c:1385
#8 0x00007f0ef15c40d9 in g_main_dispatch (context=context@entry=0x5650780c1320) at ../../../glib/gmain.c:3476
#9 0x00007f0ef15c7317 in g_main_context_dispatch_unlocked (context=0x5650780c1320) at ../../../glib/gmain.c:4284
#10 g_main_context_iterate_unlocked (context=0x5650780c1320, block=block@entry=1, dispatch=dispatch@entry=1, self=<optimized out>) at ../../../glib/gmain.c:4349
#11 0x00007f0ef15c7c1f in g_main_loop_run (loop=0x5650780c1450) at ../../../glib/gmain.c:4551
#12 0x00007f0ef1836eaa in gdbus_shared_thread_func (user_data=0x5650780c12f0) at ../../../gio/gdbusprivate.c:284
#13 0x00007f0ef15f4a41 in g_thread_proxy (data=0x565078057af0) at ../../../glib/gthread.c:831
#14 0x00007f0ef14133ec in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:444
#15 0x00007f0ef1493a4c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
(gdb) info locals
val = 0x0
retval = 0x0
d = 0x303030303030300
data = 0x303030303030308
data_end = <optimized out>
A value of 0x30303030303XXXX for a pointer indicates that it is using uninitialised memory from MALLOC_PERTURB_.
../../../glib/gdataset.c
972
973 g_datalist_lock (datalist);
974
975 d = G_DATALIST_GET_POINTER (datalist);
976 if (d)
977 {
978 data = d->data;
> 979 data_end = data + d->len;
980 do
981 {
982 if (data->key == key_id)
983 {
984 val = data->data;
985 break;
986 }
987 data++;
0x7f0ef15a4fba <g_datalist_id_dup_data+26> mov $0x2,%esi
0x7f0ef15a4fbf <g_datalist_id_dup_data+31> call 0x7f0ef1593250 <g_pointer_bit_lock>
0x7f0ef15a4fc4 <g_datalist_id_dup_data+36> mov 0x0(%rbp),%rax
0x7f0ef15a4fc8 <g_datalist_id_dup_data+40> cmp $0x7,%rax
0x7f0ef15a4fcc <g_datalist_id_dup_data+44> jbe 0x7f0ef15a4ff9 <g_datalist_id_dup_data+89>
0x7f0ef15a4fce <g_datalist_id_dup_data+46> and $0xfffffffffffffff8,%rax
0x7f0ef15a4fd2 <g_datalist_id_dup_data+50> mov %rax,%rdx
0x7f0ef15a4fd5 <g_datalist_id_dup_data+53> lea 0x8(%rax),%rax
> 0x7f0ef15a4fd9 <g_datalist_id_dup_data+57> mov (%rdx),%edx
0x7f0ef15a4fdb <g_datalist_id_dup_data+59> lea (%rdx,%rdx,2),%rdx
0x7f0ef15a4fdf <g_datalist_id_dup_data+63> lea (%rax,%rdx,8),%rcx
0x7f0ef15a4fe3 <g_datalist_id_dup_data+67> jmp 0x7f0ef15a4ff1 <g_datalist_id_dup_data+81>
0x7f0ef15a4fe5 <g_datalist_id_dup_data+69> nopl (%rax)
0x7f0ef15a4fe8 <g_datalist_id_dup_data+72> add $0x18,%rax
0x7f0ef15a4fec <g_datalist_id_dup_data+76> cmp %rcx,%rax
0x7f0ef15a4fef <g_datalist_id_dup_data+79> jae 0x7f0ef15a5028 <g_datalist_id_dup_data+136>
0x7f0ef15a4ff1 <g_datalist_id_dup_data+81> cmp %ebx,(%rax)
The stack trace and disassembly is from Debian glib2 2.78.1-2 (amd64, amd64 dbgsym), but the problem has been happening for several months now, since roughly around version 2.76.
I'll leave it to you to investigate further, but presumably if (d)
is checking for a NULL pointer when it could also be an uninitialised pointer.