tests/mainloop-test appears to be flaky

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=909150

glib2.0 2.58.0-4 failed to build on Debian's armel (ARM softfloat EABI) autobuilder: https://buildd.debian.org/status/fetch.php?pkg=glib2.0&arch=armel&ver=2.58.0-4&stamp=1537225902&raw=0

FAIL mainloop-test (exit status: 139)

(That's a segmentation fault, with no other output.)

I was able to reproduce this failure on the armel porterbox (Debian-developer-accessible test machine), abel, by doing a build (which succeeded) and then running:

bash -euc 'for i in `seq 1 1000`; do time debian/build/deb/tests/mainloop-test; done'

You can also run this test as an installed binary without needing to build it: it's /usr/lib/glib2.0/installed-tests/glib/mainloop-test in the libglib2.0-tests package.

On my 81st attempt on abel, the test segfaulted.

% debian/build/deb/libtool --mode=execute gdb debian/build/deb/tests/mainloop-test core
...
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00459634 in cleanup_crawlers (context=0xb5101218) at ../../../../tests/mainloop-test.c:364
364           if (g_source_get_context (crawler_array->pdata[i]) == context)
[Current thread is 1 (Thread 0xb5afe460 (LWP 29988))]
(gdb) thread apply all bt

Thread 2 (Thread 0xb6f503e0 (LWP 29972)):
#0  0xb6d2b750 in ?? () from /lib/arm-linux-gnueabi/libc.so.6
#1  0x00459c8c in adder_callback (source=<optimized out>, condition=<optimized out>, data=0x0)
    at ../../../../tests/mainloop-test.c:122
#2  0x00000010 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Thread 1 (Thread 0xb5afe460 (LWP 29988)):
#0  0x00459634 in cleanup_crawlers (context=0xb5101218) at ../../../../tests/mainloop-test.c:364
#1  adder_thread (data=<optimized out>) at ../../../../tests/mainloop-test.c:192
#2  0xb6e87628 in g_thread_proxy (data=0x2416c60) at ../../../../glib/gthread.c:784
#3  0xb6c1f1ac in start_thread () from /lib/arm-linux-gnueabi/libpthread.so.0
#4  0xb6d96ca8 in ?? () from /lib/arm-linux-gnueabi/libc.so.6
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
...
(gdb) p *crawler_array
$1 = {pdata = 0x2415178, len = 0}
(gdb) p i
$2 = 0
(gdb) p (GSource *) crawler_array->pdata[0]
$5 = (GSource *) 0x2415190
(gdb) p *(GSource *) crawler_array->pdata[0]
$6 = {callback_data = 0x2415160, callback_funcs = 0x0, source_funcs = 0xffffffff, ref_count = 4294967295,
  context = 0x0, priority = 0, flags = 37835528, source_id = 0, poll_fds = 0xb78731fa, prev = 0x182, next = 0x0,
  name = 0x0, priv = 0x0}
(gdb) p *context
$8 = {mutex = {p = 0x0, i = {0, 0}}, cond = {p = 0x0, i = {1, 0}}, owner = 0x0, owner_count = 0, waiters = 0x0, 
  ref_count = 1, sources = 0xb6302078, pending_dispatches = 0xb6302e50, timeout = 1, next_id = 108, 
  source_lists = 0xb6303790, in_check_or_prepare = 0, poll_records = 0xb6303740, n_poll_records = 1, 
  cached_poll_array = 0xb51030e8, cached_poll_array_size = 2, wakeup = 0xb6301160, wake_up_rec = {fd = 21, 
    events = 1, revents = 0}, poll_changed = 1, poll_func = 0xb6e6cc78 <g_poll>, time = 1660937509354, 
  time_is_fresh = 1}

I'm not sure what is going on here; the stack traces don't look complete. The last functional change to this test was in 2013.

There also seems to be a different failure mode (seen on the 200th attempt when I re-ran the loop) where it logs this warning:

(process:11074): GLib-CRITICAL **: 22:31:14.520: g_rand_int_range: assertion 'end > begin' failed

and does not terminate. I was also able to reproduce that one on x86_64, again by running it in a loop, after 7299 successes.

I've left the armel porterbox re-running that test 10,000 times in a loop that does not terminate on failure, with G_DEBUG=fatal-criticals, in the hope that it will give me (a) a better idea of how often this happens, and (b) a core dump for the critical.

We've seen something similar before (although not been able to reproduce it): in 2014, GLib 2.40 failed to build on x86_64, with the same critical followed by:

GLib (gthread-posix.c): Unexpected error from C library during 'pthread_mutex_lock': Invalid argument.  Aborting.