futex based g_mutex_lock sometimes splatters errno with EAGAIN on contended locks
In libvirt we've been seeing rare failures of our test suite where a trivial 'g_ascii_strtol(...)' call will inexplicably fail with errno == EAGAIN. This can be demonstrated with this code:
$ cat > demo.c <<EOF
#include <glib.h>
static gpointer run(gpointer data)
{
errno = 0;
int i = g_ascii_strtoll("10", NULL, 10);
g_assert(errno == 0);
g_assert(i == 10);
return NULL;
}
int main(int argc, char **argv) {
gsize i;
GThread *t[5];
for (i = 0; i < G_N_ELEMENTS(t); i++) {
t[i] = g_thread_new("demo", run, NULL);
}
for (i = 0; i < G_N_ELEMENTS(t); i++) {
g_thread_join(t[i]);
}
g_printerr(".");
}
EOF
$ gcc -Wall `pkg-config --cflags --libs glib-2.0` -o demo demo.c
$ for i in `seq 1 200` ; do LD_LIBRARY_PATH=`pwd`/build/glib ./demo ; done
...**
ERROR:demo.c:7:run: assertion failed: (errno == 0)
Bail out! ERROR:demo.c:7:run: assertion failed: (errno == 0)
Aborted (core dumped)
.........................................................................................**
ERROR:demo.c:7:run: assertion failed: (errno == 0)
Bail out! ERROR:demo.c:7:run: assertion failed: (errno == 0)
**
ERROR:demo.c:7:run: assertion failed: (errno == 0)
Aborted (core dumped)
..........**
I first isolated this to the get_C_locale
method call splattering errno.
Then further it is the g_once_init_enter
method call splattering errno.
Finally I isolated it to g_mutex_lock
, with the native futex based impl leaving errno set on success
This can be demonstrated with the following program
$ cat fdemo.c
#include <glib.h>
#include <unistd.h>
#include <assert.h>
static gpointer run(gpointer data)
{
GMutex *m = data;
int i;
for (i = 0; i < 1000; i++) {
write(2, ".", 1);
errno = 0;
g_mutex_lock(m);
g_usleep(1);
assert(errno == 0);
g_mutex_unlock(m);
assert(errno == 0);
}
return NULL;
}
int main(int argc, char **argv) {
gsize i;
GThread *t[5];
GMutex m;
g_mutex_init(&m);
for (i = 0; i < G_N_ELEMENTS(t); i++)
{
t[i] = g_thread_new("demo", run, &m);
}
for (i = 0; i < G_N_ELEMENTS(t); i++)
{
g_thread_join(t[i]);
}
g_printerr(".");
}
$ ./fdemo
...............................................................fdemo: fdemo.c:14: run: Assertion `errno == 0' failed.
Aborted (core dumped)
It won't always fail - as its a race condition you sometimes need to run it a few times.
If I strace the fdemo
program, we see the "failure"
p.log.4063236:futex(0x7fffd85dd868, FUTEX_WAIT_PRIVATE, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable)
Of course this is NOT actually a failure. It is totally normal / exepcted / harmless for FUTEX_WAIT_PRIVATE to sometimes return EAGAIN.
The native glibc pthread_mutex_t impl will exhibit the same behaviour of EAGAIN return from the futex() syscall.
The difference is that glib
GMutex is using the public syscall(2)
function exported by glibc
that will set errno
when the syscall returns failure.
The glibc
pthread_mutex_t impl is using an internal syscall
helper that does NOT set errno
when the syscall returns failure.
IOW, despite doing the same logic/syscall, glib
will splatter errno on successful mutex lock, but glibc
will not.
I think the gthreadprivate.h
code for g_futex_simple
needs to save errno
before the syscall, and restore errno
after the syscall if the new errno
is EAGAIN.