disposing a non-cancelled inotify GFileMonitor causes deadlocks
Cockpit builds started to fail on various architectures in Debian a few weeks ago, in a unit test that essentially wraps GFileMonitor. I did some investigations in https://github.com/cockpit-project/cockpit/issues/13146 and found that g_file_monitor()
initialization hangs forever if it's the second (or third..) instance of a monitor (the first one always seems to work), and if the first monitor saw some event.
I wrote a little standalone gfilemon.c reproducer that illustrates this. It has a do_one()
function which creates a temp dir, a file monitor on it, creates a file in it, and ensures it picks up the event in there. main()
then calls do_one()
in a loop. In Debian sid this now reliably hangs at the second, or at most third, iteration:
$ gcc ~/gfilemon.c -Wall $(pkg-config --cflags --libs glib-2.0 gio-2.0) && ./a.out
** Message: 22:28:56.699: creating GFileMonitor for /tmp/.XOQ0A0
** Message: 22:28:56.699: created GFileMonitor for /tmp/.XOQ0A0
** Message: 22:28:56.700: filemon changed: file /tmp/.XOQ0A0/test.txt.D7P0A0, type 3
** Message: 22:28:56.700: filemon changed: file /tmp/.XOQ0A0/test.txt.D7P0A0, type 1
** Message: 22:28:56.700: filemon changed: file /tmp/.XOQ0A0/test.txt.D7P0A0, type 0
** Message: 22:28:56.700: filemon changed: file /tmp/.XOQ0A0/test.txt.D7P0A0, type 1
** Message: 22:28:56.700: filemon changed: file /tmp/.XOQ0A0/test.txt.D7P0A0, type 2
** Message: 22:28:56.700: filemon changed: file /tmp/.XOQ0A0/test.txt, type 3
** Message: 22:28:56.700: filemon changed: file /tmp/.XOQ0A0/test.txt, type 1
** Message: 22:28:56.700: creating GFileMonitor for /tmp/.GVL0A0
and hanging here (no "created") message. When running through strace, I see
Message: 22:35:47.358: creating GFileMonitor for /tmp/.5JC2A0
) = 85
[pid 101278] lstat("/tmp/.5JC2A0", {st_dev=makedev(0, 0x20), st_ino=1009329, st_mode=S_IFDIR|0700, st_nlink=2, st_uid=1000, st_gid=1000, st_blksize=4096, st_blocks=0, st_size=40, st_atime=1573770947 /* 2019-11-14T22:35:47.357645967+0000 */, st_atime_nsec=357645967, st_mtime=1573770947 /* 2019-11-14T22:35:47.357645967+0000 */, st_mtime_nsec=357645967, st_ctime=1573770947 /* 2019-11-14T22:35:47.357645967+0000 */, st_ctime_nsec=357645967}) = 0
[pid 101278] futex(0x7f823a74a300, FUTEX_WAIT_PRIVATE, 2, NULL
This happens with both glib 2.26.2 in Debian unstable and glib 2.63.1 in Debian experimental. I also tried this in a Fedora 31 (glib 2.26.2, same as in sid) and rawhide mock (glib 2.63.0), but it does not happen there.
Debian configures glib a bit differently than Fedora; in particular, Debian uses -Dfam=false
(except on the non-Linux ports), where Fedora uses -Dfam=true
. That might be related?
Does that ring any bell? What can I do to debug this further?