g_poll on macOS fails with EBADF instead of setting POLLNVAL since 2.72.1
With the glib 2.72.1 release, the semantics of g_poll
on macOS have changed when faced with invalid file descriptors.
Previously an invalid FD would result in G_IO_NVAL
/ POLLNVAL
being set in the GPollFD
struct revents
field, but g_poll
would return success.
Now an invalid FD will result in g_poll
failing with EBADF
errno set on return, and GPOllFD
struct revents
field left untouched.
The change was caused by the merging of !2571 (merged) which fixed setting of 'BROKEN_POLL' on macOS, which in turn causes GLib to use select()
to fake poll. Returning EBADF
is all that select()
is capable of doing, so it is understandable that this leaks through to GLib's g_poll
impl.
This behaviour, however, then also alters the behaviour of GMainContext
, triggering a warning message every time a bad FD is encountered, instead of this being propagated upto the application code
(process:50968): GLib-WARNING **: 04:59:12.981: poll(2) failed due to: Bad file descriptor.
This tripped up libvirt's test suite since we had G_DEBUG=fatal-warnings set when running unit tests, and the code is a little racey when closing FDs / removing them from the GMainContext.
A select()
based impl of poll()
is inherently poor for a number of reasons most notably the FD_SETSIZE limit. The question is how far we should expect GLib to go to hide the differences ?
At the very minimum IMHO the above warning message should be skipped when seeing EBADF errno IMHO, as it is a needless behaviour difference that GLib is consciously creating.
To actually fill in POLLNVAL
in the GPollFD
could be done by calling fcntl(GETFL)
on each FD to see which one(s) caused select()
to fail with EBADF
. That would make behaviour much closer to poll
, at the cost of significantly inefficiency when EBADF
occurs due to the O(N) fcntl
calls that would be required. I'm not sure if that's justifiable or not in the name consistent API semantics
FYI itt is relatively easy to trigger a bad file descriptor scenario, with GLib's event loop, on any platform with the following program:
#include <glib.h>
#include <glib-unix.h>
#include <unistd.h>
G_GNUC_NORETURN static void *eventThreadLoop(void *data G_GNUC_UNUSED) {
GMainContext *ctx = g_main_context_default();
while (1) {
g_printerr("EV tick\n");
g_main_context_iteration(ctx, TRUE);
}
abort();
}
static void gotEvent(gint fd, GIOCondition cond, gpointer data G_GNUC_UNUSED)
{
g_printerr("FD %d cond %x\n", fd, cond);
}
int main(int argc G_GNUC_UNUSED, char **argv G_GNUC_UNUSED)
{
pthread_t eventThread;
GMainContext *ctx = g_main_context_default();
int i;
pthread_create(&eventThread, NULL, eventThreadLoop, NULL);
for (i = 0; i < 100000; i++) {
GSource *src;
int fd = open("/dev/null", O_RDONLY);
g_printerr("Iterate %d: fd=%d\n", i, fd);
g_assert(fd > -1);
src = g_unix_fd_source_new(fd, G_IO_IN);
g_source_set_callback(src,(GSourceFunc)gotEvent, src, NULL);
g_source_attach(src, ctx);
g_source_destroy(src);
g_source_unref(src);
close(fd);
}
return 0;
}
strace'ing that on Linux will show a small fraction 2-3% of poll() syscalls returning POLLNVAL.
On macOS that will emit a warning '(process:50968): GLib-WARNING **: 04:59:12.981: poll(2) failed due to: Bad file descriptor.' periodically
The problem with this demo is that the GSource
API is inherently racey as implemented. g_source_destroy
will remove the FD from the list of FDs to be polled by the GMainContext
, and will call g_wakeup_signal
to poke the thread that's asleep in g_main_context_iteration
. Some percentage of the time, however, the thread that called g_source_destroy
will to get on to running close(fd)
before the event loop thread has been scheduled long enough to get out of poll()
(or select()
) syscall. So the event loop thread is still using the just closed FD for a fraction of a second. I can't see how to solve this other than never close'ing FDs in any thread that isn't the one running the GMainContext
.