We might repeatedly get si_pid == 0 for a child that hasn't exited, meaning we won't get a correct exit status. This seems to happen when the glib application tracks a ptrace():ed child process; the correct exit status of the process using e.g. a BPF program, where one can observe that glib appears to get it wrong.
I've tried to write a test case for this, but the reproducer I have is somewhat hard to translate to a test case. This is what seems to happen:
- Run
mutter
inside catch (a small utility that ptrace()es all child processes (and subprocesses) and generates backtraces for everySIGABRT
andSIGSEGV
. - Make something spawn Xwayland
-
kill -SIGKILL
that Xwayland - Observe that mutter gets the wrong result from
g_subprocess_get_success()
(it returnsTRUE
).
For example:
dbus-run-session -- catch mutter --nested weston-terminal
In weston-terminal run xterm
In another terminal, use ps ax | grep Xwayland
to find the correct Xwayland process ID, and run kill -SIGKILL <PID>
.
In the terminal where mutter was run, one should see X Wayland crashed; attempting to recover
, but here 2 out of 3 times, it won't, meaning g_subprocess_get_success()
returned TRUE
.
Another way to observe it is to run the exitsnoop
BPF program (https://github.com/iovisor/bcc/blob/master/tools/exitsnoop.py), and see that Xwayland
will have a error exit status.
I tried to write a test case that:
- Spawns a
GSubprocess
fork()
- Run the
sleep(2); kill(SIGKILL, subprocess_pid); exit(0);
in the fork - And
ptrace()
/waitpid()
etc imitatecatch
The problem with this is that waitpid()
"consumes" the exit status meaning waitid()
in g_child_watch_check()
fails instead of succeeds while setting info.si_pid
to 0
.
Also can't really find any documentation about what si_pid
being 0
should mean, nor whether POLLIN
with an "empty" info
really meaning the process exited or not.