Skip to content

ssh-agent: avoid deadlock when agent process dies before we connect to it

Steven Luo requested to merge steven676/gnome-keyring:master into master

One of our users reported that git pushes to GitHub were hanging on their workstation. A look at the state of the workstation showed that the user's gnome-keyring-daemon was not responding to SSH agent requests and consuming 100% of a CPU core on that machine.

Looking at the stuck thread with gdb showed the following (this is with gnome-keyring 3.28.2-1 as shipped by RHEL8):

(gdb) bt
#0  0x00007fcc52ef5138 in g_mutex_unlock () from /lib64/libglib-2.0.so.0
#1  0x00007fcc52eadccd in g_main_context_iterate.isra ()
   from /lib64/libglib-2.0.so.0
#2  0x00007fcc52eadf40 in g_main_context_iteration ()
   from /lib64/libglib-2.0.so.0
#3  0x000055f32b0971d9 in gkd_ssh_agent_process_connect (self=0x55f32c5ad400, 
    cancellable=0x55f32c5c0610, error=error@entry=0x7fcc4d52a4c8)
    at daemon/ssh-agent/gkd-ssh-agent-process.c:232
#4  0x000055f32b095a78 in on_run (service=<optimized out>, 
    connection=0x55f32c5ef720, source_object=<optimized out>, 
    user_data=<optimized out>) at daemon/ssh-agent/gkd-ssh-agent-service.c:297
#5  0x00007fcc515ff17e in ffi_call_unix64 () from /lib64/libffi.so.6
#6  0x00007fcc515feb2f in ffi_call () from /lib64/libffi.so.6
#7  0x00007fcc5318b386 in g_cclosure_marshal_generic_va ()
   from /lib64/libgobject-2.0.so.0
#8  0x00007fcc5318a616 in _g_closure_invoke_va ()
   from /lib64/libgobject-2.0.so.0
#9  0x00007fcc531a6525 in g_signal_emit_valist ()
   from /lib64/libgobject-2.0.so.0
#10 0x00007fcc531a70e3 in g_signal_emit () from /lib64/libgobject-2.0.so.0
#11 0x00007fcc53462ebc in g_threaded_socket_service_func ()
   from /lib64/libgio-2.0.so.0
#12 0x00007fcc52ed6ef3 in g_thread_pool_thread_proxy ()
   from /lib64/libglib-2.0.so.0
#13 0x00007fcc52ed64ea in g_thread_proxy () from /lib64/libglib-2.0.so.0
#14 0x00007fcc51fd51ca in start_thread () from /lib64/libpthread.so.0
#15 0x00007fcc51c41e73 in clone () from /lib64/libc.so.6

Looking at the other threads of the process, I found one doing the following:

#0  0x00007fcc51c419bd in syscall () from /lib64/libc.so.6
#1  0x00007fcc52ef487c in g_mutex_lock_slowpath () from /lib64/libglib-2.0.so.0
#2  0x000055f32b096ec7 in on_child_watch (pid=161639, status=256, 
    user_data=<optimized out>) at daemon/ssh-agent/gkd-ssh-agent-process.c:133
#3  0x00007fcc52eaa418 in g_child_watch_dispatch ()
   from /lib64/libglib-2.0.so.0
#4  0x00007fcc52eadaed in g_main_context_dispatch ()
   from /lib64/libglib-2.0.so.0
#5  0x00007fcc52eadea8 in g_main_context_iterate.isra ()
   from /lib64/libglib-2.0.so.0
#6  0x00007fcc52eae1d2 in g_main_loop_run () from /lib64/libglib-2.0.so.0
#7  0x000055f32b0719fa in main (argc=<optimized out>, argv=<optimized out>)
    at daemon/gkd-main.c:1165

gkd_ssh_agent_process_connect() in the first thread is running the GLib main loop while holding self->lock, waiting for on_output_watch() to set self->ready. However, if the SSH agent has already exited, on_child_watch() will be executed on a second thread, which tries to take self->lock -- that creates a deadlock. The timeout in gkd_ssh_agent_process_connect() seems to be ineffective because it's also triggered by an event and the entire event-handling flow is stuck due to the deadlock.

This deadlock may be the cause of several bug reports over the years (#25, #70, https://bugzilla.gnome.org/show_bug.cgi?id=794848, https://bugzilla.redhat.com/show_bug.cgi?id=1841855).

Fix this by having gkd_ssh_agent_process_connect() release self->lock before entering the main loop. While we're at it, check self->pid before each iteration -- if the SSH agent process dies, on_output_watch() should set that to 0, and we don't need to keep waiting for self->ready to be set.

Merge request reports