RFE - reduce "jank". Maybe need to avoid calling fsync() in the compositor's GUI thread :-).
I also have the same result from the quick
fsync reproducer, when using a Fedora 29 VM. That means -
EDIT: quick proof
To reproduce/observe that fsync() is called on the gnome-shell main thread:
- Open two gnome-terminal windows
ps -aux | grep gnome-shell
strace -f -p <PID-OF-GNOME-SHELL>
- Switch to the second gnome-terminal window.
sleep 5; hit enter, and then switch back to the first gnome-terminal window before the sleep finishes.
Result: a GUI notification appears. At the same time, the
strace output shows an fsync() call, in the TID which is equal to the PID i.e. the first thread of the gnome-shell process.
Based on the hang below, I believe this first thread is the main thread of the compositor. Or at least, I believe this fsync() call causes the entire compositor to block, freezing the entire GUI including the mouse cursor until fsync() returns.
I reproduced the following hang on a system with a spinning hard drive. The system also has 8GB of RAM.
This is a temporary hang, which occurs while a simple file copy command is running. If the file copy command is killed or completes, gnome-shell becomes usable again.
The existence of a complete hang, would suggests a lighter IO load could cause "jank" i.e. stuttering. Firefox has done a lot of work to avoid the browser being blocked by IO on the main thread. (Can't find a good link, but there's a mention here). Now gnome-shell is the wayland compositor, it is even responsible for moving the mouse cursor; it is even more desirable for gnome-shell to avoid jank.
So I'm reporting this as a lead on some possible jank. If I'd only observed jank, it would be harder to report objectively. But I'm hoping there's room to improve jank issue(s), and then maybe the simplest I/O load won't freeze the mouse pointer every time. I counted up to thirty seconds before getting bored. (And the new BFQ IO scheduler does not save you - I first reproduced this while testing BFQ, though it also happens on the default CFQ).
Reproduction is as follows
- Fedora Workstation 28. Spinning hard drive. 8GB RAM. 2GB swap. Default desktop: GNOME Shell on Wayland. My kernel and hardware settings should all be defaults. Kernel is
- My hard drive identifies as a WDC WD5000LPLX-7. Exact IO behaviour might vary surprisingly depending on the disk model, sorry :-(.
xterm. I also recommend enabling the kernel SysRq.
- Open gnome-terminal. Optional: for monitoring purposes, I ran a second maximized terminal window with
mkdir tmp && cd tmp
- Generate large file:
dd if=/dev/zero of=largefile bs=1M count=10k conv=fsync status=progress. (Yes, 5GB on a spinning hard drive. This is a long test, sorry.).
- Start file copy:
dd if=largefile of=largefile.copy bs=128k status=progress. This is exactly equivalent to
cp, but with a progress indicator.
- Notice: the system is now a bit sluggish. But have a little bit of patience. You can still open new terminal windows/tabs, or switch to the Linux text console and back.
- In another terminal window, run
time xterm -e echo. That starts an
xtermand closes it again. This is not too sluggish either :-). Now keep trying it. After a while it might start taking longer. After a while, it hangs completely. And the hang applies to the entire gnome-shell. Including the mouse cursor.
- gnome-shell seems to hang until the copy finishes, or if you have enabled SysRq, you can use alt+sysrq+R switch to a Linux text console and kill the
ddprocess. If you do this, best log out your GNOME as well. Because after using sysrq+R, when you press ctrl+C inside that GNOME session, it will kill the entire GNOME session.
Yes, step 9 really does seem necessary to reproduce this. This test comes out of IO testing, where I've gone through basically the same thing several times except without step 9. And none of those tests hung like this.
My intuition is that gnome-shell tries to save some configuration information about window positions, in response to the windows being opened or closed. That means writing to the disk. And then you've lost. The smoothness of your mouse cursor is depending on a spinning lump of metal. And a kernel IO stack which self-admits to having some very ugly behaviour on spinning metal and fast flash.
Why does the thing hang for so darn long? I dunno! I'm sure there's something that's more complex than all my assumptions.
Nov 25 19:44:27 alan-laptop gnome-shell: failed to commit changes to dconf: Timeout was reached
I saw the quoted message when reproducing the hang after switching to the BFQ IO scheduler. I then reproduced the hang with the default IO scheduler (CFQ), but I did not reproduce the message. But the message helped my thinking a bit, so I'm including it as a possibility.