Popup wayland surfaces are destroyed on surface enter
This bug was initially reported to firefox in [1], and the observable effect is that popups frequently don't appear on scaled outputs. Firefox WR wayland uses gtk to place the popups but renders to a subsurface it creates which is parented to the popup surface. It was determined the popup isn't rendered because sway doesn't send a frame event to the parent surface, so sway issue [2] was opened.
I find that sway doesn't respond to the request because the surface has been destroyed in gtk by a call to gdk_window_hide
[3] when sway sends the enter event for that surface. Firefox doesn't seem to know that the surface is dead, nor does there seem to be any gtk event it can listen to if it wanted to know. Is there a way for firefox to watch for this, or if not, can we make one?
I tried to determine what should be done about this, but it left me confused. It looks like the hide/show behavior on configure was introduced to fix [4], where popups sometimes have the wrong size. TBH, I don't think I understand the race that is stated to be the culprit. The patch apparently introduced another bug, [5], and a solution was attempted where gtk would avoid destroying its surface by attaching a NULL buffer instead [6], but that approach was scrapped because apparently mutter rejects that with a protocol error in its implementation of xdg-shell v6 [5] (comment 14).
Well, I think the NULL buffer approach sounds fine. AFAICT only mutter rejects this, and it doesn't seem to be mandated by the (v6) protocol. I think wlroots used to allow this, but v6 support was dropped years ago. It is allowed in xdg_shell stable, and everyone supports xdg_shell stable.
I tried the "minimal reproducer" offered in [4] (comment 17) in order to better understand the original bug and... I can still reproduce the bug from [4] in gtk 3.24.30. It's not very easy, maybe 1/100 clicks, but I still occasionally get extra space below the last option just like in the image from the original report. I'm not sure it was ever totally fixed.
So I'm hoping we can either:
- Use the NULL buffer approach, or
- Stop doing the hide/show thing as introduced in [4], since it's kind of a hassle and doesn't seem to actually fix the issue it intended.
If not, is there at least some way for firefox to tell when the surface it requested a callback on was destroyed? If it's simple enough to revert the changes, I can probably send a patch for options 1 and 2. I don't think they're mutually exclusive either. I briefly tried option 1 and it does fix the firefox issue for me.
- [1] https://bugzilla.mozilla.org/show_bug.cgi?id=1722767
- [2] https://github.com/swaywm/sway/issues/6426
- [3] https://gitlab.gnome.org/GNOME/gtk/-/blob/d4e2d05cd9518ba04d6fbe1cbcec27142788ac95/gdk/wayland/gdkwindow-wayland.c#L1227
- [4] https://bugzilla.gnome.org/show_bug.cgi?id=772505
- [5] https://bugzilla.gnome.org/show_bug.cgi?id=773686
- [6] 0f2e19c0