Skip to content

WIP: Automatic frame-clock-dispatch and frame-callback delaying for reducing input- and presentation latency

Ivan Molodetskikh requested to merge YaLTeR/mutter:⏲ into master

Includes and builds on top of !1484 (merged).

Prior art:

  • In Weston compositing time is fixed to 7 ms and there's no frame callback delaying.
  • In sway both compositing time and frame callback delay are exposed as settings, although frame callbacks can only be delayed per-view (≈ toplevel surface).

COPR with this MR on top of stable Mutter which I'm using for dogfooding: https://copr.fedorainfracloud.org/coprs/yalter/mutter-repaint-scheduling/

Testing is appreciated! Look for, ideally under different workloads and setups:

  • no more frame drops than usual

  • lower latency if you can somehow measure / feel that

  • launch weston-presentation-shm and check f2p, it should be much lower than usual, ideally below one frame which is 1000 / REFRESH RATE

    Example output on my laptop (60 Hz, one frame is 16.7 ms):

    387: f2c  0 ms, c2p  9 ms, f2p  9 ms, p2p 16643 us, t2p   9530, [sce_], seq 8073
    388: f2c  0 ms, c2p 10 ms, f2p 10 ms, p2p 16643 us, t2p   9459, [sce_], seq 8074
    389: f2c  0 ms, c2p 11 ms, f2p 11 ms, p2p 16643 us, t2p  10554, [sce_], seq 8075
    390: f2c  0 ms, c2p  9 ms, f2p  9 ms, p2p 16643 us, t2p   9336, [sce_], seq 8076
    391: f2c  0 ms, c2p 10 ms, f2p 10 ms, p2p 16643 us, t2p   9521, [sce_], seq 8077
    392: f2c  0 ms, c2p  9 ms, f2p  9 ms, p2p 16643 us, t2p   9247, [sce_], seq 8078
    393: f2c  0 ms, c2p  8 ms, f2p  8 ms, p2p 16643 us, t2p   8511, [sce_], seq 8079
    394: f2c  0 ms, c2p 10 ms, f2p 10 ms, p2p 16643 us, t2p   9677, [sce_], seq 8080
    395: f2c  0 ms, c2p  9 ms, f2p  9 ms, p2p 16643 us, t2p   9462, [sce_], seq 8081
    396: f2c  0 ms, c2p  9 ms, f2p  9 ms, p2p 16643 us, t2p   9233, [sce_], seq 8082

    When this MR is disabled:

    2343: f2c  0 ms, c2p 29 ms, f2p 29 ms, p2p 16656 us, t2p  28687, [sce_], seq 18899
    2344: f2c  0 ms, c2p 30 ms, f2p 30 ms, p2p 16656 us, t2p  29800, [sce_], seq 18900
    2345: f2c  0 ms, c2p 29 ms, f2p 29 ms, p2p 16656 us, t2p  29442, [sce_], seq 18901
    2346: f2c  0 ms, c2p 30 ms, f2p 30 ms, p2p 16657 us, t2p  30015, [sce_], seq 18902
    2347: f2c  0 ms, c2p 28 ms, f2p 28 ms, p2p 16656 us, t2p  28249, [sce_], seq 18903
    2348: f2c  1 ms, c2p 28 ms, f2p 29 ms, p2p 16656 us, t2p  28583, [sce_], seq 18904
    2349: f2c  0 ms, c2p 30 ms, f2p 30 ms, p2p 16656 us, t2p  30019, [sce_], seq 18905
    2350: f2c  0 ms, c2p 30 ms, f2p 30 ms, p2p 16656 us, t2p  30029, [sce_], seq 18906
    2351: f2c  0 ms, c2p 30 ms, f2p 30 ms, p2p 16657 us, t2p  30261, [sce_], seq 18907
    2352: f2c  0 ms, c2p 30 ms, f2p 30 ms, p2p 16656 us, t2p  30156, [sce_], seq 18908

    Example output on my desktop (144 Hz, one frame is 6.9 ms):

    863: f2c  0 ms, c2p  5 ms, f2p  5 ms, p2p  6944 us, t2p   5612, [sce_], seq 450109
    864: f2c  0 ms, c2p  5 ms, f2p  5 ms, p2p  6945 us, t2p   5591, [sce_], seq 450110
    865: f2c  0 ms, c2p  6 ms, f2p  6 ms, p2p  6944 us, t2p   6740, [sce_], seq 450111
    866: f2c  0 ms, c2p  5 ms, f2p  5 ms, p2p  6945 us, t2p   5604, [sce_], seq 450112
    867: f2c  0 ms, c2p  6 ms, f2p  6 ms, p2p  6944 us, t2p   5758, [sce_], seq 450113
    868: f2c  0 ms, c2p  6 ms, f2p  6 ms, p2p  6945 us, t2p   5701, [sce_], seq 450114
    869: f2c  1 ms, c2p  5 ms, f2p  6 ms, p2p  6944 us, t2p   5587, [sce_], seq 450115
    870: f2c  1 ms, c2p  5 ms, f2p  6 ms, p2p  6945 us, t2p   5483, [sce_], seq 450116
    871: f2c  0 ms, c2p  7 ms, f2p  7 ms, p2p  6944 us, t2p   6891, [sce_], seq 450117
    872: f2c  0 ms, c2p  6 ms, f2p  6 ms, p2p  6945 us, t2p   5687, [sce_], seq 450118

    When this MR is disabled:

    3986: f2c  0 ms, c2p 11 ms, f2p 11 ms, p2p  6945 us, t2p  10665, [sce_], seq 639159
    3987: f2c  0 ms, c2p 11 ms, f2p 11 ms, p2p  6944 us, t2p  10914, [sce_], seq 639160
    3988: f2c  0 ms, c2p 11 ms, f2p 11 ms, p2p  6945 us, t2p  10922, [sce_], seq 639161
    3989: f2c  0 ms, c2p 10 ms, f2p 10 ms, p2p  6944 us, t2p  10743, [sce_], seq 639162
    3990: f2c  0 ms, c2p  9 ms, f2p  9 ms, p2p  6945 us, t2p   9124, [sce_], seq 639163
    3991: f2c  0 ms, c2p 11 ms, f2p 11 ms, p2p  6944 us, t2p  10859, [sce_], seq 639164
    3992: f2c  1 ms, c2p 10 ms, f2p 11 ms, p2p  6945 us, t2p  10799, [sce_], seq 639165
    3993: f2c  0 ms, c2p 11 ms, f2p 11 ms, p2p  6944 us, t2p  10998, [sce_], seq 639166
    3994: f2c  0 ms, c2p 10 ms, f2p 10 ms, p2p  6945 us, t2p  10595, [sce_], seq 639167
    3995: f2c  0 ms, c2p 11 ms, f2p 11 ms, p2p  6944 us, t2p  10648, [sce_], seq 639168

To compare with old behavior or just disable if you run into an issue, type the following into Alt+F2lgEnter:

Meta.add_clutter_debug_flags(0, Clutter.DrawDebugFlag.DISABLE_DYNAMIC_MAX_RENDER_TIME, 0)
Meta.add_debug_paint_flag(Meta.DebugPaintFlag.DISABLE_FRAME_CALLBACK_DELAYING)

To enable this MR back, use remove instead of add in the above commands.

Note that this MR is primarily for Wayland. I don't know if it has any effect on Xorg, but testing on Xorg would be nice too to make sure nothing's broken (and hey, maybe frame clock dispatching improves latency on Xorg as well, who knows).

The new version of this MR is currently Wayland-only. It should have no effect on Xorg.

Max render time shows how early the frame clock needs to be dispatched
to make it to the predicted next presentation time. Before this commit
it was set to refresh interval minus 2 ms. This means Mutter would
always start compositing 14.7 ms before a display refresh on a 60 Hz
screen or 4.9 ms before a display refresh on a 144 Hz screen. However,
Mutter frequently does not need as much time to finish compositing and
submit buffer to KMS:

      max render time
      /------------\
---|---------------|---------------|---> presentations
      D----S          D--S

      D - frame clock dispatch
      S - buffer submission

This commit aims to automatically compute a shorter max render time to
make Mutter start compositing as late as possible (but still making it
in time for the presentation):

         max render time
             /-----\
---|---------------|---------------|---> presentations
             D----S          D--S

Why is this better? First of all, Mutter gets application contents to
draw at the time when compositing starts. If new application buffer
arrives after the compositing has started, but before the next
presentation, it won't make it on screen:

---|---------------|---------------|---> presentations
      D----S          D--S
        A-------------X----------->

                   ^ doesn't make it for this presentation

        A - application buffer commit
        X - application buffer sampled by Mutter

Here the application committed just a few ms too late and didn't make on
screen until the next presentation. If compositing starts later in the
frame cycle, applications can commit buffers closer to the presentation.
These buffers will be more up-to-date thereby reducing input latency.

---|---------------|---------------|---> presentations
             D----S          D--S
        A----X---->

                   ^ made it!

Moreover, applications are recommended to render their frames on frame
callbacks, which Mutter sends right after compositing is done. Since
this commit delays the compositing, it also reduces the latency for
applications drawing on frame callbacks. Compare:

---|---------------|---------------|---> presentations
      D----S          D--S
           F--A-------X----------->
              \____________________/
                     latency

---|---------------|---------------|---> presentations
             D----S          D--S
                  F--A-------X---->
                     \_____________/
                      less latency

           F - frame callback received, application starts rendering

So how do we actually estimate max render time? We want it to be as low
as possible, but still large enough so as not to miss any frames by
accident:

         max render time
             /-----\
---|---------------|---------------|---> presentations
             D------S------------->
                   oops, took a little too long

For a successful presentation, the frame needs to be submitted to KMS
and the GPU work must be completed before the vblank. This deadline can
be computed by subtracting the vblank duration (calculated from display
mode) from the predicted next presentation time.

We don't know how long compositing will take, and we also don't know how
long the GPU work will take, since clients can submit buffers with
unfinished GPU work. So we measure and estimate these values.

The frame clock dispatch can be split into two phases:
1. From start of the dispatch to all GPU commands being submitted (but
   not finished)—until the call to eglSwapBuffers().
2. From eglSwapBuffers() to submitting the buffer to KMS and to GPU
   work completing. These happen in parallel, and we want the latest of
   the two to be done before the vblank.

We measure these three durations and store them for the last 16 frames.
The estimate for each duration is a maximum of these last 16 durations.
Usually even taking just the last frame's durations as the estimates
works well enough, but I found that screen-capturing with OBS Studio
increases duration variability enough to cause frequent missed frames
when using that method. Taking a maximum of the last 16 frames smoothes
out this variability.

           F - frame callback received, application starts rendering

So how do we actually estimate max render time? We want it to be as low
as possible, but still large enough so as not to miss any frames by
accident:

         max render time
             /-----\
---|---------------|---------------|---> presentations
             D------S------------->
                   oops, took a little too long

For a successful presentation, the frame needs to be submitted to KMS
and the GPU work must be completed before the vblank. This deadline can
be computed by subtracting the vblank duration (calculated from display
mode) from the predicted next presentation time.

We don't know how long compositing will take, and we also don't know how
long the GPU work will take, since clients can submit buffers with
unfinished GPU work. So we measure and estimate these values.

The frame clock dispatch can be split into two phases:
1. From start of the dispatch to all GPU commands being submitted (but
   not finished)—until the call to eglSwapBuffers().
2. From eglSwapBuffers() to submitting the buffer to KMS and to GPU
   work completing. These happen in parallel, and we want the latest of
   the two to be done before the vblank.

We measure these three durations and store them for the last 16 frames.
The estimate for each duration is a maximum of these last 16 durations.
Usually even taking just the last frame's durations as the estimates
works well enough, but I found that screen-capturing with OBS Studio
increases duration variability enough to cause frequent missed frames
when using that method. Taking a maximum of the last 16 frames smoothes
out this variability.

The durations are naturally quite variable and the estimates aren't
perfect. To take this into account, an additional constant 2 ms is added
to the max render time.

How does it perform in practice? On my desktop with 144 Hz monitors I
get a max render time of 4–5 ms instead of the default 4.9 ms (I had
1 ms manually configured in sway) and on my laptop with a 60 Hz screen I
get a max render time of 4.8–5.5 ms instead of the default 14.7 ms (I
had 5–6 ms manually configured in sway). Weston [1] went with a 7 ms
default.

The main downside is that if there's a sudden heavy batch of work in the
compositing, which would've made it in default 14.7 ms, but doesn't make
it in reduced 6 ms, there is a delayed frame which would otherwise not
be there. Arguably, this happens rarely enough to be a good trade-off
for reduced latency. One possible solution is a "next frame is expected
to be heavy" function which manually increases max render time for the
next frame. This would avoid this single dropped frame at the start of
complex animations.

[1]: https://www.collabora.com/about-us/blog/2015/02/12/weston-repaint-scheduling/

Note that even in absence of a "next frame is expected to be heavy" function the heuristics can automatically increase max render time to beyond its previous value, potentially making heavy animations or app grid scrolling smoother past their first heavy frame.

Remember this graph from last time?

    ---|---------------|---------------|---> presentations
                 D----S          D--S
                      F--A-------X---->
                         \_____________/
                          less latency

       D - frame clock dispatch
       S - buffer swap (we are essentially here in time)
       F - frame callback
       A - application commits a new buffer
       X - the buffer is sampled by the next compositing

Note that there's some time to be shaved off between A and X. This is
exactly what this commit does by delaying the frame callback. The end
result looks like this:

    ---|---------------|---------------|---> presentations
                 D----S          D--S
                             F--AX---->
                                \______/
                           even lower latency
                      \______/
                       delay

The delay takes into account the compositing time estimate from before
as well as the frame-to-commit duration estimate.
Edited by Ivan Molodetskikh

Merge request reports

Loading