Use primary GPU to copy for a secondary output
Mutter uses one device, the primary GPU, for all compositing. Other DRM devices are called secondary GPUs, regardless of having a GPU or not. Content for outputs connected to secondary GPUs is copied from the primary GPU. This is done because often a DRM device can scan out only from its own memory, and rendering to another device's memory may be slow or not work at all.
The foremost path for the copy is using the secondary GPU to do a hardware accelerated copy straight from the primary GPU's memory. When that is not possible (driver support lacking, or secondary GPU is display-only), a glReadPixels
based CPU copy path is used, and the copy targets DRM dumb buffers allocated from the secondary GPU.
Naturally the CPU copy path is slow. This causes quite a lot of overhead in driving e.g. DisplayLink outputs, where the kernel DRM driver called EVDI is actually a virtual display-only driver. The CPU copy path also blocks Mutter for the copy duration.
This MR adds a third copy path: copy using the primary GPU writing into memory of the secondary GPU.
This attempts a somewhat notorious operation: using a GPU to render into a DRM dumb buffer. Normally that is illegal, but in this case we already have a working fallback just in case it doesn't work. The DRM dumb buffer is allocated on the secondary GPU device to ensure it will work for KMS on the secondary's outputs.
The primary GPU copy is only attempted if the secondary GPU copy path has already been deemed unavailable. Therefore this MR cannot regress any "PRIME" use cases that are already using the secondary GPU to do the copy. The primary GPU copy path can only be used where previously the CPU copy path was used, and there, when it works, it should be a major performance win.
My performance tests were performed on a Haswell desktop machine, with two normal monitors and a third monitor at 1080p 60 Hz through a DisplayLink dock D6000. I was running Mutter standalone in Wayland mode on all three monitors. For graphical load I used the infamous glxgears
in its default size on the DL output, and it achieved roughly 58 fps in both cases. The results are:
copy path | wall time per call | Mutter CPU | DLM CPU |
---|---|---|---|
CPU | CopySharedFramebufferCpu = 5 - 9 ms | 30 - 40 % | 35 - 45 % |
prim. GPU | CopySharedFramebufferPrimaryGpu = 0.1 ms | 6 % | 50 % |
-
wall time per call: See the patch adding Cogl tracing to renderer/native. This is the time Mutter spends blocked in the copy function. I didn't actually get Sysprof working, so I wrote a replacement
COGL_TRACE_BEGIN_SCOPED
that justfprintf
s the elapsed time. The numbers are an eye-ball average from the trace log. -
Mutter CPU: CPU spent in Mutter according to
top
-
DLM CPU: CPU spent in DisplayLinkManager according to
top
. DLM is the DisplayLink userspace daemon that relays the images from EVDI kernel driver to the USB device (the dock).
Note: The CPU consumption in the CPU copy case varied wildly, so the numbers are inaccurate. Also, CPU frequency scaling was enabled, which may make the two paths not directly comparable. Another factor is that in the primary GPU copy path, DLM might be able to push through more frames, but it won't affect Mutter or glxgears
fps because of EVDI. However, the ratio of mutter vs. DLM CPU consumption should be reliable enough, and the signficant difference between CPU and primary GPU paths leaves no room for doubt.
The patch series has these parts:
- Trivial clean-up.
- Preparing to make
cogl_blit_framebuffer
usable between onscreen and offscreen framebuffers, by lifting restrictions that the ANGLE extension has forglBlitFramebuffer
. Support for that one ANGLE extension is removed. -
cogl_blit_framebuffer
is exposed for renderer/native to use. - Consolidating with
meta_egl_create_dmabuf_image
and Wayland dma-buf. - Renderer/native preparation for the new copy mode.
- The new copy path, and copy mode renames.
- Trace annotations.