Skip to content

hwaccel-nvidia: Reduce global memory access in BGRX_TO_YUV420 kernel

Quoting the main commit here:

Global memory access (accessing data on the heap) on the GPU is slow
during compute operations.
While in most situations it cannot be avoided, its usage can still be
decreased.

Since every pixel in every frame, that is received via PipeWire, has a
size of 4 Bytes, read every pixel in the BGRX_TO_YUV420 CUDA kernel as
uint32, instead of reading it three times as uint8 per component.
Then retrieve the values of the components via the local uint32_t
variable.

For FullHD frames on a GTX 660, this reduces the overall runtime of the
BGRX_TO_YUV420 CUDA kernel by about 140-150µs (1. quartile and
3. quartile) from about 330µs to 182µs (1. quartile) and from 412µs to
266µs (3. quartile).
The average runtime value drops from about 377µs to 228µs and the
median value drops from about 394.5µs to 226µs.

For easier testing, this is based on !75 (merged)
Depends on !75 (merged) (commits included here)
Depends on !77 (merged) (commit included here)

Edited by Pascal Nowack

Merge request reports