Improve inline-ability of gsk_gpu_node_processor_add_glyph_node()
It looks like a common way that GTK is getting compiled is with -O2
and without lto
. Some do a bit better (like Fedora), but that seems to be the exception. GNOME OS for example is just -O2
.
With that scenario, much of some hot paths in gskgpunodeprocessor.c are not getting appropriate inlines.
Right now, gsk_gpu_node_processor_add_glyph_node()
is about 20% of gnome-text-editor
samples. Looking at that, we are not inlining any of (and verified with disassemble gsk_gpu_node_processor_add_glyph_node
in gdb
):
gsk_text_node_has_color_glyphs()
gs_gpu_frame_get_device()
gsk_text_node_get_color()
gsk_text_node_get_num_glyphs()
gsk_text_node_get_glyphs()
gsk_text_node_get_font()
gsk_text_node_get_offset()
-
graphene_vec2_get_x()
(×2 for PLT) -
graphene_vec2_get_y()
(×2 for PLT) gsk_gpu_frame_should_optimize()
gsk_font_get_hint_style()
-
roundf()
(×2 for PLT) (inside inner loop
) -
gsk_gpu_device_lookup_glyph_image()
(inside inner loop
) -
gsk_gpu_image_get_height()
(inside inner loop
) -
gsk_gpu_image_get_width()
(inside inner loop
) -
gsk_gpu_node_processor_add_image()
(inside inner loop
) -
gsk_gpu_clip_get_shader_clip()
(inside inner loop
) -
gsk_gpu_texture_op()
(inside inner loop
) -
graphene_rect_scale()
(×2 for PLT) (inside inner loop
though unlikely)
Only the inner loop ones particularly matter at all.
Contrast with gsk_gl_render_job_visit_text_node()
which only has a handful of calls into the uniform buffering code inside the inner loop. That was on purpose because it had an effect on being able to make gnome-text-editor get to 144hz consistently while scrolling.
Currently, with NGL, I can only get to about 120 fps with frequent dips into around 90 fps where as GL is locked in on 140+.
GskGpuImage
has a Private
PIMPL indirection because there are two subclasses (GL and Vulkan). Though the object is defined in the header and so the Private doesn't really seem too valuable. Additioanlly, this causes a runtime pointer + offset
from a dynamic load (the type offset pointer) and thus not a candidate for LTO anyway (at least from what I've tried here w/ GCC). This matters for gsk_rect_scale()
usage.
Percentage wise, the lookup_glyph_image()
is by far the most expensive of the inner calls.