Profiling gjs performance with perf and hotspot
So it's a known fact that gjs is not very optimized for performance, let's change that! Here's an introduction about how to use perf and hotspot to get fancy flamegraphs and how to analyze them.
How to profile
-
Install
perf
andhotspot
, luckily both in Fedoras repos -
Now allow everyone to access perf events:
sudo sysctl kernel.perf_event_paranoid=-1
-
That's pretty much it, now you can gather data using perf:
perf record --call-graph dwarf,65528 gjs expensivestuff.js
Those options tell perf to record a callgraph using the dwarf method to unwind stacks and with an increased call-stack depth (this is the maximum depth, note that this will create very large files, 10 seconds and you're at 1gb, so keep profiling sessions short, the higher you choose this number, the less frames with an unknown entry function will appear).
You can also profile your active gnome-shell process by passing perf a pid: perf record --call-graph dwarf,65528 -p PID
- Look at your results using
hotspot perf.data
and select the "Flame Graph" view to get a fancy graph
Things to profile and how to analyze results
Function call on a boxed-type object
For a careful analysis of the bottlenecks gjs currently has it's best to profile simple applications, let's start with the performance of calling a function on a boxed-type object:
const Graphene = imports.gi.Graphene;
function test() {
const box = new Graphene.Box();
for (let i = 0; i < 1000000; i++)
box.get_width();
}
test();
The flamegraph for it should look something like this:
What you can see here are a few interesting things:
- We spend most of the time executing the JS (the whole "??" block)
- We're using the Ion JIT compiler (
js::jit::IonGetProperty
is called, the baseline compiler would bejs::jit::DoGetPropFallback
) - Most of the time spent in JS is spent in the JS->C machinery,
gjs_invoke_c_function
, which is apparently very slow because it callsg_registered_type_info_get_g_type()
twice. Looking at the code, this can easily be fixed by adding the GType to the argument cache.
Getting a boxed-type property from a GObject
Now let's look at something else and get a boxed-type property from a GObject:
const Clutter = imports.gi.Clutter;
function test() {
const actor = new Clutter.Actor();
for (let i = 0; i < 1000000; i++)
actor.allocation;
}
test();
This time we get a flamegraph like this:
Some quick observations:
- We're using the Ion JIT compiler again, nice!
- A bit of time is spent with GC, freeing the boxed-type allocation property we retrieved
- Most of the time is spent getting the property, although not as one might expect inside
g_object_get_property()
, but ingjs_value_from_g_value_internal()
building a new JSObject for the C-struct. What immediately comes to mind here would be introducing a C-struct -> JSObject cache.
gnome-shell overview animation layout process
So we now know how getting properties and calling C functions looks in the flamegraph, time to start looking at some real-world data, let's have a look at the layout process of gnome-shell while opening the overview.
You might ask yourself now "how do you know the layout process when opening the overview is something interesting to look at?", so it's time for a quick shout-out to sysprof: With sysprof marks, you can easily spot that the "Layout" mark uses about 6ms of 8ms it takes us to come up the frame ("Frame Clock (frame)" mark) during the overview animation with a lot of open windows. So obviously that's something to look into if we want to consistently stay inside the time-budget we have for a single frame.
Now open a lot of windows, run perf
attaching it to the gnome-shell process, and open and close the overview a bunch of times.
This is what the flamegraph will look like:
What we care about here is handle_frame_clock_frame()
which is called on every new frame (surprise!), and especially clutter_stage_maybe_relayout()
which is the entry point of the layout process. As you can see, recursion is pretty deep in the layout phase and there's a large chunk of JS we can't figure out an entry point for. That doesn't mean we have to ignore this chunk though: By looking at the C functions it calls into, most notably clutter_actor_allocate()
and clutter_actor_get_preferred_*
(the search feature of hotspot comes in handy here as it might be hard to spot those functions), we find out that most of it are traces from the layout phase, so the size of the clutter_stage_maybe_relayout
block here is probably misleading (which makes sense given that Sysprof reported of 6 out of 8 ms).
Now let's zoom in a little on the layout phase and look at the large block that includes clutter_stage_maybe_relayout()
-> st_widget_allocate()
-> JS -> clutter_actor_allocate_available_size()
. With a little knowledge of the gnome-shell codebase we can figure out that this path must be allocating the Workspace
actor which is the actor responsible for laying out all the windows in the overview, here's the upper part of that flamegraph:
What we see is that we enter a JS allocation function (it's the allocation function of the WorkspaceLayout
layout manager) and a lot of the time in there is spent getting properties (including another roundtrip through JS land and it appears to be a boxed-type GObject property, which means we're probably looking at WindowPreview.boundingBox
).
Now what can we learn here? Well, the most important lesson is that getting properties from GObjects seems to be very expensive and a major bottleneck, in gnome-shell we might want to avoid accessing properties multiple time and instead assign our own "pointer" or copy of the property and use that instead. Also generally (and as expected), recursive processes like Clutters allocation machinery round tripping between C and JS land are a bad idea and should probably avoid JS land altogether (also for the sake of stack-trace readability :P).
More cool stuff you can do with perf and hotspot
Another neat feature I think is worth knowing is that hotspot allows filtering the flamegraph for looking at a specific timespan or thread using the "Time Line" in the bottom part of the window.
That was a quick introduction into the most important things, there are a lot more cool things you can do with perf and hotspot, for example Off-CPU-Profiling to analyze synchronous IO or sleep()
calls, look at the hotspot repo for how to do that. Maybe I'll do another writeup on that in the future, but I'll probably have to learn more about computers before I do that :)
Here's a very interesting talk about using heaptrack (another great profiling tool from the KDE community) and hotspot by David Faure.
What now?
Now go and have fun playing around with this and making GNOME faster :)
Before you start creating patches for gjs, I've already fixed the g_registered_type_info_get_g_type()
occurrences when calling gjs_invoke_c_function()
and the sometimes unnecessary calls into gjs_get_string_id()
in GIWrapperBase::resolve()
.
That was a lot more than I was planning to write, maybe I'll just reuse it for a blog post after all...