Improve performance of argument marshalling
Submitted by Giovanni Campagna
In the past few days, I did some profiling on gjs. I knew that gjs is not very well optimized, but from a what I saw, there is a lot to gain in the argument marshalling path.
So I decided to refactor it completely. I took the idea from PyGObject, and implemented an "argument cache", which is a system based on small data structures built once per function and vfuncs that handle the actual marshalling, without long switches.
So far, I converted the GjsParamTypes handling in function.c, and I converted most of regular in arguments. There is a generic fallback to arg.c, for what's not handled yet, so the test suite passes. But before I go on with out arguments, and then the new arg cache builders for trampolines, I'd like to hear from you. Do you think this optimization is worthwhile? Is it better to have the data structures (to avoid continuous allocations of GBaseInfos) but then keep the giant switch (for instruction cache locality)?
Also, from the profile we see that a non trivial amount of time (~7% with the branch applied) is spent in ffi_call itself. I figured out that maybe it makes sense to codegen asm trampolines, at least for the major architectures. Obviously, this is a very dirty and quite dangerous approach, so before I attempt it, I'd like to know that it's not wasted work.
In any case, the branch is at https://github.com/gcampax/gjs/commits/arg-cache
The benchmark I used is at http://people.gnome.org/~gcampagna/gjs_optimization/bench2.js
Callgrind results with the branch: http://people.gnome.org/~gcampagna/gjs_optimization/callgrind.best
Callgrind results with master: http://people.gnome.org/~gcampagna/gjs_optimization/callgrind.comparison
TL,DR: a 43% speedup on invoke-intensive applications