Even faster box blur
I replaced the generic convolution box blur with an optimized implementation.
The main optimization comes from not having to recompute the whole weighted sum for each pixel, which is possible thanks to the box blur having equal weights for all pixels. This dropped the phone from 29 seconds to 18 seconds on my PC.
Next, I checked the WebKit code to see if I was doing anything sub-optimally and realized that I could move the division around a bit to reduce the number of divisions and floating point operations in general. This dropped the phone further from 18 seconds to 16 seconds, resulting in a total speedup of almost 50%.
I also rsvg-bench–tested
-p 1 -r 100 which dropped from 23.30 seconds to 2.82 seconds (!), as well as
-p 1 -r 100 which dropped from 74.50 seconds to 55.47 seconds.
I also made the benchmarks to use the Criterion crate which provides very nice analysis and works on stable, and added the box blur benchmarks (which I used throughout making these optimizations).