Discussion: utf8 validation optimization
Sorry for using gitlab for discussions instead of discourse, but I just can't keep track of these things across all the web properties and given the engineering rigor for a proper conversion on this, gitlab seems best.
Currently, our UTF-8 validator is right out of 1999. It is full of data-dependent branches, though performs "okay" because compilers and CPU are optimizing for cases like it, assuming ascii.
Throughput-wise, it's probably a safe assumption to say it performs 1/4 to 1/10th of what it can be for strings over a couple dozen bytes.
WebKit has already transitioned to https://github.com/simdutf/simdutf which may be useful for us to either look at (or crib if a C++11 compiler was a required dependency).
Just going to a branchless decoder is not necessarily going to yield improvements in our average case (say GTK). In fact, in many common cases I've tested here it ended up slower. Though for cases like libsoup, it very well could. Especially if we were to do GConverter using it instead of iconv.
So to keep the discussion on-topic:
- Do we have a palette for improving UTF-8 performance?
- Do we want to defer it to somewhere else in the stack (icu, etc).
- If we do improve it from GLib API, can we palette a C++ (11) compiler dependency? All our supported platforms currently have one.
- Do we trust something that is in WebKit to be of high enough quality for GLib (presumably yes)