glib/utf8: Use SIMD for UTF-8 validation

This is based on the https://github.com/c-util/c-utf8 project and has been adapted for portability and integration into GLib. c-utf8 is dual licensed Apache-2.0 and LGPLv2.1+, the latter matching GLib.

Notably, case 0x01 ... 0x7F: style switch/case labels have been converted to if/else which is more portable to non-GCC/Clang platforms while generating the same assembly, at least on x86_64 with GCC.

Additionally, __attribute__((aligned(n))) is used in favor of __builtin_assume_aligned(n) because it is more portable to MSVC's __declspec(align(n)) and also generates the same assembly as GCC's __builtin_assume_aligned(n).

For GCC x86_64 Linux on a Xeon 4214 this improved the throughput of g_utf8_validate() for ASCII from 750MB/s to around 10,000MB/s (13x).

On GCC aarch64 Linux with an Apple Silicon M2 Pro we go from about 2,200 MB/s to 26,700 MB/s (12x).

Closes: #3481 (closed)

Edited by Christian Hergert

Merge request reports

Loading