Skip to content

g_utf8_normalize: don't read past the end of the buffer

Todd Carson requested to merge toc/glib:normalize-utf8-bounds-checking into main

_g_utf8_normalize_wc() could read past the end of the provided buffer if it ends with a truncated multibyte character. If max_len is -1, it can continue reading until it encounters either a NUL or unreadable memory. Avoid this with extra bounds checks prior to g_utf8_get_char() to ensure that it does not read past either max_len or a NUL terminator.

If the result of _g_utf8_normalize_wc() were directly returned to the caller then this could be an exploitable infoleak in some applications, but the result is transformed from UCS-4 back to UTF-8 by g_ucs4_to_utf8(), which bails on invalid encodings rather than continuing as _g_utf8_normalize_wc() does. So in cases where _g_utf8_normalize_wc() read off the end of the buffer, g_ucs4_to_utf8() bails out and returns NULL. There's a potential to return a few bytes from past the end of a buffer in cases where g_utf8_normalize() is called on a string with no NUL terminator and length set by the max_len argument, and the next bytes past the end of the string are valid UTF-8 continuation bytes. I think that's sufficiently low-probability to treat this as a normal bug report and not a security report.

Discovered by fuzzing the mail indexer mu, which calls g_utf8_normalize() on non-validated strings and will crash on inputs with MIME parts that end with truncated multibyte characters.

Bug reproduced and patch tested on OpenBSD/amd64, macOS/arm64, and Linux/x86_64. Patch passes glib/tests/unicode-normalize and holds up against a day or so of fuzzing with AFL++. See also !3342 (merged) for a fuzzing harness.

Example program to reproduce below. Run it on a 4096-byte test file ending with a truncated multibyte UTF-8 character, for example the output of perl -e 'print (("A" x 4095) . "\x{e2}"'.

#include <sys/mman.h>
#include <sys/stat.h>

#include <err.h>
#include <fcntl.h>
#include <glib.h>
#include <stdio.h>
#include <stdlib.h>

int
main(int argc, char **argv)
{
        struct stat st;
        const char *path;
        char *in, *res;
        size_t len;
        int fd;

        if (argc != 2) {
                fprintf(stderr, "usage: %s <file>\n", getprogname());
                return 2;
        }
        path = argv[1];

        if (0 > (fd = open(path, O_RDONLY)))
                err(1, "%s: %s", path, "open");
        if (0 != fstat(fd, &st))
                err(1, "%s: %s", path, "fstat");
        len = ((st.st_size + 4095) / 4096) * 4096;

        if (MAP_FAILED == (in = mmap(NULL, len, PROT_READ,
            MAP_PRIVATE, fd, 0)))
                err(1, "%s: %s", path, "mmap");

        res = g_utf8_normalize(in, -1, G_NORMALIZE_ALL);
        if (!res)
                errx(1, "g_utf8_normalize returned NULL");

        return 0;
}
Edited by Todd Carson

Merge request reports