xmlParseCharData will fall back to xmlParseCharDataComplex unnecessarily if a newline is the first byte in an input chunk
When SAX-parsing an ASCII document using an IO parser context (i.e.,
xmlInputReadCallback), and the first character of an input chunk is
\n or "newline" or "linefeed") then
xmlParseCharData() will fall back to
xmlParseCharDataComplex(). The fallback method has different memory and performance characteristics which can lead to an unnecessary parse failure.
Specifically, if this happens in the middle of a node whose text exceeds
XML_MAX_LOOKUP_LIMIT then, then the parsing will raise a fatal error, "Huge input lookup" (from
xmlGrow()). However, without the newline the document would be parsed successfully.
Reproducing the issue
I've created a small git repository containing code to reproduce this. First clone the repo:
git clone https://gist.github.com/flavorjones/7164487b5af273bef73fcb1a79f89d2d
Then compile the program and generate the "ok" document and the "fail" document:
Finally run the program:
You should see:
parse: file ok.xml parse: file fail.xml errorCallback: internal error: Huge input lookup
When SAX-parsing a document using an IO parser context, bytes will be read in chunks of 4000 (
xmlParseCharData gets to the end of a chunk, it peeks at the next byte (https://gitlab.gnome.org/GNOME/libxml2/-/blob/7929f05710134b9b243952019b6c14066cd3ac9e/parser.c#L4509) and if it doesn't appear to be ASCII, falls back to
However, this peek only looks for
(((*in >= 0x20) && (*in <= 0x7F)) || (*in == 0x09)), omitting other valid whitespace like
The example provided above demonstrates that this fallback is unnecessary and silent, and as a result can cause parse failures at worst and slower parsing at best.
This exact same problem also exists in
I will submit a PR shortly (and link to it in a comment) that replaces the
0x09 check with
IS_BLANK_CH which I think is more correct.