xmlParseCharData will fall back to xmlParseCharDataComplex unnecessarily if a newline is the first byte in an input chunk
Brief Description
When SAX-parsing an ASCII document using an IO parser context (i.e., xmlInputReadCallback
), and the first character of an input chunk is 0x0a
(a.k.a. \n
or "newline" or "linefeed") then xmlParseCharData()
will fall back to xmlParseCharDataComplex()
. The fallback method has different memory and performance characteristics which can lead to an unnecessary parse failure.
Specifically, if this happens in the middle of a node whose text exceeds XML_MAX_LOOKUP_LIMIT
then, then the parsing will raise a fatal error, "Huge input lookup" (from xmlGrow()
). However, without the newline the document would be parsed successfully.
Reproducing the issue
I've created a small git repository containing code to reproduce this. First clone the repo:
git clone https://gist.github.com/flavorjones/7164487b5af273bef73fcb1a79f89d2d
Then compile the program and generate the "ok" document and the "fail" document:
make
Finally run the program:
./run-me
You should see:
parse: file ok.xml
parse: file fail.xml
errorCallback: internal error: Huge input lookup
Longer description
When SAX-parsing a document using an IO parser context, bytes will be read in chunks of 4000 (MINLEN
in xmlIO.c
). When xmlParseCharData
gets to the end of a chunk, it peeks at the next byte (https://gitlab.gnome.org/GNOME/libxml2/-/blob/7929f05710134b9b243952019b6c14066cd3ac9e/parser.c#L4509) and if it doesn't appear to be ASCII, falls back to xmlParseCharDataComplex
.
However, this peek only looks for (((*in >= 0x20) && (*in <= 0x7F)) || (*in == 0x09))
, omitting other valid whitespace like 0x0A
and 0x0D
.
The example provided above demonstrates that this fallback is unnecessary and silent, and as a result can cause parse failures at worst and slower parsing at best.
This exact same problem also exists in xmlParseComment
(https://gitlab.gnome.org/GNOME/libxml2/-/blob/7929f05710134b9b243952019b6c14066cd3ac9e/parser.c#L4990).
Suggested fix
I will submit a PR shortly (and link to it in a comment) that replaces the 0x09
check with IS_BLANK_CH
which I think is more correct.