Skip to content
GitLab
  • Menu
Projects Groups Snippets
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in / Register
  • L libxml2
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 92
    • Issues 92
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 4
    • Merge requests 4
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Packages & Registries
    • Packages & Registries
    • Package Registry
    • Container Registry
    • Infrastructure Registry
  • Monitor
    • Monitor
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Repository
  • Wiki
    • Wiki
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • GNOME
  • libxml2
  • Issues
  • #192
Closed
Open
Created Oct 10, 2020 by Mike Dalessio@flavorjonesContributor

xmlParseCharData will fall back to xmlParseCharDataComplex unnecessarily if a newline is the first byte in an input chunk

Brief Description

When SAX-parsing an ASCII document using an IO parser context (i.e., xmlInputReadCallback), and the first character of an input chunk is 0x0a (a.k.a. \n or "newline" or "linefeed") then xmlParseCharData() will fall back to xmlParseCharDataComplex(). The fallback method has different memory and performance characteristics which can lead to an unnecessary parse failure.

Specifically, if this happens in the middle of a node whose text exceeds XML_MAX_LOOKUP_LIMIT then, then the parsing will raise a fatal error, "Huge input lookup" (from xmlGrow()). However, without the newline the document would be parsed successfully.

Reproducing the issue

I've created a small git repository containing code to reproduce this. First clone the repo:

git clone https://gist.github.com/flavorjones/7164487b5af273bef73fcb1a79f89d2d

Then compile the program and generate the "ok" document and the "fail" document:

make

Finally run the program:

./run-me

You should see:

parse: file ok.xml
parse: file fail.xml
errorCallback: internal error: Huge input lookup

Longer description

When SAX-parsing a document using an IO parser context, bytes will be read in chunks of 4000 (MINLEN in xmlIO.c). When xmlParseCharData gets to the end of a chunk, it peeks at the next byte (https://gitlab.gnome.org/GNOME/libxml2/-/blob/7929f05710134b9b243952019b6c14066cd3ac9e/parser.c#L4509) and if it doesn't appear to be ASCII, falls back to xmlParseCharDataComplex.

However, this peek only looks for (((*in >= 0x20) && (*in <= 0x7F)) || (*in == 0x09)), omitting other valid whitespace like 0x0A and 0x0D.

The example provided above demonstrates that this fallback is unnecessary and silent, and as a result can cause parse failures at worst and slower parsing at best.

This exact same problem also exists in xmlParseComment (https://gitlab.gnome.org/GNOME/libxml2/-/blob/7929f05710134b9b243952019b6c14066cd3ac9e/parser.c#L4990).

Suggested fix

I will submit a PR shortly (and link to it in a comment) that replaces the 0x09 check with IS_BLANK_CH which I think is more correct.

Assignee
Assign to
Time tracking