Skip to content
GitLab
Projects Groups Topics Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Register
  • Sign in
  • L libxml2
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributor statistics
    • Graph
    • Compare revisions
  • Issues 98
    • Issues 98
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 8
    • Merge requests 8
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Artifacts
    • Schedules
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Packages and registries
    • Packages and registries
    • Package Registry
    • Container Registry
    • Terraform modules
    • Model experiments
  • Monitor
    • Monitor
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Repository
  • Wiki
    • Wiki
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • GNOMEGNOME
  • libxml2
  • Issues
  • #339
Closed
Open
Issue created Feb 21, 2022 by Mike Dalessio@flavorjonesContributor

798bdf13 changes the HTML parser's recovery from '<' characters

Summary

Before v2.9.13, the HTML parser in "recovery" mode would parse a string containing a bare < character and convert that character into the &lt; entity.

Starting in v2.9.13, the behavior of the parser with and without the "recovery" parse option is identical; the < character until the next start tag is dropped from the parsed document.

Looking at the commit log message for 798bdf13, and it appears to say that the < should be emitted as text in this case. I'd love to better understand whether this was the intended behavior.

In particular, when parsing ill-formed HTML4 documents, the v2.9.12 behavior is what most users will probably expect.

Reproduction

Create a file test/HTML/entities3.html containing:

<html>
<body>
<div>this < that</div>
<div>second element</div>
</body>
</html>

With libxml 2.9.12:

$ ./xmllint --version --html --recover test/HTML/entities3.html
/home/flavorjones/code/oss/libxml2/.libs/xmllint: using libxml version 20912-GITv2.9.12
   compiled with: Threads Tree Output Push Reader Patterns Writer SAXv1 FTP HTTP DTDValid HTML Legacy C14N Catalog XPath XPointer XInclude Iconv ISO8859X Unicode Regexps Automata Schemas Schematron Modules Debug Zlib Lzma 
test/HTML/entities3.html:3: HTML parser error : htmlParseStartTag: invalid element name
<div>this < that</div>
           ^
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<body>
<div>this &lt; that</div>
<div>second element</div>
</body>
</html>

With libxml 2.9.13:

$ ./xmllint --version --html --recover test/HTML/entities3.html
/home/flavorjones/code/oss/libxml2/.libs/xmllint: using libxml version 20913-GITv2.9.13
   compiled with: Threads Tree Output Push Reader Patterns Writer SAXv1 FTP HTTP DTDValid HTML Legacy C14N Catalog XPath XPointer XInclude Iconv ISO8859X Unicode Regexps Automata Schemas Schematron Modules Debug Zlib Lzma 
test/HTML/entities3.html:3: HTML parser error : htmlParseStartTag: invalid element name
<div>this < that</div>
           ^
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<body>
<div>this 
<div>second element</div>
</div>
</body>
</html>
Assignee
Assign to
Time tracking