798bdf13 changes the HTML parser's recovery from '<' characters
Summary
Before v2.9.13, the HTML parser in "recovery" mode would parse a string containing a bare <
character and convert that character into the <
entity.
Starting in v2.9.13, the behavior of the parser with and without the "recovery" parse option is identical; the <
character until the next start tag is dropped from the parsed document.
Looking at the commit log message for 798bdf13, and it appears to say that the <
should be emitted as text in this case. I'd love to better understand whether this was the intended behavior.
In particular, when parsing ill-formed HTML4 documents, the v2.9.12 behavior is what most users will probably expect.
Reproduction
Create a file test/HTML/entities3.html
containing:
<html>
<body>
<div>this < that</div>
<div>second element</div>
</body>
</html>
With libxml 2.9.12:
$ ./xmllint --version --html --recover test/HTML/entities3.html
/home/flavorjones/code/oss/libxml2/.libs/xmllint: using libxml version 20912-GITv2.9.12
compiled with: Threads Tree Output Push Reader Patterns Writer SAXv1 FTP HTTP DTDValid HTML Legacy C14N Catalog XPath XPointer XInclude Iconv ISO8859X Unicode Regexps Automata Schemas Schematron Modules Debug Zlib Lzma
test/HTML/entities3.html:3: HTML parser error : htmlParseStartTag: invalid element name
<div>this < that</div>
^
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<body>
<div>this < that</div>
<div>second element</div>
</body>
</html>
With libxml 2.9.13:
$ ./xmllint --version --html --recover test/HTML/entities3.html
/home/flavorjones/code/oss/libxml2/.libs/xmllint: using libxml version 20913-GITv2.9.13
compiled with: Threads Tree Output Push Reader Patterns Writer SAXv1 FTP HTTP DTDValid HTML Legacy C14N Catalog XPath XPointer XInclude Iconv ISO8859X Unicode Regexps Automata Schemas Schematron Modules Debug Zlib Lzma
test/HTML/entities3.html:3: HTML parser error : htmlParseStartTag: invalid element name
<div>this < that</div>
^
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<body>
<div>this
<div>second element</div>
</div>
</body>
</html>