After b167c731, HTML parser does not recover from encoding errors
Given an input with bad encoding:
<html>
<head>
<title>テスト</title>
<meta http-equiv="Content-Type" content="text/html; charset=Shift_JIS">
</head>
<body>
<div>hello</div>
</body>
</html>
libxml2 v2.10.4 with the HTML_PARSE_RECOVER
flag set recovers fully and finishes the doc:
$ ./xmllint --version
/home/flavorjones/code/oss/libxml2/.libs/xmllint: using libxml version 21004-GITv2.10.4
compiled with: Threads Tree Output Push Reader Patterns Writer SAXv1 HTTP DTDValid HTML C14N Catalog XPath XPointer XInclude Iconv ISO8859X Unicode Regexps Automata Schemas Schematron Modules Debug Zlib Lzma
$ ./xmllint --html --recover --memory ../nokogiri/613-encoding-issue.html
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<head>
<title>�e�X�g</title>
<meta http-equiv="Content-Type" content="text/html; charset=Shift_JIS">
</head>
<body>
<div>hello</div>
</body>
</html>
libxml2 immediately prior to b167c731 still returns a document, though parsing has stopped where the error was encountered:
$ ./xmllint --version
/home/flavorjones/code/oss/libxml2/.libs/xmllint: using libxml version 21100-GITv2.10.0-469-gb167c731
compiled with: Threads Tree Output Push Reader Patterns Writer SAXv1 HTTP DTDValid HTML C14N Catalog XPath XPointer XInclude Iconv ISO8859X Unicode Regexps Automata Schemas Schematron Modules Debug Zlib Lzma
$ ./xmllint --html --recover --memory ../nokogiri/613-encoding-issue.html
encoding error : input conversion failed due to input error, bytes 0x86 0xE3 0x82 0xB9
encoding error : input conversion failed due to input error, bytes 0x86 0xE3 0x82 0xB9
I/O error : encoder error
../nokogiri/613-encoding-issue.html:3: parser error : Growing input buffer
繝
^
../nokogiri/613-encoding-issue.html:3: parser error : Growing input buffer
繝
^
../nokogiri/613-encoding-issue.html:3: parser error : Growing input buffer
繝
^
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<head>
<title>繝</title>
</head>
</html>
libxml after b167c731 does not recover, the parse function (e.g., htmlReadMemory
) returns NULL.
$ ./xmllint --version
/home/flavorjones/code/oss/libxml2/.libs/xmllint: using libxml version 21100-GITv2.10.0-469-gb167c731
compiled with: Threads Tree Output Push Reader Patterns Writer SAXv1 HTTP DTDValid HTML C14N Catalog XPath XPointer XInclude Iconv ISO8859X Unicode Regexps Automata Schemas Schematron Modules Debug Zlib Lzma
$ ./xmllint --html --recover --memory ../nokogiri/613-encoding-issue.html
../nokogiri/613-encoding-issue.html:3: parser error : Invalid bytes in character encoding
繝
^
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<head>
<title></title>
</head>
</html>