Regression in CDATA handling in lxml with libxml 2.9.11
173a0830 (2.9.11) introduced a regression for lxml, that we spotted through soupsieve's test suite.
CDATA tags now get passed as character blocks to the SAX2 callbacks.
Attached are:
- A minimal reproducer in Python (lxml) repro.py
- A minimal reproducer in C showing the underlying changes in behaviour repro.c
Parsing the HTML body <html><body><![CDATA[test]]><p>test2</p></body></html>
with an HTML push parser, we see a change in behaviour:
With 2.9.12+dfsg-5 (Debian) libxml2, I get:
HTML parser error : htmlParseTryOrFinish: invalid element name
<html><body><![CDATA[test]]><p>test2</p></body></html>
^
char: < (1)
char: ![CDATA[test]]> (15)
char: test2 (5)
With 2.9.10+dfsg-6.7 (Debian) libxml2 I get:
:1: HTML parser error : htmlParseStartTag: invalid element name
<html><body><![CDATA[test]]><p>test2</p></body></html>
^
char: test2 (5)
Related bugs:
- https://github.com/facelessuser/soupsieve/issues/220
- https://bugs.launchpad.net/beautifulsoup/+bug/1930164
- https://bugs.launchpad.net/lxml/+bug/1930224
FYI: @scoder