segfault on in-context parsing (HTML)
Note: This issue is essentially the same as the one reported at #597 (closed) except for HTML documents.
Use case
The use case I describe may not be one that the libxml2 maintainer wishes to support, but I wanted to write about it as it seems relevant to evolving libxml2 towards more support of HTML5 concepts.
In this case, imagine we have a tree that is constructed to resemble an HTML5 document with a MathML element (e.g., Nokogiri does this by parsing HTML5 content with libgumbo, then constructing a libxml2 tree based on the gumbo tree). HTML5 treats MathML as foreign content which has a namespace:
char *html = "<html><body><math></math></body></html>";
xmlDocPtr document = htmlReadMemory(html, strlen(html), NULL, NULL, HTML_PARSE_RECOVER|HTML_PARSE_NOERROR);
xmlNodePtr math = xmlDocGetRootElement(document)->children->children;
assert(!strcmp(math->name, "math"));
xmlNsPtr ns = xmlNewNs(math, (const xmlChar *)"http://www.w3.org/1998/Math/MathML", NULL);
xmlSetNs(math, ns);
That tree corresponds to the following HTML5 document:
<html><body>
<math xmlns="http://www.w3.org/1998/Math/MathML"></math>
</body></html>
Then we parse additional math content "in context" of the math
node:
char *mathml = "<mrow></mrow>";
xmlNodePtr nodes;
xmlParserErrors error;
error = xmlParseInNodeContext(math, mathml, strlen(mathml), HTML_PARSE_RECOVER|HTML_PARSE_NOERROR, &nodes);
fprintf(stderr, "return code: %d\n", error);
This segfaults (line numbers correspond to libxml2 code at tag v2.12.4
):
#0 xmlParserNsGrow (ctxt=0x5555556e32a0) at parser.c:1642
#1 0x000055555556d16a in xmlParserNsPush (ctxt=0x5555556e32a0, prefix=0x7fffffffcd70, uri=0x7fffffffcd80,
saxData=0x5555556e5ae0, defAttr=1) at parser.c:1679
#2 0x000055555558c09f in xmlParseInNodeContext (node=0x5555556e6080, data=0x55555569c0a3 "<mrow></mrow>",
datalen=13, options=12321, lst=0x7fffffffcdc8) at parser.c:13267
#3 0x000055555556af12 in main ()
The issue appears to be that ctxt->nsdb
is not initialized, since xmlInitSAXParserCtxt
isn't called for HTML documents.
Note that before e0dd330b, the behavior was for this in-context parsing to fail returning an error code of XML_HTML_UNKNOWN_TAG
.
repro
A complete reproduction is:
#include <libxml/HTMLparser.h>
#include <string.h>
#include <assert.h>
int main(int argc, char **argv) {
fprintf(stderr, "using libxml2 version %s\n", xmlParserVersion);
char *html = "<html><body><math></math></body></html>";
xmlDocPtr document = htmlReadMemory(html, strlen(html), NULL, NULL, HTML_PARSE_RECOVER|HTML_PARSE_NOERROR);
xmlNodePtr math = xmlDocGetRootElement(document)->children->children;
assert(!strcmp(math->name, "math"));
xmlNsPtr ns = xmlNewNs(math, (const xmlChar *)"http://www.w3.org/1998/Math/MathML", NULL);
xmlSetNs(math, ns);
char *mathml = "<mrow></mrow>";
xmlNodePtr nodes;
xmlParserErrors error;
error = xmlParseInNodeContext(math, mathml, strlen(mathml), HTML_PARSE_RECOVER|HTML_PARSE_NOERROR, &nodes);
fprintf(stderr, "return code: %d\n", error);
}
or, in Ruby using Nokogiri:
doc = Nokogiri::HTML5::Document.parse("<html><body><math>")
math = doc.at_css("math")
math.parse("mrow")