xmlParseChunk has inconsistent behavior when chunks split closing tags
I recently noticed the following behavior with xmlParseChunk when used with xmlKeepBlanksDefault: if a chunk ends with the '<' of the closing tag, and the node contains only white space, the white space is not saved. With all of the other split points I tested, or if the node contains non-white space characters, the text is saved.
The following code demonstrates the inconsistency:
#include <libxml/parser.h>
#include <string.h>
int main()
{
char* firstA = "<?xml version='1.0'?> <roottag> <tag>\n\t</";
char* secondA = "tag> </roottag>";
char* firstB = "<?xml version='1.0'?> <roottag> <tag>\n\t<"; /* note the '/' is now on the 2nd line*/
char* secondB = "/tag> </roottag>";
struct _xmlNode * text;
xmlParserCtxtPtr ctxt;
xmlKeepBlanksDefault(0);
/** normal whitespace case **/
ctxt = xmlCreatePushParserCtxt(NULL, NULL, NULL, 0, "unknown");
xmlParseChunk(ctxt, firstA, strlen(firstA), 0);
xmlParseChunk(ctxt, secondA, strlen(secondA), 0);
xmlParseChunk(ctxt, NULL, 0, 1);
text = ctxt->myDoc->children->children->children; /* get the "tag" node's text */
printf("text ptr: %p\n", text); /* should exist */
printf("content: %02x %02x\n", text->content[0], text->content[1]);
xmlFreeParserCtxt(ctxt);
printf("----------------\n");
/** Split whitespace case **/
ctxt = xmlCreatePushParserCtxt(NULL, NULL, NULL, 0, "unknown");
xmlParseChunk(ctxt, firstB, strlen(firstB), 0);
xmlParseChunk(ctxt, secondB, strlen(secondB), 0);
xmlParseChunk(ctxt, NULL, 0, 1);
text = ctxt->myDoc->children->children->children; /* get the "tag" node */
printf("text ptr: %p\n", text); /* is now null */
xmlFreeParserCtxt(ctxt);
return 0;
}
This code yields the following output:
text ptr: 0x55d78b6fe280
content: 0a 09
----------------
text ptr: (nil)
I tested this code with a few different versions of libxml2, including 2.9.14, all with the same results (the ptr address changes of course, but it is included to demonstrate that the text exists). I do not know how the parser works internally, but it appears that the parser may assume the '<' belongs to the start of a new tag, rather than the closing tag. This would make the white space part of the formatting, causing it to be ignored. That said, if this is the case I am not sure why it would not wait to make the decision since it works properly when the break falls one character to either side (that is, with the first block ending in the white space or "<").