bug: Encoding issue in `htmlReadMemory` (libxml2 2.11)
There appears to be an encoding issue related to htmlReadMemory
in 2.11.
This was originally reported for the Crystal language's XML library which links libxml2 (https://github.com/crystal-lang/crystal/issues/13703).
A basic reproduction in C looks like this:
#include <stdio.h>
#include <string.h>
#include <libxml/HTMLparser.h>
#include <libxml/HTMLtree.h>
int main(void) {
const char *content = "<p>České psaní</p>";
htmlDocPtr doc;
fprintf(stdout, "libxml2 version: %s\n", LIBXML_DOTTED_VERSION);
doc = htmlReadMemory(content, strlen(content), NULL, NULL, 0);
if (doc == NULL) {
fprintf(stderr, "Failed to parse document\n");
return(1);
}
htmlDocDump(stdout, doc);
xmlFreeDoc(doc);
return(0);
}
I'm getting the following different outputs:
libxml2 version: 2.10.4
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><p>České psaní</p></body></html>
libxml2 version: 2.11.4
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><p>České psaní</p></body></html>
For all I know, the behaviour in 2.10.4 is correct, and 2.11.4 appears broken. htmlReadMemory
is called with NULL encoding which means the source should be interpeted as UTF-8. Apparently 2.11 assumes a different encoding?