Update xmlParseBalancedChunkMemoryRecover to handle parseFlags from xmlNewDoc
The following Merge Request (MR) has been forwarded from GitHub in order to prevent the GNOME Project from losing contributions coming from un-official channels. And for contributors to not see their valuable contributions not being accounted for.
Relevant information:
Github handle: ramzes642
MR URL: https://github.com/GNOME/libxml2/pull/21
Patch URL: https://github.com/GNOME/libxml2/pull/21.patch
Body of the MR:
Hello, I am proposing a little change to support parse options in xmlParseBalancedChunkMemory. I do not understand how to do it better in this case, but this small fix helps, maybe you can tell me how to do it better.
I tell you what bug I am trying to fix, don't blame me if I am doing something wrong - this is my first try to contribute to big open source projects.
I've met a fail of loading big xml file into postgres database, that caused "Segmentation fault" error with ubuntu's version of library libxml2 "libxml2:amd64 2.9.4+dfsg1-6.1ubuntu1.3" and postgresql 12.2-2.pgdg18.04+1. (It is awkward, because simplexml_load_file in php7.2 that uses same libxml2 do not need any flags and opens entire (bigger) file successfully)
My next move is to understand what is going wrong - I've built "libxml2.so.2.9.10" from source and got a good exception about that libxml cannot parse document because of too many childs in tag (line 74032: internal error: Huge input lookup). Googling this exception led me to a solution to use xmlReadMemory instead of xmlParseMemory and pass there XML_PARSE_HUGE flag. I've opened postgresql sources and patched all occurences of xmlParseMemory to customizable variant, but there was one place that was using another function (xmlParseBalancedChunkMemoryRecover see usage below) where I was unable to pass a flag. I've found that there is an option inside doc, that contains that parse flags(doc->parseFlags), but further investigation led me to that flags inside doc are not being used in parser context.
I think, that my patch to postgresql can be moved under configure flag with version dependence to support huge xml documents, but only after this small patch to libxml.
Postgresql is using libxml2 like this:
doc = xmlNewDoc(version);
Assert(doc->encoding == NULL);
doc->encoding = xmlStrdup((const xmlChar *) "UTF-8");
doc->standalone = standalone;
doc->parseFlags |= XML_PARSE_HUGE; // <--------- propose to add flag here
/* allow empty content */
if (*(utf8string + count))
{
res_code = xmlParseBalancedChunkMemory(doc, NULL, NULL, 0,
utf8string + count, NULL);
if (res_code != 0 || xmlerrcxt->err_occurred)
xml_ereport(xmlerrcxt, ERROR, ERRCODE_INVALID_XML_CONTENT,
"invalid XML content");
}
After that improvements - that my huge document inserted good, xpath queries are working and 100 gigs of xmls are imported without any new crashes of database.
List of relations
Schema | Name | Type | Owner | Size | Description
--------+----------------+-------+-------+---------+-------------
public | egrip | table | egrip | 16 GB |
public | egrip_test | table | egrip | 13 MB |
public | egrip_versions | table | egrip | 17 GB |
public | egrul | table | egrip | 33 GB |
public | egrul_versions | table | egrip | 45 GB |
If you will be so kind to apply my patch in new release, that will make my further patch to postgresql possible. Thank you in advance, hope that you can help with this.
Notes: Here is my commits to fix that XML_PARSE_HUGE problem in postgresql: https://github.com/ramzes642/postgres/commit/6eae093d9d1331fa9de92e41f463c263aaf3b641 - no need to modify libxml2 commit https://github.com/ramzes642/postgres/commit/b59459a16b13de718dde21642452dbdbb253c316 - modification needed commit