XPath evaluator (xmlXPathEvalExpression) errors when passed expressions containing Unicode characters supported by XML 1.0 Fifth Edition
-
The XML parser
xmlParseFile
supports all Unicode characters supported by the XML 1.0 Fifth Edition Specification, while the XPath evaluatorxmlXPathEvalExpression
only supports characters up to the XML 1.0 Fourth Edition Specification. XML parser support for Fifth Edition characters was added in this commit. -
The XPath 1.0 Specification indicates that character support is defined by NCNAME, of the Namespaces in XML 1.0 Third Edition Specification, which points to Name, of the XML 1.0 Fifth Edition Specification. The XML 1.0 Fifth Edition Name production states:
Almost all characters are permitted in names, except those which either are or reasonably could be used as delimiters. The intention is to be inclusive rather than exclusive, so that writing systems not yet encoded in Unicode can be used in XML names.
The specification cites the Unicode 5.0 specification. This seems to imply that a compliant XPath evaluator should support characters at least up to those contained in Unicode 5.0.
It would be helpful to have consistent Unicode support between the XML parser and the XPath evaluator.
Reproduction
The character Ꮂ
U+13B2 CHEROKEE LETTER HV was introduced in Unicode 3.0. While the XML parser successfully reads a file that contains the Ꮂ
character, the XPath evaluator will error.
The following code demonstrates that the XPath evaluator will error when passed an XPath expression containing the character.
#include <libxml/xpath.h>
int main(int argc, char* argv[]) {
xmlDocPtr doc;
xmlXPathContextPtr xpathContext;
xmlXPathObjectPtr xpathObj;
xmlInitParser();
const xmlChar* xpathExpression = "/Ꮂ";
const xmlChar* filename = "unicode.xml";
doc = xmlParseFile(filename);
xpathContext = xmlXPathNewContext(doc);
// This line will throw an error.
xpathObj = xmlXPathEvalExpression(xpathExpression, xpathContext);
xmlCleanupParser();
return 0;
}
This results in the following error:
XPath error : Invalid expression
/Ꮂ[1]
^
Operating system: Debian 10 Buster libxml2 version: 2.9.8