parse+save of HTML href attributes escapes them when it shouldn't
Today I stumbled upon something in PHP:
$doc = new DOMDocument();
$doc->loadHTML(
'<a href="http://example.org/{{hi}}">Hi</a>',
LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD
);
print($doc->saveHTML());
Expected results: <a href="http://example.org/{{hi}}">Hi</a>
Actual results: <a href="http://example.org/%7B%7Bhi%7D%7D">Hi</a>
This only occurs for href
attributes. It won't occur for, say, an hrefx
attribute.
This is a bug because %7B
is not the same thing as {
. These are bytes passed to web servers in HTTP requests. The servers can tell the difference. Therefore, parse+save of HTML breaks URLs.
Sure, one could add {
and }
to the list of "not-escaped" characters in htmlAttrDumpOutput()
-- and maybe [
and ]
while we're at it. But the fundamental problem is that HTMLtree.c
calls xmlURIEscapeStr()
on href
attributes when it has no business doing so.
I understand that XSLT 1.0 escapes URIs. But XSLT 2.0 adds an escape-uri-attributes
option, so I guess that means one can't build a full-featured XSLT 2.0 engine atop libxml2?
This bug has existed for 19 years. I expect tons of code in the wild (e.g., libxslt-dependent code) relies on this libxml2 bug. Outside of libxslt, PHP users invent wacky workarounds. I'm sure it's the same in other environments.
Where does one go from here? Is there a path to truly fixing this bug and passing through HTML attributes unaltered, in PHP, by default, without breaking the world?