libxml2 cannot parse its own ASCII-escaped output
When I encode a document with non-ASCII tag/attribute names to ASCII, the names get character-escaped, but then libxml2 cannot parse them any more.
$ echo "<älämänt öttrib='Атрибут'></älämänt>" | xmllint --encode ascii -
Output is:
<?xml version="1.0" encoding="ascii"?>
<älämänt öttrib="Атрибут"/>
Now passing this back into libxml2:
$ echo "<älämänt öttrib='Атрибут'></älämänt>" | xmllint --encode ascii - | xmllint -
Output is:
-:2: parser error : StartTag: invalid element name
<älämänt öttrib="Атрибу&
^
-:2: parser error : Extra content at the end of the document
<älämänt öttrib="Атрибу&
^
I would expect libxml2 to either parse this back in without complaining, or to refuse to write the output, but not to happily write output that it cannot process itself.
For comparison, I tried expat and it also rejects the ASCII-escaped output:
Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyexpat
>>> p = pyexpat.ParserCreate()
>>> p.Parse("""\
... <?xml version="1.0" encoding="ascii"?>
... <älämänt öttrib="Атрибут"/>
... """)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 2, column 1
Thus, (and also according to the way I read the XML spec), it's correct to reject this kind of input, but then libxml2 also shouldn't produce it.
I'm not sure if there is any use case for this, but some users out there probably did make use of it, so turning this into an error would likely break someone's code. But it seems wrong to allow silently generating non-XML output during XML serialisation.
Other encodings end up writing their respective replacement character if a character cannot be represented, also without producing an error. That is similar but not entirely the same thing, since the issue here is position specific, i.e. names are a problem, text is not.