[html serializer] Top level domains in link href attributes are incorrectly urlencoded
When using the HTML serializer, all link attributes are urlencoded. This makes sense for the URL path, however not so much for the domain part. Domains may only contain ascii characters. Some registrars allow punycode representations for unicode characters in domain names, but no other representations (RFC 5895). Most browsers will probably open links like https://www.baf%C3%B6g.de without issues, however, other libraries might not be that forgiving, e.g. python's requests library will throw an error.
To reproduce this problem:
xmllint --html <(echo "<a href='https://www.bafög.de'>https://www.bafög.de</a>")
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<a href="https://www.baf%C3%83%C2%B6g.de">https://www.bafög.de</a>
</body></html>
This doesn't change when unicode encoding is used:
xmllint --html --encode utf-8 <(echo "<a href='https://www.bafög.de'>https://www.bafög.de</a>")
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><a href="https://www.baf%C3%83%C2%B6g.de">https://www.bafög.de</a>
</body></html>
In those examples, the urlencoding is even a bit more broken than when using as part of lxml (https://www.baf%C3%83%C2%B6g.de
decodes to https://www.bafög.de
), but maybe that's due to the usage via cli.
In my opinion, libxml2 should either parse the URL correctly and exclude the domain part from the urlencoding (or applies punycode encoding) like it does for the protocol, or provide an option to turn off urlencoding of links altogether.
I'm sorry if I missed something and there already exists a similar ticket or it isn't a valid bug for other reasons.
For reference, here my ticket I initially opened on lxml's side: https://bugs.launchpad.net/lxml/+bug/2051597
And an issue from my project where this problem arose: https://github.com/digitalfabrik/integreat-cms/issues/2274