HTML Namespace Parsing Regression
Introduced with 21ca8829
As I understand, that change removed namespace resolution when parsing html, as an optimization. While HTML shouldn't include namespaced elements, unfortunately it occasionally does and not parsing them can disrupt visual formatting for certain common scenarios.
Our use case involves email, which unfortunately means working with a lot of poorly formed HTML. Much of that poorly formed HTML comes from Microsoft Outlook, which for versions 2019 and older used Word as its internal engine for formatting. What that means for us is that many of the emails we see include xml-ish tags that would be valid in a word document's XML but is not valid HTML. For example, this type of content is very common:
<p>Some Images: <img src="cid:123"><o:p></o:p><img src="cid:unknown"></p>
While <o:p>
shouldn't be valid, without namespace parsing it's indistinguishable from a <p>
tag. That means when we process these email bodies (to, for example, replace the img's src="cid:123"
with a url), a parse/serialize round trip ends up converting <o:p>
tags into <p>
tags. Our main issue with this behavior is that browsers ignore <o:p>
, but <p>
adds a breaking space which messes up the formatting.
Some other contextual information:
- We can (and do) recommend to our customers to not use Outlook 2019 or older, but have no control over the email clients of the people our customers are talking to
- This behavior breaking formatting is especially common due to people's signatures which often include inline images. That combined with quoting previous emails in a thread makes it quite frequent.
- Why report this now, nearly 3 years later? We're upgrading from Ubuntu 20.04 to 22.04, which bumps the system
libxml2
fromv2.9.10
tov2.9.13
, which now includes this change. - Our application processes in python, using
lxml
. We install it using the OS-providedlibxml2
package. - We have double checked in production that we still quite frequently rely on the namespace parsing behavior to avoid breaking formatting
- We've been looking for other workarounds but haven't been able to come up with anything that seems reasonable. Options we've considered:
- breaking image formatting for our customers (unfortunately it's too common to be viable for us)
- forking
libxml2
and reverting just this patch, then installing from source - forking
lxml
and forcinghtml
to false (though I'm not sure how feasible that is or how it impacts other parsing logic, since we are after all parsing html here) - doing a regex replace on these types of tags to "fix" them during the parse/serialize round trip, perhaps at the cost of our sanity