Mike Dalessio requested to merge flavorjones/libxml2:flavorjones/handle-incorrectly-closed-comments-master into master Aug 04, 2020

What problem is this PR trying to solve?

The WHATWG "living spec" document provides guidance on how to handle "incorrectly closed" comments. Specifically, it suggests that parsers treat comments closed by --!> as if they were closed by -->.

This guidance is non-normative, however popular modern browsers do follow this guidance (I've tested on Chrome and Firefox), and unfortunately the difference in behavior between libxml2 and these browsers introduces an opportunity for XSS attack vectors to exist.

As an example, take this HTML fragment:

<!-- --!><script>alert(1)</script><!-- -->

This is interpreted by libxml2 as one long comment with contents --!><script>alert(1)</script><!--. However, it's interpreted by modern browsers as three elements: a comment, a script node, and a comment. This is the essence of the potential XSS attack vector: libxml2 thinks the script node is part of a well-formed comment, but user agents parse it as part of the DOM.

Why should we solve this problem?

There's lots of advice in the WHATWG "living spec" that libxml2 doesn't follow when it encounters parsing errors; and that seems totally fine as far as I can tell for the majority of the enumerated cases in https://html.spec.whatwg.org/multipage/parsing.html#parse-errors.

However, this particular edge case means the difference between text being parsed as part of the DOM or being essentially ignored as CDATA. Many web applications stacks rely on libxml2-based libraries to perform HTML sanitization on user content and may be impacted; and although the safe thing to do is to strip comments out of sanitized HTML, this is certainly something that libxml2 could be handling better.

I can't think of a downside to adopting WHATWG's guidance here.

What approach was used?

This merge request contains first a commit to introduce baseline test coverage around "incorrectly closed" comments. Then the method htmlParseComment is modified to check for the 4-character sequence --!> and if found:

emit a non-fatal error
close the comment

Effort was made to make this patch as compact and unlikely to cause merge conflicts as possible. Obviously other approaches might be considered, and I'm open to feedback.

Other context

Apologies that I've publicly touched on something that might be considered a vulnerability report, but I wasn't sure if there was another way to report potentially-sensitive behavior.

This behavior was first reported (to me, at least) on the Loofah Hackerone project by user mayflower and that user should receive the reporting credit.

Edited Aug 04, 2020 by Mike Dalessio

Handle incorrectly closed comments

What problem is this PR trying to solve?

Why should we solve this problem?

What approach was used?

Other context

Merge request reports