HTML5 support
This probably won't be completed soon but here's an outline.
Tokenization
HTML5 specifies exactly how to parse broken HTML. For the most part, handling of error and other corner cases has to be checked and possibly adjusted. Some work in this direction has already been completed. This shouldn't cause major regressions since valid HTML isn't affected. Changes can be implemented directly in the current HTML4 parser. An immediate benefit is that the security of several HTML sanitization libraries based on libxml2 (often through language bindings) is improved.
Some specifics that have to be addressed:
- HTML5 allows CDATA sections.
- Tag and attribute names.
- HTML5 treats
U+000C FORM FEED (FF)
as whitespace. - Named character references
- Special content modes
- Script data
- RCDATA
- raw text
- Many quirks of the parsing rules, see for example https://htmlparser.info/parser/
Tree builder
At some point, a separate database of HTML5 elements has to be added. Tag omission and content model can probably be handled similar to the HTML4 implementation. Tree construction and handling of misnested elements might require more extensive changes.