HTML5 support
IMPORTANT NOTE FROM MAINTAINERS: The HTML parser in libxml2 was written 20+ years ago. It does not implement HTML5. Maybe it will some day, maybe it won't. Don't use libxml2 to parse HTML for anything serious. If you maintain a downstream project that uses libxml2's HTML parser, please forward this message to your users.
This probably won't be completed soon but here's an outline.
Tokenization
HTML5 specifies exactly how to parse broken HTML. For the most part, handling of error and other corner cases has to be checked and possibly adjusted. Some work in this direction has already been completed. This shouldn't cause major regressions since valid HTML isn't affected. Changes can be implemented directly in the current HTML4 parser. An immediate benefit is that the security of several HTML sanitization libraries based on libxml2 (often through language bindings) is improved.
Some specifics that have to be addressed:
- Tag and attribute names.
- HTML5 treats
U+000C FORM FEED (FF)
as whitespace. - Named character references
- Doctype declaration
- CDATA sections in foreign content.
- Special content modes
- Script data
- RCDATA
- raw text
- Many quirks of the parsing rules, see for example https://htmlparser.info/parser/
Tree builder
At some point, a separate database of HTML5 elements has to be added. Tag omission and content model can probably be handled similar to the HTML4 implementation. Tree construction and handling of misnested elements might require more extensive changes.