Skip to content
GitLab
Projects Groups Topics Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Register
  • Sign in
  • L libxml2
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributor statistics
    • Graph
    • Compare revisions
  • Issues 91
    • Issues 91
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 6
    • Merge requests 6
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Packages and registries
    • Packages and registries
    • Package Registry
    • Container Registry
    • Terraform modules
  • Monitor
    • Monitor
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Repository
  • Wiki
    • Wiki
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • GNOMEGNOME
  • libxml2
  • Issues
  • #211
Closed
Open
Issue created Dec 10, 2020 by Nick Wellnhofer@nwellnhofMaintainer

HTML5 support

This probably won't be completed soon but here's an outline.

Tokenization

HTML5 specifies exactly how to parse broken HTML. For the most part, handling of error and other corner cases has to be checked and possibly adjusted. Some work in this direction has already been completed. This shouldn't cause major regressions since valid HTML isn't affected. Changes can be implemented directly in the current HTML4 parser. An immediate benefit is that the security of several HTML sanitization libraries based on libxml2 (often through language bindings) is improved.

Some specifics that have to be addressed:

  • HTML5 allows CDATA sections.
  • Tag and attribute names.
  • HTML5 treats U+000C FORM FEED (FF) as whitespace.
  • Named character references
  • Special content modes
    • Script data
    • RCDATA
    • raw text
  • Many quirks of the parsing rules, see for example https://htmlparser.info/parser/

Tree builder

At some point, a separate database of HTML5 elements has to be added. Tag omission and content model can probably be handled similar to the HTML4 implementation. Tree construction and handling of misnested elements might require more extensive changes.

Edited Jul 17, 2022 by Nick Wellnhofer
Assignee
Assign to
Time tracking