Skip to content

GitLab

  • Projects
  • Groups
  • Snippets
  • Help
    • Loading...
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in
L
libxml2
  • Project overview
    • Project overview
    • Details
    • Activity
    • Releases
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 69
    • Issues 69
    • List
    • Boards
    • Labels
    • Service Desk
    • Milestones
  • Merge Requests 9
    • Merge Requests 9
  • CI / CD
    • CI / CD
    • Pipelines
    • Jobs
    • Schedules
  • Operations
    • Operations
    • Incidents
    • Environments
  • Packages & Registries
    • Packages & Registries
    • Container Registry
  • Analytics
    • Analytics
    • CI / CD
    • Repository
    • Value Stream
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Members
    • Members
  • Collapse sidebar
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
  • GNOME
  • libxml2
  • Issues
  • #211

Closed
Open
Opened Dec 10, 2020 by Nick Wellnhofer@nwellnhofDeveloper

HTML5 support

This probably won't be completed soon but here's an outline.

Syntax

HTML5 specifies exactly how to parse broken HTML. For the most part, handling of error and other corner cases has to be checked and possibly adjusted. Some work in this direction has already been completed. This shouldn't cause major regressions since valid HTML isn't affected. Changes can be implemented directly in the current HTML4 parser. An immediate benefit is that the security of several HTML sanitization libraries based on libxml2 (often through language bindings) is improved.

Some specifics that have to be addressed:

  • HTML5 allows CDATA sections.
  • Tag and attribute names.
  • HTML5 treats U+000C FORM FEED (FF) as whitespace.
  • Script data parsing rules.

Semantics

At some point, a separate database of HTML5 elements has to be added. Tag omission and content model can probably be handled similar to the HTML4 implementation. Tree construction and handling of misnested elements might require more extensive changes.

Edited Feb 07, 2021 by Nick Wellnhofer
Assignee
Assign to
None
Milestone
None
Assign milestone
Time tracking
None
Due date
None
Reference: GNOME/libxml2#211