Skip to content

html.lang: Update tag and attribute name regular expressions

This updates the regular expressions for tag and attribute names, following the 12.2 Parsing HTML documents section of the HTML Living Standard, specifically:

  • 12.2.3.5 Preprocessing the input stream
  • 12.2.5.6 Tag open state
  • 12.2.5.8 Tag name state
  • 12.2.5.32 Before attribute name state
  • 12.2.5.33 Attribute name state

These characters are flagged as parse errors during preprocessing:

(Including surrogate characters in GRegex regular expressions leads to compilation errors disallowed Unicode code point (>= 0xd800 && <= 0xdfff), so they are not included/checked.)

Null characters are also flagged as errors later in the parsing process. Other characters (whitespace, /, >, etc.) trigger state changes and so cannot be part of the tag / attribute name.

Fixes #87 (closed).

Merge request reports