html.lang: Update tag and attribute name regular expressions
This updates the regular expressions for tag and attribute names, following the 12.2 Parsing HTML documents section of the HTML Living Standard, specifically:
- 12.2.3.5 Preprocessing the input stream
- 12.2.5.6 Tag open state
- 12.2.5.8 Tag name state
- 12.2.5.32 Before attribute name state
- 12.2.5.33 Attribute name state
These characters are flagged as parse errors during preprocessing:
(Including surrogate characters in GRegex regular expressions leads to
compilation errors disallowed Unicode code point (>= 0xd800 && <= 0xdfff)
, so they are not included/checked.)
Null characters are also flagged as errors later in the parsing process.
Other characters (whitespace, /
, >
, etc.) trigger state changes and
so cannot be part of the tag / attribute name.
Fixes #87 (closed).