-
This updates the regular expressions for tag and attribute names, following the 12.2 Parsing HTML documents[1] section of the HTML Living Standard, specifically: * 12.2.3.5 Preprocessing the input stream * 12.2.5.6 Tag open state * 12.2.5.8 Tag name state * 12.2.5.32 Before attribute name state * 12.2.5.33 Attribute name state These characters are flagged as parse errors during preprocessing: * Surrogates[2] * Noncharacters[3] * Controls[4] (Including surrogate characters in GRegex regular expressions leads to compilation errors "disallowed Unicode code point (>= 0xd800 && <= 0xdfff)", so they are not included/checked.) Null characters are also flagged as errors later in the parsing process. Other characters (whitespace, "/", ">", etc.) trigger state changes and so cannot be part of the tag / attribute name. Fixes #87. [1]: https://html.spec.whatwg.org/multipage/parsing.html#parsing [2]: https://infra.spec.whatwg.org/#surrogate [3]: https://infra.spec.whatwg.org/#noncharacter [4]: https://infra.spec.whatwg.org/#control
5385eb27