Quadratic behavior in HTML push parser with unquoted attribute values
OSS-Fuzz found another case in the HTML parser where htmlLookupSequence
doesn't agree with the rest of the parsing code:
<elem attr=value">"
The closing bracket terminates the start tag but htmlLookupSequence
thinks it's part of an attribute value. We really have to implement a part of the parser's state machine in htmlLookupSequence
when handling attributes. This would be easier if we first started parsing element and attribute names according to the HTML5 spec which allows all characters in names. At least, we wouldn't have to care about name start states. In addition, parsing of end tags should be fixed. In HTML5 end tags are parsed like start tags, allowing attributes which will be ignored.
States we have to handle:
- tag name
- before attribute name
- attribute name
- after attribute name
- before attribute value
- attribute value quoted
- attribute value unquoted
- after attribute value quoted
- self-closing start tag