Auto-closing anchor element containing table elements
When libxml HTML parser sees a table
start tag (also td
/th
) inside an a
anchor element, it always closes the anchor element. For example <a><table></table></a>
linted with libxml becomes <a></a><table></table>
.
This is likely so because in HTML 4.0 Transitional and in XHTML, it is invalid to put a table inside an anchor, as it only allows phrasing content. However HTML 5 allows that and putting either entire tables or just table cells inside anchors is actually not an uncommon practice even in older standards (browsers permit it).
I would like to propose changing this behavior with the following reasoning:
- Even though to my understanding libxml doesn't support HTML 5, it does try to support broken HTML 4 and behave in the most helpful way - ideally similar to web browsers or other renderers
- Supporting HTML 5 where possible would be very nice to have
- Auto-closing the anchor causes more problems than it solves, I think the intention to have a table or table cell wrapped in an anchor is undoubtedly more likely than unintentionally omitting the end tag for the anchor before a table
- I encountered this after it was reported that our lxml-based email parser spuriously removed links from content generated by a WYSIWYG email editor called BeeFree - they can generate XHTML 1.0 Transitional code containing table elements in anchors
- The current behavior is extremely obscure, I actually had to use debugger to find out why this happened in libxml and suspected an unintentional bug the whole time - it would be much easier to understand if
<a>
wasn't auto-closed when the closing tag was missing than understanding why it was auto-closed when the HTML fragment was otherwise valid
Backward compatibility might be an issue, but just a marginal one in my opinion, I find it unlikely someone could rely on the current behavior. This could potentially be adjusted with some flag to the parser but would be overkill IMHO.
As the proposed patch is small, I'm pasting it here, if needed, I will make a MR.
diff --git a/HTMLparser.c b/HTMLparser.c
index 9e60e27..a73262a 100644
--- a/HTMLparser.c
+++ b/HTMLparser.c
@@ -1091,9 +1091,9 @@ static const char * const htmlStartClose[] = {
"colgroup", "caption", "colgroup", "col", "p", NULL,
"col", "caption", "col", "p", NULL,
"table", "p", "head", "h1", "h2", "h3", "h4", "h5", "h6", "pre",
- "listing", "xmp", "a", NULL,
-"th", "th", "td", "p", "span", "font", "a", "b", "i", "u", NULL,
-"td", "th", "td", "p", "span", "font", "a", "b", "i", "u", NULL,
+ "listing", "xmp", NULL,
+"th", "th", "td", "p", "span", "font", "b", "i", "u", NULL,
+"td", "th", "td", "p", "span", "font", "b", "i", "u", NULL,
"tr", "th", "td", "tr", "caption", "col", "colgroup", "p", NULL,
"thead", "caption", "col", "colgroup", NULL,
"tfoot", "th", "td", "tr", "caption", "col", "colgroup", "thead",