DOMDocument::loadHTML() - textContent property has different value with newer libxml2 versions
We are experiencing a change in the behavior of the DOMDocument::loadHTML()
on newer libxml2 versions in PHP.
Let's have an example PHP script:
<?php
$document = <<<EOD
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head><meta http-equiv="Content-Type" content="text/html; charset=utf-8" /></head>
<body><</body>
</html>
EOD;
$dom = new \DOMDocument();
$dom->loadHTML($document, LIBXML_NOBLANKS);
echo $dom->textContent . PHP_EOL;
Calling this script in PHP with libxml2 2.9.10 - the textContent
property of document is empty + there is a PHP warning:
# php libxml.php
PHP Warning: DOMDocument::loadHTML(): htmlParseStartTag: invalid element name in Entity, line: 4 in /mnt/libxml.php on line 10
root@74899afe88eb:/mnt# php --ri dom
dom
DOM/XML => enabled
DOM/XML API Version => 20031129
libxml Version => 2.9.10
HTML Support => enabled
XPath Support => enabled
XPointer Support => enabled
Schema Support => enabled
RelaxNG Support => enabled
root@74899afe88eb:/mnt# php -v
PHP 8.2.0 (cli) (built: Feb 16 2023 18:16:46) (NTS)
Copyright (c) The PHP Group
Zend Engine v4.2.0, Copyright (c) Zend Technologies
with Zend OPcache v8.2.0, Copyright (c), by Zend Technologies
Calling this script in PHP with libxml2 2.9.14 (or 2.11.5) - the textContent
property of document is "<":
$ php libxml.php
<
$ php --ri dom
dom
DOM/XML => enabled
DOM/XML API Version => 20031129
libxml Version => 2.11.5
HTML Support => enabled
XPath Support => enabled
XPointer Support => enabled
Schema Support => enabled
RelaxNG Support => enabled
TL'DR when body contain only <
character (not <
) the textContent
property of document has different values. This is causing failures in tests in Drupal: https://www.drupal.org/project/drupal/issues/3397882
I cannot say which exact version of libxml2 caused this change, as I am unable to test each specific version, but there is a difference between 2.9.10 and 2.9.14.
This was previously reported to the PHP team (https://github.com/php/php-src/issues/11469#issuecomment-1699885043), but according to the further testing and answers from the PHP team it seems like it is caused by some changes in libxml2 (as the behavior is dependent on the libxml2 version).