`htmlReadMemory()` not resolving relative links with a nonempty url argument: bug? missing feature?
I initially reported this issue to Emacs, but then decided to redirect it here because Emacs didn't do anything special -- it seems to me that all Emacs did was passing the supplied arguments to htmlReadMemory()
.
I have an HTML document containing various types of links (see below for one example), and I call htmlReadMemory()
on this document, passing in the url
parameter. What I hope to happen is to get a parse tree that contains normalized link urls, but currently this is not the case.
My expectation came from reading the Emacs docstring for the function libxml-parse-html-region
, which directly calls htmlReadMemory()
with little transormation on its arguments. See the end of this issue for its full docstring.
Here is a concrete example to explain what I want.
#include <stdio.h>
#include <libxml/HTMLparser.h>
#include <libxml/parser.h>
#include <libxml/tree.h>
int main() {
char const html[] = "<html>\
<body>\
<a href=\"/hello\">1</a>\
<a href=\"../world\">2</a>\
<a href=\"good\">3</a>\
<a href=\"morning/or/night\">4</a>\
</body>\
</html>\
";
xmlDocPtr doc = htmlReadMemory(
html, sizeof html, "https://example.com/good/day", "utf-8",
HTML_PARSE_RECOVER | HTML_PARSE_NONET | HTML_PARSE_NOWARNING |
HTML_PARSE_NOERROR | HTML_PARSE_NOBLANKS);
xmlDocPtr doc2 =
htmlReadMemory(html, sizeof html, "https://example.com/good/day", "utf-8",
HTML_PARSE_RECOVER | HTML_PARSE_NOWARNING |
HTML_PARSE_NOERROR | HTML_PARSE_NOBLANKS);
return doc != doc2;
}
Using libxml2 version 2.10.4, and gcc version 12.2.1, and the following makefile:
CC = gcc
CFLAGS += -I /usr/include/libxml2 -lxml2 -Og -g
Now, gdb
reports that both doc->last->last->children->next->next->next->properties->children
and doc2->last->last->children->next->next->next->properties->children
evaluate to "../world"
, the url for the second <a>
's href property. In particular, I expect it to be a full url that reads "https://example.com/good/world"
.
I'm at a loss in determining what is at fault here. The Emacs docstring? My misinterpretation of this docstring? Or that htmlReadMemory()
should, but fails to, consult url
for normalizing links?
I'm also looking at this piece of documentation, but I can't come to any conclusion with it.
Thanks in advance.
Docstring of the Emacs function libxml-parse-html-region
, and especially note its second paragraph. In addition, note that when this function gets a nil BASE-URL
, htmlReadMemory()
sees url
being the empty string ""
.
Parse the region as an HTML document and return the parse tree.
If START is nil, it defaults to `point-min'. If END is nil, it
defaults to `point-max'.
If BASE-URL is non-nil, it is used to expand relative URLs.
If you want comments to be stripped, use the `xml-remove-comments'
function to strip comments before calling this function.