xmlreader.html 19.7 KB
Newer Older
1 2 3 4 5
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
    "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
  <meta http-equiv="Content-Type" content="text/html">
6
  <style type="text/css"></style>
7 8 9 10 11 12
<!--
TD {font-family: Verdana,Arial,Helvetica}
BODY {font-family: Verdana,Arial,Helvetica; margin-top: 2em; margin-left: 0em; margin-right: 0em}
H1 {font-family: Verdana,Arial,Helvetica}
H2 {font-family: Verdana,Arial,Helvetica}
H3 {font-family: Verdana,Arial,Helvetica}
13
A:link, A:visited, A:active { text-decoration: underline }
14
  </style>
15
-->
Daniel Veillard's avatar
Daniel Veillard committed
16
  <title>Libxml2 XmlTextReader Interface tutorial</title>
17 18 19 20 21 22 23 24
</head>

<body bgcolor="#fffacd" text="#000000">
<h1 align="center">Libxml2 XmlTextReader Interface tutorial</h1>

<p></p>

<p>This document describes the use of the XmlTextReader streaming API added
25
to libxml2 in version 2.5.0 . This API is closely modeled after the <a
26 27 28 29 30 31 32 33 34 35 36 37 38 39
href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlTextReader.html">XmlTextReader</a>
and <a
href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlReader.html">XmlReader</a>
classes of the C# language.</p>

<p>This tutorial will present the key points of this API, and working
examples using both C and the Python bindings:</p>

<p>Table of content:</p>
<ul>
  <li><a href="#Introducti">Introduction: why a new API</a></li>
  <li><a href="#Walking">Walking a simple tree</a></li>
  <li><a href="#Extracting">Extracting informations for the current
  node</a></li>
40 41
  <li><a href="#Extracting1">Extracting informations for the
  attributes</a></li>
42 43
  <li><a href="#Validating">Validating a document</a></li>
  <li><a href="#Entities">Entities substitution</a></li>
44 45 46
  <li><a href="#L1142">Relax-NG Validation</a></li>
  <li><a href="#Mixing">Mixing the reader and tree or XPath
  operations</a></li>
47 48 49 50 51 52 53 54
</ul>

<p></p>

<h2><a name="Introducti">Introduction: why a new API</a></h2>

<p>Libxml2 <a href="http://xmlsoft.org/html/libxml-tree.html">main API is
tree based</a>, where the parsing operation results in a document loaded
Jared Yanovich's avatar
Jared Yanovich committed
55
completely in memory, and expose it as a tree of nodes all available at the
56 57 58 59 60 61 62 63 64 65 66 67 68 69
same time. This is very simple and quite powerful, but has the major
limitation that the size of the document that can be hamdled is limited by
the size of the memory available. Libxml2 also provide a <a
href="http://www.saxproject.org/">SAX</a> based API, but that version was
designed upon one of the early <a
href="http://www.jclark.com/xml/expat.html">expat</a> version of SAX, SAX is
also not formally defined for C. SAX basically work by registering callbacks
which are called directly by the parser as it progresses through the document
streams. The problem is that this programming model is relatively complex,
not well standardized, cannot provide validation directly, makes entity,
namespace and base processing relatively hard.</p>

<p>The <a
href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlTextReader.html">XmlTextReader
70
API from C#</a> provides a far simpler programming model. The API acts as a
71
cursor going forward on the document stream and stopping at each node in the
72
way. The user's code keeps control of the progress and simply calls a
73 74 75 76 77
Read() function repeatedly to progress to each node in sequence in document
order. There is direct support for namespaces, xml:base, entity handling and
adding DTD validation on top of it was relatively simple. This API is really
close to the <a href="http://www.w3.org/TR/DOM-Level-2-Core/">DOM Core
specification</a> This provides a far more standard, easy to use and powerful
78
API than the existing SAX. Moreover integrating extension features based on
79 80 81
the tree seems relatively easy.</p>

<p>In a nutshell the XmlTextReader API provides a simpler, more standard and
82
more extensible interface to handle large documents than the existing SAX
83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127
version.</p>

<h2><a name="Walking">Walking a simple tree</a></h2>

<p>Basically the XmlTextReader API is a forward only tree walking interface.
The basic steps are:</p>
<ol>
  <li>prepare a reader context operating on some input</li>
  <li>run a loop iterating over all nodes in the document</li>
  <li>free up the reader context</li>
</ol>

<p>Here is a basic C sample doing this:</p>
<pre>#include &lt;libxml/xmlreader.h&gt;

void processNode(xmlTextReaderPtr reader) {
    /* handling of a node in the tree */
}

int streamFile(char *filename) {
    xmlTextReaderPtr reader;
    int ret;

    reader = xmlNewTextReaderFilename(filename);
    if (reader != NULL) {
        ret = xmlTextReaderRead(reader);
        while (ret == 1) {
            processNode(reader);
            ret = xmlTextReaderRead(reader);
        }
        xmlFreeTextReader(reader);
        if (ret != 0) {
            printf("%s : failed to parse\n", filename);
        }
    } else {
        printf("Unable to open %s\n", filename);
    }
}</pre>

<p>A few things to notice:</p>
<ul>
  <li>the include file needed : <code>libxml/xmlreader.h</code></li>
  <li>the creation of the reader using a filename</li>
  <li>the repeated call to xmlTextReaderRead() and how any return value
    different from 1 should stop the loop</li>
128
  <li>that a negative return means a parsing error</li>
129 130 131 132
  <li>how xmlFreeTextReader() should be used to free up the resources used by
    the reader.</li>
</ul>

133
<p>Here is similar code in python for exactly the same processing:</p>
134 135 136 137 138
<pre>import libxml2

def processNode(reader):
    pass

139 140 141 142 143 144
def streamFile(filename):
    try:
        reader = libxml2.newTextReaderFilename(filename)
    except:
        print "unable to open %s" % (filename)
        return
145 146

    ret = reader.Read()
147 148 149 150 151
    while ret == 1:
        processNode(reader)
        ret = reader.Read()

    if ret != 0:
152
        print "%s : failed to parse" % (filename)</pre>
153 154 155 156

<p>The only things worth adding are that the <a
href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlTextReader.html">xmlTextReader
is abstracted as a class like in C#</a> with the same method names (but the
157
properties are currently accessed with methods) and that one doesn't need to
158
free the reader at the end of the processing. It will get garbage collected
Jared Yanovich's avatar
Jared Yanovich committed
159
once all references have disappeared.</p>
160

161
<h2><a name="Extracting">Extracting information for the current node</a></h2>
162

163 164
<p>So far the example code did not indicate how information was extracted
from the reader. It was abstrated as a call to the processNode() routine,
165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213
with the reader as the argument. At each invocation, the parser is stopped on
a given node and the reader can be used to query those node properties. Each
<em>Property</em> is available at the C level as a function taking a single
xmlTextReaderPtr argument whose name is
<code>xmlTextReader</code><em>Property</em> , if the return type is an
<code>xmlChar *</code> string then it must be deallocated with
<code>xmlFree()</code> to avoid leaks. For the Python interface, there is a
<em>Property</em> method to the reader class that can be called on the
instance. The list of the properties is based on the <a
href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlTextReader.html">C#
XmlTextReader class</a> set of properties and methods:</p>
<ul>
  <li><em>NodeType</em>: The node type, 1 for start element, 15 for end of
    element, 2 for attributes, 3 for text nodes, 4 for CData sections, 5 for
    entity references, 6 for entity declarations, 7 for PIs, 8 for comments,
    9 for the document nodes, 10 for DTD/Doctype nodes, 11 for document
    fragment and 12 for notation nodes.</li>
  <li><em>Name</em>: the <a
    href="http://www.w3.org/TR/REC-xml-names/#ns-qualnames">qualified
    name</a> of the node, equal to (<em>Prefix</em>:)<em>LocalName</em>.</li>
  <li><em>LocalName</em>: the <a
    href="http://www.w3.org/TR/REC-xml-names/#NT-LocalPart">local name</a> of
    the node.</li>
  <li><em>Prefix</em>: a  shorthand reference to the <a
    href="http://www.w3.org/TR/REC-xml-names/">namespace</a> associated with
    the node.</li>
  <li><em>NamespaceUri</em>: the URI defining the <a
    href="http://www.w3.org/TR/REC-xml-names/">namespace</a> associated with
    the node.</li>
  <li><em>BaseUri:</em> the base URI of the node. See the <a
    href="http://www.w3.org/TR/xmlbase/">XML Base W3C specification</a>.</li>
  <li><em>Depth:</em> the depth of the node in the tree, starts at 0 for the
    root node.</li>
  <li><em>HasAttributes</em>: whether the node has attributes.</li>
  <li><em>HasValue</em>: whether the node can have a text value.</li>
  <li><em>Value</em>: provides the text value of the node if present.</li>
  <li><em>IsDefault</em>: whether an Attribute  node was generated from the
    default value defined in the DTD or schema (<em>unsupported
  yet</em>).</li>
  <li><em>XmlLang</em>: the <a
    href="http://www.w3.org/TR/REC-xml#sec-lang-tag">xml:lang</a> scope
    within which the node resides.</li>
  <li><em>IsEmptyElement</em>: check if the current node is empty, this is a
    bit bizarre in the sense that <code>&lt;a/&gt;</code> will be considered
    empty while <code>&lt;a&gt;&lt;/a&gt;</code> will not.</li>
  <li><em>AttributeCount</em>: provides the number of attributes of the
    current node.</li>
</ul>

214 215 216 217 218 219 220 221 222 223 224 225
<p>Let's look first at a small example to get this in practice by redefining
the processNode() function in the Python example:</p>
<pre>def processNode(reader):
    print "%d %d %s %d" % (reader.Depth(), reader.NodeType(),
                           reader.Name(), reader.IsEmptyElement())</pre>

<p>and look at the result of calling streamFile("tst.xml") for various
content of the XML test file.</p>

<p>For the minimal document "<code>&lt;doc/&gt;</code>" we get:</p>
<pre>0 1 doc 1</pre>

226
<p>Only one node is found, its depth is 0, type 1 indicate an element start,
227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254
of name "doc" and it is empty. Trying now with
"<code>&lt;doc&gt;&lt;/doc&gt;</code>" instead leads to:</p>
<pre>0 1 doc 0
0 15 doc 0</pre>

<p>The document root node is not flagged as empty anymore and both a start
and an end of element are detected. The following document shows how
character data are reported:</p>
<pre>&lt;doc&gt;&lt;a/&gt;&lt;b&gt;some text&lt;/b&gt;
&lt;c/&gt;&lt;/doc&gt;</pre>

<p>We modifying the processNode() function to also report the node Value:</p>
<pre>def processNode(reader):
    print "%d %d %s %d %s" % (reader.Depth(), reader.NodeType(),
                              reader.Name(), reader.IsEmptyElement(),
                              reader.Value())</pre>

<p>The result of the test is:</p>
<pre>0 1 doc 0 None
1 1 a 1 None
1 1 b 0 None
2 3 #text 0 some text
1 15 b 0 None
1 3 #text 0

1 1 c 1 None
0 15 doc 0 None</pre>

255
<p>There are a few things to note:</p>
256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288
<ul>
  <li>the increase of the depth value (first row) as children nodes are
    explored</li>
  <li>the text node child of the b element, of type 3 and its content</li>
  <li>the text node containing the line return between elements b and c</li>
  <li>that elements have the Value None (or NULL in C)</li>
</ul>

<p>The equivalent routine for <code>processNode()</code> as used by
<code>xmllint --stream --debug</code> is the following and can be found in
the xmllint.c module in the source distribution:</p>
<pre>static void processNode(xmlTextReaderPtr reader) {
    xmlChar *name, *value;

    name = xmlTextReaderName(reader);
    if (name == NULL)
        name = xmlStrdup(BAD_CAST "--");
    value = xmlTextReaderValue(reader);

    printf("%d %d %s %d",
            xmlTextReaderDepth(reader),
            xmlTextReaderNodeType(reader),
            name,
            xmlTextReaderIsEmptyElement(reader));
    xmlFree(name);
    if (value == NULL)
        printf("\n");
    else {
        printf(" %s\n", value);
        xmlFree(value);
    }
}</pre>

289
<h2><a name="Extracting1">Extracting information for the attributes</a></h2>
290 291 292 293 294 295

<p>The previous examples don't indicate how attributes are processed. The
simple test "<code>&lt;doc a="b"/&gt;</code>" provides the following
result:</p>
<pre>0 1 doc 1 None</pre>

296
<p>This proves that attribute nodes are not traversed by default. The
297
<em>HasAttributes</em> property allow to detect their presence. To check
298
their content the API has special instructions. Basically two kinds of operations
299 300 301
are possible:</p>
<ol>
  <li>to move the reader to the attribute nodes of the current element, in
Jared Yanovich's avatar
Jared Yanovich committed
302
    that case the cursor is positioned on the attribute node</li>
303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341
  <li>to directly query the element node for the attribute value</li>
</ol>

<p>In both case the attribute can be designed either by its position in the
list of attribute (<em>MoveToAttributeNo</em> or <em>GetAttributeNo</em>) or
by their name (and namespace):</p>
<ul>
  <li><em>GetAttributeNo</em>(no): provides the value of the attribute with
    the specified index no relative to the containing element.</li>
  <li><em>GetAttribute</em>(name): provides the value of the attribute with
    the specified qualified name.</li>
  <li>GetAttributeNs(localName, namespaceURI): provides the value of the
    attribute with the specified local name and namespace URI.</li>
  <li><em>MoveToAttributeNo</em>(no): moves the position of the current
    instance to the attribute with the specified index relative to the
    containing element.</li>
  <li><em>MoveToAttribute</em>(name): moves the position of the current
    instance to the attribute with the specified qualified name.</li>
  <li><em>MoveToAttributeNs</em>(localName, namespaceURI): moves the position
    of the current instance to the attribute with the specified local name
    and namespace URI.</li>
  <li><em>MoveToFirstAttribute</em>: moves the position of the current
    instance to the first attribute associated with the current node.</li>
  <li><em>MoveToNextAttribute</em>: moves the position of the current
    instance to the next attribute associated with the current node.</li>
  <li><em>MoveToElement</em>: moves the position of the current instance to
    the node that contains the current Attribute  node.</li>
</ul>

<p>After modifying the processNode() function to show attributes:</p>
<pre>def processNode(reader):
    print "%d %d %s %d %s" % (reader.Depth(), reader.NodeType(),
                              reader.Name(), reader.IsEmptyElement(),
                              reader.Value())
    if reader.NodeType() == 1: # Element
        while reader.MoveToNextAttribute():
            print "-- %d %d (%s) [%s]" % (reader.Depth(), reader.NodeType(),
                                          reader.Name(),reader.Value())</pre>

342
<p>The output for the same input document reflects the attribute:</p>
343 344 345
<pre>0 1 doc 1 None
-- 1 2 (a) [b]</pre>

346
<p>There are a couple of things to note on the attribute processing:</p>
347
<ul>
348 349
  <li>Their depth is the one of the carrying element plus one.</li>
  <li>Namespace declarations are seen as attributes, as in DOM.</li>
350
</ul>
351 352 353

<h2><a name="Validating">Validating a document</a></h2>

354 355
<p>Libxml2 implementation adds some extra features on top of the XmlTextReader
API. The main one is the ability to DTD validate the parsed document
356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383
progressively. This is simply the activation of the associated feature of the
parser used by the reader structure. There are a few options available
defined as the enum xmlParserProperties in the libxml/xmlreader.h header
file:</p>
<ul>
  <li>XML_PARSER_LOADDTD: force loading the DTD (without validating)</li>
  <li>XML_PARSER_DEFAULTATTRS: force attribute defaulting (this also imply
    loading the DTD)</li>
  <li>XML_PARSER_VALIDATE: activate DTD validation (this also imply loading
    the DTD)</li>
  <li>XML_PARSER_SUBST_ENTITIES: substitute entities on the fly, entity
    reference nodes are not generated and are replaced by their expanded
    content.</li>
  <li>more settings might be added, those were the one available at the 2.5.0
    release...</li>
</ul>

<p>The GetParserProp() and SetParserProp() methods can then be used to get
and set the values of those parser properties of the reader. For example</p>
<pre>def parseAndValidate(file):
    reader = libxml2.newTextReaderFilename(file)
    reader.SetParserProp(libxml2.PARSER_VALIDATE, 1)
    ret = reader.Read()
    while ret == 1:
        ret = reader.Read()
    if ret != 0:
        print "Error parsing and validating %s" % (file)</pre>

384
<p>This routine will parse and validate the file. Error messages can be
385
captured by registering an error handler. See python/tests/reader2.py for
Jared Yanovich's avatar
Jared Yanovich committed
386
more complete Python examples. At the C level the equivalent call to ativate
387 388 389 390 391
the validation feature is just:</p>
<pre>ret = xmlTextReaderSetParserProp(reader, XML_PARSER_VALIDATE, 1)</pre>

<p>and a return value of 0 indicates success.</p>

392 393
<h2><a name="Entities">Entities substitution</a></h2>

394
<p>By default the xmlReader will report entities as such and not replace them
Jared Yanovich's avatar
Jared Yanovich committed
395
with their content. This default behaviour can however be overridden using:</p>
396 397 398 399 400 401 402 403 404 405 406

<p><code>reader.SetParserProp(libxml2.PARSER_SUBST_ENTITIES,1)</code></p>

<h2><a name="L1142">Relax-NG Validation</a></h2>

<p style="font-size: 10pt">Introduced in version 2.5.7</p>

<p>Libxml2 can now validate the document being read using the xmlReader using
Relax-NG schemas. While the Relax NG validator can't always work in a
streamable mode, only subsets which cannot be reduced to regular expressions
need to have their subtree expanded for validation. In practice it means
Jared Yanovich's avatar
Jared Yanovich committed
407
that, unless the schemas for the top level element content is not expressible
408 409 410 411 412 413 414 415 416 417 418
as a regexp, only chunk of the document needs to be parsed while
validating.</p>

<p>The steps to do so are:</p>
<ul>
  <li>create a reader working on a document as usual</li>
  <li>before any call to read associate it to a Relax NG schemas, either the
    preparsed schemas or the URL to the schemas to use</li>
  <li>errors will be reported the usual way, and the validity status can be
    obtained using the IsValid() interface of the reader like for DTDs.</li>
</ul>
419

420 421
<p>Example, assuming the reader has already being created and that the schema
string contains the Relax-NG schemas:</p>
422
<pre><code>rngp = libxml2.relaxNGNewMemParserCtxt(schema, len(schema))<br>
423 424 425 426 427 428 429 430 431
rngs = rngp.relaxNGParse()<br>
reader.RelaxNGSetSchema(rngs)<br>
ret = reader.Read()<br>
while ret == 1:<br>
    ret = reader.Read()<br>
if ret != 0:<br>
    print "Error parsing the document"<br>
if reader.IsValid() != 1:<br>
    print "Document failed to validate"</code><br>
432 433 434
</pre>

<p>See <code>reader6.py</code> in the sources or documentation for a complete
435 436 437 438 439 440 441 442 443 444
example.</p>

<h2><a name="Mixing">Mixing the reader and tree or XPath operations</a></h2>

<p style="font-size: 10pt">Introduced in version 2.5.7</p>

<p>While the reader is a streaming interface, its underlying implementation
is based on the DOM builder of libxml2. As a result it is relatively simple
to mix operations based on both models under some constraints. To do so the
reader has an Expand() operation allowing to grow the subtree under the
445 446 447 448 449 450
current node. It returns a pointer to a standard node which can be
manipulated in the usual ways. The node will get all its ancestors and the
full subtree available. Usual operations like XPath queries can be used on
that reduced view of the document. Here is an example extracted from
reader5.py in the sources which extract and prints the bibliography for the
"Dragon" compiler book from the XML 1.0 recommendation:</p>
451 452 453 454 455 456 457 458 459 460 461 462
<pre>f = open('../../test/valid/REC-xml-19980210.xml')
input = libxml2.inputBuffer(f)
reader = input.newTextReader("REC")
res=""
while reader.Read():
    while reader.Name() == 'bibl':
        node = reader.Expand()            # expand the subtree
        if node.xpathEval("@id = 'Aho'"): # use XPath on it
            res = res + node.serialize()
        if reader.Next() != 1:            # skip the subtree
            break;</pre>

463
<p>Note, however that the node instance returned by the Expand() call is only
464 465 466
valid until the next Read() operation. The Expand() operation does not
affects the Read() ones, however usually once processed the full subtree is
not useful anymore, and the Next() operation allows to skip it completely and
467
process to the successor or return 0 if the document end is reached.</p>
468

469
<p><a href="mailto:xml@gnome.org">Daniel Veillard</a></p>
470 471 472 473 474 475

<p>$Id$</p>

<p></p>
</body>
</html>