Pull parsing XML in PHP

Create memory-efficient stream processing

Discover the XMLReader library, which is bundled with PHP 5 and enables PHP pages to process XML documents in an efficient streaming mode.

Share:

Elliotte Rusty Harold (elharo@metalab.unc.edu), Adjunct Professor, Polytechnic University

Photo of Elliot Rusty HaroldElliotte Rusty Harold is originally from New Orleans, to which he returns periodically in search of a decent bowl of gumbo. However, he resides in the Prospect Heights neighborhood of Brooklyn with his wife Beth and cats Charm (named after the quark) and Marjorie (named after his mother-in-law). He's an adjunct professor of computer science at Polytechnic University, where he teaches Java and object-oriented programming. His Cafe au Lait Web site has become one of the most popular independent Java sites on the Internet, and his spin-off site, Cafe con Leche, has become one of the most popular XML sites. His most recent book is Java I/O, 2nd edition. He's currently working on the XOM API for processing XML, the Jaxen XPath engine, and the Jester test coverage tool.



11 January 2008 (First published 30 January 2007)

Also available in Chinese Russian Japanese

11 Jan 2008 update: The third code section in Initialize the parser and load the document was revised to: $reader->open('http://www.cafeaulait.org/today.atom'); from $reader->XML('http://www.cafeaulait.org/today.atom');

libxml

The XMLReader API described here sits on top of the Gnome Project's libxml library for C and C++. XMLReader is really just a thin PHP layer on top of libxml's XmlTextReader API. XmlTextReader is itself modeled after (although it shares no code with) .NET's XmlTextReader and XmlReader classes.

PHP 5 introduced XMLReader, a new class for reading Extensible Markup Language (XML). Unlike SimpleXML or the Document Object Model (DOM), XMLReader operates in streaming mode. That is, it reads the document from start to finish. You can begin to work with the content at the beginning before you see the content at the end. This makes it very fast, very efficient, and very parsimonious with memory. The larger the documents you need to process, the more important this is.

Unlike the Simple API for XML (SAX), XMLReader is a pull parser rather than a push parser. This means that your program is in control. Rather than being told what the parser sees when the parser sees it, you tell the parser when to go fetch the next piece of the document. You request content rather than react to it. Another way of thinking about it: XMLReader is an implementation of the Iterator design pattern rather than the Observer design pattern.

A sample problem

Let's begin with a simple example. Suppose you're writing a PHP script that receives XML-RPC requests and generates responses. More specifically, suppose the requests look like Listing 1. The root element of the document is methodCall, which contains a methodName element and a params element. The method name is sqrt. The params element contains one param element that contains a double whose square root is desired. Namespaces aren't used.

Listing 1. An XML-RPC request
<?xml version="1.0"?>
<methodCall>
  <methodName>sqrt</methodName>
  <params>
    <param>
      <value><double>36.0</double></value>
    </param>
  </params>
</methodCall>

Here's what the PHP script needs to do:

  1. Check the method name, and generate a fault response if it's not sqrt (the only method this script knows how to handle).
  2. Find the argument, and generate a fault response if it's not present or has the wrong type.
  3. Otherwise, calculate the square root.
  4. Return the result in the form shown in Listing 2.
Listing 2. An XML-RPC response
<?xml version="1.0"?>
<methodResponse>
  <params>
    <param>
      <value><double>6.0</double></value>
    </param>
  </params>
</methodResponse>

Let's develop this step by step.


Initialize the parser and load the document

The first step is to create a new parser object. Doing so is straightforward:

$reader = new XMLReader();

Populate the raw post data

If you find that $HTTP_RAW_POST_DATA is empty, add the following line to your php.ini file:

always_populate_raw_post_data = On

Next, you need to give it some data to parse. For XML-RPC, this is the raw body of the Hypertext Transfer Protocol (HTTP) request. This string can then be passed to the reader's XML() function:

$request = $HTTP_RAW_POST_DATA;
$reader->XML($request);

You can parse any string, wherever you get it. For instance, it can be a string literal in the program or read from a local file. You can also load data from an external URL with the open() function. For example, this statement prepares to parse one of my Atom feeds:

$reader->open('http://www.cafeaulait.org/today.atom');

Wherever you get your raw data, the reader is now set up and ready to parse.


Read the document

The read() function advances the parser to the next token. The simplest approach is to iterate through the entire document in a while loop:

while ($reader->read()) {
  // processing code goes here...
}

After you're finished, close the parser to release any resources it's holding onto and reset it for the next document:

$reader->close();

Inside the loop, the parser is positioned on a particular node: the start of an element, the end of an element, a text node, a comment, and so forth. You can find out what the parser is looking at right now by inspecting these properties:

  • localName is the local, unprefixed name of the node.
  • name is the possibly prefixed name of the node. For nodes such as comments that don't have names, it's #comment, #text, #document, and so forth, as in DOM.
  • namespaceURI is the Uniform Resource Identifier (URI) for the node's namespace.
  • nodeType is an integer representing the node type -- for example, 2 for an attribute node and 7 for a processing instruction.
  • prefix is the node's namespace prefix.
  • value is the node's text content.
  • hasValue is true if the node has a text value or false otherwise.

Of course, not all node types have all these properties. For instance, text nodes, CDATA sections, comments, processing instructions, attributes, whitespace, document types, and XML declarations have values. Other node types (most significantly, elements and documents) don't. Generally, a program uses the nodeType property to figure out what it's looking at and then respond appropriately. Listing 3 shows a simple while loop that uses these functions to print what it sees. Listing 4 shows the output from this program when Listing 1 is fed into it.

Listing 3. What the parser sees
     while ($reader->read()) {
      echo $reader->name;
      if ($reader->hasValue) {
        echo ": " . $reader->value;
      }
      echo "\n";
    }
Listing 4. Output from Listing 3
                methodCall
#text: 
  
methodName
#text: sqrt
methodName
#text: 
  
params
#text: 
    
param
#text: 
      
value
double
#text: 10
double
value
#text: 
    
param
#text: 
  
params
#text: 

methodCall

Most programs aren't so generic. They accept input in a particular form and process it in some way. In the XML-RPC example, you need to read only one thing in the input: the double element, of which there should be exactly one. To do that, you look for the start of an element with the name double:

if ($reader->name == "double" 
  && $reader->nodeType == XMLReader::ELEMENT) {
    // ...
}

This element likely has a single text node child, which you can read by advancing the parser to the next node like so:

if ($reader->name == "double" && $reader->nodeType == XMLReader::ELEMENT) {
    $reader->read();
    respond($reader->value);
}

Here the respond() function builds the XML-RPC response and sends it to the client. However, before I show that, there's something else I need to address. It's not absolutely guaranteed that the double element in the request document contains exactly one text node. It might contain several, as well as comments and processing instructions. For instance, it could look like this:

 <value><double>
  <!--value follows-->6.<!--fractional part next-->0
</double></value>

Nested elements

This scheme has one potential flaw. Nested double elements such as <double>6<double>1.2</double></double> would break this algorithm. However, that would be invalid XML-RPC; and shortly you'll see how to use RELAX NG validation to reject all such documents. In document types such as Extensible Hypertext Markup Language (XHTML) that allow the same elements inside each other (such as a table inside a table), you also need to keep track of the depth of the elements to make sure you're matching the right end-tag to the right start-tag.

A robust solution needs to get all the text node children of the double element, concatenate them, and only then convert the result to a double. It needs to carefully avoid any comments or other non-text nodes that might appear. This is a little more complex, but not excessively so, as Listing 5 shows.

Listing 5. Accumulate all text content from an element
  while ($reader->read()) {
    if ($reader->nodeType == XMLReader::TEXT
      || $reader->nodeType == XMLReader::CDATA
      || $reader->nodeType == XMLReader::WHITESPACE
      || $reader->nodeType == XMLReader::SIGNIFICANT_WHITESPACE) {
       $input .= $reader->value;
    }
    else if ($reader->nodeType == XMLReader::END_ELEMENT
      && $reader->name == "double") {
        break;
    }
  }

You can ignore everything else in the document for the moment. (I'll add more error-handling later.)


Build the response

As its name implies, XMLReader is purely for reading. A corresponding XMLWriter class is in development but isn't yet ready for production. Fortunately, writing XML is much easier than reading it. First, you should set the media type of the response using the header() function. For XML-RPC, this is application/xml. For example:

header('Content-type: application/xml');

The content can usually be echoed straight onto the page, as shown in the respond() function in Listing 6.

Listing 6. Echo XML
                 function respond($input) {

  echo "<?xml version='1.0'?>
<methodResponse>
  <params>
    <param>
      <value><double>" .
       sqrt($input)
  . "</double></value>
    </param>
  </params>
</methodResponse>";
  
}

You can even embed the literal parts of the response directly in the PHP page, just as you would with HTML. Listing 7 demonstrates this technique.

Listing 7. Literal XML
                 function respond($input) {

  ?><?xml version='1.0'?>
<methodResponse>
  <params>
    <param>
      <value><double>"<?php 
 echo      sqrt($input);
?>
  </double></value>
    </param>
  </params>
</methodResponse>
  <?php
}

Error handling

Until now, I implicitly assumed that the input document was well-formed. However, there's no guarantee of that. Like any XML parser, XMLReader is required to stop processing as soon as it detects a well-formedness error. If it does so, the read() function returns false.

Theoretically, the parser could report data up to the first error it finds. In my experiments with small documents, however, it errors out almost immediately. The underlying parser is preparsing a large chunk of the document, caching it, and then doling it out a piece at a time. Thus it tends to detect errors prematurely. For safety's sake, don't assume you'll be able to parse content before the first well-formedness error. Furthermore, don't assume you won't see any content before the parser error. If you want to accept only complete, well-formed documents, then make sure your script doesn't do anything irreversible until the end of the document is seen.

If the parser detects a well-formedness error, then the read() function echos an error message such as this one (if verbose error reporting is turned on, as it should be on a development server):

<br />
<b>Warning</b>:  XMLReader::read() [<a href='function.read'>function.read</a>]:       
< value><double>10</double></value> in <b>/var/www/root.php</b> 
on line <b>35</b><br />

You probably don't want to copy this into the HTML page the user sees. A better approach is to capture the error message in the $php_errormsg environment variable. To do this, you need to turn on the track_errors configuration option in your php.ini file:

track_errors = On

The track_errors option is off by default; this is explicitly specified in php.ini, so make sure you change that line. If you add the previous line early in php.ini, as I initially did, the later track_errors = Off line will override it.

This program should send responses only to complete, well-formed input. (Valid too, but I'll get to that.) Thus you need to wait until you're finished parsing the document (you've broken out of the while loop). At that point, you check to see whether $php_errormsg is set. If it isn't, the document is well-formed, and you send an XML-RPC response message. If the variable is set, the document is not well-formed, and you instead send an XML-RPC fault response. You also send a fault response if someone requests the square root of a negative number. Listing 8 demonstrates.

Listing 8. Check for well-formedness
     // set up the request
    $request = $HTTP_RAW_POST_DATA;
    error_reporting(E_ERROR | E_WARNING | E_PARSE);
    if (isset($php_errormsg)) unset(($php_errormsg);
    // create the reader
    $reader = new XMLReader();
    // $reader->setRelaxNGSchema("request.rng");
    $reader->XML($request);

    $input = "";
    while ($reader->read()) {
      if ($reader->name == "double" && $reader->nodeType == XMLReader::ELEMENT) {

          while ($reader->read()) {
            if ($reader->nodeType == XMLReader::TEXT
              || $reader->nodeType == XMLReader::CDATA
              || $reader->nodeType == XMLReader::WHITESPACE
              || $reader->nodeType == XMLReader::SIGNIFICANT_WHITESPACE) {
               $input .= $reader->value;
            }
            else if ($reader->nodeType == XMLReader::END_ELEMENT
              && $reader->name == "double") {
                break;
            }
          } 
          break;
      }
    } 

    // make sure the input was well-formed
    if (isset($php_errormsg) ) fault(21, $php_errormsg);
    else if ($input < 0) fault(20, "Cannot take square root of negative number");
    else respond($input);

This is a simple version of a common pattern in streaming processing of XML. The parser fills a data structure that is acted on when the document is finished. Usually the data structure is simpler than the document itself. Here the data structure is especially simple: a single string.


Validation

libxml version

RELAX NG had some serious bugs in earlier versions of libxml, the library on which XMLReader depends. Make sure you're using at least version 2.06.26. Many systems, including Mac OS X Tiger, bundle an earlier, buggy release.

Until now, I've been cavalier about verifying that the data was where I thought it was. The easiest way to accomplish this verification is to check the document against a schema. XMLReader supports the RELAX NG schema language; Listing 9 shows a simple RELAX NG schema for this specific form of XML-RPC request.

Listing 9. An XML-RPC request
<element name="methodCall" xmlns="http://relaxng.org/ns/structure/1.0" 
 datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes">
  <element name="methodName">
    <value>sqrt</value>
  </element>
  <element name="params">
    <element name="param">
      <element name="value">
        <element name="double">
          <data type="double"/>
        </element>
      </element>
    </element>
  </element>
</element>

You can embed the schema directly in the PHP script as a string literal using setRelaxNGSchemaSource() or read it from an external file or URL using setRelaxNGSchema(). For example, assuming Listing 9 is in the file sqrt.rng, here's how you load the schema:

reader->setRelaxNGSchema("sqrt.rng")

Do this before you begin parsing the document. The parser checks the document against the schema as it reads. To check whether the document is valid, you call isValid(), which returns true if the document is valid (so far) and false if it isn't. Listing 10 demonstrates the complete finished program, including all error handling. This should accept any legal input and return a correct value, and reject all incorrect requests. I've also added a fault() method that sends an XML-RPC fault response when something goes wrong.

Listing 10. The complete XML-RPC square root server
<?php
header('Content-type: application/xml');

// try grammar
$schema = "<element name='methodCall' 
                   xmlns='http://relaxng.org/ns/structure/1.0' 
                   datatypeLibrary='http://www.w3.org/2001/XMLSchema-datatypes'>
  <element name='methodName'>
    <value>sqrt</value>
  </element>
  <element name='params'>
    <element name='param'>
      <element name='value'>
        <element name='double'>
          <data type='double'/>
        </element>
      </element>
    </element>
  </element>
</element>";


if (!isset($HTTP_RAW_POST_DATA)) {
   fault(22, "Please make sure always_populate_raw_post_data = On in php.ini");
}
else {

    // set up the request
    $request = $HTTP_RAW_POST_DATA;
    error_reporting(E_ERROR | E_WARNING | E_PARSE);
    // create the reader
    $reader = new XMLReader();
    $reader->setRelaxNGSchema("request.rng");
    $reader->XML($request);

    $input = "";
    while ($reader->read()) {
      if ($reader->name == "double" && $reader->nodeType == XMLReader::ELEMENT) {

          while ($reader->read()) {
            if ($reader->nodeType == XMLReader::TEXT
              || $reader->nodeType == XMLReader::CDATA
              || $reader->nodeType == XMLReader::WHITESPACE
              || $reader->nodeType == XMLReader::SIGNIFICANT_WHITESPACE) {
               $input .= $reader->value;
            }
            else if ($reader->nodeType == XMLReader::END_ELEMENT
              && $reader->name == "double") {
                break;
            }
          } 
          break;
      }
    } 

    if (isset($php_errormsg) ) fault(21, $php_errormsg);
    else if (! $reader->isValid()) fault(19, "Invalid request");
    else if ($input < 0) fault(20, "Cannot take square root of negative number");
    else respond($input);

    $reader->close();
}


function respond($input)
{
?>
<methodResponse>
  <params>
    <param>
      <value><double><?php 
 echo      sqrt($input);
?></double></value>
    </param>
  </params>
</methodResponse>
  <?php
}


function fault($code, $message)
{

  echo "<?xml version='1.0'?>
<methodResponse>
  <fault>
    <value>
      <struct>
        <member>
          <name>faultCode</name>
          <value><int>" . $code . "</int></value>
        </member>
        <member>
          <name>faultString</name>
          <value>
             <string>" . $message . "</string>
          </value>
        </member>
      </struct>
    </value>
  </fault>
</methodResponse>";
  
}

Attributes

Attributes aren't seen during the normal course of pull parsing. To read attributes, you stop at the start of an element and request a specific attribute, either by name or number.

Pass the name of the attribute you want to getAttribute() to find the value of that attribute on the current element. For example, this statement asks for the id attribute of the current element:

$id = $reader->getAttribute("id");

If the attribute is in a namespace -- for example, xlink:href -- call getAttributeNS(), pass the local name and namespace URI as the first and second arguments, respectively. (The prefix doesn't matter.) For example, this statement requests the value of the xlink:href attribute in the http://www.w3.org/1999/xlink/ namespace:

$href = $reader->getAttributeNS("href", "http://www.w3.org/1999/xlink/");

Attribute order

Attribute order isn't significant in XML documents and isn't preserved by the parser. The numbers used here to index attributes are purely conveniences. There is no guarantee that the first attribute in the start-tag will be attribute 1, the second will be attribute 2, and so on. Don't write code that depends on attribute order.

Both of these methods return an empty string if the attribute doesn't exist. (This is wrong. They should return null. The current design makes it hard to distinguish between an attribute whose value is the empty string and one that isn't present at all.)

If you just want to know all the attributes on an element, and you don't know their names in advance, then call moveToNextAttribute() when the reader is positioned on the element. Once the parser is positioned on an attribute node, you can read its name, namespace, and value with the same properties used for elements. For example, this code fragment prints out all the attributes of the current element:

  if ($reader->hasAttributes and $reader->nodeType == XMLReader::ELEMENT) {
    while ($reader->moveToNextAttribute()) {
      echo $reader->name . "='" . $reader->value . "'\n";
    }
    echo "\n";
  }

Very unusually for an XML API, XMLReader lets you read the attributes from either the beginning or the end of the element. To avoid double counting, it's important to check that the node type is XMLReader::ELEMENT and not XMLReader::END_ELEMENT, which can also have attributes.


In conclusion

XMLReader is a useful addition to the PHP programmer's toolkit. Unlike SimpleXML, it's a full XML parser that handles all documents, not just some of them. Unlike DOM, it can handle documents larger than available memory. Unlike SAX, it puts your program in control. If your PHP programs need to accept XML input, XMLReader is well worth your consideration.

Resources

Learn

Get products and technologies

  • IBM trial software: Build your next development project with trial software available for download directly from developerWorks.

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into XML on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML, Open source
ArticleID=188820
ArticleTitle=Pull parsing XML in PHP
publish-date=01112008