Contents


XML for PHP developers, Part 2

Advanced XML parsing techniques

PHP5's XML parsing techniques for large or complex XML documents

Comments

Content series:

This content is part # of # in the series: XML for PHP developers, Part 2

Stay tuned for additional content in this series.

This content is part of the series:XML for PHP developers, Part 2

Stay tuned for additional content in this series.

PHP5 offers an improved variety of XML parsing techniques. James Clark's Expat SAX parser, now based on libxml2, is no longer the only fully functional game in town. Parsing with the DOM, fully compliant with the W3C standard, is a familiar option. Both SimpleXML, which you saw in Part 1 (see Related topics), and XMLReader, which is easier and faster than SAX, offer additional parsing approaches. All the XML extensions are now based on the libxml2 library by the GNOME project. This unified library allows for interoperability between the different extensions. This article will cover PHP5's XML parsing techniques, focusing on parsing large or complex XML documents. It will offer some background about parsing techniques, what method is best suited to what types of XML documents, and, if you have a choice, what your criteria for choosing should be.

SimpleXML

Part 1 provided essential information on XML and focused on quick-start Application Programming Interfaces (APIs). It demonstrated how SimpleXML, combined with the Document Object Model (DOM) as necessary, is the ideal choice if you work with straightforward, predictable, and relatively basic XML documents.

XML and PHP5

Extensible Markup Language (XML) is described as both a markup language and a text-based data storage format; it offers a text-based means to apply and describe a tree-based structure to information.

In PHP5, there are totally new and rewritten extensions for parsing XML. Those that load the entire XML document into memory include SimpleXML, the DOM, and the XSLT processor. Those parsers that provide you with one piece of the XML document at a time include the Simple API for XML (SAX) and XMLReader. SAX functions the same way it did in PHP4, but it's not based on the expat library anymore, but on the libxml2 library. If you are familiar with the DOM from other languages, you will have an easier time coding with the DOM in PHP5 than previous versions.

XML parsing fundamentals

The two basic means to parse XML are: tree and stream. Tree style parsing involves loading the entire XML document into memory. The tree file structure allows for random access to the document's elements and for editing of the XML. Examples of tree-type parsing include the DOM and SimpleXML. These share the tree-like structure in different but interoperable formats in memory. Unlike tree style parsing, stream parsing does not load the entire XML document into memory. The use of the term stream in this context corresponds closely to the term stream in streaming audio. What it is doing and why it is doing it is exactly the same, namely delivering a small amount of data at a time to preserve both bandwidth and memory. In stream parsing, only the node currently being parsed is accessible, and editing the XML, as a document, is not possible. Examples of stream parsers include XMLReader and SAX.

Tree-based parsers

Tree-based parsers are so named because they load the entire XML document into memory with the document root being the trunk, and all children, grandchildren, subsequent generations, and attributes being the branches. The most familiar tree-based parser is the DOM. The easiest tree-based parser to code is SimpleXML. You will look at both.

Parsing with the DOM

The DOM standard, according the W3C, is "... a platform and language neutral interface that will allow programs and scripts to dynamically access and update the content, structure and style of documents." The libxml2 library by the GNOME project implements the DOM, along with all its methods, in C. Since all of the PHP5 XML extensions are based on libxml2, there is complete interoperability between the extensions. This interoperability greatly enhances their functionality. You can, for instance, use XMLReader, a stream parser, to get an element, import it into the DOM and extract data using XPath. That is a lot of flexibility. You'll see this in Listing 5.

The DOM is a tree-based parser. The DOM is easy to understand and utilize since its structure in memory resembles the original XML document. DOM passes on information to the application by creating a tree of objects that duplicates exactly the tree of elements from the XML file, with every XML element being a node in the tree. DOM is a W3C standard, which gives the DOM a lot of authority with developers due to its consistency with other programming languages. Because the DOM builds a tree of the entire document, it uses a lot of memory and processor time.

The DOM in action

If you're forced by your design or any other constraint to be a one trick pony in the area of parsers, this is where you want to be due to flexibility alone. With the DOM, you can build, modify, query, validate and transform XML Documents. You can use all DOM methods and properties. Most DOM level 2 methods are implemented with properties properly supported. Documents parsed with the DOM can be as complex as they come because of its tremendous flexibility. Remember, however, that flexibility comes at a price if you load a large XML document into memory all at once.

The example in Listing 1 uses the DOM to parse the document and retrieves an element with getElementById. It's necessary to validate the document by setting validateOnParse=true before referring to the ID. According to the DOM standard, this requires a DTD which defines the attribute ID to be of type ID.

Listing 1. Using the DOM with a basic document
<?php

$doc = new DomDocument;

// We must validate the document before referring to the id
$doc->validateOnParse = true;
$doc->Load('basic.xml');

echo "The element whose id is myelement is: " . 
$doc->getElementById('myelement')->tagName . "\n";

?>

The getElementsByTagName() function returns a new instance of class DOMNodeList containing the elements with a given tag name. The list, of course, has to be walked through. Altering the document structure while iterating the NodeList returned by getElementsByTagName() affects the NodeList you are iterating (see Listing 2). There is no validation requirement.

Listing 2. DOM getElementsByTagName method
DOMDocument {
  DOMNodeList getElementsByTagName(string name);
}

The example in Listing 3 uses the DOM with XPath.

Listing 3. Using the DOM and parsing with XPath
<?php

$doc = new DOMDocument;

// We don't want to bother with white spaces
$doc->preserveWhiteSpace = false;

$doc->Load('book.xml');

$xpath = new DOMXPath($doc);

// We start from the root element
$query = '//book/chapter/para/informaltable/tgroup/tbody/row/entry[. = "en"]';

$entries = $xpath->query($query);

foreach ($entries as $entry) {
   echo "Found {$entry->previousSibling->previousSibling->nodeValue}," .
        " by {$entry->previousSibling->nodeValue}\n";
}
?>

Having said all of those nice things about the DOM, I'm going to wind up with an example of what not to do with the DOM just to make the point as strongly as possible, and, then, in the next example, how to save yourself. Listing 4 illustrates loading a large file into the DOM simply to extract the data from a single attribute with DomXpath.

Listing 4. Using the DOM with XPath the wrong way, on a large XML document
<?php

// Parsing a Large Document with DOM and DomXpath
// First create a new DOM document to parse
$dom = new DomDocument();

//  This document is huge and we don't really need anything from the tree
//  This huge document uses a huge amount of memory 
$dom->load("tooBig.xml");
$xp = new DomXPath($dom);
$result = $xp->query("/blog/entries/entry[@ID = 5225]/title") ;
print $result->item(0)->nodeValue ."\n";

?>

This final, follow-up example in Listing 5 uses the DOM with XPath in the same way, except the data is passed one element at a time by XMLReader using expand(). With this method, you can convert a node passed by XMLReader to a DOMElement.

Listing 5. Using the DOM with XPath the right way, on a large XML document
<?php

// Parsing a large document with XMLReader with Expand - DOM/DOMXpath 
$reader = new XMLReader();

$reader->open("tooBig.xml");

while ($reader->read()) {
    switch ($reader->nodeType) {
        case (XMLREADER::ELEMENT):
        if ($reader->localName == "entry") {
            if ($reader->getAttribute("ID") == 5225) {
                $node = $reader->expand();
                $dom = new DomDocument();
                $n = $dom->importNode($node,true);
                $dom->appendChild($n);
                $xp = new DomXpath($dom);
                $res = $xp->query("/entry/title");
                echo $res->item(0)->nodeValue;
            }
        }
    }
}
    
?>

Parsing with SimpleXML

The SimpleXML extension is another choice for parsing an XML document. The SimpleXML extension requires PHP5 and includes built-in XPath support. SimpleXML works best with uncomplicated, basic XML data. Provided that the XML document isn't too complicated, too deep, and lacks mixed content, SimpleXML is simpler to use than the DOM, as its name implies. It is more intuitive if you are working with a known document structure.

SimpleXML in action

SimpleXML shares many of the advantages of the DOM and is more easily coded. It allows easy access to an XML tree, has built-in validation and XPath support, and is interoperable with the DOM, giving it read and write support for XML documents. You can code documents parsed with SimpleXML simply and quickly. Remember however, that, like the DOM, SimpleXML comes with a price for its ease and flexibility if you load a large XML document into memory.

The following code in Listing 6 extracts <plot> from the example XML.

Listing 6. Extracting the plot text
<?php
$xmlstr = <<<XML
<?xml version='1.0' standalone='yes'?>
<books>
   <book>
      <title>Great American Novel</title>
      <plot>
         Cliff meets Lovely Woman. Loyal Dog sleeps, but
         wakes up to bark at mailman.
      </plot>
      <success type="bestseller">4</success>
      <success type="bookclubs">9</success>
   </book>
</books>
XML;
?>
<?php

$xml = new SimpleXMLElement($xmlstr);
echo $xml->book[0]->plot; // "Cliff meets Lovely Woman. ..."
?>

On the other hand, you might want to extract a multi-line address. When multiple instances of an element exist as children of a single parent element, normal iteration techniques apply. The following code in Listing 7 demonstrates this functionality.

Listing 7. Extracting multiple instances of an element
<?php
$xmlstr = <<<XML
<xml version='1.0' standalone='yes'?>
<books>
   <book>
      <title>Great American Novel</title>
      <plot>
         Cliff meets Lovely Woman.
      </plot>
      <success type="bestseller">4</success>
      <success type="bookclubs">9</success>
   </book>
   <book>
      <title>Man Bites Dog</title>
      <plot>
         Reporter invents a prize-winning story.
      </plot>
      <success type="bestseller">22</success>
      <success type="bookclubs">3</success>
   </book>
</books>
XML;
?>
<php

$xml = new SimpleXMLElement($xmlstr);

foreach ($xml->book as $book) {
   echo $book->plot, '<br />';
}
?

In addition to reading element names and their values, SimpleXML can also access element attributes. In the code shown in Listing 8, you access attributes of an element just as you would elements of an array.

Listing 8. Demonstrating SimpleXML accessing the attributes of an element
<?php
$xmlstr = <<<XML
<?xml version='1.0' standalone='yes'?>
<books>
   <book>
      <title>Great American Novel</title>
      <plot>
         Cliff meets Lovely Woman.
      </plot>
      <success type="bestseller">4</success>
      <success type="bookclubs">9</success>
   </book>
   <book>
      <title>Man Bites Dog</title>
      <plot>
         Reporter invents a prize-winning story.
      <plot>
      <success type="bestseller">22</success>
      <success type="bookclubs">3</success>
   </book>
<books>
XML;
?>
<?php

$xml = new SimpleXMLElement($xmlstr);

foreach ($xml->book[0]->success as $success) {
   switch((string) $success['type']) {
   case 'bestseller':
      echo $success, ' months on bestseller list<br />';
      break;
   case 'bookclubs':
      echo $success, ' bookclub listings<br />';
      break;
   }
}

?>

This final example (see Listing 9) uses SimpleXML and the DOM with XMLReader. With XMLReader, the data is passed one element at a time using expand(). With this method, you can convert a node passed by XMLReader to a DOMElement, and then to SimpleXML.

Listing 9. Using SimpleXML with the DOM and XMLReader to parse a large XML document
<?php

// Parsing a large document with Expand and SimpleXML
$reader = new XMLReader();

$reader->open("tooBig.xml");

while ($reader->read()) {
    switch ($reader->nodeType) {
        case (XMLREADER::ELEMENT):
        if ($reader->localName == "entry") {
            if ($reader->getAttribute("ID") == 5225) {
                $node = $reader->expand();
                $dom = new DomDocument();
                $n = $dom->importNode($node,true);
                $dom->appendChild($n);
                $sxe = simplexml_import_dom($n);
                echo $sxe->title; 
            }
        }
    }
}
    
?>

Stream-based parsers

Stream-based parsers are so named because they parse the XML in a stream with much the same rationale as streaming audio, working with a particular node, and, when they are finished with that node, entirely forgetting its existence. XMLReader is a pull parser and you code for it in much the same way as for a database query result table in a cursor. This makes it easier to work with unfamiliar or unpredictable XML files.

Parsing with XMLReader

The XMLReader extension is a stream-based parser of the type often referred to as a cursor type or pull parser. XMLReader pulls information from the XML document on request. It is based on the API derived from C# XmlTextReader. It is included and enabled in PHP 5.1 by default and is based on libxml2. Before PHP 5.1, the XMLReader extension was not enabled by default but was available at PECL (see Related topics for a link). XMLReader supports namespaces and validation, including DTD and Relaxed NG.

XMLReader in action

XMLReader, as a stream parser, is well-suited to parsing large XML documents; it is a lot easier to code than SAX and usually faster. This is your stream parser of choice.

This example in Listing 10 parses a large XML document with XMLReader.

Listing 10. XMLReader with a large XML file
<?php

$reader = new XMLReader();
$reader->open("tooBig.xml");
while ($reader->read()) {
   switch ($reader->nodeType) {
   case (XMLREADER::ELEMENT):
      if ($reader->localName == "entry") {
         if ($reader->getAttribute("ID") == 5225) {
            while ($reader->read()) {
               if ($reader->nodeType == XMLREADER::ELEMENT) {
                  if ($reader->localName == "title") {
                     $reader->read();
                     echo $reader->value;
                     break;
                  }
                  if ($reader->localName == "entry") {
                     break;
                  }
               }
            }
         }
      }
   }
}
?>

Parsing with SAX

The Simple API for XML (SAX) is a stream parser. Events are associated with the XML document being read, so SAX is coded in callbacks. There are events for element opening and closing tags, for the content of elements, for entities, and for parsing errors. The primary reason to use the SAX parser rather than the XMLReader is that the SAX parser is sometimes more efficient and usually more familiar. A major disadvantage is that SAX parser code is complex and more difficult to write than XMLReader code.

SAX in action

SAX is likely familiar to those who worked with XML in PHP4, and the SAX extension in PHP5 is compatible with the version they're used to. Since it's a stream parser, it's a good choice for large files, but not as good a choice as XMLReader.

This example in Listing 11 parses a large XML document with SAX.

Listing 11. Using SAX to parse a large XML file
<?php

//This class contains all the callback methods that will actually
//handle the XML data.
class SaxClass {
   private $hit = false;
   private $titleHit = false;

   //callback for the start of each element
   function startElement($parser_object, $elementname, $attribute) {
      if ($elementname == "entry") {
         if ( $attribute['ID'] == 5225) {
            $this->hit = true;
         } else {
            $this->hit = false;
         }
      }
      if ($this->hit && $elementname == "title") {
         $this->titleHit = true;
      } else {
         $this->titleHit =false;
      }
   }

   //callback for the end of each element
   function endElement($parser_object, $elementname) {
   }

   //callback for the content within an element
   function contentHandler($parser_object,$data)
   {
      if ($this->titleHit) {
         echo trim($data)."<br />";
      }
   }
}

//Function to start the parsing once all values are set and
//the file has been opened
function doParse($parser_object) {
   if (!($fp = fopen("tooBig.xml", "r")));

   //loop through data
   while ($data = fread($fp, 4096)) {
      //parse the fragment
      xml_parse($parser_object, $data, feof($fp));
   }
}

$SaxObject = new SaxClass();
$parser_object = xml_parser_create();
xml_set_object ($parser_object, $SaxObject);

//Don't alter the case of the data
xml_parser_set_option($parser_object, XML_OPTION_CASE_FOLDING, false);

xml_set_element_handler($parser_object,"startElement","endElement");
xml_set_character_data_handler($parser_object, "contentHandler");

doParse($parser_object);

?>

Summary

PHP5 offers an improved variety of parsing techniques. Parsing with the DOM, now fully compliant with the W3C standard, is a familiar option, and is your choice for complex but relatively small documents. SimpleXML is the way to go for basic and not-too-large XML documents, and XMLReader, easier and faster than SAX, is the stream parser of choice for large documents.


Downloadable resources


Related topics


Comments

Sign in or register to add and subscribe to comments.

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML, Open source
ArticleID=199776
ArticleTitle=XML for PHP developers, Part 2: Advanced XML parsing techniques
publish-date=03062007