Reading and writing the XML DOM with PHP

Using the DOM library, SAX parser and regular expressions

Myriad techniques are available for reading and writing XML in PHP. This article presents three methods for reading XML: using the DOM library, using the SAX parser, and using regular expressions. Writing XML using DOM and PHP text templating will also be covered.

Jack Herrington (jherr@pobox.com), Senior Software Engineer, Studio B

Jack D. Herrington is a senior software engineer with more than 20 years of experience. He's the author of three books: Code Generation in Action, Podcasting Hacks, and PHP Hacks. He has also written more than 30 articles.



06 December 2005

Also available in Vietnamese Portuguese

Reading and writing Extensible Markup Language (XML) in PHP may seem a little frightening. In fact, XML and all its related technologies can be intimidating. However, reading and writing XML in PHP doesn't have to be a daunting task. First, you need to learn a little about XML -- what it is and what it's used for. Then, you need to learn how to read and write XML in PHP, which you can do in many ways.

This article provides a short primer on XML, then explains how to read and write XML in PHP.

What is XML?

XML is a data storage format. It doesn't define what data is being stored or the structure of that data. XML simply defines tags and attributes for those tags. A properly formed XML tag looks like this:

<name>Jack Herrington</name>

This <name> tag contains some text: Jack Herrington.

An XML tag that contains no text looks like this:

<powerUp />

There may be more than one way to code something in XML. For instance, this tag produces the same output as the previous one:

<powerUp></powerUp>

You can also add attributes to an XML tag. For example, this <name> tag contains first and last attributes:

<name first="Jack" last="Herrington" />

You can encode special characters in XML, too. For instance, an ampersand is encoded like this:

&

An XML document that contains tags and attributes formatted like the examples provided is well formed, which means the tags are balanced, and the characters are encoded properly. Listing 1 is an example of well-formed XML.

Listing 1. An XML book list example
  <books>
  <book>
  <author>Jack Herrington</author>
  <title>PHP Hacks</title>
  <publisher>O'Reilly</publisher>
  </book>
  <book>
  <author>Jack Herrington</author>
  <title>Podcasting Hacks</title>
  <publisher>O'Reilly</publisher>
  </book>
  </books>

The XML in Listing 1 contains a list of books. The parent <books> tag includes a set of <book> tags that each contain <author>, <title>, and <publisher> tags.

XML documents are valid when the structure of the tags and their content is validated by an external schema file. Schema files can be specified in a variety of formats. For the purposes of this article, all you need is well-formed XML.

If you think XML looks a lot like Hypertext Markup Language (HTML), you're right. Both XML and HTML are tag-based languages, and they have many similarities. However, it's important to note that while XML documents can be well-formed HTML, not all HTML documents are well-formed XML. The break tag (br) is an excellent example of the differences between XML and HTML. This line break is well-formed HTML, but not well-formed XML:

<p>This is a paragraph<br>
With a line break</p>

This line break is well-formed XML and HTML:

<p>This is a paragraph<br />
With a line break</p>

If you want to write HTML that is well-formed XML, follow the Extensible Hypertext Markup Language (XHTML) standard from the World Wide Web Consortium (W3C) (see Resources). All modern browsers render XHTML. Plus, it's possible to use XML tools to read XHTML and to find data in the documents, which is far easier than parsing through HTML.

Reading XML using the DOM library

The easiest way to read a well-formed XML file is to use the Document Object Model (DOM) library compiled into some installations of PHP. The DOM library reads the entire XML document into memory and represents it as a tree of nodes, as illustrated in Figure 1.

Figure 1. XML DOM tree for the books XML
XML DOM tree for the books XML

The books node at the top of the tree has two child book tags. Within each book, there are author, publisher, and title nodes. The author, publisher, and title nodes each have child text nodes that contain the text.

The code to read the books XML file and display the contents using the DOM is shown in Listing 2.

Listing 2. Reading books XML with the DOM
  <?php
  $doc = new DOMDocument();
  $doc->load( 'books.xml' );
  
  $books = $doc->getElementsByTagName( "book" );
  foreach( $books as $book )
  {
  $authors = $book->getElementsByTagName( "author" );
  $author = $authors->item(0)->nodeValue;
  
  $publishers = $book->getElementsByTagName( "publisher" );
  $publisher = $publishers->item(0)->nodeValue;
  
  $titles = $book->getElementsByTagName( "title" );
  $title = $titles->item(0)->nodeValue;
  
  echo "$title - $author - $publisher\n";
  }
  ?>

The script starts by creating a new DOMdocument object and loading the books XML into that object using the load method. After that, the script uses the getElementsByName method to get a list of all of the elements with the given name.

Within the loop of the book nodes, the script uses the getElementsByName method to get the nodeValue for the author, publisher, and title tags. The nodeValue is the text within the node. The script then displays those values.

You can run the PHP script on the command line like this:

% php e1.php
PHP Hacks - Jack Herrington - O'Reilly
Podcasting Hacks - Jack Herrington - O'Reilly
%

As you can see, a line is printed for each book block. That's a good start. However, what if you don't have access to the XML DOM library?


Reading XML using the SAX parser

Another way to read XML is to use the Simple API for XML (SAX) parser. Most installations of PHP include the SAX parser. The SAX parser runs on a callback model. Every time a tag is opened or closed, or any time the parser sees some text, it makes callbacks to some user-defined functions with the node or text information.

The advantage of a SAX parser is that it's really lightweight. The parser doesn't keep anything in memory for very long, so it can be used for extremely large files. The disadvantage is that writing SAX parser callbacks is a big nuisance. Listing 3 shows the code to read the books XML file and display the contents using SAX.

Listing 3. Reading books XML with the SAX parser
  <?php
  $g_books = array();
  $g_elem = null;
  
  function startElement( $parser, $name, $attrs ) 
  {
  global $g_books, $g_elem;
  if ( $name == 'BOOK' ) $g_books []= array();
  $g_elem = $name;
  }
  
  function endElement( $parser, $name ) 
  {
  global $g_elem;
  $g_elem = null;
  }
  
  function textData( $parser, $text )
  {
  global $g_books, $g_elem;
  if ( $g_elem == 'AUTHOR' ||
  $g_elem == 'PUBLISHER' ||
  $g_elem == 'TITLE' )
  {
  $g_books[ count( $g_books ) - 1 ][ $g_elem ] = $text;
  }
  }
  
  $parser = xml_parser_create();
  
  xml_set_element_handler( $parser, "startElement", "endElement" );
  xml_set_character_data_handler( $parser, "textData" );
  
  $f = fopen( 'books.xml', 'r' );
  
  while( $data = fread( $f, 4096 ) )
  {
  xml_parse( $parser, $data );
  }
  
  xml_parser_free( $parser );
  
  foreach( $g_books as $book )
  {
  echo $book['TITLE']." - ".$book['AUTHOR']." - ";
  echo $book['PUBLISHER']."\n";
  }
  ?>

The script starts by setting up the g_books array, which holds all the books and their information in memory, and a g_elem variable, which stores the name of the tag the script is currently processing. The script then defines the callback functions. In this example, the callback functions are startElement, endElement, and textData. The startElement and endElement functions are called when tags are opened and closed, respectively. The textData function is called on the text between the start and end of the tags.

In this example, the startElement tag is looking for the book tag to start a new element in the book array. Then, the textData function looks at the current element to see if it's a publisher, title, or author tag. If so, the function puts the current text into the current book.

To get the parsing going, the script creates the parser with the xml_parser_create function. Then, it sets the callback handlers. After that, the script reads in the file and sends off chunks of the file to the parser. After the file is read, the xml_parser_free function deletes the parser. The end of the script dumps out the contents of the g_books array.

As you can see, this is much tougher code to write than the DOM equivalent. What if you don't have the DOM library or the SAX library? Is there another alternative?


Parsing XML with regular expressions

I'm certain to be vilified by some engineers for even mentioning this approach, but you can parse XML with regular expressions. Listing 4 shows an example of using the preg_ functions to read the books file.

Listing 4. Reading books XML with regular expressions
  <?php
  $xml = "";
  $f = fopen( 'books.xml', 'r' );
  while( $data = fread( $f, 4096 ) ) { $xml .= $data; }
  fclose( $f );
  
  preg_match_all( "/\<book\>(.*?)\<\/book\>/s", 
  $xml, $bookblocks );
  
  foreach( $bookblocks[1] as $block )
  {
  preg_match_all( "/\<author\>(.*?)\<\/author\>/", 
  $block, $author );
  preg_match_all( "/\<title\>(.*?)\<\/title\>/", 
  $block, $title );
  preg_match_all( "/\<publisher\>(.*?)\<\/publisher\>/", 
  $block, $publisher );
  echo( $title[1][0]." - ".$author[1][0]." - ".
  $publisher[1][0]."\n" );
  }
  ?>

Notice how short that code is. It starts by reading the file into one big string. It then uses one regex function to read in each book item. Finally, using the foreach loop, the script loops around each book block and picks out the author, title, and publisher.

So, what are the shortcomings? The problem with using regular expression code to read XML is that it doesn't check first to make sure that the XML is well formed. That means you may not know you have XML that is not well formed before you start reading it. Also, some valid forms of XML may not match your regular expressions, so you will have to modify them later.

I never recommend using regular expressions to read XML, but sometimes it's the most compatible way because the regular expression functions are always available. Don't use regular expressions to read XML that comes directly from users; you don't control the form or structure of that XML. Always read XML from users using a DOM library or SAX parser.


Writing XML with the DOM

Reading XML is only one part of the equation. What about writing it? The best way to write XML is to use the DOM. Listing 5 shows how the DOM builds the books XML file.

Listing 5. Writing books XML with the DOM
  <?php
  $books = array();
  $books [] = array(
  'title' => 'PHP Hacks',
  'author' => 'Jack Herrington',
  'publisher' => "O'Reilly"
  );
  $books [] = array(
  'title' => 'Podcasting Hacks',
  'author' => 'Jack Herrington',
  'publisher' => "O'Reilly"
  );
  
  $doc = new DOMDocument();
  $doc->formatOutput = true;
  
  $r = $doc->createElement( "books" );
  $doc->appendChild( $r );
  
  foreach( $books as $book )
  {
  $b = $doc->createElement( "book" );
  
  $author = $doc->createElement( "author" );
  $author->appendChild(
  $doc->createTextNode( $book['author'] )
  );
  $b->appendChild( $author );
  
  $title = $doc->createElement( "title" );
  $title->appendChild(
  $doc->createTextNode( $book['title'] )
  );
  $b->appendChild( $title );
  
  $publisher = $doc->createElement( "publisher" );
  $publisher->appendChild(
  $doc->createTextNode( $book['publisher'] )
  );
  $b->appendChild( $publisher );
  
  $r->appendChild( $b );
  }
  
  echo $doc->saveXML();
  ?>

At the top of the script, the books array is loaded with some example books. That data could come from the user or from a database.

After the example books are loaded, the script creates a new DOMDocument and adds the root books node to it. Then the script creates an element for the author, title, and publisher for each book and adds a text node to each of those nodes. The final step for each book node is to re-attach it to the root books node.

The end of the script dumps the XML to the console using the saveXML method. (You can also use the save method to create a file from the XML.) The output of the script is shown in Listing 6.

Listing 6. Output from the DOM build script
  % php e4.php 
  <?xml version="1.0"?>
  <books>
  <book>
  <author>Jack Herrington</author>
  <title>PHP Hacks</title>
  <publisher>O'Reilly</publisher>
  </book>
  <book>
  <author>Jack Herrington</author>
  <title>Podcasting Hacks</title>
  <publisher>O'Reilly</publisher>
  </book>
  </books>
  %

The real value of using the DOM is that the XML it creates is always well formed. But what can you do if you don't have access to the DOM to create XML?


Writing XML with PHP

If the DOM isn't available, you can use PHP text templating to write XML. Listing 7 shows how PHP builds the books XML file.

Listing 7. Writing books XML with PHP
  <?php
  $books = array();
  $books [] = array(
  'title' => 'PHP Hacks',
  'author' => 'Jack Herrington',
  'publisher' => "O'Reilly"
  );
  $books [] = array(
  'title' => 'Podcasting Hacks',
  'author' => 'Jack Herrington',
  'publisher' => "O'Reilly"
  );
  ?>
  <books>
  <?php
  
  foreach( $books as $book )
  {
  ?>
  <book>
  <title><?php echo( $book['title'] ); ?></title>
  <author><?php echo( $book['author'] ); ?>
  </author>
  <publisher><?php echo( $book['publisher'] ); ?>
  </publisher>
  </book>
  <?php
  }
  ?>
  </books>

The top of the script is similar to the DOM script. The bottom of the script opens the books tag, then iterates through each book, creating the book tag and all the internal title, author, and publisher tags.

The problem with this approach is encoding the entities. To make sure the entities are properly encoded, the htmlentities function must be called on each item, as shown in Listing 8.

Listing 8. Using the htmlentities function to encode entities
  <books>
  <?php
  
  foreach( $books as $book )
  {
  $title = htmlentities( $book['title'], ENT_QUOTES );
  $author = htmlentities( $book['author'], ENT_QUOTES );
  $publisher = htmlentities( $book['publisher'], ENT_QUOTES );
  ?>
  <book>
  <title><?php echo( $title ); ?></title>
  <author><?php echo( $author ); ?> </author>
  <publisher><?php echo( $publisher ); ?>
  </publisher>
  </book>
  <?php
  }
  ?>
  </books>

This is why it's annoying to write XML in basic PHP. You think that you're creating perfect XML, but then you find that certain elements aren't encoded properly when you try to run data through it.


Conclusions

XML has always had a lot of hype and confusion surrounding it. However, it's not as difficult as you think it is -- especially in a great language like PHP. When you understand and implement XML properly, you'll find there are a lot of powerful tools you can use. XPath and XSLT are two such tools that are worth checking out.

Resources

Learn

Get products and technologies

  • Visit PHP.net to learn the latest news about PHP, find downloads, and learn from other users.
  • Learn about Expat XML Parser, the parser that is used to provide the SAX parser functionality for PHP.
  • Innovate your next open source development project with IBM trial software, available for download or on DVD.

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Open source on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Open source, XML
ArticleID=99821
ArticleTitle=Reading and writing the XML DOM with PHP
publish-date=12062005