Reading and writing the XML DOM with PHP
Using the DOM library, SAX parser and regular expressions
Reading and writing Extensible Markup Language (XML) in PHP may seem a little frightening. In fact, XML and all its related technologies can be intimidating. However, reading and writing XML in PHP doesn't have to be a daunting task. First, you need to learn a little about XML -- what it is and what it's used for. Then, you need to learn how to read and write XML in PHP, which you can do in many ways.
This article provides a short primer on XML, then explains how to read and write XML in PHP.
What is XML?
XML is a data storage format. It doesn't define what data is being stored or the structure of that data. XML simply defines tags and attributes for those tags. A properly formed XML tag looks like this:
<name>Jack Herrington</name>
This <name>
tag contains some text: Jack Herrington.
An XML tag that contains no text looks like this:
<powerUp />
There may be more than one way to code something in XML. For instance, this tag produces the same output as the previous one:
<powerUp></powerUp>
You can also add attributes to an XML tag. For example, this
<name>
tag contains first
and
last
attributes:
<name first="Jack" last="Herrington" />
You can encode special characters in XML, too. For instance, an ampersand is encoded like this:
&
An XML document that contains tags and attributes formatted like the examples provided is well formed, which means the tags are balanced, and the characters are encoded properly. Listing 1 is an example of well-formed XML.
Listing 1. An XML book list example
<books> <book> <author>Jack Herrington</author> <title>PHP Hacks</title> <publisher>O'Reilly</publisher> </book> <book> <author>Jack Herrington</author> <title>Podcasting Hacks</title> <publisher>O'Reilly</publisher> </book> </books>
The XML in Listing 1 contains a list of books. The parent
<books>
tag includes a set of <book>
tags that each contain <author>
,
<title>
, and <publisher>
tags.
XML documents are valid when the structure of the tags and their content is validated by an external schema file. Schema files can be specified in a variety of formats. For the purposes of this article, all you need is well-formed XML.
If you think XML looks a lot like Hypertext Markup Language (HTML), you're
right. Both XML and HTML are tag-based languages, and they have many
similarities. However, it's important to note that while XML documents can
be well-formed HTML, not all HTML documents are well-formed XML. The break
tag (br
) is an excellent example of the differences between
XML and HTML. This line break is well-formed HTML, but not well-formed
XML:
<p>This is a paragraph<br>
With a line break</p>
This line break is well-formed XML and HTML:
<p>This is a paragraph<br />
With a line break</p>
If you want to write HTML that is well-formed XML, follow the Extensible Hypertext Markup Language (XHTML) standard from the World Wide Web Consortium (W3C). All modern browsers render XHTML. Plus, it's possible to use XML tools to read XHTML and to find data in the documents, which is far easier than parsing through HTML.
Reading XML using the DOM library
The easiest way to read a well-formed XML file is to use the Document Object Model (DOM) library compiled into some installations of PHP. The DOM library reads the entire XML document into memory and represents it as a tree of nodes, as illustrated in Figure 1.
Figure 1. XML DOM tree for the books XML

The books
node at the top of the tree has two child
book
tags. Within each book, there are author
,
publisher
, and title
nodes. The
author
, publisher
, and title
nodes
each have child text nodes that contain the text.
The code to read the books XML file and display the contents using the DOM is shown in Listing 2.
Listing 2. Reading books XML with the DOM
<?php $doc = new DOMDocument(); $doc->load( 'books.xml' ); $books = $doc->getElementsByTagName( "book" ); foreach( $books as $book ) { $authors = $book->getElementsByTagName( "author" ); $author = $authors->item(0)->nodeValue; $publishers = $book->getElementsByTagName( "publisher" ); $publisher = $publishers->item(0)->nodeValue; $titles = $book->getElementsByTagName( "title" ); $title = $titles->item(0)->nodeValue; echo "$title - $author - $publisher\n"; } ?>
The script starts by creating a new DOMdocument
object and
loading the books XML into that object using the load
method.
After that, the script uses the getElementsByName
method to
get a list of all of the elements with the given name.
Within the loop of the book
nodes, the script uses the
getElementsByName
method to get the nodeValue
for the author
, publisher
, and
title
tags. The nodeValue
is the text within the
node. The script then displays those values.
You can run the PHP script on the command line like this:
% php e1.php
PHP Hacks - Jack Herrington - O'Reilly
Podcasting Hacks - Jack Herrington - O'Reilly
%
As you can see, a line is printed for each book block. That's a good start. However, what if you don't have access to the XML DOM library?
Reading XML using the SAX parser
Another way to read XML is to use the Simple API for XML (SAX) parser. Most installations of PHP include the SAX parser. The SAX parser runs on a callback model. Every time a tag is opened or closed, or any time the parser sees some text, it makes callbacks to some user-defined functions with the node or text information.
The advantage of a SAX parser is that it's really lightweight. The parser doesn't keep anything in memory for very long, so it can be used for extremely large files. The disadvantage is that writing SAX parser callbacks is a big nuisance. Listing 3 shows the code to read the books XML file and display the contents using SAX.
Listing 3. Reading books XML with the SAX parser
<?php $g_books = array(); $g_elem = null; function startElement( $parser, $name, $attrs ) { global $g_books, $g_elem; if ( $name == 'BOOK' ) $g_books []= array(); $g_elem = $name; } function endElement( $parser, $name ) { global $g_elem; $g_elem = null; } function textData( $parser, $text ) { global $g_books, $g_elem; if ( $g_elem == 'AUTHOR' || $g_elem == 'PUBLISHER' || $g_elem == 'TITLE' ) { $g_books[ count( $g_books ) - 1 ][ $g_elem ] = $text; } } $parser = xml_parser_create(); xml_set_element_handler( $parser, "startElement", "endElement" ); xml_set_character_data_handler( $parser, "textData" ); $f = fopen( 'books.xml', 'r' ); while( $data = fread( $f, 4096 ) ) { xml_parse( $parser, $data ); } xml_parser_free( $parser ); foreach( $g_books as $book ) { echo $book['TITLE']." - ".$book['AUTHOR']." - "; echo $book['PUBLISHER']."\n"; } ?>
The script starts by setting up the g_books
array, which holds
all the books and their information in memory, and a g_elem
variable, which stores the name of the tag the script is currently
processing. The script then defines the callback functions. In this
example, the callback functions are startElement
,
endElement
, and textData
. The
startElement
and endElement
functions are called
when tags are opened and closed, respectively. The textData
function is called on the text between the start and end of the tags.
In this example, the startElement
tag is looking for the
book
tag to start a new element in the book
array. Then, the textData
function looks at the current
element to see if it's a publisher
, title
, or
author
tag. If so, the function puts the current text into
the current book.
To get the parsing going, the script creates the parser with the
xml_parser_create
function. Then, it sets the callback
handlers. After that, the script reads in the file and sends off chunks of
the file to the parser. After the file is read, the
xml_parser_free
function deletes the parser. The end of the
script dumps out the contents of the g_books
array.
As you can see, this is much tougher code to write than the DOM equivalent. What if you don't have the DOM library or the SAX library? Is there another alternative?
Parsing XML with regular expressions
I'm certain to be vilified by some engineers for even mentioning this
approach, but you can parse XML with regular expressions. Listing 4 shows
an example of using the preg_
functions to read the books
file.
Listing 4. Reading books XML with regular expressions
<?php $xml = ""; $f = fopen( 'books.xml', 'r' ); while( $data = fread( $f, 4096 ) ) { $xml .= $data; } fclose( $f ); preg_match_all( "/\<book\>(.*?)\<\/book\>/s", $xml, $bookblocks ); foreach( $bookblocks[1] as $block ) { preg_match_all( "/\<author\>(.*?)\<\/author\>/", $block, $author ); preg_match_all( "/\<title\>(.*?)\<\/title\>/", $block, $title ); preg_match_all( "/\<publisher\>(.*?)\<\/publisher\>/", $block, $publisher ); echo( $title[1][0]." - ".$author[1][0]." - ". $publisher[1][0]."\n" ); } ?>
Notice how short that code is. It starts by reading the file into one big
string. It then uses one regex
function to read in each book
item. Finally, using the foreach
loop, the script loops
around each book block and picks out the author, title, and publisher.
So, what are the shortcomings? The problem with using regular expression code to read XML is that it doesn't check first to make sure that the XML is well formed. That means you may not know you have XML that is not well formed before you start reading it. Also, some valid forms of XML may not match your regular expressions, so you will have to modify them later.
I never recommend using regular expressions to read XML, but sometimes it's the most compatible way because the regular expression functions are always available. Don't use regular expressions to read XML that comes directly from users; you don't control the form or structure of that XML. Always read XML from users using a DOM library or SAX parser.
Writing XML with the DOM
Reading XML is only one part of the equation. What about writing it? The best way to write XML is to use the DOM. Listing 5 shows how the DOM builds the books XML file.
Listing 5. Writing books XML with the DOM
<?php $books = array(); $books [] = array( 'title' => 'PHP Hacks', 'author' => 'Jack Herrington', 'publisher' => "O'Reilly" ); $books [] = array( 'title' => 'Podcasting Hacks', 'author' => 'Jack Herrington', 'publisher' => "O'Reilly" ); $doc = new DOMDocument(); $doc->formatOutput = true; $r = $doc->createElement( "books" ); $doc->appendChild( $r ); foreach( $books as $book ) { $b = $doc->createElement( "book" ); $author = $doc->createElement( "author" ); $author->appendChild( $doc->createTextNode( $book['author'] ) ); $b->appendChild( $author ); $title = $doc->createElement( "title" ); $title->appendChild( $doc->createTextNode( $book['title'] ) ); $b->appendChild( $title ); $publisher = $doc->createElement( "publisher" ); $publisher->appendChild( $doc->createTextNode( $book['publisher'] ) ); $b->appendChild( $publisher ); $r->appendChild( $b ); } echo $doc->saveXML(); ?>
At the top of the script, the books
array is loaded with some
example books. That data could come from the user or from a database.
After the example books are loaded, the script creates a
new DOMDocument
and adds the root books
node to
it. Then the script creates an element for the author, title, and
publisher for each book and adds a text node to each of those nodes. The
final step for each book
node is to re-attach it to the root
books
node.
The end of the script dumps the XML to the console using the
saveXML
method. (You can also use the save
method to create a file from the XML.) The output of the script is shown
in Listing 6.
Listing 6. Output from the DOM build script
% php e4.php <?xml version="1.0"?> <books> <book> <author>Jack Herrington</author> <title>PHP Hacks</title> <publisher>O'Reilly</publisher> </book> <book> <author>Jack Herrington</author> <title>Podcasting Hacks</title> <publisher>O'Reilly</publisher> </book> </books> %
The real value of using the DOM is that the XML it creates is always well formed. But what can you do if you don't have access to the DOM to create XML?
Writing XML with PHP
If the DOM isn't available, you can use PHP text templating to write XML. Listing 7 shows how PHP builds the books XML file.
Listing 7. Writing books XML with PHP
<?php $books = array(); $books [] = array( 'title' => 'PHP Hacks', 'author' => 'Jack Herrington', 'publisher' => "O'Reilly" ); $books [] = array( 'title' => 'Podcasting Hacks', 'author' => 'Jack Herrington', 'publisher' => "O'Reilly" ); ?> <books> <?php foreach( $books as $book ) { ?> <book> <title><?php echo( $book['title'] ); ?></title> <author><?php echo( $book['author'] ); ?> </author> <publisher><?php echo( $book['publisher'] ); ?> </publisher> </book> <?php } ?> </books>
The top of the script is similar to the DOM script. The bottom of the
script opens the books
tag, then iterates through each book,
creating the book
tag and all the internal
title
, author
, and publisher
tags.
The problem with this approach is encoding the entities. To make sure the
entities are properly encoded, the htmlentities
function must
be called on each item, as shown in Listing 8.
Listing 8. Using the htmlentities function to encode entities
<books> <?php foreach( $books as $book ) { $title = htmlentities( $book['title'], ENT_QUOTES ); $author = htmlentities( $book['author'], ENT_QUOTES ); $publisher = htmlentities( $book['publisher'], ENT_QUOTES ); ?> <book> <title><?php echo( $title ); ?></title> <author><?php echo( $author ); ?> </author> <publisher><?php echo( $publisher ); ?> </publisher> </book> <?php } ?> </books>
This is why it's annoying to write XML in basic PHP. You think that you're creating perfect XML, but then you find that certain elements aren't encoded properly when you try to run data through it.
Conclusions
XML has always had a lot of hype and confusion surrounding it. However, it's not as difficult as you think it is -- especially in a great language like PHP. When you understand and implement XML properly, you'll find there are a lot of powerful tools you can use. XPath and XSLT are two such tools that are worth checking out.