Skip to main content

SAX-like apps in PHP

Use streaming XML data in PHP

Nicholas Chase (nicholas@nicholaschase.com), President, Chase and Chase, Inc.
Photo of Nicholas Chase
Nicholas Chase has been involved in Web site development for companies such as Lucent Technologies, Sun Microsystems, Oracle, and the Tampa Bay Buccaneers. Nick has been a high school physics teacher, a low-level radioactive waste facility manager, an online sci-fi magazine editor, a multimedia engineer, and an Oracle instructor. Recently, he was the Chief Technology Officer for Site Dynamics Interactive Communications in Clearwater, FL, and is the author of three books on Web development, including Java and XML From Scratch (Que). He loves to hear from readers and can be reached at nicholas@nicholaschase.com.

Summary:  While there is no official implementation of the Simple API for XML (SAX) in PHP, PHP does provide a SAX-like method for working with both local and remote XML files. In this article, author Nicholas Chase shows you how to work with XML files in PHP by building and setting handler functions and creating a parser. He demonstrates SAX in PHP with a page-building exercise in which he crafts a page based on the result of an Amazon Web Services query.

Date:  01 Mar 2003
Level:  Introductory
Activity:  2418 views
Comments:  

PHP does compensate for having no official implementation of the Simple API for XML, or SAX, by providing a SAX-like method for working with both local and remote XML files. This article demonstrates how to work with XML files in PHP by building and setting handler functions and creating a parser. It also provides a SAX-related page-building exercise in which you can see how to build a a page based on the result of an Amazon Web Services query.

To follow along with this article, you should be familiar with both PHP and XML (see Resources for links to get you started). To enable support for XML, simply configure PHP using the --with-xml option. (You'll find that in many situations it's already enabled.)

Let's start with a quick look at SAX in the Java environment (but it's not necessary to understand the Java technology to follow along).

SAX and streams

In a SAX application, a parser sends events, such as "start element" or "end document," to a handler which takes the supplied data and acts on it. SAX was originally developed for the Java environment, so a SAX application typically involves three objects:

  • the XMLReader
  • a ContentHandler
  • an ErrorHandler

The developer creates the XMLReader, sets the handler object, and parses the file. For example:


...
          XMLReader reader = XMLReaderFactory.createXMLReader(parserClass);

          reader.setContentHandler(new DataProcessor());
          reader.setErrorHandler(new ErrorProcessor());

          InputSource file = new InputSource(fileName);
          reader.parse(file);
...

The ContentHandler object (in this case, an instance of DataProcessor) has methods that handle each of the callbacks sent by the parser. For example:


...
   public void startDocument() {
      //Executed at the start of the document
      ...
   }

   public void endDocument() {
      //Executed at the end of the document
      ...
   }

   public void startElement (String namespaceUri, String localName,
		            String qualifiedName, Attributes attributes)
   {
      //Executed at the start of each element
      ...
   }

   public void endElement (String namespaceUri, String localName,
		            String qualifiedName) throws SAXException
   {
      //Executed at the end of each element
      ...
   }

   StringBuffer thisText = new StringBuffer();
   public void characters (char ch[], int start, int length)
   {
       //A text node can be sent in one or more callbacks to this method
       thisText.append(ch, start, length);
       ...
   }

...

PHP parses an XML stream using the expat parser, which uses many of these same concepts, but in a slightly different way.


Creating handler functions

While Java technology uses a single ContentHandler object, PHP can use individual functions added to (or included in) the file. For example, functions can simply output the name of elements and their character data:


<?php

function start_element($parser, $name, $attrs) {
    print "<b>Start Element:</b> $name<br />";
    print "<b>---Attributes:</b> <br />";
    foreach ($attrs as $key => $value) {
        print "$key = $value<br />";
    }
    print "<br />";
}

function end_element($parser, $name) {
    print "<b>End Element:</b> $name<br /><br />";
}

function characters($parser, $chars) {
    print "<p><i>$chars</i></p>";
}

?>

The functions in Listing 3 are simply standard PHP functions, taking arguments on which they can act. These arguments are fed to the functions by the parser. For example, the start_element() function ultimately gets the name of the element and an associative array containing each of the attributes with their names as the keys, while the characters() function receives the character data as a parameter.


Creating the parser

To use these functions, the application must create a parser and tell it where to send various events. These tasks are accomplished through functions that are part of PHP, such as xml_parser_create():


...
function characters($parser, $chars) {
    print "<p><i>$chars</i></p>";
}
 
$book_parser = xml_parser_create();
xml_set_element_handler($book_parser, "start_element", "end_element");
xml_set_character_data_handler($book_parser, "characters");


xml_parser_free($book_parser);
...

The ultimate goal is the parsing of book data, so create a variable called $book_parser and populate it with the resource returned by the xml_parser_create() function.

The newly-created parser needs to know where to send the information about each event. In other words, you must register handlers with the parser. The xml_set_element_handler() function takes the parser and registers a function for both the element start and end events. Note that the function names are completely arbitrary; it doesn't matter what the functions are actually called, as long as they're properly registered.

The xml_set_element_handler() function only takes care of the element itself. To deal with any character data that may be contained within it, use xml_set_character_data_handler() to register a handler for any text content. If no character data handler were registered, character data would simply be ignored, since the parser had no place to send it.

In fact, xml_set_element_handler() and xml_set_character_data_handler() are only the two most common handler registration functions. PHP also includes:

  • xml_set_external_entity_ref_hander(). This determines the function that resolves and potentially loads and parses an external entity encountered within the original XML file.

  • xml_set_notation_decl_handler() and xml_set_processing_instruction_handler(). These two set the functions called when notation or processing instructions are encountered, respectively.

  • xml_set_start_namespace_decl_handler() and xml_set_end_namespace_decl_handler(). These functions set the functions to call at the start and end of a namespace declaration's scope.

  • xml_set_default_handler(). This sets the function to process any events for which there is no other handler.

Once the parse is complete, destroy the parser object by calling xml_parser_free().


Parsing the file

Once the parser exists and has had its handlers set, actually parsing the file is a matter of sending each piece of data within it to the parser:


...
xml_set_character_data_handler($book_parser, "characters");

$file = "data.xml";
if ($file_stream = fopen($file, "r")) {

   while ($data = fread($file_stream, 4096)) {

       $this_chunk_parsed = xml_parse($book_parser, $data, feof($file_stream));
       if (!$this_chunk_parsed) {
           $error_code = xml_get_error_code($book_parser);
           $error_text = xml_error_string($error_code);
           $error_line = xml_get_current_line_number($book_parser);

           $output_text = "Parsing problem at line $error_line: $error_text";
           die($output_text);
       }

   }

} else {

    die("Can't open XML file.");

}
xml_parser_free($book_parser);
?>

First, create a stream by opening the file for reading. The stream itself is read in chunks, so the while loop continues as long as chunks are available. Each chunk is passed to the parser using the xml_parse() function. The function takes into use the parser, the data, and a value that tells it whether or not this is the last piece of data. In this case, testing for the end of the file using feof() supplies that information. When the file's been completely read, the parser will know it's the last piece.

If the chunk is successfully parsed, the information is sent off to the appropriate handlers and nothing more will need to be done with it here.

If a problem, such as a non-well-formed XML file, does occur, xml_parse() returns false. The if statement checks for this condition and, if necessary, creates an error string and outputs it as processing is stopped.

For a simple XML file such as this:


<?xml version="1.0"?>
<person status="missing">
    <name>Stormy</name>
    <opnumber>TK421</opnumber>
</person>

the resulting page is shown in Figure 1.


Figure 1. Parsing a simple file
Figure 1: Parsing a simple file

Case issues

You may have noticed that all of the element and attribute names are output in uppercase, even though they were lowercase in the original XML document. This change is known as case folding and is the default for a PHP XML parser. You can turn it off by setting the case folding option:


...
$book_parser = xml_parser_create();
xml_parser_set_option($book_parser, XML_OPTION_CASE_FOLDING, 0);

xml_set_element_handler($book_parser, "start_element", 
                                      "end_element");
...

Now the document will be represented more accurately, as shown in Figure 2.


Figure 2. Turning off case folding results in accurate element and attribute names
Figure 2. Turning off case folding results in accurate element and attribute names

Using a handler object

To get even closer to the SAX structure, you can group your handler functions into a single handler object. For example:


<?php

class Content_Handler {

   function Content_Handler(){}

   function start_element($parser, $name, $attrs) {
       print "<b>Start Element:</b> $name<br />";
       print "<b>---Attributes:</b> <br />";
       foreach ($attrs as $key => $value) {
           print "$key = $value<br />";
       }
       print "<br />";
   }

   function end_element($parser, $name) {
       print "<b>End Element:</b> $name<br /><br />";
   }

   function characters($parser, $chars) {
       print "<p><i>$chars</i></p>";
   }

}

$handler = new Content_Handler();
$book_parser = xml_parser_create();
...

In this case, the object doesn't have to do anything special on instantiation, so the constructor function is left empty. Instantiating the handler is simply a matter of creating a new instance of the Content_Handler class.

You still need to take one step before the object can be used; the parser has to know where to find these functions. It may seem like a simple matter of changing the registration functions, as in:


xml_set_element_handler($book_parser, "$handler->start_element", 
                                      "$handler->end_element");
xml_set_character_data_handler($book_parser, "$handler->characters");

But the parser won't be able to find them this way. Instead, you need to set a handler object for the parser:


...
$handler = new Content_Handler();
$book_parser = xml_parser_create();
xml_parser_set_option($book_parser, XML_OPTION_CASE_FOLDING, 0);

xml_set_object($book_parser, $handler);
xml_set_element_handler($book_parser, "start_element", 
                                      "end_element");
xml_set_character_data_handler($book_parser, "characters");

$file = "data.xml";
if ($file_stream = fopen($file, "r")) {
...

As shown in Listing 10, the $book_parser parser knows that any functions called are actually methods of the $handler object.

Encapsulating the handler into a class provides easier maintainability through modularization. The class definition can be stripped out into a separate file and used in multiple places, and changes to processing won't affect the main application.

What's more, because this is a class, other classes can inherit these methods for specialized parsing situations.


Creating the list

Now that the basic structure is in place, you can add the functionality to analyze an XML file. In this case, the file is a remote resource, but as long as your installation of PHP is configured to allow URLs to be opened as local files, the application doesn't care. This is typically the case.

The URL in question is an Amazon search that returns the last 10 books in the catalog that were published in 2002 on the topic (subject) of XML:

http://xml.amazon.com/onca/xml2?PowerSearch=subject:xml%20and%20pubdate:
2002&mode=books&dev-t=00000000000000&t=vanguardsc-20&
type=lite&f=xml&sort=%2bdaterank

(To make this URL work, you'd need to replace the zeroes with a developer token, available at https://associates.amazon.com/exec/panama/associates/join/developer/application.html.)

The results of this query are shown in Figure 3.


Figure 3. The raw XML to be processed
Figure 3. The raw XML to be processed

The goal is to provide a page that shows a list of each book. The list is made up of the Details element, including its title (in the ProductName element), which is a link to the book's page (as given by the url attribute), followed by arbitrary text and the author's name (in the Author element).

Fortunately, the information is included in the order in which it is to be presented, so all that's necessary is to alter the handler functions so that they output information when required:


<?php

class Content_Handler {

   function Content_Handler(){}

   function start_element($parser, $name, $attrs) {
       global $print_characters;
       if ($name == "Details"){
          print "<a href=\"".$attrs["url"]."\">";
       }
       if ($name == "Author"){
          print "</a>, by ";
       }
       if ($name == "ProductName" || $name == "Author") {
          $print_characters = true;
       }
   }

   function end_element($parser, $name) {
       if ($name == "Details") {
          print "<br />";
       }
   }

   function characters($parser, $chars) {
       global $print_characters;
       if ($print_characters) {
          print "$chars";
          $print_characters = false;
       }
   }
}

$print_characters = false;

$handler = new Content_Handler();
$book_parser = xml_parser_create();
...

Starting at the bottom of Listing 11, notice the $print_characters variable. The only data that is output in any form is the url attribute, the ProductName element's text child, and the Author element's text child, but the characters() function has no way of knowing what element contains the text it's currently receiving.

The only way to control when text is output, then, is a global variable; in this case, $print_characters. The variable starts with a false value, and when the start_element() function encounters an element that it should print, it is set to true and then is turned off once the text has been output by the characters() function.

Aside from turning output on and off, the start_element() and end_element() functions simply provide the arbitrary text that surrounds the data being output, such as the <a></a> tags. The result of parsing the file is a simple list, complete with links, as shown in Figure 4.


Figure 4. The final results
Figure 4. The final results

Parsing this article

Through its use of the expat parser, PHP provides a means for simulating SAX processing by assigning handlers to a parser. The parser calls these handlers when specific events, such as the start of an element or character content, are encountered. These handlers come in the form of functions, and can be used individually or grouped into an object for easier maintenance.



Download

DescriptionNameSizeDownload method
Code sample for articlewa-php4sax/saxphp.zip1 KB HTTP

Information about download methods


Resources

  • To get a basic grounding in XML, take the developerWorks "Introduction to XML" tutorial (August 2002).

  • For an introduction to PHP, read "PHP by example, Part 1" (developerWorks, December 2000) and "Part 2" (developerWorks, January 2001).

  • For more information on SAX, take the developerWorks tutorial, "Understanding SAX" (August, 2002).

  • For more information on PHP's XML parser, see the official documentation.

  • Find great resources for PHP at IBM's developerWorks Web Architecture zone, and for XML at the XML zone.

  • The author's book, XML and Java from Scratch (Que, 2001), illustrates the principles of XML by building a Web site and application for a fictitious furniture company as it covers cascading style sheets, XSL processors, DTDs, parsers, manipulating vendor data with JDOM, organizing inventory structure with namespaces and DOM, and using Java technology to access legacy SQL databases in conjunction with XML. The author offers sample chapters from his book.

About the author

Photo of Nicholas Chase

Nicholas Chase has been involved in Web site development for companies such as Lucent Technologies, Sun Microsystems, Oracle, and the Tampa Bay Buccaneers. Nick has been a high school physics teacher, a low-level radioactive waste facility manager, an online sci-fi magazine editor, a multimedia engineer, and an Oracle instructor. Recently, he was the Chief Technology Officer for Site Dynamics Interactive Communications in Clearwater, FL, and is the author of three books on Web development, including Java and XML From Scratch (Que). He loves to hear from readers and can be reached at nicholas@nicholaschase.com.

Comments



Trademarks  |  My developerWorks terms and conditions

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Web development, XML
ArticleID=11760
ArticleTitle=SAX-like apps in PHP
publish-date=03012003
author1-email=nicholas@nicholaschase.com
author1-email-cc=dwxed@us.ibm.com

My developerWorks community

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere).

My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Special offers