Skip to main content

Tip: Stop a SAX parser when you have enough data

Use SAX data without having to parse the entire document

Nicholas Chase (nicholas@nicholaschase.com), President, Chase and Chase, Inc.
Nicholas Chase has been involved in Web site development for companies such as Lucent Technologies, Sun Microsystems, Oracle, and the Tampa Bay Buccaneers. Nick has been a high school physics teacher, a low-level radioactive waste facility manager, an online science fiction magazine editor, a multimedia engineer, and an Oracle instructor. More recently, he was the Chief Technology Officer of Site Dynamics Interactive Communications in Clearwater, FL, USA, and is the author of three books on Web development, including Java and XML From Scratch (Que) and the upcoming Primer Plus XML Programming (Sams). He loves to hear from readers and can be reached at nicholas@nicholaschase.com.

Summary:  A SAX parser can be instructed to stop midway through a document without losing the data already collected. This is one of the most commonly mentioned advantages of a SAX parser over a DOM parser, which generally creates an in-memory structure of the entire document. In this tip, you'll parse a list of recently updated weblogs, stopping when you've displayed all those within a particular time range.

View more content in this series

Date:  01 Jun 2002
Level:  Intermediate
Activity:  2506 views

Note: This tip uses JAXP. The classes are also part of the Java 2 SDK 1.4, so if you have 1.4 installed, you don't need any additional software. You can download the source file for this article (see Resources).

How a SAX parser works

The Simple API for XML (SAX) is an event-based API. It examines an XML file, character by character, and translates it into a series of events, such as startDocument() and endElement(). A ContentHandler object processes these events, taking appropriate action. An ErrorHandler object takes care of any warnings or errors that arise during the parsing. The main application (see Listing 1) assigns these objects to the XMLReader object:

The parse() method simply sends the events to the content object, which then deals with them.


The handlers

For this application, all of the work will be done by the WeblogHandler object, which processes the XML file. The changes.xml file itself is fairly simple, with all of the actual data contained in attributes:


Listing 2. A portion of the data file

<?xml version="1.0"?>
<weblogUpdates version="1" 
              updated="Sat, 15 Jun 2002 22:25:06 GMT" 
              count="592697">
  <weblog name="Enigmatic Mermaid" 
             url="http://pombostrans.blogspot.com" 
             when="28"/>
  <weblog name="The Vanguard Science Fiction Report" 
             url="http://www.vanguardreport.com" when="852"/>
  <weblog name="Flummox.com" 
             url="http://www.flummox.com/" when="10713"/>
</weblogUpdates>

This is just a snippet of the actual file, but it shows the structure: Attributes include the name, the URL, and the time since the weblog was updated, in seconds. The content handler takes some of that information and outputs it to the window:


Listing 3. The content handler

import org.xml.sax.helpers.DefaultHandler;
import org.xml.sax.Attributes;

public class WeblogHandler extends DefaultHandler
{
  public WeblogHandler ()
  {
      super();
  }

  int numLogs = 0;
  public void startElement (String namespaceUri, String localName, 
          String qualifiedName, Attributes attributes) {
      
      if (localName.equals("weblog")) {
          String logName = attributes.getValue("name");
          String secsAgo = attributes.getValue("when");
          numLogs = numLogs + 1;
          System.out.println(numLogs + ") " + logName 
                  + " updated " + secsAgo + " seconds ago.");
      }
  }

  public void endDocument(){
       System.out.println();
       System.out.println("All recorded logs displayed.");
       System.out.println("More may have been updated within"
                  + " the appropriate timeframe.");
  }

}

In this case the error handler is trivial, simply alerting you to the presence of an error or warning. The source files include the file in its entirety.


Running the application

When you actually run MainSaxApp, all of the data in changes.xml is passed through to content, which outputs the appropriate information, as seen in Figure 1.


Figure 1. All of the weblogs are displayed.
All of the weblogs are displayed

Notice that the entire file has been parsed, as evidenced by the execution of the endDocument() method.


Stopping the parser

As you can see, a significant number of weblogs have been updated in the three-hour period that changes.xml tracks. Suppose that you want to allow the user to enter a number of seconds representing the interval in which he or she is interested. To do that, you'll look at the first argument on the command line, passing it in to the content object. (You'll look at the corresponding changes to WeblogHandler.java in a moment.)


Listing 4. Changes to MainSaxApp.java
...
   String parserClass = "org.apache.crimson.parser.XMLReaderImpl";
   XMLReader reader = XMLReaderFactory.createXMLReader(parserClass);

   WeblogHandler content = new WeblogHandler();
   int numSecs = new Integer(args[0]).intValue();
   content.setNumSecs(numSecs);

   ErrorProcessor errors = new ErrorProcessor(); 

   reader.setContentHandler(content);
   reader.setErrorHandler(errors);
...

Of course, these changes won't mean anything unless you change the WeblogHandler class:


Listing 5. Changes to WeblogHandler.java

import org.xml.sax.helpers.DefaultHandler;
import org.xml.sax.Attributes;
import org.xml.sax.SAXException;

public class WeblogHandler extends DefaultHandler
{
 public WeblogHandler ()
 { super(); }

 //-------------
 //UTILITY METHODS
 //-------------
 int numSecs = 0;
 public void setNumSecs(int arg) {
      numSecs = arg;
 }

 //-------------
 //EVENT METHODS
 //-------------
 int numLogs = 0;
 public void startElement (String namespaceUri, String localName,
                           String qualifiedName, Attributes attributes)
                        throws SAXException {

  if (localName.equals("weblog")) {
   String logName = attributes.getValue("name");
   String logURL = attributes.getValue("url");

    int secsAgo = new Integer(attributes.getValue("when")).intValue();
                
   if (secsAgo > numSecs) {
    throw new SAXException("\nLimit reached after "+numLogs+" entries.");
   } else {
       numLogs = numLogs + 1;
       System.out.println(numLogs + ") " + logName + 
                             " updated " + secsAgo + " seconds ago.");
        }
    }
 }

 public void endDocument(){
      System.out.println();
      System.out.println("All recorded logs displayed.");
      System.out.println("More may have been updated within"
                            + " the appropriate timeframe.");
 }

}

First, add the setNumSecs() method for the argument. Next, retrieve the when attribute as an int rather than as a String. Fortunately, changes.xml is sorted based on the when attribute, so all you have to do is compare the current secsAgo to numSecs; if secsAgo exceeds numSecs, you want to stop parsing.

In order to stop parsing, you throw a new SAXException, creating it with a message that includes the number of logs processed so far. So what happens when you run it?


Running the new application

Now, if you run the new application with an argument of, say, five minutes (for example, using java MainSaxApp 300) you can see the difference, as shown in Figure 2.


Figure 2. The first five minutes.
The first five minutes

So what is actually happening here? You entered an argument of 300 seconds, so when the first weblog that was updated more than 300 seconds ago is reached, the startElement() method throws the SAXException. Because there's no try-catch block to catch that exception, startElement() throws it to the calling environment, which is the reader's parse() method called in MainSaxApp. There's nothing to catch it there either, so it goes to the MainSaxApp's main() method, where that try-catch block outputs the passed message.

The main point is this: Because the application threw the exception, the parser stopped -- as evidenced by the fact that the endDocument() method was never executed -- but you still had all of the information it had already encountered.


Next steps

This tip demonstrates a simple application that includes a SAX parser that stops when it encounters a particular condition. Here, you have simply used a generic SAXException, but there's nothing to stop you from creating your own exceptions for different business conditions and building their use into your logic. (You'd also want to perform a lot more error checking when using the command-line argument!)



Download

DescriptionNameSizeDownload method
Source code for this tipx-tipsaxstopsource.zip11.52 KB HTTP

Information about download methods


Resources

About the author

Nicholas Chase has been involved in Web site development for companies such as Lucent Technologies, Sun Microsystems, Oracle, and the Tampa Bay Buccaneers. Nick has been a high school physics teacher, a low-level radioactive waste facility manager, an online science fiction magazine editor, a multimedia engineer, and an Oracle instructor. More recently, he was the Chief Technology Officer of Site Dynamics Interactive Communications in Clearwater, FL, USA, and is the author of three books on Web development, including Java and XML From Scratch (Que) and the upcoming Primer Plus XML Programming (Sams). He loves to hear from readers and can be reached at nicholas@nicholaschase.com.

Comments (Undergoing maintenance)



Trademarks  |  My developerWorks terms and conditions

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML
ArticleID=12117
ArticleTitle=Tip: Stop a SAX parser when you have enough data
publish-date=06012002
author1-email=nicholas@nicholaschase.com
author1-email-cc=

My developerWorks community

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere).

My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Special offers