Skip to main content

By clicking Submit, you agree to the developerWorks terms of use.

The first time you sign into developerWorks, a profile is created for you. Select information in your profile (name, country/region, and company) is displayed to the public and will accompany any content you post. You may update your IBM account at any time.

All information submitted is secure.

  • Close [x]

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerworks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

By clicking Submit, you agree to the developerWorks terms of use.

All information submitted is secure.

  • Close [x]

Tip: Screen XML documents efficiently with StAX

Retrieve the information you want, then stop the parsing process

Berthold Daum (berthold.daum@bdaum.de), President, BDaum Industrial Communications
Berthold Daum is a consultant and writer based in Lützelbach, Germany. For information on his recent books, System Architecture with XML and Modeling Business Objects with XML Schema (both from Morgan Kaufman) see http://www.bdaum.de. You can contact Berthold at berthold.daum@bdaum.de.

Summary:  With the Streaming API for XML (StAX), you can screen XML documents efficiently without the drawbacks of traditional push parsers. This tip shows you how to retrieve specific information from XML documents and how to stop the parsing process once this information is collected.

View more content in this series

Date:  11 Dec 2003
Level:  Intermediate

Activity:  10773 views
Comments:  

The screening or classification of XML documents is a common problem, especially in XML middleware. Routing XML documents to specific processors may require analysis of both the document type and the document content. The problem here is obtaining the required information from the document with the least possible overhead. Traditional parsers such as DOM or SAX are not well suited to this task. DOM, for example, parses the whole document and constructs a complete document tree in memory before it returns control to the client. Even DOM parsers that employ deferred node expansion, and thus are able to parse a document partially, have high resource demands because the document tree must be at least partially constructed in memory. This is simply not acceptable for screening purposes.

Like DOM, SAX parsers control the complete parsing process. By default, a SAX parser starts parsing at the beginning of a document and continues until the end. Client event handlers are informed through callbacks about the events during this parsing process. To avoid unnecessary overhead during document screening, such an event handler may want to stop the parsing process once it has gathered the required information. A common technique for achieving this in SAX is throwing an exception, which is discussed in the developerWorks tip "Stop a SAX parser when you have enough data" by Nicholas Chase. This will cause SAX to stop the parsing process. The information gathered by the event handler must be encoded in an error message that's wrapped in an exception object and posted to the parser's client. A special error handler in the client receives this exception and must parse the parser's error message to retrieve the required information! This may be a solution to the screening problem, but it's a complicated one.

Enter StAX

StAX offers a pull parser that gives client applications full control over the parsing process. A client application may decide at any time to discontinue the parsing process, and no tricks are required to stop the parser. This is ideal for screening purposes.

Listing 1 shows what a simple document classifier might look like. I use the cursor-based StAX API for this example. At the very first start tag of the document (the root element tag), I retrieve the kind attribute from this element. The value of this attribute is then passed back to the client and the parsing process is discontinued. The client may now act upon this returned value.


Listing 1. Screening documents
import java.io.*;

import javax.xml.stream.*;

public class Classifier {

   // Holds factory instance
   private XMLInputFactory xmlif;

   public static void main(String[] args)
      throws FileNotFoundException, XMLStreamException {
      Classifier router = new Classifier();
      String kind1 = router.getKind("somefile.xml");
      String kind2 = router.getKind("otherfile.xml");
   }

   /**
    * Return the document kind
    * @param string - the value of the "kind" attribute of the root element
    */
   private String getKind(String filename)
      throws FileNotFoundException, XMLStreamException {
      // Create input factory lazily
      if (xmlif == null) {
         // Use reference implementation
         System.setProperty(
            "javax.xml.stream.XMLInputFactory",
            "com.bea.xml.stream.MXParserFactory");
         xmlif = XMLInputFactory.newInstance();
      }
      // Create stream reader
      XMLStreamReader xmlr =
         xmlif.createXMLStreamReader(new FileReader(filename));
      // Main event loop
      while (xmlr.hasNext()) {
         // Process single event
         switch (xmlr.getEventType()) {
            // Process start tags
            case XMLStreamReader.START_ELEMENT :
               // Check attributes for first start tag
               for (int i = 0; i < xmlr.getAttributeCount(); i++) {
                  // Get attribute name
                  String localName = xmlr.getAttributeName(i);
                  if (localName.equals("kind")) {
                     // Return value
                     return xmlr.getAttributeValue(i);
                  }
               }
               return null;
         }
         // Move to next event
         xmlr.next();
      }
      return null;
   }
}

Note, that I use an instance field to hold the XMLInputFactory instance. This is done to improve efficiency. Compared to the actual parsing process (which is blazingly fast), the execution of XMLInputFactory.newInstance() and xmlif.createXMLStreamReader() cause considerable overhead. While createXMLStreamReader() must be executed once for each new document, you may reuse the XMLInputFactory instance and thus avoid the repeated execution of XMLInputFactory.newInstance().


Next steps

This tip demonstrated the use of StAX parsers for screening and classification of XML documents. In the next tip, I will show how XML documents can be created through the StAX API.



Download

NameSizeDownload method
x-tipstx3screening.zip2KB HTTP

Information about download methods


Resources

About the author

Berthold Daum is a consultant and writer based in Lützelbach, Germany. For information on his recent books, System Architecture with XML and Modeling Business Objects with XML Schema (both from Morgan Kaufman) see http://www.bdaum.de. You can contact Berthold at berthold.daum@bdaum.de.

Report abuse help

Report abuse

Thank you. This entry has been flagged for moderator attention.


Report abuse help

Report abuse

Report abuse submission failed. Please try again later.


developerWorks: Sign in


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Select information in your profile (name, country/region, and company) is displayed to the public and will accompany any content you post. You may update your IBM account at any time.

Choose your display name

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


Rate this article

Comments

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML
ArticleID=12353
ArticleTitle=Tip: Screen XML documents efficiently with StAX
publish-date=12112003
author1-email=berthold.daum@bdaum.de
author1-email-cc=