Tip: Parsing XML documents partially with StAX

Apply event filters and stream filters to StAX parsers

The Streaming API for XML (StAX) provides not only an XML parser that is fast, easy to use, and has a low memory footprint, but one that also provides a filter interface that allows programmers to hide unnecessary document detail from the application's business logic. This tip shows how to apply event filters and stream filters to StAX parsers. As with the first tip, I will demonstrate and explain this using both the iterator-style API and the cursor-based API.

Share:

Berthold Daum (berthold.daum@bdaum.de), President, BDaum Industrial Communications

Berthold Daum is a consultant and writer based in Lützelbach, Germany. For information on his recent books, System Architecture with XML and Modeling Business Objects with XML Schema (both from Morgan Kaufman) see http://www.bdaum.de. You can contact Berthold at berthold.daum@bdaum.de.



02 December 2003

When parsing an XML document, an XMLEventReader instance delivers event objects to the client application through its next() method -- one for each syntactical unit in the document. However, applications are not always interested in receiving all event classes; an application that only looks at XML elements and their attributes doesn't care about events that represent comments or processing instructions. Fortunately, StAX allows you to skip certain event classes by implementing an event filter.

Listing 1 shows an event filter that skips all XML processing instructions. These events are not passed to the event reader's hasNext(), next(), or peek() methods. To add a filter to a given event reader, you must construct a new reader. This is done with the factory method createFilteredReader(). This method accepts the original reader and an EventFilter as parameters. I will then use this new filtered event reader to parse the document.

Listing 1. Filtering XML events
import java.io.*;
import javax.xml.stream.*;
import javax.xml.stream.events.XMLEvent;

public class ParseFilteredByEvent {

   public static void main(String[] args)
      throws FileNotFoundException, XMLStreamException {
      // Use  reference implementation
      System.setProperty(
         "javax.xml.stream.XMLInputFactory",
         "com.bea.xml.stream.MXParserFactory");
      // Create the XML input factory
      XMLInputFactory factory = XMLInputFactory.newInstance();
      // Create event reader
      FileReader reader = new FileReader("somefile.xml");
      XMLEventReader eventReader = factory.createXMLEventReader(reader);
      // Create a filtered reader
      XMLEventReader filteredEventReader =
         factory.createFilteredReader(eventReader, new EventFilter() {
         public boolean accept(XMLEvent event) {
            // Exclude PIs
            return (!event.isProcessingInstruction());
         }
      });
      // Main event loop
      while (filteredEventReader.hasNext()) {
         XMLEvent e = filteredEventReader.next();
         System.out.println(e);
      }
   }
}

You can hide other event classes from the main application logic in the same way. You can even combine several EventFilters in a layered fashion by constructing filtered event readers on top of each other.

Hiding document branches

In the next example, I'll show a filter that skips a whole branch of an XML document. This time I'll be using the cursor-based API and a filtered stream reader instead of an event reader, as I have found that complex filters are best implemented as stream filters. Similar to the example above, a new filtered stream reader is constructed on top of a base stream reader:

Listing 2. Creating a filtered stream reader
      // Create stream reader
      XMLStreamReader xmlr =
         xmlif.createXMLStreamReader(new FileReader("somefile.xml"));

      // Create a filtered stream reader
      XMLStreamReader xmlfr = xmlif.createFilteredReader(xmlr, filter);

The StreamFilter used here in the second parameter is shown in Listing 3. It acts upon the start and end of XML elements and compares the name of the respective elements with a path segment. The path specifies which sections of the document should be skipped, and is implemented as a QName array. In this example, all elements in the path invoice/item will be skipped.

When implementing such a filter, you need to be aware of the fact that the filter's accept() method is called whenever a hasNext(), next(), or peek() method is invoked. Consequently, the accept() method may be called several times for the same event. Here, I made sure that the filter logic is only executed once for each event; it is only executed when the character position within the document has changed.

Listing 3. A stream filter
   // Exclusion path
   private static QName[] exclude = new QName[] { 
      new QName("invoice"), new QName("item")};

   private static StreamFilter filter = new StreamFilter() {
      // Element level
      int depth = -1;
      // Last matching path segment
      int match = -1;
      // Filter result
      boolean process = true;
      // Character position in document
      int currentPos = -1;
      
      public boolean accept(XMLStreamReader reader) {
         // Get character position
         Location loc = reader.getLocation();
         int pos = loc.getCharacterOffset();
         // Inhibit double execution
         if (pos != currentPos) {
            currentPos = pos;
            switch (reader.getEventType()) {
               case XMLStreamConstants.START_ELEMENT :
                  // Increment element depth
                  if (++depth < exclude.length && match == depth - 1) {
                     // Compare path segment with current element
                     if (reader.getName().equals(exclude[depth]))
                        // Equal - set segment pointer
                        match = depth;
                  }
                  // Process all elements not in path
                  process = match < exclude.length - 1;
                  break;
               // End of XML element
               case XMLStreamConstants.END_ELEMENT :
                  // Process all elements not in path
                  process = match < exclude.length - 1;
                  // Decrement element depth
                  if (--depth < match)
                     // Update segment pointer
                     match = depth;
                  break;
            }
         }
         return process;
      }
   };

Next steps

This tip demonstrated the use of filters in StAX parsers. In the next tip, I will show how these and other techniques can be used to screen XML documents efficiently.


Download

DescriptionNameSize
Code samplex-tipstx2filters.zip2KB

Resources

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into XML on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML
ArticleID=12349
ArticleTitle=Tip: Parsing XML documents partially with StAX
publish-date=12022003