Parsing XML documents partially with StAX

Apply event filters and stream filters to StAX parsers

Comments

Content series:

This content is part # of # in the series: Tip

Stay tuned for additional content in this series.

This content is part of the series:Tip

Stay tuned for additional content in this series.

When parsing an XML document, an XMLEventReader instance delivers event objects to the client application through its next() method -- one for each syntactical unit in the document. However, applications are not always interested in receiving all event classes; an application that only looks at XML elements and their attributes doesn't care about events that represent comments or processing instructions. Fortunately, StAX allows you to skip certain event classes by implementing an event filter.

Listing 1 shows an event filter that skips all XML processing instructions. These events are not passed to the event reader's hasNext(), next(), or peek() methods. To add a filter to a given event reader, you must construct a new reader. This is done with the factory method createFilteredReader(). This method accepts the original reader and an EventFilter as parameters. I will then use this new filtered event reader to parse the document.

Listing 1. Filtering XML events
import java.io.*;
import javax.xml.stream.*;
import javax.xml.stream.events.XMLEvent;

public class ParseFilteredByEvent {

   public static void main(String[] args)
      throws FileNotFoundException, XMLStreamException {
      // Use  reference implementation
      System.setProperty(
         "javax.xml.stream.XMLInputFactory",
         "com.bea.xml.stream.MXParserFactory");
      // Create the XML input factory
      XMLInputFactory factory = XMLInputFactory.newInstance();
      // Create event reader
      FileReader reader = new FileReader("somefile.xml");
      XMLEventReader eventReader = factory.createXMLEventReader(reader);
      // Create a filtered reader
      XMLEventReader filteredEventReader =
         factory.createFilteredReader(eventReader, new EventFilter() {
         public boolean accept(XMLEvent event) {
            // Exclude PIs
            return (!event.isProcessingInstruction());
         }
      });
      // Main event loop
      while (filteredEventReader.hasNext()) {
         XMLEvent e = filteredEventReader.next();
         System.out.println(e);
      }
   }
}

You can hide other event classes from the main application logic in the same way. You can even combine several EventFilters in a layered fashion by constructing filtered event readers on top of each other.

Hiding document branches

In the next example, I'll show a filter that skips a whole branch of an XML document. This time I'll be using the cursor-based API and a filtered stream reader instead of an event reader, as I have found that complex filters are best implemented as stream filters. Similar to the example above, a new filtered stream reader is constructed on top of a base stream reader:

Listing 2. Creating a filtered stream reader
      // Create stream reader
      XMLStreamReader xmlr =
         xmlif.createXMLStreamReader(new FileReader("somefile.xml"));

      // Create a filtered stream reader
      XMLStreamReader xmlfr = xmlif.createFilteredReader(xmlr, filter);

The StreamFilter used here in the second parameter is shown in Listing 3. It acts upon the start and end of XML elements and compares the name of the respective elements with a path segment. The path specifies which sections of the document should be skipped, and is implemented as a QName array. In this example, all elements in the path invoice/item will be skipped.

When implementing such a filter, you need to be aware of the fact that the filter's accept() method is called whenever a hasNext(), next(), or peek() method is invoked. Consequently, the accept() method may be called several times for the same event. Here, I made sure that the filter logic is only executed once for each event; it is only executed when the character position within the document has changed.

Listing 3. A stream filter
   // Exclusion path
   private static QName[] exclude = new QName[] { 
      new QName("invoice"), new QName("item")};

   private static StreamFilter filter = new StreamFilter() {
      // Element level
      int depth = -1;
      // Last matching path segment
      int match = -1;
      // Filter result
      boolean process = true;
      // Character position in document
      int currentPos = -1;
      
      public boolean accept(XMLStreamReader reader) {
         // Get character position
         Location loc = reader.getLocation();
         int pos = loc.getCharacterOffset();
         // Inhibit double execution
         if (pos != currentPos) {
            currentPos = pos;
            switch (reader.getEventType()) {
               case XMLStreamConstants.START_ELEMENT :
                  // Increment element depth
                  if (++depth < exclude.length && match == depth - 1) {
                     // Compare path segment with current element
                     if (reader.getName().equals(exclude[depth]))
                        // Equal - set segment pointer
                        match = depth;
                  }
                  // Process all elements not in path
                  process = match < exclude.length - 1;
                  break;
               // End of XML element
               case XMLStreamConstants.END_ELEMENT :
                  // Process all elements not in path
                  process = match < exclude.length - 1;
                  // Decrement element depth
                  if (--depth < match)
                     // Update segment pointer
                     match = depth;
                  break;
            }
         }
         return process;
      }
   };

Next steps

This tip demonstrated the use of filters in StAX parsers. In the next tip, I will show how these and other techniques can be used to screen XML documents efficiently.


Downloadable resources


Related topics


Comments

Sign in or register to add and subscribe to comments.

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML
ArticleID=12349
ArticleTitle=Tip: Parsing XML documents partially with StAX
publish-date=12022003