Skip to main content

skip to main content

developerWorks  >  XML  >

Tip: Parsing XML documents partially with StAX

Apply event filters and stream filters to StAX parsers

developerWorks
Document options

Document options requiring JavaScript are not displayed

Sample code


Rate this page

Help us improve this content


Level: Intermediate

Berthold Daum (berthold.daum@bdaum.de), President, BDaum Industrial Communications

02 Dec 2003

The Streaming API for XML (StAX), introduced in the previous tip, provides not only an XML parser that is fast, easy to use, and has a low memory footprint, but one that also provides a filter interface that allows programmers to hide unnecessary document detail from the application's business logic. This tip shows how to apply event filters and stream filters to StAX parsers. As with the first tip, I will demonstrate and explain this using both the iterator-style API and the cursor-based API.

When parsing an XML document, an XMLEventReader instance delivers event objects to the client application through its next() method -- one for each syntactical unit in the document. However, applications are not always interested in receiving all event classes; an application that only looks at XML elements and their attributes doesn't care about events that represent comments or processing instructions. Fortunately, StAX allows you to skip certain event classes by implementing an event filter.

Listing 1 shows an event filter that skips all XML processing instructions. These events are not passed to the event reader's hasNext(), next(), or peek() methods. To add a filter to a given event reader, you must construct a new reader. This is done with the factory method createFilteredReader(). This method accepts the original reader and an EventFilter as parameters. I will then use this new filtered event reader to parse the document.


Listing 1. Filtering XML events
import java.io.*;
import javax.xml.stream.*;
import javax.xml.stream.events.XMLEvent;

public class ParseFilteredByEvent {

   public static void main(String[] args)
      throws FileNotFoundException, XMLStreamException {
      // Use  reference implementation
      System.setProperty(
         "javax.xml.stream.XMLInputFactory",
         "com.bea.xml.stream.MXParserFactory");
      // Create the XML input factory
      XMLInputFactory factory = XMLInputFactory.newInstance();
      // Create event reader
      FileReader reader = new FileReader("somefile.xml");
      XMLEventReader eventReader = factory.createXMLEventReader(reader);
      // Create a filtered reader
      XMLEventReader filteredEventReader =
         factory.createFilteredReader(eventReader, new EventFilter() {
         public boolean accept(XMLEvent event) {
            // Exclude PIs
            return (!event.isProcessingInstruction());
         }
      });
      // Main event loop
      while (filteredEventReader.hasNext()) {
         XMLEvent e = filteredEventReader.next();
         System.out.println(e);
      }
   }
}

You can hide other event classes from the main application logic in the same way. You can even combine several EventFilters in a layered fashion by constructing filtered event readers on top of each other.



Back to top


Hiding document branches

In the next example, I'll show a filter that skips a whole branch of an XML document. This time I'll be using the cursor-based API and a filtered stream reader instead of an event reader, as I have found that complex filters are best implemented as stream filters. Similar to the example above, a new filtered stream reader is constructed on top of a base stream reader:


Listing 2. Creating a filtered stream reader
      // Create stream reader
      XMLStreamReader xmlr =
         xmlif.createXMLStreamReader(new FileReader("somefile.xml"));

      // Create a filtered stream reader
      XMLStreamReader xmlfr = xmlif.createFilteredReader(xmlr, filter);

The StreamFilter used here in the second parameter is shown in Listing 3. It acts upon the start and end of XML elements and compares the name of the respective elements with a path segment. The path specifies which sections of the document should be skipped, and is implemented as a QName array. In this example, all elements in the path invoice/item will be skipped.

When implementing such a filter, you need to be aware of the fact that the filter's accept() method is called whenever a hasNext(), next(), or peek() method is invoked. Consequently, the accept() method may be called several times for the same event. Here, I made sure that the filter logic is only executed once for each event; it is only executed when the character position within the document has changed.


Listing 3. A stream filter
   // Exclusion path
   private static QName[] exclude = new QName[] { 
      new QName("invoice"), new QName("item")};

   private static StreamFilter filter = new StreamFilter() {
      // Element level
      int depth = -1;
      // Last matching path segment
      int match = -1;
      // Filter result
      boolean process = true;
      // Character position in document
      int currentPos = -1;
      
      public boolean accept(XMLStreamReader reader) {
         // Get character position
         Location loc = reader.getLocation();
         int pos = loc.getCharacterOffset();
         // Inhibit double execution
         if (pos != currentPos) {
            currentPos = pos;
            switch (reader.getEventType()) {
               case XMLStreamConstants.START_ELEMENT :
                  // Increment element depth
                  if (++depth < exclude.length && match == depth - 1) {
                     // Compare path segment with current element
                     if (reader.getName().equals(exclude[depth]))
                        // Equal - set segment pointer
                        match = depth;
                  }
                  // Process all elements not in path
                  process = match < exclude.length - 1;
                  break;
               // End of XML element
               case XMLStreamConstants.END_ELEMENT :
                  // Process all elements not in path
                  process = match < exclude.length - 1;
                  // Decrement element depth
                  if (--depth < match)
                     // Update segment pointer
                     match = depth;
                  break;
            }
         }
         return process;
      }
   };



Back to top


Next steps

This tip demonstrated the use of filters in StAX parsers. In the next tip, I will show how these and other techniques can be used to screen XML documents efficiently.




Back to top


Download

NameSizeDownload method
x-tipstx2filters.zip2KBHTTP
Information about download methods


Resources



About the author

Berthold Daum is a consultant and writer based in Lützelbach, Germany. For information on his recent books, System Architecture with XML and Modeling Business Objects with XML Schema (both from Morgan Kaufman) see http://www.bdaum.de. You can contact Berthold at berthold.daum@bdaum.de.




Rate this page


Please take a moment to complete this form to help us better serve you.



 


 


Not
useful
Extremely
useful
 


Share this....

digg Digg this story del.icio.us del.icio.us Slashdot Slashdot it!



Back to top