The screening or classification of XML documents is a common problem, especially in XML middleware. Routing XML documents to specific processors may require analysis of both the document type and the document content. The problem here is obtaining the required information from the document with the least possible overhead. Traditional parsers such as DOM or SAX are not well suited to this task. DOM, for example, parses the whole document and constructs a complete document tree in memory before it returns control to the client. Even DOM parsers that employ deferred node expansion, and thus are able to parse a document partially, have high resource demands because the document tree must be at least partially constructed in memory. This is simply not acceptable for screening purposes.
Like DOM, SAX parsers control the complete parsing process. By default, a SAX parser starts parsing at the beginning of a document and continues until the end. Client event handlers are informed through callbacks about the events during this parsing process. To avoid unnecessary overhead during document screening, such an event handler may want to stop the parsing process once it has gathered the required information. A common technique for achieving this in SAX is throwing an exception, which is discussed in the developerWorks tip "Stop a SAX parser when you have enough data" by Nicholas Chase. This will cause SAX to stop the parsing process. The information gathered by the event handler must be encoded in an error message that's wrapped in an exception object and posted to the parser's client. A special error handler in the client receives this exception and must parse the parser's error message to retrieve the required information! This may be a solution to the screening problem, but it's a complicated one.
StAX offers a pull parser that gives client applications full control over the parsing process. A client application may decide at any time to discontinue the parsing process, and no tricks are required to stop the parser. This is ideal for screening purposes.
Listing 1 shows what a simple document classifier might look like. I use the cursor-based StAX API for this example. At the very first start tag of the document (the root element tag), I retrieve the kind attribute from this element. The value of this attribute is then passed back to the client and the parsing process is discontinued. The client may now act upon this returned value.
Listing 1. Screening documents
import java.io.*;
import javax.xml.stream.*;
public class Classifier {
// Holds factory instance
private XMLInputFactory xmlif;
public static void main(String[] args)
throws FileNotFoundException, XMLStreamException {
Classifier router = new Classifier();
String kind1 = router.getKind("somefile.xml");
String kind2 = router.getKind("otherfile.xml");
}
/**
* Return the document kind
* @param string - the value of the "kind" attribute of the root element
*/
private String getKind(String filename)
throws FileNotFoundException, XMLStreamException {
// Create input factory lazily
if (xmlif == null) {
// Use reference implementation
System.setProperty(
"javax.xml.stream.XMLInputFactory",
"com.bea.xml.stream.MXParserFactory");
xmlif = XMLInputFactory.newInstance();
}
// Create stream reader
XMLStreamReader xmlr =
xmlif.createXMLStreamReader(new FileReader(filename));
// Main event loop
while (xmlr.hasNext()) {
// Process single event
switch (xmlr.getEventType()) {
// Process start tags
case XMLStreamReader.START_ELEMENT :
// Check attributes for first start tag
for (int i = 0; i < xmlr.getAttributeCount(); i++) {
// Get attribute name
String localName = xmlr.getAttributeName(i);
if (localName.equals("kind")) {
// Return value
return xmlr.getAttributeValue(i);
}
}
return null;
}
// Move to next event
xmlr.next();
}
return null;
}
}
|
Note, that I use an instance field to hold the XMLInputFactory instance. This is done to improve efficiency. Compared to the actual parsing process (which is blazingly fast), the execution of XMLInputFactory.newInstance() and xmlif.createXMLStreamReader() cause considerable overhead. While createXMLStreamReader() must be executed once for each new document, you may reuse the XMLInputFactory instance and thus avoid the repeated execution of XMLInputFactory.newInstance().
This tip demonstrated the use of StAX parsers for screening and classification of XML documents. In the next tip, I will show how XML documents can be created through the StAX API.
| Name | Size | Download method |
|---|---|---|
| x-tipstx3screening.zip | 2KB | HTTP |
Information about download methods
- Download the source files for this tip.
- Get more information on the Streaming API for XML (StAX) at
the Java Community Process site.
- Learn how to apply event filters and stream filters to StAX parsers in the second in this series of StAX tips, "Tip: Parsing XML documents partially with StAX" (December 2003).
- Find out how to stop a SAX parser midway through a document without losing the data already collected, in this tip by Nicholas Chase (developerWorks, June 2002).
- Find more XML resources on the developerWorks
XML zone. For a complete list of XML tips to date, check out the tips summary page.
- Find out how you can become an IBM Certified Developer in XML and related technologies.
Berthold Daum is a consultant and writer based in Lützelbach, Germany. For information on his recent books, System Architecture with XML and Modeling Business Objects with XML Schema (both from Morgan Kaufman) see http://www.bdaum.de. You can contact Berthold at berthold.daum@bdaum.de.



