Skip to main content

The Geronimo renegade: Using integrated packages: Codehaus' Woodstox

Michael Galpin (mike.sr@gmail.com), Developer, Adomo, Inc.
Michael Galpin has been developing Java software professionally since 1998. He currently works at Adomo, Inc., a start-up in Mountain View, CA. He holds a degree in mathematics from the California Institute of Technology.

Summary:  An XML parser is often the key to a high-performance, robust application. Traditional XML parsing techniques include Document Object Model (DOM) and Simple API for XML (SAX). Now there’s an innovative new parsing technique called Streaming API for XML (StAX) that’s so beneficial it’s integrated with the Java™ Platform, Enterprise Edition (Java EE) 5 specification. Apache Geronimo 2.0, a full implementation of Java EE 5, includes a StAX parser — Codehaus' Woodstox. In this installment, learn the benefits of StAX and why the Geronimo team chose Woodstox as the StAX parser.

Date:  24 Jul 2007
Level:  Intermediate
Activity:  1636 views

The importance of XML

XML was introduced in 1996 by Tim Bray and Michael Sperberg-McQueen. Its potential was widely recognized, but it's hard to imagine that anyone back then could know what an essential technology XML would become. Enterprise Java developers use XML for configuration, as a data store, and most commonly as a format for data exchange. It's the foundation for Web services and SOAP, and thus for the modern Service-Oriented Architecture (SOA) design pattern. But XML doesn't stop there. It puts the X in Ajax, or Asynchronous JavaScript + XML and is the key to the richer-than-ever experiences delivered by modern Web applications.

XML isn't exactly a panacea, though; there's a dark side to it. XML documents tend to be large in size. There's a general tree structure to XML documents, but the extensibility of them means there can be tremendous variances in the schemas of such documents. These aspects present challenges to parsing XML efficiently. There have been two traditional approaches to the challenge of XML parsing: DOM and SAX.

XML processing: DOM and SAX

DOM and SAX are the two classic strategies for parsing XML. They are, in many ways, polar opposite strategies. DOM provides a straightforward object model for XML documents. A DOM parser turns an XML document into an easy-to-use object representing all the data from the XML document. However, there's a price to pay for so faithfully representing an XML document: DOM parsing tends to be memory intensive.

Memory isn't a problem for SAX. SAX parsers produce a series of parsing events. It's up to a handler to register callbacks for these events and then perform some kind of logic on the data from these events. It's fast and efficient but requires a complicated programming model.

The easiest way to understand the differences between using DOM and SAX — and therefore the motivations and benefits of StAX — is to look at a specific example.


Parsing example using Flickr

It's not hard to find some XML to parse. It's used everywhere. Most Web sites these days offer some kind of XML-based Web service. Flickr is a popular photo sharing site owned by Yahoo that has a powerful and flexible API. Let's take a look at some simple code for accessing Flickr's "interesting" photos. (See Downloads for all of the source code used in this article, and make sure you either put Woodstox in your class path or use JDK 1.6.) The code is shown in Listing 1.


Listing 1. Using the Flickr API
                String apiKey = "c4579586f41a90372f762cb65c78be5d";
String urlStr = "http://api.flickr.com/services/rest/?" + 
"method=flickr.interestingness.getList&per_page=20&api_key="+apiKey;
URL request = new URL(urlStr);
InputStream input = request.openStream();

This code uses Flickr's Representational State Transfer (REST) API. (See the Resources section for more about Flickr's APIs and the REST format.) Some sample output from the above call is shown in Listing 2.


Listing 2. XML from Flickr
                <?xml version="1.0" encoding="utf-8" ?>
<rsp stat="ok">
<photos page="1" pages="25" per_page="20" total="500">
     <photo id="469774979" owner="35373726@N00" secret="c8a1be2012" server="183" 
farm="1" title="Where will it lead me......?" ispublic="1" isfriend="0" 
isfamily="0" />
     <photo id="470281793" owner="73955226@N00" secret="49612a2794" server="212" 
farm="1" title="Island Beauty" ispublic="1" isfriend="0" isfamily="0" />
     <photo id="469808244" owner="43568064@N00" secret="26b71544a3" server="227" 
farm="1" title="" ispublic="1" isfriend="0" isfamily="0" />
</photos>
</rsp>

Note that Listing 2 only shows three photos. The API call would actually return 20 (the per_page parameter in the URL string.) The results are pretty straightforward, so take a look at how to parse this XML. In the example, you parse out the title of each photo and its ID. The ID can be used to create the URL for the photo, so it's not hard to imagine a Web application (perhaps a mashup) using just this information. First you use DOM to extract this data.


DOM example

To use DOM, you parse the document into a document object. This is an in-memory tree structure representing the XML document that was parsed. You then walk the DOM tree looking for the title and ID of each photo. Put this data into a simple map. The code for doing this is shown in Listing 3.


Listing 3. Parsing with DOM
                Map<String,String> map = new HashMap<String,String>();
DocumentBuilder builder =    DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document dom = builder.parse(input);
Element root = dom.getDocumentElement();
NodeList childNodes = root.getChildNodes();
Node photosNode = null;
for (int i=0;i<childNodes.getLength();i++){
     Node node = childNodes.item(i);
     if (node.getNodeName().equalsIgnoreCase("photos")){
          photosNode = node;
          break;
     }
}
childNodes = photosNode.getChildNodes();
for (int i=0;i<childNodes.getLength();i++){
     Node node = childNodes.item(i);
     if (node.getNodeName().equalsIgnoreCase("photo")){
          String title = node.getAttributes().getNamedItem("title").getTextContent();
          String id = node.getAttributes().getNamedItem("id").getTextContent();
          map.put(id,title);
     }
}

DOM is popular because it's so easy to use. You just pass in your input source to the parser, and it gives you a document object. You can then go through child nodes until you find the photo node. Each photo node is a child of the photo node, so you go through each photo node. Then you access the title and id attributes of each photo node and store it in your map.

However, there are some obvious inefficiencies with DOM. You're storing lots of data that you might not care about, such as the owner of each photo. You're also reading through all the data twice: once for reading it into the document object, then again when walking through the document object. The traditional way to avoid these inefficiencies was to use SAX.


SAX example

A SAX parser doesn't give back a nice document object like a DOM parser. Instead it gives a series of events as it rips through the XML document. A handler has to be created for these events by either implementing an interface or extending the DefaultHandler class and overriding its methods as needed. Listing 4 demonstrates a SAX parsing of the Flickr XML document.


Listing 4. Parsing with SAX
                SAXParser parser = SAXParserFactory.newInstance().newSAXParser();
DefaultHandler handler = new DefaultHandler(){
     @Override
     public void startElement(String uri, String localName, 
     String qName, Attributes attributes) throws SAXException {
          if (qName.equalsIgnoreCase("photo")){
               String title = attributes.getValue("title");
               String id = attributes.getValue("id");
               // map is static so we can access it here
               map.put(id, title);
          }
     }
};
parser.parse(input, handler);

The code shown in Listing 4 is definitely a little harder to understand than the DOM code you saw in Listing 3. You needed a ContentHandler to handle the SAX events, so you created a DefaultHandler and overrode its startElement callback method. You checked to see if it was a photo element, and if so, you accessed its title and id attributes.

The code is fairly succinct and is very efficient when it runs. It stores only the data you care about, and you only pass through the document once. It's more complicated code that requires extending a class to register an event listener. It would be nice to be able to parse the XML efficiently, but with a more intuitive programming model. That's where StAX comes in.


The StAX alternative

The complexity in SAX comes from the Observer design pattern it implements. It's a push model, in that the parser pushes events to observers who then act on the events. The StAX model is similar to SAX. It streams data and events from the XML document, allowing it to be fast and efficient like SAX. The big difference is that it uses a pull model. This allows the application code to pull events from the parser.

This may sound like a subtle difference, but it allows for a much simpler programming model. Take a look at Listing 5 to see StAX in action.


Listing 5. Parsing with StAX
                Map<String,String> map = new HashMap<String,String>();
XMLInputFactory inputFactory = XMLInputFactory.newInstance();
QName qId = new QName("id");
QName qTitle = new QName("title");
QName qPhoto = new QName("photo");
XMLEventReader  reader = inputFactory.createXMLEventReader(input);
while (reader.hasNext()){
     XMLEvent event = reader.nextEvent();
     if (event.isStartElement()){
          StartElement element = event.asStartElement();
          if (element.getName().equals(qPhoto)){
               String id = element.getAttributeByName(qId).getValue();
               String title = element.getAttributeByName(qTitle).getValue();
               map.put(id,title);
          }
     }
}
reader.close();

First of all, you didn't have to extend any classes. That's because you don't need to register for events. With StAX, you control the flow of events, because you pull them from the parser. You're able to use a familiar iterator-like syntax to search through the document to find the data you want. You're still storing only the data you want, and you only have to go through the XML document once. You get the same efficiencies as with SAX, but the code is far more intuitive.


Woodstox as Geronimo's StAX provider

Now you've seen the benefits of StAX parsing. It's widely recognized as an important advancement in XML technology. Thus, it wasn't surprising when it became part of the Java EE 5 specification (it's even being included with Java Platform, Standard Edition [Java SE] 6 as well.) Because it's part of Java EE 5, it must be implemented by Geronimo 2.0.

Luckily for the Geronimo team, there were several open source StAX implementations to choose from. The team picked Woodstox as the StAX parse to include with Geronimo. Woodstox is regarded as one of the best performing StAX implementations. (See Resources for a comparison of various StAX parsers out there.) In addition, Woodstox is dual-licensed under both Lesser General Public License (LGPL) and the Apache 2.0 license. So you can include Woodstox and its source code with Geronimo without restrictions.

Performance-tuning your application: Getting the most out of Woodstox

Performance is definitely one of the advantages that Woodstox brings to Geronimo. Just as with other high-performance technologies, it's important to understand how to use Woodstox to get the best performance. The code in Listing 5 uses the XMLEventReader interface, a high-level API that's part of the StAX specification. A more low-level API that can be used instead for greater performance is the XMLStreamReader interface. Listing 6 shows the StAX parser using this interface.


Listing 6. StAX parsing with the XMLStreamReader
                Map<String,String> map = new HashMap<String,String>();
XMLInputFactory inputFactory = XMLInputFactory.newInstance();
QName qId = new QName("id");
QName qTitle = new QName("title");
QName qPhoto = new QName("photo");
XMLStreamReader reader = inputFactory.createXMLStreamReader(input);
while (reader.hasNext()){
    int event = reader.next();
    if (event == START_ELEMENT){ // statically included constant from XMLStreamConstants
         if (reader.getName().equals(qPhoto)){
               String id = reader.getAttributeValue(null, qId.getLocalPart());
               String title = reader.getAttributeValue(null, qTitle.getLocalPart());
               map.put(id,title);
          }
     } 
}
reader.close();

The code in Listing 6 is similar to the code in Listing 5; while it's obviously a little more low level, you get a significant performance boost.

Summary

You've learned about some of the advantages of using a StAX parser to parse XML documents. StAX provides a nice compromise between SAX and DOM. You can immediately take advantage of StAX by using it as part of Geronimo 2.0. Not only do you get to use the intuitive pull APIs of StAX, you get the extra benefit of using a high-performance implementation of StAX in Woodstox.



Download

DescriptionNameSizeDownload method
Sample article coderenegade.woodstox.source.zip4KB HTTP

Information about download methods


Resources

Learn

Get products and technologies

Discuss

About the author

Michael Galpin has been developing Java software professionally since 1998. He currently works at Adomo, Inc., a start-up in Mountain View, CA. He holds a degree in mathematics from the California Institute of Technology.

Comments (Undergoing maintenance)



Trademarks  |  My developerWorks terms and conditions

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Open source, Java technology, XML, WebSphere
ArticleID=242233
ArticleTitle=The Geronimo renegade: Using integrated packages: Codehaus' Woodstox
publish-date=07242007
author1-email=mike.sr@gmail.com
author1-email-cc=ruterbo@us.ibm.com

My developerWorks community

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere).

My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Rate a product. Write a review.

Special offers