XML was introduced in 1996 by Tim Bray and Michael Sperberg-McQueen. Its potential was widely recognized, but it's hard to imagine that anyone back then could know what an essential technology XML would become. Enterprise Java developers use XML for configuration, as a data store, and most commonly as a format for data exchange. It's the foundation for Web services and SOAP, and thus for the modern Service-Oriented Architecture (SOA) design pattern. But XML doesn't stop there. It puts the X in Ajax, or Asynchronous JavaScript + XML and is the key to the richer-than-ever experiences delivered by modern Web applications.
XML isn't exactly a panacea, though; there's a dark side to it. XML documents tend to be large in size. There's a general tree structure to XML documents, but the extensibility of them means there can be tremendous variances in the schemas of such documents. These aspects present challenges to parsing XML efficiently. There have been two traditional approaches to the challenge of XML parsing: DOM and SAX.
DOM and SAX are the two classic strategies for parsing XML. They are, in many ways, polar opposite strategies. DOM provides a straightforward object model for XML documents. A DOM parser turns an XML document into an easy-to-use object representing all the data from the XML document. However, there's a price to pay for so faithfully representing an XML document: DOM parsing tends to be memory intensive.
Memory isn't a problem for SAX. SAX parsers produce a series of parsing events. It's up to a handler to register callbacks for these events and then perform some kind of logic on the data from these events. It's fast and efficient but requires a complicated programming model.
The easiest way to understand the differences between using DOM and SAX — and therefore the motivations and benefits of StAX — is to look at a specific example.
It's not hard to find some XML to parse. It's used everywhere. Most Web sites these days offer some kind of XML-based Web service. Flickr is a popular photo sharing site owned by Yahoo that has a powerful and flexible API. Let's take a look at some simple code for accessing Flickr's "interesting" photos. (See Downloads for all of the source code used in this article, and make sure you either put Woodstox in your class path or use JDK 1.6.) The code is shown in Listing 1.
Listing 1. Using the Flickr API
String apiKey = "c4579586f41a90372f762cb65c78be5d";
String urlStr = "http://api.flickr.com/services/rest/?" +
"method=flickr.interestingness.getList&per_page=20&api_key="+apiKey;
URL request = new URL(urlStr);
InputStream input = request.openStream();
|
This code uses Flickr's Representational State Transfer (REST) API. (See the Resources section for more about Flickr's APIs and the REST format.) Some sample output from the above call is shown in Listing 2.
Listing 2. XML from Flickr
<?xml version="1.0" encoding="utf-8" ?>
<rsp stat="ok">
<photos page="1" pages="25" per_page="20" total="500">
<photo id="469774979" owner="35373726@N00" secret="c8a1be2012" server="183"
farm="1" title="Where will it lead me......?" ispublic="1" isfriend="0"
isfamily="0" />
<photo id="470281793" owner="73955226@N00" secret="49612a2794" server="212"
farm="1" title="Island Beauty" ispublic="1" isfriend="0" isfamily="0" />
<photo id="469808244" owner="43568064@N00" secret="26b71544a3" server="227"
farm="1" title="" ispublic="1" isfriend="0" isfamily="0" />
</photos>
</rsp>
|
Note that Listing 2 only shows three photos. The API call would actually return
20 (the per_page parameter in the URL string.) The
results are pretty straightforward, so take a look at how to parse this
XML. In the example, you parse out the title of each photo and its ID. The ID can
be used to create the URL for the photo, so it's not hard to imagine a Web
application (perhaps a mashup) using just this information. First you use DOM to
extract this data.
To use DOM, you parse the document into a document object. This is an in-memory tree structure representing the XML document that was parsed. You then walk the DOM tree looking for the title and ID of each photo. Put this data into a simple map. The code for doing this is shown in Listing 3.
Listing 3. Parsing with DOM
Map<String,String> map = new HashMap<String,String>();
DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document dom = builder.parse(input);
Element root = dom.getDocumentElement();
NodeList childNodes = root.getChildNodes();
Node photosNode = null;
for (int i=0;i<childNodes.getLength();i++){
Node node = childNodes.item(i);
if (node.getNodeName().equalsIgnoreCase("photos")){
photosNode = node;
break;
}
}
childNodes = photosNode.getChildNodes();
for (int i=0;i<childNodes.getLength();i++){
Node node = childNodes.item(i);
if (node.getNodeName().equalsIgnoreCase("photo")){
String title = node.getAttributes().getNamedItem("title").getTextContent();
String id = node.getAttributes().getNamedItem("id").getTextContent();
map.put(id,title);
}
}
|
DOM is popular because it's so easy to use. You just pass in your input
source to the parser, and it gives you a document
object. You can then go through child nodes until you find the photo node. Each
photo node is a child of the photo node, so you go through each photo node. Then
you access the title and id attributes of each photo node and store it in your
map.
However, there are some obvious inefficiencies with DOM. You're storing lots of data that you might not care about, such as the owner of each photo. You're also reading through all the data twice: once for reading it into the document object, then again when walking through the document object. The traditional way to avoid these inefficiencies was to use SAX.
A SAX parser doesn't give back a nice document object
like a DOM parser. Instead it gives a series of events as it rips through the
XML document. A handler has to be created for these events by either implementing
an interface or extending the DefaultHandler class and
overriding its methods as needed. Listing 4 demonstrates a SAX parsing of the
Flickr XML document.
Listing 4. Parsing with SAX
SAXParser parser = SAXParserFactory.newInstance().newSAXParser();
DefaultHandler handler = new DefaultHandler(){
@Override
public void startElement(String uri, String localName,
String qName, Attributes attributes) throws SAXException {
if (qName.equalsIgnoreCase("photo")){
String title = attributes.getValue("title");
String id = attributes.getValue("id");
// map is static so we can access it here
map.put(id, title);
}
}
};
parser.parse(input, handler);
|
The code shown in Listing 4 is definitely a little harder to understand than the
DOM code you saw in Listing 3. You needed a
ContentHandler to handle the SAX events, so you created
a DefaultHandler and overrode its
startElement callback method. You checked to see if it
was a photo element, and if so, you accessed its title and id attributes.
The code is fairly succinct and is very efficient when it runs. It stores only the data you care about, and you only pass through the document once. It's more complicated code that requires extending a class to register an event listener. It would be nice to be able to parse the XML efficiently, but with a more intuitive programming model. That's where StAX comes in.
The complexity in SAX comes from the Observer design pattern it implements. It's a push model, in that the parser pushes events to observers who then act on the events. The StAX model is similar to SAX. It streams data and events from the XML document, allowing it to be fast and efficient like SAX. The big difference is that it uses a pull model. This allows the application code to pull events from the parser.
This may sound like a subtle difference, but it allows for a much simpler programming model. Take a look at Listing 5 to see StAX in action.
Listing 5. Parsing with StAX
Map<String,String> map = new HashMap<String,String>();
XMLInputFactory inputFactory = XMLInputFactory.newInstance();
QName qId = new QName("id");
QName qTitle = new QName("title");
QName qPhoto = new QName("photo");
XMLEventReader reader = inputFactory.createXMLEventReader(input);
while (reader.hasNext()){
XMLEvent event = reader.nextEvent();
if (event.isStartElement()){
StartElement element = event.asStartElement();
if (element.getName().equals(qPhoto)){
String id = element.getAttributeByName(qId).getValue();
String title = element.getAttributeByName(qTitle).getValue();
map.put(id,title);
}
}
}
reader.close();
|
First of all, you didn't have to extend any classes. That's because you don't need to register for events. With StAX, you control the flow of events, because you pull them from the parser. You're able to use a familiar iterator-like syntax to search through the document to find the data you want. You're still storing only the data you want, and you only have to go through the XML document once. You get the same efficiencies as with SAX, but the code is far more intuitive.
Woodstox as Geronimo's StAX provider
Now you've seen the benefits of StAX parsing. It's widely recognized as an important advancement in XML technology. Thus, it wasn't surprising when it became part of the Java EE 5 specification (it's even being included with Java Platform, Standard Edition [Java SE] 6 as well.) Because it's part of Java EE 5, it must be implemented by Geronimo 2.0.
Luckily for the Geronimo team, there were several open source StAX implementations to choose from. The team picked Woodstox as the StAX parse to include with Geronimo. Woodstox is regarded as one of the best performing StAX implementations. (See Resources for a comparison of various StAX parsers out there.) In addition, Woodstox is dual-licensed under both Lesser General Public License (LGPL) and the Apache 2.0 license. So you can include Woodstox and its source code with Geronimo without restrictions.
Performance-tuning your application: Getting the most out of Woodstox
Performance is definitely one of the advantages that Woodstox brings to Geronimo.
Just as with other high-performance technologies, it's important to understand
how to use Woodstox to get the best performance. The code in Listing 5 uses the
XMLEventReader interface, a high-level API that's
part of the StAX specification. A more low-level API that can be used instead for
greater performance is the XMLStreamReader interface.
Listing 6 shows the StAX parser using this interface.
Listing 6. StAX parsing with the
XMLStreamReader
Map<String,String> map = new HashMap<String,String>();
XMLInputFactory inputFactory = XMLInputFactory.newInstance();
QName qId = new QName("id");
QName qTitle = new QName("title");
QName qPhoto = new QName("photo");
XMLStreamReader reader = inputFactory.createXMLStreamReader(input);
while (reader.hasNext()){
int event = reader.next();
if (event == START_ELEMENT){ // statically included constant from XMLStreamConstants
if (reader.getName().equals(qPhoto)){
String id = reader.getAttributeValue(null, qId.getLocalPart());
String title = reader.getAttributeValue(null, qTitle.getLocalPart());
map.put(id,title);
}
}
}
reader.close();
|
The code in Listing 6 is similar to the code in Listing 5; while it's obviously a little more low level, you get a significant performance boost.
You've learned about some of the advantages of using a StAX parser to parse XML documents. StAX provides a nice compromise between SAX and DOM. You can immediately take advantage of StAX by using it as part of Geronimo 2.0. Not only do you get to use the intuitive pull APIs of StAX, you get the extra benefit of using a high-performance implementation of StAX in Woodstox.
| Description | Name | Size | Download method |
|---|---|---|---|
| Sample article code | renegade.woodstox.source.zip | 4KB | HTTP |
Information about download methods
Learn
- For a great introduction to StAX, check out the
"StAX'ing up XML, Part 1: An introduction to Streaming API for XML (StAX)"
(developerWorks, November 2006).
- Learn how to use StAX to not only read XML
documents, but also to write XML documents in Berthold Daum's
"Tip: Write XML documents with StAX"
(developerWorks, December 2003).
- Find out the latest on Woodstox.
- Get a detailed performance comparison of
Woodstox and other StAX implementations in the white paper from Sun titled
"Streaming APIs for XML parsers".
- Read all about SAX in Benoit Marchal's
article
"SAX, the power
API"
(developerWorks, August 2001).
- Find out about navigating DOMs more efficiently
in the article
"Effective XML processing with DOM and XPath in Java"
by Parand Tony Darugar (developerWorks, December 2001).
- Read the latest Geronimo documentation and the
latest news on its Java EE 5 status on the
Geronimo Wiki.
- New to Geronimo?
Get started with Geronimo
here on developerWorks.
- Get involved in the
Geronimo project.
- Join the Apache Geronimo mailing list.
- Read "Applying the Apache License, Version 2.0" to understand what you need to do to
apply the license.
- Check out the developerWorks Apache Geronimo project area for articles, tutorials, and other resources to help you get started developing with Geronimo today.
- Check out the IBM® Support for Apache Geronimo offering, which lets you develop Geronimo applications backed by world-class IBM support.
- Visit the developerWorks Open source zone for extensive how-to information, tools, and project updates to help you develop with open source technologies and use them with IBM's products.
- Stay current with developerWorks technical events and webcasts.
- Browse all the Apache articles and free Apache tutorials available in the developerWorks Open source zone.
- Browse for books on these and other technical topics at the Safari bookstore.
- Get an RSS feed for this series. (Find out more about RSS.)
Get products and technologies
- Download the latest version of Apache Geronimo.
- Download your free copy of IBM WebSphere® Application Server Community Edition — a lightweight J2EE application server built on Apache Geronimo open source technology that is designed to help you accelerate your development and deployment efforts.
- Innovate your next open source development project with IBM trial software, available for download or on DVD.
Discuss
- Participate in the discussion forum.
- Get involved in the developerWorks community by participating in developerWorks blogs.
Comments (Undergoing maintenance)





