 | Level: Intermediate Michael Galpin (mike.sr@gmail.com), Developer, Adomo, Inc.
24 Jul 2007 An XML parser is often the key to a high-performance, robust application. Traditional XML parsing techniques include Document Object Model (DOM) and Simple API for XML (SAX). Now there’s an innovative new parsing technique called Streaming API for XML (StAX) that’s so beneficial it’s integrated with the Java™ Platform, Enterprise Edition (Java EE) 5 specification. Apache Geronimo 2.0, a full implementation of Java EE 5, includes a StAX parser — Codehaus' Woodstox. In this installment, learn the benefits of StAX and why the Geronimo team chose Woodstox as the StAX parser.
The importance of XML
XML was introduced in 1996 by Tim Bray and Michael Sperberg-McQueen. Its
potential was widely recognized, but it's hard to imagine that anyone back then
could know what an essential technology XML would become. Enterprise Java
developers use XML for configuration, as a data store, and most commonly as a
format for data exchange. It's the foundation for Web services and SOAP, and thus
for the modern Service-Oriented Architecture (SOA) design pattern. But XML doesn't
stop there. It puts the X in Ajax, or Asynchronous JavaScript + XML and is the key to the richer-than-ever
experiences delivered by modern Web applications.
XML isn't exactly a panacea, though; there's a dark side to it. XML documents
tend to be large in size. There's a general tree structure to XML documents,
but the extensibility of them means there can be tremendous variances in the
schemas of such documents. These aspects present challenges to parsing XML
efficiently. There have been two traditional approaches to the challenge of XML
parsing: DOM and SAX.
XML processing: DOM and SAX
DOM and SAX are the two classic strategies for parsing XML. They are, in many
ways, polar opposite strategies. DOM provides
a straightforward object model for XML documents. A DOM parser turns an XML
document into an easy-to-use object representing all the data from the XML
document. However, there's a price to pay for so faithfully representing an XML
document:
DOM parsing tends to be memory intensive.
Memory isn't a problem for SAX. SAX parsers produce a
series of parsing events. It's up to a handler to register callbacks for these
events and then perform some kind of logic on the data from these events. It's
fast and efficient but requires a complicated programming model.
The easiest way to understand the differences between using DOM and SAX
— and
therefore
the motivations and benefits of StAX — is to look at a specific example.
Parsing example using Flickr
It's not hard to find some XML to parse. It's used everywhere. Most Web sites
these days offer some kind of XML-based Web service. Flickr is a popular photo
sharing site owned by Yahoo that has a powerful and flexible API. Let's take a
look at some simple code for accessing Flickr's "interesting" photos. (See
Downloads for all of the source code used in this
article, and make sure you either put Woodstox in your class path or use JDK 1.6.)
The code is shown in Listing 1.
Listing 1. Using the Flickr API
String apiKey = "c4579586f41a90372f762cb65c78be5d";
String urlStr = "http://api.flickr.com/services/rest/?" +
"method=flickr.interestingness.getList&per_page=20&api_key="+apiKey;
URL request = new URL(urlStr);
InputStream input = request.openStream();
|
This code uses Flickr's Representational State Transfer (REST) API. (See the Resources section
for more about Flickr's APIs and the REST format.) Some sample output from
the above call is shown in Listing 2.
Listing 2. XML from Flickr
<?xml version="1.0" encoding="utf-8" ?>
<rsp stat="ok">
<photos page="1" pages="25" per_page="20" total="500">
<photo id="469774979" owner="35373726@N00" secret="c8a1be2012" server="183"
farm="1" title="Where will it lead me......?" ispublic="1" isfriend="0"
isfamily="0" />
<photo id="470281793" owner="73955226@N00" secret="49612a2794" server="212"
farm="1" title="Island Beauty" ispublic="1" isfriend="0" isfamily="0" />
<photo id="469808244" owner="43568064@N00" secret="26b71544a3" server="227"
farm="1" title="" ispublic="1" isfriend="0" isfamily="0" />
</photos>
</rsp>
|
Note that Listing 2 only shows three photos. The API call would actually return
20 (the per_page parameter in the URL string.) The
results are pretty straightforward, so take a look at how to parse this
XML. In the example, you parse out the title of each photo and its ID. The ID can
be used to create the URL for the photo, so it's not hard to imagine a Web
application (perhaps a mashup) using just this information. First you use DOM to
extract this data.
DOM example
To use DOM, you parse the document into a document object. This is an
in-memory tree structure representing the XML document that was parsed. You
then walk the DOM tree looking for the title and ID of each photo. Put this
data into a simple map. The code for doing this is shown in Listing 3.
Listing 3. Parsing with DOM
Map<String,String> map = new HashMap<String,String>();
DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document dom = builder.parse(input);
Element root = dom.getDocumentElement();
NodeList childNodes = root.getChildNodes();
Node photosNode = null;
for (int i=0;i<childNodes.getLength();i++){
Node node = childNodes.item(i);
if (node.getNodeName().equalsIgnoreCase("photos")){
photosNode = node;
break;
}
}
childNodes = photosNode.getChildNodes();
for (int i=0;i<childNodes.getLength();i++){
Node node = childNodes.item(i);
if (node.getNodeName().equalsIgnoreCase("photo")){
String title = node.getAttributes().getNamedItem("title").getTextContent();
String id = node.getAttributes().getNamedItem("id").getTextContent();
map.put(id,title);
}
}
|
DOM is popular because it's so easy to use. You just pass in your input
source to the parser, and it gives you a document
object. You can then go through child nodes until you find the photo node. Each
photo node is a child of the photo node, so you go through each photo node. Then
you access the title and id attributes of each photo node and store it in your
map.
However, there are some obvious inefficiencies with DOM. You're
storing lots of data that you might not care about, such as the owner of each
photo. You're also reading through all the data twice: once for reading it into the
document object, then again when walking through the document object. The
traditional way to avoid these inefficiencies was to use SAX.
SAX example
A SAX parser doesn't give back a nice document object
like a DOM parser. Instead it gives a series of events as it rips through the
XML document. A handler has to be created for these events by either implementing
an interface or extending the DefaultHandler class and
overriding its methods as needed. Listing 4 demonstrates a SAX parsing of the
Flickr XML document.
Listing 4. Parsing with SAX
SAXParser parser = SAXParserFactory.newInstance().newSAXParser();
DefaultHandler handler = new DefaultHandler(){
@Override
public void startElement(String uri, String localName,
String qName, Attributes attributes) throws SAXException {
if (qName.equalsIgnoreCase("photo")){
String title = attributes.getValue("title");
String id = attributes.getValue("id");
// map is static so we can access it here
map.put(id, title);
}
}
};
parser.parse(input, handler);
|
The code shown in Listing 4 is definitely a little harder to understand than the
DOM code you saw in Listing 3. You needed a
ContentHandler to handle the SAX events, so you created
a DefaultHandler and overrode its
startElement callback method. You checked to see if it
was a photo element, and if so, you accessed its title and id attributes.
The code is fairly succinct and is very efficient when it runs. It
stores only the data you care about, and you only pass through the document once.
It's
more complicated code that requires extending a class to register an
event listener. It would be nice to be able to parse the XML efficiently, but with
a more intuitive programming model. That's where StAX comes in.
The StAX alternative
The complexity in SAX comes from the Observer design pattern it implements. It's
a push model, in that the parser pushes events to observers who then act on the
events. The StAX model is similar to SAX. It streams data and events from the XML
document, allowing it to be fast and efficient like SAX. The big difference is
that it uses a pull model. This allows the application code to pull events from
the parser.
This may sound like a subtle difference, but it allows for a much simpler
programming model. Take a look at Listing 5 to see StAX in action.
Listing 5. Parsing with StAX
Map<String,String> map = new HashMap<String,String>();
XMLInputFactory inputFactory = XMLInputFactory.newInstance();
QName qId = new QName("id");
QName qTitle = new QName("title");
QName qPhoto = new QName("photo");
XMLEventReader reader = inputFactory.createXMLEventReader(input);
while (reader.hasNext()){
XMLEvent event = reader.nextEvent();
if (event.isStartElement()){
StartElement element = event.asStartElement();
if (element.getName().equals(qPhoto)){
String id = element.getAttributeByName(qId).getValue();
String title = element.getAttributeByName(qTitle).getValue();
map.put(id,title);
}
}
}
reader.close();
|
First of all, you didn't have to extend any classes.
That's because you don't need to register for events. With StAX, you control the
flow of events, because you pull them from the parser. You're able to use a
familiar iterator-like syntax to search through the document to find the data
you want. You're still storing only the data you want, and you only have to
go through the XML document once. You get the same efficiencies as with SAX, but
the code is far more intuitive.
Woodstox as Geronimo's StAX provider
Now you've seen the benefits of StAX parsing. It's widely recognized as an
important advancement in XML technology. Thus, it wasn't surprising when it
became part of the Java EE 5 specification (it's even being included with Java
Platform, Standard Edition [Java SE]
6 as well.) Because it's part of Java EE 5, it must be implemented by Geronimo 2.0.
Luckily for the Geronimo team, there were several open source StAX
implementations to choose from. The team picked Woodstox as the StAX parse to
include with Geronimo. Woodstox is regarded as one of the best performing StAX
implementations. (See Resources for a comparison of
various StAX parsers out there.) In addition, Woodstox is dual-licensed under both
Lesser General Public License (LGPL) and the Apache 2.0 license. So you can include Woodstox and its source
code with Geronimo without restrictions.
Performance-tuning your application: Getting the most out of Woodstox
Performance is definitely one of the advantages that Woodstox brings to Geronimo.
Just as with other high-performance technologies, it's important to understand
how to use Woodstox to get the best performance. The code in Listing 5 uses the
XMLEventReader interface, a high-level API that's
part of the StAX specification. A more low-level API that can be used instead for
greater performance is the XMLStreamReader interface.
Listing 6 shows the StAX parser using this interface.
Listing 6. StAX parsing with the XMLStreamReader
Map<String,String> map = new HashMap<String,String>();
XMLInputFactory inputFactory = XMLInputFactory.newInstance();
QName qId = new QName("id");
QName qTitle = new QName("title");
QName qPhoto = new QName("photo");
XMLStreamReader reader = inputFactory.createXMLStreamReader(input);
while (reader.hasNext()){
int event = reader.next();
if (event == START_ELEMENT){ // statically included constant from XMLStreamConstants
if (reader.getName().equals(qPhoto)){
String id = reader.getAttributeValue(null, qId.getLocalPart());
String title = reader.getAttributeValue(null, qTitle.getLocalPart());
map.put(id,title);
}
}
}
reader.close();
|
The code in Listing 6 is similar to the code in Listing 5;
while it's
obviously a little more low level, you get a significant performance boost.
Summary
You've learned about some of the advantages of using a StAX parser to parse XML
documents. StAX provides a nice compromise between SAX and DOM. You can
immediately take advantage of StAX by using it as part of Geronimo 2.0. Not only
do you get to use the intuitive pull APIs of StAX, you get the extra benefit
of using a high-performance implementation of StAX in Woodstox.
Download | Description | Name | Size | Download method |
|---|
| Sample article code | renegade.woodstox.source.zip | 4KB | HTTP |
|---|
Resources Learn
- For a great introduction to StAX, check out the
"StAX'ing up XML, Part 1: An introduction to Streaming API for XML (StAX)"
(developerWorks, November 2006).
- Learn how to use StAX to not only read XML
documents, but also to write XML documents in Berthold Daum's
"Tip: Write XML documents with StAX"
(developerWorks, December 2003).
- Find out the latest on Woodstox.
- Get a detailed performance comparison of
Woodstox and other StAX implementations in the white paper from Sun titled
"Streaming APIs for XML parsers".
- Read all about SAX in Benoit Marchal's
article
"SAX, the power
API"
(developerWorks, August 2001).
- Find out about navigating DOMs more efficiently
in the article
"Effective XML processing with DOM and XPath in Java"
by Parand Tony Darugar (developerWorks, December 2001).
- If DOM is still too low level for your tastes,
give the Java APIs for XML Binding (JAXB) a try in the tutorial
"Data binding with JAXB
by Daniel Steinberg (developerWorks, May 2003).
- Learn about
StAX and Java EE
in the official Java EE 5 tutorial.
- Read the latest Geronimo documentation and the
latest news on its Java EE 5 status on the
Geronimo Wiki.
- New to Geronimo?
Get started with Geronimo
here on developerWorks.
- Get involved in the
Geronimo project.
- Join the Apache Geronimo mailing list.
- Read "Applying the Apache License, Version 2.0" to understand what you need to do to
apply the license.
- Check out the developerWorks Apache Geronimo project area for articles, tutorials, and other resources to help you get started developing with Geronimo today.
- Check out the IBM® Support for Apache Geronimo offering, which lets you develop Geronimo applications backed by world-class IBM support.
- Visit the developerWorks Open source zone for extensive how-to information, tools, and project updates to help you develop with open source technologies and use them with IBM's products.
- Stay current with developerWorks technical events and webcasts.
- Browse all the Apache articles and free Apache tutorials available in the developerWorks Open source zone.
- Browse for books on these and other technical topics at the Safari bookstore.
- Get an RSS feed for this series. (Find out more about RSS.)
Get products and technologies
Discuss
About the author  | |  | Michael Galpin has been developing Java software professionally since 1998. He
currently works at Adomo, Inc., a start-up in Mountain View, CA. He holds a degree
in mathematics from the California Institute of Technology. |
Rate this page
|  |