Skip to main content

StAX'ing up XML, Part 2: Pull parsing and events

Explore the StAX event iterator-based API for Java developers

Peter Nehrer (pnehrer@ecliptical.ca), Freelance Writer, Freelance Developer
Peter Nehrer is a software consultant specializing in Eclipse-based enterprise solutions and Java EE applications. He is the founder of Ecliptical Software Inc. and a contributor to several Eclipse-related Open Source projects. He holds an M.S. in Computer Science from the University of Massachusetts at Amherst, MA.

Summary:  The event iterator-based API provided by Streaming API for XML (StAX) offers a unique blend of advantages over other XML processing methods in terms of both performance and usability. Part 1 introduced StAX and described in detail its cursor-based API. In this article, delve deeper into the event iterator-based API and explore its benefits to Java™ developers.

View more content in this series

Date:  05 Dec 2006
Level:  Intermediate
Also available in:   Chinese  Russian

Activity:  7134 views
Comments:  

Using StAX to parse XML

In Part 1 (see Resources), you learned that StAX provides two API styles for processing XML. The cursor-based API represents a low-level method for parsing XML. Using this approach, the application advances a cursor over a stream of XML tokens, examining the parser state at every step to get more information about what was parsed. This method is very efficient and especially suitable for resource-constrained environments. However, the cursor-based API is not object-oriented and thus not a natural fit for Java applications, especially in the enterprise domain where the extensibility and maintainability of code are just as important as its performance. For example, a multi-layered Web service that uses a generic component to process message envelopes while delegating any message-specific content processing (such as argument binding) to other components would likely benefit from an object-oriented approach.

The other API style provided by StAX is centered around event objects. Like its cursor-based alternative, it is also a pull-based method of parsing XML; the application pulls each event from the parser by using one of the provided methods, then deals with the event as needed, and so on, until the stream is parsed (or the application decides to stop parsing).

Introducing the XMLEventReader interface

The main interface of the event iterator-based API is XMLEventReader. Compared to XMLStreamReader, it has only a handful of methods. This is because XMLEventReader is used to iterate over a stream of event objects (in fact, XMLEventReader extends java.util.Iterator). All information about the parsed event is encapsulated in the event object rather than in the reader.

To use the event iterator-based API, the application must first obtain an instance of XMLEventReader from the XMLInputFactory. The factory itself can be obtained using the standard JAXP approach, which relies on the Abstract Factory pattern to support pluggable service providers. This makes getting an instance of the default XMLInputFactory implementation as simple as calling XMLInputFactory.getInstance() as shown in Listing 1.


Listing 1. Creating an XMLEventReader using the default XMLInputFactory implementation

String uri = "http://www.atomenabled.org/atom.xml";
URL url = new URL(uri);
InputStream input = url.openStream();
XMLInputFactory factory = XMLInputFactory.newInstance();
XMLEventReader reader = factory.createXMLEventReader(uri, input);
...

XMLInputFactory supports a variety of input sources that you can use to create an XMLEventReader. In addition to InputStream and Reader from the Java I/O package, the JAXP Source (from TrAX) is also supported, which facilitates integration of StAX with JAXP's transformation API (TrAX). Lastly, it is also possible to create an XMLEventReader from an XMLStreamReader. This option in particular demonstrates how the event iterator-based API stacks (no pun intended) on top of the cursor-based API. In fact, the implementation typically creates an XMLStreamReader using one of the other input sources and then uses it to create the XMLEventReader.

Using XMLEventReader

After creating an XMLEventReader, the application can use it to iterate over events that represent pieces of the underlying XML stream's InfoSet. Because interface XMLEventReader extends java.util.Iterator, the standard Iterator methods, such as hasNext() and next() can be used. Note, however, that method remove() is not supported and will throw an exception to that effect if invoked.

XMLEventReader also provides some convenience methods to make XML processing easier:

  • nextEvent() is essentially a strongly-typed equivalent of Iterator's next(); it returns an XMLEvent, the base interface of all event objects.
  • nextTag() can skip over any insignificant whitespace up to the next opening or closing tag. Thus, the return value will be either a StartElement or EndElement event (more on those later). This method is particularly useful when processing element-only content (that is, elements declared as EMTPY in the Document Type Declaration, or DTD).
  • getElementText() can get the text content of a text-only element, starting at its opening tag and ending at its closing tag. Starting with StartElement as the next expected event, the method will concatenate all characters and return the resulting string before an EndElement is encountered.
  • peek() can find out the next event, if any, to be returned by the iterator without advancing it.

Listing 2 demonstrates the use of XMLEventReader methods to iterate over an Atom feed. Atom is a syndication format used in Web publishing. The example starts by obtaining the default instance of XMLInputFactory and using it to create an XMLEventReader to parse the Atom feed at the given URL. While iterating over the events, method peek() determines if the next event will be the start of an icon element, which contains the feed's icon URL. If one is encountered, method getElementText() obtains the element's text content (that is, the icon URL). At that point, iteration terminates.


Listing 2. Extracting an Atom feed's icon URL using peek() and getElementText()

final QName ICON = new QName("http://www.w3.org/2005/Atom", "icon");
URL url = new URL(uri);
InputStream input = url.openStream();

XMLInputFactory factory = XMLInputFactory.newInstance();
XMLEventReader reader = factory.createXMLEventReader(uri, input);
try {
      while (reader.hasNext()) {
            XMLEvent event = reader.peek();
            if (event.isStartElement()) {
                  StartElement start = event.asStartElement();
                  if (ICON.equals(start.getName())) {
                        System.out.println(reader.getElementText());
                        break;
                  }
            }

            reader.nextEvent();
      }
} finally {
      reader.close();
}

input.close();

The returned event objects are immutable and the application can cache them beyond the parsing process. However, it is possible for the application to define a different event retention (and reuse) policy, as you will see in Part 3.

The application can also use getProperty(String) to obtain the value of a custom or pre-defined property from the underlying implementation. When finished, the application should close the reader by calling its close() method in order to release any resources that were acquired in the process.

Iterating over a stream of events using XMLEventReader is rather straightforward. Processing these events requires the knowledge and understanding of StAX XMLEvent hierarchy, which is discussed next.


Events and how to use them in your XML parser

As already emphasized, XMLEventReader communicates its state to the application through event objects after every step of the parsing process. The standard types of event objects used throughout this API are defined in package javax.xml.stream.events. Interface XMLEvent represents the root of this type hierarchy; all event types must extend this interface. There is an interface for representing each cursor-level event type (as in the cursor-based API) defined in interface XMLStreamConstants. However, it is possible to use custom interfaces (as long as they extend XMLEvent) as you will learn in Part 3.


Navigating the XMLEvent hierarchy

After retrieving an event from the parser, the application typically needs to downcast it into one of the XMLEvent sub-types in order to access its type-specific information. There are several ways to accomplish this. In addition to the brute-force instanceof checks (that is, a sequence of if/then statements testing to see if the returned event implements the desired interface), XMLEvent provides method getEventType(), which returns one of the event constants defined in XMLStreamConstants. This information can be used as the basis for downcasting the event. For instance, if the event's getEventType() returns START_ELEMENT, it can be safely downcast to StartElement.

Another way to determine the concrete type of the event is to use one of the boolean query methods provided for this purpose. For instance, isAttribute() returns true if the event is an Attribute, isStartElement() if it is a StartElement, and so on. Finally, several convenience methods can perform the downcast. asStartElement(), asEndElement(), and asCharacters() downcast the appropriate event into StartElement, EndElement, and Characters, respectively.

In Listing 3, you use methods isStartElement() and asStartElement() to first determine if the retrieved event is a StartElement, then downcast it to the StartElement type, which allows you to access the element's name.


Listing 3. Determining event type and downcasting it to the corresponding interface

// get an event from the reader
...
if (event.isStartElement()) {
      StartElement start = event.asStartElement();
      // use methods provided by StartElement
...

In addition to the type hierarchy-related methods, XMLEventType provides methods getLocation(), getSchemaType(), and writeAsEncodedUnicode(Writer). Method getLocation() returns a Location object that provides optional information about the location of the event in the underlying input source (for example, the line and column number indicating where the event ends). Method getSchemaType() can retrieve optional XML Schema information related to the given event (if it is made available by the implementation). Finally, method writeAsEncodedUnicode(Writer) defines a contract for writing event objects out to a java.io.Writer in a standard manner. This is especially useful for defining custom events (discussed in the next installment) as it allows the serializer to delegate the serialization of any XMLEvent derivative rather than requiring the application to use a custom serializer.


Processing XML Documents

When parsing a stream that represents a whole XML document, the first event returned by XMLEventReader is StartDocument. This interface provides methods for obtaining information about the document itself. For instance, method getSystemId() returns the document's system ID, if known. Method getVersion() returns the XML version used in this document. The default version is 1.0, unless some other value is specified in the document's XML declaration.

Method getCharacterEncodingScheme() returns the document's character encoding, either specified explicitly in its XML declaration or auto-detected by the parser. Its default value is UTF-8. Method isStandalone() returns true unless external markup declarations are present, or the value is specified explicitly in the document's XML declaration.

Accessing DTDs

If the XMLEventReader encounters a DTD, it returns it as a DTD event. If the application does not care about DTDs, it can request that this behavior be turned off by setting the parser's javax.xml.stream.supportDTD property to false. The event's getDocumentTypeDeclaration() method can retrieve the entire DTD as a string, including the internal subset. The implementation might actually process the DTD into a more structured representation (which is provided-specific) and make it available by calling the getProcessedDTD() method. Method getEntities() returns a list of EntityDeclaration events (described below) representing general external entity declarations, both internal and external. Finally, method getNotations() returns a list of NotationDeclaration events (also described later) used to represent any declared notations.

The EntityDeclaration event represents unparsed general entities declared in the document's DTD. This event is not reported individually, but rather as part of the DTD event. It provides methods for obtaining the entity's name, its public and system IDs, as well as its associated notation name (methods getName(), getPublicId(), getSystemId(), and getNotationName(), respectively). If this is an internal entity, method getReplacementText() can retrieve its replacement text.

Similarly, NotationDeclaration is an event that's only accessible through the DTD event. It represents notation declarations. In addition to its name (method getName()), this interface provides methods to retrieve the notation's public and system IDs (methods getPublicId() and getSystemId(), respectively). At least one of the two must be available.

Listing 4 demonstrates how to process unparsed external entity references. In the example, a fictitious catalog document contains references to publications whose contents might reside in a PDF or HTML file (neither of which is valid XML). While iterating over the events, you extract notation declarations from the DTD event and cache them by name. When an entity reference is encountered, you obtain its entity declaration and retrieve the cached notation declaration by its name. A real-world application might use the notation identifier to locate the appropriate content processor and use the entity's system identifier as its input.


Listing 4. Example showing how to obtain information about unparsed entities and notations

final String xml = "<?xml version=\"1.0\" standalone=\"no\" ?>" +
            "<!DOCTYPE catalog [" +
            "<!ELEMENT catalog (publication+) >" +
            "<!ELEMENT publication (#PCDATA) >" +
            "<!ATTLIST publication title CDATA #REQUIRED >" +
            "<!NOTATION pdf SYSTEM \"application/pdf\" >" +
            "<!NOTATION html SYSTEM \"text/html\" >" +
            "<!ENTITY overview SYSTEM \"resources/overview.pdf\" NDATA pdf 
>" +
            "<!ENTITY chapter1 SYSTEM \"resources/chapter_1.html\" NDATA html 
>" +
            "]>" +
            "<catalog>" +
            "<ext title=\"Overview\">&overview;</ext>" +
            "<ext title=\"Chapter 1\">&chapter1;</ext>" +
            "</catalog>";
Map notations = new HashMap();
StringReader input = new StringReader(xml);
XMLInputFactory f = XMLInputFactory.newInstance();
XMLEventReader r = f.createXMLEventReader("http://example.com/catalog.xml", 
input);
PrintWriter out = new PrintWriter(System.out);
try {
      while (r.hasNext()) {
            XMLEvent event = r.nextEvent();
            switch (event.getEventType()) {
            case XMLStreamConstants.ENTITY_REFERENCE:
                  EntityReference ref = (EntityReference) event;
                  EntityDeclaration decl = ref.getDeclaration();
                  NotationDeclaration n = (NotationDeclaration) 
                        notations.get(decl.getNotationName());

                  out.print("Object of type ");
                  out.print(n.getSystemId());
                  out.print(" located at ");
                  out.print(decl.getSystemId());
                  out.print(" would be placed here.");
                  break;
            case XMLStreamConstants.DTD:
                  DTD dtd = (DTD) event;
                  for (Iterator i = dtd.getNotations().iterator(); i.hasNext();) 
{
                        n = (NotationDeclaration) i.next();
                        notations.put(n.getName(), n);
                  }
            default:
                  event.writeAsEncodedUnicode(out);
                  out.println();
            }
      }
} finally {
      r.close();
}

input.close();
out.flush();


Processing elements, attributes, and namespace declarations

For every element, XMLEventReader returns a StartElement event to represent its opening tag, and eventually a corresponding EndElement event to represent its closing tag. Even for empty elements that don't have separate opening and closing tags (for example, <empty-element/>) the reader returns an EndElement event immediately following the StartElement.

You are likely to work with StartElement more frequently than any other event, as it is typically used to represent most of the information in an XML document. To retrieve the element's qualified name, call getName(). Class QName represents qualified XML names; it encapsulates all components of a qualified name, such as its namespace URI, prefix, and local name. Method getNamespaceContext() can retrieve the current namespace context with information about all namespaces currently in scope. To retrieve the element's attributes, call getAttributes(), or individually by name (if known beforehand) using getAttributeByName(QName). Similarly, you can obtain any namespaces declared on the element by calling getNamespaces(). Method getNamespaceURI(String) returns the namespace bound to a specific prefix in the current context.

Although they are modeled as events and represented by interface Attribute, an element's attributes are typically not reported as individual events. Instead, they are accessible from the StartElement event. Method getName() returns the attribute's qualified name and getValue() its value as a string. Call isSpecified() to determine if the attribute is actually specified on the element, or implied by the document's schema. Method getDTDType() returns the attribute's declared type (such as, CDATA, IDREF, or NMTOKEN).

Similarly, any namespaces declared on an element are accessible from the StartElement event rather than being reported individually. Interface Namespace actually extends Attribute, because namespaces are in fact specified as attributes of an element (with a special prefix). Method getPrefix() is a shorthand for getting the namespace attribute's local name (unless it is a default namespace declaration, in which case the prefix is an empty string rather than "xmlns"). Similarly, method getNamespaceURI() returns the attribute's value (that is, the declared namespace URI). To determine whether the namespace is the default namespace (with an empty prefix), call isDefaultNamespaceDeclaration().

EndElement represents the element's closing tag (or simply the end of the element's markup, if it is an empty element). Its getName() method can be used to obtain the element's qualified name, and getNamespaces() to find out which namespaces have gone out of scope.

Listing 5 reports all Atom extension elements and attributes (that is, those that do not belong to the Atom namespace or the XML namespace). For each StartElement event, check that its namespace URI is the Atom namespace URI. You then iterate over all its attributes and use the Attribute interface to get their names. Finally, report any attribute that does not belong to the Atom or XML namespaces.


Listing 5. Retrieving an element's attributes from the StartElement event

final String ATOM_NS = "http://www.w3.org/2005/Atom";

URL url = new URL(uri);
InputStream input = url.openStream();
XMLInputFactory f = XMLInputFactory.newInstance();
XMLEventReader r = f.createXMLEventReader(uri, input);
try {
      while (r.hasNext()) {
            XMLEvent event = r.nextEvent();
            if (event.isStartElement()) {
                  StartElement start = event.asStartElement();
                  boolean isExtension = false;
                  boolean elementPrinted = false;
                  if (!ATOM_NS.equals(start.getName().getNamespaceURI())) {
                        System.out.println(start.getName());
                        isExtension = true;
                        elementPrinted = true;
                  }

                  for (Iterator i = start.getAttributes(); i.hasNext();) {
                        Attribute attr = (Attribute) i.next();
                        String ns = attr.getName().getNamespaceURI();
                        if (ATOM_NS.equals(ns))
                              continue;

                        if ("".equals(ns) && !isExtension)
                              continue;

                        if ("xml".equals(attr.getName().getPrefix()))
                              continue;

                        if (!elementPrinted) {
                              elementPrinted = true;
                              System.out.println(start.getName());
                        }

                        System.out.print("\t");
                        System.out.println(attr);
                  }
            }
      }
} finally {
      r.close();
}

input.close();

Representing text content

The Characters event is actually used to represent three types of text events: text that is the actual content (CHARACTERS), CDATA sections, and ignorable whitespace (SPACE). It provides methods to distinguish between these sub-types; isCData() returns true if this is a CDATA event and isIgnorableWhitespace() if it is a SPACE event. Method getData() returns the event's text. Finally, isWhiteSpace() indicates if the text consists of all whitespace characters (which might not necessarily be ignorable whitespace).

The EntityReference event is reported for unresolved general entity references. For parsed entities it is only reported if the reader's javax.xml.stream.isReplacingEntityReferences property is set to false. Otherwise, the parser is required to replace internal entity references with their replacement text (as specified in their declaration) and report them as regular character events, or resolve external entities and report them as regular markup. Interface EntityReference provides methods for obtaining the entity's name and its declaration (as an EntityDeclaration event) by calling getName() and getDeclaration(), respectively.

Events PROCESSING_INSTRUCTION and COMMENTS are represented by ProcessingInstruction and Comment, respectively. ProcessingInstruction provides methods getTarget() and getData() for retrieving the instruction's target and data. Method getText() defined on interface Comment can retrieve the comment's text.

Listing 6 shows an example of how to use the Characters event to report the various types of text content. It also shows the use of interfaces Comment and ProcessingInstruction.


Listing 6. Reporting character and processing instruction events

final String ATOM_NS = "http://www.w3.org/2005/Atom";

URL url = new URL(uri);
InputStream input = url.openStream();

XMLInputFactory f = XMLInputFactory.newInstance();
XMLEventReader r = f.createXMLEventReader(uri, input);
try {
      while (r.hasNext()) {
            XMLEvent event = r.nextEvent();
            if (event.isCharacters()) {
                  Characters c = event.asCharacters();
                  System.out.print("Characters");
                  if (c.isCData()) {
                        System.out.print(" (CDATA):");
                        System.out.println(c.getData());
                  } else if (c.isIgnorableWhiteSpace()) {
                        System.out.println(" (IGNORABLE SPACE)");
                  } else if (c.isWhiteSpace()) {
                        System.out.println(" (EMPTY SPACE)");
                  } else {
                        System.out.print(": ");
                        System.out.println(c.getData());
                  }
            } else if (event.isProcessingInstruction()) {
                  ProcessingInstruction pi = (ProcessingInstruction) event;
                  System.out.print("PI(");
                  System.out.print(pi.getTarget());
                  System.out.print(", ");
                  System.out.print(pi.getData());
                  System.out.println(")");
            } else if (event.getEventType() == XMLStreamConstants.COMMENT) {
                  System.out.print("Comment: ");
                  System.out.println(((Comment) event).getText());
            }
      }
} finally {
      r.close();
}

input.close();

The last event typically delivered by XMLEventReader is EndDocument, which does not define any new methods.


Filtering events and manipulating the event stream

As you can see, parsing XML using XMLEventReader along with XMLEvent and its sub-types is quite straightforward. By controlling the parsing process, the application can decide what to do with each event. However, it is also possible to create specialized event readers for situations where the application (or a component of it) expects an event stream with certain type of content. For instance, one can easily create a filtered XMLEventStream that only lets certain events pass through to the caller. This can be done by calling the method createXMLEventReader(XMLEventReader, EventFilter) on an instance of XMLInputFactory, passing in the base event reader and a simple filter that accepts or rejects events obtained from the base reader. Listing 7 shows an example of such a filter (this particular one only accepts Processing Instruction events, but the application is free to define any criteria for accepting events).


Listing 7. Using an EventFilter to find Processing Instructions in the document

URL url = new URL(uri);
InputStream input = url.openStream();

XMLInputFactory factory = XMLInputFactory.newInstance();
XMLEventReader r = factory.createXMLEventReader(uri, input);
XMLEventReader fr = factory.createFilteredReader(r, new EventFilter() {
      public boolean accept(XMLEvent e) {
            return e.getEventType() == PROCESSING_INSTRUCTION;
      }
});

try {
      while (fr.hasNext()) {
            XMLEvent e = fr.nextEvent();
            if (e.getEventType() == PROCESSING_INSTRUCTION) {
                  ProcessingInstruction pi = (ProcessingInstruction) e;
                  System.out.println(pi.getTarget() + ": " + pi.getData());
            }
      }
} finally {
      fr.close();
      r.close();
}

input.close();

To perform more sophisticated stream manipulation, extend EventReaderDelegate, a utility class defined in package javax.xml.stream.util. This class allows the developer to wrap an existing XMLEventStream to which all calls are delegated by default. The subclass can then override any particular method to alter the behavior of the base reader. For instance, one can use this approach to inject synthetic events into the event stream, or otherwise transform it. An application iterating over such a modified stream would not need to know that it was manipulated. Listing 8 shows an example of this technique.


Listing 8. Injecting a Comment into the stream with an EventReaderDelegate

URL url = new URL(uri);
InputStream input = url.openStream();

XMLInputFactory factory = XMLInputFactory.newInstance();
XMLEventReader r = factory.createXMLEventReader(uri, input);

XMLEventReader fr = new EventReaderDelegate(r) {

      private Comment comment;

      public XMLEvent nextEvent() throws XMLStreamException {
            XMLEvent event = null;
            if (comment != null) {
                  event = comment;
                  comment = null;
                  return event;
            }

            event = super.nextEvent();
                  if (event.isStartDocument()) {
                  XMLEventFactory ef = XMLEventFactory.newInstance();
                  comment = ef.createComment("Generated " + new Date());
            }

            return event;
      }
};

OutputStreamWriter writer = new OutputStreamWriter(System.out);
try {
      while (fr.hasNext()) {
            XMLEvent event = fr.nextEvent();
            event.writeAsEncodedUnicode(writer);
      }
} finally {
      fr.close();
      r.close();
}

input.close();
writer.flush();

Note that in order to implement this example, the application must be able to create instances of standard events (a Comment, in this case). This functionality is provided by class XMLEventFactory, which defines creation methods for each standard event type (in fact, there are several overloaded versions of each method, each with a different set of arguments, depending on the event type). Like XMLInputFactory, this class implements the Abstract Factory pattern: call getInstance() to obtain a concrete instance of it.


Summary

These are just some of the examples of what can be done with the event iterator-based API provided by StAX, thanks to its flexibility and ease of use. In Part 3, you will take a look at how to create and use custom events. You will also explore the StAX serializer API.


Resources

Learn

Get products and technologies

Discuss

About the author

Peter Nehrer is a software consultant specializing in Eclipse-based enterprise solutions and Java EE applications. He is the founder of Ecliptical Software Inc. and a contributor to several Eclipse-related Open Source projects. He holds an M.S. in Computer Science from the University of Massachusetts at Amherst, MA.

Comments



Trademarks  |  My developerWorks terms and conditions

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML, Java technology
ArticleID=180768
ArticleTitle=StAX'ing up XML, Part 2: Pull parsing and events
publish-date=12052006
author1-email=pnehrer@ecliptical.ca
author1-email-cc=dwxed@us.ibm.com

My developerWorks community

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere).

My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Special offers