Skip to main content

Tip: Merge XML documents with StAX

Use the high-level, event-based API for pipelined XML applications

Berthold Daum (berthold.daum@bdaum.de), President, BDaum Industrial Communications
Berthold Daum is a consultant and writer based in Lützelbach, Germany. For information on his recent books, System Architecture with XML and Modeling Business Objects with XML Schema (both from Morgan Kaufman), see http://www.bdaum.de. You can contact Berthold at berthold.daum@bdaum.de.

Summary:  Deriving new XML documents from input documents is where the Streaming API for XML (StAX) shines. This tip explores how client applications can utilize the event-based API to efficiently merge two incoming XML documents into one.

View more content in this series

Date:  07 Jan 2004
Level:  Intermediate
Activity:  2226 views

In my previous tip, "Write XML documents with StAX", I showed how to use the low-level, cursor-based StAX API to create XML documents programmatically. In this tip, I use the high-level, event-based API to demonstrate this by creating a program that merges two incoming XML documents into one.

Processing several XML documents simultaneously can be a significant challenge. SAX parsers, for example, deliver the parsing events through callbacks to the client application. Because the SAX parser controls this process, the client application does not really have a chance to synchronize the different input sources. Therefore, programmers usually resort to the DOM parser when it comes to multi-document processing. However, the penalty here is excessive resource usage; the node trees of all input documents must completely reside in memory.

StAX does not suffer from these drawbacks. As its name indicates, it is targeted at streaming applications such as the merging of two documents. The following example shows how this is done. Assume that you want to merge two documents containing lists of products. Each document consists of a <products> element that contains one or several <product> elements sorted alphabetically by attribute pid. Listing 1 is an example of such a document:


Listing 1. Product list
<products>
   <product pid="01"/>
   <product pid="05"/>
   <product pid="09"/>
</products>

In Listing 2, I use a classical merge algorithm to merge the lists from both documents. Depending on the comparison between the merge criteria from the documents, I either copy events from document 1 to the output document or from document 2 to the output document. This is done by the readToNextElement() method. This method contains some extra logic for detecting the end of the product list. Special treatment is also required for the beginning of the document and for the end of the document.


Listing 2. Merging documents
import java.io.*;
import javax.xml.namespace.QName;
import javax.xml.stream.*;
import javax.xml.stream.events.XMLEvent;

public class Merger {

   private static final QName prodName = new QName("product");
   private static final QName pidName = new QName("pid");

   public static void main(String[] args)
      throws FileNotFoundException, XMLStreamException {
         
      // Use  the reference implementation for the  XML input factory
      System.setProperty(
         "javax.xml.stream.XMLInputFactory",
         "com.bea.xml.stream.MXParserFactory");
      // Create the XML input factory
      XMLInputFactory factory = XMLInputFactory.newInstance();
      // Create XML event reader 1
      XMLEventReader r1 = 
         factory.createXMLEventReader(new FileReader("prodList1.xml"));
      // Create XML event reader 2
      XMLEventReader r2 = 
         factory.createXMLEventReader(new FileReader("prodList2.xml"));

      // Create the output factory
      XMLOutputFactory xmlof = XMLOutputFactory.newInstance();
      // Create XML event writer
      XMLEventWriter xmlw = xmlof.createXMLEventWriter(System.out);

      // Read to first <product> element in document 1
      // and output to result document
      String pid1 = readToNextElement(r1, xmlw, false);
      // Read to first <product> element in document 1
      // without writing to result document
      String pid2 = readToNextElement(r2, null, false);
      // Loop over both XML input streams
      while (pid1 != null || pid2 != null) {
         // Compare merge criteria
         if (pid2 == null || (pid1 != null && pid1.compareTo(pid2) <= 0))
            // Continue in document 1
            pid1 = readToNextElement(r1, xmlw, pid2 == null);
         else
            // Continue in document 2
            pid2 = readToNextElement(r2, xmlw, pid1 == null);
      }
      xmlw.close();
   }

   /**
    * @param reader - the document reader
    * @param writer - the document writer
    * @param processEnd - forces the document end to be written
    * @return - the next merge criterion value
    * @throws XMLStreamException
    */
   private static String readToNextElement(XMLEventReader reader,
         XMLEventWriter writer, boolean processEnd) throws XMLStreamException {
      // Nesting level
      int level = 0;
      while (true) {
         // Read event to be written to result document
         XMLEvent event = reader.next();
         // Avoid double processing of document end
         if (!processEnd)
            switch (event.getEventType()) {
               case XMLEvent.START_ELEMENT :
                  ++level;
                  break;
               case XMLEvent.END_ELEMENT :
                  if (--level < 0)
                     return null;
                  break;
            }
         // Output event
         if (writer != null)
            writer.add(event);
         // Look at next event
         event = reader.peek();
         switch (event.getEventType()) {
            case XMLEvent.START_ELEMENT :
               // Start element - stop at <product> element
               QName name = event.asStartElement().getName();
               if (name.equals(prodName)) {
                  return event
                     .asStartElement()
                     .getAttributeByName(pidName)
                     .getValue();
               }
               break;
            case XMLEvent.END_DOCUMENT :
               // Stop at end of document
               return null;
         }
      }
   }
}

As you can see, the event-based API is ideally suited for deriving a document from other documents. With the low-level, cursor-based API, you would need to use different method calls for each different event type, but with the event-based API you just pass generic events to the event writer's add()method and that's it.


Summary

This tip has demonstrated the use of the event-based API of StAX for pipelined XML applications, such as the merging of documents. As of Nov 3, 2003, StAX has passed the Final JSR-0173 Approval Ballot. It will make a valuable addition to every Java programmer's toolbox.



Download

NameSizeDownload method
x-tipstx5_merger.zip2KB HTTP

Information about download methods


Resources

About the author

Berthold Daum is a consultant and writer based in Lützelbach, Germany. For information on his recent books, System Architecture with XML and Modeling Business Objects with XML Schema (both from Morgan Kaufman), see http://www.bdaum.de. You can contact Berthold at berthold.daum@bdaum.de.

Comments (Undergoing maintenance)



Trademarks  |  My developerWorks terms and conditions

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML
ArticleID=12359
ArticleTitle=Tip: Merge XML documents with StAX
publish-date=01072004
author1-email=berthold.daum@bdaum.de
author1-email-cc=

My developerWorks community

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere).

My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Rate a product. Write a review.

Special offers