Skip to main content

If you don't have an IBM ID and password, register here.

By clicking Submit, you agree to the developerWorks terms of use.

The first time you sign into developerWorks, a profile is created for you. This profile includes the first name, last name, and display name you identified when you registered with developerWorks. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

All information submitted is secure.

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerworks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

By clicking Submit, you agree to the developerWorks terms of use.

All information submitted is secure.

Implement a web service that deals with complex XML documents

Alexander Sasha Ananiev, IT Architect , EMC
Alexander Sasha Ananiev is an IT Architect for IBM Global Services. He has over 16 years of experience in designing and developing computer systems using a variety of languages and technologies. He's currently working on implementing SOA solutions in the public sector. You can reach him at alexander.v.ananiev@us.ibm.com.

Summary:  Examine an approach for building a web service that is capable of efficiently handling large XML documents. This article illustrates how to build such a web service in a Java™ 2 Platform, Enterprise Edition (J2EE) environment using streaming XML parsing, Java Message Service (JMS), and Java APIs for XML-Based Remote Procedure Call (JAX-RPC).

Date:  29 Nov 2005
Level:  Intermediate

Comments:  

Introduction

Many document-style web services have to be able to process complex XML documents. An XML document might contain a large number of data elements that often have many nested fragments and repeating groups.

One particular situation that leads to the use of large documents is when multiple requests are bundled together and sent to a service in one shot. This scenario has not been covered in detail by respected sources in the field; however, it is not as uncommon as it might seem. In many situations, a client's system is schedule-driven and many records have to be processed within a limited amount of time, such as during the payroll processing cycle. Also, web services are often put in place to replace batch file-based interfaces (for example, legacy). In this situation, a web service that can handle batches of XML documents can provide a transition path to clients who are unable or unwilling to overhaul their systems to support a message-based approach required by proper web services.

Batching XML documents together is an effective performance booster. There is a substantial degree of overhead associated with invoking a web service. Web services clients have to create a Simple Object Access Protocol (SOAP) envelope, open a connection (many SOAP clients create a new connection for every call), and transmit the message. A server has to parse the SOAP envelope, perform message decryption, and authentication. Hypertext Transfer Protocol (HTTP) and Secure Sockets Layer (SSL) handshake overhead also adds to the invocation time. The performance improves when the overhead is spread over multiple requests, as opposed to being incurred for each individual one.

Fast processing of complex XML documents

So how can you implement a web service that can efficiently handle large and complex messages that could potentially contain multiple business requests bundled together?

The key is to divide a large message into several logical fragments and decide what fragments can be processed concurrently. This means that you want to start processing a fragment without having to wait for the parsing of the entire message to finish.

When multiple requests are bundled together in a single message, each individual request represents an independent fragment that can be processed in parallel with other requests.

The same technique also works for a single (unbundled) complex message. A complex message typically consists of several logical fragments involving repeating groups. An element, which is part of the repeating group, could be a good candidate for being processed concurrently.

Let's consider a simple XML purchase order, as shown in Listing 1 below -- this is a classic XML document used in the World Wide Web Consortium (W3C) XML Schema Primer.


Listing 1. XML purchase order

				

<?xml version="1.0"?>

<orders>

    <purchaseOrder orderID="111" orderDate="1999-10-20">

       <shipTo country="US">

          <name>Alice Smith</name>

          <street>123 Maple Street</street>

          <city>Mill Valley</city>

          <state>CA</state>

          <zip>90952</zip>

       </shipTo>

       <billTo country="US">

          <name>Robert Smith</name>

          <street>8 Oak Avenue</street>

          <city>Old Town</city>

          <state>PA</state>

          <zip>95819</zip>

       </billTo>

       <comment>Hurry, my lawn is going wild</comment>



       <items>

          <item partNum="872-AA">

             <productName>Lawnmower</productName>

             <quantity>1</quantity>

             <USPrice>148.95</USPrice>

             <comment>Confirm this is electric</comment>

          </item>

          <item partNum="926-AA">

             <productName>Baby Monitor</productName>

             <quantity>1</quantity>

             <USPrice>39.98</USPrice>

             <shipDate>1999-05-21</shipDate>

          </item>

       </items>

    </purchaseOrder>



    <purchaseOrder orderID="999" orderDate="2001-10-20">

        <shipTo country="US">

          <name>John Doe</name>

          <street>123 Main Street</street>

          <city>Mill Valley</city>

          <state>CA</state>

          <zip>90952</zip>

       </shipTo>

       <items>

          <item partNum="999-BB">

             <productName>Chainsaw</productName>

             <quantity>1</quantity>

             <USPrice>250.00</USPrice>

          </item>

       </items>

    </purchaseOrder>

</orders>


The first part of the purchase order contains shipping and billing information. The web service might need to save this information in a database, for example, in the customer table.

The second part contains order line items. The web service must check the inventory when processing each line item. It can also process all the line items concurrently -- the inventory check does not depend on processing shipping and billing information, and line items do not depend on each other. The web service can start processing a line item once it's reached the closing </item> tag for each item. However, you need to make sure you can still correlate the line item with the purchase order header. This can be accomplished by adding the orderID attribute to each line item XML fragment. Listing 2 shows a complete XML line item with the added orderID attribute.


Listing 2. XML line item

				

<?xml version="1.0"?>

<item partNum="872-AA" orderID="111">

   <productName>Lawnmower</productName>

   <quantity>1</quantity>

   <USPrice>148.95</USPrice>

   <comment>Confirm this is electric</comment>

</item>


In other words, the web service needs to be able to concurrently process the customer's shipping and billing information as well as all line items.

The easiest way to accomplish this is to leverage the Java™ Message Service (JMS) support built into the Java 2 Platform, Enterprise Edition (J2EE) specification. The added benefit of JMS is that it allows you to fully decouple message producers and consumers so that the consumers can process messages at their own pace. If you need to scale up your service, just add more message consumers -- this can be done by adding more application server instances.

The parsing logic sends each XML fragment, which will be converted into a well-formed XML document, to a JMS destination as soon as the parser reaches the end of a fragment.

You need to create JMS queues for line items and purchase order headers with shipping information. Back-end consumers pick up messages from these queues and execute the necessary business logic.

You also have to implement the getStatus operation so that the clients can poll it to find out the status of their orders. Alternatively, you can use WS-Addressing (see Resources) for notifying clients when the order is processed, assuming that your clients can support the WS-Addressing specification.

Coordinating the concurrent processing of fragments for each order requires some effort. In order to arrange for shipping, you'll need to know when all line items are processed. This requires using a persistent state for each order. For example, you can store each processed fragment ID in the database and check the order for completeness every time a new fragment is processed. Since you're using JMS, the messages can arrive out of order -- so when a line item is processed before the order header, it is handled correctly. Note that implementing this logic is beyond the scope of this article.


Streaming XML parsing

I've already established that you have to be able to start processing XML fragments while the entire XML document is still being parsed. This approach calls for the use of a streaming parsing API. You have a choice of two different streaming APIs: Simple API for XML (SAX) and Pull. The SAX API is one of the oldest XML parsing APIs and it is part of almost any XML parser. The pull-parsing API is more recent; however, it has already been standardized into the "JSR 173: Streaming API for XML (StAX)" specification (see Resources).

Both APIs have their pros and cons (see Resources). In general, StAX is somewhat easier to use. The biggest advantage of StAX is that, unlike SAX, it is bi-directional, which means that you can use it to create XML documents. This makes StAX more suitable for XML transformation and conversion tasks.

StAX, being a simple API, should have good performance -- although this really depends on the implementation of the StAX specification. A benchmark published in the "XML Documents on the Run" article (see Resources) supports this theory, although this benchmark predates the release of the StAX specification. Another recent benchmark test also fares a particular StAX implementation above the competitors. Nevertheless, while superior StAX performance might be an important criteria, you should not choose the parsing API strictly based on the results of the performance benchmarks.

Let's look at the purchase order with line items example. In this case, StAX is a very good choice, because you need to transform the complex incoming XML document into more granular documents using straightforward transformation logic.

Listing 3 shows the code that parses the XML purchase order using StAX and converts the message's fragments into smaller XML documents.


Listing 3. Parsing of the purchase order document using the StAX API

				

//create the XMLEventReader, pass the filename for any relative resolution

XMLEventReader xmlEventReader = 

   XMLInputFactory.newInstance().createXMLEventReader( purchaseOrderInputStream );



StringWriter stringWriter = new StringWriter();

XMLEventWriter xmlEventWriter = 

    XMLOutputFactory.newInstance().createXMLEventWriter( stringWriter );



/* Create purchaseOrder closing tag ahead of time - we will need to 

 * close the purchaseOrder document before we encounter the real closing tag.

 */

XMLEventFactory xmlEventFactory = XMLEventFactory.newInstance();

EndElement endPurchaseOrder = xmlEventFactory.createEndElement( 

        new QName("purchaseOrder"), null);



Attribute orderIdAttr = null;



while(xmlEventReader.hasNext()) {



    XMLEvent e = xmlEventReader.nextEvent();

    

    /* figure out the name of the element - unfortunately,

     * there is no common Element even supertype in stax,  

     * so we need to check for start and end elements separately

     */

    String eltName = "";

    StartElement startElt = null;

    if ( e.isStartElement() ) {

       startElt = ((StartElement)e);

       eltName = startElt.getName().getLocalPart();

    }

    else if ( e.isEndElement() ) {

        eltName = ((EndElement)e).getName().getLocalPart();

    }

    

    // preserve purchase order ID so we can add it later to each item 

    if (e.isStartElement() && "purchaseOrder".equals( eltName ) ) {

       orderIdAttr = startElt.getAttributeByName( new QName("orderID") );

       if (orderIdAttr == null )

       throw new XMLStreamException( "Missing orderID attribute", e.getLocation());

    }

    

    /*"items" start tag means that the order-related information

     * has been already parsed, so we can send the order 

     * without any items for processing

     */

    if (e.isStartElement() && "items".equals( eltName ) ) {

        xmlEventWriter.add( endPurchaseOrder );         

        sendMessage( stringWriter, "jms/poQueue" );

    }

    

    /* Since we're sending each item separately, we

     * need to ignore "items" start and end tags.

     * Purchase order is sent when the first item is encountered,

     * so we have to ignore the purchase order end tag as well.

     */

    if (!eltName.equals("items") && !eltName.equals("orders") &&

          (!e.isEndElement() || !eltName.equals("purchaseOrder") )) {

        xmlEventWriter.add( e );    

    }



    // for each new item, add orderId attribute with the current orderId 

    if (e.isStartElement() && "item".equals( eltName ) ) {

        xmlEventWriter.add( orderIdAttr ); 

    }

    

    // send each item as an independent message

    if (e.isEndElement() && "item".equals( eltName ) ) {

       sendMessage( stringWriter, "jms/itemQueue" );

    }

}



Implementing the web service using JAX-RPC

So how are you going to embed a complex, and potentially large, XML document into a SOAP message in order to parse it on a server using StAX?

Java APIs for the XML-Based Remote Procedure Call (JAX-RPC) specification does provide some support for document-style web services (services that use literal SOAP binding in a Web Services Description Language (WSDL)). When an XML document is received, JAX-RPC implementations convert the entire document into a Java object graph, such as a SOAPElement object or a generated SOAPElement wrapper. Alternatively, you can choose to use String in the SOAP body -- the entire XML document will be loaded into a String object as text, which can then be parsed using any XML API. Please refer to the "Patterns and Strategies for Building Document-Based Web Services" article (see Resources) for the detailed analysis of all of the alternatives. Unfortunately, none of these options give you a clean way of accessing the input stream of the message payload before it was fully parsed by the SOAP server (for example, the SOAP server will attempt to load the entire document into a String object before your service can get access to it). Note that the upcoming JAX-WS 2.0 specification aims to fix this issue, but the specification is still in the proposed final draft stage at the time of this writing.

The JAX-RPC specification also supports another approach. This approach involves using SOAP Messages with Attachments (SwA). SwA relies on a well-known Multipurpose Internet Mail Extensions (MIME) multipart protocol. This protocol is extremely simple. In SwA, the first part of the message contains the SOAP envelope and the subsequent parts contain binary attachments. In this example, you'll only have two parts, one for the envelope and one for the XML document. At first, a choice of SwA for passing XML documents might seem unjustified, because the main purpose of this specification is to provide a mechanism for transferring binary data as part of the SOAP message. However, upon closer analysis, SwA does appear to meet the needs for passing large and complex business XML documents.

First, SwA completely decouples the business XML document from the SOAP envelope. The business document might contain entities, use encoding other than UTF-8, or refer to Document Type Definition (DTD) -- in fact, many industry-specific information exchange standards still use DTD, such as the Mortgage Industry Standards Maintenance Organization (MISMO) standard. In other words, the business XML document might have its own prolog, which is independent of the prolog and declarations of the SOAP envelope (see the link to "Transporting Binary Data in SOAP" in the Resources section for a detailed discussion of the potential problems with XML embedding).

Secondly, JAX-RPC supports SwA in such a way that allows you to easily obtain access to the input stream of the document that is passed as an attachment. For example, you can use javax.activation.DataHandler as a parameter type in the service's method signature:


public void process(DataHandler purchaseOrder)


The javax.activation.DataHandler provides access to the java.io.InputStream by using the getInputStream()method.

JAX-RPC also supports mapping of text/XML and application/XML MIME types. These types map to javax.xml.transform.Source. Unfortunately, the javax.xml.transform.Source interface is not stream-aware. To get to the input stream, you would have to cast it to one of the classes implementing this interface, for example, javax.xml.transform.stream.StreamSource. But the JAX-RPC specification does not restrict implementers to any particular class, so this downcast might not be portable -- it did work in Apache Axis, but I did not test it with any other SOAP server.

So using javax.activation.DataHandler as a parameter type seems to be a better choice, especially since it does not restrict clients to any particular MIME type (for example, a client might decide to pass an XML document using text and / or plain type).
Listing 4 shows the code of the web service that uses DataHandler in its signature:


Listing 4. Code of the purchase order processing service

				

public class PurchaseOrderProcessor {



    public void process( DataHandler purchaseOrder ) 

        throws XMLStreamException, IOException {



        processPurchaseOrderXML( purchaseOrder.getInputStream());

    }

    

    void processPurchaseOrderXML( InputStream purchaseOrderInputStream ) 

       throws XMLStreamException {

       // See the code from Listing 3...

    }

}



Drawbacks of SwA

The main downside of SwA is that it is not directly supported by .NET-based clients. Microsoft® used to favor the alternative specification called WS-Attachment, which is based on Direct Internet Message Encapsulation (DIME) instead of MIME. Later on, Microsoft decided to standardize the Message Transmission Optimization Mechanism (MTOM) specification, which is MIME-based.

As a result, the SwA approach might not be appropriate for web services where interoperability based on the WS-I Basic Profile 1.0 is required (SwA is not part of the WS-I Basic Profile; instead, it is part of a separate "Attachment profile"). However, in the majority of cases, web services today are used for integrating systems involving business and trading partners or for interdepartmental integration. In this case, all partners should agree that a single standard and SwA is a good candidate for being chosen because of its simplicity and maturity.

Partners using Microsoft technologies can interoperate with SwA-based partners by relying on add-on libraries, such as "AlotSoft SOAP extension" (see Resources), that add SwA support to the Microsoft platform.

Another problem with SwA is that it does not support an XML Information Set-based (Infoset-based) representation of a SOAP message. The SwA message consists of several parts -- some potentially containing binary data. This makes it impossible to cleanly convert the entire message into XML. The detailed analysis of this issue is beyond the scope of this article; however, you can refer to the "XML, SOAP and Binary Data" whitepaper (see Resources) for more information. I'm only going to mention that SwA makes it difficult using digital signatures and WS-Security, which is not an issue if transport-level security, such as HTTPS, is used.

The MTOM specification, which has already been approved by W3C, provides a solution to this problem. MTOM defines the Infoset inclusion mechanism, which allows for easy deserialization of the entire message into XML form. MTOM is already supported by Microsoft as part of "Web Services Enhancement (WSE 3.0)" add-on for .NET. MTOM is also currently supported by Apache Axis2 and Sun's JAX-WS RI 2.0. You can expect other vendors to add the MTOM support in the near future (see Resources). The fact that MTOM uses MIME encoding provides for an easy transition path for web services that currently rely on SwA.


Conclusion

JAX-RPC provides a number of ways for implementing document-style web services; however, most of these options have serious tradeoffs when it comes to dealing with large and complex XML documents.

In this article, I explained one technique that you can use to implement a document-style web service using JAX-RPC. This technique relies on the use of a StAX streaming parser and SwA. From my experience, this technique has good performance characteristics and can deal with messages of any size.

I also explained the drawbacks of SwA. I think that these drawbacks are not significant unless adherence to the WS-I Basic Profile and / or support for WS-Security are required.

The technique I presented is not suitable for simple XML messages or when performance requirements are not very demanding. An approach relying on parallel processing, such as the one described in this article, always requires extra effort on the developer's side. However, I think that this effort is well justified under the right circumstances.


Resources

Learn

Get products and technologies

Discuss

About the author

Alexander Sasha Ananiev is an IT Architect for IBM Global Services. He has over 16 years of experience in designing and developing computer systems using a variety of languages and technologies. He's currently working on implementing SOA solutions in the public sector. You can reach him at alexander.v.ananiev@us.ibm.com.

Report abuse help

Report abuse

Thank you. This entry has been flagged for moderator attention.


Report abuse help

Report abuse

Report abuse submission failed. Please try again later.


developerWorks: Sign in

If you don't have an IBM ID and password, register here.


Forgot your IBM ID?


Forgot your password?
Change your password


By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. This profile includes the first name, last name, and display name you identified when you registered with developerWorks. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

Choose your display name

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

(Must be between 3 – 31 characters.)


By clicking Submit, you agree to the developerWorks terms of use.

 


Rate this article

Comments

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=SOA and web services, XML
ArticleID=99476
ArticleTitle=Implement a web service that deals with complex XML documents
publish-date=11292005
author1-email=alexander.v.ananiev@us.ibm.com
author1-email-cc=

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

For articles in technology zones (such as Java technology, Linux, Open source, XML), Popular tags shows the top tags for all technology zones. For articles in product zones (such as Info Mgmt, Rational, WebSphere), Popular tags shows the top tags for just that product zone.

For articles in technology zones (such as Java technology, Linux, Open source, XML), My tags shows your tags for all technology zones. For articles in product zones (such as Info Mgmt, Rational, WebSphere), My tags shows your tags for just that product zone.

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).