Improve performance in your XML applications, Part 1

Write XML documents and develop applications using the SAX and DOM APIs

Write your application to get the best possible performance, plus learn which SAX or DOM operations and features affect application performance. In this first of a three-part article, authors Elena Litani and Michael Glavassevich describe best practices for writing XML apps and documents, and for developing applications with the standard SAX and DOM APIs.

Share:

Elena Litani (elitani@ca.ibm.com), Software Developer, IBM,Software Group

Elena Litani is a IBM Software Developer who works on the Eclipse Modeling Framework (EMF). Previously, Elena was one of the main contributors to the Apache Xerces2 project, working on Xerces2 XML Schema and DOM Level 3 implementations, as well as analyzing and improving performance of the parser. Elena has also represented IBM in the W3C DOM Working Group, and participated in the development of the DOM Level 3 specifications. You can contact her at elitani@ca.ibm.com.



Michael Glavassevich (mrglavas@ca.ibm.com), Software Developer, IBM,Software Group

Michael Glavassevich is a Software Developer at the IBM Toronto Lab. He started contributing to the Xerces2 project in 2003, and is now one of the lead developers. You can reach him at mrglavas@ca.ibm.com.



26 July 2004

Also available in Japanese

Today, XML plays an important role in many performance-critical scenarios. While many developers know how to write XML documents, XML schemas, or DTDs, some might not realize that performance of an XML application depends on the choices you make while constructing an XML document and which features you set on the parser before parsing an XML document.

Many developers also know when to use SAX and when to use the DOM API. In general, you can best use SAX in scenarios where memory is a concern, when an application has to process large documents, or create its own representation in memory (other than DOM). On the other hand, you can best use DOM in the cases where your application needs to randomly access and modify document data, wants to implement complex searches, or plans to traverse a document tree multiple times. In this article, we explain which SAX or DOM operations and features may affect the performance of your application, and describe how to write your application for the best performance.

Writing XML documents

Developers who are responsible for writing XML documents can do a variety of things to improve the performance of an XML application.

Each XML document can specify a character encoding in the XML declaration. To achieve optimum performance, use US ASCII ("US-ASCII") as the encoding when writing XML documents. Documents written using ASCII characters are the fastest to parse because each character is guaranteed to be a single byte and map directly to its equivalent Unicode value. If your document is encoded in UTF-8 but only contains ASCII characters, some parsers (such as Xerces2) perform in much the same way as they would in processing an equivalent XML document encoded in US-ASCII. For documents that contain Unicode characters beyond the ASCII range, the parser must read and convert multiple byte sequences for each character. This conversion results in a performance penalty. The UTF-16 encoding alleviates some of this penalty because each character is specified using two bytes, assuming no surrogate characters. However, if you use UTF-16, the size of the original document roughly doubles and the document takes longer to parse.

You can also improve performance by reducing the number of new lines and the amount of whitespace used in a document. Normally for editing convenience, developers organize documents into lines -- for example, using carriage returns (#xD) and line feeds (#xA). An XML parser must translate both the two-character sequence #xD #xA and any #xD (not followed by #xA) into a single #xA character. This translation is not free. The overall performance impact on parsing depends on the number of characters in a document relative to the number of new lines. This also applies to whitespace usage. When you add whitespace to your documents, a parser processes more characters, which in the end affects parsing performance.

You should also avoid using namespaces in your applications unless they're absolutely necessary. Processing a document with the namespace feature enabled can slow the processing of the whole document. A parser not only processes namespace declarations, verifying their correctness, but it also ensures that an XML document is namespace well-formed.

Applications that do not need validation should not include a <!DOCTYPE...> line in their documents. According to the XML specification, a validating processor, such as Xerces2, must process the internal and external DTD subsets to get information about default attributes, attribute types, and so forth. The processor will process the DTD even if the validation feature is turned off.

When an application needs validation, keep in mind that processing and validating against a DTD is normally cheaper than processing and validating against a W3C XML Schema. In addition, you should avoid using a lot of external entities -- such as external DTDs or imported XML Schemas -- since opening and reading from a file is an expensive operation. Also avoid using many default attributes, since this increases validation time. XML Schema’s redefine construct and identity constraint are also worth avoiding, since both could affect the duration of the validation process.


General SAX performance tips

While choosing SAX over a more memory-intensive API such as DOM may in itself improve the performance of your application; you can do a number of things to maximize it. Try these tips to improve the performance of your SAX applications:

String internalization

SAX specifies a feature that's identified by the feature URI http://xml.org/sax/features/string-interning. When set to true, it instructs the parser to report XML names -- such as the names of elements and attributes -- and namespace URIs as internalized strings that have been interned by invoking java.lang.String.intern().

To accelerate string equality tests, turn this feature on. Instead of making calls to equals() which compares strings character by character, you can compare names reported by the parser against string constants by reference. If you use XML names reported by the parser as keys to hash tables, internalized strings should improve lookup times if the table calls the hashCode method of java.lang.String. Although not specified in the Javadoc, implementations of this hashCode method typically cache the hash code value in the object after computing it. After the hash code has been computed once, getting the hash code for an internalized string is essentially free.

Some parser implementations may not support the string internalization feature. Xerces2 uses internalized strings for faster comparisons, so this feature is always on.

Switch content handlers

If you process large XML vocabularies, you may find yourself with a large number of if and else statements in your callback methods. At any time during a parse, it is possible to register a new content handler, as stated in the SAX specification. You can reduce the complexity and length of your callback methods by using different content handlers for different parts of the document. The class shown in Listing 2 demonstrates how you can split the processing of a document, shown in Listing 1, between multiple handlers.

Listing 1. Sample XML document
<?xml version="1.0" encoding="US-ASCII"?>
<!DOCTYPE root [
 <!ELEMENT root (child*)>
 <!ELEMENT child (#PCDATA)>
]>
<root><child/></root>
Listing 2. Use multiple content handlers
public class MultipleHandlersExample {
    
    private XMLReader reader;
    private ContentHandler docHandler;
    private ContentHandler rootContentHandler;
    private ContentHandler childContentHandler;
        
    ...
    
    public void parse (String uri) throws SAXException, IOException {
        reader.setContentHandler(docHandler);
        reader.parse(uri);
    }
    
    public class DocHandler extends DefaultHandler {
        public void startElement(String uri, String localName, 
            String qName, Attributes atts) {
            if ("root".equals(qName)) {
                // process root
                reader.setContentHandler(rootContentHandler);
            }
            else {} // error: only root expected here 
        }
    }
    
    public class RootContentHandler extends DefaultHandler {
        public void startElement(String uri, String localName, 
            String qName, Attributes atts) {
            if ("child".equals(qName)) {
                // process child
                reader.setContentHandler(childContentHandler);
            }
            else {} // error: only child expected here
        }
        public void endElement(String uri, String localName, 
            String qName) {
            // end of root, set content handler for document
            reader.setContentHandler(docHandler);
        }
    }
    
    public class ChildContentHandler extends DefaultHandler {
        public void startElement(String uri, String localName, 
            String qName, Attributes atts) {
            // error: no element content expected here
        }
        public void endElement(String uri, String localName,
            String qName) {
            // end of child, set content handler for root
            reader.setContentHandler(rootContentHandler);
        }
    }
}

When a particular element (such as root or child in the above example) has been reported, you can register a handler to process the content of this element. At the end of the element, you would restore the content handler for the parent element. For more complex documents than the one given in Listing 1, you can accomplish this by pushing and popping content handlers onto and off of a stack. By handling content in this way, you have much less code in each of your handler methods. Reducing the length of these methods can make them more amenable to optimization by a JIT compiler.

Based on the configuration of the SAX parser, your application may need to perform differently. For instance, if you rely on string internalization but the parser you are using does not support it, your application must compare strings using equals(). You can have one content handler that handles both cases, but this requires a check to see which case needs to be handled each time the parser invokes the handler. Instead of writing one monolithic handler, you can write two content handlers: one that performs reference comparisons on strings and another that does not. The decision of which handler to use can be made before parsing.

Load external entities with entity resolvers

XML documents that refer to external DTDs and/or contain many references to external entities can be very expensive to process. For each of these entities, the parser needs to locate a resource somewhere out in the world and read it. If this resource is on your hard drive, the parser must open a file. If the parser works on character data internally on a single encoding -- such as Xerces2, which always represents characters internally as 16-bit units (UTF-16) -- it must transcode each of these files. If your document contains references to entities on a network or on the Internet (and these resources are accessible from your environment), you could incur a large performance penalty, especially during periods of high network latency. Many parsers, including Xerces2, do not keep entities that it has already read in memory; if your document references an entity multiple times, the parser will fetch the entity as many times as it is referenced.

If your XML documents have references to external entities or external DTDs, you can improve the performance of your application by loading these entities into memory using an entity resolver. Write your entity resolver so it caches the content of the entity the first time it is read. Your application only pays the retrieval penalty once per entity. If you do not require such dynamic loading, you can preload your application with entities that you wish to be read from memory. When you store an entity in memory as a java.lang.String, you can avoid the cost incurred by a parser when it converts from the entity’s encoding into characters, as shown in Listing 3.

Listing 3. Load external entities from memory
public class MyEntityResolver implements EntityResolver {
    
    private String externalEntity = ...;

    InputSource resolveEntity(String publicId, String systemId)
        throws SAXException, IOException {
        if (systemId.equals("ExternalEntity.xml") {
            return new InputSource(new StringReader(externalEntity));
        }
        return null;
    }
}

Avoid processing external entities

Although XML documents processed by your application may contain references to external entities, you might not be interested in expanding them. SAX defines two features, identified by the feature URIs http://xml.org/sax/features/external-general-entities and http://xml.org/sax/features/external-parameters-entities, that control whether the parser processes external general and external parameter entities. If you disable these features and an external entity reference is encountered while processing a document, a SAX parser won't report the entity content, but will instead report the name of the entity to the skippedEntity callback of your content handler. If your application is not interested in the content of external entities, you can turn these features off to stop them from being processed.


General DOM Performance Tips

DOM defines several types of nodes, such as Element and Attribute. When you write code to perform specific operations based on the type of a node, avoid using the Java instanceof operator to check the node type. Instead, use the getNodeType (Node interface) method to retrieve the type of the node being processed.

Before you retrieve a list of attributes, always query to see if a node has attributes using the hasAttributes method; if the node has attributes, cast the node to an Element node and use the getAttributes method to retrieve the list of attributes. With this sequence of operations, you avoid unnecessary casting of a Node to an Element node and possible creation of an empty NamedNodeMap -- the getAttributes method always returns a map even if a node has no attributes.

If your application needs to query or modify an attribute Node, avoid using the hasAttribute(String) or the hasAttributeNS(String, String) methods. Instead, use either the getAttribute(String) method or the getAttributeNS(String, String) method to retrieve the attribute Node, which you can then query or modify.

Several operations in the DOM API can be quite expensive. The importNode operation results in the creation of new nodes, so you should consider using the adoptNode method instead. Both getElementByTagName and getElementByTagNameNS traverse DOM trees looking for the nodes using the Java String.equals() method for comparing names and namespace URIs; instead, an application can choose to write its own traversal methods that can use Java == for comparing strings (in case strings are internalized) or search only in parts of the tree. The same applies to the getElementById method.

In addition, use the DOM Level 3 normalizeDocument method with caution. While this method can improve performance in validation, it can also be expensive. For example, by default this method ensures that the tree is namespace well-formed -- that is, it checks to see if all necessary namespace declarations for attributes and elements have been added, and may change prefixes of some attributes or elements, if required. If that application code is already making sure that the tree is namespace well-formed, the application should turn off the "namespace" parameter using the DOMConfiguration interface before invoking the normalizeDocument method. This also applies to the well-formed parameter, which is true by default. Keep in mind that normally DOM trees should be well-formed. The exceptions are if developers include -- in the Comment nodes or ]]> in the CDATASection nodes or use non-XML characters in textual content (including CDATA sections and comments). Therefore, you can disable the well-formed parameter when calling normalizeDocument safely for most applications, and this can significantly affect the performance of this method.

DOM Level 3 API

The DOM Level 3 Core and Load and Save specifications define several new operations and an API that can improve the performance of a DOM application. Therefore, you should plan to migrate your DOM applications to use the DOM Level 3, which is supported in J2SE 5.0.

Rename and move nodes

In DOM Level 2, renaming and moving nodes from one document to another can be relatively expensive, since these operations involve creating new nodes, copying the contents of those nodes, and inserting nodes at the appropriate places in a tree. To improve performance of these operations, write your application so that it uses the renameNode and adoptNode methods. Normally, the renameNode method only changes the name of a node. In some rare cases, this method may end up creating a new node, copying all information and inserting the new node into the tree. When working with Xerces2 DOM, this only happens if an application creates a non-namespace-aware node -- a node created with DOM Level 1 methods, such as createElement -- and later tries to rename this node adding a namespace URI. Since Xerces2 uses different classes for namespace-aware nodes (the nodes created with the DOM Level 2 methods, such as createElementNS) and non-namespace nodes, the Xerces2 DOM implementation is forced to create a new instance of a node during the renameNode operation. As mentioned above, this is a rare case, since this scenario most often occurs when an application attempts to mix namespace-aware and non-namespace-aware nodes in a single document. Mixing the two types of nodes is highly discouraged, as it can lead to unpredictable results (for example, during validation of the tree).

Validate in memory

Using DOM Level 3, you can now validate a DOM in memory. In the past, if you wanted to validate against a schema you had to either write your own validation code, which can be very complicated, or serialize the DOM and then load it back in memory using a validating parser. Using the normalizeDocument method, you can perform validation of the DOM tree in one easy step to avoid other costly operations. Read more on how to use the normalizeDocument in the article "Discover key features of DOM Level 3 Core, Part 2."

Avoid unnecessary checking

Normally, a DOM implementation must verify the correctness of operations and throw an exception if an application passes the wrong parameters or performs an illegal operation. For example, for the createElementNS method a DOM implementation must verify that the qualifiedName complies with the definition of the QName (see Resources). DOM Level 3 adds a new strictErrorChecking attribute to the Document interface. If you are convinced that all the operations your application performs on the DOM are legal (for example, if the DOM tree is built using SAX events), you can improve performance by turning off strict error checking.

Use the Load and Save filter API

With the DOM Level 3, you no longer need to wait for parsing to complete before you can modify a document’s structure. The new filter API lets you examine and modify a document’s structure during parsing by asking a parser to accept, skip, or remove a node and its children from the resulting tree. You can also choose to load only part of an XML document by interrupting parsing using the filter API. Such modifications of the document structure during parsing can result in a smaller memory footprint for the DOM tree, and also reduce the time spent traversing a document and modifying it in memory.

Using the serializer filter, your application can specify which nodes you want to serialize into XML without modifying the original DOM tree. This gives you flexibility to serialize the same DOM tree into multiple XML documents, again avoiding potentially costly traversals and modifications of a DOM tree.


Conclusion

In this article, we have shown how to improve performance in your XML applications. We started by showing you techniques for writing XML to achieve better parsing performance. Then we described how you can improve performance of SAX and DOM applications. In the second article in this series, we will explain how to improve the performance of SAX and DOM applications if you are using the Xerces2 implementation. We will also show you how to reuse parser instances.

Resources

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into XML on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML
ArticleID=12418
ArticleTitle=Improve performance in your XML applications, Part 1
publish-date=07262004