Improve performance in your XML applications, Part 2

Reuse parser instances with the Xerces2 SAX and DOM implementations

In this installment of a three-part series describing best practices for writing XML applications, authors Elena Litani and Michael Glavassevich explain how to improve your SAX and DOM applications' performance by using the Xerces2 implementation. They also include code samples to show you how to improve your application's performance by reusing parser instances.

Share:

Elena Litani (elitani@ca.ibm.com), Software Developer, IBM,Software Group

Elena Litani is an IBM Software Developer who works on the Eclipse Modeling Framework (EMF). Previously, Elena was one of the main contributors to the Apache Xerces2 project, working on Xerces2 XML Schema and DOM Level 3 implementations, as well as analyzing and improving performance of the parser. Elena has also represented IBM in the W3C DOM Working Group and participated in the development of the DOM Level 3 specifications. You can contact her at elitani@ca.ibm.com.



Michael Glavassevich (mrglavas@ca.ibm.com), Software Developer, IBM,Software Group

Michael Glavassevich is a Software Developer at the IBM Toronto Lab. He started contributing to the Xerces2 project in 2003, and is now one of the lead developers. You can contact him at mrglavas@ca.ibm.com.



30 July 2004

Also available in Japanese

Today, many applications use the Java API for XML Processing (JAXP) to retrieve either a SAX or a DOM parser. Depending on the version of JAXP and on the Java 2 Platform, Standard Edition (J2SE) vendor, your application can retrieve different parser implementations. For example, Sun J2SE 1.4.x includes the Crimson parser, while IBM® J2SE 1.4.x and Sun J2SE 1.5 include the Xerces2 parser. If you are interested in working directly with the Xerces2 parser, you can use the JAXP factory plug-in mechanism to specify where JAXP can locate the Xerces2 parser implementation.

Even if your application cannot specify the parser implementation, you still might want to anticipate that in some environments the Xerces2 parser will be the parser implementation available; therefore, you might want to set features and properties on the parser that could affect performance. If the Xerces2 parser is not available, setting Xerces2-specific features or properties will fail, but since initialization of the parser should happen only once (as described below), it should not inhibit your application's performance.

This paper offers suggestions for improving the performance of your SAX or DOM applications if you are using the Xerces2 parser. We recommend that you always use the latest available version of Xerces2, since almost every release of Xerces2 includes some performance improvements. It also discusses how the reuse of parser instances can significantly improve your application's performance. Keep in mind that what may perform particularly well on Xerces may actually perform poorly on other parsers, and vice versa.

SAX Xerces2 - specific tips

In the previous paper, we gave you some general performance tips for writing SAX applications. If you are using the Xerces2 SAX parser, you might want to consider a few additional things. In this section, we discuss areas where using the Xerces2's SAX parser could affect your application's performance.

Namespace declarations

Internally, Xerces stores namespace declarations with the rest of the attributes that are specified on an element start tag. By default, a SAX parser does not include namespace declarations among the attributes reported in a startElement callback. In order to conform to the SAX API, the parser must iterate over the set of attributes to remove any namespace declarations, even when no namespace declarations have been specified on the start tag.

A feature identified by the feature URI http://xml.org/sax/features/namespace-prefixes controls whether namespace declarations are reported. To achieve better performance, set this feature to true, so that the parser reports namespace declarations among other attributes. However, the parser does not report the namespace declarations as being bound to the namespace name http://www.w3.org/2000/xmlns/, as is specified in the errata for the Namespaces in XML Recommendation (see Resources). The first edition of this specification does not assign namespace declarations to a namespace. Before the SAX 2.0.2 release, namespace declarations always had to be reported as having no namespace. Internally, Xerces binds these attributes to a namespace, so before reporting namespace declarations to your application as attributes, the parser must unbind them from their namespace. Similarly, the parser must always iterate over the set of attributes to locate any namespace declarations.

SAX 2.0.2 introduced a new feature that is identified by the feature URI http://xml.org/sax/features/xmlns-uris. It lets you specify a preference as to whether namespace declarations are reported as having a namespace. When this feature is set to true, Xerces 2.7.0 (still in development at the time of this writing) does not further process the set of attributes, as they are already in their expected form. Configuring the parser in this way speeds up the processing of attributes.

Reading attributes by index

In the implementation of org.xml.sax.Attributes in Xerces2, attributes are stored in an array for fast access to attributes by index. The SAX helper class org.xml.sax.helpers.AttributeListImpl (used and extended by the Crimson parser) stores attributes similarly. When attributes are stored in this manner, you can achieve better performance when processing attributes by accessing them by index rather than by their name. In this case, looking up an attribute by name initiates a linear search. As the number of attributes specified for an element increases, so does the average search time. When looking up an attribute by name and examining several of its properties, it is better to first look up its index by name and then use the index to get the properties you are interested in, such as the attribute’s value and type.

Although storing attributes in array form is a typical implementation, an element's attributes are an unordered set, so other parser implementations may store attributes in a more efficient way. This could mean that they aren't stored in the order that they appeared in the document. In such cases, looking up an attribute by name could yield better performance than looking it up by index.


DOM Xerces2 - specific tips

In the previous paper we discussed general performance tips for writing DOM applications. In this section we discuss the design of the Xerces2 DOM and features that affect your application's performance.

Specifying a DOM implementation

The DOM API consists of several specifications. These specifications define features to identify areas that a particular specification covers. The default Xerces DOM implementation supports the DOM Core, XML, Mutation Events, Range, and Traversal features. If your application only needs an implementation of the DOM Level 2 (or 3) Core Recommendation, you can improve performance by specifying the org.apache.xerces.dom.CoreDocumentImpl class as an implementation of the org.w3c.dom.Document interface using the http://apache.org/xml/properties/dom/document-class-name property. This implementation supports the DOM Level 2, Level 3 Core, and Level 3 Load and Save Recommendations. It does not implement any other DOM specifications, and therefore performs better.

Deferred DOM

By default, the Xerces2 parser builds a DOM using a compact array structure known as deferred DOM. This allows the parser to return a document faster than if the tree were fully expanded during parsing and can improve memory usage. As a tree is traversed, the Xerces2 DOM implementation creates DOM nodes using the information stored in the array structures.

In general, you should use the deferred implementation if your application needs to process large documents and if your application is not intending to traverse the whole tree. However, some performance tests have shown that using the Xerces2 DOM with deferred node expansion for small documents (0K-10K) results in poor performance and large memory consumption.

Thus, for best performance when using the Xerces2 DOM with smaller documents, you should disable the deferred node expansion feature identified by the feature URI http://apache.org/xml/features/dom/defer-node-expansion. For larger documents (~100K and higher) the deferred DOM offers better performance than non-deferred DOM, but uses more memory.

Traversing Xerces2 DOM

Always use the getValue (org.w3c.dom.Attr interface) method to retrieve an attribute node's value as a string. Avoid using methods to retrieve an attribute node's children. While a DOM implementation must create a Text node for an attribute string value, the Xerces2 DOM implementation delays creating a Text node until an application attempts to retrieve an attribute node's children. If your application retrieves attribute values using the getValue method, a Text node is never created, thus saving space and improving the DOM traversal's performance.

Many applications choose to use getChildNodes() and a NodeList for traversing the tree. However, it is less expensive to traverse the tree using the getFirstChild, getLastChild, getNextSibling, and getPreviousSibling methods, as shown in Listing 1.

Listing 1. Traversing Xerces2 DOM
Element root = document.getDocumentElement();

// Avoid traversing using NodeList:
NodeList children = root.getChildNodes();
for (int i = 0; i < children.getLength(); i++) {
   Node n = children.item(i);
}

// Instead, use the following methods:
Node child = root.getFirstChild();
while (child != null) {
   child = child.getNextSibling();
}

Serializing DOM

While it is possible to serialize a Xerces2 DOM using Java object serialization, we recommend that a DOM be serialized as XML wherever possible instead of using object serialization. The Xerces2 DOM implementation does not guarantee DOM Java object serialization interoperability between different versions of Xerces. In addition, some rough measurements have shown that XML serialization performs better than Java object serialization, and that XML instance documents require less storage space than object-serialized DOMs.


Reusing parsers

One of the common misconceptions about writing XML applications is that creating a parser instance does not incur a large performance cost. On the contrary, creation of a parser instance involves creation, initialization, and setup of many objects that the parser needs and reuses for each subsequent XML document parsing. These initialization and setup operations are expensive.

In addition, creating a parser can be even more expensive if you are using the JAXP API. To obtain a parser with this API, you first need to retrieve a corresponding parser factory -- such as a SAXParserFactory -- and use it to create the parser. To retrieve a parser factory, JAXP uses a search mechanism that first looks up a ClassLoader (depending on the environment, this can be an expensive operation), and then attempts to locate a parser factory implementation that can be specified in the JAXP system property, the jaxp.property file, or by using the Jar Service Provider mechanism. The lookup using the Jar Service Provider mechanism can be particularly expensive as it may search through all the JARs on the classpath; this can perform even worse if the ClassLoader consulted does a search on the network.

Consequently, in order to achieve better performance, we strongly recommend that your application create a parser once and then reuse this parser instance.

Typically, there are two types of applications:

  • Applications that need to use a parser with the same set of features and properties
  • Applications that might need to change parser features and properties between subsequent parses

For the first type of application, it is easy to reuse the parser. Normally, you first create a parser, configure it by setting features and properties, and then use the same parser instance to parse all XML documents, as shown in Listing 2:

Listing 2. Reusing a parser instance
// Use JAXP to retrieve SAX factory
SAXParserFactory factory = SAXParserFactory.newInstance();

// create a parser instance
SAXParser parser = factory.newSAXParser();

// set features and properties
parser.getXMLReader().setFeature(
    "http://xml.org/sax/features/validation", 
    true);

parser.getXMLReader().setProperty(
    "http://xml.org/sax/properties/lexical-handler", 
     myHandler);
parser.getXMLReader().setErrorHandler(myErrorHandler);

DefaultHandler myHandler = new DefaultHandler();
// use the same parser instance to parse XML documents
for (int i=0; i < args.length; i++){
   parser.parse(args[i], myHandler);
}

It is more difficult to implement parser caching for the second type of application using only one instance of the parser. While it is possible, you need to remember to reset the features and properties to the default state -- for example as shown in Listing 3:

Listing 3. Resetting features and properties values
// record the features used by application and their default values
HashMap defaultFeatureValues = new HashMap();
defaultFeatureValues.put("http://xml.org/sax/features/validation",
          Boolean.FALSE);
...

// record features that are set for this scenario
Vector currentFeatures = new Vector();
currentFeatures.add("http://xml.org/sax/features/validation");

// set features on the parser
parser.getXMLReader().setFeature(
        "http://xml.org/sax/features/validation", 
        true);

DefaultHandler myHandler = new DefaultHandler();
// use the same parser instance to parse XML documents
for (int i=0; i < args.length; i++){
   parser.parse(args[i], myHandler);
}

// reset parser features 
for (int i=0; i < currentFeatures.size(); i++){
   String feature = (String) currentFeatures.get(i);
   parser.getXMLReader().setFeature(
       feature, 
       ((Boolean)defaultFeatureValues.get(feature)).booleanValue());
}

There is no simple API for resetting parser features, and in some cases resetting properties might not be possible; for example, Xerces2 version 2.6.2 throws a NullPointerException if you attempt to set a property value to null. As a result, it is better to use multiple parser instances.

You can start by defining a parser pool interface and registering its implementation with your application. Given a set of features and properties, the parser pool should either return a parser from an internal pool or create and store a new parser instance if one does not exist. The application should interact with your parser pool implementation each time it needs to get a parser.

If your application runs in a multi-thread environment, you need to make sure that your parser pool is synchronized. In this case, the parser pool needs to define not only a get method but also a release method that allows a thread to release a parser instance back to the pool, making it available for the other threads. Listing 4 illustrates a possible interface for a SAX parser pool. To ensure that your implementation is thread-safe, a class that implements this interface should use either the synchronized keyword on the methods or synchronized blocks within the methods.

Listing 4. Sample interface for reusing SAX parsers in a multi-thread environment.
public interface XMLParserPool
{
  /**
   * Retrieves a parser from the pool given specified properties 
   * and features.
   * If parser can't be created using specified properties 
   * or features, an exception can be thrown.
   */
  public SAXParser get(Map features, Map properties) 
           throws ParserConfigurationException, SAXException;

  /**
   * Returns the parser to the pool.
   */
  public void release(SAXParser parser, 
            Map features, 
            Map properties);
}

Conclusion

This paper discussed how you can improve an application’s performance when using the Xerces2 SAX and DOM implementations. It also showed you how to improve your XML applications' performance by reusing and caching parsers. The third paper in this series will continue discussing how you can use Xerces2-specific features and properties to improve performance. It will give a short overview of the Xerces Native Interface (XNI), compare it with SAX, and discuss the Xerces2 grammar caching API, which can significantly improve performance of applications that require validation against DTDs or XML schemas.

Resources

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into XML on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML
ArticleID=12420
ArticleTitle=Improve performance in your XML applications, Part 2
publish-date=07302004