Improve performance in your XML applications, Part 3

Xerces Native Interface (XNI), Xerces2 features and properties, and caching schemas

In this final installment of a three-part series describing best practices for writing XML applications, authors Elena Litani and Michael Glavassevich explain how to use Xerces2-specific features and properties to improve performance. They also give a short overview of the Xerces Native Interface (XNI), compare it with SAX, and discuss the Xerces2 grammar caching API, which can significantly improve performance of applications that require validation against DTDs or XML schemas.

Share:

Elena Litani (elitani@ca.ibm.com), Software Developer, IBM,Software Group

Elena Litani is a IBM Software Developer who works on the Eclipse Modeling Framework (EMF). Previously, Elena was one of the main contributors to the Apache Xerces2 project, working on Xerces2 XML Schema and DOM Level 3 implementations, as well as analyzing and improving performance of the parser. Elena has also represented IBM in the W3C DOM Working Group, and participated in the development of the DOM Level 3 specifications. You can contact her at elitani@ca.ibm.com.



Michael Glavassevich (mrglavas@ca.ibm.com), Software Developer, IBM,Software Group

Michael Glavassevich is a Software Developer at the IBM Toronto Lab. He started contributing to the Xerces2 project in 2003, and is now one of the lead developers. You can contact him at mrglavas@ca.ibm.com.



02 September 2004

In the first two parts of this series, we described how you can improve performance in your XML applications with general tips on how best to write your XML documents, reuse parsers, and take better advantage of the SAX and DOM APIs in general and with Xerces2. Here, we show how you can optimize the performance of your application by using the Xerces Native Interface (XNI) and configuring features and properties specific to Xerces2. We also explain how to cache schemas with Xerces2.

To understand how to achieve better performance with Xerces2, you need to have some understanding of the Xerces2 design and XNI. Therefore, in this article we give a brief overview of XNI, explaining how it is different from SAX and how you can use XNI to make your application faster. If you want to learn more about XNI, please read the Apache Xerces2 XNI manual (see Resources).

Xerces Native Interface (XNI)

The internals of Xerces2 are built on the Xerces Native Interface (XNI), a framework for communicating streaming document information similar to SAX, and for constructing generic parsers and their components. XNI was declared stable almost two years ago and it is not expected to undergo major changes in the future.

In theory, other parsers can implement the XNI interfaces, however Xerces2 currently appears to be the only parser that does. Therefore, XNI remains an internal API framework that is used in the Xerces2 parser to communicate data between different components.

Xerces2 design

In Xerces2, both the SAX and DOM parsers contain an XNI parser configuration that defines the entry point for the parser to set features and properties and to initiate a parse of an XML document. A typical parser configuration is composed of a set of components. Some of these components may be chained together to form a parsing pipeline, where each component communicates information to the next component using XNI's document handler interfaces. The SAX and DOM parsers are connected to the last component in the pipeline, and are responsible for generating their respective APIs (SAX events and DOM trees) from the XNI stream they receive.

The Xerces2 framework separates the configuration of components in the parsing pipeline from the API generation code. This separation allows the same API-generating parser to be used with an unlimited number of different parser configurations, and also allows the same parser configuration to be reused by different API-generating parsers.

Performance and parser configurations

The performance of your application may be affected by your choice of parser configuration. The default parser configuration (as of Xerces 2.6.2) supports XML 1.0/ 1.1, Namespaces in XML 1.0/ 1.1, as well as DTD and W3C XML Schema validation. If your application does not require validation, you can achieve better performance by using the non-validating parser configuration (org.apache.xerces.parsers.NonValidatingConfiguration).

You can override the default parser configuration used by Xerces2 parsers without writing any code or changing the existing parser classes. Just use one of the following mechanisms:

  • Set the org.apache.xerces.xni.parser.XMLParserConfiguration system property to point to the configuration you want to use.
  • Add an org.apache.xerces.xni.parser.XMLParserConfiguration file to your application's JAR META-INF/services/ directory. This file needs to contain the class name of the parser configuration. As long as a JAR file that contains this file appears before Xerces' JAR files, the parser will use the new parser configuration.

XNI versus SAX

As mentioned earlier, XNI provides a framework similar to the SAX API. It is also an event-based API, where parsing events (such as the start and end of elements) are pushed to the XNI components and your application through callbacks to the handlers.

The XNI handler interfaces were designed to provide all document information as defined by the XML Information Set specification. The handler interfaces can provide additional information, such as the Post-Schema Validation Infoset (PSVI) using a structure called XNI Augmentations. The XNI Augmentaions are present as a parameter on each handler method. SAX 2.0.1 and previous versions of SAX did not report all document information. For example, previously SAX had no method for retrieving the encoding of the document being parsed. With the release of SAX 2.0.2 in early 2004, all the information from the XML Information Set is now reported. You can now retrieve encoding information from the Locator2 interface.

Assuming that your SAX application always runs in an environment where Xerces2 is present, you can benefit by switching to programming with XNI directly. Not only does this give you access to more information, but you can also improve the performance of your application. As we mentioned in Part 2, Xerces2's SAX parser must sometimes perform additional processing; for instance, the modification or removal of namespace declaration attributes from the parser's internal representation to conform to the SAX specification. This carries a performance cost that you can avoid by using XNI.

The design of XNI for accessing the document information set is more tuned for performance than that of SAX. To access some information in SAX, you need to cast an object to a different interface; this carries a performance cost. For example, to query if an attribute has been specified in the document or defaulted through a DTD, you need to cast the org.xml.sax.Attributes to the org.xml.sax.ext.Attributes2 interface. To access the encoding and the version of an XML document, you need to cast the locator object provided in the ContentHandler.setDocumentLocator(Locator) call to the Locator2 interface. With XNI, the encoding and version of an XML document are pushed to the application through the org.apache.xerces.xni.XMLDocumentHandler.xmlDecl method.

In SAX, for each namespace declaration your application receives a startPrefixMapping call and a corresponding endPrefixMapping call. This means that if your XML documents include a relatively large number of namespace declarations, you get lots of additional callbacks and it slows down your application. To avoid this, XNI provides namespace information in one object (org.apache.xerces.xni.NamespaceContext) which it passes through the startDocument method of the XNI document handler. An application is responsible for saving this object if it later needs to access the current namespace information.

Why else might you consider using XNI directly? If, for example, your application needs to merge two XML documents and validate the resulting document against an XML Schema, you can create a new XNI component that intercepts XNI document handler events and merges the two documents. Then you would create a new configuration (making sure it has a public, no argument constructor) that puts this component in the parsing pipeline before the XML Schema validator. Using XNI here can help you to avoid creating a memory structure for the intermediate representation of your documents, merging the two documents in memory, possibly serializing the resulting document, and then asking the parser to parse the document again and validate it.


Xerces2 features and properties: Performance tips

In this section, we discuss additional features and properties specific to the Xerces2 parser and how to configure them to improve your application's performance.

Selecting the input buffer size

Inside the parser, a number of buffers store chunks of the document in memory as it is being parsed. By default, the size of the input buffer in the readers is 2 KB. This means as many as 2 KB will be read from the input stream at a time. Some performance tests have showed that increasing this input buffer size improves performance when parsing documents larger than 2 KB. A property identified by the URI http://apache.org/xml/properties/input-buffer-size was introduced in Xerces 2.1.0 allowing you to calibrate the size of the input buffer to best fit the size of your XML documents. The value of this property is an instance of java.lang.Integer, whose unit value is in bytes.

For small documents (generally less than 4 KB), sticking with the default buffer size should give you good performance. For larger documents, you should set the property to between 4 KB and 8 KB to achieve best performance. If you are parsing documents under 2 KB in size, you may want to choose an input buffer size that's smaller than 2 KB. For large documents, the benefit of choosing larger buffer sizes drops off beyond 8 KB, though this may depend on the type of input source you pass to the parser.

Avoid loading an external DTD

In Part 1, we showed you how to configure a parser using standard features defined by SAX to avoid processing external general and parameter entities. As we discussed, according to the XML specification a validating processor (such as Xerces2) must process both the internal and external DTD subsets. Xerces2 defines a feature identified by the URI http://apache.org/xml/features/nonvalidating/load-external-dtd for controlling whether the external DTD subset will be read if your document has an external subset. If your application does not require that the external subset be read, set this feature to false to improve the performance. You should use this feature with caution if your application:

  • Needs to query attribute types
  • Is sensitive to the whitespace collapsing performed by the parser for non-CDATA attributes
  • Relies on whitespace in element content to be reported as ignorable

If you have enabled validation, the value of this feature is ignored, in which case external DTD subsets are always read.

Disable full schema constraint checking

If you have experience setting up Xerces2 for XML schema validation, you may have noticed that two features affect schema validation identified by the URIs http://apache.org/xml/features/validation/schema and http://apache.org/xml/features/validation/schema-full-checking. When set to true, the first feature enables schema validation. A number of constraints are defined by the W3C XML schema specification that are time consuming and memory intensive to check. These include the particle derivation and unique particle attribution constraints. The second feature, when set to true in conjunction with the first, enables these more complex constraints to be checked. By default, the schema-full-checking feature has a value of false. While you are developing your schema grammar, you should set this feature to true so that Xerces2 will check all the constraints necessary to determine if your schema is valid. Once you are ready to deploy your application, set this feature to false. By eliminating checking of the more expensive constraints, you can improve the performance of your application even if you are already using the grammar caching mechanism described below. It should be noted that disabling the schema-full-checking feature does not affect the level of checking performed on instance documents; it only applies to the most expensive constraints on the schema grammar itself.

Avoid generating PSVI

The W3C XML Schema specification defines augmentations to the XML information set called the Post-Schema-Validation Infoset (PSVI). The PSVI is generated as the result of schema assessment and validation, and includes properties such as the type an element has been validated against and schema normalized values for elements and attributes. By default, Xerces2 generates PSVI. To expose this information, Xerces2 implements an XML Schema API that provides access to PSVI through SAX and DOM. If your application is unable to use APIs other than the ones specified by JAXP, then the PSVI reported by the parser will not be accessible to you. Xerces2 defines a feature identified by the URI http://apache.org/xml/features/validation/schema/augment-psvi that specifies whether PSVI will be generated during schema validation. If you will not be reading from the PSVI, set this feature to false to avoid generating PSVI. This will improve your application's performance when it performs schema validation.


Caching schemas

Today, many applications need to validate XML documents against a schema, such as a DTD or a W3C XML Schema. Validation is an expensive process, since the parser needs not only to parse an XML document, but also to access the schema or schemas provided, then parse and build some internal representation of the schemas. For simplicity, we will refer to the process of pre-parsing a schema and building an internal representation of the schema as compiling the schema. The parser then uses the compiled schemas to validate XML documents.

If your application has a limited set of schemas against which you want to validate XML documents, consider compiling and caching schemas, since it can significantly improve the performance of your applications. In particular, if most of the XML documents your application is processing are relatively small (less than 2K) then compiling your schemas can consume more than half of the overall processing time of your XML documents.

To cache schemas, you need an API that allows you to compile schemas and set those on the parser. The bad news is that until JAXP 1.3 is finalized, no standard API can do that. The JAXP 1.3 specification defines a simple API that allows applications to compile and cache schemas. However, if you want to cache schemas to improve performance of your applications today, you will need to use the Xerces2-specific API.

Xerces2 grammar caching API

By default, the Xerces2 parser does not cache schemas (in Xerces terminology, schemas are called grammars). To allow applications to optimize performance, Xerces2 defines the XNI Grammar API (org.apache.xerces.xni.grammars), which allows you to compile schemas, create a grammar pool that contains those schemas, and register that grammar pool with the parser.

Most likely your application will not need to use the complete XNI Grammar API to be able to cache schemas. Instead, you can use the default Xerces2 grammar caching implementation (org.apache.xerces.util.XMLGrammarPoolImpl), which allows you to passively compile and cache the schemas. Here is how the default implementation works:

  1. Xerces2's parser starts parsing an XML document.
  2. Before the validation starts, a Xerces validator (either XML Schema or DTD validator) calls the retrieveInitialGrammarSet(String) method of the org.apache.xerces.xni.grammars.XMLGrammarPool interface to retrieve a set of compiled schemas from the registered grammar pool. These schemas are stored in a validator's grammar bucket and used during validation if needed.
  3. During parsing, if a validator needs a new schema, it first checks to see if the grammar bucket has the compiled schema it needs. In the case of XML Schemas, the schemas in the grammar bucket are keyed by targetNamespace URIs; DTDs are keyed on the root element.
  4. If the compiled schema is not found in the grammar bucket, the validator asks the registered grammar pool to provide a schema using the retrieveGrammar(XMLGrammarDescription) method. If no schema is returned by the pool, the validator will try to resolve the schema using the registered entity resolver (such as org.xml.sax.EntityResolver) or using default entity resolution.
  5. At the end of the parse, the validator returns all the compiled schemas in the grammar bucket to the registered grammar pool using the cacheGrammars(String, Grammar[]) method. This set of compiled schemas is stored in the grammar pool, and in the subsequent parse is provided to the Xerces2 validator as the initial set of grammars (as described in step 2).

Therefore, if you are using the default implementation, you don't need to compile schemas ahead of time. In other words, your application starts with an empty grammar pool. The parser compiles any new schema it needs for validation and gives all the compiled schemas back to the grammar pool after the parsing of your XML document is complete. As a result, the processing of the first several XML documents may be slow. However, assuming that you have a limited set of schemas, as you continue to parse additional XML documents the grammar pool will eventually include all the compiled schemas needed for validation of your XML documents. At this point, the parser no longer needs to compile any new schemas, and the validation of subsequent XML documents no longer endures the performance cost of compiling schemas.

You can trigger the Xerces2 default schema caching implementation in two ways:

  • You can specify the grammar caching parser configuration (org.apache.xerces.parsers.XMLGrammarCachingConfiguration).
  • You can set the default grammar pool implementation (org.apache.xerces.util.XMLGrammarPoolImpl) on the parser using the Xerces2 property identified by the URI http://apache.org/xml/properties/internal/grammar-pool.

As we discussed in Part 2, if you expect that in some environments your application will have Xerces2 on the classpath, and your application has a limited number of schemas, you should try to trigger the grammar caching (as shown in Listing 1) so that if the Xerces2 parser is available, your application will perform better.

Listing 1. Triggering schema caching
...
	
XMLReader parser = XMLReaderFactory.createXMLReader();
    
try {
    Class poolClass =  
        Class.forName("org.apache.xerces.util.XMLGrammarPoolImpl");
        
    Object grammarPool = poolClass.newInstance();
        
    parser.setProperty(
        "http://apache.org/xml/properties/internal/grammar-pool", 
        grammarPool);          
}
catch (Exception e) {}

Providing your own grammar pool implementation

In some cases, you might want to provide your own grammar pool implementation. For example, suppose your application plans to process XML documents that may include non-repetitive schemas. You probably do not want the parser to add to the default grammar pool all the schemas that have been processed, since this might eventually cause the virtual machine to run out memory. Instead, you may want to compile your most frequently used subset of schemas and then lock the pool (see the XMLGrammarPool lockPool() method) to disallow new schemas to be added to the pool, as shown in Listing 2.

Listing 2. Pre-parsing schemas
// create grammar preparser
XMLGrammarPreparser preparser = new XMLGrammarPreparser();

// register a specialized pre-parser
preparser.registerPreparser(XMLGrammarDescription.XML_SCHEMA, null);

// create grammar pool
XMLGrammarPool grammarPool = new XMLGrammarPool();

// set the grammar pool on the grammar preparser 
// so that all the compiled grammars are automatically
// placed to the grammar pool
preparser.setProperty(GRAMMAR_POOL, grammarPool);

// set properties
preparser.setFeature(NAMESPACES_FEATURE_ID, true);
preparser.setFeature(VALIDATION_FEATURE_ID, true);

// parse grammar(s)
Grammar g = preparser.preparseGrammar(
    XMLGrammarDescription.XML_SCHEMA, 
    new XMLInputSource(null, "personal.xsd", null));

// lock grammar pool
grammarPool.lockPool();

// next register the grammar pool with the parser
// and start parsing
...

Conclusion

In this article, we introduced the Xerces Native Interface (XNI) and described how you can use this API directly to improve your application's performance. We then discussed several features and properties specific to Xerces2 and how you can calibrate them to accelerate the processing of your documents. Finally we showed you how to cache schemas with Xerces2 to avoid the cost of processing them repeatedly.

The techniques described in this three-part series should help you make decisions as you write your XML applications for higher performance. Of course, we did not cover many other techniques. Keep in mind that what might perform well with one particular parser implementation may perform poorly with another and vice versa, therefore you may need to experiment a bit to get the best possible performance.

Resources

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into XML on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML
ArticleID=15081
ArticleTitle=Improve performance in your XML applications, Part 3
publish-date=09022004