Skip to main content

Improve the performance of your XML applications using Xerces-C++

Delve into the Xerces-C++ properties and features, data handling, and schema grammar caching

David A. Cargill (cargilld@ca.ibm.com), Software Developer, IBM
David Cargill is a member of the XML Parser Development team at IBM Canada. He's been involved in the development of the Xerces-C++ parser for the last five years.
Khaled Noaman (knoaman@ca.ibm.com), Software Developer, IBM, Software Group
Khaled Noaman is a member of the XML Parser Development team at IBM. He's been involved in the development of the Xerces-C++ parser for the last five years and implemented many of the parser features including support for XML Schema Structures.

Summary:  XML is becoming a main staple in data exchange both between applications and on the Web. Learn how to improve the performance of your XML applications by using the Xerces-C++ parser properly. You'll learn the best ways to use the parser efficiently, and which features and properties affect its performance.

Date:  16 May 2008 (Published 13 May 2008)
Level:  Intermediate
Activity:  3502 views

XML has gained widespread popularity with the emergence of Web services and service-oriented architecture (SOA). It plays an important role in the exchange of data both between applications and on the Web, and it's the cornerstone of many performance-critical scenarios.

Frequently used acronyms

  • API: application programming interface
  • CPU: Central Processing Unit
  • DOM: Document Object Model
  • DTD: Document Type Definition
  • SAX: Simple API for XML
  • W3C: World Wide Web Consortium
  • XML: Extensible Markup Language

You can improve the performance of your XML application by using the parser efficiently. Xerces-C++ is an open source validating XML parser available from Apache. In this article, we'll show you several tips on how to use the Xerces-C++ parser to improve the performance of your applications.

Xerces-C++ is a validating XML parser that is provided as a shared library. The library includes interfaces for DOM and SAX. Specifically, SAXParser is an interface for the SAX 1.0 specification; SAX2XMLReader is an interface for the SAX 2.0 specification; XercesDOMParser is an interface for the DOM specification; and DOMBuilder is an implementation of the Load interface of the DOM Level 3.0 Abstract Schemas and Load and Save specification.

Properties and features

Numerous properties and features can have significant influence over the performance of the parser.

Using the right scanner

One of the major components in Xerces-C++ is the scanner. It's not only responsible for scanning an XML instance, but it also plays an important role in assessing the validity of the XML document.

Xerces-C++ has four scanners: IGXMLScanner, WFXMLScanner, DGXMLScanner, and SGXMLScanner. It's important to choose the appropriate scanner for your scenario to obtain better performance. IGXMLScanner, the default scanner, is an all-purpose scanner that not only handles well-formedness, but is also involved in validating XML documents against DTDs and/or XML Schemas. WFXMLScanner, on the other hand, handles only well-formedness checking, not grammar validation. If you're only concerned with the well-formedness of the document, then use WFXMLScanner. Use DGXMLScanner if you're only doing DTD validation, and use SGXMLScanner if you're only doing XML Schema validation.

You can tell the parser which scanner to use by setting the scanner property on a SAX2XMLReader API or a DOMBuilder API. Listing 1 shows you how to set the scanner on SAX2XMLReader.


Listing 1. Setting the scanner on a SAX2XMLReader API
#include <xercesc/internal/XMLGrammarPoolImpl.hpp>
#include <xercesc/sax2/XMLReaderFactory.hpp>
#include <xercesc/util/XMLUni.hpp>

XMLGrammarPool *grammarPool = new XMLGrammarPoolImpl(XMLPlatformUtils::fgMemoryManager);
SAX2XMLReader* parser = XMLReaderFactory::createXMLReader(
                               XMLPlatformUtils::fgMemoryManager, grammarPool);

parser->setProperty(XMLUni::fgXercesScannerName, (void *)XMLUni::fgSGXMLScanner);

For a SAXParser API or a XercesDOMParser API, you can call the useScanner method to specify which scanner the parser should use, as in Listing 2.


Listing 2. Setting the scanner on a SAXParser API
#include <xercesc/parsers/SAXParser.hpp>
#include <xercesc/util/XMLUni.hpp>
				
SAXParser parser = new SAXParser();
parser->useScanner(XMLUni::fgDGXMLScanner);

For more information on how to use a specific scanner in Xerces-C++, see Resources.

Controlling validation

Once you specify which scanner to use, you still have features to control whether or not the parser performs validation.

You can tell the parser how to validate an instance document by setting the validation feature on a SAX2XMLReader API or a DOMBuilder API. Listing 3 shows you how.


Listing 3. Setting validation on a DOMBuilder API
#include <xercesc/dom/DOMImplementationLS.hpp>
#include <xercesc/dom/DOMImplementationRegistry.hpp>
#include <xercesc/dom/DOMBuilder.hpp>
#include <xercesc/util/XMLUni.hpp>
				
static const XMLCh gLS[] = { chLatin_L, chLatin_S, chNull };
DOMImplementation *impl = DOMImplementationRegistry::getDOMImplementation(gLS);
DOMBuilder *parser = ((DOMImplementationLS*)impl)->createDOMBuilder(
                       DOMImplementationLS::MODE_SYNCHRONOUS, 0);

/* specify a validating parse */
parser->setFeature(XMLUni::fgDOMValidateIfSchema, true);
parser->setFeature(XMLUni::fgDOMValidation, true);

/* for SAX2 use XMLUni::fgSAX2CoreValidation and XMLUni::fgXercesDynamic */

For a SAXParser API or a XercesDOMParser API, you can call the setValidationScheme method to specify validation, as in Listing 4.


Listing 4. Setting validation on a XercesDOMParser API
#include <xercesc/parsers/XercesDOMParser.hpp>

XercesDOMParser *parser = new XercesDOMParser();
/* specify a non-validating parse */
parser->setValidationScheme(XercesDOMParser::Val_Never);			

Other features

As Table 1 notes, a few other features affect the performance of the parser.


Table 1. Other Xerces-C++ performance features
SAX2/DOM Level 3 setFeature XMLUni memberDescription
SAX1/Xerces DOM parser set method
fgXercesLoadExternalDTDControls whether or not an external DTD is parsed
setLoadExternalDTD
fgXercesCalculateSrcOfsControls the calculation of source offsets, which can be expensive
setCalculateSrcOfs
fgXercesIdentityConstraintCheckingControls whether or not schema identity constraints are checked
setIdentityConstraintChecking
fgXercesIgnoreAnnotationsControls whether or not schema annotations are ignored when traversing a schema
setIgnoreAnnotations
fgXercesSchemaFullCheckingControls whether or not the schema itself is fully checked for additional errors that are time-consuming or memory-intensive to discover
setValidationSchemaFullChecking

The Xerces-C++ programming guide, referenced in Resources, describes the Xerces-C++ features.

Data handling

A few APIs can have significant influence over the performance of the parser, as a parse of a single-instance document might make a significant number of calls to these functions.

Avoiding unnecessary calls to XMLString::transcode()

If you know the content of the string to be transcoded ahead of time, then it's better to create an XMLCh string constant for the string, as in Listing 5, instead of calling XMLString::transcode().


Listing 5. Defining XMLCh strings

// define a constant XMLCh string for 'Element_Name'
#include <xercesc/util/XMLUniDefs.hpp>

XMLCh Element_Name[] = {  chLatin_E, chLatin_l, chLatin_e, chLatin_m, chLatin_e, 
                          chLatin_n, chLatin_t, chUnderscore, chLatin_N, chLatin_a, 
                          chLatin_m, chLatin_e, chNull }; 

By using a constant string, you avoid allocating memory, copying the string, and the transcoding process. You also don't have to free up the memory returned by the transcode function, since it's now the responsibility of the caller. Frequently transcoding from XMLCh to char and vice versa can have a significant impact on performance, so wherever possible, try to process the data in one format.

A simple way to get the XMLCh string constant is to use makeStringDefinition.pl in the scripts directory. Xerces-C++ uses a number of predefined symbols that are defined in the header file, xercesc/util/XMLUni.hpp.

Avoiding calling XMLString::stringLen() to check for a zero-length string

If you're only interested in checking for a string of zero length instead of calling XMLString::stringLen(string), just check to see if the string is NULL or if its first character is the null character. Listing 6 shows you how to check for a zero-length string without using XMLString::stringLen().


Listing 6. Check for zero-length string

if (xmlStr == 0 || *xmlStr == 0) {
// string is zero length
}

This code helps you avoid an extra pass through the string.

Avoiding calling XMLString::compareIString()

The XMLString::compareIString() method uses the transcoder to do a case-insensitive string comparison. If you know the data you're comparing only has alphabetic characters (A to Z), then you should use the XMLString::compareIStringASCII() routine. This routine checks to see if a character is between A and Z, then converts it to lowercase and compares it; it compares other characters directly.

Using XMLString::compareIStringASCII() avoids calling out to the transcoding services to do the comparison. As an alternative, you can call the XMLString::equals() method. See all three methods in Listing 7.


Listing 7. String comparisons
#include <xercesc/util/XMLString.hpp>

XMLCh*  data;
XMLCh Element[] = {  chLatin_E, chLatin_l, chLatin_e, chLatin_m, chLatin_e, 
                     chLatin_n, chLatin_t, chNull }; 

if (XMLString::compareIString(data, Element) == 0) { ... }
/* Since Element only has the characters A to Z, this could be done more 
   efficiently using: */
if (XMLString::compareIStringASCII(data, Element) == 0) { ... }

/* Even better is when you don't require a case-sensitive comparison */

if (XMLString::equals(data, Element)) { ... }

Minimizing handlers

If you only test to see if a document is well-formed and/or valid, then only register an ErrorHandler. When you register a DocumentHandler and/or an AdvDocHandler, the result is extra, unrequired calls from the Xerces-C++ library to your application. If you use an AdvDocHandler only to get the XMLDecl callback information, then you can call removeAdvDocHandler after you get the XMLDecl information.

Avoiding using XMLFormatter::UnRep_Fail during serialization

If you try to format XML, be cautious about using XMLFormatter::UnRep_Fail. This option checks each character to see if it can be transcoded into the target encoding. If the encoding of the original document is the same as the target encoding and your changes are still valid in the target encoding, then you should use XMLFormatter::UnRep_CharRef instead.


Schema grammar caching

Use the grammar-caching features of Xerces-C++ if you do schema validation and reuse the same schema. For more information, check the developerWorks article, "Cache and serialize XML Schemas with Xerces-C++," referenced in Resources.


Xerces-C++ initialization

According to the Xerces-C++ threading model, the main thread calls XMLPlatformUtils::Initialize(). Then you can create other threads for parsing; each thread creates one parser. Finally, the main thread calls XMLPlatformUtils::Terminate().

XMLPlatformUtils::Initialize() is an expensive operation. Even if you don't use multithreading, initialize Xerces-C++ up front to avoid multiple calls to this function and then terminate when your application terminates. Specifying true for the last parameter on XMLPlatformUtils::Initialize, toInitStatics, as in Listing 8, makes initialization take longer, but it can result in better performance for parsing, because it initializes all the statics up front.


Listing 8. Initialization call to initialize all statics
#include <xercesc/util/PlatformUtils.hpp>
				
XMLPlatformUtils::Initalize(XMLUni::fgXercescDefaultLocale, 0, 0, 0, true);   

Similarly, you can create a pool of parsers—one per thread—at initialization time. You can use these parsers at runtime to avoid the cost of constructing and deconstructing parsers when you need to parse a document.


Conclusion

In this article, we showed you a number of tips and suggestions on how to use the Xerces-C++ XML parser to improve the performance of your application. Implementing these suggestions will reduce the CPU consumption used by Xerces-C++ when parsing an XML document.


Resources

Learn

Get products and technologies

  • The XML parser for C++: Try this parser distributed by Apache.

  • IBM trial software for product evaluation: Build your next project with trial software available for download directly from developerWorks, including application development tools and middleware products from DB2®, Lotus®, Rational®, Tivoli®, and WebSphere®.

Discuss

About the authors

David Cargill is a member of the XML Parser Development team at IBM Canada. He's been involved in the development of the Xerces-C++ parser for the last five years.

Khaled Noaman is a member of the XML Parser Development team at IBM. He's been involved in the development of the Xerces-C++ parser for the last five years and implemented many of the parser features including support for XML Schema Structures.

Comments (Undergoing maintenance)



Trademarks  |  My developerWorks terms and conditions

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML, SOA and Web services
ArticleID=306893
ArticleTitle=Improve the performance of your XML applications using Xerces-C++
publish-date=05162008
author1-email=cargilld@ca.ibm.com
author1-email-cc=dwxed@us.ibm.com
author2-email=knoaman@ca.ibm.com
author2-email-cc=dwxed@us.ibm.com

My developerWorks community

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere).

My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Rate a product. Write a review.

Special offers