Skip to main content

skip to main content

developerWorks  >  XML | SOA and Web services  >

Improve the performance of your XML applications using Xerces-C++

Delve into the Xerces-C++ properties and features, data handling, and schema grammar caching

developerWorks
Document options

Document options requiring JavaScript are not displayed


Rate this page

Help us improve this content


Level: Intermediate

David A. Cargill (cargilld@ca.ibm.com), Software Developer, IBM
Khaled Noaman (knoaman@ca.ibm.com), Software Developer, IBM 

13 May 2008
Updated 16 May 2008

XML is becoming a main staple in data exchange both between applications and on the Web. Learn how to improve the performance of your XML applications by using the Xerces-C++ parser properly. You'll learn the best ways to use the parser efficiently, and which features and properties affect its performance.

XML has gained widespread popularity with the emergence of Web services and service-oriented architecture (SOA). It plays an important role in the exchange of data both between applications and on the Web, and it's the cornerstone of many performance-critical scenarios.

Frequently used acronyms
  • API: application programming interface
  • CPU: Central Processing Unit
  • DOM: Document Object Model
  • DTD: Document Type Definition
  • SAX: Simple API for XML
  • W3C: World Wide Web Consortium
  • XML: Extensible Markup Language

You can improve the performance of your XML application by using the parser efficiently. Xerces-C++ is an open source validating XML parser available from Apache. In this article, we'll show you several tips on how to use the Xerces-C++ parser to improve the performance of your applications.

Xerces-C++ is a validating XML parser that is provided as a shared library. The library includes interfaces for DOM and SAX. Specifically, SAXParser is an interface for the SAX 1.0 specification; SAX2XMLReader is an interface for the SAX 2.0 specification; XercesDOMParser is an interface for the DOM specification; and DOMBuilder is an implementation of the Load interface of the DOM Level 3.0 Abstract Schemas and Load and Save specification.

Properties and features

Numerous properties and features can have significant influence over the performance of the parser.

Using the right scanner

One of the major components in Xerces-C++ is the scanner. It's not only responsible for scanning an XML instance, but it also plays an important role in assessing the validity of the XML document.

Xerces-C++ has four scanners: IGXMLScanner, WFXMLScanner, DGXMLScanner, and SGXMLScanner. It's important to choose the appropriate scanner for your scenario to obtain better performance. IGXMLScanner, the default scanner, is an all-purpose scanner that not only handles well-formedness, but is also involved in validating XML documents against DTDs and/or XML Schemas. WFXMLScanner, on the other hand, handles only well-formedness checking, not grammar validation. If you're only concerned with the well-formedness of the document, then use WFXMLScanner. Use DGXMLScanner if you're only doing DTD validation, and use SGXMLScanner if you're only doing XML Schema validation.

You can tell the parser which scanner to use by setting the scanner property on a SAX2XMLReader API or a DOMBuilder API. Listing 1 shows you how to set the scanner on SAX2XMLReader.


Listing 1. Setting the scanner on a SAX2XMLReader API
                
#include <xercesc/internal/XMLGrammarPoolImpl.hpp>
#include <xercesc/sax2/XMLReaderFactory.hpp>
#include <xercesc/util/XMLUni.hpp>

XMLGrammarPool *grammarPool = new XMLGrammarPoolImpl(XMLPlatformUtils::fgMemoryManager);
SAX2XMLReader* parser = XMLReaderFactory::createXMLReader(
                               XMLPlatformUtils::fgMemoryManager, grammarPool);

parser->setProperty(XMLUni::fgXercesScannerName, (void *)XMLUni::fgSGXMLScanner);

For a SAXParser API or a XercesDOMParser API, you can call the useScanner method to specify which scanner the parser should use, as in Listing 2.


Listing 2. Setting the scanner on a SAXParser API
                
#include <xercesc/parsers/SAXParser.hpp>
#include <xercesc/util/XMLUni.hpp>
				
SAXParser parser = new SAXParser();
parser->useScanner(XMLUni::fgDGXMLScanner);

For more information on how to use a specific scanner in Xerces-C++, see Resources.

Controlling validation

Once you specify which scanner to use, you still have features to control whether or not the parser performs validation.

You can tell the parser how to validate an instance document by setting the validation feature on a SAX2XMLReader API or a DOMBuilder API. Listing 3 shows you how.


Listing 3. Setting validation on a DOMBuilder API
                
#include <xercesc/dom/DOMImplementationLS.hpp>
#include <xercesc/dom/DOMImplementationRegistry.hpp>
#include <xercesc/dom/DOMBuilder.hpp>
#include <xercesc/util/XMLUni.hpp>
				
static const XMLCh gLS[] = { chLatin_L, chLatin_S, chNull };
DOMImplementation *impl = DOMImplementationRegistry::getDOMImplementation(gLS);
DOMBuilder *parser = ((DOMImplementationLS*)impl)->createDOMBuilder(
                       DOMImplementationLS::MODE_SYNCHRONOUS, 0);

/* specify a validating parse */
parser->setFeature(XMLUni::fgDOMValidateIfSchema, true);
parser->setFeature(XMLUni::fgDOMValidation, true);

/* for SAX2 use XMLUni::fgSAX2CoreValidation and XMLUni::fgXercesDynamic */

For a SAXParser API or a XercesDOMParser API, you can call the setValidationScheme method to specify validation, as in Listing 4.


Listing 4. Setting validation on a XercesDOMParser API
                
#include <xercesc/parsers/XercesDOMParser.hpp>

XercesDOMParser *parser = new XercesDOMParser();
/* specify a non-validating parse */
parser->setValidationScheme(XercesDOMParser::Val_Never);			

Other features

As Table 1 notes, a few other features affect the performance of the parser.


Table 1. Other Xerces-C++ performance features
SAX2/DOM Level 3 setFeature XMLUni memberDescription
SAX1/Xerces DOM parser set method
fgXercesLoadExternalDTD Controls whether or not an external DTD is parsed
setLoadExternalDTD
fgXercesCalculateSrcOfs Controls the calculation of source offsets, which can be expensive
setCalculateSrcOfs
fgXercesIdentityConstraintChecking Controls whether or not schema identity constraints are checked
setIdentityConstraintChecking
fgXercesIgnoreAnnotations Controls whether or not schema annotations are ignored when traversing a schema
setIgnoreAnnotations
fgXercesSchemaFullChecking Controls whether or not the schema itself is fully checked for additional errors that are time-consuming or memory-intensive to discover
setValidationSchemaFullChecking

The Xerces-C++ programming guide, referenced in Resources, describes the Xerces-C++ features.

Data handling

A few APIs can have significant influence over the performance of the parser, as a parse of a single-instance document might make a significant number of calls to these functions.

Avoiding unnecessary calls to XMLString::transcode()

If you know the content of the string to be transcoded ahead of time, then it's better to create an XMLCh string constant for the string, as in Listing 5, instead of calling XMLString::transcode().


Listing 5. Defining XMLCh strings
                

// define a constant XMLCh string for 'Element_Name'
#include <xercesc/util/XMLUniDefs.hpp>

XMLCh Element_Name[] = {  chLatin_E, chLatin_l, chLatin_e, chLatin_m, chLatin_e, 
                          chLatin_n, chLatin_t, chUnderscore, chLatin_N, chLatin_a, 
                          chLatin_m, chLatin_e, chNull }; 

By using a constant string, you avoid allocating memory, copying the string, and the transcoding process. You also don't have to free up the memory returned by the transcode function, since it's now the responsibility of the caller. Frequently transcoding from XMLCh to char and vice versa can have a significant impact on performance, so wherever possible, try to process the data in one format.

A simple way to get the XMLCh string constant is to use makeStringDefinition.pl in the scripts directory. Xerces-C++ uses a number of predefined symbols that are defined in the header file, xercesc/util/XMLUni.hpp.

Avoiding calling XMLString::stringLen() to check for a zero-length string

If you're only interested in checking for a string of zero length instead of calling XMLString::stringLen(string), just check to see if the string is NULL or if its first character is the null character. Listing 6 shows you how to check for a zero-length string without using XMLString::stringLen().


Listing 6. Check for zero-length string
                

if (xmlStr == 0 || *xmlStr == 0) {
// string is zero length
}

This code helps you avoid an extra pass through the string.

Avoiding calling XMLString::compareIString()

The XMLString::compareIString() method uses the transcoder to do a case-insensitive string comparison. If you know the data you're comparing only has alphabetic characters (A to Z), then you should use the XMLString::compareIStringASCII() routine. This routine checks to see if a character is between A and Z, then converts it to lowercase and compares it; it compares other characters directly.

Using XMLString::compareIStringASCII() avoids calling out to the transcoding services to do the comparison. As an alternative, you can call the XMLString::equals() method. See all three methods in Listing 7.


Listing 7. String comparisons
                
#include <xercesc/util/XMLString.hpp>

XMLCh*  data;
XMLCh Element[] = {  chLatin_E, chLatin_l, chLatin_e, chLatin_m, chLatin_e, 
                     chLatin_n, chLatin_t, chNull }; 

if (XMLString::compareIString(data, Element) == 0) { ... }
/* Since Element only has the characters A to Z, this could be done more 
   efficiently using: */
if (XMLString::compareIStringASCII(data, Element) == 0) { ... }

/* Even better is when you don't require a case-sensitive comparison */

if (XMLString::equals(data, Element)) { ... }

Minimizing handlers

If you only test to see if a document is well-formed and/or valid, then only register an ErrorHandler. When you register a DocumentHandler and/or an AdvDocHandler, the result is extra, unrequired calls from the Xerces-C++ library to your application. If you use an AdvDocHandler only to get the XMLDecl callback information, then you can call removeAdvDocHandler after you get the XMLDecl information.

Avoiding using XMLFormatter::UnRep_Fail during serialization

If you try to format XML, be cautious about using XMLFormatter::UnRep_Fail. This option checks each character to see if it can be transcoded into the target encoding. If the encoding of the original document is the same as the target encoding and your changes are still valid in the target encoding, then you should use XMLFormatter::UnRep_CharRef instead.



Back to top


Schema grammar caching

Use the grammar-caching features of Xerces-C++ if you do schema validation and reuse the same schema. For more information, check the developerWorks article, "Cache and serialize XML Schemas with Xerces-C++," referenced in Resources.



Back to top


Xerces-C++ initialization

According to the Xerces-C++ threading model, the main thread calls XMLPlatformUtils::Initialize(). Then you can create other threads for parsing; each thread creates one parser. Finally, the main thread calls XMLPlatformUtils::Terminate().

XMLPlatformUtils::Initialize() is an expensive operation. Even if you don't use multithreading, initialize Xerces-C++ up front to avoid multiple calls to this function and then terminate when your application terminates. Specifying true for the last parameter on XMLPlatformUtils::Initialize, toInitStatics, as in Listing 8, makes initialization take longer, but it can result in better performance for parsing, because it initializes all the statics up front.


Listing 8. Initialization call to initialize all statics
                
#include <xercesc/util/PlatformUtils.hpp>
				
XMLPlatformUtils::Initalize(XMLUni::fgXercescDefaultLocale, 0, 0, 0, true);   

Similarly, you can create a pool of parsers—one per thread—at initialization time. You can use these parsers at runtime to avoid the cost of constructing and deconstructing parsers when you need to parse a document.



Back to top


Conclusion

In this article, we showed you a number of tips and suggestions on how to use the Xerces-C++ XML parser to improve the performance of your application. Implementing these suggestions will reduce the CPU consumption used by Xerces-C++ when parsing an XML document.



Resources

Learn

Get products and technologies
  • The XML parser for C++: Try this parser distributed by Apache.

  • IBM trial software for product evaluation: Build your next project with trial software available for download directly from developerWorks, including application development tools and middleware products from DB2®, Lotus®, Rational®, Tivoli®, and WebSphere®.


Discuss


About the authors

David Cargill is a member of the XML Parser Development team at IBM Canada. He's been involved in the development of the Xerces-C++ parser for the last five years.


Khaled Noaman is a member of the XML Parser Development team at IBM. He's been involved in the development of the Xerces-C++ parser for the last five years and implemented many of the parser features including support for XML Schema Structures.




Rate this page


Please take a moment to complete this form to help us better serve you.



 


 


Not
useful
Extremely
useful
 


Share this....

digg Digg this story del.icio.us del.icio.us Slashdot Slashdot it!



Back to top


Adobe, the Adobe logo, PostScript, and the PostScript logo are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States, and/or other countries. IBM, the IBM logo, ibm.com, DB2, developerWorks, Lotus, Rational, Tivoli, and WebSphere are trademarks of IBM Corporation in the United States, other countries, or both. Other company, product, or service names may be trademarks or service marks of others. Other company, product, or service names may be trademarks or service marks of others.