 | Level: Intermediate David A. Cargill (cargilld@ca.ibm.com), Software Developer, IBM Khaled Noaman (knoaman@ca.ibm.com), Software Developer, IBM
13 May 2008 Updated 16 May 2008 XML is becoming a main staple in data exchange both between applications and on the Web. Learn how to improve the performance of your XML applications by using the Xerces-C++ parser properly. You'll learn the best ways to use the parser efficiently, and which features and properties affect its performance.
XML has gained widespread popularity with the emergence of Web services and service-oriented architecture (SOA). It plays an important role in the exchange of data both between applications and on the Web, and it's the cornerstone of many performance-critical scenarios.
 |
Frequently used acronyms
- API: application programming interface
- CPU: Central Processing Unit
- DOM: Document Object Model
- DTD: Document Type Definition
- SAX: Simple API for XML
- W3C: World Wide Web Consortium
- XML: Extensible Markup Language
|
|
You can improve the performance of your XML application by using the parser efficiently. Xerces-C++ is an open source validating XML parser available from Apache. In this article, we'll show you several tips on how to use the Xerces-C++ parser to improve the performance of your applications.
Xerces-C++ is a validating XML parser that is provided as a shared library. The library includes interfaces for DOM and SAX. Specifically, SAXParser is an interface for the SAX 1.0 specification; SAX2XMLReader is an interface for the SAX 2.0 specification; XercesDOMParser is an interface for the DOM specification; and DOMBuilder is an implementation of the Load interface of the DOM Level 3.0 Abstract Schemas and Load and Save specification.
Properties and features
Numerous properties and features can have significant influence over the performance of the parser.
Using the right scanner
One of the major components in Xerces-C++ is the scanner. It's not only responsible for scanning an XML instance, but it also plays an important role in assessing the validity of the XML document.
Xerces-C++ has four scanners: IGXMLScanner, WFXMLScanner, DGXMLScanner, and SGXMLScanner. It's important to choose the appropriate scanner for your scenario to obtain better performance. IGXMLScanner, the default scanner, is an all-purpose scanner that not only handles well-formedness, but is also involved in validating XML documents against DTDs and/or XML Schemas. WFXMLScanner, on the other hand, handles only well-formedness checking, not grammar validation. If you're only concerned with the well-formedness of the document, then use WFXMLScanner. Use DGXMLScanner if you're only doing DTD validation, and use SGXMLScanner if you're only doing XML Schema validation.
You can tell the parser which scanner to use by setting the scanner property on a SAX2XMLReader API or a DOMBuilder API. Listing 1 shows you how to set the scanner on SAX2XMLReader.
Listing 1. Setting the scanner on a SAX2XMLReader API
#include <xercesc/internal/XMLGrammarPoolImpl.hpp>
#include <xercesc/sax2/XMLReaderFactory.hpp>
#include <xercesc/util/XMLUni.hpp>
XMLGrammarPool *grammarPool = new XMLGrammarPoolImpl(XMLPlatformUtils::fgMemoryManager);
SAX2XMLReader* parser = XMLReaderFactory::createXMLReader(
XMLPlatformUtils::fgMemoryManager, grammarPool);
parser->setProperty(XMLUni::fgXercesScannerName, (void *)XMLUni::fgSGXMLScanner);
|
For a SAXParser API or a XercesDOMParser API, you can call the useScanner method to specify which scanner the parser should use, as in Listing 2.
Listing 2. Setting the scanner on a SAXParser API
#include <xercesc/parsers/SAXParser.hpp>
#include <xercesc/util/XMLUni.hpp>
SAXParser parser = new SAXParser();
parser->useScanner(XMLUni::fgDGXMLScanner);
|
For more information on how to use a specific scanner in Xerces-C++, see Resources.
Controlling validation
Once you specify which scanner to use, you still have features to control whether or not the parser performs validation.
You can tell the parser how to validate an instance document by setting the validation feature on a SAX2XMLReader API or a DOMBuilder API. Listing 3 shows you how.
Listing 3. Setting validation on a DOMBuilder API
#include <xercesc/dom/DOMImplementationLS.hpp>
#include <xercesc/dom/DOMImplementationRegistry.hpp>
#include <xercesc/dom/DOMBuilder.hpp>
#include <xercesc/util/XMLUni.hpp>
static const XMLCh gLS[] = { chLatin_L, chLatin_S, chNull };
DOMImplementation *impl = DOMImplementationRegistry::getDOMImplementation(gLS);
DOMBuilder *parser = ((DOMImplementationLS*)impl)->createDOMBuilder(
DOMImplementationLS::MODE_SYNCHRONOUS, 0);
/* specify a validating parse */
parser->setFeature(XMLUni::fgDOMValidateIfSchema, true);
parser->setFeature(XMLUni::fgDOMValidation, true);
/* for SAX2 use XMLUni::fgSAX2CoreValidation and XMLUni::fgXercesDynamic */
|
For a SAXParser API or a XercesDOMParser API, you can call the setValidationScheme method to specify validation, as in Listing 4.
Listing 4. Setting validation on a XercesDOMParser API
#include <xercesc/parsers/XercesDOMParser.hpp>
XercesDOMParser *parser = new XercesDOMParser();
/* specify a non-validating parse */
parser->setValidationScheme(XercesDOMParser::Val_Never);
|
Other features
As Table 1 notes, a few other features affect the performance of the parser.
Table 1. Other Xerces-C++ performance features
| SAX2/DOM Level 3 setFeature XMLUni member | Description |
|---|
| SAX1/Xerces DOM parser set method |
|---|
fgXercesLoadExternalDTD
| Controls whether or not an external DTD is parsed |
setLoadExternalDTD
|
fgXercesCalculateSrcOfs
| Controls the calculation of source offsets, which can be expensive |
setCalculateSrcOfs
|
fgXercesIdentityConstraintChecking
| Controls whether or not schema identity constraints are checked |
setIdentityConstraintChecking
|
fgXercesIgnoreAnnotations
| Controls whether or not schema annotations are ignored when traversing a schema |
setIgnoreAnnotations
|
fgXercesSchemaFullChecking
| Controls whether or not the schema itself is fully checked for additional errors that are time-consuming or memory-intensive to discover |
setValidationSchemaFullChecking
|
The Xerces-C++ programming guide, referenced in Resources, describes the Xerces-C++ features.
Data handling
A few APIs can have significant influence over the performance of the parser, as a
parse of a single-instance document might make a significant
number of calls to these functions.
Avoiding unnecessary calls to XMLString::transcode()
If you know the content of the string to be transcoded ahead of time, then it's better to create an XMLCh string constant for the string, as in Listing 5, instead of calling XMLString::transcode().
Listing 5. Defining XMLCh strings
// define a constant XMLCh string for 'Element_Name'
#include <xercesc/util/XMLUniDefs.hpp>
XMLCh Element_Name[] = { chLatin_E, chLatin_l, chLatin_e, chLatin_m, chLatin_e,
chLatin_n, chLatin_t, chUnderscore, chLatin_N, chLatin_a,
chLatin_m, chLatin_e, chNull };
|
By using a constant string, you avoid allocating memory, copying the string, and the transcoding process. You also don't have to free up the memory returned by the transcode function, since it's now the responsibility of the caller. Frequently transcoding from XMLCh to char and vice versa can have a significant impact on performance, so wherever possible, try to process the data in one format.
A simple way to get the XMLCh string constant is to use makeStringDefinition.pl in the scripts directory. Xerces-C++ uses a number of predefined symbols that are defined in the header file, xercesc/util/XMLUni.hpp.
Avoiding calling XMLString::stringLen() to check for a zero-length string
If you're only interested in checking for a string of zero length instead of calling XMLString::stringLen(string), just check to see if the string is NULL or if its first character is the null character. Listing 6 shows you how to check for a zero-length string without using XMLString::stringLen().
Listing 6. Check for zero-length string
if (xmlStr == 0 || *xmlStr == 0) {
// string is zero length
}
|
This code helps you avoid an extra pass through the string.
Avoiding calling XMLString::compareIString()
The XMLString::compareIString() method uses the transcoder to do a case-insensitive string comparison. If you know the data you're comparing only has alphabetic characters (A to Z), then you should use the XMLString::compareIStringASCII() routine. This routine checks to see if a character is between A and Z, then converts it to lowercase and compares it; it compares other characters directly.
Using XMLString::compareIStringASCII() avoids calling out
to the transcoding services to do the comparison. As an alternative, you can call the
XMLString::equals() method. See all three methods in Listing 7.
Listing 7. String comparisons
#include <xercesc/util/XMLString.hpp>
XMLCh* data;
XMLCh Element[] = { chLatin_E, chLatin_l, chLatin_e, chLatin_m, chLatin_e,
chLatin_n, chLatin_t, chNull };
if (XMLString::compareIString(data, Element) == 0) { ... }
/* Since Element only has the characters A to Z, this could be done more
efficiently using: */
if (XMLString::compareIStringASCII(data, Element) == 0) { ... }
/* Even better is when you don't require a case-sensitive comparison */
if (XMLString::equals(data, Element)) { ... }
|
Minimizing handlers
If you only test to see if a document is well-formed and/or valid, then only
register an ErrorHandler. When you register a DocumentHandler and/or an AdvDocHandler,
the result is extra, unrequired calls from the Xerces-C++ library to your
application. If you use an AdvDocHandler only to get the XMLDecl callback information, then you can call removeAdvDocHandler after you get the XMLDecl information.
Avoiding using XMLFormatter::UnRep_Fail during serialization
If you try to format XML, be cautious about using XMLFormatter::UnRep_Fail. This option checks each character to see if it can be transcoded into the target encoding. If the encoding of the original document is the same as the target encoding and your changes are still valid in the target encoding, then you should use XMLFormatter::UnRep_CharRef instead.
Schema grammar caching
Use the grammar-caching features of Xerces-C++ if you do schema validation and
reuse the same schema. For more information, check the developerWorks article, "Cache and serialize XML Schemas with Xerces-C++," referenced in Resources.
Xerces-C++ initialization
According to the Xerces-C++ threading model, the main thread calls XMLPlatformUtils::Initialize(). Then you can create other threads for parsing; each thread creates one parser. Finally, the main thread calls XMLPlatformUtils::Terminate().
XMLPlatformUtils::Initialize() is an expensive operation.
Even if you don't use multithreading, initialize Xerces-C++ up front to avoid
multiple calls to this function and then terminate when your application terminates. Specifying true for the last parameter on XMLPlatformUtils::Initialize, toInitStatics, as in Listing 8, makes initialization take longer, but it can result in better performance for parsing, because it initializes all the statics up front.
Listing 8. Initialization call to initialize all statics
#include <xercesc/util/PlatformUtils.hpp>
XMLPlatformUtils::Initalize(XMLUni::fgXercescDefaultLocale, 0, 0, 0, true);
|
Similarly, you can create a pool of parsers—one per thread—at initialization time. You can use these parsers at runtime to avoid the cost of constructing and deconstructing parsers when you need to parse a document.
Conclusion
In this article, we showed you a number of tips and suggestions on how to use the Xerces-C++ XML parser to improve the performance of your application. Implementing these suggestions will reduce the CPU consumption used by Xerces-C++ when parsing an XML document.
Resources Learn
- The Xerces-C++ Use Specific Scanner guide: Find additional information on using a specific scanner in Xerces-C++.
-
Xerces-C++ Programming Guide:
Read more on how to use SAX1, SAX2, and DOM in Xerces-C++.
-
Cache and serialize XML Schemas with Xerces-C++ (Neil Graham and Khaled Noaman, developerWorks, July 2005): Learn to improve the performance of Xerces-C++.
-
Improve performance in your XML applications, Part 1 (Elena Litani and Michael Glavassevich, developerWorks, July 2004): In a series focused on Xerces2, write your app for the best possible performance, plus explore which SAX or DOM operations and features affect application performance.
-
Xerces-C++ mailing lists: Post questions and view answers.
-
Xerces-C++ API documentation: Reference the documentation for a full understanding of how the Xerces-C++ APIs work.
- The official SAX
Web site: Visit and learn more about the API from technical documentation, FAQs, and more.
- The W3C
Recommendation XML Schema Part 0: Primer: Read an introduction to the XML Schema language.
- The XML 1.0
specification: Learn more about the XML standards.
-
W3C
Document Object Model (DOM) Level 3.0 Core Specification: Read this W3C-defined
specification about DOM and how it allows programs and scripts to dynamically access and update the content, structure, and style of documents.
-
IBM XML certification: Find out how you can become an IBM-Certified Developer in XML and related technologies.
-
XML technical library: See the developerWorks XML Zone for a wide range of technical articles and tips, tutorials, standards, and IBM Redbooks.
-
developerWorks technical events and webcasts: Stay current with developerWorks technical events and webcasts.
- The technology
bookstore: Browse for books on these and other technical topics.
-
developerWorks
podcasts: Listen to interesting interviews and discussions for software developers.
Get products and technologies
- The XML parser for C++: Try this parser
distributed by Apache.
-
IBM
trial software for product evaluation: Build your next project with trial software available for download directly from developerWorks, including application development tools and middleware products from DB2®, Lotus®, Rational®, Tivoli®, and WebSphere®.
Discuss
About the authors  | |  | David Cargill is a member of the XML Parser Development team at IBM Canada. He's been involved in the development of the Xerces-C++ parser for the last five years. |
 | |  | Khaled Noaman is a member of the XML Parser Development team at IBM. He's been involved in the development of the Xerces-C++ parser for the last five years and implemented many of the parser features including support for XML Schema Structures. |
Rate this page
|  |