Cache and serialize XML Schemas with Xerces-C++

Improve the performance of Xerces-C++

Use Xerces-C++ to validate documents more efficiently. XML plays an increasingly important role in C and C++ applications. To ensure successful interpretation of a document's contents, many of these applications require W3C XML Schemas to validate the documents they process. This article includes examples that demonstrate how to preprocess and cache schemas in advance of or during the validation process, and thus avoid the expensive process of repeatedly processing given XML Schema documents. You'll also learn how to save the processed schemas to disk, so you'll only need to reprocess the original XML Schema documents if they change.

Share:

Neil Graham (neilg@ca.ibm.com), Manager, XML Parser Development, IBM, Software Group

Neil Graham is the Manager of XML Parser Development at IBM. He is a committer on Apache's Xerces-Java and Xerces-C++ XML parsers, where he has worked on, among other things, the implementation of XML Schema, XML 1.1, and grammar caching. He was also one of IBM's representatives on the Expert Group that developed JAXP 1.3.



Khaled Noaman (knoaman@ca.ibm.com), Software Developer, IBM, Software Group

Khaled Noaman is a member of the XML Parser Development team at IBM. He's been involved in the development of the Xerces-C++ parser for the last five years and implemented many of the parser features including support for XML Schema Structures.



29 July 2005

Also available in Japanese

Xerces-C++ is one of the most full-featured and portable XML parsers written in C++ available today. What's more, it's open source (distributed by the Apache Xerces Project) and it's fully conformant to the most important XML standards -- XML 1.0 3rd Edition, XML 1.1, and XML Schema 1.0 2nd Edition Structures and Datatypes. Because of the importance they play in Web services, XML Schemas are becoming especially important to today's XML applications. In this article, we show you how to use Xerces-C++ to validate a document according to an XML Schema. We then explore how to get the best possible schema-validation performance out of Xerces-C++ through the use of its grammar caching and grammar serialization capabilities (see Resources).

Simple XML Schema validation using SAX2

The W3C's XML Schema specification defines a set of components that XML authors use to describe the structure of XML documents. It also provides a rich datatype language that specifies the textual content of elements and attributes, thereby permitting one application environment to transmit data to another without losing information. This is why XML Schemas play such a critical role in Web services, and why they're increasingly important in other aspects of XML processing. Our first job in this article is to show you how to use Xerces-C++ to validate a document according to an XML Schema.

In this article, we use Xerces-C++'s version of the SAX 2.0 API to illustrate its schema-validation capabilities. Xerces-C++ also supports a binding for W3C's DOM Level 2 Core specification. It's trivial to take the code presented here for SAX2 and alter it so that it works in the context of Xerces-C++'s DOM implementation. While this article doesn't describe the usage of Xerces-C++ or its SAX2 API in detail, we'll note some of the more important aspects of SAX2 as they relate to XML Schema validation (see Resources).

To validate an XML document against its corresponding XML Schema documents, you first need to create a Xerces-C++ SAX 2 parser instance and set the appropriate features and handlers. Then you tell the parser instance to parse the XML document. For example:

Listing 1. Enabling schema validation on a Xerces-C++ SAX 2 parser
// Necessary includes. We refer to these as "common includes" 
// in the following examples.
#include <xercesc/sax2/XMLReaderFactory.hpp>
#include <xercesc/sax2/SAX2XMLReader.hpp>
#include <xercesc/sax2/DefaultHandler.hpp>

// Handy definitions of constants.
#include <xercesc/util/XMLUni.hpp>

// Create a SAX2 parser object.
SAX2XMLReader* parser = XMLReaderFactory::createXMLReader();

// Set the appropriate features on the parser.
// Enable namespaces, schema validation, and the checking 
// of all Schema constraints.
// We refer to these as "common features" in following examples.
parser->setFeature(XMLUni::fgSAX2CoreNameSpaces, true);
parser->setFeature(XMLUni::fgSAX2CoreValidation, true);
parser->setFeature(XMLUni::fgXercesDynamic, false);
parser->setFeature(XMLUni::fgXercesSchema, true);
parser->setFeature(XMLUni::fgXercesSchemaFullChecking, true);

// Set appropriate ContentHandler, ErrorHandler, and EntityResolver.
// These will be referred to as "common handlers" in subsequent examples.

// You will use a default handler provided by Xerces-C++ (no op action).
// Users should write their own handlers and install them.
DefaultHandler handler;
parser->setContentHandler(&handler);

// The object parser calls when it detects violations of the schema.
parser->setErrorHandler(&handler);

// The object parser calls to find the schema and 
// resolve schema imports/includes.
parser->setEntityResolver(&handler);

// Parse the XML document.
// Document content sent to registered ContentHandler instance.
parser->parse(xmlFile);

// Delete the parser instance.
delete parser;

Simple XML Schema validation using SAX2 with grammar caching enabled

XML Schema validation is a relatively complex process. The parser makes a large number of checks on each aspect of the document, but the parser can only make these checks after it does extensive processing of the documents that comprise the XML Schema. It does this in order to turn these documents into an internal form, called a grammar, which it can use to perform validation.

In addition, the XML Schema specifications mandate that parsers must ensure that the documents comprising the XML Schema are valid XML Schema documents. You can avoid some of this latter checking if the application sets the fgXercesSchemaFullChecking feature to false (which is the default). When this is done, the parser doesn't perform certain complicated checks on the schema -- such as making sure that whenever the schema encounters an element in a document validated by the schema, it uses a unique type definition to validate that element. While this may sound esoteric, much of the logic of Web services and other standards and technologies that rely on XML Schema assumes that valid schemas have this and other properties whose verification the schema-full-checking feature can disable.

It's always a good idea to enable this feature. Fortunately, Xerces-C++ provides you with an easy way to avoid repeatedly rebuilding grammars that correspond to commonly-used XML Schemas, and so save all this build and verification effort on subsequent parses. This is referred to as grammar caching, since it involves building the grammars once and then putting them into a cache, where the parser can find and retrieve them when required, without additional processing.

Listing 2 only differs from Listing 1 in that the CacheGrammarFromParse feature is set. When this feature is set and the parser encounters a schemaLocation attribute in a document, it will consult its internal grammar cache (XMLGrammarPool) to see whether it has a grammar that corresponds to the schemaLocation's target namespace. If it does, it will use that grammar; otherwise, it will parse the schema document associated with the schemaLocation's hint, and add the resulting grammar to the XMLGrammarPool. This is a great model to use when your documents refer to a limited number of target namespaces and have accurate schemaLocation hints.

Listing 2. Simple grammar caching
// Include "common includes".

// Create a SAX2 parser object.
SAX2XMLReader* parser = XMLReaderFactory::createXMLReader();

// Set "common features".

// Enable grammar caching feature.
parser->setFeature(XMLUni::fgXercesCacheGrammarFromParse);

// Set "common handlers".

// Parse the XML document.
// As Xerces-C++ processes the XML document, it will also process its
// associated XML schema and cache it for later reference.
parser->parse(xmlFile);

// Xerces-C++ won't re-process the XML schema, instead it'll use
// the processed and cached schema.
parser->parse(xmlFile);

// Delete the parser instance.
delete parser;

But what if you want to take a more active approach to XML parsing? Suppose you want to specify where the parser looks to find schema documents with particular target namespaces. One way, of course, is to register a SAX EntityResolver on the parser (see Resources). Another is to use the loadGrammar method of Xerces-C++ parsers to explicitly create grammars that correspond to particular target namespaces (see Listing 3).

Listing 3. XML Schema validation using a cached schema with an explicit location
// Include "common includes".

// Home of Xerces-C++ grammar constants.
#include <xercesc/validators/common/Grammar.hpp>

// Create a SAX2 parser object.
SAX2XMLReader* parser = XMLReaderFactory::createXMLReader();

// Set "common features".
// Note that this time you don't want to cache schemas from parse

// Set "common handlers".

// Preprocess the XML Schema and cache it.
// xsdFile could be a file path or 
// of an object type xercesc/sax/InputSource
parser->loadGrammar(xsdFile, Grammar::SchemaGrammarType, true);

// Instruct the parser to use the cached schema 
// when processing XML documents.
parser->setFeature(XMLUni::fgXercesUseCachedGrammarInParse);

// Parse the XML document.
// Xerces-C++ will use the preprocessed schema when it validates 
// the document's contents, if the target namespaces match.
parser->parse(xmlFile);

// Delete the parser instance.
delete parser;

Note that fgXercesUseCachedGrammarInParse causes Xerces-C++ to refer requests for grammars to its XMLGrammarPool instance before it asks a registered EntityResolver or tries to dereference the schemaLocation hint. This is not the same as fgXercesCacheGrammarFromParse, which additionally adds all new grammars encountered while parsing documents to the XMLGrammarPool.


XML Schema validation using serialization of grammars to disk

What if your application can't reuse parser instances, with their associated XMLGrammarPools, for long periods of time? This can happen if XML documents are parsed infrequently, or if the number of threads within your application varies widely (Xerces-C++ parsers are not re-entrant). In this situation, the time taken to build that first grammar from a set of schema documents might well be important even if that grammar is reused many times.

But Xerces-C++ can help here too: It provides a means to serialize the entire contents of an XMLGrammarPool to disk, in their native form. This dramatically speeds up the creation of grammar objects for validation. It also allows the application to group all XML Schemas of interest in one place, so the application logic that knows which schemas are important and trusted can be separated entirely from application logic concerned with instance document processing.

Listing 4 illustrates how to build an XMLGrammarPool and serialize its contents to a binary file. The final example in this article will then show how to use the contents of that file to validate documents.

Listing 4. Serializing schema grammar to disk
// Include "common includes".

// Various interfaces you'll need:
#include <xercesc/validators/common/Grammar.hpp>
#include <xercesc/framework/MemoryManager.hpp>
#include <xercesc/framework/XMLGrammarPool.hpp>
#include <xercesc/framework/BinOutputStream.hpp>

// Xerces-C++'s default MemoryManager/XMLGrammarPool implementations.
#include <xercesc/internal/MemoryManagerImpl.hpp>
#include <xercesc/internal/XMLGrammarPoolImpl.hpp>

// Binary output stream for files.
#include <xercesc/internal/BinFileOutputStream.hpp>

// Create a memory manager instance for memory handling requests.
MemoryManager *memMgr = new MemoryManagerImpl();

// Create a grammar pool that stores the cached grammars.
XMLGrammarPool* pool = new XMLGrammarPoolImpl(memMgr);

// Create a SAX2 parser object.
SAX2XMLReader* parser = XMLReaderFactory::createXMLReader(memMgr, pool);

// Set "common features".

// Enable grammar caching feature.
parser->setFeature(XMLUni::fgXercesCacheGrammarFromParse);

// Set errorHandler and entityResolver (no need for ContentHandler).
// You will use a default handler provided by Xerces-C++ (no op action).
// Users should write their own handlers and install them.
DefaultHandler handler;
parser->setErrorHandler(&handler);
parser->setEntityResolver(&handler);

// xsdFile1 could be a file path or 
// of an object type xercesc/sax/InputSource.
parser->loadGrammar(xsdFile1, Grammar::SchemaGrammarType, true);

// Include however many XSD files you might require.

// Create an output stream instance to serialize processed grammar 
// to use a BinFileOutputStream instance to serialize data to disk.
BinOutputStream outStream = new BinFileOutputStream(outFile);

// Serialize the grammar pool.
pool->serializeGrammars(outStream);

// Clean up.
delete parser;
delete pool;
delete outStream;
delete memMgr;

In Listing 4, various internal Xerces-C++ classes are used as implementations for interfaces. In general, you might want to customize the behaviour by providing your own implementations.

Now you have a binary file on disk that contains a Xerces-C++ representation of the XMLGrammarPool instance containing all the schema documents of interest to your application. Now, assume that you want to validate an instance document according to one of those XML Schemas -- so reload that file into memory and use it!

Listing 5. XML Schema validation using a deserialized XMLGrammarPool from disk
// Include "common includes".

// Various interfaces you'll need.
#include <xercesc/framework/MemoryManager.hpp>
#include <xercesc/framework/XMLGrammarPool.hpp>
#include <xercesc/util/BinInputStream.hpp>

// Xerces-C++'s default MemoryManager/XMLGrammarPool implementations.
#include <xercesc/internal/MemoryManagerImpl.hpp>
#include <xercesc/internal/XMLGrammarPoolImpl.hpp>

// Binary input stream for files.
#include <xercesc/util/BinFileInputStream.hpp>

// Create a memory manager instance for memory handling requests.
MemoryManager *memMgr = new MemoryManagerImpl();

// Create a grammar pool to receive the serialized grammars.
XMLGrammarPool* pool = new XMLGrammarPoolImpl(memMgr);

// Create an input stream instance 
// to deserialize processed grammar from.
// Use a BinFileInputStream instance to deserialize data from disk.
BinInputStream inStream = new BinFileInputStream(inFile);

// Deserialize grammars from disk.
pool->deserializeGrammars(inStream);

// Create a SAX2 parser object.
SAX2XMLReader* parser = XMLReaderFactory::createXMLReader(memMgr, pool);

// Set "common features".

// Enable use of cached grammar feature.
parser->setFeature(XMLUni::fgXercesUseCachedGrammarInParse);

// Set "common handlers".

// Parse the instance document that will use the cached schemas 
// for validation.
parser->parse(xmlFile);

// Delete instances.
delete parser;
delete pool;
delete inStream;
delete memMgr;

Conclusion

In this article, we showed you how to use Xerces-C++ to validate an instance document according to an XML Schema. We also demonstrated how you can improve performance of this process when you enable the parser to cache its internal representations of XML Schemas. We then used Xerces-C++'s ability to serialize these internal representations to disk, then deserialize them back into memory when required, to avoid the expense of initializing a grammar cache from raw schema documents.

Performance is a constant concern with applications that need to use XML Schema. This article can help allay those concerns for C and C++ applications that make use of the Xerces-C++ parser.

Resources

Learn

Get products and technologies

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into XML on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML, SOA and web services
ArticleID=90821
ArticleTitle=Cache and serialize XML Schemas with Xerces-C++
publish-date=07292005