Xerces-C++ is one of the most full-featured and portable XML parsers written in C++ available today. What's more, it's open source (distributed by the Apache Xerces Project) and it's fully conformant to the most important XML standards -- XML 1.0 3rd Edition, XML 1.1, and XML Schema 1.0 2nd Edition Structures and Datatypes. Because of the importance they play in Web services, XML Schemas are becoming especially important to today's XML applications. In this article, we show you how to use Xerces-C++ to validate a document according to an XML Schema. We then explore how to get the best possible schema-validation performance out of Xerces-C++ through the use of its grammar caching and grammar serialization capabilities (see Resources).
Simple XML Schema validation using SAX2
The W3C's XML Schema specification defines a set of components that XML authors use to describe the structure of XML documents. It also provides a rich datatype language that specifies the textual content of elements and attributes, thereby permitting one application environment to transmit data to another without losing information. This is why XML Schemas play such a critical role in Web services, and why they're increasingly important in other aspects of XML processing. Our first job in this article is to show you how to use Xerces-C++ to validate a document according to an XML Schema.
In this article, we use Xerces-C++'s version of the SAX 2.0 API to illustrate its schema-validation capabilities. Xerces-C++ also supports a binding for W3C's DOM Level 2 Core specification. It's trivial to take the code presented here for SAX2 and alter it so that it works in the context of Xerces-C++'s DOM implementation. While this article doesn't describe the usage of Xerces-C++ or its SAX2 API in detail, we'll note some of the more important aspects of SAX2 as they relate to XML Schema validation (see Resources).
To validate an XML document against its corresponding XML Schema documents, you first need to create a Xerces-C++ SAX 2 parser instance and set the appropriate features and handlers. Then you tell the parser instance to parse the XML document. For example:
Listing 1. Enabling schema validation on a Xerces-C++ SAX 2 parser
// Necessary includes. We refer to these as "common includes" // in the following examples. #include <xercesc/sax2/XMLReaderFactory.hpp> #include <xercesc/sax2/SAX2XMLReader.hpp> #include <xercesc/sax2/DefaultHandler.hpp> // Handy definitions of constants. #include <xercesc/util/XMLUni.hpp> // Create a SAX2 parser object. SAX2XMLReader* parser = XMLReaderFactory::createXMLReader(); // Set the appropriate features on the parser. // Enable namespaces, schema validation, and the checking // of all Schema constraints. // We refer to these as "common features" in following examples. parser->setFeature(XMLUni::fgSAX2CoreNameSpaces, true); parser->setFeature(XMLUni::fgSAX2CoreValidation, true); parser->setFeature(XMLUni::fgXercesDynamic, false); parser->setFeature(XMLUni::fgXercesSchema, true); parser->setFeature(XMLUni::fgXercesSchemaFullChecking, true); // Set appropriate ContentHandler, ErrorHandler, and EntityResolver. // These will be referred to as "common handlers" in subsequent examples. // You will use a default handler provided by Xerces-C++ (no op action). // Users should write their own handlers and install them. DefaultHandler handler; parser->setContentHandler(&handler); // The object parser calls when it detects violations of the schema. parser->setErrorHandler(&handler); // The object parser calls to find the schema and // resolve schema imports/includes. parser->setEntityResolver(&handler); // Parse the XML document. // Document content sent to registered ContentHandler instance. parser->parse(xmlFile); // Delete the parser instance. delete parser; |
Simple XML Schema validation using SAX2 with grammar caching enabled
XML Schema validation is a relatively complex process. The parser makes a large number of checks on each aspect of the document, but the parser can only make these checks after it does extensive processing of the documents that comprise the XML Schema. It does this in order to turn these documents into an internal form, called a grammar, which it can use to perform validation.
In addition, the XML Schema specifications mandate that
parsers must ensure that the documents comprising the XML Schema are valid
XML Schema documents. You can avoid some of this latter checking if the
application sets the fgXercesSchemaFullChecking
feature to false (which is the default). When this is done, the parser doesn't perform certain complicated
checks on the schema -- such as making sure that whenever the schema encounters an element
in a document validated by the schema, it uses a unique type definition to validate that
element. While this may sound esoteric, much of the logic of Web
services and other standards and technologies that rely on XML Schema assumes
that valid schemas have this and other properties whose verification the
schema-full-checking feature can disable.
It's always a good idea to enable this feature. Fortunately, Xerces-C++ provides you with an easy way to avoid repeatedly rebuilding grammars that correspond to commonly-used XML Schemas, and so save all this build and verification effort on subsequent parses. This is referred to as grammar caching, since it involves building the grammars once and then putting them into a cache, where the parser can find and retrieve them when required, without additional processing.
Listing 2 only differs from Listing 1 in that the CacheGrammarFromParse
feature is set. When this feature is set and the parser
encounters a schemaLocation attribute in a document, it will consult its
internal grammar cache (XMLGrammarPool) to see whether
it has a grammar that corresponds to the schemaLocation's target namespace.
If it does, it will use that grammar; otherwise, it will parse the schema
document associated with the schemaLocation's hint, and add the resulting
grammar to the XMLGrammarPool. This is a great model to use when your
documents refer to a limited number of target namespaces and have accurate
schemaLocation hints.
Listing 2. Simple grammar caching
// Include "common includes". // Create a SAX2 parser object. SAX2XMLReader* parser = XMLReaderFactory::createXMLReader(); // Set "common features". // Enable grammar caching feature. parser->setFeature(XMLUni::fgXercesCacheGrammarFromParse); // Set "common handlers". // Parse the XML document. // As Xerces-C++ processes the XML document, it will also process its // associated XML schema and cache it for later reference. parser->parse(xmlFile); // Xerces-C++ won't re-process the XML schema, instead it'll use // the processed and cached schema. parser->parse(xmlFile); // Delete the parser instance. delete parser; |
But what if you want to take a more active approach to XML parsing? Suppose you want to specify where the parser looks to find schema documents with particular
target namespaces. One way, of course, is to register a SAX EntityResolver
on the parser (see Resources). Another is to use the
loadGrammar method of Xerces-C++ parsers to explicitly create grammars
that correspond to particular target namespaces (see Listing 3).
Listing 3. XML Schema validation using a cached schema with an explicit location
// Include "common includes". // Home of Xerces-C++ grammar constants. #include <xercesc/validators/common/Grammar.hpp> // Create a SAX2 parser object. SAX2XMLReader* parser = XMLReaderFactory::createXMLReader(); // Set "common features". // Note that this time you don't want to cache schemas from parse // Set "common handlers". // Preprocess the XML Schema and cache it. // xsdFile could be a file path or // of an object type xercesc/sax/InputSource parser->loadGrammar(xsdFile, Grammar::SchemaGrammarType, true); // Instruct the parser to use the cached schema // when processing XML documents. parser->setFeature(XMLUni::fgXercesUseCachedGrammarInParse); // Parse the XML document. // Xerces-C++ will use the preprocessed schema when it validates // the document's contents, if the target namespaces match. parser->parse(xmlFile); // Delete the parser instance. delete parser; |
Note that fgXercesUseCachedGrammarInParse causes Xerces-C++ to refer requests
for grammars to its XMLGrammarPool instance before it asks a registered
EntityResolver or tries to dereference the schemaLocation hint. This is not the
same as fgXercesCacheGrammarFromParse, which additionally adds all new grammars
encountered while parsing documents to the XMLGrammarPool.
XML Schema validation using serialization of grammars to disk
What if your application can't reuse parser instances, with their
associated XMLGrammarPools, for long periods of time? This can happen if XML
documents are parsed infrequently, or if the number of threads within your
application varies widely (Xerces-C++ parsers are not re-entrant). In this
situation, the time taken to build that first grammar from a set of schema
documents might well be important even if that grammar is reused many times.
But Xerces-C++ can help here too: It provides a means to serialize the
entire contents of an XMLGrammarPool to disk, in their native form. This
dramatically speeds up the creation of grammar objects for
validation. It also allows the application to group all XML Schemas of
interest in one place, so the application logic that knows
which schemas are important and trusted can be separated entirely from
application logic concerned with instance document processing.
Listing 4 illustrates how to build an XMLGrammarPool
and serialize its contents to a binary file. The final example in this article will then
show how to use the contents of that file to validate documents.
Listing 4. Serializing schema grammar to disk
// Include "common includes". // Various interfaces you'll need: #include <xercesc/validators/common/Grammar.hpp> #include <xercesc/framework/MemoryManager.hpp> #include <xercesc/framework/XMLGrammarPool.hpp> #include <xercesc/framework/BinOutputStream.hpp> // Xerces-C++'s default MemoryManager/XMLGrammarPool implementations. #include <xercesc/internal/MemoryManagerImpl.hpp> #include <xercesc/internal/XMLGrammarPoolImpl.hpp> // Binary output stream for files. #include <xercesc/internal/BinFileOutputStream.hpp> // Create a memory manager instance for memory handling requests. MemoryManager *memMgr = new MemoryManagerImpl(); // Create a grammar pool that stores the cached grammars. XMLGrammarPool* pool = new XMLGrammarPoolImpl(memMgr); // Create a SAX2 parser object. SAX2XMLReader* parser = XMLReaderFactory::createXMLReader(memMgr, pool); // Set "common features". // Enable grammar caching feature. parser->setFeature(XMLUni::fgXercesCacheGrammarFromParse); // Set errorHandler and entityResolver (no need for ContentHandler). // You will use a default handler provided by Xerces-C++ (no op action). // Users should write their own handlers and install them. DefaultHandler handler; parser->setErrorHandler(&handler); parser->setEntityResolver(&handler); // xsdFile1 could be a file path or // of an object type xercesc/sax/InputSource. parser->loadGrammar(xsdFile1, Grammar::SchemaGrammarType, true); // Include however many XSD files you might require. // Create an output stream instance to serialize processed grammar // to use a BinFileOutputStream instance to serialize data to disk. BinOutputStream outStream = new BinFileOutputStream(outFile); // Serialize the grammar pool. pool->serializeGrammars(outStream); // Clean up. delete parser; delete pool; delete outStream; delete memMgr; |
In Listing 4, various internal Xerces-C++ classes are used as implementations for interfaces. In general, you might want to customize the behaviour by providing your own implementations.
Now you have a binary file on disk that contains a Xerces-C++
representation of the XMLGrammarPool instance containing all the schema
documents of interest to your application. Now, assume that you
want to validate an instance document according to one of those XML
Schemas -- so reload that file into memory and use it!
Listing 5. XML Schema validation using a deserialized
XMLGrammarPool from disk// Include "common includes". // Various interfaces you'll need. #include <xercesc/framework/MemoryManager.hpp> #include <xercesc/framework/XMLGrammarPool.hpp> #include <xercesc/util/BinInputStream.hpp> // Xerces-C++'s default MemoryManager/XMLGrammarPool implementations. #include <xercesc/internal/MemoryManagerImpl.hpp> #include <xercesc/internal/XMLGrammarPoolImpl.hpp> // Binary input stream for files. #include <xercesc/util/BinFileInputStream.hpp> // Create a memory manager instance for memory handling requests. MemoryManager *memMgr = new MemoryManagerImpl(); // Create a grammar pool to receive the serialized grammars. XMLGrammarPool* pool = new XMLGrammarPoolImpl(memMgr); // Create an input stream instance // to deserialize processed grammar from. // Use a BinFileInputStream instance to deserialize data from disk. BinInputStream inStream = new BinFileInputStream(inFile); // Deserialize grammars from disk. pool->deserializeGrammars(inStream); // Create a SAX2 parser object. SAX2XMLReader* parser = XMLReaderFactory::createXMLReader(memMgr, pool); // Set "common features". // Enable use of cached grammar feature. parser->setFeature(XMLUni::fgXercesUseCachedGrammarInParse); // Set "common handlers". // Parse the instance document that will use the cached schemas // for validation. parser->parse(xmlFile); // Delete instances. delete parser; delete pool; delete inStream; delete memMgr; |
In this article, we showed you how to use Xerces-C++ to validate an instance document according to an XML Schema. We also demonstrated how you can improve performance of this process when you enable the parser to cache its internal representations of XML Schemas. We then used Xerces-C++'s ability to serialize these internal representations to disk, then deserialize them back into memory when required, to avoid the expense of initializing a grammar cache from raw schema documents.
Performance is a constant concern with applications that need to use XML Schema. This article can help allay those concerns for C and C++ applications that make use of the Xerces-C++ parser.
Learn
- Read the W3C Recommendation
XML Schema Part 0: Primer
for an introduction to the XML Schema language.
- To learn more about the XML standards, read
the XML 1.0
and the XML 1.1 specifications.
- Reference the Xerces-C++ SAX2 Programming
Guide for a tutorial on how to use the SAX2 API.
- Explore the C++ Language
Binding for DOM.
- Read the developerWorks tutorial "XML Schema validation in Xerces-Java 2" (July 2002).
- Confused by all the XML standards out there? Uche Ogbuji's developerWorks article series on XML standards can help you sort through it all:
- Part 1 -- The core standards (January 2004)
- Part 2 -- XML processing standards (February 2004)
- Part 3 -- The most important vocabularies (February 2004)
- Part 4 -- Detailed cross-reference of the most important XML standards (March 2004)
- Jump start your knowledge with these developerWorks articles:
- Save XML data using DOMWriter in XML for the C++ parser in "Serialize XML Data" (July 2003).
- Compare DOM and SAX and then put SAX to work in "SAX, the power API" (August 2001).
- Find hundreds more XML resources on the developerWorks XML zone.
- Find out how you can become an an IBM
Certified Developer in XML and related technologies.
Get products and technologies
- While you're at it, check the W3C XML Schema specification which is
composed of two parts: XML Schema
Part 1: Structures and XML
Schema Part 2: Datatypes.
- Try Xerces-C++, the
XML parser for C++ that's distributed by Apache.
- Visit the offical SAX Web site
to learn more about the API. You'll find technical documentation, FAQs, and more.
Neil Graham is the Manager of XML Parser Development at IBM. He is a committer on Apache's Xerces-Java and Xerces-C++ XML parsers, where he has worked on, among other things, the implementation of XML Schema, XML 1.1, and grammar caching. He was also one of IBM's representatives on the Expert Group that developed JAXP 1.3.