Make the most of Xerces-C++, Part 1

A parsing how-to for C++ programmers

This two-part article offers an introduction to the Xerces-C++ XML library. Part 1 explains how to link the library into applications written in Linux and Windows. Ample code demonstrates parsing with the SAX API, and a sample application shows you how to create a bar graph in ASCII art. In Part 2, I'll demonstrate how to load, manipulate, or synthesize a DOM document, and you'll see how to create the same bar graph using Scalable Vector Graphics (SVG). C++ programmers who read these articles should be able to easily add XML parsing and processing capabilities to their applications.

Rick Parrish (rfmobile@swbell.net), Consultant

Rick Parrish writes software for a living but also lends a hand in several open-source projects. His current interests include 3D graphics and visualization, decoding digital radio signals, and creating a scriptable database framework for Mozilla. You can contact Rick at rfmobile@swbell.net.



13 August 2003

Also available in Japanese

Xerces-C++ is a very robust XML parser that offers validation, plus SAX and DOM APIs. XML validation is well supported for a Document Type Definition (DTD), and essentially complete open-standards support for W3C XML Schema was added in December 2001.

Xerces-C++: a capsule bio

Xerces-C++ originated as the XML4C project at IBM. XML4C was a companion project to XML4J, which likewise was the origins of Xerces-J -- the Java implementation. IBM released the source for both projects to the Apache Software Foundation, where they were renamed Xerces-C++ and Xerces-J, respectively. These two are core projects of the Apache XML group. (If you see "Xerces-C" instead of "Xerces-C++", it's the same thing; the project was written in C++ from the start.)

The XML4C project continues at IBM, based on Xerces-C++. XML4C's distinguishing merit relative to Xerces-C++ is better out-of-the-box support for a huge number of international character encodings in the version that I explored (see Resources).

Validation

The two principal means of specifying the structure of an XML document are the DTD and W3C XML Schema, with DTD being the much older of the two. XML Schema is basically a DTD expressed as XML. Xerces-C++ offers great out-of-the-box validation capabilities for ensuring that an XML document conforms to a DTD.

Licensing

Xerces-C++ is made available under the terms of the Apache Software License (see Resources), which happens to be one of the more readable open-source licenses around. It compares very well to the BSD license. Essentially, you can use Xerces-C++ in your (or your company's) software royalty free at the mere expense of disclosing to your customers and users that your software includes Apache code, and including the proper copyright notice. Check the Web page for the exact text of the license.


SAX: the event API model

SAX, as you may know, is an event-oriented programming API for parsing XML documents. A parsing engine consumes XML sequential data and makes callbacks into the application as it discovers the structure of the incoming XML data. These callbacks are referred to as event handlers. SAX is actually two APIs: SAX 1.0 is the original, and SAX 2.0 is the current revised specification. The two are similar, but different enough that most applications based on SAX 1.0 break when they are moved to the newer specification.

The SAX API specification was moved to SourceForge as a project of its own (see Resources). The SAX examples I give later in this article make use of SAX 2.0.


DOM: the Document Object Model

Unlike SAX, the DOM API permits editing and saving an XML document back to a file or stream. It also permits programmatically constructing a new XML document from scratch. The reason for this is that DOM provides an in-memory model for the document. You can traverse the document tree, prune nodes, or graft on new ones.

The tech wrecks

DOM is a family of W3C technical recommendations affectionately called tech wrecks. DOM has three levels, with Levels 1 and 2 at full technical recommendation status and Level 3 at working draft status.

The DOM Level 1 Core defines most of what is needed for basic XML functionality: the ability to construct a representation of an XML document. The DOMString type is explicitly specified to consist of wide UTF-16 characters. Level 1 goes on to define the interfaces for programmatically interacting with the various pieces of a DOM tree. Serialization of XML is intentionally omitted from Level 1. Just beyond the Level 1 core is the DOM Level 1 HTML definition. This area attempts to resolve DOM Level 1 core with the earlier Dynamic HTML object model (loosely referred to as Level 0).

The DOM Level 2 adds namespaces, events, and iterators, plus view and stylesheet support. You need DOM Level 2 for some applications: For instance, assigning an XML Schema to a namespace is essential for applications like RDF, where XML tags come from different schemas and the chance for a name collision is high. Level 2 adds a pair of createDocument methods to the DOMImplementation interface. One of the examples will show why this is important. Just when you thought you were safe from the callbacks and event handlers found in SAX, here they are again in the Event interface. Unlike the SAX events, which are for parsing, DOM events can reflect user interactions with a document as well as changes to a live document. DOM events that reflect the change in the structure of a document are called mutation events. TreeWalkers and NodeIterators enhance DOM tree traversal. Programs can inspect style information through the StyleSheet interface. Finally, view support allows an XML application to examine a document in both original and stylesheet rendered forms. These before and after views are called the document and abstract views.

DOM Level 3 Core adds the getInterface method to the DOMImplementation interface. In a Level 3 document, you can specify the document's character encoding or set some of its basic XML declarations like version and standalone. Level 2 doesn't permit moving DOM nodes from one document to another. Level 3 drops this limitation. Level 3 adds user data -- extra application data that can be optionally attached to any node. Level 3 has a number of other advanced features, but the W3C committee is still working on the Level 3 drafts. Check Resources for a link to read up on the committee's progress.


Download and install

You can download Xerces-C++ as a zipped tarball or a precompiled binary (see Resources). Script users accessing the library through Perl, Python, VBScript, or JavaScript can download the binary for their platform to get a jumpstart on installation. C++ programmers will most likely prefer to go with building their own binaries from the source tarball. The building instructions on the Apache XML group Web site are well written; a little farther on in this article I discuss a couple of subtle issues that I have discovered -- a pthreads linking problem and a fix for potential memory leaks on Windows platforms. Part 2 will include a tip for specifying a DOCTYPE in the SVG example. If you want to build the library as you read this, look at the Xerces build documentation found on the Apache site (see Resources) first and then come back here to read about linking Xerces to your own applications.

You can download the tarball and work offline (with a laptop, for example). The full HTML documentation is included in the tarball, so you don't need to keep referring back to the Web site for the instructions.

Building for Win32

The steps for installing the software on Visual Studio dot-NET or Win64 are nearly identical to these steps for building on Win32.

  1. Unzip and untar the Xerces source tarball to a working directory. Xerces-C++ has its own directory structure, so you should make sure you preserve relative path names during this step.
  2. Using Windows Explorer or your favorite file manager, drill down to the \xerces-c-src_2_3_0\Projects\Win32\VC6\xerces-all\ folder and click the xerces-all.dsw workspace file to launch Microsoft Developer Studio.
    Note: These instructions assume that you're building Win32 applications in Visual Studio 6. For Visual Studio dot-NET or Win64 applications, repeat steps 1 and 2 in the Win64 or VC7 variants of the directory.
  3. From Developer Studio, make XercesLib the current active project and press F7 to build the DLL. On last year's hardware this takes a minute or two.
  4. Add a path to the Xerces header files into your project. (Applications wanting to link against Xerces-C++ need to either include the XercesLib DSP project file in their workspace or add the LIB file in their project file to permit linking.) Select Project>Settings to bring up the project settings dialog box. Select All Configurations from the Settings combo box, click the C++ tab, select the Preprocessor category, and add the Xerces include path (something like \xerces-c-sr2_2_0\src) to the Additional include directories text box.
  5. If you have added the XercesLib DSP to your workspace, remember to mark your own project as dependent upon the XercesLib project; otherwise, you will be greeted with link errors.
  6. Create a stub C++ source file that does nothing but contain a line that reads #include <xercesc/sax/HandlerBase.hpp>. If you can compile this one-line C++ file, your include paths are probably right. Save your workspace after doing that. To run and debug your application, place a copy of the Xerces DLL in the working directory.

Building for Linux

Build the Xerces-C++ shared library by following the thorough instructions in the doc/html folder. The commands below illustrate how to build the Xerces-C++ library from the zipped source. This assumes that the xerces-c-src_2_3_0.tar.gz file is present in a directory like /home/user. Whatever directory you choose should match the XERCESCROOT variable; the configure script requires it.

# cd /home/user
# gunzip xerces-c-src_2_3_0.tar.gz
# tar -xvf xerces-c-src_2_3_0.tar
# export XERCESCROOT=/home/user/xerces-c-src_2_3_0
# cd $(XERCESCROOT)/src/xercesc
# ./configure
# make all

For the rest of this example, I'll assume the source tree is under the /home/user/xerces-c-src_2_3_0 directory. If all goes well, the shared library should appear in the lib folder. If you have problems, review the build instructions in the /doc/html folder. At this point, you can either copy the library (and symlinks) to /usr/lib or define the appropriate environment variable so that the loader can locate your newly-compiled library.

The easy way to test out your new library is to build and run one of the samples:

# export XERCESCROOT=/home/user/xerces-c-src_2_3_0
# cd $(XERCESCROOT)/samples
# ./configure
# make all

I tripped over a small problem building one of the samples on a fresh installation of Slackware Linux 9.0. The linker complained of some missing pthreads-related exports. I edited the Makefile.in file to include a reference to -lpthread and ran configure again. The second time around, typing make all worked.

Once you know the library works, you can start your own Xerces-C++ project. Use the -I compiler option to help the compiler locate the Xerces header files. Use the -L and -l linker options to help the linker locate the Xerces-C++ library. Listing 1 gives you a working minimal makefile to get started.

Listing 1. A minimal makefile
APP = example
XERCES = /home/user/xerces-c-src_2_3_0
INCS = ${XERCES}/src

${APP} :: ${APP}.cpp
	${CC} -lxerces-c-src_2_3_0 -I${INCS} ${APP}.cpp -o ${APP}

The command to kick off Listing 1 is make or gmake. You can change the APP variable to whatever source file suits you. The examples in this article use similar makefiles.

Xerces C++ added C++ namespace support (not to be confused with XML namespaces) as of Version 2.2.0. If you have code that works on 2.1.0 and you'd like to take advantage of the newer version, add the following three lines to your code, just after including the Xerces C++ headers.

Listing 2. Xerces C++ namespace support
#ifdef XERCES_CPP_NAMESPACE_USE
XERCES_CPP_NAMESPACE_USE
#endif

You could, of course, just prefix all of your Xerces-C++ objects with the XERCES_CPP_NAMESPACE:: namespace.


The sample application

To keep things interesting as I explain the basics of using Xerces-C++, I'm going to create a simple bar graph using XML as the data format. To dodge the cross-platform bullet of platform GUI specifics, I'm doing the bar graph using ASCII art. This is, after all, an article on XML and not GTK, OpenGL, or Direct-X. If you are interested in using an XML representation of graphical data, look at SVG and SMIL (see Resources). The DOM example that I describe in Part 2 outputs SVG. I'll start with the simple text-only app.

Listing 3 is the DTD for the data. Next I'll construct a program to load the data, determine what scale to use, and then actually plot the data to the screen.

Listing 3. DTD for sample application data
APP = example
<?xml version="1.0" ?>
<!ELEMENT figures (PCDATA) >
<!ATTLIST figures type (sales | inventory | labor) >
<!ATTLIST figures value CDATA >
<!ELEMENT department (figures*) >
<!ATTLIST department name CDATA> 
<!ELEMENT corporate (department*) >
<!ATTLIST corporate name CDATA >

Listing 4 shows a sampling of what the data might look like.

Listing 4. Sample input XML data
APP = example
<?xml version="1.0" ?>
<corporate name="Big Biz">
<department name="North">
<figures type="sales" value="125000.00"/>
<figures type="inventory" value="90000.00"/>
<figures type="labor" value="110000.00">estimated</figures>
</department>
<department name="South">
<figures type="sales" value="980000.00"/>
<figures type="inventory" value="110000.00"/>
<figures type="labor" value="115000.00">estimated</figures>
</department>
<department name="East">
<figures type="sales" value="210000.00"/>
<figures type="inventory" value="80000.00"/>
<figures type="labor" value="95000.00">estimated</figures>
</department>
<department name="West">
<figures type="sales" value="160000.00"/>
<figures type="inventory" value="75000.00"/>
<figures type="labor" value="130000.00">estimated</figures>
</department>
<department name="Central">
<figures type="sales" value="723000.00"/>
<figures type="inventory" value="11000.00"/>
<figures type="labor" value="221000.00">estimated</figures>
</department>
</corporate>

SAX2 implementation

Listing 5 is a baseline SAX implementation. This isn't a complete program because it is missing the handler implementation, but it does show what exactly is needed to put the framework into place. The calls to XMLPlatformUtils:Initialize() and XMLPlatformUtils::Terminate() are very important. The library guards against applications that fail to initialize the library properly by throwing an exception.

To make the program in Listing 5 a complete application, you need to add the event-handler class in Listing 6. SAX2 comes with a default event-handler class called DefaultHandler, defined in the C++ header file of the same name. The default handler does nothing -- it is just a stub implementation -- but it is complete, and so I'm using it here as a base class for the graphing event-handler class.

This file in Listing 7 is the actual implementation of the event-handler class in Listing 6. While the rest of the program is pretty much just boilerplate code to get the SAX2 parser running, the part in Listing 7 defines the application's personality.

Xerces-C++ uses XMLCh as a typedef'd character representation. On some platforms it is compatible with the C type wchar_t, which is usually two -- but sometimes four -- bytes wide. Because of that possibility, the docs discourage the practice of interchanging wchar_t and XMLCh. You can get away with it on some platforms, but it will break on others. Xerces-C++ uses this larger character representation to exchange text as UTF-16 as opposed to UTF-8 or ISO-8859. To debug this program, I'm using the XMLString::transcode function to convert the wide character strings for display on a console, as shown in Figure 1.

Figure 1. Screen shot of SAX parser output
Screen shot of SAX parser output

I discovered a problem using the Xerces internal string class on Microsoft Windows. The comments in XMLString.hpp require the caller of replicate and other similar functions to release the memory returned. The problem comes from linking your application against the Xerces-C++ library as a DLL. The strings are allocated from the DLL's local heap. If both your application and the XercesLib DLL use the exact same C runtime (CRT) library DLL, then all is well. If, however, your program uses the single-threaded CRT and XercesLib uses the multithreaded CRT, DLL problems happen. When your program attempts to release the string memory, the C runtime notices that the memory did not come from your application's local heap. For debug builds it throws an exception, but for release builds it may silently leak memory. The sample programs found in earlier versions of Xerces (like 1_5_1) avoided this by simply not releasing the memory.

My fix for this was to add a pair of static discard functions to the XMLString class. Because the string memory is released by code executing inside the DLL, the correct local heap is used, and no debug assertion results. I was pleased to see that Xerces developer Tinny Ng added this to the XMLString class and went a step further to null the string pointer (see Resources). The other nice feature of this is that programmers don't need to worry about how the implementation of XMLString allocates memory. Instead of guessing whether they should be using delete[] or free, they can just call XMLString::release. You can, of course, just make sure the CRT that your application expects is the same as the CRT used by the XercesLib DLL.


What's next?

Here in Part 1, you've seen how to link the Xerces-C++ XML library into applications written in Linux and Windows, and I've demonstrated parsing with the SAX API by creating a bar graph in ASCII art. In Part 2, I'll show you how to load, manipulate, or synthesize a DOM document, and create the same bar graph using Scalable Vector Graphics (SVG).


Download

DescriptionNameSize
Code samplex-xercc/xml4c.zip---

Resources

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into XML on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML
ArticleID=12308
ArticleTitle=Make the most of Xerces-C++, Part 1
publish-date=08132003