Skip to main content

If you don't have an IBM ID and password, register here.

By clicking Submit, you agree to the developerWorks terms of use.

The first time you sign into developerWorks, a profile is created for you. This profile includes the first name, last name, and display name you identified when you registered with developerWorks. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

All information submitted is secure.

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerworks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

By clicking Submit, you agree to the developerWorks terms of use.

All information submitted is secure.

Serialize XML data

Saving XML data using DOMWriter in XML for the C++ parser

Tinny Ng (tng@ca.ibm.com), System House Business Scenarios Designer, IBM Toronto Laboratory
Tinny Ng is an Advisory Software Developer currently working as a Business Scenario Solution Designer in the WebSphere System House at the IBM Toronto Lab. Previously she was the team lead of the XML for C++ parser development team, and has led the team to deliver nine Apache Xerces-C++ releases and seven IBM XML4C releases within two years. Tinny also led the architectural design work for the C++ XML parser, including redesigning the DOM implementation in the parser for faster performance, and defining a new DOM C++ binding which was then referenced by the W3C. She is an active Committer to the Apache XML Open Source Project, Xerces-C++.

Summary:  IBM developer Tinny Ng shows you how to serialize XML data to a DOMString with different encodings. You'll also find examples that demonstrate how to use the MemBufFormatTarget, StdOutFormatTarget, and LocalFileFormatTarget output streams in XML4C/Xerces-C++.

Date:  15 Jul 2003
Level:  Intermediate

Comments:  

Xerces-C++ is an XML parser written in C++ and distributed by the open source Apache XML project. Since early last year, Xerces-C++ has added an experimental implementation of a subset of the W3C Document Object Model (DOM) Level 3 as specified in the DOM Level 3 Core Specification and the DOM Level 3 Load and Save Specification (see Resources).

The DOM Level 3 Load and Save Specification defines a set of interfaces that allow users to load and save XML content from different input sources to different output streams. This article uses examples to show you how to save XML data in this way. Users can stream the output data into a string, an internal buffer, the standard output, or a file. In the following sections, I will show you how to serialize XML data to a DOMString with different encodings, and also how to use MemBufFormatTarget, StdOutFormatTarget, and LocalFileFormatTarget in Xerces-C++.

Note: IBM XML for C++ (XML4C) integrates Xerces-C++ with International Components for Unicode (ICU) to provide support for over 100 different encodings. In this document, I will use Xerces-C++ to represent the XML parser for C++. However, the behavior described should apply to both XML4C and Xerces C++, unless otherwise specified .

Serializing XML data

The DOMBuilder class provides an API for parsing XML documents and building the corresponding DOM document tree; while the DOMWriter class provides an API for serializing (writing) a DOM document out in an XML document. To serialize XML data, first load the XML data to a DOM tree using a DOMBuilder and then use a DOMWriter to write out the DOM tree. For example:


Listing 1. Serializing XML data
// DOMImplementationLS contains factory methods for creating objects
// that implement the DOMBuilder and the DOMWriter interfaces
static const XMLCh gLS[] = { chLatin_L, chLatin_S, chNull };
DOMImplementation *impl = 
     DOMImplementationRegistry::getDOMImplementation(gLS);

// construct the DOMBuilder
DOMBuilder* myParser = ((DOMImplementationLS*)impl)->
          createDOMBuilder(DOMImplementationLS::MODE_SYNCHRONOUS, 0);

// parse the XML data, assume it is saved in a local file 
// called "theXMLFile.xml"
// the DOMBuilder will parse the data and return it as a DOM tree
DOMNode* aDOMNode = myParser->parseURI("theXMLFile.xml");

// construct the DOMWriter
DOMWriter* myWriter = ((DOMImplementationLS*)impl)->createDOMWriter();

// optionally, set some DOMWriter features
// set the format-pretty-print feature
if (myWriter->canSetFeature(XMLUni::fgDOMWRTFormatPrettyPrint, true))
    myWriter->setFeature(XMLUni::fgDOMWRTFormatPrettyPrint, true);

// set the byte-order-mark feature      
if (myWriter->canSetFeature(XMLUni::fgDOMWRTBOM, true))
    myWriter->setFeature(XMLUni::fgDOMWRTBOM, true);

// serialize the DOMNode to a UTF-16 string
XMLCh* theXMLString_Unicode = myWriter->writeToString(*aDOMNode);

// release the memory
XMLString::release(&theXMLString_Unicode); 
myWriter->release();
myParser->release();

Both DOMBuilder and DOMWriter are constructed using factory methods from DOMImplementationLS. When finished, they both need to be released explicitly to relinquish any associated resources. Also, the returned string from writeToString is owned by the caller, who is responsible for releasing the allocated memory.

You can also opt to set some features that control the behavior of the DOMWriter. Xerces-C++ has implemented a number of DOMWriter features that are specified in the W3C DOM Level 3 Load and Save Specification. A complete list can be found in the Xerces-C++ programming guide, DOMWriter Supported Features (see Resources). A couple of them are worth highlighting:

  1. format-pretty-print -- This formats the output by adding a newline carriage return and indented whitespace to produce a pretty-printed, human-readable form. The exact form of the transformations is not specified in the W3C DOM Level 3 Load and Save Specification, and thus the parser has its own interpretation. In releases prior to Xerces-C++ 2.2 (or XML4C 5.1), the parser only pretty-prints the prologue and the epilogue. It doesn't touch the content within the root element. And from Xerces-C++ 2.2 (or XML4C 5.1) onwards, turning on this feature also causes the content within the root element to be formatted.
  2. byte-order-mark -- This is a non-standard extension added in Xerces-C++ 2.2 (or XML4C 5.1) to enable the writing of the Byte-Order-Mark (BOM) in the resultant XML stream. The BOM is written at the beginning of the resultant XML stream, if and only if a DOMDocumentNode is rendered for serialization, and the output encoding is one of the following:
    • UTF-16
    • UTF-16LE
    • UTF-16BE
    • UCS-4
    • UCS-4LE
    • UCS-4BE

Output streams supported by Xerces-C++

DOMWriter provides an API for writing a DOM node into various types of output streams. Xerces-C++ supports four types of output streams:

  • DOMString
  • MemBufFormatTarget
  • StdOutFormatTarget
  • LocalFileFormatTarget

DOMString

Users can serialize a DOMNode into a DOMString (that is, XMLCh* in Xerces-C++) using the DOMWriter method writeToString. This method completely ignores all the encoding information available, and the returned string is always encoded in UTF-16. And as mentioned above, the string returned from writeToString is owned by the caller, who is responsible for releasing the allocated memory. For example:


Listing 2. Serializing a DOMNode to a UTF-16 string
// construct the DOMWriter
DOMWriter* myWriter = ((DOMImplementationLS*)impl)->createDOMWriter();

// serialize a DOMNode to a UTF-16 string
XMLCh* theXMLString_Unicode = myWriter->writeToString(*aDOMNode);

// release the memory
XMLString::release(&theXMLString_Unicode); 
myWriter->release(); 

If you would like to receive a string encoded in something other than UTF-16, you can transcode the string manually using an XMLTranscoder. Construct your XMLTranscoder for a specific encoding using XMLPlatformUtils::fgTransService-> makeNewTranscoderFor, then call transcodeTo to transcode the UTF-16 string to your specified encoding. For example:


Listing 3. Serializing a DOMNode to a Big5 string
// construct the DOMWriter
DOMWriter* myWriter = ((DOMImplementationLS*)impl)->createDOMWriter();

// serialize a DOMNode to a UTF-16 string
XMLCh* theXMLString_Unicode = myWriter->writeToString(*aDOMNode);

// construct a transcoder in Big5
XMLTransService::Codes resCode;
XMLTranscoder* aBig5Transcoder =  XMLPlatformUtils::fgTransService->
     makeNewTranscoderFor("Big5", resCode, 16*1024, 
          XMLPlatformUtils::fgMemoryManager);

// transcode the string into Big5
unsigned int charsEaten;
char resultXMLString_Encoded[16*1024+4];
aBig5Transcoder->transcodeTo(theXMLString_Unicode,
                             XMLString::stringLen(theXMLString_Unicode),
                             (XMLByte*) resultXMLString_Encoded,
                             16*1024,
                             charsEaten,
                             XMLTranscoder::UnRep_Throw );

// release the memory
XMLString::release(&theXMLString_Unicode); 
delete aBig5Transcoder;
myWriter->release(); 

This assumes that the underlying transcoder that is integrated with the parser supports the encoding you've specified. Xerces-C++ has intrinsic support for ASCII, UTF-8, UTF-16 (Big/Small Endian), UCS4 (Big/Small Endian), EBCDIC code pages IBM037 and IBM1140, ISO-8859-1 (aka Latin1), and Windows-1252. If you wish to have more encodings support -- say in Shift-JIS or Big5 -- then you may wish to use XML4C which integrates the Xerces-C++ parser with IBM's International Components for Unicode (ICU) and extends the support to over 100 different encodings.

However, the XMLTranscoder does not alter the encoding information stored in the XML declaration of the input string that was generated by writeToString. Thus the encoding attribute of the manually transcoded XML string is still "UTF-16" instead of "Big5". This can be misleading if you are serializing the entire DOMDocumentNode, where the encoding information is included in the XML declaration.

In that case, it is recommended to use MemBufFormatTarget instead, for receiving an encoded string other than UTF-16.

MemBufFormatTarget

MemBufFormatTarget saves the XML data to an internal buffer. MemBufFormatTarget is initialized to have a memory buffer of 1023 upon construction and will grow as needed. It returns a null-terminated XMLByte stream upon request through the getRawBuffer() method. Users should make their own copy of the returned buffer if they intend to keep it independent on the state of the MemBufFormatTarget. Otherwise, the buffer will either be deleted when MemBufFormatTarget is destroyed or it will be reset when the reset() function is called.

The encoding of the returned XMLByte stream is determined in the following order:

  1. The encoding setting in the DOMWriter is used
  2. If that is null, then the encoding attribute of the DOM stream to be written is used
  3. If neither of the above provides an encoding name, then a default encoding of UTF-8 is used

The DOMWriter will store the correct encoding information -- which matches the actual encoding of the string -- in the encoding attribute of the XML declaration.

Listing 4 illustrates how to receive an XML string encoded in Big-5:


Listing 4. Using MemBufFormatTarget
// construct the DOMWriter
DOMWriter* myWriter = ((DOMImplementationLS*)impl)->createDOMWriter();

// construct the MemBufFormatTarget
XMLFormatTarget *myFormatTarget = new MemBufFormatTarget();

// set the encoding to be Big5
XMLCh tempStr[100];
XMLString::transcode("Big5", tempStr, 99);
myWriter->setEncoding(tempStr);

// serialize a DOMNode to an internal memory buffer
myWriter->writeNode(myFormatTarget, *aDOMNode);

// get the string which is encoded in Big 5 from the MemBufFormatTarget
char* theXMLString_Encoded = (char*) 
     ((MemBufFormatTarget*)myFormatTarget)->getRawBuffer();

// release the memory
myWriter->release();
delete myFormatTarget;

Again, this also depends on the underlying transcoding capability supported by the parser. If the encoding you've specified is not supported, DOMWriter issues a fatal error.

Besides serializing the XML data to an internal buffer, there are two other types of output streams -- StdOutFormatTarget and LocalFileFormatTarget.

StdOutFormatTarget

StdOutFormatTarget saves the XML data to the standard output. For example:


Listing 5. Using StdOutFormatTarget
// construct the DOMWriter
DOMWriter* myWriter = ((DOMImplementationLS*)impl)->createDOMWriter();

// construct the StdOutFormatTarget
XMLFormatTarget *myFormatTarget = new StdOutFormatTarget();

// serialize a DOMNode to the standard output
myWriter->writeNode(myFormatTarget, *aDOMNode);

// release the memory
myWriter->release();
delete myFormatTarget;

LocalFileFormatTarget

LocalFileFormatTarget saves the XML data to a physical local file. Users need to pass a local file name as a parameter when constructing the LocalFileFormatTarget. For example:


Listing 6. Using LocalFileFormatTarget
// construct the DOMWriter
DOMWriter* myWriter = ((DOMImplementationLS*)impl)->createDOMWriter();

// construct the LocalFileFormatTarget
XMLFormatTarget *myFormatTarget = new LocalFileFormatTarget("myXMLFile.xml");

// serialize a DOMNode to the local file "myXMLFile.xml"
myWriter->writeNode(myFormatTarget, *aDOMNode);

// optionally, you can flush the buffer to ensure all contents are written
myFormatTarget->flush();

// release the memory
myWriter->release();
delete myFormatTarget; 

The file is created automatically if it doesn't already exist. Optionally, you can flush the content of the file before doing any I/O to ensure that all the contents are written out.


Conclusion

You should now have a good understanding of how to serialize XML data into different types of output streams with different encodings. Again, for more details please refer to the W3C DOM Level 3 Load and Save Specification and the complete API documentation in Xerces-C++.


Resources

About the author

Tinny Ng is an Advisory Software Developer currently working as a Business Scenario Solution Designer in the WebSphere System House at the IBM Toronto Lab. Previously she was the team lead of the XML for C++ parser development team, and has led the team to deliver nine Apache Xerces-C++ releases and seven IBM XML4C releases within two years. Tinny also led the architectural design work for the C++ XML parser, including redesigning the DOM implementation in the parser for faster performance, and defining a new DOM C++ binding which was then referenced by the W3C. She is an active Committer to the Apache XML Open Source Project, Xerces-C++.

Report abuse help

Report abuse

Thank you. This entry has been flagged for moderator attention.


Report abuse help

Report abuse

Report abuse submission failed. Please try again later.


developerWorks: Sign in

If you don't have an IBM ID and password, register here.


Forgot your IBM ID?


Forgot your password?
Change your password


By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. This profile includes the first name, last name, and display name you identified when you registered with developerWorks. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

Choose your display name

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

(Must be between 3 – 31 characters.)


By clicking Submit, you agree to the developerWorks terms of use.

 


Rate this article

Comments

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML
ArticleID=12295
ArticleTitle=Serialize XML data
publish-date=07152003
author1-email=tng@ca.ibm.com
author1-email-cc=

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

For articles in technology zones (such as Java technology, Linux, Open source, XML), Popular tags shows the top tags for all technology zones. For articles in product zones (such as Info Mgmt, Rational, WebSphere), Popular tags shows the top tags for just that product zone.

For articles in technology zones (such as Java technology, Linux, Open source, XML), My tags shows your tags for all technology zones. For articles in product zones (such as Info Mgmt, Rational, WebSphere), My tags shows your tags for just that product zone.

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).