Skip to main content

Thinking XML: Hacking XML Hacks

Observations on a handy book for XML users

Uche Ogbuji (uche@ogbuji.net), Principal Consultant, Fourthought, Inc.
Photo of Uche Ogbuji
Uche Ogbuji is a consultant and co-founder of Fourthought Inc., a software vendor and consultancy specializing in XML solutions for enterprise knowledge management. Fourthought develops 4Suite, an open source platform for XML, RDF, and knowledge-management applications. Mr. Ogbuji is also a lead developer of the Versa RDF query language. He is a computer engineer and writer born in Nigeria, living and working in Boulder, Colorado, USA. You can contact Mr. Ogbuji at uche@ogbuji.net.

Summary:  XML Hacks is a book of tips and tricks for XML users. This useful resource covers a wide variety of topics, but in some cases further expansion and alternatives to material covered could be even more helpful. In this article, Uche Ogbuji offers practical observations based on topics from the book.

Date:  14 Sep 2004
Level:  Intermediate
Activity:  1708 views

In my last column, I looked at Elliotte Rusty Harold's book Effective XML, an excellent volume for any XML professional. In this one. I turn my attention to another practical XML book, XML Hacks, edited by Michael Fitzgerald (O'Reilly and Associates, 2004). This book travels all over the landscape, offering some very introductory sections, some intermediate and advanced design and implementation techniques, some tips for using specific tools, and more. Readers of this column and my other articles for developerWorks might expect me to focus on issues of XML design and XML vocabularies. XML Hacks really dwells more on implementation details and the use of tools, but these are also important and in this installment I'll cover some of my own practical observations as they apply to the themes in this book. As with the article on Effective XML, this is not a review of the book, but rather a set of observations inspired by the book, and written even for readers who do not own it.

Including external text documents with XInclude

Hack #26, "Include External Documents with XInclude," shows how to use XInclude (see Resources) in a way that's very similar to XML's built-in external parsed entities. It features a sample document that inserts an external XML document specified by an HTTP URL. XInclude does add a few tricks to the mix, such as fallback support (providing alternative content to insert in case of error) and the ability to specify content negotiation as the processor makes HTTP requests. But I think that two of the most important advantages of XInclude over the parsed entity mechanism are:

  • The ability to use XPointer to select just a portion of the target document for inclusion
  • The ability to change the parsing mechanism to insert the external document as a fully-escaped text file rather than as an XML document

The second capability is very useful if you're preparing XML documents that contain code listings or examples. As an example, imagine that you're writing a document that uses the Python language code in Listing 1.


Listing 1. Example Python code to be inserted as a listing into an XML document
def game_show(contestant_guess, prices):
    if prices[contestant_guess] < 1000:
        print "you win!"
    else:
        print "you lose!"
  

You'd probably develop this code in a separate file so that you could test it out to be sure it works as expected before putting it into your document. At first, you just cut and paste the code as is into the XML file, shown in Listing 2.


Listing 2. Document into which sample code is to be inserted by direct cut and paste
<?xml version='1.0' encoding='iso-8859-1'?>
<!DOCTYPE html PUBLIC
  "-//W3C//DTD XHTML 1.0 Transitional//EN"
  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-US" xml:lang="en-US">
<head>
  <title>On-line game show programming in Python</title>
</head>
<body>
  <div class="section">
    <h3>A simple example</h3>
    <p>Examine the following code:
    </p>
    <div class="code-listing">
      <div class="caption">example 1</div>
<!-- paste Python code here -->
    </div>
  </div>
</body>
</html>
  

This would cause an error because of the line if prices[contestant_guess] < 1000:, which contains an unescaped less than sign (<). You can manually escape this to <, but this might lead to inconvenience and errors if you make modifications to the code; you then need to modify the external, test file and then modify and re-escape the file pasted in to your document. One solution is to prepare a CDATA section, as in Listing 3, for the code block and just paste into that so that you don't need further escaping.


Listing 3. Document into which sample code is to be inserted by cut and paste into CDATA section
<?xml version='1.0' encoding='iso-8859-1'?>
<!DOCTYPE html PUBLIC
  "-//W3C//DTD XHTML 1.0 Transitional//EN"
  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-US" xml:lang="en-US">
<head>
  <title>On-line game show programming in Python</title>
</head>
<body>
  <div class="section">
    <h3>A simple example</h3>
    <p>Examine the following code:
    </p>
    <div class="code-listing">
      <div class="caption">example 1</div>
<![CDATA[
<!-- paste Python code here -->
]]>
    </div>
  </div>
</body>
</html>
  

This approach certainly reduces escaping errors, but you have to be aware of the presumably rare string "]]>", as exemplified in Listing 4.


Listing 4. Example Python code to be inserted as a listing into an XML document
def game_show(guesses, contestant, prices):
    if prices[guesses[contestant]]>1000:
        print "you win!"
    else:
        print "you lose!"
  

To properly escape this line within a CDATA section, you would need something of at least as much complexity as: if prices[guesses[contestant]]]]><![CDATA[>1000:. Take note as well that in the examples so far I've been using Python code, which in most cases has relatively few cases that would need escaping. If you're writing an article with XML examples, it may be overwhelming to do the escaping by hand. And instances of the gotcha "]]>" string are much more likely (for example, if the XML listing itself has CDATA sections).

You can certainly work your way around these obstacles, but I have found that the easiest way to handle code inclusions in articles is to use XInclude's parse="text" capability. By adding this attribute to an xi:include element, the result is automatically parsed as XML CDATA, and thus automatically escaped upon inclusion. Listing 5 is an example of a document that uses XInclude in this way:


Listing 5. Document into which sample code is to be inserted using textual XInclude
<?xml version='1.0' encoding='iso-8859-1'?>
<!DOCTYPE html PUBLIC
  "-//W3C//DTD XHTML 1.0 Transitional//EN"
  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-US" xml:lang="en-US"
      xmlns:xi="http://www.w3.org/2001/XInclude"
>
<head>
  <title>On-line game show programming in Python</title>
</head>
<body>
  <div class="section">
    <h3>A simple example</h3>
    <p>Examine the following code:
    </p>
    <div class="code-listing">
      <div class="caption">example 1</div>
<xi:include href="gameshow1.py" parse="text" encoding="iso-8859-1"/>
    </div>
  </div>
</body>
</html>
  

The xi:include element is replaced with the fully escaped content of gameshow1.py (for example Listing 1), which is resolved relative to the base URI of the element. The escaping is automated thanks to parse="text". I always use the encoding attribute (which, incidentally, is ignored if you use parse="xml"). In my own usage, the encoding is usually "iso-8859-1" for Python files and "utf-8" for XML files, although encodings might be different in your environment.

Parsed text XInclude is the technique I use in preparing these very articles on developerWorks (which commendably requires authors to send article drafts in a fairly well-designed XML format), and I find it helps my authoring productivity immensely.

One additional note and warning: The book uses the XInclude namespace that was current at the time it was written -- http://www.w3.org/2003/XInclude -- but this is no longer current. The W3C working group reverted back to the original namespace, http://www.w3.org/2001/XInclude, in the 13 April 2004 Candidate Recommendation. Most tools I know of only support the latter (2001) namespace form, which may be why the W3C decided to go back to it, but they did sow a bit of confusion with this namespace change and reversal. The book's authors are innocent victims of this, and I have submitted an erratum on this point to the publisher.


Simpler identity transforms

Hack #37, "Generate an XSLT Identity Stylesheet with Relaxer," discusses a rather elaborate means of generating an identity transform, an XSLT transform that outputs XML that is equivalent to the source document. The method discussed results in a transform that has a template for each element in the vocabulary and is really quite elaborate. I think the intention is to provide boilerplate that one can use to create more specialized transforms, but I don't think a far simpler identity transform is made clear -- one that is even given as an example in the XSLT specification. This simpler identity transform is discussed in the following entry (#38), "Pretty-Print XML Using a Generic Identity Stylesheet and Xalan," which includes the well-known twist of adding <xsl:output method="xml" indent="yes"/> in order to have the output pretty-printed. I recommend that you read entry #38 and become very familiar with the simple identity transform before worrying about the more complex approach in Hack #37. As a bonus, understanding the simple identity transform is key to proficiency with several XSLT idioms, including xsl:copy-of for copying source nodes to output, and the nuances of the common XPath node tests *, @*, and node().


Generating multiple output documents without bothering with XSLT 2.0

Item #45, "Generate Multiple Output Documents with XSLT 2.0," discusses use of the xsl:result-document in XSLT 2.0 to serialize more than one result tree within a transform. This is all very well except that it ends with the comment:

If you're still using XSLT 1.0, you can probably produce multiple result documents, but it would be through extension features that vary from processor to processor.

This is not quite true thanks to EXSLT (see Resources), a collection of standard extensions for XSLT 1.0 processors. EXSLT provides the exsl:document extension, which is supported by several of the more popular XSLT processors. It is also a bit simpler, and in my opinion more elegant, than the equivalent mechanism in XSLT 2.0. (It is based on a form of this feature in earlier drafts of the XSLT 2.0 Working Draft.) I shall not reproduce the very long example in the book (#3-28) using exsl:document. In Listing 6, I present a simpler example -- a transform for writing each paragraph element in an XHTML source document to a new result document.


Listing 6. Transform for writing each paragraph element in an XHTML document to a new result document
<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="1.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:exsl="http://exslt.org/common"
  xmlns:html="http://www.w3.org/1999/xhtml"
  extension-element-prefixes="exsl">

  <xsl:template match="html:p">
    <exsl:document href="para-{generate-id()}.xml" method="xml" indent="yes">
        <xsl:copy-of select="."/>
    </exsl:document>
  </xsl:template>

</xsl:stylesheet>
  

The exsl:document element is the key. It instructs the processor to prepare a new output tree, typically by creating a new file. The href attribute is the resource name, resolved according to the base URI of the extension element. I use the generate-id() function to ensure that each created file has a unique name. method="xml" and indent="yes" are just the regular attributes defined for xsl:output, any of which can be used with this extension element.

The book does introduce EXSLT in item #58, "Use EXSLT Extensions", but much too late to avoid giving the reader an impression that the only portable option they have for multiple result documents is a move to XSLT 2.0. I advise you to stick with XSLT 1.0 as long as you can because so far I have found XPath 2.0 (which is required for XSLT 2.0) to be unnecessarily complex. EXSLT provides XSLT 1.0 users with almost all of the capabilities that are useful in XSLT 2.0, and more.


Wrap-up

XML Hacks is a handy collection of tips and tricks. It does seem to show a strong and unnecessary bias towards certain tools, and the writing is rather uneven in places (not uncommon for an ensemble book), so it can be hard to figure out the substance of some of the items. In this article, I've tried to provide some notes to augment particularly important points that I think were not clear enough in the book. I'll continue with a few more observations in my next column.


Resources

About the author

Photo of Uche Ogbuji

Uche Ogbuji is a consultant and co-founder of Fourthought Inc., a software vendor and consultancy specializing in XML solutions for enterprise knowledge management. Fourthought develops 4Suite, an open source platform for XML, RDF, and knowledge-management applications. Mr. Ogbuji is also a lead developer of the Versa RDF query language. He is a computer engineer and writer born in Nigeria, living and working in Boulder, Colorado, USA. You can contact Mr. Ogbuji at uche@ogbuji.net.

Comments (Undergoing maintenance)



Trademarks  |  My developerWorks terms and conditions

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML
ArticleID=15102
ArticleTitle=Thinking XML: Hacking XML Hacks
publish-date=09142004
author1-email=uche@ogbuji.net
author1-email-cc=

My developerWorks community

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere).

My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Rate a product. Write a review.

Special offers