Skip to main content

XML Matters: Transforming DocBook documents using XSLT

David Mertz, Ph.D. (mertz@gnosis.cx), Archivist, Gnosis Software, Inc.
Photo of David Mertz
David Mertz became disenchanted with the academy and became a technical journalist: post hoc ergo prompter hoc. You can contact David at mertz@gnosis.cx; his life pored over at http://gnosis.cx/dW/. Suggestions and recommendations on this, past, or future, columns are welcomed.

Summary:  Using a DocBook example, David Mertz shows how to convert an XML document to HTML through XSLT (Extensible Stylesheet Language Transformation). Along the way, your intrepid columnist discusses four alternative approaches for transforming XML documents and shares what he experienced in experimenting with some open source tools. Sample code includes fragments of XSLT documents, valid HTML outputter code in XSLT for a simple DocBook chapter, and a brief XSLT looping example.

View more content in this series

Date:  01 Nov 2000
Level:  Introductory
Activity:  1937 views

Welcome to the world of XML transformations! I'm afraid that you're in for a rocky ride: Standards are coalescing and undergoing revision, tools are immature and often buggy, implementations are inconsistent, and choices are just plain confusing. But don't panic. I can lead you through at least one path out of the labyrinth. And things will inevitably get better with time, albeit always more slowly than we would like.

First things first

The last two "XML Matters" columns described my project for converting my academic writings to XML, specifically to the DocBook DTD. Those articles provide a good starting point for writing your own DocBook documents, and that's where this column takes up.

For this column, let's assume that you have some nicely structured, well-formed, and valid DocBook XML documents lying around. It's nice to have them in the first place, but the next step is to transform them into more conventional end-user formats: things like HTML pages, PDF files, and printed pages (the things readers actually read). This is exactly the problem I faced after converting a portion of my archival writing to DocBook, and this article presents my own solution.

My main goal -- at least for now -- is a good transformation to HTML. But I don't want to be limited to HTML output. I also have a few smaller goals. I'd like to have some control over the precise output without doing a lot of work and without having to learn a lot of new languages and techniques. I'd also like to use tools that are free (as in speech, and as in beer), and tools that are cross platform. Finally, I'd like to minimize dependencies. A large number of complex dependencies are a disadvantage, even if all the needed contributions are free and cross platform. Basically, my ideal is a standalone executable that just runs, runs reliably, and converts my DocBook documents to HTML in just the style I want. Lofty dreams, but why not?


Approaches to transformations

There are at least four possible approaches for transforming a DocBook document -- or most any XML document -- into end-user formats. I seriously considered all four approaches for my own little project. This column discusses only the last option in detail, but all are worth keeping in mind as you plan a project that involves repeated transformations:

  • Write custom transformation code. It would be nice to start with a programming language that has some libraries for basic XML methods like SAX and DOM. But even assuming the basic parsing is a black box, custom code can do whatever you want with the parsed elements. Ultimately, this is the most flexible and powerful approach, but it is also likely to take more work, both up front and in maintenance.
  • Use Cascading Stylesheets with our DocBook document. It's a thought. It would be nice to keep the typographic specifications completely separate from the structural markup and just simply have the client device (e.g., browser) render things nicely. That might yet happen, but as of right now there seems to be only limited support -- only in IE 5.5, Opera 4, and some of the latest Mozilla developer releases. It just doesn't seem at the point where one can count on an end user making this work for them.
  • Use Document Style Semantics and Specification Language (DSSSL) to specify transformations into target formats. On the plus side, a number of DSSSL style sheets already exist for DocBook (and for other formats). DSSSL is basically a whole new programming language to learn, and it's a functional Lisp-like language to boot. In order to utilize DSSSL, you need to start with the Jade or OpenJade, but both tools are complex enough that many people have written wrappers to them (such as SGML-tools Lite). In order to get a working system -- albeit by reports a very nicely working system -- you really need to satisfy all sorts of system dependencies and install all sorts of tools and libraries. On some well-intentioned although perhaps not sufficiently dedicated attempts, I didn't manage to get Jade-related tools smoothly functioning on my system. Obviously, a lot of other folks use these systems every day, so a little more work would have surely put things in order. (If you can point me to a quick, simple all-in-one DSSSL processor, let me know. I'd love to try it.) Even more than the setup difficulties, however, DSSSL simply feels like it comes out of different traditions and ways of thinking than do XML techniques. By contrast, the final approach is basically pure XML and comes out of official (working) specifications of the W3C.
  • Use eXtensible Stylesheet Language Transformations (XSLT). In one sense, XSLT is actually a specification for a class of XML documents. That is, an XSLT style sheet is itself a well-formed XML document with some specialized contents that let you "templatize" the output format you are looking for (stay tuned for what this means). A large number of tools at least nominally support XSLT: My hunch is that this really is the direction technologies are going in for XML transforms -- either because of, or in spite of, its "official" status with the W3C. XSLT can specify transforms to any target format. But the general feeling I've picked up is that most developers find it easiest to work with XSLT when the target format is another XML format, such as XHTML.

Choosing an XSLT tool

The Resources section contains a link to descriptions of a number of XSLT tools. I tried a many of them, but found Sablotron most to my taste. It is free software (GNU). It is multiplatform. It has a standalone executable that is simple to run from the command line. And most importantly, it appears to work correctly, at least for my simple test cases.

A number of the other XSLT tools listed by XSLT.com are also free software. However, most of them are Java programs that also depend on various extra Java libraries. Users appear to give a positive evaluation to a number of the Java tools, so these may be good choices for you. I opted for Sablotron both for the greater speed of compiled C and for the simplicity of installing and using it.

Norman Walsh has created a set of complete XSLT style sheets for DocBook. Unfortunately, Sablotron simply crashes on them, and XML Spy fails to match anything in a valid DocBook document when using them. This is more likely a limitation in the tools than in Walsh's style sheets. You might have better luck with other tools. Still, the problem gives us the opportunity to develop custom (less complete) XSLT style sheets, which is what I really want anyway (to demonstrate the techniques).

Use of Sablotron is quite simple. The basics are:


Listing 1: Basic use of Sablotron

X:\mydocs> x:\sabl\bin\sabcmd mystyle.xsl mydoc.xml mydoc.html     

What this says is: use the rules in mystyle.xsl to transform mydoc.xml into mydoc.html. You can also use pipes and redirection, if you wish. Setting up Sablotron is as easy as unpacking its archive (it also provides libraries you can call from your programs, but the command-line utility is a good way to get started). So you can adjust paths and filenames as needed for your environment.


Writing the XSLT specification

For the real blood and guts of XSLT, read the W3C's official recommendation (see the Resources section). This article aims at the more informal details of getting it working.

The specific DocBook document developed in "XML Matters #3" and "XML Matters #4" (chap5.xml) was a chapter. The example used a fairly small subset of all the possible DocBook tags in the chapter. So for now, all we really need is a chapter.xsl file that will do something useful with every tag actually used in chap5.xml. This is a modest start, but one that is easy to build on because of the open and extensible nature of XSLT. Let's take a look.

Start with a skeleton of chapter.xsl -- the "how to convert a DocBook chapter to HTML" template:


Listing 2: Skeleton XSLT document (empty.xls)

<xsl:stylesheet version="1.0"
     xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
     xmlns="http://www.w3.org/TR/xhtml1/strict">
  <xsl:output method="html" indent="yes" encoding="UTF-8"/>
</xsl:stylesheet>     

As you can see, chapter.xsl is a well-formed XML file. As you will also notice, the pattern <xsl:*> is the name of many of the tags in an XSLT document. In fact, all the tags that are instructions look like that. In transforming to XML-like formats (such as HTML), you will see various other tags. These other tags belong to the target format and will occur only within an <xsl:*> element.

Basically, you should use exactly the namespace attributes (xmlns:xsl and xmlns) indicated above. You'd probably want to keep the output line also; you might use the xml or text methods, though.

The above XSLT file is perfectly good to use as a processing template. But it might not do exactly what you expect. You might assume that since output specification is missing, nothing gets output. That isn't exactly correct: it still catches all the text nodes, and gives you a plain ASCII version of your chapter (using the above style sheet). If you really do want to output nothing at all, you want an XSLT document like the one in Listing 3:


Listing 3: Null-output XSLT document (null.xls)

<xsl:stylesheet version="1.0"
     xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
     xmlns="http://www.w3.org/TR/xhtml1/strict">
  <xsl:output method="html" indent="yes" encoding="UTF-8"/>
  <xsl:template match="*">
  </xsl:template>
</xsl:stylesheet>   

Our null outputter moves us in the direction of a useful transform. A real style sheet is really just a description of a set of patterns to try to match, and a template is inside each <xsl:template> element that provides a template for what to output. As the example shows, "*" can match any pattern. Our example just does not happen to do anything inside the template, but it still manages to match any element that might occur in our source XML/DocBook document.


Matching by descent

The power of XSLT templates lies mainly in their ability to extend the matching function. Once an element is matched, XSLT extends the matching function to the subelements of that element. Expanding on the null outputter, let's create a semi-meaningful style sheet. The important tag for allowing descent into subelements is <xsl:apply-templates>.

Generally, every template includes this tag somewhere in its body:


Listing 4: Minimal chapter XSLT document (minimal.xls

<xsl:stylesheet version="1.0"
     xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
     xmlns="http://www.w3.org/TR/xhtml1/strict">
  <xsl:output method="html" indent="yes" encoding="UTF-8"/>
  <xsl:template match="chapter">
    ----- Start of Chapter -----
    <xsl:apply-templates/>
  </xsl:template>
  <xsl:template match="*">
    ##### Unmatched Element in Source #####
  </xsl:template>
</xsl:stylesheet>   

When you run an XSLT processor using this style sheet and a DocBook chapter, you get something like:


Listing 5: XSLT processor using style sheet and DocBook chapter

----- Start of Chapter -----
##### Unmatched Element in Source #####
##### Unmatched Element in Source #####
##### Unmatched Element in Source #####   

This output isn't all that useful, but it lets us see what the style sheet is doing. The root element of a chapter is the <chapter> tag. The style sheet matches the <chapter> tag, and prints " - - - - - Start of Chapter - - - - - ". Various children occur within the <chapter> element. Each such child is called something other than chapter, and so will pass to matching the "*" template.

For developing your own XSLT style sheet, leaving in some obvious flag like the above for unmatched elements will let you quickly see what templates you need to develop. Listing 6 shows a version with some real templates:


Listing 6: Valid HTML outputter XSLT document

<xsl:stylesheet version="1.0"
                    xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                    xmlns="http://www.w3.org/TR/xhtml1/strict">
  <xsl:output method="html" indent="yes" encoding="UTF-8"/>
  <xsl:template match="chapter">
    <html>
      <head>
        <title>
          <xsl:value-of select="title"/>
        </title>
      </head>
      <body>
        <xsl:apply-templates/>
      </body>
    </html>
  </xsl:template>
  <xsl:template match="chapter/title">
    <hr></hr>
    <h1><xsl:apply-templates/></h1>
  </xsl:template>
  <xsl:template match="para">
    <p><xsl:apply-templates/></p>
  </xsl:template>
  <xsl:template match="*">
     ##### Unmatched Element in Source #####
  </xsl:template>
</xsl:stylesheet>   

This HTML outputter shows some realistic features of an XSLT style sheet. The chapter template match lays out the HTML document I want to produce. There is nothing special about the HTML tags inside the template match; any text you put there will appear in the output. Within the HTML <title> element, use the <xsl:value-of> instruction to insert the title subelement that's required inside a <chapter> in DocBook. In the HTML <body> element, you pass control on to other templates (presumably quite a few for all of DocBook).

The next template after chapter is chapter/title. This means to match a <title> element, but only if it occurs directly inside a <chapter>. If you want to, you could simply match title and thereby specify the output format of every <title> element in the source document. But I want to format chapter titles differently from sect1 titles, sect2 titles, and so on. You do that with para in the example (but it never actually matches, because paras can only occur inside tags not yet matched). For good measure, the template still matches "*", so you can see that the style sheet is not complete when you examine its output.


Repeated children

Matching templates by descent is not the only trick XSLT can do. You can also do conditional outputting, sorting, pulling out source attributes, and looping over children. For today, just look at the simple looping example in Listing 7:


Listing 7: XSLT template for looping over subelements

<xsl:template match="simplelist">
  <ul>
    <xsl:for-each select="member">
      <li><xsl:apply-templates/></li>
    </xsl:for-each>
  </ul>
</xsl:template>     

Rather than descend to every subelement in a simplelist, we just assume subelements are all <member> elements. The <xsl:for-each> works much like a nested template, and also much like a programming-language loop construct. The contents of the <xsl:for-each> element will go to the output for every subelement that matches the select attribute. Within the loop, the contents of the current <member> element become the active node that descends down to the <xsl:apply-templates/> tag we find inside the loop. That is, each thing in the list might have further markup inside it, and we pass formatting of those elements to their appropriate templates (for text nodes, they are just ouput in literal form).


Evermore

The preceding material simply scratches the surface of XSLT. But it should give you a sense of working with style sheets and transforms. The Resources section provides many places to read further on related matters. In particular, you might benefit from looking through the more complete XML and XSLT examples in this article's archive file. Stay tuned, this column is bound to come back to XSLT in numerous ways.


Resources

  • Participate in the discussion forum.

  • Go to the XSL Home page of the World Wide Web Consortium (W3C) for a complete description and explanation of the Extensible Stylesheet Language (XSL).

  • XSLT Recommendation 1.0 of the World Wide Web Consortium (W3C) gives an overview of the XML namespace mechanism. It's also a great place for definitions of syntax and semantics of XSL Transformations (XSLT).

  • The Sablotron XSL Transformations Processor (open source) is available to the public and is very handy as a base for multiplatform XML applications.

  • Joe Brockmeier's "A gentle guide to DocBook" is a nice introduction to the use of SGML-tools Lite. This is another approach -- using DSSSL -- for formatting DocBook documents that is different from XSLT approaches.

  • James Clark's Document Style Semantics and Specification Language (DSSSL) page is a good place to start if you would like to know more about DSSSL.

  • OASIS' recommendations on XML tools offers resources that are "known to work with" DocBook.

  • The Xeena XML Editor (free-of-cost 90-day license), from IBM's alphaWorks, is a generic Java application for editing valid XML documents derived from any valid DTD.

  • For a commercial XML editor, check out:

  • To validate an XML document, go to the Web-based XML Validation form (source available and liberally licensed) of the Scholarly Technology Group.

  • By all means, the best place to get started in a more detailed understanding of DocBook is DocBook: The Definitive Guide, Norman Walsh & Leonard Muellner, O'Reilly, Cambridge, MA 1999. Or check out the electronic version.

  • The Organization for the Advancement of Structured Information Standards (OASIS) is a non-profit, international consortium that "creates interoperable industry specifications based on public standards such as XML and SGML . . . "

  • Download the files used and mentioned in this article.

  • Learn more about XSLT transformations in Doug Tidwell's tutorial for developerWorks that covers the basics of manipulating XML documents using Java technology. Doug Tidwell looks at the common APIs for XML and discusses how to parse, create, manipulate, and transform XML documents..

  • Find other articles in David Mertz's XML Matters column.

About the author

Photo of David Mertz

David Mertz became disenchanted with the academy and became a technical journalist: post hoc ergo prompter hoc. You can contact David at mertz@gnosis.cx; his life pored over at http://gnosis.cx/dW/. Suggestions and recommendations on this, past, or future, columns are welcomed.

Comments (Undergoing maintenance)



Trademarks  |  My developerWorks terms and conditions

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML
ArticleID=11960
ArticleTitle=XML Matters: Transforming DocBook documents using XSLT
publish-date=11012000
author1-email=mertz@gnosis.cx
author1-email-cc=dwxed@us.ibm.com

My developerWorks community

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere).

My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Rate a product. Write a review.

Special offers