Welcome to the world of XML transformations! I'm afraid that you're in for a rocky ride: Standards are coalescing and undergoing revision, tools are immature and often buggy, implementations are inconsistent, and choices are just plain confusing. But don't panic. I can lead you through at least one path out of the labyrinth. And things will inevitably get better with time, albeit always more slowly than we would like.
The last two "XML Matters" columns described my project for converting my academic writings to XML, specifically to the DocBook DTD. Those articles provide a good starting point for writing your own DocBook documents, and that's where this column takes up.
For this column, let's assume that you have some nicely structured, well-formed, and valid DocBook XML documents lying around. It's nice to have them in the first place, but the next step is to transform them into more conventional end-user formats: things like HTML pages, PDF files, and printed pages (the things readers actually read). This is exactly the problem I faced after converting a portion of my archival writing to DocBook, and this article presents my own solution.
My main goal -- at least for now -- is a good transformation to HTML. But I don't want to be limited to HTML output. I also have a few smaller goals. I'd like to have some control over the precise output without doing a lot of work and without having to learn a lot of new languages and techniques. I'd also like to use tools that are free (as in speech, and as in beer), and tools that are cross platform. Finally, I'd like to minimize dependencies. A large number of complex dependencies are a disadvantage, even if all the needed contributions are free and cross platform. Basically, my ideal is a standalone executable that just runs, runs reliably, and converts my DocBook documents to HTML in just the style I want. Lofty dreams, but why not?
There are at least four possible approaches for transforming a DocBook document -- or most any XML document -- into end-user formats. I seriously considered all four approaches for my own little project. This column discusses only the last option in detail, but all are worth keeping in mind as you plan a project that involves repeated transformations:
- Write custom transformation code. It would be nice to start with a programming language that has some libraries for basic XML methods like SAX and DOM. But even assuming the basic parsing is a black box, custom code can do whatever you want with the parsed elements. Ultimately, this is the most flexible and powerful approach, but it is also likely to take more work, both up front and in maintenance.
- Use Cascading Stylesheets with our DocBook document. It's a thought. It would be nice to keep the typographic specifications completely separate from the structural markup and just simply have the client device (e.g., browser) render things nicely. That might yet happen, but as of right now there seems to be only limited support -- only in IE 5.5, Opera 4, and some of the latest Mozilla developer releases. It just doesn't seem at the point where one can count on an end user making this work for them.
- Use Document Style Semantics and Specification Language (DSSSL) to specify transformations into target formats. On the plus side, a number of DSSSL style sheets already exist for DocBook (and for other formats). DSSSL is basically a whole new programming language to learn, and it's a functional Lisp-like language to boot. In order to utilize DSSSL, you need to start with the Jade or OpenJade, but both tools are complex enough that many people have written wrappers to them (such as SGML-tools Lite). In order to get a working system -- albeit by reports a very nicely working system -- you really need to satisfy all sorts of system dependencies and install all sorts of tools and libraries. On some well-intentioned although perhaps not sufficiently dedicated attempts, I didn't manage to get Jade-related tools smoothly functioning on my system. Obviously, a lot of other folks use these systems every day, so a little more work would have surely put things in order. (If you can point me to a quick, simple all-in-one DSSSL processor, let me know. I'd love to try it.) Even more than the setup difficulties, however, DSSSL simply feels like it comes out of different traditions and ways of thinking than do XML techniques. By contrast, the final approach is basically pure XML and comes out of official (working) specifications of the W3C.
- Use eXtensible Stylesheet Language Transformations (XSLT). In one sense, XSLT is actually a specification for a class of XML documents. That is, an XSLT style sheet is itself a well-formed XML document with some specialized contents that let you "templatize" the output format you are looking for (stay tuned for what this means). A large number of tools at least nominally support XSLT: My hunch is that this really is the direction technologies are going in for XML transforms -- either because of, or in spite of, its "official" status with the W3C. XSLT can specify transforms to any target format. But the general feeling I've picked up is that most developers find it easiest to work with XSLT when the target format is another XML format, such as XHTML.
The Resources section contains a link to descriptions of a number of XSLT tools. I tried a many of them, but found Sablotron most to my taste. It is free software (GNU). It is multiplatform. It has a standalone executable that is simple to run from the command line. And most importantly, it appears to work correctly, at least for my simple test cases.
A number of the other XSLT tools listed by XSLT.com are also free software. However, most of them are Java programs that also depend on various extra Java libraries. Users appear to give a positive evaluation to a number of the Java tools, so these may be good choices for you. I opted for Sablotron both for the greater speed of compiled C and for the simplicity of installing and using it.
Norman Walsh has created a set of complete XSLT style sheets for DocBook. Unfortunately, Sablotron simply crashes on them, and XML Spy fails to match anything in a valid DocBook document when using them. This is more likely a limitation in the tools than in Walsh's style sheets. You might have better luck with other tools. Still, the problem gives us the opportunity to develop custom (less complete) XSLT style sheets, which is what I really want anyway (to demonstrate the techniques).
Use of Sablotron is quite simple. The basics are:
Listing 1: Basic use of Sablotron
X:\mydocs> x:\sabl\bin\sabcmd mystyle.xsl mydoc.xml mydoc.html     |
What this says is: use the rules in mystyle.xsl to transform
mydoc.xml into mydoc.html. You can also use pipes and redirection, if you
wish. Setting up Sablotron is as easy as unpacking its archive (it also provides libraries
you can call from your programs, but the command-line utility is a good
way to get started). So you can adjust paths and filenames as needed for your environment.
Writing the XSLT specification
For the real blood and guts of XSLT, read the W3C's official recommendation (see the Resources section). This article aims at the more informal details of getting it working.
The specific DocBook document developed in "XML Matters #3" and "XML Matters #4" (chap5.xml) was a chapter. The example used a fairly small subset of all the
possible DocBook tags in the chapter. So for now, all we really need is
a chapter.xsl file that will do something useful with every tag
actually used in chap5.xml. This is a modest start, but one that
is easy to build on because of the open and extensible nature of
XSLT. Let's take a look.
Start with a skeleton of chapter.xsl -- the "how to convert a DocBook chapter to HTML" template:
Listing 2: Skeleton XSLT document (empty.xls)
<xsl:stylesheet version="1.0" Â Â Â Â xmlns:xsl="http://www.w3.org/1999/XSL/Transform" Â Â Â Â xmlns="http://www.w3.org/TR/xhtml1/strict"> Â <xsl:output method="html" indent="yes" encoding="UTF-8"/> </xsl:stylesheet>Â Â Â Â Â |
As you can see, chapter.xsl is a well-formed XML file. As you
will also notice, the pattern <xsl:*> is the name of many of
the tags in an XSLT document. In fact, all the tags that are instructions look like that. In transforming to XML-like formats (such as HTML), you will see various other tags. These other tags belong to the target
format and will occur only within an <xsl:*> element.
Basically, you should use exactly the namespace attributes (xmlns:xsl
and xmlns) indicated above. You'd probably want to keep the output
line also; you might use the xml or text methods, though.
The above XSLT file is perfectly good to use as a processing template. But it might not do exactly what you expect. You might assume that since output specification is missing, nothing gets output. That isn't exactly correct: it still catches all the text nodes, and gives you a plain ASCII version of your chapter (using the above style sheet). If you really do want to output nothing at all, you want an XSLT document like the one in Listing 3:
Listing 3: Null-output XSLT document (null.xls)
<xsl:stylesheet version="1.0" Â Â Â Â xmlns:xsl="http://www.w3.org/1999/XSL/Transform" Â Â Â Â xmlns="http://www.w3.org/TR/xhtml1/strict"> Â <xsl:output method="html" indent="yes" encoding="UTF-8"/> Â <xsl:template match="*"> Â </xsl:template> </xsl:stylesheet>Â Â Â |
Our null outputter moves us in the direction of a useful transform.
A real style sheet is really just a description of a set of patterns to
try to match, and a template is inside each <xsl:template>
element that provides a template for what to output. As the example shows,
"*" can match any pattern. Our example just does not happen to do
anything inside the template, but it still manages to match any element
that might occur in our source XML/DocBook document.
The power of XSLT templates lies mainly in their ability to extend the matching function. Once an element is matched, XSLT extends the matching function to the subelements of that element. Expanding on the null outputter, let's create a semi-meaningful style sheet.
The important tag for allowing descent into subelements is <xsl:apply-templates>.
Generally, every template includes this tag somewhere in its body:
Listing 4: Minimal chapter XSLT document (minimal.xls
<xsl:stylesheet version="1.0" Â Â Â Â xmlns:xsl="http://www.w3.org/1999/XSL/Transform" Â Â Â Â xmlns="http://www.w3.org/TR/xhtml1/strict"> Â <xsl:output method="html" indent="yes" encoding="UTF-8"/> Â <xsl:template match="chapter"> Â Â Â ----- Start of Chapter ----- Â Â Â <xsl:apply-templates/> Â </xsl:template> Â <xsl:template match="*"> Â Â Â ##### Unmatched Element in Source ##### Â </xsl:template> </xsl:stylesheet>Â Â Â |
When you run an XSLT processor using this style sheet and a DocBook chapter, you get something like:
Listing 5: XSLT processor using style sheet and DocBook chapter
----- Start of Chapter ----- ##### Unmatched Element in Source ##### ##### Unmatched Element in Source ##### ##### Unmatched Element in Source #####Â Â Â |
This output isn't all that useful, but it lets us see what the style sheet
is doing. The root element of a chapter is the <chapter> tag. The style sheet matches the <chapter> tag, and prints " - - - - - Start of Chapter - - - - - ". Various children occur within the <chapter> element. Each such child is called something other than chapter, and so will pass to matching the "*" template.
For developing your own XSLT style sheet, leaving in some obvious flag like the above for unmatched elements will let you quickly see what templates you need to develop. Listing 6 shows a version with some real templates:
Listing 6: Valid HTML outputter XSLT document
<xsl:stylesheet version="1.0" Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â xmlns:xsl="http://www.w3.org/1999/XSL/Transform" Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â xmlns="http://www.w3.org/TR/xhtml1/strict"> Â <xsl:output method="html" indent="yes" encoding="UTF-8"/> Â <xsl:template match="chapter"> Â Â Â <html> Â Â Â Â Â <head> Â Â Â Â Â Â Â <title> Â Â Â Â Â Â Â Â Â <xsl:value-of select="title"/> Â Â Â Â Â Â Â </title> Â Â Â Â Â </head> Â Â Â Â Â <body> Â Â Â Â Â Â Â <xsl:apply-templates/> Â Â Â Â Â </body> Â Â Â </html> Â </xsl:template> Â <xsl:template match="chapter/title"> Â Â Â <hr></hr> Â Â Â <h1><xsl:apply-templates/></h1> Â </xsl:template> Â <xsl:template match="para"> Â Â Â <p><xsl:apply-templates/></p> Â </xsl:template> Â <xsl:template match="*"> Â Â Â Â ##### Unmatched Element in Source ##### Â </xsl:template> </xsl:stylesheet>Â Â Â |
This HTML outputter shows some realistic features of an XSLT style sheet.
The chapter template match lays out the HTML document I want to
produce. There is nothing special about the HTML tags inside the template
match; any text you put there will appear in the output. Within the HTML
<title> element, use the <xsl:value-of> instruction to insert the title subelement that's required inside a <chapter> in DocBook. In the HTML
<body> element, you pass control on to other templates (presumably
quite a few for all of DocBook).
The next template after chapter is chapter/title.
This means to match a <title> element, but only if it occurs
directly inside a <chapter>. If you want to, you could simply
match title and thereby specify the output format of every <title>
element in the source document. But I want to format chapter titles differently
from sect1 titles, sect2 titles, and so on. You do that
with para in the example (but it never actually matches, because
paras can only occur inside tags not yet matched). For good measure,
the template still matches "*", so you can see that the style sheet is not
complete when you examine its output.
Matching templates by descent is not the only trick XSLT can do. You can also do conditional outputting, sorting, pulling out source attributes, and looping over children. For today, just look at the simple looping example in Listing 7:
Listing 7: XSLT template for looping over subelements
<xsl:template match="simplelist"> Â <ul> Â Â Â <xsl:for-each select="member"> Â Â Â Â Â <li><xsl:apply-templates/></li> Â Â Â </xsl:for-each> Â </ul> </xsl:template>Â Â Â Â Â |
Rather than descend to every subelement in a simplelist, we just assume subelements are all <member> elements. The <xsl:for-each> works much like a nested template, and also much like a programming-language loop construct. The contents of the <xsl:for-each> element will go to the output for every subelement that matches the select attribute. Within the loop, the contents of the current <member> element become the active node that descends down to the <xsl:apply-templates/> tag we find inside the loop. That is, each thing in the list might have further markup inside it, and we pass formatting of those elements to their appropriate templates (for text nodes, they are just ouput in literal form).
The preceding material simply scratches the surface of XSLT. But it should give you a sense of working with style sheets and transforms. The Resources section provides many places to read further on related matters. In particular, you might benefit from looking through the more complete XML and XSLT examples in this article's archive file. Stay tuned, this column is bound to come back to XSLT in numerous ways.
- Participate in the discussion forum.
-
Go to the XSL Home page of the
World Wide Web Consortium (W3C) for a complete description and explanation
of the Extensible Stylesheet Language (XSL).
- XSLT Recommendation
1.0 of the World Wide Web Consortium (W3C) gives an overview of the
XML namespace mechanism. It's also a great place for definitions of syntax
and semantics of XSL Transformations (XSLT).
-
The Sablotron
XSL Transformations Processor (open source) is available to the public
and is very handy as a base for multiplatform XML applications.
-
Joe Brockmeier's "A
gentle guide to DocBook" is a nice introduction to the use of SGML-tools
Lite. This is another approach -- using DSSSL -- for formatting DocBook documents
that is different from XSLT approaches.
-
James Clark's Document Style Semantics
and Specification Language (DSSSL) page is a good place to start if
you would like to know more about DSSSL.
- OASIS' recommendations
on XML tools offers resources that are "known to work with" DocBook.
-
The Xeena XML Editor
(free-of-cost 90-day license), from IBM's alphaWorks, is a generic Java
application for editing valid XML documents derived from any valid DTD.
-
For a commercial XML editor, check out:
-
To validate an XML document, go to the Web-based
XML Validation form (source available and liberally licensed) of the
Scholarly Technology Group.
-
By all means, the best place to get started in a more detailed understanding
of DocBook is DocBook: The Definitive Guide, Norman Walsh &
Leonard Muellner, O'Reilly, Cambridge, MA 1999. Or check out the electronic version.
-
The Organization for the Advancement of Structured Information Standards
(OASIS)
is a non-profit, international consortium that "creates interoperable industry
specifications based on public standards such as XML and SGML . . . "
- Download the
files used and mentioned in this article.
-
Learn more about XSLT transformations in Doug
Tidwell's tutorial for developerWorks that covers the basics of manipulating XML documents using Java technology. Doug Tidwell looks at the common APIs for XML and discusses how to parse, create, manipulate, and transform XML documents..
-
Find other articles in David Mertz's XML Matters column.

David Mertz became disenchanted with the academy and became a technical journalist: post hoc ergo prompter hoc. You can contact David at mertz@gnosis.cx; his life pored over at http://gnosis.cx/dW/. Suggestions and recommendations on this, past, or future, columns are welcomed.
Comments (Undergoing maintenance)





