If your document archives are like mine, they contain files in every format from Microsoft Word 3.3 to HTML to Word Perfect 7 to ASCII text. Often, you can't even obtain the software you used to create the original documents. Fortunately DocBook, an SGML dialect for creating all-purpose technical documents, can help you move your files into a single, standard XML format. In this column, I'll explain how to use the XML version of the DocBook DTD to convert an existing document.
DocBook is a rather complex DTD with hundreds of elements. Fortunately you don't need to know all of DocBook to work with it. As you'll see, the basic elements are arranged logically, and most elements follow similar patterns for nesting child elements.
Creating content -- different approaches
It's easy to make small typos in DocBook. The key to working with it is having a good reference handy while you're working. I'm partial to O'Reilly's excellent hardcopy text, but the identical material is also available online (see Resources). With your reference in hand, you can create DocBook content in one of two ways:
- Using a specialized XML editor
- Using a generic text editor plus an external validator
DocBook is detailed enough that you need some automation to ensure conformance to the DTD. Using either approach, you can work for stretches, and validate and fix glitches only occasionally.
Most specialized XML editors help you enter elements and attributes. Many programs present context-sensitive prompts for available tags or lists of tags that exist in the current DTD (for example, DocBook's). However, be aware that specialized editors are generally less flexible than good general-purpose text editors that provide features like multiple clipboards, syntax highlighting, column marking, and section/function browsing.
Unfortunately, I've found that the quality of XML tools is still disappointing. I've tested a number of XML validation and transformation tools and have yet to locate a completely accurate command line XML validator. In fact, I've had to settle for using XML Spy under Win32, and Xeena on other platforms with Java support. Both tools do a good job of validation, but are somewhat cumbersome to use. (See Resources for reviews of XML Spy, Xeena, and general text editors.)
The first step in creating an XML DocBook document is to prepare its declaration. Let's look at Listing 1, a document declaration example, and step through its different parts:
Listing 1. XML document type declaration
<?xml version="1.0"?> <!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN" Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â "http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd" [ Â Â <!ENTITY Zizek "Žižek"> Â Â <!ENTITY Mocnik "Močnik"> ]> <?xml-stylesheet type="text/xsl" href="chapter.xsl"?> <chapter> Â Â <!-- The actual chapter contents are here --> </chapter> |
XML declaration
The first thing we include is the <?xml> declaration,
which indicates that the document is XML. Next is the <!DOCTYPE>
tag -- the document type declaration. The document type declaration contents
are worth looking at in detail.
DOCTYPE tag element
The first thing to notice in the <!DOCTYPE> tag is the
name of the root element (chapter) that will be used in the
document. Deciding what type of root element to use is important because it states the document's purpose, at least in broad terms. The root element
generally determines the rough size of the document.
At the broadest level, you can specify a root element of set
when including two or more books; for example, a whole reference
collection. In this case, you wouldn't necessarily put everything in the
same file, but would instead use inclusions, as outlined in "XML
Matters #3." More likely, you will be creating a book, which
is a collection of parts or chapters, plus other sections
at the same conceptual level as parts/chapters. Even more modestly, you
might be creating an article or a chapter, as in our
example in Listing 1. In practice, a chapter
or article is the smallest root element used for a DocBook document.
Next in the <!DOCTYPE> declaration we see the PUBLIC and
system identifiers. The part following PUBLIC is an SGML feature, and you
don't really need it in XML documents. If you do include it, be
sure to spell it exactly the way it's spelled in the DTD. The DTD is indicated in the system identifier by a URL, which is where all
the DocBook definitions are located. You can download the URL if you'd
like to look at the DTD. Also be sure to spell the URL correctly, or your validating programs won't
be able to find the DTD.
Finally, inside the square brackets in the <!DOCTYPE> tag
is the "internal subset," which is simply a way to declare special features
in your document. In this case, I created a couple aliases for
names that are hard to type on a US keyboard.
Following the document type declaration tag in Listing
1, we have a processing instruction, <?xml-stylesheet...>.
I won't go into detail about Extensible Stylesheet Language Transformations
(XSLT) until the next "XML Matters" column. However, processing instructions
are similar to cascading style sheets (CSS) for HTML documents. In this
case, I added a reference to an XSL document that contains some rules for transforming
the DocBook document. Like a cascading style sheet, this type of processing
instruction is optional, even for a transformation tool. Depending
on the tool, you can specify a transformation using whatever XSLT you want. A processing instruction is just one way to do it.
Finally, we see the <chapter> tag we referred to in
the declaration root element. The chapter content goes inside this tag.
Things like chapters, articles, prefaces, and bibliographies are all components of documents. That is to say, a component is something that addresses the same topic in moderate specificity. Generally, the element names reflect their English meanings.
The structures of <chapter>, <appendix>, or
<preface> elements are similar. An <article> has
nearly the same structure as these elements, but the front matter is usually
enclosed in an <artheader> element. A component like <chapter>
includes front matter such as <title>, followed by sections
and/or block elements (for example, <p>).
A <title> element is usually required as front matter for
components and sections. Most other front matter is optional, but it might
include author information, abstracts, graphics, or other information that
has more to do with describing a component than constituting
the component. Let's look at Listing 2,
an example of a valid, highly abridged chapter (assuming the document type declaration
described in the Listing 1):
Listing 2. DocBook chapter markup
<chapter>  <title>Hegemony, and Other Passing Fads</title>  <epigraph>    <attribution>      Gould, 1987b, quoting Gunnar Myrdal, <citetitle>An      American Dilemma</citetitle> (1944)    </attribution>    <para>      But there must be still other countless errors of the      same sort that no living man can yet detect, because      of the fog within which our type of Western culture      envelops us. Cultural influences have set up the      assumptions about the mind, the body, and the      universe with which we begin; pose the questions we      ask; influence the facts we seek; determine the      interpretations we give these facts; and direct our      reaction to these interpretations and      conclusions.    </para>  </epigraph>  <sect1>    <title>Day-Care Devil Worshipers</title>    <!-- para's, sect2's, epigraph's, and other block elements -->  </sect1>  <sect1>    <!-- more blocks -->  </sect1> </chapter> |
As the example shows, you may want to divide a moderately long
chapter into sections <sect1>. It's a judgment call on how
big to make the sections, but there are a couple of strategies for
creating sections. You can use either the <sect1> through <sect5>
hierarchy or the <section> element, nested recursively. For
my own purpose -- writing philosophical prose -- I felt that explicitly numbered
section levels were better. I had a distinct sense of how important each
type of section must be, and the numbering matched that well. However,
for something like a technical reference, your section material might be
nested in different places and at different depths. For example, a function
call might be described in an overview and then later in the chapter in
a programming example. In this case, the <section> element
works better and can be nested to more than five levels.
Sections are bigger than block elements, and are simply a list of blocks. With a shorter component, you might immediately begin using block elements. Basically, a block element is either a paragraph or an element at the same conceptual/hierarchical level as a paragraph (such as a list, equation, or illustration). There are other specialized block types, but these the most general.
The only thing "smaller" than a block element is an inline element.
Generally, you set block elements apart from other blocks with vertical
white space, framing boxes, or the like. In contrast, an inline element
is continuous with the words around it, but it is marked by a different font,
color, hyperlink, and so on. In our chapter example, the epigraph is like
a short section containing two blocks: the attribution <attribution>,
and the epigraph <para>. The attribution contains a <citetitle>,
but that citation will likely be rendered inline when printed, perhaps
in italics or underlining, or will appear as a hotlink to the bibliography if rendered in HTML.
The elements and structure outlined here are enough to get you started with creating your own DocBook documents. The next column will show how to transform our DocBook source document into other formats and become familiar with Extensible Stylesheet Language Transformations, which are useful outside of DocBook applications.
- IBM alphaWorks' Xeena
XML editor (free-of-cost 90-day license) provides an overview of Xeena,
requirements for running it, FAQs, and a downloadable copy.
- David
Mertz's XML Spy review on webreview.com is my review of XML Spy 3.0
and provides links to other XML articles.
-
Find other articles in David Mertz's XML Matters column.
- Also see David's June
2000 Charming Python column on XML modules and related resources for
XML developers working in Python.
- Altova's XML Spy Home page (commercial
XML editor) describes Altova's version of XML Spy features and provides
examples and a downloadable copy.
- Scholarly Technology
Group's Web-based XML Validation (source available and liberally licensed) allows you to paste your XML document into their online forms and validate
it with full XML 1.0 function.
- Visit the
XMetal Home page (commercial XML editor). It offers an overview of the most current XMetal software, feature description, FAQs, product reviews and allows you to download
and purchase the product.
- Sablotron
XSL Processor (open source) offers an overview and downloadable version
of this product.
- Organization for the Advancement of
Structured Information Standards (OASIS) is a central source of information
on SGML, XML, and DocBook.
- OASIS's recommendations
on XML tools gives a good overview of DocBook and XML and offers tools
known to work with DocBook, documentation, and samples.
- DocBook: The Definitive Guide, Norman Walsh & Leonard Muellner,
O'Reilly, Cambridge, MA 1999 is the best place to get started in a more
detailed understanding of DocBook. Or, check out the electronic
version.
-
Robert Stayton's hypertext
version of the DocBook/XML DTD is extremely useful. You have to practice
a little to read DTD formats, but that is something you need to do in working
with XML anyway.

David Mertz became disenchanted with the academy and became a technical journalist: post hoc ergo prompter hoc. You can contact David at mertz@gnosis.cx; his life pored over at http://gnosis.cx/dW/. Suggestions and recommendations on this, past, or future, columns are welcomed.
Comments (Undergoing maintenance)





