Skip to main content

If you don't have an IBM ID and password, register here.

By clicking Submit, you agree to the developerWorks terms of use.

The first time you sign into developerWorks, a profile is created for you. This profile includes the first name, last name, and display name you identified when you registered with developerWorks. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

All information submitted is secure.

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerworks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

By clicking Submit, you agree to the developerWorks terms of use.

All information submitted is secure.

XML Matters: Getting comfortable with the DocBook XML dialect

David Mertz, Ph.D. (mertz@gnosis.cx), Archivist, Gnosis Software, Inc.
Photo of David Mertz
David Mertz became disenchanted with the academy and became a technical journalist: post hoc ergo prompter hoc. You can contact David at mertz@gnosis.cx; his life pored over at http://gnosis.cx/dW/. Suggestions and recommendations on this, past, or future, columns are welcomed.

Summary:  This column continues the discussion of the benefits of using DocBook to convert documents in heterogeneous formats to a single, standard XML format. It also looks at some DocBook tags in greater detail and discusses how to compose a basic DocBook document.

View more content in this series

Date:  01 Oct 2000
Level:  Introductory

Comments:  

If your document archives are like mine, they contain files in every format from Microsoft Word 3.3 to HTML to Word Perfect 7 to ASCII text. Often, you can't even obtain the software you used to create the original documents. Fortunately DocBook, an SGML dialect for creating all-purpose technical documents, can help you move your files into a single, standard XML format. In this column, I'll explain how to use the XML version of the DocBook DTD to convert an existing document.

DocBook is a rather complex DTD with hundreds of elements. Fortunately you don't need to know all of DocBook to work with it. As you'll see, the basic elements are arranged logically, and most elements follow similar patterns for nesting child elements.

Creating content -- different approaches

It's easy to make small typos in DocBook. The key to working with it is having a good reference handy while you're working. I'm partial to O'Reilly's excellent hardcopy text, but the identical material is also available online (see Resources). With your reference in hand, you can create DocBook content in one of two ways:

  • Using a specialized XML editor
  • Using a generic text editor plus an external validator

DocBook is detailed enough that you need some automation to ensure conformance to the DTD. Using either approach, you can work for stretches, and validate and fix glitches only occasionally.

Most specialized XML editors help you enter elements and attributes. Many programs present context-sensitive prompts for available tags or lists of tags that exist in the current DTD (for example, DocBook's). However, be aware that specialized editors are generally less flexible than good general-purpose text editors that provide features like multiple clipboards, syntax highlighting, column marking, and section/function browsing.

Unfortunately, I've found that the quality of XML tools is still disappointing. I've tested a number of XML validation and transformation tools and have yet to locate a completely accurate command line XML validator. In fact, I've had to settle for using XML Spy under Win32, and Xeena on other platforms with Java support. Both tools do a good job of validation, but are somewhat cumbersome to use. (See Resources for reviews of XML Spy, Xeena, and general text editors.)


Preparing the DocBook DTD

The first step in creating an XML DocBook document is to prepare its declaration. Let's look at Listing 1, a document declaration example, and step through its different parts:


Listing 1. XML document type declaration
<?xml version="1.0"?>

<!DOCTYPE chapter
 PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN"
                  "http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd" [
   <!ENTITY Zizek "&Zcaron;i&zcaron;ek">
   <!ENTITY Mocnik "Mo&ccaron;nik">
]>
<?xml-stylesheet type="text/xsl" href="chapter.xsl"?>
<chapter>
   <!-- The actual chapter contents are here -->
</chapter>

XML declaration The first thing we include is the <?xml> declaration, which indicates that the document is XML. Next is the <!DOCTYPE> tag -- the document type declaration. The document type declaration contents are worth looking at in detail.

DOCTYPE tag element The first thing to notice in the <!DOCTYPE> tag is the name of the root element (chapter) that will be used in the document. Deciding what type of root element to use is important because it states the document's purpose, at least in broad terms. The root element generally determines the rough size of the document.

At the broadest level, you can specify a root element of set when including two or more books; for example, a whole reference collection. In this case, you wouldn't necessarily put everything in the same file, but would instead use inclusions, as outlined in "XML Matters #3." More likely, you will be creating a book, which is a collection of parts or chapters, plus other sections at the same conceptual level as parts/chapters. Even more modestly, you might be creating an article or a chapter, as in our example in Listing 1. In practice, a chapter or article is the smallest root element used for a DocBook document.

Next in the <!DOCTYPE> declaration we see the PUBLIC and system identifiers. The part following PUBLIC is an SGML feature, and you don't really need it in XML documents. If you do include it, be sure to spell it exactly the way it's spelled in the DTD. The DTD is indicated in the system identifier by a URL, which is where all the DocBook definitions are located. You can download the URL if you'd like to look at the DTD. Also be sure to spell the URL correctly, or your validating programs won't be able to find the DTD.

Finally, inside the square brackets in the <!DOCTYPE> tag is the "internal subset," which is simply a way to declare special features in your document. In this case, I created a couple aliases for names that are hard to type on a US keyboard.


Processing instructions

Following the document type declaration tag in Listing 1, we have a processing instruction, <?xml-stylesheet...>. I won't go into detail about Extensible Stylesheet Language Transformations (XSLT) until the next "XML Matters" column. However, processing instructions are similar to cascading style sheets (CSS) for HTML documents. In this case, I added a reference to an XSL document that contains some rules for transforming the DocBook document. Like a cascading style sheet, this type of processing instruction is optional, even for a transformation tool. Depending on the tool, you can specify a transformation using whatever XSLT you want. A processing instruction is just one way to do it.

Finally, we see the <chapter> tag we referred to in the declaration root element. The chapter content goes inside this tag.


Creating a chapter

Things like chapters, articles, prefaces, and bibliographies are all components of documents. That is to say, a component is something that addresses the same topic in moderate specificity. Generally, the element names reflect their English meanings.

The structures of <chapter>, <appendix>, or <preface> elements are similar. An <article> has nearly the same structure as these elements, but the front matter is usually enclosed in an <artheader> element. A component like <chapter> includes front matter such as <title>, followed by sections and/or block elements (for example, <p>).

A <title> element is usually required as front matter for components and sections. Most other front matter is optional, but it might include author information, abstracts, graphics, or other information that has more to do with describing a component than constituting the component. Let's look at Listing 2, an example of a valid, highly abridged chapter (assuming the document type declaration described in the Listing 1):


Listing 2. DocBook chapter markup

<chapter>
  <title>Hegemony, and Other Passing Fads</title>
 
<epigraph>

 
  
<attribution>

      Gould, 1987b, quoting Gunnar Myrdal, 
<citetitle>An
      American Dilemma</citetitle>
 (1944)
    </attribution>
    <para>
      But there must be still other countless errors of the
      same sort that no living man can yet detect, because
      of the fog within which our type of Western culture
      envelops us.  Cultural influences have set up the
      assumptions about the mind, the body, and the
      universe with which we begin; pose the questions we
      ask; influence the facts we seek; determine the
      interpretations we give these facts; and direct our
      reaction to these interpretations and
      conclusions.
    </para>
  </epigraph>

 
<sect1>

    <title>Day-Care Devil Worshipers</title>
    <!-- para's, sect2's, epigraph's, and other block elements -->
  </sect1>
  <sect1>
    <!-- more blocks -->
  </sect1>
</chapter>

As the example shows,  you may want to divide a moderately long chapter into sections <sect1>. It's a judgment call on how big to make the sections, but there are a couple of  strategies for creating sections. You can use either the <sect1> through <sect5> hierarchy or the <section> element, nested recursively. For my own purpose -- writing philosophical prose -- I felt that explicitly numbered section levels were better. I had a distinct sense of how important each type of section must be, and the numbering matched that well. However, for something like a technical reference, your section material might be nested in different places and at different depths. For example, a function call might be described in an overview and then later in the chapter in a programming example. In this case, the <section> element works better and can be nested to more than five levels.

Sections are bigger than block elements, and are simply a list of blocks. With a shorter component, you might immediately begin using block elements. Basically, a block element is either a paragraph or an element at the same conceptual/hierarchical level as a paragraph (such as a list, equation, or illustration). There are other specialized block types, but these the most general.

The only thing "smaller" than a block element is an inline element. Generally, you set block elements apart from other blocks with vertical white space, framing boxes, or the like. In contrast, an inline element is continuous with the words around it, but it is marked by a different font, color, hyperlink, and so on. In our chapter example, the epigraph is like a short section containing two blocks: the attribution <attribution>, and the epigraph <para>. The attribution contains a <citetitle>, but that citation will likely be rendered inline when printed, perhaps in italics or underlining, or will appear as a hotlink to the bibliography if rendered in HTML.


Until next time

The elements and structure outlined here are enough to get you started with creating your own DocBook documents. The next column will show how to transform our DocBook source document into other formats and become familiar with Extensible Stylesheet Language Transformations, which are useful outside of DocBook applications.


Resources

About the author

Photo of David Mertz

David Mertz became disenchanted with the academy and became a technical journalist: post hoc ergo prompter hoc. You can contact David at mertz@gnosis.cx; his life pored over at http://gnosis.cx/dW/. Suggestions and recommendations on this, past, or future, columns are welcomed.

Report abuse help

Report abuse

Thank you. This entry has been flagged for moderator attention.


Report abuse help

Report abuse

Report abuse submission failed. Please try again later.


developerWorks: Sign in

If you don't have an IBM ID and password, register here.


Forgot your IBM ID?


Forgot your password?
Change your password


By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. This profile includes the first name, last name, and display name you identified when you registered with developerWorks. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

Choose your display name

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

(Must be between 3 – 31 characters.)


By clicking Submit, you agree to the developerWorks terms of use.

 


Rate this article

Comments

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML
ArticleID=11958
ArticleTitle=XML Matters: Getting comfortable with the DocBook XML dialect
publish-date=10012000
author1-email=mertz@gnosis.cx
author1-email-cc=dwxed@us.ibm.com

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

For articles in technology zones (such as Java technology, Linux, Open source, XML), Popular tags shows the top tags for all technology zones. For articles in product zones (such as Info Mgmt, Rational, WebSphere), Popular tags shows the top tags for just that product zone.

For articles in technology zones (such as Java technology, Linux, Open source, XML), My tags shows your tags for all technology zones. For articles in product zones (such as Info Mgmt, Rational, WebSphere), My tags shows your tags for just that product zone.

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).