Skip to main content

By clicking Submit, you agree to the developerWorks terms of use.

The first time you sign into developerWorks, a profile is created for you. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

All information submitted is secure.

  • Close [x]

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerworks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

By clicking Submit, you agree to the developerWorks terms of use.

All information submitted is secure.

  • Close [x]

XML Matters: Getting started with the DocBook XML dialect

David Mertz, Ph.D. (mertz@gnosis.cx), Archivist, Gnosis Software, Inc.
Photo of David Mertz
It might be catachrestic, but it is not a malapropism to describe David Mertz' juxtapositions of interests herein as sylleptic. Words is words. David may be reached at mertz@gnosis.cx; his life pored over at http://gnosis.cx/dW/. Suggestions and recommendations on this, past, or future, columns are welcomed.

Summary:  In the third installment of his "XML Matters" column, David Mertz gets you started with DocBook, an SGML/XML dialect that describes the content of technical articles and other documents. David discusses the benefits of using DocBook, and then describes how to plan and modularize a large document conversion project.

View more content in this series

Date:  01 Oct 2000
Level:  Introductory
Also available in:   Japanese

Activity:  6908 views
Comments:  

Imagine a historian 100 years in the future finding a library of electronic documents and having to decipher them. A century of ever-accelerating technology changes will ensure her a puzzle of grandiose proportions. But it doesn't have to be this way!

This column arises out of one of my own very practical personal concerns. Over the years, I've written a number of academic papers on Humanities topics, and I'd like to make these papers available on my Web site. Unfortunately I've changed word processors and platforms numerous times over the years, and I've saved many documents that were composed using programs I no longer own or cannot obtain. Even if I had access to these programs, I probably couldn't run them on current computers. In the best cases, I have been able to locate conversion programs that do an adequate job of converting to a program I can run. In other cases, I have used the original word processor format, which is mostly ASCII with a moderate amount of typographic fluff interspersed.

In short, my electronic archives are a mess. Many individuals and organizations suffer with archives in even worse shape. With each software upgrade, large organizations lose massive numbers of important archival documents to changes in technology -- a problem that is compounded over time.

Fortunately, we can create documents that will age much better than those I have accumulated. XML/SGML generally, and DocBook specifically, go a long way toward the creation of flexible and persistent documents.

What is DocBook?

DocBook is an SGML dialect developed by O'Reilly and HaL Computer Systems in 1991. It is currently maintained by the Organization for the Advancement of Structured Information Standards (OASIS). DocBook describes the content of articles, books, technical manuals, and other documents. Although DocBook is focused on technical writing styles, it is general enough to describe most prose writing. In this article, I'll discuss an XML variant of the DocBook DTD that is also available.

The first and ultimate key to time-resistant documents is using open standards, such as XML/SGML, for document formats. These open standards comprise two elements:

  • Syntax, or what a document must look like
  • Semantics, or what a document means

The syntax of a DocBook document is wholly contained in the simple rules of XML markup and in the DocBook DTD inherent in every DocBook document. The semantics are slightly less distinct. For example, the DTD contains certain semantic features that determine which elements can or must occur inside other elements. The DocBook tags are applied so that they have a certain "common sense" semantic content, at least to English speakers. But other, more detailed semantic issues rely on specific publication guidelines, common usage rules, and editorial judgments (for example, governing the type of list that is appropriate in a certain place in the text). Note that the DocBook manuals, cited in Resources, can give you some information on general semantic guidelines, but various publications may have more specific guidelines.

The second key is of less theoretic importance, but of considerable practical significance. How easy is a document format to interpret and use outside of formal specifications? It is difficult to make sense of an old binary stream format using a text viewer. But an XML document is usually pretty reasonable looking, even without formal validation and processing. Of course, plain ASCII is even easier to peruse.

Furthermore, some formats are much easier to reconstruct than others, even without a formal specification. Imagine our historian finding two documents: one in MS Word 97 accompanied by an MSDN file-format specification CD, and one in an XML format (even one missing a DTD). Clearly, this historian would have a much easier time reconstructing the XML document's contents. In fact, no vendor -- not even Microsoft -- has done a good job of writing Word 97 converters, even with format specifications. For that matter, imagine having to reconstruct your own documents five years in the future, after your employer has "upgraded" all of your workstations to MS Office 2005.

With the issues of portability and technological change in mind, I've started a project of getting my past academic writing into DocBook format. I believe this project will help preserve my writing, and facilitate making it available in current and future document formats (via conversions). You can download the files used and mentioned in this article.


Semantic flexibility

It is important to keep in mind that a DocBook document annotates the semantics of the document, not its typography or appearance. This focus on document semantics stands in contrast to the focus of word processors, HTML, and even TeX. Word processors often allow style sheets that help you mark conceptual categories like "Header, Level 2," but increasingly they attempt to deliver "what you see is what you get" (WYSIWYG). Even style sheets are rarely uniform across documents. This approach makes broad assumptions about things such as page size and layout, available fonts, and typestyles of elements. Most of these assumptions have little to do with the actual conceptual meaning of the text. And almost all of them make it more difficult to adapt the document to a different format -- whether it be a different printed layout, onscreen display, speech-synthesized version, or an index for Web robots. HTML, originally similar (albeit simpler) to DocBook, has added more and more typographic tags, so that it is currently a hodge-podge of semantics and typography (for example, <h2> versus <b>).

As an easy-to-understand example, many different conceptual elements are rendered with italics in printed books. Different books use different conventions, but any of the following DocBook tags might be rendered in italics when actually typeset:

<abbrev> 
<citetitle> 
<foreignphrase>
<classname> 
<email>

Of course, any one of them might not be rendered in this manner. How these elements are rendered is arbitrary, given the conceptual meaning of the text. In fact, these decisions should be the business of publishers and book designers, not of authors. DocBook gives you the essential structure of a document without attempting to render elements in WYSIWYG fashion. Besides separating content and appearance, DocBook-style conceptual markup lets you work with element types systematically. For example, in creating a glossary of foreign phrases in your document, you could simply search for all occurrences of the tag <foreignphrase>. With a word processor, you would have to use the less effective method of searching for all phrases marked as italics.


Ready, set, mark up!

My first project -- converting my doctoral dissertation to DocBook -- is a big one, but I'll do it in increments. Besides being rather long as dissertations go, the specific document poses several challenges for a documentation system. It contains:

  • Names that require roman diacritics (but no non-European character sets)
  • Footnotes and cross references
  • Page numbering
  • Multiple section levels
  • Epigraphs
  • A bibliography
  • Appendices
  • A dedication and an abstract
  • Mathematical notations
  • References to books, URLs, and e-mail addresses
  • Unusual layout for specific effect
  • Diagrams and diagram commentary (for which I must approximate the original typography)

Overall, I've written a document that provides a good workout for a large number of DocBook tags. The dissertation is already available in its original WordPerfect 7 format and in two differently formatted PDF versions, but none of the versions is very portable or flexible. Using DocBook will be an improvement in both these areas. For now, I will only discuss the markup, not the processing into target formats.

Enough prefacing, let's create the document:

<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN"
               "http://gnosis.cx/download/docbook/4.12/docbookx.dtd" [
    <!ENTITY bookinfo SYSTEM "bookinfo.sgm">
        <!ENTITY abstract SYSTEM "abstract.sgm">
    <!ENTITY chap1 SYSTEM "chap1.sgm">
    <!ENTITY chap2 SYSTEM "chap2.sgm">
    <!ENTITY chap3 SYSTEM "chap3.sgm">
    <!ENTITY chap4 SYSTEM "chap4.sgm">
    <!ENTITY chap5 SYSTEM "chap5.sgm">
        <!ENTITY chap5_1 SYSTEM "chap5_1.sgm">
        <!ENTITY chap5_2 SYSTEM "chap5_2.sgm">
        <!ENTITY chap5_3 SYSTEM "chap5_3.sgm">
    <!ENTITY chap6 SYSTEM "chap6.sgm">
    <!ENTITY chap7 SYSTEM "chap7.sgm">
    <!ENTITY chap8 SYSTEM "chap8.sgm">
    <!ENTITY appendix1 SYSTEM "appendix1.sgm">
    <!ENTITY appendix2 SYSTEM "appendix2.sgm">
    <!ENTITY biblio SYSTEM "biblio.sgm">
    <!ENTITY Zizek "&Zcaron;i&zcaron;ek">
    <!ENTITY Mocnik "Mo&ccaron;nik">
]>
<book>
&bookinfo;
&chap1;
&chap2;
&chap3;
&chap4;
&chap5;
&chap6;
&chap7;
&chap8;
&appendix1;
&appendix2;
&biblio;
</book>	

As you can see, this first step is mostly planning. Creating the contents of the component-level elements, such as chapters, will be the real work. However, by creating entity references to these component-level elements, I have divided the creation into more manageable chunks. In addition, I've made it easier to publish or export the individual chapters as separate documents. In this first step, I've specified that the type of document being created is a book, and that it includes a set of component-level elements referencing external files.

Some entities defined at this top level are not used immediately, but only within the included files. For example, the entity &abstract; is only inserted within the bookinfo.sgm document. This is also true of the sections inside Chapter 5. It's a judgment call about what to divide out, but my criterion was that I should create separate files for documents that I might publish separately. I'll probably make other adjustments as I extend this DocBook project. 

At this point I also defined names that I know are mentioned in the document, but do not fit in US-ASCII. I cannot type the diacritics directly, but typing &Zizek; for example, is an inconspicuous approximation of what I actually want. You could also use abbreviations of whole phrases.


Inclusions

As the sample code shows, the files included in the master document setup consist of bare document root tags and their contents. No document type declarations or processing instructions should be in the included files. The document type is already declared in the central book master document, so it can be kept one place. For example, the bookinfo.sgm file contains only the following:

Included XML/SGML subdocument 
<bookinfo>
  <title>The Speculum and The Scalpel</title>
  <subtitle>The Politics of Impotent Representation and
            Non-Representational Terrorism</subtitle>
  <author><firstname>David</firstname><surname>Mertz</surname></author>
  &abstract;
</bookinfo>	

Similarly, each chapter file starts and ends with the <chapter> and </chapter> tags.

Again, a major advantage of this modular structure is that it is easy to extract individual components for separate publication. For example, I intend to convert versions of Chapter 5 first for separate distribution. Therefore, I created the following smaller wrapper for that chapter alone:

<?xml version="1.0"?>
<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN"
               "file://g:/articles/scratch/docbook/4.12/docbookx.dtd" [
  <!ENTITY chap5_1 SYSTEM "chap5_1.sgm">
  <!ENTITY chap5_2 SYSTEM "chap5_2.sgm">
  <!ENTITY chap5_3 SYSTEM "chap5_3.sgm">
]>
<chapter>
  <title>Hegemony, and Other Passing Fads</title>
  <epigraph>
    <attribution>Gould, 1987b, quoting Gunnar Myrdal, <citetitle>An
      American Dilemma</citetitle> (1944)</attribution>
    <para>But there must be still other countless errors of the same
      sort that no living man can yet detect, because of the fog within which
      our type of Western culture envelops us. Cultural influences have set
      up the assumptions about the mind, the body, and the universe with which
      we begin; pose the questions we ask; influence the facts we seek;
      determine the interpretations we give these facts; and direct our
      reaction to these interpretations and conclusions.</para>
  </epigraph>
&chap5_1;
&chap5_2;
&chap5_3;
</chapter>

The bulk of the marked-up content is in three sections, each with a top-level sect1 as its root. However, I have the option of processing the same section content as part of either the book-level or chapter-level wrapper. I may also publish Section 2 as a separate article, which follows the same structure as a chapter.


Continuing education

This column provides you with only enough information to get a general sense of DocBook. Subsequent columns will cover DocBook tags in greater detail and describe how they are structured. In addition, I have yet to discuss how to convert DocBook documents to more directly readable formats, how to validate them, and how to perform processing operations on them. Stay tuned.

In the meantime, it's a good idea to start skimming through some of the DocBook reference material in Resources. DocBook has lots of tags available, probably more than anyone can remember. For this reason, it doesn't hurt to keep a reference on your lap while you work with DocBook -- even if you use specialized tools to help with the editing. Once you get a sense of what types of tags to look for, and how to put them together, the going gets easier.



Download

DescriptionNameSizeDownload method
Files used and mentioned in this articlexml-matters3.zip37KB HTTP

Information about download methods


Resources

  • Check out these two DocBook articles by the author: Getting comfortable with the DocBook XML dialect (developerWorks, October 2000) and Transforming DocBook documents using XSLT (developerWorks, November 2000).

  • The best place to get started on a more detailed understanding of DocBook is with DocBook: The Definitive Guide, Norman Walsh & Leonard Muellner, O'Reilly, Cambridge, MA 1999. An online version of the book is also available.

  • OASIS is the Organization for the Advancement of Structured Information Standards, a non-profit, international consortium that creates interoperable industry specifications based on public standards such as XML and SGML. Their mission is to promote the use of these standards and their site, OASIS, provides additional information on their organization and the standards.

  • In some respects, a format even more portable and time-protected than DocBook is plain ASCII, or "smart ASCII," which incorporates simple style annotations in the way evolved on Usenet. Of course, ASCII cannot capture all the semantic structure of DocBook, but many times you do not need this. Project Gutenberg is an example of attempts to preserve and utilize texts in this neutral manner.

  • TeX is an important tool whose purpose overlaps DocBook's. The focus of TeX is closer to typography, but TeX also has many elements of semantic markup especially for mathematics.

  • The author's own articles, including the draft of this one, have used a similar "smart ASCII" format for their originals. Markup is automated using the tool Txt2Html. Refer to the ASCII version of this article.

  • Find other articles in David Mertz's XML Matters column.

  • developerWorks XML zone: Find more XML resources here, including articles, tutorials, tips, and standards.

  • IBM Certified Solution Developer -- XML and related technologies: Learn how to get certified.

About the author

Photo of David Mertz

It might be catachrestic, but it is not a malapropism to describe David Mertz' juxtapositions of interests herein as sylleptic. Words is words. David may be reached at mertz@gnosis.cx; his life pored over at http://gnosis.cx/dW/. Suggestions and recommendations on this, past, or future, columns are welcomed.

Report abuse help

Report abuse

Thank you. This entry has been flagged for moderator attention.


Report abuse help

Report abuse

Report abuse submission failed. Please try again later.


developerWorks: Sign in


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

Choose your display name

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


Rate this article

Comments

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML
ArticleID=11957
ArticleTitle=XML Matters: Getting started with the DocBook XML dialect
publish-date=10012000
author1-email=mertz@gnosis.cx
author1-email-cc=dwxed@us.ibm.com

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

For articles in technology zones (such as Java technology, Linux, Open source, XML), Popular tags shows the top tags for all technology zones. For articles in product zones (such as Info Mgmt, Rational, WebSphere), Popular tags shows the top tags for just that product zone.

For articles in technology zones (such as Java technology, Linux, Open source, XML), My tags shows your tags for all technology zones. For articles in product zones (such as Info Mgmt, Rational, WebSphere), My tags shows your tags for just that product zone.

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Try IBM PureSystems. No charge.

Special offers