Imagine a historian 100 years in the future finding a library of electronic documents and having to decipher them. A century of ever-accelerating technology changes will ensure her a puzzle of grandiose proportions. But it doesn't have to be this way!
This column arises out of one of my own very practical personal concerns. Over the years, I've written a number of academic papers on Humanities topics, and I'd like to make these papers available on my Web site. Unfortunately I've changed word processors and platforms numerous times over the years, and I've saved many documents that were composed using programs I no longer own or cannot obtain. Even if I had access to these programs, I probably couldn't run them on current computers. In the best cases, I have been able to locate conversion programs that do an adequate job of converting to a program I can run. In other cases, I have used the original word processor format, which is mostly ASCII with a moderate amount of typographic fluff interspersed.
In short, my electronic archives are a mess. Many individuals and organizations suffer with archives in even worse shape. With each software upgrade, large organizations lose massive numbers of important archival documents to changes in technology -- a problem that is compounded over time.
Fortunately, we can create documents that will age much better than those I have accumulated. XML/SGML generally, and DocBook specifically, go a long way toward the creation of flexible and persistent documents.
DocBook is an SGML dialect developed by O'Reilly and HaL Computer Systems in 1991. It is currently maintained by the Organization for the Advancement of Structured Information Standards (OASIS). DocBook describes the content of articles, books, technical manuals, and other documents. Although DocBook is focused on technical writing styles, it is general enough to describe most prose writing. In this article, I'll discuss an XML variant of the DocBook DTD that is also available.
The first and ultimate key to time-resistant documents is using open standards, such as XML/SGML, for document formats. These open standards comprise two elements:
- Syntax, or what a document must look like
- Semantics, or what a document means
The syntax of a DocBook document is wholly contained in the simple rules of XML markup and in the DocBook DTD inherent in every DocBook document. The semantics are slightly less distinct. For example, the DTD contains certain semantic features that determine which elements can or must occur inside other elements. The DocBook tags are applied so that they have a certain "common sense" semantic content, at least to English speakers. But other, more detailed semantic issues rely on specific publication guidelines, common usage rules, and editorial judgments (for example, governing the type of list that is appropriate in a certain place in the text). Note that the DocBook manuals, cited in Resources, can give you some information on general semantic guidelines, but various publications may have more specific guidelines.
The second key is of less theoretic importance, but of considerable practical significance. How easy is a document format to interpret and use outside of formal specifications? It is difficult to make sense of an old binary stream format using a text viewer. But an XML document is usually pretty reasonable looking, even without formal validation and processing. Of course, plain ASCII is even easier to peruse.
Furthermore, some formats are much easier to reconstruct than others, even without a formal specification. Imagine our historian finding two documents: one in MS Word 97 accompanied by an MSDN file-format specification CD, and one in an XML format (even one missing a DTD). Clearly, this historian would have a much easier time reconstructing the XML document's contents. In fact, no vendor -- not even Microsoft -- has done a good job of writing Word 97 converters, even with format specifications. For that matter, imagine having to reconstruct your own documents five years in the future, after your employer has "upgraded" all of your workstations to MS Office 2005.
With the issues of portability and technological change in mind, I've started a project of getting my past academic writing into DocBook format. I believe this project will help preserve my writing, and facilitate making it available in current and future document formats (via conversions).
It is important to keep in mind that a DocBook document annotates the semantics of the document, not its typography or appearance. This focus on document semantics stands in contrast to the focus of word processors, HTML, and even TeX. Word processors often allow style sheets that help you mark conceptual categories like "Header, Level 2," but increasingly they attempt to deliver "what you see is what you get" (WYSIWYG). Even style sheets are rarely uniform across documents. This approach makes broad assumptions about things such as page size and layout, available fonts, and typestyles of elements. Most of these assumptions have little to do with the actual conceptual meaning of the text. And almost all of them make it more difficult to adapt the document to a different format -- whether it be a different printed layout, onscreen display, speech-synthesized version, or an index for Web robots. HTML, originally similar (albeit simpler) to DocBook, has added more and more typographic tags, so that it is currently a hodge-podge of semantics and typography (for example,
As an easy-to-understand example, many different conceptual elements are rendered with italics in printed books. Different books use different conventions, but any of the following DocBook tags might be rendered in italics when actually typeset:
<abbrev> <citetitle> <foreignphrase> <classname> <email>
Of course, any one of them might not be rendered in this manner. How these elements are rendered is arbitrary, given the conceptual meaning of the text. In fact, these decisions should be the business of publishers and book designers, not of authors. DocBook gives you the essential structure of a document without attempting to render elements in WYSIWYG fashion. Besides separating content and appearance, DocBook-style conceptual markup lets you work with element types systematically. For example, in creating a glossary of foreign phrases in your document, you could simply search for all occurrences of the tag
<foreignphrase>. With a word processor, you would have to use the less effective method of searching for all phrases marked as italics.
My first project -- converting my doctoral dissertation to DocBook -- is a big one, but I'll do it in increments. Besides being rather long as dissertations go, the specific document poses several challenges for a documentation system. It contains:
- Names that require roman diacritics (but no non-European character sets)
- Footnotes and cross references
- Page numbering
- Multiple section levels
- A bibliography
- A dedication and an abstract
- Mathematical notations
- References to books, URLs, and e-mail addresses
- Unusual layout for specific effect
- Diagrams and diagram commentary (for which I must approximate the original typography)
Overall, I've written a document that provides a good workout for a large number of DocBook tags. The dissertation is already available in its original WordPerfect 7 format and in two differently formatted PDF versions, but none of the versions is very portable or flexible. Using DocBook will be an improvement in both these areas. For now, I will only discuss the markup, not the processing into target formats.
Enough prefacing, let's create the document:
<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN" "http://gnosis.cx/download/docbook/4.12/docbookx.dtd" [ <!ENTITY bookinfo SYSTEM "bookinfo.sgm"> <!ENTITY abstract SYSTEM "abstract.sgm"> <!ENTITY chap1 SYSTEM "chap1.sgm"> <!ENTITY chap2 SYSTEM "chap2.sgm"> <!ENTITY chap3 SYSTEM "chap3.sgm"> <!ENTITY chap4 SYSTEM "chap4.sgm"> <!ENTITY chap5 SYSTEM "chap5.sgm"> <!ENTITY chap5_1 SYSTEM "chap5_1.sgm"> <!ENTITY chap5_2 SYSTEM "chap5_2.sgm"> <!ENTITY chap5_3 SYSTEM "chap5_3.sgm"> <!ENTITY chap6 SYSTEM "chap6.sgm"> <!ENTITY chap7 SYSTEM "chap7.sgm"> <!ENTITY chap8 SYSTEM "chap8.sgm"> <!ENTITY appendix1 SYSTEM "appendix1.sgm"> <!ENTITY appendix2 SYSTEM "appendix2.sgm"> <!ENTITY biblio SYSTEM "biblio.sgm"> <!ENTITY Zizek "Žižek"> <!ENTITY Mocnik "Močnik"> ]> <book> &bookinfo; &chap1; &chap2; &chap3; &chap4; &chap5; &chap6; &chap7; &chap8; &appendix1; &appendix2; &biblio; </book>
As you can see, this first step is mostly planning. Creating the contents of the component-level elements, such as chapters, will be the real work. However, by creating entity references to these component-level elements, I have divided the creation into more manageable chunks. In addition, I've made it easier to publish or export the individual chapters as separate documents. In this first step, I've specified that the type of document being created is a book, and that it includes a set of component-level elements referencing external files.
Some entities defined at this top level are not used immediately, but only within the included files. For example, the entity
&abstract; is only inserted within the bookinfo.sgm document. This is also true of the sections inside Chapter 5. It's a judgment call about what to divide out, but my criterion was that I should create separate files for documents that I might publish separately. I'll probably make other adjustments as I extend this DocBook project.
At this point I also defined names that I know are mentioned in the document, but do not fit in US-ASCII. I cannot type the diacritics directly, but typing
&Zizek; for example, is an inconspicuous approximation of what I actually want. You could also use abbreviations of whole phrases.
As the sample code shows, the files included in the master document setup consist of bare document root tags and their contents. No document type declarations or processing instructions should be in the included files. The document type is already declared in the central book master document, so it can be kept one place. For example, the bookinfo.sgm file contains only the following:
Included XML/SGML subdocument <bookinfo> <title>The Speculum and The Scalpel</title> <subtitle>The Politics of Impotent Representation and Non-Representational Terrorism</subtitle> <author><firstname>David</firstname><surname>Mertz</surname></author> &abstract; </bookinfo>
Similarly, each chapter file starts and ends with the <chapter> and </chapter> tags.
Again, a major advantage of this modular structure is that it is easy to extract individual components for separate publication. For example, I intend to convert versions of Chapter 5 first for separate distribution. Therefore, I created the following smaller wrapper for that chapter alone:
<?xml version="1.0"?> <!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN" "file://g:/articles/scratch/docbook/4.12/docbookx.dtd" [ <!ENTITY chap5_1 SYSTEM "chap5_1.sgm"> <!ENTITY chap5_2 SYSTEM "chap5_2.sgm"> <!ENTITY chap5_3 SYSTEM "chap5_3.sgm"> ]> <chapter> <title>Hegemony, and Other Passing Fads</title> <epigraph> <attribution>Gould, 1987b, quoting Gunnar Myrdal, <citetitle>An American Dilemma</citetitle> (1944)</attribution> <para>But there must be still other countless errors of the same sort that no living man can yet detect, because of the fog within which our type of Western culture envelops us. Cultural influences have set up the assumptions about the mind, the body, and the universe with which we begin; pose the questions we ask; influence the facts we seek; determine the interpretations we give these facts; and direct our reaction to these interpretations and conclusions.</para> </epigraph> &chap5_1; &chap5_2; &chap5_3; </chapter>
The bulk of the marked-up content is in three sections, each with a top-level
sect1 as its root. However, I have the option of processing the same section content as part of either the book-level or chapter-level wrapper. I may also publish Section 2 as a separate article, which follows the same structure as a chapter.
This column provides you with only enough information to get a general sense of DocBook. Subsequent columns will cover DocBook tags in greater detail and describe how they are structured. In addition, I have yet to discuss how to convert DocBook documents to more directly readable formats, how to validate them, and how to perform processing operations on them. Stay tuned.
In the meantime, it's a good idea to start skimming through some of the DocBook reference material in Resources. DocBook has lots of tags available, probably more than anyone can remember. For this reason, it doesn't hurt to keep a reference on your lap while you work with DocBook -- even if you use specialized tools to help with the editing. Once you get a sense of what types of tags to look for, and how to put them together, the going gets easier.
- The best place to get started on a more detailed understanding of DocBook is with DocBook: The Definitive Guide, Norman Walsh & Leonard Muellner, O'Reilly, Cambridge, MA 1999. An online version of the book is also available.
- OASIS is the Organization for the Advancement of Structured Information Standards, a non-profit, international consortium that creates interoperable industry specifications based on public standards such as XML and SGML. Their mission is to promote the use of these standards and their site, OASIS, provides additional information on their organization and the standards.
- In some respects, a format even more portable and time-protected than DocBook is plain ASCII, or "smart ASCII," which incorporates simple style annotations in the way evolved on Usenet. Of course, ASCII cannot capture all the semantic structure of DocBook, but many times you do not need this. Project Gutenberg is an example of attempts to preserve and utilize texts in this neutral manner.
TeX is an important tool whose purpose overlaps DocBook's. The focus of TeX is closer to typography, but TeX also has many elements of semantic markup especially for mathematics.
- My own articles, including the draft of this one, have used a similar "smart ASCII" format for their originals. Markup is automated using the tool Txt2Html. Refer to the ASCII version of this article.
- Files used and mentioned in this article can be found at: XML Matters #3 files.
- Read more about DocBook in Getting comfortable with the DocBook XML dialect andTransforming DocBook documents using XSLT.
Find other articles in David Mertz's XML Matters column.
It might be catachrestic, but it is not a malapropism to describe David Mertz' juxtapositions of interests herein as sylleptic. Words is words. David may be reached at firstname.lastname@example.org; his life pored over at http://gnosis.cx/dW/. Suggestions and recommendations on this, past, or future, columns are welcomed.