Skip to main content

Tip: The txt2dw utility

Converting ASCII text to XML

David Mertz (mertz@gnosis.cx), Luddist, Gnosis Software, Inc.
Author photo: David Mertz
David Mertz greatly welcomes feedback on ways to tweak and improve txt2dw, or any of his public domain utilities. David may be reached at mertz@gnosis.cx; his life pored over at http://gnosis.cx/dW/.

Summary:  developerWorks is moving toward a custom XML dialect as the source format for the articles that appear on the site. However, writing XML is always going to be difficult for people (but easy for machines). One approach to the "human interface" problem is the public domain txt2dw utility that author David Mertz uses for his own articles.

View more content in this series

Date:  01 Jan 2002
Level:  Introductory
Activity:  943 views

There are lots of good reasons why many organizations are likely to adapt XML dialects for many of their documentation needs. For these same reasons, developerWorks has developed its own XML DTD for articles. Once you have an XML source -- either a shared standard like DocBook or an in-house dialect -- it is easy to transform the source into arbitrary target formats (HTML, PDF, other XML, and so on). Moreover, validation against a DTD provides a nice check that a document contains all the parts it needs to have, with all the right relationships between them. In addition, XML is a much more platform- and tool-neutral format than those used by proprietary (or even open source) word processors and publishing applications.

Source formats and human interfaces

The problem with XML, however, is that it is a really crummy human interface. Even though XML is just ASCII bytes, typing the element tags into a text editor takes a lot of extra keystrokes. Besides requiring a littering of angle brackets and punctuation to interrupt the flow of a touch typist, it is difficult to make sure that every tag gets closed in the proper order as you type. And how many of us understand even a moderately complex DTD well enough to remember exactly what elements and attributes are allowed at each point in a document? Worst of all, the abundance of XML tags makes visually scanning a document significantly harder.


Making it easy for writers

At least two approaches ease the pain of editing XML documents with a text editor. One approach is to use a higher-level tool for the editing. An XML-aware editor can automate conformance with a DTD, and some of these editors can even hide or highlight the XML markup to make visual scanning easier. Some developerWorks writers, myself included, are particularly fond of XMetaL, but many excellent programs exist. All of these programs, however, run on specific platforms; they each have their own set of quirks (different from those of a favorite text editor); and many of them will set you back a large number of dollars.

The second approach is the one txt2dw takes: Let writers write using tools that don't get in their way. Then let computers worry about how the documents need to be formatted. Word processors try to take this approach, but the state of tools for getting from a word processor to XML is still crude. Personally, I prefer to use the "smart ASCII" markup format that has informally evolved in e-mail, on the Usenet, and in project documentation for open-source software projects. One can formalize it just a little bit without getting in the way of writers (while simultaneously aiding the converter).


Using txt2dw

The use of txt2dw could hardly be simpler. Just read some "smart ASCII" input from STDIN, and write some valid XML to STDOUT. For example:

% txt2dw.py < MyArticle.txt > MyArticle.xml

At this point, one has an XML-formatted document. The eventual target will most likely be something different from XML. In my own case -- and this is true for many writers -- the eventual target format is not really all that noteworthy (that's for editors and publishers to worry about and change as needed). All that really matters is that the XML version is valid according to article.dtd.

However, someone will want to transform the XML into something else. XSLT is a common transformation technique, and one for which developerWorks uses the custom style sheet article-html.xsl. Assuming you want the HTML version developerWorks will use, you can simply run something like this:

% xslt article-html.xsl < MyArticle.xml > MyArticle.html

The exact details will vary with the XSLT engine one uses, but the idea will be the same.


Smart ASCII format

For the most part, "smart ASCII" is what you have been writing for years if you use e-mail and the Usenet. Most of the details are documented at the top of the script. Asterisks surround bold or heavily emphasized phrases; dashes surround italicized or lightly emphasized phrases; underscores introduce Book or Series Titles. I have adopted the use of single quotes to set apart appnames and filenames (usually rendered in a fixed font), and square brackets to indicate libraries and modules . Take a look at the ASCII version of this Tip in the Resources for how these features started out. These conventions are not quite universal, but they will also not be unfamiliar to readers. They are all very quick to type.

Anything that looks like a URL is turned into a link automatically. A fairly simple special format with curly braces and the ALT text before a colon is used to insert images, such as charts and graphs.

At the paragraph level, a few types of paragraphs are allowed, and are indicated by indentation level. Headers are not indented. In addition, any header line that only consists of a row of dashes is stripped out (this helps beautify the ASCII originals). Regular text paragraphs are indented two spaces. Block quotes are indented four spaces. Code samples are indented six spaces (or more). If a code sample begins with a line that consists of a pound sign, some dashes, a title, some more dashes, then another pound sign, then that line is treated as a label for the code sample (in many programming languages, it would be a comment line anyway). If not, no harm is done.

There are a few features of txt2dw that are more rigid than I would like. These were concessions to the fairly rigid format of article.dtd. On the plus side, the rigid constraints were exactly the conventions I had adopted anyway, so obeying them was not difficult. Moreover, none of them look odd or unnatural (but you still have to remember to use these features, or create a template that does so). A few moderately intelligent changes are made when ALLCAPS sections are encountered. Here is a usable template:


Template for txt2dw "smart ASCII" source
                

SERIES: Main Title
Subtitle

Author Name
Title, Affiliation
Date

    Abstract of the article (block quote indented)...

FIRST SECTION
----------------------------------------------------------

  Regular paragraph...

      #----- Title of code sample -----#
      Sample code line 1
      [...]

  Regular paragraph...

MORE SECTIONS...
----------------------------------------------------------

  [...]

  {Picture of Author: http://mysite/mypic.png}
  Author blurb...

Sometimes computer tools that are chosen for good technical reasons wind up forcing users to think like computers. XML markup can have this quality. A writer should not need to spend a lot of time thinking about formats, but should be allowed to focus on content. In any ongoing documentation process, it is worth a little extra up-front programming work to allow writers to think a little bit less about the nitty-gritty of formatting and markup. txt2dw is one tool that lets computers worry about computer matters, while writers worry about words.


Resources

About the author

Author photo: David Mertz

David Mertz greatly welcomes feedback on ways to tweak and improve txt2dw, or any of his public domain utilities. David may be reached at mertz@gnosis.cx; his life pored over at http://gnosis.cx/dW/.

Comments (Undergoing maintenance)



Trademarks  |  My developerWorks terms and conditions

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML
ArticleID=12067
ArticleTitle=Tip: The txt2dw utility
publish-date=01012002
author1-email=mertz@gnosis.cx
author1-email-cc=

My developerWorks community

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere).

My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Rate a product. Write a review.

Special offers