XML Matters


A light, powerful document markup


Content series:

This content is part # of # in the series: XML Matters

Stay tuned for additional content in this series.

This content is part of the series:XML Matters

Stay tuned for additional content in this series.

In the past, this column has looked at alternatives to XML -- document formats that satisfy many of the same purposes for which you might use XML. reStructuredText continues this tradition. In contrast to YAML, which is good for data formats, reStructuredText is designed for documentation; in contrast to smart ASCII, reStructuredText is heavier, more powerful, and more formally specified. All of these formats, in contrast to XML, are easy and natural to read and edit with standard text editors. Working with XML more-or-less requires specialized XML editors, such as those I have reviewed previously (see Related topics).

reStructuredText (frequently abbreviated as reST) is part of the Python Docutils project. The goal of this project is to create a set of tools for manipulating plaintext documents, including exporting them to structured formats like HTML, XML, and TeX. While this project comes from the Python community, the needs it addresses extend beyond Python. Programmers and writers of all types frequently create documents such as READMEs, HOWTOs, FAQs, application manuals, and, in Python's case, PEPs (Python Enhancement Proposals). For these types of documents, requiring users to deal with verbose and difficult formats like XML or LaTeX is not generally reasonable, even if those users are programmers. But it is still often desirable to utilize these types of documents for purposes beyond simple viewing (such as indexing, compilation, pretty-printing, filtering, etc.).

The Docutils tools can serve the needs of Python programmers in much the same way that JavaDoc helps Java programmers, or POD helps Perl programmers. The documentation within Python modules can be converted to Docutils document trees, and in turn to various output formats (usually within a single script). But for this article, the more interesting use is for general documentation. For articles like this, and even for my forthcoming book, I write using smart ASCII; but I am coming to feel that I would be better off with the formality of reStructuredText (and I may develop tools to convert my existing documents).

As of this writing, the Docutils project is under development, and has not released a stable version. The tools that exist are good, but the overall project is a mixture of promises, good intentions, partial documentation, and some actual working tools. However, progress is steady, and what you can do at this point is very useful.

Examples of reStructuredText

You can get a better sense of what reStructuredText is about with a brief example. The following text is an example in PEP 287 (of part of a hypothetical PEP):

Listing 1. Plaintext version of PEP

    This PEP proposes adding frungible doodads [1] to the
    core. It extends PEP 9876 [2] via the BCA [3] mechanism.


References and Footnotes


    [2] PEP 9876, Let's Hope We Never Get Here

    [3] "Bogus Complexity Addition"

The format in Listing 1 is exactly how PEPs were formatted prior to 287. If reStructuredText is used to markup the same PEP, it could look like this:

Listing 2. reST version of PEP

This PEP proposes adding `frungible doodads`_ to the core.
It *extends* PEP 9876 [#pep9876]_ via the BCA [#]_ mechanism.


References & Footnotes

.. _frungible doodads:

.. [#pep9876] PEP 9876, Let's Hope We Never Get Here

.. [#] "Bogus Complexity Addition"

There are a few details that differ from the plaintext. But readability is really not harmed by the very light sprinkling of special characters. You would not need to look twice to read this if you saw it in a text editor or a printed page.

The reST-formatted document in Listing 2 can be automatically transformed into an XML dialect, such as that defined by the Docutils Generic DTD:

Listing 3. Docutils XML version of PEP
<?xml version="1.0" encoding="UTF-8"?>
<document source="test">
  <section id="abstract" name="abstract">
    <paragraph>This PEP proposes adding <reference
      refname="frungible doodads">Frungible doodads</reference>
      to the core. It<emphasis>extends</emphasis><reference
      PEP 9876</reference><footnote_reference auto="1" id="id1"
      refname="pep9876"/> via the BCA <footnote_reference
      auto="1" id="id2"/> mechanism.</paragraph>
  <section id="references-footnotes"
           name="references & footnotes">
    <title>References & Footnotes</title>
    <target id="frungible-doodads" name="frungible doodads"
    <footnote auto="1" id="pep9876" name="pep9876">
        9876</reference>, Let's Hope We Never Get Here
    <footnote auto="1" id="id3">
      <paragraph>"Bogus Complexity Addition"

You can see several things in contrasting these three formats. The most dramatic difference is how much harder it is to skim the XML version. But it is also notable just how much information the reStructuredText tools have located in the reST document. References of several types are properly matched up, document sections are identified, character-level typographic markup is added. In other examples, linked TOCs can be generated during processing, along with other special directives.

The docutils project structure

The docutils package consists of quite a few subpackages, in fairly complicated relation to each other. PEP 258, Docutils Design Specification, contains a chart that is useful for understanding the overall pattern:

Figure 1. Docutils project model
Docutils project model
Docutils project model

A more complete explanation of the component subpackages is contained in that PEP, but a brief explanation is worth repeating here.

The heavy work of converting a reST text into a tree of nodes is done by the docutils.parsers.rst subpackage. The reStructuredText parser treats a source in a line-oriented fashion, looking for a state transition on each line; if none of the other transition patterns are found, the text transition catches the line. Transitions consist of features like changes in indentation, special leading symbols, and so on. The default just includes the next line as more text within the current node.

This structure is similar to that used in the smart ASCII parsers txt2dw and txt2html. Other parsers would live under the docutils.parsers hierarchy, but none are currently provided. However, there is an experimental Python source code parser that treats a Python source file as a document tree.

Once the docutils.transforms subpackage generates a tree of nodes for a document, you can manipulate the tree in various ways. For example, if you specified a directive to include a table of contents, the document tree is walked to identify listed items. Also, the transformation performs some cleanup of references and links at this stage. During the initial pass, placeholders that cue the transformations fill the places in the tree where unresolved elements will go.

Event-oriented output

The various docutils.writers modules are probably the primary points of interest for most readers of this article. Some of the more interesting writers are still kept in the experimental "sandbox" area at the time of this writing (check the Docutils Web site in Related topics), but the principles are the same in any case. A writer module should define a Writer class that inherits from docutils.writers.Writer. This Writer class defines some settings, but mostly defines a .translate() method, that might look something like:

Listing 4. Typical custom Writer.translate() method
def translate(self):
    visitor = DocBookTranslator(self.document)
    self.output = visitor.astext()

The writer, as you can see, depends on a visitor that knows what to do with nodes of each type. A visitor will generally inherit from docutils.nodes.NodeVisitor. Programming a visitor is a lot like programming a SAX, expat, REXML, or other event-oriented XML parser. However, a visitor is even closer to the programming style of Python's xmllib module. That is, a visitor will have a .visit_FOO() and .depart_FOO() method for each type of node, rather than switching on type within large .startElement() and endElement() methods. OOP purists are likely to prefer this style. A simple example from the Docbook/XML writer is:

class DocBookTranslator(nodes.NodeVisitor):
    [...lots of methods...]
    def visit_block_quote(self, node):
      self.body.append(self.starttag(node, 'blockquote'))
    def depart_block_quote(self, node):
    [...lots more methods...]

Programming a custom writer/visitor is a straightforward enough matter, and there are writers for Docutils/XML, HTML, PEP-HTML, PseudoXML (a sort of light XML that combines start tags with indentation, but no closing tags), LaTeX, DocBook/XML, PDF, OpenOffice/XML, and Wiki-HTML.

Tree-oriented processing

You may transform a reStructuredText document into a tree of nodes that can be manipulated in a DOM-like fashion. The following is an example that uses the example of a reST PEP shown in Listing 2.

Listing 5. Creating a reST node tree
>>> txt = open('pep.txt').read()
>>> def rst2tree(txt):
...     import docutils.parsers.rst
...     parser = docutils.parsers.rst.Parser()
...     document = docutils.utils.new_document("test")
...     document.settings.tab_width = 4
...     document.settings.pep_references = 1
...     document.settings.rfc_references = 1
...     parser.parse(txt, document)
...     return document
>>> doc = rst2tree(txt)
>>> doc.children
[<section "abstract": <title...><paragraph...><paragraph...>>,
 <section "references & footnotes": <title...>
   <target "frungible doodads"...><footnote "pep9 ...>]
>>> print doc.autofootnotes
[<footnote "pep9876": <paragraph...>>, <footnote: <paragraph...>>]
>>> print doc.autofootnotes[0].rawsource
PEP 9876, Let's Hope We Never Get Here

One thing to notice in contrast with DOM is that reStructuredText is already a fixed document dialect. So rather than use generic methods to search for matching nodes, you can search for nodes using attributes that are named for their meaning. The .children attribute is generically hierarchical, but most attributes collect nodes of a given type.

One convenient method of reST nodes is .pformat(), which produces a pseudo-XML representation of the document tree for pretty-printing, as shown in Listing 6:

Listing 6. Pseudo-XML representation of reST node
>>> print doc.autofootnotes[0].pformat('  ')
<footnote auto="1" id="pep9876" name="pep9876">
    <reference refuri="">
      PEP 9876,
    Let's Hope We Never Get Here

Node methods like .remove(), .copy(), .append(), and .insert() are useful for pruning and manipulating trees.

For an XML programmer, a more desirable API might be DOM itself. Fortunately, this API is a single method call away:

Listing 7. Converting a reST tree to a DOM tree
>>> dom = doc.asdom()
>>> foot0 = dom.getElementsByTagName('footnote')[0]
>>> print foot0.toprettyxml('  ')
<footnote auto="1" id="pep9876" name="pep9876">
    <reference refuri="">
      PEP 9876
    , Let's Hope We Never Get Here

Unfortunately, as of this writing, there are no tools or functions to convert a DOM tree or XML document back into reStructuredText. It would be especially nice to have a reader for the Docutils Generic DTD; this would let you produce a reST document tree for the corresponding XML. You could write it back out as reST with the .astext() node method. It would not be hard to write such a reader, and I am sure this will happen in time (perhaps by me or one of my readers).

Downloadable resources

Related topics

  • Check out the Docutils Web site, where you can find extensive references for both the reStructuredText format itself, and for the docutils package. You can also download the Docutils Generic XML DTD.
  • See the Python Enhancement Proposal 287, which recommends the use of reStructuredText for inline documentation of Python code. This PEP also usefully contrasts reST with other documentation formats considered for the same purpose (XML, TeX, HTML, POD, SEText, etc.).
  • In a previous installment of XML Matters, David introduced YAML, a data serialization format that can be easily read by humans and is well-suited to encoding the data types used in dynamic programming languages (developerWorks, October 2002).
  • You'll find all of the previous installments of the XML Matters column at the column summary page.
  • Check out Rational Application Developer for WebSphere Software, an easy-to-use, integrated development environment for building, testing, and deploying J2EE applications, including generating XML documents from DTDs and schemas.
  • IBM certification: Find out how you can become an IBM-Certified Developer.


Sign in or register to add and subscribe to comments.

ArticleTitle=XML Matters: reStructuredText