Beyond a few abandoned or incomplete efforts, three word processors are now available in actively maintained states. All do an excellent job as word processors; all provide a variety of useful import/export capabilities -- including the widely used, but proprietary, Microsoft Word format; and all are available in both source and binary form for Linux along with other platforms (on both free and proprietary OSes). And most interestingly for this column, AbiWord, KWord, and Writer all use XML for their native document formats.
For this column, I am not interested in comparing the features, appearance, or user interfaces of these three projects. Suffice it to say that they have all obtained a very nice degree of polish in look-and-feel, and all have a sufficient feature set for creation of most types of business and personal documents. What I am interested in here is the design of the XML document formats -- the guts inside these projects.
For those unfamiliar with the three projects, a few items are worth noting. AbiWord is a standalone word processor with an emphasis on cross-platform compatibility, moderate size, and good execution speed. OpenOffice.org is an outgrowth of Sun Microsystems' StarOffice product, which was released under free software license, and taken up by the developer community. OpenOffice.org Writer is just part of a suite of inter-operable applications including a spreadsheet, vector drawing program, presentation application, and some other components. Similarly, KWord is part of the KOffice suite (which is itself part of the overall KDE project); KOffice contains even more components than does OpenOffice.org -- adding flowcharting, raster image editing, charting, and other applications. In any case, for now I'll focus only on the word processor component of KOffice and OpenOffice.org.
As you would expect, new versions of these open source word processors usually tweak the document format a bit. Fortunately, XML is well suited to upward changes, which can include the addition of (optional) new attributes and child elements. If this is done well, earlier versions of applications can even degrade relatively gracefully when they read newer saved documents -- usually by just ignoring unfamiliar tags and attributes.
In the XML formats I looked at, DTDs are provided by the project developers, but they tend to be somewhat out of sync with the actual XML documents created by the same versions of the applications. Well-formedness is still respected, as you would hope, but creation and parsing seem to be rather informal matters; the final say is the source code that implements the formats, not in a DTD or schema. In other words, the samples below will not validate successfully. To give you an idea of what the documents really look like, I have created a very simple test document, shown in Figure 1:
Figure 1. Screenshot of simple document
Interestingly, if not surprisingly, you will see in the XML versions of this document that the representation on the identical document is not unique. (Of course, this being XML, issues like whitespace normalization allow non-identical files to represent the same Infoset; but that is not what I mean.) I found that, at least in some details, the exact same formatting can get different markup due to the sequence of user actions that went into the document's creation (and perhaps due to other factors too). While this fact is not necessarily a problem -- and probably applies equally to binary document formats like Microsoft Word's .doc format -- it seems mildly unfortunate that canonicalization is not as straightforward at a semantic level as it is at the XML syntax level.
AbiWord uses a relatively simple and straightforward XML document format in which appearance and layout are specified in CSS-like attributes. While many of these attributes are taken directly from CSS, the AbiWord developers decided that CSS was insufficient for their needs, so they took it only as a starting point.
Although they are a bit long, I would like to present the entire XML source of the word processor documents created. I have prettified these sources, but have verified that my Infoset-neutral changes do not affect re-import. First the AbiWord version:
Listing 1. simple.abw AbiWord document
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE abiword PUBLIC "-//ABISOURCE//DTD AWML 1.0 Strict//EN"
"http://www.abisource.com/awml.dtd">
<abiword
fileformat="1.1"
props="dom-dir:ltr; lang:en-US"
styles="unlocked"
template="false"
version="2.0.3"
xml:space="preserve"
xmlns="http://www.abisource.com/awml.dtd"
xmlns:awml="http://www.abisource.com/awml.dtd"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:fo="http://www.w3.org/1999/XSL/Format"
xmlns:math="http://www.w3.org/1998/Math/MathML"
xmlns:svg="http://www.w3.org/2000/svg"
xmlns:xlink="http://www.w3.org/1999/xlink">
<metadata>
<m key="dc.format">application/x-abiword</m>
<m key="abiword.generator">AbiWord</m>
<m key="abiword.date_last_changed">Tue Feb 10 20:04:46 2004</m>
</metadata>
<styles>
<s followedby="Current Settings"
name="Normal"
props="text-indent:0in; margin-top:0pt; margin-left:0pt;
font-stretch:normal; line-height:1.0; text-align:left;
bgcolor:transparent; lang:en-US; dom-dir:ltr;
margin-bottom:0pt; text-decoration:none;
font-weight:normal; font-variant:normal; color:000000;
text-position:normal; font-size:12pt; margin-right:0pt;
font-style:normal; widows:2; font-family:Times New Roman"
type="P"/>
</styles>
<pagesize height="11.000000"
orientation="portrait"
page-scale="1.000000"
pagetype="Letter"
units="in"
width="8.500000"/>
<section props="page-margin-footer:0.5in; page-margin-header:0.5in">
<p style="Normal">Minimal document with <c
props="font-weight:bold">bold</c><c
props="font-weight:normal"> and </c><c
props="font-style:italic; font-weight:normal">italics</c><c
props="font-style:normal; font-weight:normal">.</c></p>
<p style="Normal"><c
props="font-style:normal; font-weight:normal"/></p>
<p style="Normal"><c
props="font-style:normal;
font-weight:normal">New paragraph with </c><c
props="font-weight:normal;
text-decoration:underline;
font-style:normal">underline</c><c
props="text-decoration:none;
font-weight:normal;
font-style:normal">.</c></p>
</section>
</abiword>
|
A few features stand out. One notable advantage that comes with XML is the use of namespaces to indicate external schemas developed and refined by other groups. For example, inclusion of equations or figures can be done using MathML or SVG, respectively; the AbiWord developers have no need to re-engineer these capabilities themselves.
Another thing to notice about AbiWord's format is that it only
half-heartedly uses XML attributes in describing the rendering of
sections or character spans. That is, where some XML formats try to
list a priori all the possible formatting in attributes or child
tags (named in the DTD), AbiWord simply throws in a generic props
attribute that contains CSS-style formatting. This pushes the
rendering semantics outside of the XML Infoset (for better or worse, I
am not sure).
The XML format developed by Sun Microsystems for StarOffice (and taken up by OpenOffice.org) has been assumed by an OASIS Technical Committee (see Resources); in short, it is on its way to becoming a standard, not simply a format. Moreover, the KOffice project, which previously used its own XML format, has recently decided to move towards native use of the OpenOffice.org format -- or some future OASIS enhancement to that format. I find it more useful, therefore, to present the OASIS/OpenOffice.org format than to detail the older KOffice format. That said, current stable versions of KOffice have not yet switched formats as of this writing.
In contrast to the AbiWord format, OpenOffice.org's XML format encompasses all the types of documents supported by OpenOffice.org applications -- not simply word processor documents, but also charts, drawings, and so on. Data of different types is indicated by namespaces for each type, allowing multiple data formats to be embedded in the same document. How and whether a particular application handles a given data type is up to the application; but, for example, one application might pass control for rendering a given data type to another component (either in the same suite, or a wholly external application).
For now I am only interested in the simple word processor document
shown in Figure 1. Take a look at it, and then compare the AbiWord
version in Listing 1 with OpenOffice.org's XML format shown in
Listing 2. As with the AbiWord
version, I have prettified the XML, but maintained the Infoset. Also
as with AbiWord's version, the document does not actually validate; in
this case the dr3d, form,
and math namespace attributes are
missing from the version of the DTD that's included with my OpenOffice.org
installation (the same one that created this document). And while
the content of interest is in Listing 2, the complete OpenOffice.org data
file is a .zip archive containing several ancillary XML documents for
settings, metadata, and styles (normally having the extension .sxw):
Listing 2. content.xml from simple.sxw
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE office:document-content PUBLIC
"-//OpenOffice.org//DTD OfficeDocument 1.0//EN"
"office.dtd">
<office:document-content office:class="text" office:version="1.0"
xmlns:chart="http://openoffice.org/2000/chart"
xmlns:dr3d="http://openoffice.org/2000/dr3d"
xmlns:draw="http://openoffice.org/2000/drawing"
xmlns:fo="http://www.w3.org/1999/XSL/Format"
xmlns:form="http://openoffice.org/2000/form"
xmlns:math="http://www.w3.org/1998/Math/MathML"
xmlns:number="http://openoffice.org/2000/datastyle"
xmlns:office="http://openoffice.org/2000/office"
xmlns:script="http://openoffice.org/2000/script"
xmlns:style="http://openoffice.org/2000/style"
xmlns:svg="http://www.w3.org/2000/svg"
xmlns:table="http://openoffice.org/2000/table"
xmlns:text="http://openoffice.org/2000/text"
xmlns:xlink="http://www.w3.org/1999/xlink">
<office:script/>
<office:font-decls>
<style:font-decl fo:font-family=""
style:font-pitch="variable" style:name="F"/>
<style:font-decl fo:font-family="Mincho"
style:font-pitch="variable" style:name="Mincho"/>
<style:font-decl fo:font-family="Times"
style:font-family-generic="roman"
style:font-pitch="variable"
style:name="Times"/>
</office:font-decls>
<office:automatic-styles>
<style:style style:family="paragraph"
style:name="P1" style:parent-style-name="Standard">
<style:properties fo:font-style="normal"
fo:font-weight="normal"/>
</style:style>
<style:style style:family="text" style:name="T1">
<style:properties fo:font-weight="bold"/>
</style:style>
<style:style style:family="text" style:name="T2">
<style:properties fo:font-weight="normal"/>
</style:style>
<style:style style:family="text" style:name="T3">
<style:properties fo:font-style="italic" fo:font-weight="normal"/>
</style:style>
<style:style style:family="text" style:name="T4">
<style:properties fo:font-style="normal" fo:font-weight="normal"/>
</style:style>
<style:style style:family="text" style:name="T5">
<style:properties style:text-underline="single"
style:text-underline-color="font-color"/>
</style:style>
</office:automatic-styles>
<office:body>
<text:sequence-decls>
<text:sequence-decl
text:display-outline-level="0" text:name="Illustration"/>
<text:sequence-decl
text:display-outline-level="0" text:name="Table"/>
<text:sequence-decl
text:display-outline-level="0" text:name="Text"/>
<text:sequence-decl
text:display-outline-level="0" text:name="Drawing"/>
</text:sequence-decls>
<text:p
text:style-name="Standard">Minimal document with <text:span
text:style-name="T1">bold </text:span><text:span
text:style-name="T2">and </text:span><text:span
text:style-name="T3">italics</text:span><text:span
text:style-name="T4">.</text:span>
</text:p>
<text:p text:style-name="P1"/>
<text:p text:style-name="P1">New paragraph with <text:span
text:style-name="T5">underline</text:span>.</text:p>
</office:body>
</office:document-content>
|
The OpenOffice.org format follows a structure that's generally similar to
that of AbiWord. Instead of AbiWord's <p>
tag, OpenOffice.org uses <text:p>; and
instead of AbiWord's <c>, OpenOffice.org uses
<text:span>. A notable difference here is that where AbiWord uses
formatting descriptions directly accompanying marked character
sequences, OpenOffice.org always uses indirect references to named
styles, even where the names of automatic styles are generated on
the fly by the generating application.
The sample document also illustrates a point made above about incidental variations in document Infosets. For example, notice that the period at the end of the first paragraph is marked as style T4, while the period in the final paragraph is outside any span. Moreover, if you look at the earlier T4 style definition you'll see that it merely defines normal -- that is, default -- font style and weight. In other words, you don't need to mark text with the T4 style as opposed to leaving it as PCDATA for the surrounding paragraph.
One important advantage of using XML formats like those in the word processors you've seen here is the facilitation of access by new tools to those documents. It is just easier to write new applications that process XML word processor documents than it is to write ones that work with binary formats, especially proprietary ones. To some extent, RTF (Rich Text Format) achieves a similar goal: It is a textual markup format that is publicly documented. But as things have unfolded, you have many more commodity XML parsers to choose from than RTF parsers.
One application that obviously comes to mind for working with an XML word processor format is a new word processing application. The anticipated transparent interoperability between KOffice (KWord) and OpenOffice.org (Writer) is an example of this. But somewhat more modest applications are worth keeping in mind, too: Indexing, analyzing, summarizing, comparing, or otherwise batch processing documents are also tasks that are frequently useful.
For many of these batch-style applications, XSLT stands out as an
obvious processing language -- and indeed, existing conversion routines
often use XSLT. However, I am much less fond of XSLT than are its
proponents. Despite the declarative goal of the language, I usually
find its details confusing and difficult to maintain and debug. In
any case, what I present in Listing 3 is a very simple utility that utilizes
the gnosis.xml.objectify Python XML binding that I have discussed in
several previous installments of this column. The utility I present
is similar to the -dump option in lynx,
except that it processes OpenOffice.org Writer documents rather than HTML documents.
My utility is crude, but is also extremely concise, thereby
demonstrating some XML benefits:
Listing 3. dumpOO.py
#!/usr/bin/env python
import sys, zipfile
from gnosis.xml.objectify import XML_Objectify, EXPAT, \
children, tagname, content
XML_Objectify.expat_kwargs['nspace_sep'] = None
doc_content = zipfile.ZipFile(sys.argv[1]).read('content.xml')
doc = XML_Objectify(doc_content).make_instance()
write = sys.stdout.write
for o in children(doc.office_body):
if tagname(o)=="text_p":
for s in content(o):
if type(s) is unicode and s.strip():
write(" "+s.encode('utf-8').strip())
elif tagname(s)=='text_span':
write(" "+s.PCDATA.encode('utf-8'))
write('\n') |
This utility doesn't necessarily handle every OpenOffice.org construct
gracefully (but I think it handles all text content), and line wrapping is
not performed for paragraphs. However, you could easily add this capability,
either by using the Python 2.3+ module textwrap, or by piping to the
external utility fmt. For example:
Listing 4. dumpOO.py in action
$ ./dumpOO.py simple.sxw | fmt Minimal document with bold and italics . New paragraph with underline . |
Free software and XML document formats are a natural pairing. The inherent readability of XML just makes interchange and format specification easier, and the wide availability of XML libraries makes creation of new tools simple. Moreover, looking at these word processor formats has really helped me to see the modularity benefits of namespaces -- when done correctly, namespaces can leverage the work done by many groups of independent developers.
However, XML itself only goes so far. For example, Microsoft is also moving towards an XML format for future versions of MS Word; but in contrast to the openness of the OASIS/OpenOffice.org or AbiWord formats, Microsoft is surrounding its format with patent applications, and putting a veil of secrecy around the format variations (plus it uses cryptic tag and attribute names rather than self-documenting ones).
XML by itself does not really mean open, but fortunately, the developers of KOffice, AbiWord, and OpenOffice.org have done a generally wonderful job of obtaining openness with XML (albeit, the wild world of community development still leaves occasional impedance mismatches in, for example, DTDs).
- Participate in the discussion forum.
- Check out OpenOffice.org,
licensed under a mixture of free software licenses
(all approved by FSF and OSI). Depending on which components you are
interested in, either the PDL, GPL, LGPL, or SISSL may apply;
moreover, you have some choices about the license terms you can
accept. The site also includes information about the project in general, as well as about
the licensing specifics.
- Download KWord, part of the KOffice project (available under General Public License). The set of DTDs used by KOffice components are available.
- The KOffice project announced its plans to switch to using OASIS/OpenOffice.org as its
native file format following the 2003 KOffice Developers' Meeting.
- Take a look at the Abiword site, which includes downloads (also available under General Public License). You can also view the (non-definitive) DTD for AbiWord (or save it as a file to view later).
- An OASIS
Technical Committee has been organized to create an open,
XML-based file format specification for office applications. The
basis of this specification is the StarOffice/OpenOffice.org format
specification, created by Sun.
- Read about the standardization of document formats between office suites on OpenOffice.org.
- Consider the LyX application if you're creating specialized technical documents.
- Browse for books on these and other technical topics.
- Find more XML resources on the developerWorks XML zone. You'll find all previous installments of David's XML Matters column at the column summary page.
- IBM XML certification: Find out how you can become an IBM-Certified Developer in XML and related technologies.

David Mertz once led the desperate life of scholarship. David may be reached at mertz@gnosis.cx; his life pored over at http://gnosis.cx/dW/. Suggestions and recommendations on this, past, or future columns are welcomed. Check out David's new book Text Processing in Python at http//gnosis.cx/TPiP/.



