Skip to main content

By clicking Submit, you agree to the developerWorks terms of use.

The first time you sign into developerWorks, a profile is created for you. Select information in your profile (name, country/region, and company) is displayed to the public and will accompany any content you post. You may update your IBM account at any time.

All information submitted is secure.

  • Close [x]

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerworks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

By clicking Submit, you agree to the developerWorks terms of use.

All information submitted is secure.

  • Close [x]

XML Matters: XML for word processors

Open source embraces XML as native document format

David Mertz (mertz@gnosis.cx), Bibliophile, Gnosis Software, Inc.
Photo of David Mertz
David Mertz once led the desperate life of scholarship. David may be reached at mertz@gnosis.cx; his life pored over at http://gnosis.cx/dW/. Suggestions and recommendations on this, past, or future columns are welcomed. Check out David's new book Text Processing in Python at http//gnosis.cx/TPiP/.

Summary:  Recent versions of the three major free software word processing programs have all adopted XML as their native document format. The approaches to XML taken by AbiWord, KOffice's KWord, and OpenOffice.org Writer differ somewhat between the applications -- largely reflecting the underlying development focus of each project. Here, David takes a look at how these projects and all open source word processor developers have realized the advantages of XML as a document format: componentization of parsers and writers; openness and formality of format specification; and applicability of XSLT and other transformation APIs. You can share your thoughts on this article with the author and other readers in the accompanying discussion forum.

View more content in this series

Date:  25 Feb 2004
Level:  Introductory
Also available in:   Japanese

Activity:  9664 views
Comments:  

Beyond a few abandoned or incomplete efforts, three word processors are now available in actively maintained states. All do an excellent job as word processors; all provide a variety of useful import/export capabilities -- including the widely used, but proprietary, Microsoft Word format; and all are available in both source and binary form for Linux along with other platforms (on both free and proprietary OSes). And most interestingly for this column, AbiWord, KWord, and Writer all use XML for their native document formats.

For this column, I am not interested in comparing the features, appearance, or user interfaces of these three projects. Suffice it to say that they have all obtained a very nice degree of polish in look-and-feel, and all have a sufficient feature set for creation of most types of business and personal documents. What I am interested in here is the design of the XML document formats -- the guts inside these projects.

For those unfamiliar with the three projects, a few items are worth noting. AbiWord is a standalone word processor with an emphasis on cross-platform compatibility, moderate size, and good execution speed. OpenOffice.org is an outgrowth of Sun Microsystems' StarOffice product, which was released under free software license, and taken up by the developer community. OpenOffice.org Writer is just part of a suite of inter-operable applications including a spreadsheet, vector drawing program, presentation application, and some other components. Similarly, KWord is part of the KOffice suite (which is itself part of the overall KDE project); KOffice contains even more components than does OpenOffice.org -- adding flowcharting, raster image editing, charting, and other applications. In any case, for now I'll focus only on the word processor component of KOffice and OpenOffice.org.

Another option: LyX

One other free software application is worth mentioning in passing here: LyX is a GUI front-end to the creation of LaTeX documents. For specialized technical documents -- such as those involving many equations or complex cross-referencing -- LyX is a good choice, but its learning curve is steep for creating general business correspondence.

Testing document formats

As you would expect, new versions of these open source word processors usually tweak the document format a bit. Fortunately, XML is well suited to upward changes, which can include the addition of (optional) new attributes and child elements. If this is done well, earlier versions of applications can even degrade relatively gracefully when they read newer saved documents -- usually by just ignoring unfamiliar tags and attributes.

In the XML formats I looked at, DTDs are provided by the project developers, but they tend to be somewhat out of sync with the actual XML documents created by the same versions of the applications. Well-formedness is still respected, as you would hope, but creation and parsing seem to be rather informal matters; the final say is the source code that implements the formats, not in a DTD or schema. In other words, the samples below will not validate successfully. To give you an idea of what the documents really look like, I have created a very simple test document, shown in Figure 1:


Figure 1. Screenshot of simple document
Figure 1. Screenshot of simple document

Interestingly, if not surprisingly, you will see in the XML versions of this document that the representation on the identical document is not unique. (Of course, this being XML, issues like whitespace normalization allow non-identical files to represent the same Infoset; but that is not what I mean.) I found that, at least in some details, the exact same formatting can get different markup due to the sequence of user actions that went into the document's creation (and perhaps due to other factors too). While this fact is not necessarily a problem -- and probably applies equally to binary document formats like Microsoft Word's .doc format -- it seems mildly unfortunate that canonicalization is not as straightforward at a semantic level as it is at the XML syntax level.


Starting simple: AbiWord

AbiWord uses a relatively simple and straightforward XML document format in which appearance and layout are specified in CSS-like attributes. While many of these attributes are taken directly from CSS, the AbiWord developers decided that CSS was insufficient for their needs, so they took it only as a starting point.

Although they are a bit long, I would like to present the entire XML source of the word processor documents created. I have prettified these sources, but have verified that my Infoset-neutral changes do not affect re-import. First the AbiWord version:


Listing 1. simple.abw AbiWord document
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE abiword PUBLIC "-//ABISOURCE//DTD AWML 1.0 Strict//EN"
                         "http://www.abisource.com/awml.dtd">
<abiword
  fileformat="1.1"
  props="dom-dir:ltr; lang:en-US"
  styles="unlocked"
  template="false"
  version="2.0.3"
  xml:space="preserve"
  xmlns="http://www.abisource.com/awml.dtd"
  xmlns:awml="http://www.abisource.com/awml.dtd"
  xmlns:dc="http://purl.org/dc/elements/1.1/"
  xmlns:fo="http://www.w3.org/1999/XSL/Format"
  xmlns:math="http://www.w3.org/1998/Math/MathML"
  xmlns:svg="http://www.w3.org/2000/svg"
  xmlns:xlink="http://www.w3.org/1999/xlink">
  <metadata>
    <m key="dc.format">application/x-abiword</m>
    <m key="abiword.generator">AbiWord</m>
    <m key="abiword.date_last_changed">Tue Feb 10 20:04:46 2004</m>
  </metadata>
  <styles>
    <s followedby="Current Settings"
       name="Normal"
       props="text-indent:0in; margin-top:0pt; margin-left:0pt;
              font-stretch:normal; line-height:1.0; text-align:left;
              bgcolor:transparent; lang:en-US; dom-dir:ltr;
              margin-bottom:0pt; text-decoration:none;
              font-weight:normal; font-variant:normal; color:000000;
              text-position:normal; font-size:12pt; margin-right:0pt;
              font-style:normal; widows:2; font-family:Times New Roman"
       type="P"/>
  </styles>
  <pagesize height="11.000000"
            orientation="portrait"
            page-scale="1.000000"
            pagetype="Letter"
            units="in"
            width="8.500000"/>
  <section props="page-margin-footer:0.5in; page-margin-header:0.5in">
    <p style="Normal">Minimal document with <c
       props="font-weight:bold">bold</c><c
       props="font-weight:normal"> and </c><c
       props="font-style:italic; font-weight:normal">italics</c><c
       props="font-style:normal; font-weight:normal">.</c></p>
    <p style="Normal"><c
       props="font-style:normal; font-weight:normal"/></p>
    <p style="Normal"><c
       props="font-style:normal;
              font-weight:normal">New paragraph with </c><c
       props="font-weight:normal;
              text-decoration:underline;
              font-style:normal">underline</c><c
       props="text-decoration:none;
              font-weight:normal;
              font-style:normal">.</c></p>
  </section>
</abiword>

A few features stand out. One notable advantage that comes with XML is the use of namespaces to indicate external schemas developed and refined by other groups. For example, inclusion of equations or figures can be done using MathML or SVG, respectively; the AbiWord developers have no need to re-engineer these capabilities themselves.

Another thing to notice about AbiWord's format is that it only half-heartedly uses XML attributes in describing the rendering of sections or character spans. That is, where some XML formats try to list a priori all the possible formatting in attributes or child tags (named in the DTD), AbiWord simply throws in a generic props attribute that contains CSS-style formatting. This pushes the rendering semantics outside of the XML Infoset (for better or worse, I am not sure).


Becoming formal: OASIS

The XML format developed by Sun Microsystems for StarOffice (and taken up by OpenOffice.org) has been assumed by an OASIS Technical Committee (see Resources); in short, it is on its way to becoming a standard, not simply a format. Moreover, the KOffice project, which previously used its own XML format, has recently decided to move towards native use of the OpenOffice.org format -- or some future OASIS enhancement to that format. I find it more useful, therefore, to present the OASIS/OpenOffice.org format than to detail the older KOffice format. That said, current stable versions of KOffice have not yet switched formats as of this writing.

In contrast to the AbiWord format, OpenOffice.org's XML format encompasses all the types of documents supported by OpenOffice.org applications -- not simply word processor documents, but also charts, drawings, and so on. Data of different types is indicated by namespaces for each type, allowing multiple data formats to be embedded in the same document. How and whether a particular application handles a given data type is up to the application; but, for example, one application might pass control for rendering a given data type to another component (either in the same suite, or a wholly external application).

For now I am only interested in the simple word processor document shown in Figure 1. Take a look at it, and then compare the AbiWord version in Listing 1 with OpenOffice.org's XML format shown in Listing 2. As with the AbiWord version, I have prettified the XML, but maintained the Infoset. Also as with AbiWord's version, the document does not actually validate; in this case the dr3d, form, and math namespace attributes are missing from the version of the DTD that's included with my OpenOffice.org installation (the same one that created this document). And while the content of interest is in Listing 2, the complete OpenOffice.org data file is a .zip archive containing several ancillary XML documents for settings, metadata, and styles (normally having the extension .sxw):


Listing 2. content.xml from simple.sxw
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE office:document-content PUBLIC
  "-//OpenOffice.org//DTD OfficeDocument 1.0//EN"
  "office.dtd">
<office:document-content office:class="text" office:version="1.0"
  xmlns:chart="http://openoffice.org/2000/chart"
  xmlns:dr3d="http://openoffice.org/2000/dr3d"
  xmlns:draw="http://openoffice.org/2000/drawing"
  xmlns:fo="http://www.w3.org/1999/XSL/Format"
  xmlns:form="http://openoffice.org/2000/form"
  xmlns:math="http://www.w3.org/1998/Math/MathML"
  xmlns:number="http://openoffice.org/2000/datastyle"
  xmlns:office="http://openoffice.org/2000/office"
  xmlns:script="http://openoffice.org/2000/script"
  xmlns:style="http://openoffice.org/2000/style"
  xmlns:svg="http://www.w3.org/2000/svg"
  xmlns:table="http://openoffice.org/2000/table"
  xmlns:text="http://openoffice.org/2000/text"
  xmlns:xlink="http://www.w3.org/1999/xlink">
  <office:script/>
  <office:font-decls>
    <style:font-decl fo:font-family=""
                     style:font-pitch="variable" style:name="F"/>
    <style:font-decl fo:font-family="Mincho"
                     style:font-pitch="variable" style:name="Mincho"/>
    <style:font-decl fo:font-family="Times"
                     style:font-family-generic="roman"
                     style:font-pitch="variable"
                     style:name="Times"/>
  </office:font-decls>
  <office:automatic-styles>
    <style:style style:family="paragraph"
                 style:name="P1" style:parent-style-name="Standard">
      <style:properties fo:font-style="normal"
                        fo:font-weight="normal"/>
    </style:style>
    <style:style style:family="text" style:name="T1">
      <style:properties fo:font-weight="bold"/>
    </style:style>
    <style:style style:family="text" style:name="T2">
      <style:properties fo:font-weight="normal"/>
    </style:style>
    <style:style style:family="text" style:name="T3">
      <style:properties fo:font-style="italic" fo:font-weight="normal"/>
    </style:style>
    <style:style style:family="text" style:name="T4">
      <style:properties fo:font-style="normal" fo:font-weight="normal"/>
    </style:style>
    <style:style style:family="text" style:name="T5">
      <style:properties style:text-underline="single"
                        style:text-underline-color="font-color"/>
    </style:style>
  </office:automatic-styles>
  <office:body>
    <text:sequence-decls>
      <text:sequence-decl
        text:display-outline-level="0" text:name="Illustration"/>
      <text:sequence-decl
        text:display-outline-level="0" text:name="Table"/>
      <text:sequence-decl
        text:display-outline-level="0" text:name="Text"/>
      <text:sequence-decl
        text:display-outline-level="0" text:name="Drawing"/>
    </text:sequence-decls>
    <text:p
      text:style-name="Standard">Minimal document with <text:span
      text:style-name="T1">bold </text:span><text:span
      text:style-name="T2">and </text:span><text:span
      text:style-name="T3">italics</text:span><text:span
      text:style-name="T4">.</text:span>
    </text:p>
    <text:p text:style-name="P1"/>
    <text:p text:style-name="P1">New paragraph with <text:span
      text:style-name="T5">underline</text:span>.</text:p>
  </office:body>
</office:document-content>

The OpenOffice.org format follows a structure that's generally similar to that of AbiWord. Instead of AbiWord's <p> tag, OpenOffice.org uses <text:p>; and instead of AbiWord's <c>, OpenOffice.org uses <text:span>. A notable difference here is that where AbiWord uses formatting descriptions directly accompanying marked character sequences, OpenOffice.org always uses indirect references to named styles, even where the names of automatic styles are generated on the fly by the generating application.

The sample document also illustrates a point made above about incidental variations in document Infosets. For example, notice that the period at the end of the first paragraph is marked as style T4, while the period in the final paragraph is outside any span. Moreover, if you look at the earlier T4 style definition you'll see that it merely defines normal -- that is, default -- font style and weight. In other words, you don't need to mark text with the T4 style as opposed to leaving it as PCDATA for the surrounding paragraph.


Processing an XML document

One important advantage of using XML formats like those in the word processors you've seen here is the facilitation of access by new tools to those documents. It is just easier to write new applications that process XML word processor documents than it is to write ones that work with binary formats, especially proprietary ones. To some extent, RTF (Rich Text Format) achieves a similar goal: It is a textual markup format that is publicly documented. But as things have unfolded, you have many more commodity XML parsers to choose from than RTF parsers.

One application that obviously comes to mind for working with an XML word processor format is a new word processing application. The anticipated transparent interoperability between KOffice (KWord) and OpenOffice.org (Writer) is an example of this. But somewhat more modest applications are worth keeping in mind, too: Indexing, analyzing, summarizing, comparing, or otherwise batch processing documents are also tasks that are frequently useful.

For many of these batch-style applications, XSLT stands out as an obvious processing language -- and indeed, existing conversion routines often use XSLT. However, I am much less fond of XSLT than are its proponents. Despite the declarative goal of the language, I usually find its details confusing and difficult to maintain and debug. In any case, what I present in Listing 3 is a very simple utility that utilizes the gnosis.xml.objectify Python XML binding that I have discussed in several previous installments of this column. The utility I present is similar to the -dump option in lynx, except that it processes OpenOffice.org Writer documents rather than HTML documents. My utility is crude, but is also extremely concise, thereby demonstrating some XML benefits:


Listing 3. dumpOO.py
#!/usr/bin/env python
import sys, zipfile
from gnosis.xml.objectify import XML_Objectify, EXPAT, \
                                 children, tagname, content
XML_Objectify.expat_kwargs['nspace_sep'] = None
doc_content = zipfile.ZipFile(sys.argv[1]).read('content.xml')
doc = XML_Objectify(doc_content).make_instance()
write = sys.stdout.write
for o in children(doc.office_body):
    if tagname(o)=="text_p":
        for s in content(o):
            if type(s) is unicode and s.strip():
                write(" "+s.encode('utf-8').strip())
            elif tagname(s)=='text_span':
                write(" "+s.PCDATA.encode('utf-8'))
        write('\n')

This utility doesn't necessarily handle every OpenOffice.org construct gracefully (but I think it handles all text content), and line wrapping is not performed for paragraphs. However, you could easily add this capability, either by using the Python 2.3+ module textwrap, or by piping to the external utility fmt. For example:


Listing 4. dumpOO.py in action
$ ./dumpOO.py simple.sxw | fmt
 Minimal document with bold and italics .

 New paragraph with underline .


Conclusion

Free software and XML document formats are a natural pairing. The inherent readability of XML just makes interchange and format specification easier, and the wide availability of XML libraries makes creation of new tools simple. Moreover, looking at these word processor formats has really helped me to see the modularity benefits of namespaces -- when done correctly, namespaces can leverage the work done by many groups of independent developers.

However, XML itself only goes so far. For example, Microsoft is also moving towards an XML format for future versions of MS Word; but in contrast to the openness of the OASIS/OpenOffice.org or AbiWord formats, Microsoft is surrounding its format with patent applications, and putting a veil of secrecy around the format variations (plus it uses cryptic tag and attribute names rather than self-documenting ones).

XML by itself does not really mean open, but fortunately, the developers of KOffice, AbiWord, and OpenOffice.org have done a generally wonderful job of obtaining openness with XML (albeit, the wild world of community development still leaves occasional impedance mismatches in, for example, DTDs).


Resources

  • Participate in the discussion forum.

  • Check out OpenOffice.org, licensed under a mixture of free software licenses (all approved by FSF and OSI). Depending on which components you are interested in, either the PDL, GPL, LGPL, or SISSL may apply; moreover, you have some choices about the license terms you can accept. The site also includes information about the project in general, as well as about the licensing specifics.

  • Download KWord, part of the KOffice project (available under General Public License). The set of DTDs used by KOffice components are available.

  • The KOffice project announced its plans to switch to using OASIS/OpenOffice.org as its native file format following the 2003 KOffice Developers' Meeting.

  • Take a look at the Abiword site, which includes downloads (also available under General Public License). You can also view the (non-definitive) DTD for AbiWord (or save it as a file to view later).

  • An OASIS Technical Committee has been organized to create an open, XML-based file format specification for office applications. The basis of this specification is the StarOffice/OpenOffice.org format specification, created by Sun.

  • Read about the standardization of document formats between office suites on OpenOffice.org.

  • Consider the LyX application if you're creating specialized technical documents.

  • Browse for books on these and other technical topics.

  • Find more XML resources on the developerWorks XML zone. You'll find all previous installments of David's XML Matters column at the column summary page.

  • IBM XML certification: Find out how you can become an IBM-Certified Developer in XML and related technologies.

About the author

Photo of David Mertz

David Mertz once led the desperate life of scholarship. David may be reached at mertz@gnosis.cx; his life pored over at http://gnosis.cx/dW/. Suggestions and recommendations on this, past, or future columns are welcomed. Check out David's new book Text Processing in Python at http//gnosis.cx/TPiP/.

Report abuse help

Report abuse

Thank you. This entry has been flagged for moderator attention.


Report abuse help

Report abuse

Report abuse submission failed. Please try again later.


developerWorks: Sign in


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Select information in your profile (name, country/region, and company) is displayed to the public and will accompany any content you post. You may update your IBM account at any time.

Choose your display name

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


Rate this article

Comments

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML
ArticleID=12377
ArticleTitle=XML Matters: XML for word processors
publish-date=02252004
author1-email=mertz@gnosis.cx
author1-email-cc=