Skip to main content

By clicking Submit, you agree to the developerWorks terms of use.

The first time you sign into developerWorks, a profile is created for you. Select information in your profile (name, country/region, and company) is displayed to the public and will accompany any content you post. You may update your IBM account at any time.

All information submitted is secure.

  • Close [x]

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerworks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

By clicking Submit, you agree to the developerWorks terms of use.

All information submitted is secure.

  • Close [x]

Thinking XML: The open office file format

An XML format for front office documents

Uche Ogbuji (uche@ogbuji.net), Principal Consultant, Fourthought, Inc.
Photo of Uche Ogbuji
Uche Ogbuji is a consultant and co-founder of Fourthought Inc., a software vendor and consultancy specializing in XML solutions for enterprise knowledge management. Fourthought develops 4Suite, an open source platform for XML, RDF, and knowledge-management applications. Mr. Ogbuji is a Computer Engineer and writer born in Nigeria, living and working in Boulder, Colorado, USA. You can contact Mr. Ogbuji at uche@ogbuji.net.

Summary:  OpenOffice.org is a mature, open source, front office applications suite with the advantage of a saved file format based on an open XML DTD. This gives users and developers an extraordinary amount of flexibility and power in dealing with work produced in OpenOffice.org. In this article, Uche Ogbuji introduces the OpenOffice file format and explains its advantages.

View more content in this series

Date:  01 Jan 2003
Level:  Introductory
Also available in:   Japanese

Activity:  38425 views
Comments:  

When markup advocates attempt to convince an audience of the value of breakthroughs such as XML, they almost invariably give the example of the proprietary, binary file format -- and the most common bogey is the saved word processor file. Discussion of precursors to XML file formats usually include comma-delimited file formats, which are very often used for import and export of spreadsheets and databases. The saved files from front-office, or just office tools -- word processors, spreadsheets, presentation software, contact managers, and the like -- hold an inordinate amount of the data that represents users' knowledge. Your notes, memos, proposals, analyses, plans, and organizational tools are on the front line of knowledge management. When you upgrade or migrate any such software, one major concern is whether the new arrangement will import your old files. When you perform backups you usually start with these office files.

Vendors know this and understand the nuances of making their proprietary file formats important enough to force your loyalty, while making their tools flexible enough to accept the files of competitors. But markup advocates -- and XML advocates in particular -- point out that you needn't even submit to such seemingly benevolent captivity. "Why shouldn't you have 100% control over such crucial data?", goes the argument, and furthermore, "Why should you not be able to simply open the file with any text viewer and have some chance of understanding the contents?" XML has been offered as a solution. Not only is XML plain text, but it comes with a toolkit that makes it possible to convert between different XML formats. It is offered as a salve for transparency as well as interoperability.

As one would expect, an increasing number of office tools offer XML output. Recently Microsoft has made a huge to-do about the XML integration and export capabilities in the latest version of its office suite. The OpenOffice.org project, which produces a complete, open-source office suite derived from StarOffice, uses XML for its core file formats, rather than as a separate export option. OpenOffice includes a word processor, spreadsheet, a presentation tool, and a graphics/diagramming tool. It's been around a long time (it emerged around 1994) and has acquired the polish and features you would expect of any such office suite.

The stake-holders in OpenOffice.org -- the contributors and users on the OpenOffice.org Web site -- have all committed to making its file format as open and general as possible, in the hopes of fostering greater interoperability and flexibility among office file formats. To further this goal, they have contributed the file formats to a new technical committee (TC) of the Organization for the Advancement of Structured Information Standards (OASIS). I am a founding member of this committee, and I think that the OpenOffice format can be a valuable community resource for connecting the human-readable documents we use in our work and communication to the sorts of metadata management that can enhance the aggregate value of these documents. In this article, I introduce the OpenOffice file formats.

This is an interesting time for the intersection of XML and office software. There has been a lot of discussion of the recent Microsoft XDocs technology and how it may or may not compete with or complement XForms, the OpenOffice formats, and other such projects. I shall not cover any such connections here -- in part because of lack of space, and in part because details of XDocs are just emerging. Also, for the rest of the article, I'll use the name "OpenOffice", rather than using the full, official name "OpenOffice.org".

The overall format

I fired up OpenOffice 1.0.1 for Linux (which, I was happy to find, comes with Red Hat 8.0) and created a document as shown in Figure 1.


Figure 1. An OpenOffice word processor session
An OpenOffice word processor session

As you can see, the editing interface is much like that of any other WYSIWYG word processor screen (the OpenOffice user interface is beyond the scope of this article). I saved the file as document.sxw. As with all files saved in OpenOffice native format, this is actually a ZIP file that contains a set of XML and other support files -- a bundling known as the OpenOffice package format. The idea of standardizing on an archive file convention for packaging multiple, related XML documents and support files is a popular and well-trodden one: XML expert Rick Jelliffe has developed an XML Application Archive (XAR) format that is based on ZIP; there is also Direct Internet Message Encapsulation (DIME), which is an Internet Draft, but is more complex and intended for messaging and Web services rather than generalized archives. OpenOffice uses its own format, which I examine next. See Resources for more information on these formats.

The ZIP contents of document.sxw are as follows:

$ unzip -v document.sxw
Archive:  document.sxw
 Length   Method    Size  Ratio   Date   Time   CRC-32    Name
--------  ------  ------- -----   ----   ----   ------    ----
    2946  Defl:N      965  67%  12-13-02 04:03  44fee85c  content.xml
    4638  Defl:N     1199  74%  12-13-02 04:03  791e906a  styles.xml
    1120  Stored     1120   0%  12-13-02 04:03  a921529c  meta.xml
    6183  Defl:N     1362  78%  12-13-02 04:03  c8586553  settings.xml
     752  Defl:N      254  66%  12-13-02 04:03  11144701  META-INF/manifest.xml
--------          -------  ---                            -------
   15639             4900  69%                            5 files

The first stop is META-INF/manifest.xml, which is sort of a central directory of all the other files in the package. Listing 1 is the manifest file from my sample document.

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE manifest:manifest PUBLIC "-//OpenOffice.org//DTD Manifest 1.0//EN" 
    "Manifest.dtd">
<manifest:manifest xmlns:manifest="http://openoffice.org/2001/manifest">
 <manifest:file-entry manifest:media-type="application/vnd.sun.xml.writer"
       manifest:full-path="/"/>
 <manifest:file-entry manifest:media-type="" manifest:full-path="Pictures/"/>
 <manifest:file-entry manifest:media-type="text/xml" manifest:full-path="content.xml"/>
 <manifest:file-entry manifest:media-type="text/xml" manifest:full-path="styles.xml"/>
 <manifest:file-entry manifest:media-type="text/xml" manifest:full-path="meta.xml"/>
 <manifest:file-entry manifest:media-type="text/xml" manifest:full-path="settings.xml"/>
</manifest:manifest>

All OpenOffice formats use DTD, which I think is good because having a schema helps enforce interoperability of the format, and the choice of DTD ensures broadest support in XML tools. Be warned that to process these files with general XML tools, you either have to use a catalog for resolving the public ID, copy the Manifest.dtd file specified as the system ID to the same directory, or just use tools that do not read the external DTD subset. OpenOffice maintains an internal catalog of the needed DTDs and entities. The OpenOffice DTDs can be found in the share directory of your OpenOffice installation. For example, in my Red Hat 8.0 installation, they are in /usr/lib/openoffice/share/dtd/ and the manifest DTD is /usr/lib/openoffice/share/dtd/officedocument/1_0/. You can also download the DTDs or access them online at the OpenOffice Web site (see Resources). The manifest file uses the common OpenOffice namespace, and mostly comprises a list of entry elements that give the Internet media type (IMT) and relative URL for each file. The media-type attribute for subfolders is left empty, such as that for the Pictures folder, which is empty in my example, but normally contains the graphic source files for any embedded pictures.

meta.xml contains a series of elements with the document metadata, such as the creation and last edit dates, total time in which the document has been edited, counts of words, pages, tables, pictures, and the like. You can think of styles.xml as a cross between cascading style sheet (CSS) in an XML format, and XSL-Formatting Objects (XSL-FO). It defines the various styles that are available in the editing session for the document in terms of font, pitch, decorations, spacing, tab stops, and the like. It names all styles so you can reference them in the other files. settings.xml records user preferences for the OpenOffice user interface. These concern the details of the application that are used to edit the document, rather than any details on the document itself. This is one area where some work still needs to be done to ensure interoperability. After all, if the same document is edited in multiple applications (all of which use the OpenOffice format) one can't expect each application to maintain the same sorts of settings -- and even then, how does one prevent them from clashing?


Working the content

The heart of the document, the actual content, is in content.xml. Unfortunately, this file is a bit too cluttered with elements for casual viewing in a text editor, but you can extract the character data using many common XML tools, including XSLT, courtesy of the null style sheet (see Listing 2).

<xsl:stylesheet
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  version="1.0"
>
  <xsl:output method="text"/>
</xsl:stylesheet>

The null style sheet uses all default template rules, with the effect of stripping all markup from any XML file. I specify the text output method to avoid getting an XML text declaration in the output. This script, which can be used with any XSLT processor:

$ 4xslt content.xml null.xslt

produces the following output:

This new column, Thinking XML, will cover the intersection of XML andknowledge architecture (KA). Knowledge architecture sounds likesomething tossed out by a jargon bot, but it's really just an umbrella termfor some very useful technologies that are emerging now that XML is enteringits adolescence. Metadata management, semantic transparency, and autonomousagents are hardly concepts unique to XML, but the promise of XML to unifythe syntax of structured and semistructured data helps turn the next-to-impossibleinto the feasible.

Notice the missing spaces. It seems that OpenOffice is rather fastidious about marking style in the document, according to the user's editing pattern. There is a lot of stuff along the lines of:

<text:s/><text:span text:style-name="T2">Knowledge architecture</text:span> sounds 
like</text:p><text:p text:style-name="Standard">something tossed out by a jargon bot, 
but it's really just an umbrella term</text:p>

.

(In the preceding code sample, the code is split between words and appears on multiple lines for ease of viewing. In reality, the code is a single, long line.)

The text is broken into multiple elements, and OpenOffice fills in spaces as needed. The XSLT processor does not perform the same compensation, with the effect you see in the output above. You could do the same with a few simple XSLT templates, adding spaces between element spans. But the key here is that you can use generic tools to process this file format very effectively.


Conclusion

In this column, I've provided a sketch of the OpenOffice text file format, but the project does not just toss out a text format and leave it at that. OpenOffice provides a rich toolkit for integrating XML tools, and there is a growing body of third-party tools as well. These include SAX filters, XSLT plug-ins, and even low-level Java APIs. Developers from the community have already used these facilities to augment OpenOffice with the ability to load and save Docbook, HTML, TeX, plain text, and the document formats used by PalmOS and PocketPC.

XMerge is a project for working with OpenOffice content on small devices such as PDAs and cell phones. Work on XMerge is proceeding at a remarkable pace, and vendors such as Nokia have seen fit to chip into the project. This underlines another huge benefit of the openness embraced by OpenOffice. It encourages contributions from a wide variety of sources, even commercial interests, who understand that this openness brings about a level playing field, as opposed to the use of a proprietary format. XMerge uses XSLT plug-ins for document conversion, which also ensures cross-platform support.

In the Open Office XML format TC (note the different spelling), we will continue to improve these file formats, with a sharp eye on enhancing interoperability even further. This is an open process with an open mailing list, and any OASIS member can join formally. I encourage all who are interested in managing front-office documents to participate, as well as those who are inclined to hack away happily at OpenOffice files using whatever tools they have lying around. After all, it's just XML.


Resources

About the author

Photo of Uche Ogbuji

Uche Ogbuji is a consultant and co-founder of Fourthought Inc., a software vendor and consultancy specializing in XML solutions for enterprise knowledge management. Fourthought develops 4Suite, an open source platform for XML, RDF, and knowledge-management applications. Mr. Ogbuji is a Computer Engineer and writer born in Nigeria, living and working in Boulder, Colorado, USA. You can contact Mr. Ogbuji at uche@ogbuji.net.

Report abuse help

Report abuse

Thank you. This entry has been flagged for moderator attention.


Report abuse help

Report abuse

Report abuse submission failed. Please try again later.


developerWorks: Sign in


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Select information in your profile (name, country/region, and company) is displayed to the public and will accompany any content you post. You may update your IBM account at any time.

Choose your display name

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


Rate this article

Comments

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML
ArticleID=12200
ArticleTitle=Thinking XML: The open office file format
publish-date=01012003
author1-email=uche@ogbuji.net
author1-email-cc=