When markup advocates attempt to convince an audience of the value of breakthroughs such as XML, they almost invariably give the example of the proprietary, binary file format -- and the most common bogey is the saved word processor file. Discussion of precursors to XML file formats usually include comma-delimited file formats, which are very often used for import and export of spreadsheets and databases. The saved files from front-office, or just office tools -- word processors, spreadsheets, presentation software, contact managers, and the like -- hold an inordinate amount of the data that represents users' knowledge. Your notes, memos, proposals, analyses, plans, and organizational tools are on the front line of knowledge management. When you upgrade or migrate any such software, one major concern is whether the new arrangement will import your old files. When you perform backups you usually start with these office files.
Vendors know this and understand the nuances of making their proprietary file formats important enough to force your loyalty, while making their tools flexible enough to accept the files of competitors. But markup advocates -- and XML advocates in particular -- point out that you needn't even submit to such seemingly benevolent captivity. "Why shouldn't you have 100% control over such crucial data?", goes the argument, and furthermore, "Why should you not be able to simply open the file with any text viewer and have some chance of understanding the contents?" XML has been offered as a solution. Not only is XML plain text, but it comes with a toolkit that makes it possible to convert between different XML formats. It is offered as a salve for transparency as well as interoperability.
As one would expect, an increasing number of office tools offer XML output. Recently Microsoft has made a huge to-do about the XML integration and export capabilities in the latest version of its office suite. The OpenOffice.org project, which produces a complete, open-source office suite derived from StarOffice, uses XML for its core file formats, rather than as a separate export option. OpenOffice includes a word processor, spreadsheet, a presentation tool, and a graphics/diagramming tool. It's been around a long time (it emerged around 1994) and has acquired the polish and features you would expect of any such office suite.
The stake-holders in OpenOffice.org -- the contributors and users on the OpenOffice.org Web site -- have all committed to making its file format as open and general as possible, in the hopes of fostering greater interoperability and flexibility among office file formats. To further this goal, they have contributed the file formats to a new technical committee (TC) of the Organization for the Advancement of Structured Information Standards (OASIS). I am a founding member of this committee, and I think that the OpenOffice format can be a valuable community resource for connecting the human-readable documents we use in our work and communication to the sorts of metadata management that can enhance the aggregate value of these documents. In this article, I introduce the OpenOffice file formats.
This is an interesting time for the intersection of XML and office software. There has been a lot of discussion of the recent Microsoft XDocs technology and how it may or may not compete with or complement XForms, the OpenOffice formats, and other such projects. I shall not cover any such connections here -- in part because of lack of space, and in part because details of XDocs are just emerging. Also, for the rest of the article, I'll use the name "OpenOffice", rather than using the full, official name "OpenOffice.org".
I fired up OpenOffice 1.0.1 for Linux (which, I was happy to find, comes with Red Hat 8.0) and created a document as shown in Figure 1.
Figure 1. An OpenOffice word processor session
As you can see, the editing interface is much like that of any other WYSIWYG word processor screen (the OpenOffice user interface is beyond the scope of this article). I saved the file as document.sxw. As with all files saved in OpenOffice native format, this is actually a ZIP file that contains a set of XML and other support files -- a bundling known as the OpenOffice package format. The idea of standardizing on an archive file convention for packaging multiple, related XML documents and support files is a popular and well-trodden one: XML expert Rick Jelliffe has developed an XML Application Archive (XAR) format that is based on ZIP; there is also Direct Internet Message Encapsulation (DIME), which is an Internet Draft, but is more complex and intended for messaging and Web services rather than generalized archives. OpenOffice uses its own format, which I examine next. See Resources for more information on these formats.
The ZIP contents of document.sxw are as follows:
$ unzip -v document.sxw Archive: document.sxw Length Method Size Ratio Date Time CRC-32 Name -------- ------ ------- ----- ---- ---- ------ ---- 2946 Defl:N 965 67% 12-13-02 04:03 44fee85c content.xml 4638 Defl:N 1199 74% 12-13-02 04:03 791e906a styles.xml 1120 Stored 1120 0% 12-13-02 04:03 a921529c meta.xml 6183 Defl:N 1362 78% 12-13-02 04:03 c8586553 settings.xml 752 Defl:N 254 66% 12-13-02 04:03 11144701 META-INF/manifest.xml -------- ------- --- ------- 15639 4900 69% 5 files
The first stop is META-INF/manifest.xml, which is sort of a central directory of all the other files in the package. Listing 1 is the manifest file from my sample document.
<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE manifest:manifest PUBLIC "-//OpenOffice.org//DTD Manifest 1.0//EN" "Manifest.dtd"> <manifest:manifest xmlns:manifest="http://openoffice.org/2001/manifest"> <manifest:file-entry manifest:media-type="application/vnd.sun.xml.writer" manifest:full-path="/"/> <manifest:file-entry manifest:media-type="" manifest:full-path="Pictures/"/> <manifest:file-entry manifest:media-type="text/xml" manifest:full-path="content.xml"/> <manifest:file-entry manifest:media-type="text/xml" manifest:full-path="styles.xml"/> <manifest:file-entry manifest:media-type="text/xml" manifest:full-path="meta.xml"/> <manifest:file-entry manifest:media-type="text/xml" manifest:full-path="settings.xml"/> </manifest:manifest>
All OpenOffice formats use DTD, which I think is good because having a schema helps enforce interoperability of the format, and the choice of DTD ensures broadest support in XML tools. Be warned that to process these files with general XML tools, you either have to use a catalog for resolving the public ID, copy the Manifest.dtd file specified as the system ID to the same directory, or just use tools that do not read the external DTD subset. OpenOffice maintains an internal catalog of the needed DTDs and entities. The OpenOffice DTDs can be found in the share directory of your OpenOffice installation. For example, in my Red Hat 8.0 installation, they are in
/usr/lib/openoffice/share/dtd/ and the manifest DTD is
/usr/lib/openoffice/share/dtd/officedocument/1_0/. You can also download the DTDs or access them online at the OpenOffice Web site (see Resources). The manifest file uses the common OpenOffice namespace, and mostly comprises a list of entry elements that give the Internet media type (IMT) and relative URL for each file. The
media-type attribute for subfolders is left empty, such as that for the Pictures folder, which is empty in my example, but normally contains the graphic source files for any embedded pictures.
meta.xml contains a series of elements with the document metadata, such as the creation and last edit dates, total time in which the document has been edited, counts of words, pages, tables, pictures, and the like. You can think of styles.xml as a cross between cascading style sheet (CSS) in an XML format, and XSL-Formatting Objects (XSL-FO). It defines the various styles that are available in the editing session for the document in terms of font, pitch, decorations, spacing, tab stops, and the like. It names all styles so you can reference them in the other files. settings.xml records user preferences for the OpenOffice user interface. These concern the details of the application that are used to edit the document, rather than any details on the document itself. This is one area where some work still needs to be done to ensure interoperability. After all, if the same document is edited in multiple applications (all of which use the OpenOffice format) one can't expect each application to maintain the same sorts of settings -- and even then, how does one prevent them from clashing?
The heart of the document, the actual content, is in content.xml. Unfortunately, this file is a bit too cluttered with elements for casual viewing in a text editor, but you can extract the character data using many common XML tools, including XSLT, courtesy of the null style sheet (see Listing 2).
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0" > <xsl:output method="text"/> </xsl:stylesheet>
The null style sheet uses all default template rules, with the effect of stripping all markup from any XML file. I specify the text output method to avoid getting an XML text declaration in the output. This script, which can be used with any XSLT processor:
$ 4xslt content.xml null.xslt
produces the following output:
This new column, Thinking XML, will cover the intersection of XML andknowledge architecture (KA). Knowledge architecture sounds likesomething tossed out by a jargon bot, but it's really just an umbrella termfor some very useful technologies that are emerging now that XML is enteringits adolescence. Metadata management, semantic transparency, and autonomousagents are hardly concepts unique to XML, but the promise of XML to unifythe syntax of structured and semistructured data helps turn the next-to-impossibleinto the feasible.
Notice the missing spaces. It seems that OpenOffice is rather fastidious about marking style in the document, according to the user's editing pattern. There is a lot of stuff along the lines of:
<text:s/><text:span text:style-name="T2">Knowledge architecture</text:span> sounds like</text:p><text:p text:style-name="Standard">something tossed out by a jargon bot, but it's really just an umbrella term</text:p>
(In the preceding code sample, the code is split between words and appears on multiple lines for ease of viewing. In reality, the code is a single, long line.)
The text is broken into multiple elements, and OpenOffice fills in spaces as needed. The XSLT processor does not perform the same compensation, with the effect you see in the output above. You could do the same with a few simple XSLT templates, adding spaces between element spans. But the key here is that you can use generic tools to process this file format very effectively.
In this column, I've provided a sketch of the OpenOffice text file format, but the project does not just toss out a text format and leave it at that. OpenOffice provides a rich toolkit for integrating XML tools, and there is a growing body of third-party tools as well. These include SAX filters, XSLT plug-ins, and even low-level Java APIs. Developers from the community have already used these facilities to augment OpenOffice with the ability to load and save Docbook, HTML, TeX, plain text, and the document formats used by PalmOS and PocketPC.
XMerge is a project for working with OpenOffice content on small devices such as PDAs and cell phones. Work on XMerge is proceeding at a remarkable pace, and vendors such as Nokia have seen fit to chip into the project. This underlines another huge benefit of the openness embraced by OpenOffice. It encourages contributions from a wide variety of sources, even commercial interests, who understand that this openness brings about a level playing field, as opposed to the use of a proprietary format. XMerge uses XSLT plug-ins for document conversion, which also ensures cross-platform support.
In the Open Office XML format TC (note the different spelling), we will continue to improve these file formats, with a sharp eye on enhancing interoperability even further. This is an open process with an open mailing list, and any OASIS member can join formally. I encourage all who are interested in managing front-office documents to participate, as well as those who are inclined to hack away happily at OpenOffice files using whatever tools they have lying around. After all, it's just XML.
- Visit the OpenOffice home page for information relevant to users and developers, including the feature list, mailing lists, documentation, and downloads.
- Follow the progress of the OASIS Open Office XML Format TC, and keep abreast of general OpenOffice.org XML File Format news on the Cover Pages.
- Information on how OpenOffice uses XML, including the file formats, can be found at the XML project page which links to details and DTDs of the file formats, a discussion of the choices behind the package formats, and more.
- There are several XML packaging format proposals besides the OpenOffice package format. "Wrap Your App" by Leigh Dodds has pointers to others, including DZIP/XAR and DIME. DIME is also discussed in "Brother, Can You Spare a DIME?," by Rich Salz.
- The XMerge project is an excellent example of versatile applications built upon basic file formats in XML. It includes tools and filters for PDA, cell phone, and other small devices.
- Check out Adventures with OpenOffice and XML by Matt Sergeant, which discusses tools for processing OpenOffice files with scripts. Perl users in particular will find this article worthwhile.
- "Sun's open, componentized OpenOffice productivity suite," by Claude J. Bauer, discusses the genesis of OpenOffice in general, and some other technical aspects of the project (developerWorks, February 2001).
- Check out all the previous installments of Uche Ogbuji's "Thinking XML" column.
- Find more information on the technologies covered in this article at the developerWorks XML zone.
- IBM WebSphere Studio provides a suite of tools that automate XML development, both in Java and in other languages. It is closely integrated with the WebSphere Application Server, but can also be used with other J2EE servers.
- Find out how you can become an IBM Certified Developer in XML and related technologies.
Uche Ogbuji is a consultant and co-founder of Fourthought Inc., a software vendor and consultancy specializing in XML solutions for enterprise knowledge management. Fourthought develops 4Suite, an open source platform for XML, RDF, and knowledge-management applications. Mr. Ogbuji is a Computer Engineer and writer born in Nigeria, living and working in Boulder, Colorado, USA. You can contact Mr. Ogbuji at firstname.lastname@example.org.