Skip to main content

OOXML: What's the big deal?

A standards fan talks about OOXML

Peter Seebach (dw-xml@seebs.net), Writer, Freelance
Photo of Peter Seebacah
Peter Seebach has been interested in standardization for many years, and volunteered on the ISO C committee for nearly a decade. He has been using XML as a document interchange format for several years.

Summary:  The OOXML specification has been both criticized and defended by a number of people, leading many to wonder what the big deal is. This article illustrates the basis of technical, rather than political, objections to treating OOXML as a standard.

Date:  19 Feb 2008
Level:  Introductory
Activity:  2385 views

I've been active in standardization for a long time (including about a decade of volunteer work on the ISO C standards committee). Most anyone interested in standardization is likely to form an opinion on the standardization process involving Office Open XML (OOXML), the proposed XML-based document format from Microsoft®. Normally, I don't start an article by talking about myself, but with recent allegations that IBM® has worked to derail OOXML, I wanted to start by making one thing perfectly clear: I'm not an employee of IBM, these are my own opinions, and I developed these opinions without reference to the position of IBM on the issue.

The OOXML standard is a big deal for a number of reasons. The fact is, no matter who you blame for the political content of the process, the standards process around it has been rife with political maneuvering (see Resources). However, beyond all that politics, serious technical questions range from whether XML is a good choice for standards to what the purpose of standardization is. All this news coverage and discussion is a great opportunity for me to get up on a soap box and talk about what makes standards matter.

What is the objective of a standard?

Standards exist to allow interoperation. If my word processor and your word processor can both open the same files, I can share documents with you easily. If they can't, we'll have trouble. What this means is that, in the absence of standards, we spend an incredible amount of time and effort working around the lack of a shared document format. Vendors of word processors spend unbelievable amounts of time and effort reverse-engineering each others' document formats to allow them to import and export files so that users can just open a document and expect to see it roughly as it was saved.

It's pretty obvious that nearly everyone benefits from a standard. The one arguable exception, for document formats, is companies in a dominant position in a given industry; in fact, they benefit when there isn't a standard, because they might be able to push their own format as a de facto standard. That gives them a double-edged competitive advantage; everyone else has to spend extra time and money supporting that format, and no one else's support will ever be as good.

It's important to understand that, by definition, a good interchange standard won't specify everything that every vendor does. Every vendor would have to support every feature in the standard, and that means that every feature added to the standard must be replicated by multiple vendors. It's better to simply allow for extensions or extra features which can't be encoded in documents in the standard format. As a user, I'd rather be able to confidently expect standard documents to work the same everywhere, rather than having a specification so baroque that no two vendors will quite be able to match up their behaviors.

The demand for standardization of office document formats is very strong. Many organizations, from corporations to governments, are drafting rules that require software to support open standards for document storage. No one wants to be locked in to a single vendor; standards offer a way out from that. With that in mind, perhaps it's time to look at some of the technical questions about Microsoft's proposed OOXML standard.

The OOXML standard

The OOXML standard, available from ECMA, is distributed as a set of PDF documents, totaling around 6000 pages. That's a lot of specification, and it goes into comprehensive detail. The reason it's so huge is simple: OOXML is essentially a complete replication of every chunk of data that a Microsoft Office application might possibly save in a file.

There have been a number of technical complaints made about OOXML. Every one of them comes down to the same base complaint: Rather than specifying a reasonable common interchange format, OOXML specifies the whole feature set of Microsoft Office, down to bug compatibility. This creates a burden on other implementors which is simply unreasonable (and in fact impossible) to meet, while conveniently being precisely what Microsoft is already shipping. That raises a lot of concerns.

Don't mistake this for mere complaints that Microsoft has a head start; a small, well-designed standard which Microsoft had implemented and everyone else could reasonably implement would probably have been accepted without much fuss. The showstopper problems come into three broad categories: Features which are unreasonably hard to implement, features which are simply not adequately specified, and features which are utterly unique to Microsoft Office. These categories overlap some, but each stands as a different kind of barrier to entry.

Unreasonable requirements

Traditionally, in a standard, the problems implementors are expected to solve are reasonably well-defined and scoped. You might be required to implement a dozen kinds of paragraph justification, but all of them are specified, and all of them are reasonably limited in scope. By contrast, OOXML imposes requirements which are extremely open-ended. As an example, when describing page headers, the proposed specification states "Both of font name and font type can be localized values." This seemingly-simple sentence (which was pointed out by Stéphane Rodriguez. See Resources.) opens a gigantic can of worms.

What locales can you use? Do you have a complete list of every locale that you might ever be use by any other vendor's implementation of this spec, and every way in which they might choose to localize a font name, or a font type? What do you do if an enthusiastic implementor chooses to write the font name and font type in a language that, at the time of your implementation, you'd never heard of?

Presumably, this reflects a historical decision to store the localized values that were presented to, or picked by, the user. Unfortunately, without a great deal more specification (at the very least, a complete list of locales that are permissible, and some way of telling which locale is being used), it's simply not possible to implement this. This reflects the historical quirks of a given implementation; it is not an appropriate choice for a standardized format to be shared among multiple implementations.

Inadequate specifications

Because some Microsoft Office documents use drawings in a vector language called VML, OOXML specifies how they are stored. This means that every implementor is on the hook to read these drawings—unfortunately, no real specification is offered for them. You can find VML shapes as strings inside particular items.

What exactly are the allowable values? That's answered clearly enough; "The possible values for this attribute are defined by the XML Schema string datatype." Which is to say, it's a string. It can contain arbitrary text, the meaning of which can be answered only by the code of the VML library. In short, unless you happen to have the VML library just lying around, you can't possibly implement this.

Once again, this is a historical quirk. In a standard designed for interchange, the drawing format (and probably only one) is fully specified, and an implementor who happens to have another drawing library is expected to export drawings into the standard format. Instead, OOXML provides a mere recapitulation of an earlier design (and one which is, intentionally, not available to others), and expects everyone else to adapt.

Unique features

The last category is the one which has drawn the most ire from many standards experts. This is not because it's harder to implement—you can't get harder to implement than impossible—but because it should never have existed. This is the category of features which are entirely and utterly dependent on Microsoft Office in some way.

Probably the most famous example is one of the optional settings provided in OOXML. The setting is called "useWord97LineBreakRules", and it specifies to use the line-break rules that were used in Word '97 for East Asian documents. Much like the previous examples, this is of course impossible for anyone else to do, as no specification of these rules is provided. In fact, the OOXML standard even warns implementors not to implement this:


Listing 1. The OOXML standard's guidance for useWord97LineBreakRules
[Guidance: To faithfully replicate this behavior, applications must imitate the behavior of that application, which involves many possible behaviors and cannot be faithfully placed into narrative for this Office Open XML Standard. If applications wish to match this behavior, they must utilize and duplicate the output of those applications. It is recommended that applications not intentionally replicate this behavior as it was deprecated due to issues with its output, and is maintained only for compatibility with existing documents from that application. end guidance]

This guidance is excellent. Given that there is no specification available of this feature, and it is deprecated, it makes all kinds of sense for people not to implement it. But wait; if it shouldn't be implemented, why is it in the spec? Compatibility with existing documents is not a reason to add a feature to a standard aimed at interchanging data; users are worried about whether their text can be opened at all in another program, not whether every line break is in the exact same location!

This feature is in the spec because OOXML is not a document interchange format; it's a careful, bit-for-bit, replication of Microsoft's historical binary formats, wrapped up in angle brackets.

Does this mean XML is a bad choice?

After reading some of the complaints about OOXML, some IT professionals have formed the notion that XML is a poor choice for standardization. I think this judgement is, at best, premature. In fact, I think it's just plain wrong. The problems here are not caused by XML; they are caused by the decision to dutifully reproduce every scrap of backwards-compatibility and every quirk of behavior of an existing program, rather than specifying the structure and contents of generic documents intended to be shared and interchanged between multiple applications.

This can be done quite well in XML. The obvious competitor to OOXML is also an XML standard, called Open Document Format (ODF). It is by no means an entirely trivial or small standard; version 1.1 of ODF is a 738-page document, and the group developing it does not consider it complete or final yet. For instance, it does not define the formula language used in spreadsheets—although this is being worked on, for inclusion in a proposed version 1.2 standard. Nonetheless, a review of the ODF specification shows that, rather than attempting to describe the behavior of a monolithic legacy application, it tries to describe the contents of documents.

The purpose of XML is to allow you to write descriptions of how you wish to describe the contents of documents. While the ODF description is not fully polished yet, it is at least conceivable that it could be.

Conclusions

While XML is a powerful and expressive tool for defining new file formats, it cannot save you from a poor choice of project scope. If you decide to make a file format in which a flag specifies the use of a large, undocumented, and proprietary rendering library, it doesn't matter whether you specify that flag through a single bit in an undocumented binary string, or with three pages of angle brackets; your specification is proprietary, and there is no way to render it otherwise simply by wrapping it in XML.

It's a shame that XML, which has the potential to offer consistent and standardized parsing across a broad range of file formats, is getting some of the blame for OOXML's shortcomings. OOXML is a 6000-page description, not just of what a given word processor does today, but of many things it used to do, some of which are only alluded to rather than specified. That it is even possible to talk usefully about attempts to implement OOXML must be considered a credit to the robustness of the underlying XML standard.

OOXML is a credible effort to solve a real problem: The problem of how to replace completely opaque binary files encoding ten years of accreted behavior with partially-legible XML files encoding the same behavior, down to the last bit. That problem, unfortunately, is not the problem of providing a usable, implementable, exchange format for office documents.

If Microsoft wants OOXML to be taken seriously as a proposal for a document standard, only one option is on the table. Rather than try to develop a specification with every possible feature of any version of Microsoft Office, every flag or quirk that some document might use, focus on building a smaller, leaner, interchange format which provides core functionality in a fully-described and implementable fashion. Don't expose implementation quirks, such as Excel® calculation chains, to people who just want to copy a spreadsheet's data and formulas. Don't expose, or even refer to, the details of the VML library, or the DrawingML library, or anything like that; instead, provide a brand new, open, and completely specified, description of the data.

When I wrote the Standards & Specs piece on XML some time back, I made an offhand reference to the notion of an XML format containing "<bytes>ff ff 00 03 [. . .]</bytes>". When I wrote it, I thought I was joking. I guess I wasn't.


Resources

Learn

Get products and technologies

  • IBM trial software: Build your next development project with trial software available for download directly from developerWorks.

Discuss

About the author

Photo of Peter Seebacah

Peter Seebach has been interested in standardization for many years, and volunteered on the ISO C committee for nearly a decade. He has been using XML as a document interchange format for several years.

Comments (Undergoing maintenance)



Trademarks  |  My developerWorks terms and conditions

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML
ArticleID=290076
ArticleTitle=OOXML: What's the big deal?
publish-date=02192008
author1-email=dw-xml@seebs.net
author1-email-cc=dwxed@us.ibm.com

My developerWorks community

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere).

My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Special offers