I've been active in standardization for a long time (including about a decade of volunteer work on the ISO C standards committee). Most anyone interested in standardization is likely to form an opinion on the standardization process involving Office Open XML (OOXML), the proposed XML-based document format from Microsoft®. Normally, I don't start an article by talking about myself, but with recent allegations that IBM® has worked to derail OOXML, I wanted to start by making one thing perfectly clear: I'm not an employee of IBM, these are my own opinions, and I developed these opinions without reference to the position of IBM on the issue.
The OOXML standard is a big deal for a number of reasons. The fact is, no matter who you blame for the political content of the process, the standards process around it has been rife with political maneuvering (see Resources). However, beyond all that politics, serious technical questions range from whether XML is a good choice for standards to what the purpose of standardization is. All this news coverage and discussion is a great opportunity for me to get up on a soap box and talk about what makes standards matter.
What is the objective of a standard?
Standards exist to allow interoperation. If my word processor and your word processor can both open the same files, I can share documents with you easily. If they can't, we'll have trouble. What this means is that, in the absence of standards, we spend an incredible amount of time and effort working around the lack of a shared document format. Vendors of word processors spend unbelievable amounts of time and effort reverse-engineering each others' document formats to allow them to import and export files so that users can just open a document and expect to see it roughly as it was saved.
It's pretty obvious that nearly everyone benefits from a standard. The one arguable exception, for document formats, is companies in a dominant position in a given industry; in fact, they benefit when there isn't a standard, because they might be able to push their own format as a de facto standard. That gives them a double-edged competitive advantage; everyone else has to spend extra time and money supporting that format, and no one else's support will ever be as good.
It's important to understand that, by definition, a good interchange standard won't specify everything that every vendor does. Every vendor would have to support every feature in the standard, and that means that every feature added to the standard must be replicated by multiple vendors. It's better to simply allow for extensions or extra features which can't be encoded in documents in the standard format. As a user, I'd rather be able to confidently expect standard documents to work the same everywhere, rather than having a specification so baroque that no two vendors will quite be able to match up their behaviors.
The demand for standardization of office document formats is very strong. Many organizations, from corporations to governments, are drafting rules that require software to support open standards for document storage. No one wants to be locked in to a single vendor; standards offer a way out from that. With that in mind, perhaps it's time to look at some of the technical questions about Microsoft's proposed OOXML standard.
The OOXML standard, available from ECMA, is distributed as a set of PDF documents, totaling around 6000 pages. That's a lot of specification, and it goes into comprehensive detail. The reason it's so huge is simple: OOXML is essentially a complete replication of every chunk of data that a Microsoft Office application might possibly save in a file.
There have been a number of technical complaints made about OOXML. Every one of them comes down to the same base complaint: Rather than specifying a reasonable common interchange format, OOXML specifies the whole feature set of Microsoft Office, down to bug compatibility. This creates a burden on other implementors which is simply unreasonable (and in fact impossible) to meet, while conveniently being precisely what Microsoft is already shipping. That raises a lot of concerns.
Don't mistake this for mere complaints that Microsoft has a head start; a small, well-designed standard which Microsoft had implemented and everyone else could reasonably implement would probably have been accepted without much fuss. The showstopper problems come into three broad categories: Features which are unreasonably hard to implement, features which are simply not adequately specified, and features which are utterly unique to Microsoft Office. These categories overlap some, but each stands as a different kind of barrier to entry.
Traditionally, in a standard, the problems implementors are expected to solve are reasonably well-defined and scoped. You might be required to implement a dozen kinds of paragraph justification, but all of them are specified, and all of them are reasonably limited in scope. By contrast, OOXML imposes requirements which are extremely open-ended. As an example, when describing page headers, the proposed specification states "Both of font name and font type can be localized values." This seemingly-simple sentence (which was pointed out by Stéphane Rodriguez. See Resources.) opens a gigantic can of worms.
What locales can you use? Do you have a complete list of every locale that you might ever be use by any other vendor's implementation of this spec, and every way in which they might choose to localize a font name, or a font type? What do you do if an enthusiastic implementor chooses to write the font name and font type in a language that, at the time of your implementation, you'd never heard of?
Presumably, this reflects a historical decision to store the localized values that were presented to, or picked by, the user. Unfortunately, without a great deal more specification (at the very least, a complete list of locales that are permissible, and some way of telling which locale is being used), it's simply not possible to implement this. This reflects the historical quirks of a given implementation; it is not an appropriate choice for a standardized format to be shared among multiple implementations.
Because some Microsoft Office documents use drawings in a vector language called VML, OOXML specifies how they are stored. This means that every implementor is on the hook to read these drawings—unfortunately, no real specification is offered for them. You can find VML shapes as strings inside particular items.
What exactly are the allowable values? That's answered clearly enough; "The possible values for this attribute are defined by the XML Schema string datatype." Which is to say, it's a string. It can contain arbitrary text, the meaning of which can be answered only by the code of the VML library. In short, unless you happen to have the VML library just lying around, you can't possibly implement this.
Once again, this is a historical quirk. In a standard designed for interchange, the drawing format (and probably only one) is fully specified, and an implementor who happens to have another drawing library is expected to export drawings into the standard format. Instead, OOXML provides a mere recapitulation of an earlier design (and one which is, intentionally, not available to others), and expects everyone else to adapt.
The last category is the one which has drawn the most ire from many standards experts. This is not because it's harder to implement—you can't get harder to implement than impossible—but because it should never have existed. This is the category of features which are entirely and utterly dependent on Microsoft Office in some way.
Probably the most famous example is one of the optional settings provided in OOXML. The setting is called "useWord97LineBreakRules", and it specifies to use the line-break rules that were used in Word '97 for East Asian documents. Much like the previous examples, this is of course impossible for anyone else to do, as no specification of these rules is provided. In fact, the OOXML standard even warns implementors not to implement this:
Listing 1. The OOXML standard's guidance for useWord97LineBreakRules
[Guidance: To faithfully replicate this behavior, applications must imitate the behavior of that application, which involves many possible behaviors and cannot be faithfully placed into narrative for this Office Open XML Standard. If applications wish to match this behavior, they must utilize and duplicate the output of those applications. It is recommended that applications not intentionally replicate this behavior as it was deprecated due to issues with its output, and is maintained only for compatibility with existing documents from that application. end guidance]
This guidance is excellent. Given that there is no specification available of this feature, and it is deprecated, it makes all kinds of sense for people not to implement it. But wait; if it shouldn't be implemented, why is it in the spec? Compatibility with existing documents is not a reason to add a feature to a standard aimed at interchanging data; users are worried about whether their text can be opened at all in another program, not whether every line break is in the exact same location!
This feature is in the spec because OOXML is not a document interchange format; it's a careful, bit-for-bit, replication of Microsoft's historical binary formats, wrapped up in angle brackets.
Does this mean XML is a bad choice?
After reading some of the complaints about OOXML, some IT professionals have formed the notion that XML is a poor choice for standardization. I think this judgement is, at best, premature. In fact, I think it's just plain wrong. The problems here are not caused by XML; they are caused by the decision to dutifully reproduce every scrap of backwards-compatibility and every quirk of behavior of an existing program, rather than specifying the structure and contents of generic documents intended to be shared and interchanged between multiple applications.
This can be done quite well in XML. The obvious competitor to OOXML is also an XML standard, called Open Document Format (ODF). It is by no means an entirely trivial or small standard; version 1.1 of ODF is a 738-page document, and the group developing it does not consider it complete or final yet. For instance, it does not define the formula language used in spreadsheets—although this is being worked on, for inclusion in a proposed version 1.2 standard. Nonetheless, a review of the ODF specification shows that, rather than attempting to describe the behavior of a monolithic legacy application, it tries to describe the contents of documents.
The purpose of XML is to allow you to write descriptions of how you wish to describe the contents of documents. While the ODF description is not fully polished yet, it is at least conceivable that it could be.
While XML is a powerful and expressive tool for defining new file formats, it cannot save you from a poor choice of project scope. If you decide to make a file format in which a flag specifies the use of a large, undocumented, and proprietary rendering library, it doesn't matter whether you specify that flag through a single bit in an undocumented binary string, or with three pages of angle brackets; your specification is proprietary, and there is no way to render it otherwise simply by wrapping it in XML.
It's a shame that XML, which has the potential to offer consistent and standardized parsing across a broad range of file formats, is getting some of the blame for OOXML's shortcomings. OOXML is a 6000-page description, not just of what a given word processor does today, but of many things it used to do, some of which are only alluded to rather than specified. That it is even possible to talk usefully about attempts to implement OOXML must be considered a credit to the robustness of the underlying XML standard.
OOXML is a credible effort to solve a real problem: The problem of how to replace completely opaque binary files encoding ten years of accreted behavior with partially-legible XML files encoding the same behavior, down to the last bit. That problem, unfortunately, is not the problem of providing a usable, implementable, exchange format for office documents.
If Microsoft wants OOXML to be taken seriously as a proposal for a document standard, only one option is on the table. Rather than try to develop a specification with every possible feature of any version of Microsoft Office, every flag or quirk that some document might use, focus on building a smaller, leaner, interchange format which provides core functionality in a fully-described and implementable fashion. Don't expose implementation quirks, such as Excel® calculation chains, to people who just want to copy a spreadsheet's data and formulas. Don't expose, or even refer to, the details of the VML library, or the DrawingML library, or anything like that; instead, provide a brand new, open, and completely specified, description of the data.
When I wrote the Standards & Specs piece on XML some time back, I made an offhand reference to the notion of an XML format containing "<bytes>ff ff 00 03 [. . .]</bytes>". When I wrote it, I thought I was joking. I guess I wasn't.
Learn
-
OOXML is defective by design:
(Stéphane Rodriguez, blog, February 2008): Don't be thrown off by the initial focus on the omniscience-complete task of implementing the spreadsheet format; this page goes into a great deal of detail on a number of other issues.
-
Standards and specs:
XML: Half a standard is better than none (Peter Seebach, developerWorks, September 2006): Discover what makes XML work—and what doesn't sometimes work so well.
-
Office Open XML: Learn about a new topic in this excellent jumping-off point from Wikipedia.
-
More Irregularities
in the OOXML ISO Process Surface (Groklaw, August 2007): Review a sampling of concerns
people have raised about the standardization process that nearly made OOXML an ISO
standard.
-
Open
Document Format (ODF): Read a Wikipedia broad overview of ODF, an XML-based competitor to OOXML.
-
IBM killed Open XML
(Nick Farrell, the Inquirer, January 2008): Read the Inquirer article that doesn't
provide sympathy for Microsoft as they blame IBM for OOXML's death .
-
Cruel truth
surfaces in the OOXML war (ZDNet, January 2008): Read the ZDNet report on the OOXML standards battle and how political the OOXML fight has been.
-
IBM XML certification: Find out how you can become an IBM-Certified Developer in XML and related technologies.
-
XML technical library: See the developerWorks XML Zone for a wide range of technical articles and tips, tutorials, standards, and IBM Redbooks.
-
developerWorks technical events and webcasts: Stay current with technology in these sessions.
- The technology
bookstore: Browse for books on these and other technical topics.
-
New to XML page: Check out the XML zone's updated resource central for XML.
Get products and technologies
-
IBM trial software: Build your next development project with trial software available for download directly from developerWorks.
Discuss
- Participate in the discussion forum.
-
XML zone discussion forums: Participate in any of several XML-related discussions.
-
developerWorks blogs: Check out these blogs and get involved in the developerWorks community.
Comments (Undergoing maintenance)






