 | Level: Introductory Peter Seebach (dw-xml@seebs.net), Writer, Freelance
19 Feb 2008 The OOXML specification has been both criticized and defended by a number of people, leading many to wonder what the big deal is. This article illustrates the basis of technical, rather than political, objections to treating OOXML as a standard.
I've been active in standardization for a long time (including about a
decade of volunteer work on the ISO C standards committee). Most anyone interested in standardization is likely to form an opinion on the
standardization process involving Office Open XML (OOXML), the proposed XML-based document
format from Microsoft®. Normally, I don't start an article
by talking about myself, but with recent allegations that IBM® has worked to derail OOXML, I wanted to start by making one thing perfectly
clear: I'm not an employee of IBM, these are my own opinions, and I developed
these opinions without reference to the position of IBM on the issue.
The OOXML standard is a big deal for a number of reasons. The fact is, no
matter who you blame for the political content of the process, the standards
process around it has been rife with political maneuvering (see Resources).
However, beyond all that politics, serious technical questions range from whether XML is a good choice for standards to what the purpose
of standardization is. All this news coverage and discussion is a great
opportunity for me to get up on a soap box and talk about what makes
standards matter.
What is the objective of a standard?
Standards exist to allow interoperation. If my word processor and your word
processor can both open the same files, I can share documents with you easily.
If they can't, we'll have trouble. What this means is that, in the
absence of standards, we spend an incredible amount of time and effort working
around the lack of a shared document format. Vendors of word processors spend
unbelievable amounts of time and effort reverse-engineering each others'
document formats to allow them to import and export files so that users can
just open a document and expect to see it roughly as it was saved.
It's pretty obvious that nearly everyone benefits from a standard. The one
arguable exception, for document formats, is companies in a dominant position
in a given industry; in fact, they benefit when there isn't a standard,
because they might be able to push their own format as a de facto standard.
That gives them a double-edged competitive advantage; everyone else has to
spend extra time and money supporting that format, and no one else's support
will ever be as good.
It's important to understand that, by definition, a good interchange standard
won't specify everything that every vendor does. Every vendor would have to support
every feature in the standard, and that means that every feature added to the
standard must be replicated by multiple vendors. It's better to simply allow
for extensions or extra features which can't be encoded in documents in the
standard format. As a user, I'd rather be able to confidently expect standard
documents to work the same everywhere, rather than having a specification so
baroque that no two vendors will quite be able to match up their behaviors.
The demand for standardization of office document
formats is very strong. Many organizations, from corporations to governments, are drafting
rules that require software to support open standards for document storage. No
one wants to be locked in to a single vendor; standards offer a way out from
that. With that in mind, perhaps it's time to look at some of the technical
questions about Microsoft's proposed OOXML standard.
The OOXML standard
The OOXML standard, available from ECMA, is distributed as a set of
PDF documents, totaling around 6000 pages. That's a lot of
specification, and it goes into comprehensive detail. The reason it's so
huge is simple: OOXML is essentially a complete replication of every
chunk of data that a Microsoft Office application might possibly save in a file.
There have been a number of technical complaints made about OOXML. Every
one of them comes down to the same base complaint: Rather than specifying
a reasonable common interchange format, OOXML specifies the whole feature
set of Microsoft Office, down to bug compatibility. This creates a burden
on other implementors which is simply unreasonable (and in fact impossible)
to meet, while conveniently being precisely what Microsoft is already
shipping. That raises a lot of concerns.
Don't mistake this for mere complaints that Microsoft has a head start; a
small, well-designed standard which Microsoft had implemented and everyone
else could reasonably implement would probably have been accepted without
much fuss. The showstopper problems come into three broad categories:
Features which are unreasonably hard to implement, features which are simply
not adequately specified, and features which are utterly unique to Microsoft
Office. These categories overlap some, but each stands as
a different kind of barrier to entry.
Unreasonable requirements
Traditionally, in a standard, the problems implementors are expected to solve
are reasonably well-defined and scoped. You might be required to implement
a dozen kinds of paragraph justification, but all of them are specified, and
all of them are reasonably limited in scope. By contrast, OOXML imposes
requirements which are extremely open-ended. As an example, when describing
page headers, the proposed specification states "Both of font name and font type can
be localized values." This seemingly-simple sentence (which was pointed
out by Stéphane Rodriguez. See Resources.) opens a gigantic
can of worms.
What locales can you use? Do you have a complete list of every locale
that you might ever be use by any other vendor's implementation of this spec, and
every way in which they might choose to localize a font name, or a font type?
What do you do if an enthusiastic implementor chooses to write the font
name and font type in a language that, at the time of your implementation,
you'd never heard of?
Presumably, this reflects a historical decision to store the localized values
that were presented to, or picked by, the user. Unfortunately, without a
great deal more specification (at the very least, a complete list of locales
that are permissible, and some way of telling which locale is being used),
it's simply not possible to implement this. This reflects the historical
quirks of a given implementation; it is not an appropriate choice for a
standardized format to be shared among multiple implementations.
Inadequate specifications
Because some Microsoft Office documents use drawings in a vector language
called VML, OOXML specifies how they are stored. This means that every
implementor is on the hook to read these drawings—unfortunately, no
real specification is offered for them. You can find VML shapes as strings
inside particular items.
What exactly are the allowable values? That's answered clearly enough; "The
possible values for this attribute are defined by the XML Schema string
datatype." Which is to say, it's a string. It can contain arbitrary text,
the meaning of which can be answered only by the code of the VML
library. In short, unless you happen to have the VML library just lying
around, you can't possibly implement this.
Once again, this is a historical quirk. In a standard designed for
interchange, the drawing format (and probably only one) is fully specified, and an
implementor who happens to have another drawing
library is expected to export drawings into the standard
format. Instead, OOXML provides a mere recapitulation of an earlier design
(and one which is, intentionally, not available to others), and expects
everyone else to adapt.
Unique features
The last category is the one which has drawn the most ire from many standards
experts. This is not because it's harder to implement—you can't get harder
to implement than impossible—but because it should never have existed.
This is the category of features which are entirely and utterly dependent on
Microsoft Office in some way.
Probably the most famous example is one of the optional settings provided
in OOXML. The setting is called "useWord97LineBreakRules", and it specifies
to use the line-break rules that were used in Word '97 for East Asian
documents. Much like the previous examples, this is of course impossible for
anyone else to do, as no specification of these rules is provided. In
fact, the OOXML standard even warns implementors not to implement this:
Listing 1. The OOXML standard's guidance for useWord97LineBreakRules
[Guidance: To faithfully replicate this behavior, applications
must imitate the behavior of that application, which involves
many possible behaviors and cannot be faithfully placed
into narrative for this Office Open XML Standard. If
applications wish to match this behavior, they must utilize
and duplicate the output of those applications. It is
recommended that applications not intentionally replicate
this behavior as it was deprecated due to issues with its
output, and is maintained only for compatibility with
existing documents from that application. end guidance]
This guidance is excellent. Given that there is no specification available of
this feature, and it is deprecated, it makes all kinds of sense for people not
to implement it. But wait; if it shouldn't be implemented, why is it in
the spec? Compatibility with existing documents is not a reason to add
a feature to a standard aimed at interchanging data; users are worried about
whether their text can be opened at all in another program, not whether every
line break is in the exact same location!
This feature is in the spec because OOXML is not a document interchange
format; it's a careful, bit-for-bit, replication of Microsoft's historical
binary formats, wrapped up in angle brackets.
Does this mean XML is a bad choice?
After reading some of the complaints about OOXML, some IT professionals
have formed the notion that XML is a poor choice for standardization.
I think this judgement is, at best, premature. In fact, I think it's
just plain wrong. The problems here are not caused by XML; they are caused
by the decision to dutifully reproduce every scrap of backwards-compatibility
and every quirk of behavior of an existing program, rather than specifying
the structure and contents of generic documents intended to be shared and
interchanged between multiple applications.
This can be done quite well in XML. The obvious competitor to OOXML is also
an XML standard, called Open Document Format (ODF). It is by no means an entirely
trivial or small standard; version 1.1 of ODF is a 738-page document, and
the group developing it does not consider it complete or final yet. For
instance, it does not define the formula language used in spreadsheets—although this is being worked on, for inclusion in a proposed version 1.2
standard. Nonetheless, a review of the ODF specification
shows that, rather than attempting to describe the behavior of a monolithic
legacy application, it tries to describe the contents of documents.
The purpose of XML is to allow you to write descriptions of how you wish to
describe the contents of documents. While the ODF description is not
fully polished yet, it is at least conceivable that it could be.
Conclusions
While XML is a powerful and expressive tool for defining new file
formats, it cannot save you from a poor choice of project scope. If you decide
to make a file format in which a flag specifies the use of a
large, undocumented, and proprietary rendering library, it doesn't matter
whether you specify that flag through a single bit in an undocumented binary
string, or with three pages of angle brackets; your specification
is proprietary, and there is no way to render it otherwise simply by wrapping
it in XML.
It's a shame that XML, which has the potential to offer consistent and
standardized parsing across a broad range of file formats, is getting some
of the blame for OOXML's shortcomings. OOXML is a 6000-page
description, not just of what a given word processor does today, but of many
things it used to do, some of which are only alluded to rather than specified.
That it is even possible to talk usefully about attempts to implement OOXML
must be considered a credit to the robustness of the underlying XML standard.
OOXML is a credible effort to solve a real problem: The problem of how to
replace completely opaque binary files encoding ten years of accreted behavior
with partially-legible XML files encoding the same behavior, down to the last
bit. That problem, unfortunately, is not the problem of providing a usable,
implementable, exchange format for office documents.
 | |
If Microsoft wants OOXML to be taken seriously as a proposal for a document
standard, only one option is on the table. Rather than try to develop
a specification with every possible feature of any version of Microsoft Office,
every flag or quirk that some document might use, focus on building a smaller,
leaner, interchange format which provides core functionality in a
fully-described and implementable fashion. Don't expose implementation quirks,
such as Excel® calculation chains, to people who just want to copy a
spreadsheet's data and formulas. Don't expose, or even refer to, the details
of the VML library, or the DrawingML library, or anything like that; instead,
provide a brand new, open, and completely specified, description of the data.
When I wrote the Standards & Specs piece on XML some time back, I made an
offhand reference to the notion of an XML format containing "<bytes>ff ff
00 03 [. . .]</bytes>". When I wrote it, I thought I was joking. I guess
I wasn't.
Resources Learn
-
OOXML is defective by design:
(Stéphane Rodriguez, blog, February 2008): Don't be thrown off by the initial focus on the omniscience-complete task of implementing the spreadsheet format; this page goes into a great deal of detail on a number of other issues.
-
Standards and specs:
XML: Half a standard is better than none (Peter Seebach, developerWorks, September 2006): Discover what makes XML work—and what doesn't sometimes work so well.
-
Office Open XML: Learn about a new topic in this excellent jumping-off point from Wikipedia.
-
More Irregularities
in the OOXML ISO Process Surface (Groklaw, August 2007): Review a sampling of concerns
people have raised about the standardization process that nearly made OOXML an ISO
standard.
-
Open
Document Format (ODF): Read a Wikipedia broad overview of ODF, an XML-based competitor to OOXML.
-
IBM killed Open XML
(Nick Farrell, the Inquirer, January 2008): Read the Inquirer article that doesn't
provide sympathy for Microsoft as they blame IBM for OOXML's death .
-
Cruel truth
surfaces in the OOXML war (ZDNet, January 2008): Read the ZDNet report on the OOXML standards battle and how political the OOXML fight has been.
-
IBM XML certification: Find out how you can become an IBM-Certified Developer in XML and related technologies.
-
XML technical library: See the developerWorks XML Zone for a wide range of technical articles and tips, tutorials, standards, and IBM Redbooks.
-
developerWorks technical events and webcasts: Stay current with technology in these sessions.
- The technology
bookstore: Browse for books on these and other technical topics.
-
New to XML page: Check out the XML zone's updated resource central for XML.
Get products and technologies
-
IBM trial software: Build your next development project with trial software available for download directly from developerWorks.
Discuss
About the author  | 
|  | Peter Seebach has been interested in standardization for many years, and volunteered on the ISO C committee for nearly a decade. He has been using XML as a document interchange format for several years. |
Rate this page
|  |