Level: Intermediate Peter Seebach (developerworks@seebs.plethora.net), Freelance author, Plethora.net
12 Sep 2006 A pervasive misconception common today is that simply designing your file format around XML somehow makes it magically portable, extensible, and intelligible by other programs. Peter Seebach explains why using XML is only part of the story when you're designing an extensible file format.
XML: Half a
standard is better than none
The eXtensible Markup Language (XML) standard specifies a data
format, but carefully leaves the question of what data is to be stored
completely open. The storage format is immensely
flexible, allowing for the representation of arbitrarily complex
datatypes.
From a programmer's standpoint, the specification telling a program
what goes in an XML file is complete enough to allow it to be parsed,
and either validated or not validated, quite consistently and reliably.
However, this offers no guarantee of comprehension. Knowing that an
XML file is well formed (syntactically correct) doesn't necessarily tell
you that a given program can understand it. Even validating that an XML
file contains a specific type of data doesn't necessarily help you write
a program to interpret that data. That is, after all, not the purpose
of XML!
The XML standard is in many ways more similar to the specification of
YACC (Yet Another Compiler Compiler), a program which specifies a format
for expressing grammars, then produces parsers for these grammars. XML
files might have semantics nearly as different from each other as the many
programming languages that have been developed using YACC. XML is not
really a file format; it's a platform for writing file formats.
This creates an interesting kind of semi-standard. The XML standard
is so very broad and open as to be useful for nearly everything, and
developer tools for XML are very likely compatible, but there's no
reason at all to expect actual data stored in XML files to be compatible
with anything. Cynics have argued that the primary purpose of XML is to
allow completely proprietary data to claim to be compliant with a
standard.
XML, SGML,
and HTML
The Standard Generalized Markup Language specification (ISO
8879:1986) provides for the creation of new markup languages. It is
extremely flexible, and rather complicated. The XML standard is one
such markup language, which itself allows for extensibility; XML is a
subset of SGML and is intended to produce a subset of languages which
are easier to parse consistently and efficiently.
HTML is another SGML language and has had some influence on XML's
development by virtue of being one of the most widely used and widely
visible SGML languages. Many people use XML-based languages which are
similar enough to HTML to be potentially confusing to a casual reader.
(For instance, the master copy of this document is in an XML format that
uses <p> tags and other markup indistinguishable from HTML.)
The additional restrictions imposed by XML simplify development,
without preventing flexible document definitions. XML documents can be used for anything from office software documents to system
administration. For instance, Mac OS X uses XML to replace an older
text format for "property list" files, used for system preferences and
configuration. The flexibility that made SGML interesting
is still alive and well in XML.
XML
specifications
SGML's specification of a specific markup language is called a
Document Type Definition, or DTD. For instance, the HTML specification
exists as a series of SGML DTDs, each specifying exactly what is, or is
not, valid as a given kind of HTML. The XML specification allows users
to create their own DTDs, within certain limits. However, different XML
documents need not have any elements in common.
This brings up the distinction between well-formed XML and valid XML.
A document that complies with the purely syntactical requirements
(arrangement of less-than and greater-than signs and use of entities) of
XML is well formed, no matter what entities or elements it contains.
"Well formed" is not specific to a particular DTD; it's a
trait an XML document can have without reference to a DTD, or even if
the document violates a given DTD. A document that complies with the
element definitions of a particular DTD is called "valid."
Valid XML is, at least in principle, usable by a specific application
that's familiar with a given DTD. Knowing that a file is well-formed XML isn't so helpful;
it's a lot more specific than saying "our data format is
ASCII," but it faces the same essential limitation. However, even
valid XML can be written for a DTD which is not made accessible to
users, leaving it utterly useless. There could exist a DTD under which
this is a valid XML element:
<zz><y/><x>183af892</x><w/></zz>
|
XML specifications are not always expressed as DTDs now; Microsoft®
proposed a new way of expressing the same sorts of concepts, called XML
Schemas. The XML Schema became a W3C standard in 2001. XML Schemas
are themselves written in XML and are designed to address a number of
concerns people had with traditional DTDs. Probably the most
significant is greater scope for identifying types of data and semantic
content, for instance, defined "dates" that specify the order
of year,
month, and day components. XML Schemas give more semantic content than
traditional DTDs, helping close the gap between "valid" and
"comprehensible." They can also specify further content
restrictions,
such as acceptable ranges for numeric values.
So is this a
standard or not?
XML is a standard. However, this doesn't mean that "XML
files" are a standard
in the sense that people normally think of a standard. If you have two
programs which process "JPEG files," you can pretty much
assume that their
files are interchangeable. The same doesn't hold for XML. This problem has
always existed with somewhat flexible file formats. For instance, the very
popular "comma-separated values" (CSV) file format, used by
dozens of
spreadsheets and databases, seems portable enough. But what can you put
in the
columns? For a table of, say, names and addresses, CSV is pretty portable.
Once you start including calculations, formats start diverging rapidly; some
spreadsheets might export the calculated values, others might export
formulae
in a particular format. The mechanistic description of data format
isn't the whole story of how the
data are represented.
This is a serious problem that crops up with "XML files."
Merely knowing
that something is some kind of XML doesn't tell you much about
compatibility.
Two spreadsheets which can save their data "as XML" might
not have any
commonality at all in the formats they save in. (Regular readers might remember
the same issue cropping up with IFF files -- see Resources.)
In some cases, there are standards for specific XML formats. Two or more
vendors could easily agree on a specific XML DTD (or, nowadays, Schema) and
communicate just fine. However, if all you know is that they both use
"XML,"
you have no way of knowing whether they are using a common Schema.
Stating this clearly is of vital importance.
You can
write bad code in any language
One of the things most programmers learn eventually is that no language,
however brilliant, can prevent writing bad code. Some evangelists tend to
portray XML as resolving all of the data format issues that have plagued
us over the centuries. It doesn't. I don't think it'd be fair to blame
XML for the horrendous things that have been done with it, but it's
important
to understand that XML cannot substitute for a design process.
My favorite example of this is an XML file I stumbled across in the
configuration for a strategy game. Since the early days of computer
strategy
games, many developers have chosen to make many parameters tweakable.
(Actually, the first game I saw with tweakable settings was a Space Invaders
clone on a Heathkit H89 -- see Resources.)
Early on, they'd just define
a file format and
let you mess around; any typo would, of course, cause the game to crash.
Enter XML. This is the sort of problem XML is built to solve. A good XML
Schema, and you should be able to do something like:
<team>
<name>Editors</name>
<bonuses>
<economic>20</economic>
</bonuses>
</team>
<team>
<name>Critics</name>
<bonuses>
<military>20</military>
</bonuses>
</team>
|
Whatever tweakable parameters there are would be encoded in the same
way for
each team, and new teams could be added with a trivial amount of effort.
In fact, at least one
game has implemented exactly the above sort of system. However, another
game's designers went in a
wholly different direction. Rather than abstracting the specification of
bonuses, they decided that each team would have a specific bonus, and they
created
an XML dictionary in which each team's bonus was specifically identified.
The above would have been written as:
<editor_economic_bonus>20</editor_economic_bonus>
<critic_military_bonus>20</critic_military_bonus>
|
You couldn't add new bonuses, or take away existing ones (although
you could
set them to zero). You couldn't give the economic bonus to critics (not
that
I know why anyone would want to). Through a broad range of exceptionally
specific modifications and tweaks, each was nailed down so the only thing
alterable through the file was the exact range of a specific bonus.
Used like this, XML serves only to inflate compression
statistics artificially . It's no longer extensible, or even extended.
It's just a very
elaborate way of saying that something can't be changed anyway.
This kind of thing probably results from someone being told, far too late
in the design process, that it is mandatory that the configuration
"use XML."
This is probably XML's greatest weakness as a specification.
Because it is
so very generic, a lot of bad XML gets produced in the name of buzzword
compliance. The mere fact that something is XML doesn't mean it's actually
extensible or portable, and allowing the buzzword to substitute for a
serious
design review is crazy.
You can
write good code in some languages
While it's true that you can write bad code in any language, some
languages
do seem to encourage good code. I was skeptical of XML's ability to do this
for a long time; XML DocBook is a very nice language for writing in, but
I was fine with SGML DocBook. However, a recent example has convinced me
that XML has the potential to make a lot of people very happy.
This example is the XML "property list" format, developed
by Apple. Property
lists are a well-considered example of how to use XML. They offer a fairly
generic way to store data about things. Property lists are used heavily in
Mac OS X to express preferences and other data. Each application can define
the exact set
of keys it looks for in a property list, but the common format provides a
number of benefits; for instance, a single line of Objective-C is enough
to save a new property in a property list. As an interesting example,
Apple's
launchd (a replacement for classic UNIX® init, cron, and inetd program
launchers) uses property lists to describe jobs, with a man page describing
the specific key/value pairs it supports.
Until fairly recently, Apple was the only entity using this format.
However,
in April of 2006, NetBSD acquired a library for reading and writing property
lists. In the interests of efficiency, the library reads and writes only a
sufficient subset of full XML to handle property lists; it's not a full XML
parser. The two implementations are not totally compatible as of this
writing; however, a standard with two completely independent implementations
is a very good thing.
Used well, XML encourages the use of data abstractions and clear
semantic
labeling of stored data. Even something as simple as a series of key/value
pairs benefits from an agreed-on specification for how to store and serialize it. The serialized XML form can be used as a way to pass chunks
of information around or store them.
So is this a
standard or not?
Yes. Mostly. The confusions caused by people talking about
"XML" as though
something being in XML made it immediately comprehensible and portable
to all
programs anywhere that "have XML" are a serious problem, but
they are not a
flaw in the specification. The buzzword tends to get used in places
where it
really doesn't fit. People think of XML as a high-level structured data
format. In a way it is, but more accurately, it is a platform
on which specific
structured data formats can be built. XML provides tools which you can use
to build good structured data formats. Those same tools can be used to
produce
unmaintainable nightmares and data files that might as well be
write-only.
So, if you want to know whether a file being "XML" means
that you can reliably
extract data from it, the answer is probably no, but your chances might be
better than they would be with a purely proprietary format. On the other
hand, it's a step closer to portability and openness, and it serves as a good
platform. In the introduction, I pointed out that XML is not really a file
format; it's a platform for writing file formats. Most of the problems
people have with XML are not a result of flaws in the standard, but of flaws
in user expectations -- they think it's a data format and that specifying
that something should use XML is enough to guarantee data portability.
If you are about to specify a file format, you might do well to use
XML as a
basis; it gets you past all the low-level stuff to the meat of the question.
But please, on behalf of all the people who will ever see your code or your
files, I beg you to use that time to think about the data you're trying to
save. This is not a good file format:
<bytes>ff ff 00 03 [. . .]</bytes>
|
Of course, someone might use it. In fact, there have been fairly
high-profile
applications where the guts of their data storage came down to something
equivalent to the above.
Picking meaningful names and representations for stored data remains
important, and XML does not take this responsibility away from you. What
it does do is give you tools which make it easier to do the right thing
with less developer time. Use them well.
Resources
About the author  | 
|  | Peter Seebach has been using computers for years and is gradually becoming acclimated. He still doesn't know why mice need to be cleaned so often, though. Contact Peter at developerworks@seebs.plethora.net |
Rate this page
|