XML: Half a standard is better than none
The eXtensible Markup Language (XML) standard specifies a data format, but carefully leaves the question of what data is to be stored completely open. The storage format is immensely flexible, allowing for the representation of arbitrarily complex datatypes.
From a programmer's standpoint, the specification telling a program what goes in an XML file is complete enough to allow it to be parsed, and either validated or not validated, quite consistently and reliably. However, this offers no guarantee of comprehension. Knowing that an XML file is well formed (syntactically correct) doesn't necessarily tell you that a given program can understand it. Even validating that an XML file contains a specific type of data doesn't necessarily help you write a program to interpret that data. That is, after all, not the purpose of XML!
The XML standard is in many ways more similar to the specification of YACC (Yet Another Compiler Compiler), a program which specifies a format for expressing grammars, then produces parsers for these grammars. XML files might have semantics nearly as different from each other as the many programming languages that have been developed using YACC. XML is not really a file format; it's a platform for writing file formats.
This creates an interesting kind of semi-standard. The XML standard is so very broad and open as to be useful for nearly everything, and developer tools for XML are very likely compatible, but there's no reason at all to expect actual data stored in XML files to be compatible with anything. Cynics have argued that the primary purpose of XML is to allow completely proprietary data to claim to be compliant with a standard.
The Standard Generalized Markup Language specification (ISO 8879:1986) provides for the creation of new markup languages. It is extremely flexible, and rather complicated. The XML standard is one such markup language, which itself allows for extensibility; XML is a subset of SGML and is intended to produce a subset of languages which are easier to parse consistently and efficiently.
HTML is another SGML language and has had some influence on XML's development by virtue of being one of the most widely used and widely visible SGML languages. Many people use XML-based languages which are similar enough to HTML to be potentially confusing to a casual reader. (For instance, the master copy of this document is in an XML format that uses <p> tags and other markup indistinguishable from HTML.)
The additional restrictions imposed by XML simplify development, without preventing flexible document definitions. XML documents can be used for anything from office software documents to system administration. For instance, Mac OS X uses XML to replace an older text format for "property list" files, used for system preferences and configuration. The flexibility that made SGML interesting is still alive and well in XML.
SGML's specification of a specific markup language is called a Document Type Definition, or DTD. For instance, the HTML specification exists as a series of SGML DTDs, each specifying exactly what is, or is not, valid as a given kind of HTML. The XML specification allows users to create their own DTDs, within certain limits. However, different XML documents need not have any elements in common.
This brings up the distinction between well-formed XML and valid XML. A document that complies with the purely syntactical requirements (arrangement of less-than and greater-than signs and use of entities) of XML is well formed, no matter what entities or elements it contains. "Well formed" is not specific to a particular DTD; it's a trait an XML document can have without reference to a DTD, or even if the document violates a given DTD. A document that complies with the element definitions of a particular DTD is called "valid."
Valid XML is, at least in principle, usable by a specific application that's familiar with a given DTD. Knowing that a file is well-formed XML isn't so helpful; it's a lot more specific than saying "our data format is ASCII," but it faces the same essential limitation. However, even valid XML can be written for a DTD which is not made accessible to users, leaving it utterly useless. There could exist a DTD under which this is a valid XML element:
<zz><y/><x>183af892</x><w/></zz> |
XML specifications are not always expressed as DTDs now; Microsoft® proposed a new way of expressing the same sorts of concepts, called XML Schemas. The XML Schema became a W3C standard in 2001. XML Schemas are themselves written in XML and are designed to address a number of concerns people had with traditional DTDs. Probably the most significant is greater scope for identifying types of data and semantic content, for instance, defined "dates" that specify the order of year, month, and day components. XML Schemas give more semantic content than traditional DTDs, helping close the gap between "valid" and "comprehensible." They can also specify further content restrictions, such as acceptable ranges for numeric values.
XML is a standard. However, this doesn't mean that "XML files" are a standard in the sense that people normally think of a standard. If you have two programs which process "JPEG files," you can pretty much assume that their files are interchangeable. The same doesn't hold for XML. This problem has always existed with somewhat flexible file formats. For instance, the very popular "comma-separated values" (CSV) file format, used by dozens of spreadsheets and databases, seems portable enough. But what can you put in the columns? For a table of, say, names and addresses, CSV is pretty portable. Once you start including calculations, formats start diverging rapidly; some spreadsheets might export the calculated values, others might export formulae in a particular format. The mechanistic description of data format isn't the whole story of how the data are represented.
This is a serious problem that crops up with "XML files." Merely knowing that something is some kind of XML doesn't tell you much about compatibility. Two spreadsheets which can save their data "as XML" might not have any commonality at all in the formats they save in. (Regular readers might remember the same issue cropping up with IFF files -- see Resources.)
In some cases, there are standards for specific XML formats. Two or more vendors could easily agree on a specific XML DTD (or, nowadays, Schema) and communicate just fine. However, if all you know is that they both use "XML," you have no way of knowing whether they are using a common Schema. Stating this clearly is of vital importance.
You can write bad code in any language
One of the things most programmers learn eventually is that no language, however brilliant, can prevent writing bad code. Some evangelists tend to portray XML as resolving all of the data format issues that have plagued us over the centuries. It doesn't. I don't think it'd be fair to blame XML for the horrendous things that have been done with it, but it's important to understand that XML cannot substitute for a design process.
My favorite example of this is an XML file I stumbled across in the configuration for a strategy game. Since the early days of computer strategy games, many developers have chosen to make many parameters tweakable. (Actually, the first game I saw with tweakable settings was a Space Invaders clone on a Heathkit H89 -- see Resources.) Early on, they'd just define a file format and let you mess around; any typo would, of course, cause the game to crash.
Enter XML. This is the sort of problem XML is built to solve. A good XML Schema, and you should be able to do something like:
<team>
<name>Editors</name>
<bonuses>
<economic>20</economic>
</bonuses>
</team>
<team>
<name>Critics</name>
<bonuses>
<military>20</military>
</bonuses>
</team>
|
Whatever tweakable parameters there are would be encoded in the same way for each team, and new teams could be added with a trivial amount of effort. In fact, at least one game has implemented exactly the above sort of system. However, another game's designers went in a wholly different direction. Rather than abstracting the specification of bonuses, they decided that each team would have a specific bonus, and they created an XML dictionary in which each team's bonus was specifically identified. The above would have been written as:
<editor_economic_bonus>20</editor_economic_bonus> <critic_military_bonus>20</critic_military_bonus> |
You couldn't add new bonuses, or take away existing ones (although you could set them to zero). You couldn't give the economic bonus to critics (not that I know why anyone would want to). Through a broad range of exceptionally specific modifications and tweaks, each was nailed down so the only thing alterable through the file was the exact range of a specific bonus.
Used like this, XML serves only to inflate compression statistics artificially . It's no longer extensible, or even extended. It's just a very elaborate way of saying that something can't be changed anyway. This kind of thing probably results from someone being told, far too late in the design process, that it is mandatory that the configuration "use XML."
This is probably XML's greatest weakness as a specification. Because it is so very generic, a lot of bad XML gets produced in the name of buzzword compliance. The mere fact that something is XML doesn't mean it's actually extensible or portable, and allowing the buzzword to substitute for a serious design review is crazy.
You can write good code in some languages
While it's true that you can write bad code in any language, some languages do seem to encourage good code. I was skeptical of XML's ability to do this for a long time; XML DocBook is a very nice language for writing in, but I was fine with SGML DocBook. However, a recent example has convinced me that XML has the potential to make a lot of people very happy.
This example is the XML "property list" format, developed by Apple. Property lists are a well-considered example of how to use XML. They offer a fairly generic way to store data about things. Property lists are used heavily in Mac OS X to express preferences and other data. Each application can define the exact set of keys it looks for in a property list, but the common format provides a number of benefits; for instance, a single line of Objective-C is enough to save a new property in a property list. As an interesting example, Apple's launchd (a replacement for classic UNIX® init, cron, and inetd program launchers) uses property lists to describe jobs, with a man page describing the specific key/value pairs it supports.
Until fairly recently, Apple was the only entity using this format. However, in April of 2006, NetBSD acquired a library for reading and writing property lists. In the interests of efficiency, the library reads and writes only a sufficient subset of full XML to handle property lists; it's not a full XML parser. The two implementations are not totally compatible as of this writing; however, a standard with two completely independent implementations is a very good thing.
Used well, XML encourages the use of data abstractions and clear semantic labeling of stored data. Even something as simple as a series of key/value pairs benefits from an agreed-on specification for how to store and serialize it. The serialized XML form can be used as a way to pass chunks of information around or store them.
Yes. Mostly. The confusions caused by people talking about "XML" as though something being in XML made it immediately comprehensible and portable to all programs anywhere that "have XML" are a serious problem, but they are not a flaw in the specification. The buzzword tends to get used in places where it really doesn't fit. People think of XML as a high-level structured data format. In a way it is, but more accurately, it is a platform on which specific structured data formats can be built. XML provides tools which you can use to build good structured data formats. Those same tools can be used to produce unmaintainable nightmares and data files that might as well be write-only.
So, if you want to know whether a file being "XML" means that you can reliably extract data from it, the answer is probably no, but your chances might be better than they would be with a purely proprietary format. On the other hand, it's a step closer to portability and openness, and it serves as a good platform. In the introduction, I pointed out that XML is not really a file format; it's a platform for writing file formats. Most of the problems people have with XML are not a result of flaws in the standard, but of flaws in user expectations -- they think it's a data format and that specifying that something should use XML is enough to guarantee data portability.
If you are about to specify a file format, you might do well to use XML as a basis; it gets you past all the low-level stuff to the meat of the question. But please, on behalf of all the people who will ever see your code or your files, I beg you to use that time to think about the data you're trying to save. This is not a good file format:
<bytes>ff ff 00 03 [. . .]</bytes> |
Of course, someone might use it. In fact, there have been fairly high-profile applications where the guts of their data storage came down to something equivalent to the above.
Picking meaningful names and representations for stored data remains important, and XML does not take this responsibility away from you. What it does do is give you tools which make it easier to do the right thing with less developer time. Use them well.
- Participate in the discussion forum.
-
See all of Peter's Standards and specs columns to date.
-
Read this discussion of the
difference between XML and SGML.
-
Get the Mac
OS X property list explained.
-
The W3C Markup Validation
Service is a free online utility that allows you to validate HTML
against some standard DTDs.
-
Alex E. Bell once tried
using XML to impress his daughter; sadly, it didn't work.
-
The Interchange
File Format (IFF) is a flexible binary file format.
-
The Heathkit H-89 was a Z80-based computer kit originally released
in 1979.

Peter Seebach has been using computers for years and is gradually becoming acclimated. He still doesn't know why mice need to be cleaned so often, though. Contact Peter at developerworks@seebs.plethora.net
Comments (Undergoing maintenance)




