I suppose the very first question on readers' minds has to be, "Why the name YAML?" There are a number of tools that have cutely adopted acronyms of the form "YA*", to mean "Yet Another XXX." In the world of open-source wit, YAML eschews its implied acronym, instead settling on the recursive "YAML Ain't Markup Language." There is a certain sense to this, however: YAML does what markup languages do, but without requiring any, well, markup.
Although it is no less general than XML, YAML is a great deal simpler to read, edit, modify, and produce than XML. That is, anything you can represent in XML, you can represent (almost always more compactly) in YAML. Namespaces represent one case in which some special pleading is necessary -- you can "bolt them on" to YAML, but they are not part of the current spec.
One criticism of XML that I have frequently raised in this column (and one that I am far from alone in emphasizing) is that XML is poorly focused. A classic committee-driven behemoth, XML tries to be a document format, a data format, a message packet format, a secure RPC channel (SOAP), and an object database. Moreover, XML piles on APIs for every style of access and manipulation: DOM, SAX, XSLT, XPATH, JDOM, and dozens of less common interface layers (I have contributed a few of my own in the gnosis.xml.pickle, gnosis.xml.objectify and
gnosis.xml.validity packages). The remarkable thing is that XML does all these things; the disappointing part is that it does none of them particularly well.
YAML's focuses more narrowly on cleanly representing the data structures and data types you encounter in dynamic programming languages like Perl, Python, Ruby, and to a lesser extent, Java programming. Bindings/libraries currently exist for those languages. A number of other languages have data models that will play nice with YAML, but no one has written libraries yet. These include Lisp/Scheme, Rebol, Smalltalk, xBase, and AWK. Less dynamic languages would not fit as well with YAML. The syntax of YAML combines the contextual typing of Perl, the indentation structure of Python, and a few conventions from MIME standards. In much the same way that Python is sometimes praised as executable pseudocode, YAML's concrete syntax comes very close to what you might use to informally explain a data structure to a class or work group.
The easiest way to see why you would want to use YAML is to look at some code in different formats. For this installment, I have imagined creating a little application with a data storage and transmission requirement. My particular pet project is inspired by the "Brains in Bahrain" chess tournament -- occurring as I write this article -- between the FISA world chess champion and the best ranked computer player (take my data details with a grain of salt). If you wanted to create a program that tracks the activity of a chess club, you might use a data structure that would be described by the following Perl code.
Listing 1. Perl description of chess club data structure
$players = {
'Vladimir Kramnik' => {'status'=>'GM', 'rating'=>'2700'},
'Deep Fritz' => {'status'=>'Computer','rating'=>'2700'},
'David Mertz' => {'status'=>'Amateur', 'rating'=>'1400'},
};
$club = {
'_players' => $players,
'matches' => [
{'Date' => '2002-10-04',
'White' => $players->{'Deep Fritz'},
'Black' => $players->{'Vladimir Kramnik'},
'Result' => 'Draw' },
{'Date' => '2002-10-06',
'White' => $players->{'Vladimir Kramnik'},
'Black' => $players->{'Deep Fritz'},
'Result' => 'White' }
]
};
|
Python is similar to the Perl code above:
Listing 2. Python description of chess club data structure
ts = yaml.timestamp # or mx.DateTime or othr date class
players = {
'Vladimir Kramnik': {'status':'GM', 'rating':2700},
'Deep Fritz': {'status':'Computer', 'rating':2700},
'David Mertz': {'status':'Amateur', 'rating':1400},
}
matches = [
{'Date': ts('2002-10-04'),
'White': players['Deep Fritz'],
'Black': players['Vladimir Kramnik'],
'Result': 'Draw' },
{'Date': ts('2002-10-06'),
'White': players['Vladimir Kramnik'],
'Black': players['Deep Fritz'],
'Result': 'White' }
]
club = {'_players':players, 'matches':matches} |
Other dynamic programming languages would use similar descriptions of the data structure. Basically, this is a top-level dictionary/mapping/hash that contains both dicts and lists at a few recursive levels; this also allows nested elements to refer to each other.
Presumably, an application that managed the chess club could perform tasks like recording additional matches, adding/removing players from the club, or updating ratings based on matches played. Moreover, an application would want to not only record a data snapshot, but also share it with other applications (in other languages) that work with the same data model. In addition, it would be nice if we could easily touch up the data by hand outside of an application; anyone who has developed a data-oriented application or maintained an organization's records knows how helpful it can be to "poke at the guts" of underlying data structures.
You have probably already picked up on the fact that I have rigged the deck: YAML is going to swoop in here as the right solution. But I don't think the setup is unfair. The way I have structured the data is a whole lot like the way data often presents itself, at least in broad strokes. I have chosen to use the maps, lists, references, and data types above in precisely the places they each seemed most natural. Also, I have chosen the underlying problem merely as something that was complex enough to show the issues while being simple enough to fit into one article.
If you are willing to require a particular programming language (and possibly a particular version) for all chess club applications, most languages have good serialization capabilities in built-in or common libraries:
- Python has
cPickle,gnosis.xml.pickle, andpprint - Perl has
Data::Dumper,Data::Denter, andData:DumpXML - Ruby has
MarshalandXmlSerialization - Java language has
java.io.Serializable,org.apache.xml.serialize.XMLSerializer, and various others
As the names indicate, some of these libraries produce XML, but that hardly means XML is easily transferable between languages.
In addition, there are a few general semantic problems with using XML to represent this chess club data. XML has the concept of unordered mapping for element attributes, but strict ordering for nested elements. A particular application, of course, has the ability to ignore some of the ordering information but the information model of XML always asserts a significance to order, often spuriously. For example, matches are considered to fall in a particular order (by date), while players are not inherently ordered. (You could, of course, impose an order such as rating or enroll-date.) The problem is that you need custom programming in every application to remove the implied ordering information everywhere it is spurious and keep it where it is important.
XML-RPC, SOAP, gnosis.xml.pickle, and the XML serializer libraries in various languages take a generic approach to representing mappings. In all of these cases, the basic principle uses (rather verbosely) <key> and <val> (or similar tags) to indicate unordered pairs, and different container elements to indicate ordered items. This principle adds several layers to remove part of the XML information model:
Listing 3. XML-RPC model of ordered and unordered collections
>>> import xmlrpclib
>>> print xmlrpclib.dumps(({'this':'that',
... 'spam':('eggs','toast')},))
<params>
<param>
<value><struct>
<member>
<name>this</name>
<value><string>that</string></value>
</member>
<member>
<name>spam</name>
<value><array><data>
<value><string>eggs</string></value>
<value><string>toast</string></value>
</data></array></value>
</member>
</struct></value>
</param>
</params>
|
XML-RPC has a few additional artifacts -- like the need to wrap the whole object in a one-item tuple -- but these are minor issues. The awkward fit between the "native" and the XML data models is equally evident in any of the XML serialization formats mentioned here.
There are at least two issues that arise in representing this chess club data as XML. The first, and simpler, issue is exactly what the best XML representation would be in the abstract. For this, I would propose something like the following as a best attempt in XML:
Listing 4. Optimal XML description of chess club data
<?xml version="1.0"?>
<club>
<players>
<player id="kramnik"
name="Vladimir Kramnik"
rating="2700"
status="GM" />
<player id="fritz"
name="Deep Fritz"
rating="2700"
status="Computer" />
<player id="mertz"
name="David Mertz"
rating="1400"
status="Amateur" />
</players>
<matches>
<match>
<Date>2002-10-04</Date>
<White refid="fritz" />
<Black refid="kramnik" />
<Result>Draw</Result>
</match>
<match>
<Date>2002-10-06</Date>
<White refid="kramnik" />
<Black refid="fritz" />
<Result>White</Result>
</match>
</matches>
</club>
|
The above XML data representation is fairly clear. It is not much more verbose than the native data descriptions given in the Perl and Python examples, nor than the YAML description in Listing 5. It is not all that difficult to modify the document with general purpose tools like a text editor. (In fact, that is exactly how I created the XML initially.)
Semantically, my proposed XML has all the problems discussed. Players appear ordered, even though they are not intended to be. And the player list appears to precede the matches list, even though no such conceptual order is intended. Player attributes are unordered, as desired (being XML attributes), but since match "attributes" cannot fit as XML attributes, XML imposes an artificial order.
The more important issue arises with actually reading and writing my optimal XML format. None of the common XML APIs comes even close to automating this operation. For example, a SAX reader could look for various "player" and "match" events and manually add to relevant nested dictionaries or lists, but this approach is fragile, and needs to be reprogrammed for the slightest change in data structure during development. Walking a DOM tree has the same issue. Custom APIs like JDOM or REXML do not help much either. gnosis.xml.objectify does a fairly good job of automatically generating a native object, but this only works for reading in the XML, not for writing it back out. Writing, of course, is symmetric with reading, with all its corresponding fragilities.
The YAML format simply matches the data structures of dynamic languages better. And it looks nicer too. Here's a YAML representation of the same chess club data:
Listing 5. YAML description of chess club data
---
players:
Vladimir Kramnik: &kramnik
rating: 2700
status: GM
Deep Fritz: &fritz
rating: 2700
status: Computer
David Mertz: &mertz
rating: 1400
status: Amateur
matches:
-
Date: 2002-10-04
White: *fritz
Black: *kramnik
Result: Draw
-
Date: 2002-10-06
White: *kramnik
Black: *fritz
Result: White
|
There are a number of nice things about this format. The YAML Web site gives exact specifications (see Resources), but this brief sample gives you a pretty accurate idea of the basic elements. The spec also includes an intuitive means of including (multi-)paragraph strings. YAML is terse, but still readable. Moreover, quoting is minimal, with data types being inferred from patterns (for example, if it looks like a date, it is treated as a timestamp value unless explicitly string quoted). You can use references to any named target. And, significantly, YAML maintains the distinction between ordered and associative collections. As an added bonus, you can very easily edit YAML in a text editor.
None of the semantic and syntactic benefits listed above are really the strongest reason for using YAML for my application. The best part is actually its uniform interface in all the supported languages. I can read, manipulate, and write the above YAML data file as easily as:
Listing 6. Python access to YAML data source
import yaml
club = yaml.loadFile('club.yml').next()
# ...manipulate the 'club' data structure...
club_yamlstr = yaml.dump(club)
# ...do something w/ formatted YAML in club_yamlstr...
|
I use the .next() method above because a YAML text can contain multiple streams, each separated by ---. Incidentally, the data structure in club is exactly the same as in the one defined in the prior pure Python definition.
In Perl (or Ruby or Java programming), the steps are almost the same:
Listing 7. Perl access to YAML data source
use YAML ();
my $club = YAML::LoadFile('club.yml');
my $club_yamlstr = YAML::Dump($club); |
The roundtrip between YAML and native data structures is free... well, very close. I found two minor drawbacks:
- References lose their names (for example, "*kramnik") and simply become numbered (for example, "*1")
- Targets are always spelled out on first occurrence.
Aesthetically, I prefer to see a player's details in the "players" section, but that is not guaranteed with an unordered dictionary (the use of _player in the Perl/Python samples is a hack to force matters).
There are a number of features of YAML that I have not covered here. The formal specification is good, albeit somewhat difficult reading (as with most specs). For example, the existing YAML libraries come with adequate, but not great, conversion tools for moving between XML and YAML. And there is some support for a technique called YPATH, which is the YAML version of XPATH.
This introduction is intended to suggest some situations where YAML provides a better object serialization format than XML. In my mind, XML is not always the best choice for data representation -- not even in many of those cases where it seems obvious.
- The home page for YAML is http://yaml.org/.
- The YAML specification has recently reached 1.0 level, and it can be found in its full glory at http://yaml.org/spec/.
- Take a look at the results of the recently completed "Brains in Bahrain" chess tournament. In a mixed reprieve of us humans' submission to our robot overlords, the match tied.
- REXML is a Ruby library for making XML look more like "native" data structures:
http://www.ibm.com/developerworks/xml/library/x-matters18.html.
- PYX is another format that is not quite XML, and is in some ways easier to process. But the semantics of PYX are essentially identical to XML; only the syntax is different: http://www.ibm.com/developerworks/xml/library/x-matters17.html.
- I looked at the object models of XML-RPC in comparison to
gnosis.xml.pickle. Given my current hindsight, I like YAML better than either of them (at least for a lot of purposes): http://www.ibm.com/developerworks/xml/library/x-matters15.html. - My Python tools
gnosis.xml.pickleandgnosis.xml.objectifyattempt to bridge some of the conceptual gaps between XML and dynamic programming languages (at least Python specifically): http://www.ibm.com/developerworks/xml/library/x-matters11.html. - You'll find all of the XML Matters columns at: http://www.ibm.com/developerworks/views/xml/libraryview.jsp?search_by=xml+matters:.
- Find more XML resources on the developerWorks
XML technology zone.
- Check out Rational Application Developer for WebSphere Software, an easy-to-use, integrated development environment for building, testing, and deploying J2EE applications, including generating XML documents from DTDs and schemas.
- IBM XML certification: Find out how you can become an IBM-Certified Developer in XML and related technologies.

David Mertz wishes to let a thousand flowers bloom. David may be reached at mertz@gnosis.cx; his life pored over at http://gnosis.cx/dW/.



