Skip to main content

If you don't have an IBM ID and password, register here.

By clicking Submit, you agree to the developerWorks terms of use.

The first time you sign into developerWorks, a profile is created for you. This profile includes the first name, last name, and display name you identified when you registered with developerWorks. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

All information submitted is secure.

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerworks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

By clicking Submit, you agree to the developerWorks terms of use.

All information submitted is secure.

XML Matters: YAML improves on XML

YAML Ain't Markup Language

David Mertz (mertz@gnosis.cx), Alternator, Gnosis Software, Inc.
Photo of David Mertz
David Mertz wishes to let a thousand flowers bloom. David may be reached at mertz@gnosis.cx; his life pored over at http://gnosis.cx/dW/.

Summary:  In this article, David introduces you to YAML, a data serialization format that can be easily read by humans and is well-suited to encoding the data types used in dynamic programming languages. In contrast to XML, YAML uses clean and very minimal structural indicators, relying largely on indentation of nested elements. More importantly, for many tasks the superior syntax of YAML is the far better semantic fit between YAML and "natural" data structures.

View more content in this series

Date:  01 Oct 2002
Level:  Intermediate

Comments:  

I suppose the very first question on readers' minds has to be, "Why the name YAML?" There are a number of tools that have cutely adopted acronyms of the form "YA*", to mean "Yet Another XXX." In the world of open-source wit, YAML eschews its implied acronym, instead settling on the recursive "YAML Ain't Markup Language." There is a certain sense to this, however: YAML does what markup languages do, but without requiring any, well, markup.

Although it is no less general than XML, YAML is a great deal simpler to read, edit, modify, and produce than XML. That is, anything you can represent in XML, you can represent (almost always more compactly) in YAML. Namespaces represent one case in which some special pleading is necessary -- you can "bolt them on" to YAML, but they are not part of the current spec.

One criticism of XML that I have frequently raised in this column (and one that I am far from alone in emphasizing) is that XML is poorly focused. A classic committee-driven behemoth, XML tries to be a document format, a data format, a message packet format, a secure RPC channel (SOAP), and an object database. Moreover, XML piles on APIs for every style of access and manipulation: DOM, SAX, XSLT, XPATH, JDOM, and dozens of less common interface layers (I have contributed a few of my own in the gnosis.xml.pickle, gnosis.xml.objectify and gnosis.xml.validity packages). The remarkable thing is that XML does all these things; the disappointing part is that it does none of them particularly well.

YAML's focuses more narrowly on cleanly representing the data structures and data types you encounter in dynamic programming languages like Perl, Python, Ruby, and to a lesser extent, Java programming. Bindings/libraries currently exist for those languages. A number of other languages have data models that will play nice with YAML, but no one has written libraries yet. These include Lisp/Scheme, Rebol, Smalltalk, xBase, and AWK. Less dynamic languages would not fit as well with YAML. The syntax of YAML combines the contextual typing of Perl, the indentation structure of Python, and a few conventions from MIME standards. In much the same way that Python is sometimes praised as executable pseudocode, YAML's concrete syntax comes very close to what you might use to informally explain a data structure to a class or work group.

Sketching an application

The easiest way to see why you would want to use YAML is to look at some code in different formats. For this installment, I have imagined creating a little application with a data storage and transmission requirement. My particular pet project is inspired by the "Brains in Bahrain" chess tournament -- occurring as I write this article -- between the FISA world chess champion and the best ranked computer player (take my data details with a grain of salt). If you wanted to create a program that tracks the activity of a chess club, you might use a data structure that would be described by the following Perl code.


Listing 1. Perl description of chess club data structure

$players = {
  'Vladimir Kramnik' => {'status'=>'GM', 'rating'=>'2700'},
  'Deep Fritz' =>  {'status'=>'Computer','rating'=>'2700'},
  'David Mertz' => {'status'=>'Amateur', 'rating'=>'1400'},
};
$club = {
  '_players' => $players,
  'matches' => [
    {'Date' => '2002-10-04',
     'White' => $players->{'Deep Fritz'},
     'Black' => $players->{'Vladimir Kramnik'},
     'Result' => 'Draw' },
    {'Date' => '2002-10-06',
     'White' => $players->{'Vladimir Kramnik'},
     'Black' => $players->{'Deep Fritz'},
     'Result' => 'White' }
  ]
};

Python is similar to the Perl code above:


Listing 2. Python description of chess club data structure

ts = yaml.timestamp    # or mx.DateTime or othr date class
players = {
  'Vladimir Kramnik':  {'status':'GM', 'rating':2700},
  'Deep Fritz':        {'status':'Computer', 'rating':2700},
  'David Mertz':       {'status':'Amateur', 'rating':1400},
  }
matches = [
  {'Date':      ts('2002-10-04'),
   'White':     players['Deep Fritz'],
   'Black':     players['Vladimir Kramnik'],
   'Result':    'Draw' },
  {'Date':      ts('2002-10-06'),
   'White':     players['Vladimir Kramnik'],
   'Black':     players['Deep Fritz'],
   'Result':    'White' }
  ]
club = {'_players':players, 'matches':matches}

Other dynamic programming languages would use similar descriptions of the data structure. Basically, this is a top-level dictionary/mapping/hash that contains both dicts and lists at a few recursive levels; this also allows nested elements to refer to each other.

Presumably, an application that managed the chess club could perform tasks like recording additional matches, adding/removing players from the club, or updating ratings based on matches played. Moreover, an application would want to not only record a data snapshot, but also share it with other applications (in other languages) that work with the same data model. In addition, it would be nice if we could easily touch up the data by hand outside of an application; anyone who has developed a data-oriented application or maintained an organization's records knows how helpful it can be to "poke at the guts" of underlying data structures.


Choosing the representation

You have probably already picked up on the fact that I have rigged the deck: YAML is going to swoop in here as the right solution. But I don't think the setup is unfair. The way I have structured the data is a whole lot like the way data often presents itself, at least in broad strokes. I have chosen to use the maps, lists, references, and data types above in precisely the places they each seemed most natural. Also, I have chosen the underlying problem merely as something that was complex enough to show the issues while being simple enough to fit into one article.

If you are willing to require a particular programming language (and possibly a particular version) for all chess club applications, most languages have good serialization capabilities in built-in or common libraries:

  • Python has cPickle, gnosis.xml.pickle, and pprint
  • Perl has Data::Dumper, Data::Denter, and Data:DumpXML
  • Ruby has Marshal and XmlSerialization
  • Java language has java.io.Serializable, org.apache.xml.serialize.XMLSerializer, and various others

As the names indicate, some of these libraries produce XML, but that hardly means XML is easily transferable between languages.

In addition, there are a few general semantic problems with using XML to represent this chess club data. XML has the concept of unordered mapping for element attributes, but strict ordering for nested elements. A particular application, of course, has the ability to ignore some of the ordering information but the information model of XML always asserts a significance to order, often spuriously. For example, matches are considered to fall in a particular order (by date), while players are not inherently ordered. (You could, of course, impose an order such as rating or enroll-date.) The problem is that you need custom programming in every application to remove the implied ordering information everywhere it is spurious and keep it where it is important.

XML-RPC, SOAP, gnosis.xml.pickle, and the XML serializer libraries in various languages take a generic approach to representing mappings. In all of these cases, the basic principle uses (rather verbosely) <key> and <val> (or similar tags) to indicate unordered pairs, and different container elements to indicate ordered items. This principle adds several layers to remove part of the XML information model:


Listing 3. XML-RPC model of ordered and unordered collections

>>> import xmlrpclib
>>> print xmlrpclib.dumps(({'this':'that',
...                         'spam':('eggs','toast')},))
<params>
<param>
<value><struct>
<member>
<name>this</name>
<value><string>that</string></value>
</member>
<member>
<name>spam</name>
<value><array><data>
<value><string>eggs</string></value>
<value><string>toast</string></value>
</data></array></value>
</member>
</struct></value>
</param>
</params>

XML-RPC has a few additional artifacts -- like the need to wrap the whole object in a one-item tuple -- but these are minor issues. The awkward fit between the "native" and the XML data models is equally evident in any of the XML serialization formats mentioned here.


Attempting XML

There are at least two issues that arise in representing this chess club data as XML. The first, and simpler, issue is exactly what the best XML representation would be in the abstract. For this, I would propose something like the following as a best attempt in XML:


Listing 4. Optimal XML description of chess club data

<?xml version="1.0"?>
<club>
  <players>
    <player id="kramnik"
            name="Vladimir Kramnik"
            rating="2700"
            status="GM" />
    <player id="fritz"
            name="Deep Fritz"
            rating="2700"
            status="Computer" />
    <player id="mertz"
            name="David Mertz"
            rating="1400"
            status="Amateur" />
  </players>
  <matches>
    <match>
        <Date>2002-10-04</Date>
        <White refid="fritz" />
        <Black refid="kramnik" />
        <Result>Draw</Result>
    </match>
    <match>
        <Date>2002-10-06</Date>
        <White refid="kramnik" />
        <Black refid="fritz" />
        <Result>White</Result>
    </match>
  </matches>
</club>

The above XML data representation is fairly clear. It is not much more verbose than the native data descriptions given in the Perl and Python examples, nor than the YAML description in Listing 5. It is not all that difficult to modify the document with general purpose tools like a text editor. (In fact, that is exactly how I created the XML initially.)

Semantically, my proposed XML has all the problems discussed. Players appear ordered, even though they are not intended to be. And the player list appears to precede the matches list, even though no such conceptual order is intended. Player attributes are unordered, as desired (being XML attributes), but since match "attributes" cannot fit as XML attributes, XML imposes an artificial order.

The more important issue arises with actually reading and writing my optimal XML format. None of the common XML APIs comes even close to automating this operation. For example, a SAX reader could look for various "player" and "match" events and manually add to relevant nested dictionaries or lists, but this approach is fragile, and needs to be reprogrammed for the slightest change in data structure during development. Walking a DOM tree has the same issue. Custom APIs like JDOM or REXML do not help much either. gnosis.xml.objectify does a fairly good job of automatically generating a native object, but this only works for reading in the XML, not for writing it back out. Writing, of course, is symmetric with reading, with all its corresponding fragilities.


YAML to the rescue

The YAML format simply matches the data structures of dynamic languages better. And it looks nicer too. Here's a YAML representation of the same chess club data:


Listing 5. YAML description of chess club data

---
players:
  Vladimir Kramnik: &kramnik
    rating: 2700
    status: GM
  Deep Fritz: &fritz
    rating: 2700
    status: Computer
  David Mertz: &mertz
    rating: 1400
    status: Amateur

matches:
  -
    Date: 2002-10-04
    White: *fritz
    Black: *kramnik
    Result: Draw
  -
    Date: 2002-10-06
    White: *kramnik
    Black: *fritz
    Result: White

There are a number of nice things about this format. The YAML Web site gives exact specifications (see Resources), but this brief sample gives you a pretty accurate idea of the basic elements. The spec also includes an intuitive means of including (multi-)paragraph strings. YAML is terse, but still readable. Moreover, quoting is minimal, with data types being inferred from patterns (for example, if it looks like a date, it is treated as a timestamp value unless explicitly string quoted). You can use references to any named target. And, significantly, YAML maintains the distinction between ordered and associative collections. As an added bonus, you can very easily edit YAML in a text editor.

None of the semantic and syntactic benefits listed above are really the strongest reason for using YAML for my application. The best part is actually its uniform interface in all the supported languages. I can read, manipulate, and write the above YAML data file as easily as:


Listing 6. Python access to YAML data source
import yaml
club = yaml.loadFile('club.yml').next()
# ...manipulate the 'club' data structure...
club_yamlstr = yaml.dump(club)
# ...do something w/ formatted YAML in club_yamlstr...

I use the .next() method above because a YAML text can contain multiple streams, each separated by ---. Incidentally, the data structure in club is exactly the same as in the one defined in the prior pure Python definition.

In Perl (or Ruby or Java programming), the steps are almost the same:


Listing 7. Perl access to YAML data source
use YAML ();
my $club = YAML::LoadFile('club.yml');
my $club_yamlstr = YAML::Dump($club);

The roundtrip between YAML and native data structures is free... well, very close. I found two minor drawbacks:

  • References lose their names (for example, "*kramnik") and simply become numbered (for example, "*1")
  • Targets are always spelled out on first occurrence.

Aesthetically, I prefer to see a player's details in the "players" section, but that is not guaranteed with an unordered dictionary (the use of _player in the Perl/Python samples is a hack to force matters).


What it means

There are a number of features of YAML that I have not covered here. The formal specification is good, albeit somewhat difficult reading (as with most specs). For example, the existing YAML libraries come with adequate, but not great, conversion tools for moving between XML and YAML. And there is some support for a technique called YPATH, which is the YAML version of XPATH.

This introduction is intended to suggest some situations where YAML provides a better object serialization format than XML. In my mind, XML is not always the best choice for data representation -- not even in many of those cases where it seems obvious.


Resources

  • The home page for YAML is http://yaml.org/.

  • The YAML specification has recently reached 1.0 level, and it can be found in its full glory at http://yaml.org/spec/.

  • Take a look at the results of the recently completed "Brains in Bahrain" chess tournament. In a mixed reprieve of us humans' submission to our robot overlords, the match tied.

About the author

Photo of David Mertz

David Mertz wishes to let a thousand flowers bloom. David may be reached at mertz@gnosis.cx; his life pored over at http://gnosis.cx/dW/.

Report abuse help

Report abuse

Thank you. This entry has been flagged for moderator attention.


Report abuse help

Report abuse

Report abuse submission failed. Please try again later.


developerWorks: Sign in

If you don't have an IBM ID and password, register here.


Forgot your IBM ID?


Forgot your password?
Change your password


By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. This profile includes the first name, last name, and display name you identified when you registered with developerWorks. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

Choose your display name

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

(Must be between 3 – 31 characters.)


By clicking Submit, you agree to the developerWorks terms of use.

 


Rate this article

Comments

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML
ArticleID=12177
ArticleTitle=XML Matters: YAML improves on XML
publish-date=10012002
author1-email=mertz@gnosis.cx
author1-email-cc=

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

For articles in technology zones (such as Java technology, Linux, Open source, XML), Popular tags shows the top tags for all technology zones. For articles in product zones (such as Info Mgmt, Rational, WebSphere), Popular tags shows the top tags for just that product zone.

For articles in technology zones (such as Java technology, Linux, Open source, XML), My tags shows your tags for all technology zones. For articles in product zones (such as Info Mgmt, Rational, WebSphere), My tags shows your tags for just that product zone.

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).