Thinking XML: Microformats the XML way

Put microformats in perspective -- when are they a good choice?

You might have heard about microformats, a way to embed small, specialized information within standard formats. In fact, microformats come in two types: elemental microformats, which are often quite useful, and compound microformats, which are often quite problematic. Learn about a basic approach to avoid the hacks in some compound microformats by virtue of the structure of the Web. XML, and other natural data representation technologies such as JSON, are just as viable as many of their counterparts in microformats.

Uche Ogbuji (uche@ogbuji.net), Principal, Uli, LLC

Photo of Uche OgbujiUche Ogbuji is Principal at Uli, LLC, a services firm specializing in next generation Web technologies. Mr. Ogbuji is lead developer of 4Suite, an open source platform for XML, RDF and knowledge-management applications and lead developer of the Versa RDF query language. He is a Computer Engineer and writer born in Nigeria, living and working in Boulder, Colorado, USA. You can find more about Mr. Ogbuji at his Weblog Copia.



15 May 2007

Also available in

Microformats emerged a couple of years ago as a way to tunnel richer semantics into host formats such as HTML/XHTML and Atom. HTML has no direct way to express contact information, but a microformat called hCard allows you to transmogrify workaday HTML divs and spans into constructs to specify contact name, street address, postal code, and so forth. Some of the basic ideas behind microformats are sound, but it has turned into a huge hype machine such that people sometimes don't think about why and where microformats make sense. Microformats should be one tool available for expression of rich content on the Web and should complement, rather than supplant, other such technologies like XML on the Web, and even Ajax. This article looks to bring back some perspective of where microformats make sense and where something else is worth a hard look.

Nuance and nuisance

Microformats work best where they add a very little bit of nuance to common constructs in a host language such as HTML, XHTML, or Atom. An example is rel-license, which allows you to express that a link is identifying the usage license for the source page's contents. The link <a href="http://creativecommons.org/licenses/by/2.0/" rel="license">cc by 2.0</a> anywhere in the page means that the page's contents are available under a Creative Commons 2.0 Attribution Required license. I've seen people abuse this microformat to assert a license for software described by a page, rather than for the page itself, but one can't really blame the microformat's developers for this. A bigger problem is that it's possible for such conventions to clash but, for the most part, microformats ignore such problems, hoping that lightweight agreements will solve them as they arise. A microformat such as rel-license provides a convention for use of an HTML attribute designed to carry such conventions. It goes no further than providing nuances in such constructs of the host vocabulary, and I call such microformats nuance microformats. Most of these are what microformats designers call elemental microformats, which are those constructed entirely within an element.

Some microformats try to tunnel specialized structure into the constructs of the host language. I like to call these nuisance microformats because they are rather pernicious in several ways, but in this article I'll use the more neutral term adopted by microformats designers: compound microformats. A good example is XOXO, a microformat for expressing outlines. XOXO is far more inscrutable and harder to process than almost any XML you might design to replace it. XML was, of course, designed to express complex and specialized structures for content, and it seems a step backward to use a far less expressive construct just to embed the structure within HTML. Microformats folks do this because they feel that XML is too complex, not yet ubiquitous enough and, more importantly, doesn't allow for graceful degradation, which means that microformats look like regular HTML to user agents that do not understand more advanced technologies such as XML. This is a fairly weak argument, in part because XML is supported by most user agents these days and also because sometimes a scalable design for the Web is worth such tradeoffs and inconveniences. Atom was developed as a better Web feed format despite the ubiquity of other RSS dialects. Cascading Stylesheets (CSS) is developed as a way to separate content from presentation in Web pages, despite the fact that it's easier for the lazy Web publisher to just use font and center tags, and even regardless of the hurdles browsers have placed in front of the conscientious Web developers who do try to apply CSS. Despite the heavy burden of legacy, both technologies are doing well and legacy is certainly a poor excuse for bad design in microformats.


Don't forget hypertext

The pet use-case for microformats is Weblogging. The idea is that the author marks up entries with hAtom, a blogroll with XOXO, the license with rel-license, inline contact information with hCard, and so on. Microformats provide a simple way to publish all this data in somewhat structured form just by updating the Weblog engine templates. And the Weblog also benefits from graceful degradation so that even if readers don't have microformats-aware software, they can still read the information directly.

This all sounds reasonable, but it also tends to ignore the basic value of the Web. HTML is for prose organized into pages. You connect pages together using links. Express non-prosaic data using links to objects such as images, stylesheets and the like. This works very well today. There is no reason not to express address books, Web feeds and the like by links to XML documents. XML is for semi-structured data. Sometimes XML is not the most suitable format for the data. In cases of highly structured data, perhaps JavaScript Object Notation (JSON) is a better alternative. Regardless, each technology is easy to use for its best purpose while still maintaining maximum compatibility with Web browser, and even mobile browser, technology.

I start with hAtom, because it has such an obvious XML alternative in Atom. There is already an XML form of Atom. It is enjoying healthy growth and support. There is even a convention to link from a regular Web page to an Atom feed:

  <link rel="alternate" type="application/atom+xml" title="BobSutor's blog feed"
         href="/developerworks/blogs/rss/BobSutor?flavor=atomdw" />

Adding this feed link is much simpler than jumping through the various hoops of hAtom. Any Weblog software likely to support hAtom is even more likely to support pure Atom. The linked Atom file will be indexed by most search engines as diligently as the page itself. Search engine spiders know from the Web feed link convention (a form of microformat in itself) how to interpret the linked Atom file. And for those writing Web tools and services, Atom is much easier to parse and process than hAtom. You gain the added strict syntax of XML and avoid having to consume a stack of two semantic layers to work with the hAtom. And the advantage to proper XML Atom comes despite the fact that hAtom is more nuance microformat than nuisance microformat.

Boilerplate or in-line content?

Moving on to the case of XOXO for a list of related Weblogs (aka "blogroll"), there is not as obvious an XML standard in this case as there is in Atom. It's much easier to work XOXO into a Weblog because the blogroll is generally not part of the in-line content of entries, but rather a part of the overall Weblog template boilerplate. Listing 1 is an example of an XOXO Blogroll:

Listing 1. XOXO blogroll example
<ol class="xoxo">
 <li>
  <p>My favorite Weblogs</p>
  <ol>
   <li>
    <a href="http://example.com/bud/" type="text/html">Buddy blog</a>
    <a href="http://example.com/bud/atom" type="application/atom+xml">Buddy feed</a>
    <dl>
     <dt>description</dt>
     <dd>My buddy's Weblog</dd>
    </dl>
   </li>
  </ol>
 </li>
</ol>

Really, I can put it no other way—this is ghastly. Ten elements to do what any reasonable XML format can accomplish in four or five (and just six elements in plain HTML). Listing 1 is a simplified example of XOXO. I have seen many more complex examples in the wild. The problem here is that in distorting HTML to express blogroll semantics, XOXO introduces complexity and, with it, increased risk of error. And it's hard to fathom the gain. An author might use the simple boilerplate of Listing 2 for a blogroll:

Listing 2. Plain HTML blogroll example
<div class="blogroll">
  <h3>My favorite Weblogs</h3>
    <ol>
     <li>
      <a href="http://example.com/bud/" type="text/html">Buddy blog</a>
      (<a href="http://example.com/bud/atom" 
          type="application/atom+xml">Buddy feed</a>): My buddy's Weblog
   </li>
  </div>

This is much easier to emulate without confusion or errors.


Even JSON gets its due

One disadvantage of the plain HTML example (Listing 2) over XOXO (Listing 1) is that the blogroll is no longer structured for machine reading. To remedy this, provide a link to the structured blogroll data which is, after all, just a collection of links. This data is highly structured and is amenable to data formats other than XML, including JSON. Listing 3 is an example of JSON for a blogroll:

Listing 3. JSON blogroll data example
  [
   {"blog": "http://example.com/bud/",
    "feed": "http://example.com/bud/atom",
    "description": "My buddy's Weblog",
    "tags": ["buddy"]
   }
  ]

You can make this available as a simple file and use Ajax scripting techniques to dynamically build the plain HTML blogroll in Listing 2 from the preceding list of links. To handle the graceful degradation, select from well-understood techniques for accessible Ajax design, many of which are discussed here on IBM developerWorks.


Wrap up

The very community that microformats claim to embrace should serve as warning for when microformats go too far. Webloggers are an amazing engine of Web-based innovations and constantly introduce ideas and technologies that should eliminate the need for some of the more egregious hacks characteristic of compound microformats. And when they produce these innovations you usually don't have to select view source to emulate them. A Weblogger often posts detailed instructions for adopting his or her techniques. Certainly nuance microformats are useful, and build on classic respect for rough consensus and running code. They are like formal registries for MIME types, filename extensions, and such, but with less central control. Problems come about when the authors of microformats start to distort host formats rather than just build on them. It's almost better to use a more suited format in such cases, and the Web is full of helpful means that leave you with a healthy choice.

Resources

Learn

  • Microformats in Context: Uche Ogbuji discusses some of the problems with microformats.
  • JSON: This page also serves as tutorial and specification for this very simple data format.
  • New to XML: Check out the XML zone's updated page. Readers of this column might be too advanced for this page, but it's a great place to get your colleagues started.
  • XML standards: All XML developers can benefit from the coverage of many XML standards.
  • developerWorks XML zone: Find more XML resources, including previous installments of the Thinking XML column. If you have comments on this article, or any others in this column please post them on the Thinking XML forum.
  • IBM XML certification: Find out how you can become an IBM-Certified Developer in XML and related technologies.
  • XML technical library: See the developerWorks XML zone for a wide range of technical articles and tips, tutorials, standards, and IBM Redbooks.
  • developerWorks technical events and webcasts: Stay current with technology in these sessions.

Get products and technologies

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into XML on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML
ArticleID=218713
ArticleTitle=Thinking XML: Microformats the XML way
publish-date=05152007