Skip to main content

Separate data and formatting with microformats

Create simple, pragmatic formats for the semantic Web

Jack D Herrington (jherr@pobox.com), Senior Software Engineer, Leverage Software Inc.
Jack D. Herrington is a senior software engineer with more than 20 years of experience. He's the author of three books: Code Generation in Action, Podcasting Hacks, and PHP Hacks. He has also written more than 30 articles. You can reach Jack at jherr@pobox.com.

Summary:  Microformats are a new way to embed structured data within standard XHTML code. Discover how to read and write the new microformats for the Web.

Date:  11 Jul 2006
Level:  Introductory
Activity:  4692 views

Every once in a long while, I read about an idea that is a stroke of brilliance, and I think to myself, "I wish I had thought of that, it's genius!" Microformats are just that kind of idea. You see, for a while now, people have tried to extract structured data from the unstructured Web. You hear glimmers of these when people talk about the "semantic Web," a Web in which data is separated from formatting. But for whatever reason, the semantic Web hasn't taken off, and the problem of finding structured data in an unstructured world remains.

Until now. Microformats are one small step forward toward exporting structured data on the Web. The idea is simple. Take a page that has some event information on it -- start time, end time, location, subject, Web page, and so on. Rather than put that information into the Hypertext Markup Language (HTML) of the page in any old way, add some standardized HTML tags and Cascading Style Sheet (CSS) class names. The page can still look any way you choose, but to a browser looking for one of these formatted -- or should I say, microformatted -- pieces of HTML, the difference is night and day.

Perhaps looking at a piece of microformatted HTML will help. Listing 1 shows an event formatted to the hCalendar microformat standard.


Listing 1. A sample event page

<html>
  <body>
    <div class="vevent">
      <a class="url" href="http://myevent.com">
        <abbr class="dtstart" title="20060501">May 1</abbr> - 
        <abbr class="dtend" title="20060502">02, 2006</abbr>
        <span class="summary">My Conference opening</span> - at
        <span class="location">Hollywood, CA</span>
      </a>
      <div class="description">The opening days of the conference</div>
    </div>
  </body>
</html>

You can have multiple events on a single page, each wrapped in a <div> tag marked with the vevent class. Within that <div> tag is the data for the event marked up with different tags, some with the specific classes for the event data. The url class is for the anchor that goes to the Web page. The dtstart and dtend classes are on the tags that have the start and end times encoded in their title elements. The summary, location, and description classes are all attached to the tags that wrap the corresponding content.

Now, the page that these items are on can use CSSs to encode them any way it chooses. So, the look of the card is still unique to the site. The naming convention on the classes is what counts.

That's all there is to microformats. The Web site for microformats is a wiki on which you can add your own formats for whatever data you want to represent. When I last looked, there were formats for events, contact information, reviews, and a host of others. The site even has handy format creators, as in Figure 1.


Figure 1. The hCalendar creator page
The hCalendar creator page

In this article, I use PHP to extract the event records from a given page into an XML format. Then, I create another PHP page to take that XML and build an HTML page with the events encoded in the hCalendar microformat. But before that, let me talk a little about the pros and cons of microformats.

Pros and cons

To be honest, any experienced engineer would look at the hCalendar example above and say, "Where's the format in this microformat?" Certainly, microformats are much less a stand-alone format than a layer on top of an existing format -- HTML. And that has to be considered one of the cons. I mean, look at the competition:

BEGIN:vCalendar
VERSION:5.0
PRODID:-//Microsoft Corporation//Works 2000//EN
BEGIN:vEvent
DTSTART: 20060501T000000
DTEND: 20060502T000000
SUMMARY;ENCODING=QUOTED-PRINTABLE: My Conference opening
PRIORITY: 3
END:vEvent
END:vCalendar

It's strong, it's clear, it's easily read, and clearly, it would be easy to parse. It looks like something out of the Fortran 77 or early UNIX® tradition. You could easily imagine an XML alternative format that would update this format a little. But to get this format of information out there, you must write dedicated clients. That's the big con of new standards -- even XML standards -- and the big plus of microformats.

With microformats, you leverage two highly successful technologies: Hypertext Transfer Protocol (HTTP) and HTML. Both of these standards are integrated into every desktop in the world, and engineers are extending what you can do with those technologies continuously.

Take Greasemonkey as an example. Greasemonkey is a fantastic extension to the Mozilla Firefox browser that allows you to apply JavaScript code to a page after it has been loaded. Some people use the extension to remove banner ads. Others use it to add extra information to sites they don't control. Still others write Greasemonkey scripts that find microformatted data. And what's better is that because the scripts are written in JavaScript code, you don't have to worry about platform concerns. No need for Win32 or Cocoa here: Just write your extension in portable JavaScript code. But Greasemonkey doesn't work on non-HTML pages, and that's where microformats shine.

To be pessimistic for a moment and thus boost microformats a little further, I'll pass on an experience I had at a recent tech conference in Seattle, Washington. I was talking with one of the founders of Technorati, a blog search engine, and I was under the impression that after HTML, the second most successful tag-based data format would have been RSS. But in my conversation, the man revealed that most RSS is poorly formed or out of date and that Technorati doesn't depend on RSS. Instead, they go directly to the HTML and look for blog-style patterns in the code.

That's not to imply that RSS is bad: I love RSS. But I think it shows that getting formats other than HTML accepted in the real world is very difficult. So layering formats on top of HTML makes a lot of sense.

Now, off my soapbox and into the code. That starts with reading microformats from a page.


Reading microformats

To read a page encoded with microformat data, you must have a Web page that has events on it. I start with an .html file, as in Listing 2.


Listing 2. Hcalendar.html

<html>
  <head>
    <style>
      body { font-family: arial, verdana, sans-serif; }
    </style>
  </head>
  <body>
    <div style="width:600px;">

      <div class="vevent" id="one">
        <a class="url" href="http://myevent.com">
          <abbr class="dtstart" title="20060501">May 1</abbr> - 
          <abbr class="dtend" title="20060502">02, 2006</abbr>
          <span class="summary">My Conference opening</span> - at
          <span class="location">Hollywood, CA</span>
        </a>
        <div class="description">The opening days of the conference</div>
      </div>

      <div class="vevent" id="two">
        <a class="url" href="http://myevent.com">
          <abbr class="dtstart" title="20060503">May 3</abbr> -
          <abbr class="dtend" title="20060504">04, 2006</abbr>
          <span class="summary">My Conference closing</span> - at
          <span class="location">Hollywood, CA</span>
        </a>
        <div class="description">The closing days of the conference</div>
      </div>

    </div>
  </body>
</html>

When I look at this in the Web browser, I see something similar to Figure 2.


Figure 2. The hcalendar.html page
The hcalendar.html page

It isn't pretty to look at, but that's just to keep the example as simple as possible. The key here is that you can use whatever HTML styling you want to make the card look however you like. As long as you use the right CSS class names, it will still be recognized as an hCalendar-microformatted item.

Now that I have the page, I need some PHP code to read that page. Listing 3 shows this code which is the base for the reading script.


Listing 3. The get_page function

<?php
require_once 'HTTP/Client.php';

function get_page( $url )
{
  $client = new HTTP_Client();
  $client->get( $url );
  $resp = $client->currentResponse();
  return $resp['body'];
}

This code uses the HTTP Client PEAR module to read the content from a given URL. If you haven't installed that module, you can use the PEAR command line to install it:

% pear install HTTP_Client

Next, I turn the HTML returned from the site into an XML Document Object Model (DOM). Thankfully, the HTML returned Extensible HTML (XHTML) from the site, so it's simple to point an XML reader at it. I do this through the get_events() function in Listing 4.


Listing 4. The get_events function

function get_events( $page )
{
  $body = get_page( $page );

  $dom = new DomDocument();
  $dom->loadXML( $body );

  $xpath = new DOMXPath( $dom );
  $events = $xpath->query("//div[@class='vevent']");

  $parsed_events = array();
  foreach( $events as $event )
  {
    $e = parse_event( $dom, $event );
    $parsed_events []= $e;
  }
  return $parsed_events;
}

This function starts by calling the get_page() function to retrieve the content of the page. Then, it creates and loads the DomDocument() function. With the DOM version in hand, I use XPath queries to get any <div> tag on the page where the vevent class occurs, and I pass those nodes on to parse_event.

If you're unfamiliar with XPath, let me break it down a bit for you. The expression:

//div

would match any <div> tag at any level. Adding this restriction:

//div[@class='vevent']

will match only those <div> tags that have an attribute named class that itself has the value that matches vevent. If you use XML or an XML-based language, such as XHTML or RSS, you should become familiar with XPath. It is by far the easiest way to navigate an XML tree and find the information you're looking for.

To break out the data from each event <div> tag, the parse_event classes use more XPath queries to extract the data. You see this in Listing 5.


Listing 5. The parse_event() function

function parse_event( $dom, $event )
{
  $data = array();

  $xpath = new DOMXPath( $dom );

  $url = $xpath->query( ".//*[contains(@class,'url')]/@href", $event );
  $data['url'] = $url->length > 0 ? $url->item(0)->nodeValue : '';

  $dtstart = $xpath->query( ".//*[contains(@class,'dtstart')]/@title", $event );
  $data['dtstart'] = $dtstart->length > 0 ? $dtstart->item(0)->nodeValue : '';

  $dtend = $xpath->query( ".//*[contains(@class,'dtend')]/@title", $event );
  $data['dtend'] = $dtend->length > 0 ? $dtend->item(0)->nodeValue : '';

  $summary = $xpath->query( ".//*[contains(@class,'summary')]", $event );
  $data['summary'] = $summary->length > 0 ? $summary->item(0)->nodeValue : '';

  $location = $xpath->query( ".//*[contains(@class,'location')]", $event );
  $data['location'] = $location->length > 0 ? $location->item(0)->nodeValue : '';

  $desc = $xpath->query( ".//*[contains(@class,'description')]", $event );
  $data['desc'] = $desc->length > 0 ? $desc->item(0)->nodeValue : '';

  return $data;
}

The code looks a bit complicated, but it's really a set of XPath queries that look for the specific tags with the specific class names somewhere in the XML DOM tree. But this XPath coding is more complicated. First, there's the notation:

.//*

which means, "any tag from this point down," where this point is the <event> tag the code is currently looking at. Notice that the $xpath->query statement now specifies an additional argument -- $event -- which is the root of the search. Typically, XPath queries start at the root of the document, but you can specify another root. I did so using the $event item.

Now, I don't want just any tag. I want a tag that has an attribute named class that contains a particular value, such as url. So, I add this syntax:

.//*[contains(@class,'url')]

so that it matches any tag in which url is part of the class name. But what I really want is the href attribute from that tag, so I even further refine the path this way:

.//*[contains(@class,'url')]/@href

This refinement gets the href attribute of the matching tag.

After the events have been chewed up and returned as an array from the get_events() function, I need another function that exports that array of events as XML. To do so, I use the dump_events() function, as in Listing 6.


Listing 6. The dump_events() function

function dump_events( $events )
{
  $dom = new DomDocument();
  $dom->formatOutput = true;
  $root = $dom->createElement( 'events' );
  $dom->appendChild( $root );

  foreach( $events as $event )
  {
    $elEvent = $dom->createElement( 'event' );
    $root->appendChild( $elEvent );

    $elUrl = $dom->createElement( 'url' );
    $elUrl->appendChild( $dom->createTextNode( $event['url'] ) );
    $elEvent->appendChild( $elUrl );

    $elStart = $dom->createElement( 'start' );
    $elStart->appendChild( $dom->createTextNode( $event['dtstart'] ) );
    $elEvent->appendChild( $elStart );

    $elEnd = $dom->createElement( 'end' );
    $elEnd->appendChild( $dom->createTextNode( $event['dtend'] ) );
    $elEvent->appendChild( $elEnd );

    $elSummary = $dom->createElement( 'summary' );
    $elSummary->appendChild( $dom->createTextNode( $event['summary'] ) );
    $elEvent->appendChild( $elSummary );

    $elLocation = $dom->createElement( 'location' );
    $elLocation->appendChild( $dom->createTextNode( $event['location'] ) );
    $elEvent->appendChild( $elLocation );

    $elDesc = $dom->createElement( 'description' );
    $elDesc->appendChild( $dom->createTextNode( $event['desc'] ) );
    $elEvent->appendChild( $elDesc );
  }

  print( $dom->saveXML() );
}

This function is rather the inverse of the other functions. Instead of querying around some XML, this code creates a DOM by using createElement and appendElement to create a tree. Then, I use the saveXML command to export the data to the standard output.

When I run this PHP script on the command line with the URL of the hcalendar.html page, I get the output in Listing 7.


Listing 7. Output from the PHP script

% php get_calendar.php http://localhost/micro/hcalendar.html
<?xml version="1.0"?>
<events>
  <event>
    <url>http://myevent.com</url>
    <start>20060501</start>
    <end>20060502</end>
    <summary>My Conference opening</summary>
    <location>Hollywood, CA</location>
    <description>The opening days of the conference</description>
  </event>
  <event>
    <url>http://myevent.com</url>
    <start>20060503</start>
    <end>20060504</end>
    <summary>My Conference closing</summary>
    <location>Hollywood, CA</location>
    <description>The closing days of the conference</description>
  </event>
</events>
%

Now I have a script that I can point at any Web page and extract any hCalendar-formatted items as XML.


Creating hCalendar items from XML

Now that I have the XML that I extracted from a Web page, I can create a PHP page that formats that XML as hCalendar items within the HTML. Listing 8 shows this page.


Listing 8. Index.php

<?php
$dom = new DomDocument();
$dom->load( "calendar.xml" );

$xpath = new DomXPath($dom);
$events = $xpath->query( '//event' );
?>
<html>
  <head>
    <title>My Calendar</title>
    <style>
      body { font-family: arial, verdana, sans-serif; }
      td { border-bottom: 1px solid black; border-top: 1px solid black; }
      abbr { border-bottom: none; }
    </style>
  </head>
  <body>
    <table>
      <?php
        foreach( $events as $event )
        {
          $desc = $xpath->query( 'description', $event )->item(0)->nodeValue;
          $start= $xpath->query( 'start', $event )->item(0)->nodeValue;
          $end = $xpath->query( 'end', $event )->item(0)->nodeValue;
          $location = $xpath->query( 'location', $event )->item(0)->nodeValue;
          $summary = $xpath->query( 'summary', $event )->item(0)->nodeValue;
          $url = $xpath->query( 'url', $event )->item(0)->nodeValue;
      ?>
      <tr>
        <td>
          <div class="vevent">
            <a class="url" href="<?php echo( $url ); ?>">
              <span class="summary"><?php echo($summary ); ?></span></a><br/>
              Start: <abbr class="dtstart" title="<?php echo($start ); ?>">
                <?php echo($start ); ?></abbr><br/>
              End: <abbr class="dtend" title="<?php echo($end ); ?>">
                <?php echo($end ); ?></abbr><br/>
                Location: <span class="location"><?php echo($location ); ?></span><br/>
              <div class="description"><?php echo($desc ); ?></div>
          </div>
        </td>
      </tr>
      <?php
      }
      ?>
    </table>
  </body>
</html>

This code might look complicated, but it's actually quite simple. The page starts by loading the calendar.xml file that I created with the get_calendar.php script. It then starts HTML all the way to a <table> tag. Within that tag, I iterate around the <event> tags and export them as rows within the HTML. Then, I finish the Web page. Figure 3 shows the result.


Figure 3. The index.php page
Index.php page

To see whether this code actually encodes hCalendar items, I point the get_calendar.php script at it. Listing 9 shows the result.


Listing 9. A portion of the events XML

% php get_calendar.php http://localhost/micro/index.php
<?xml version="1.0"?>
<events>
  <event>
    <url>http://myevent.com</url>
    <start>20060501</start>
    <end>20060502</end>
    <summary>My Conference opening</summary>
    <location>Hollywood, CA</location>
    <description>The opening days of the conference</description>
  </event>
...
%

How great is that? I have one script that reads a page with calendar items and exports it as XML. Then, I have another page that turns that XML back into calendar items. The original script can then read that page and come out with the same data. It's definitely a circular action.

Okay, maybe it's not that great. It's also not that pretty. What happens when I want to make the presentation a bit nicer? Do I have to dump the microformatting? Not at all. In Listing 10, I improved the format of the calendar item.


Listing 10. Index2.php

...
<?php
foreach( $events as $event )
{
  $desc = $xpath->query( 'description', $event )->item(0)->nodeValue;
  $start= $xpath->query( 'start', $event )->item(0)->nodeValue;
  $end = $xpath->query( 'end', $event )->item(0)->nodeValue;
  $location = $xpath->query( 'location', $event )->item(0)->nodeValue;
  $summary = $xpath->query( 'summary', $event )->item(0)->nodeValue;
  $url = $xpath->query( 'url', $event )->item(0)->nodeValue;
?>
<tr>
  <td class="event">
    <div class="vevent">
      <table width="100%" cellspacing="0" cellpadding="0">
        <tr>
          <td colspan="2">
            <a class="url" href="<?php echo( $url ); ?>">
            <span class="summary"><?php echo($summary ); ?></span></a>
          </td>
        </tr>
        <tr>
          <td>Start</td>
          <td><abbr class="dtstart" title="<?php echo($start ); ?>">
            <?php echo($start ); ?></abbr></td>
        </tr>
        <tr>
          <td>End</td>
          <td><abbr class="dtend" title="<?php echo($end ); ?>">
            <?php echo($end ); ?></abbr></td>
        </tr>
        <tr>
          <td>Location</td>
          <td><span class="location"><?php echo($location ); ?></span></td>
        </tr>
        <tr>
          <td colspan="2">
            <div class="description"><?php echo($desc ); ?></div>
          </td>
        </tr>
      </table>
    </div>
  </td>
</tr>
<?php
}
?>
...

It can be challenging to decipher what's going on from the tags. But it's easy to see the difference in the display, as in Figure 4.


Figure 4. The index2.php page
Index2.php page

Now the start, end, and location columns all line up nicely. But does it still parse as hCalendar items? It does, because the XPath code in the get_calendar.php script is so flexible.

Listing 11 shows the test I ran to prove it.


Listing 11. Test on index2.php

% php get_calendar.php http://localhost/micro/index2.php
<?xml version="1.0"?>
<events>
  <event>
    <url>http://myevent.com</url>
    <start>20060501</start>
    <end>20060502</end>
...

I really like that symmetry between these two reading and writing scripts.


Conclusion

Microformats are a pragmatic approach to solving the issue of structured data on the Web. Is it as architecturally pure as XML-encoded data separated from its formatting through a mechanism such as XSLT style sheets? No. But I think this approach is a realistic middle step that will help build a more intelligent Web that is easier to use and provides better search and data integration.


Resources

Learn

Get products and technologies

  • Build your next development project with IBM trial software, available for download directly from developerWorks.

Discuss

About the author

Jack D. Herrington is a senior software engineer with more than 20 years of experience. He's the author of three books: Code Generation in Action, Podcasting Hacks, and PHP Hacks. He has also written more than 30 articles. You can reach Jack at jherr@pobox.com.

Comments (Undergoing maintenance)



Trademarks  |  My developerWorks terms and conditions

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML, Open source, Web development
ArticleID=132661
ArticleTitle=Separate data and formatting with microformats
publish-date=07112006
author1-email=jherr@pobox.com
author1-email-cc=dwxed@us.ibm.com

My developerWorks community

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere).

My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Rate a product. Write a review.

Special offers