Every once in a long while, I read about an idea that is a stroke of brilliance, and I think to myself, "I wish I had thought of that, it's genius!" Microformats are just that kind of idea. You see, for a while now, people have tried to extract structured data from the unstructured Web. You hear glimmers of these when people talk about the "semantic Web," a Web in which data is separated from formatting. But for whatever reason, the semantic Web hasn't taken off, and the problem of finding structured data in an unstructured world remains.
Until now. Microformats are one small step forward toward exporting structured data on the Web. The idea is simple. Take a page that has some event information on it -- start time, end time, location, subject, Web page, and so on. Rather than put that information into the Hypertext Markup Language (HTML) of the page in any old way, add some standardized HTML tags and Cascading Style Sheet (CSS) class names. The page can still look any way you choose, but to a browser looking for one of these formatted -- or should I say, microformatted -- pieces of HTML, the difference is night and day.
Perhaps looking at a piece of microformatted HTML will help. Listing 1 shows an event formatted
to the hCalendar microformat standard.
Listing 1. A sample event page
<html>
<body>
<div class="vevent">
<a class="url" href="http://myevent.com">
<abbr class="dtstart" title="20060501">May 1</abbr> -
<abbr class="dtend" title="20060502">02, 2006</abbr>
<span class="summary">My Conference opening</span> - at
<span class="location">Hollywood, CA</span>
</a>
<div class="description">The opening days of the conference</div>
</div>
</body>
</html>
|
You can have multiple events on a single page, each wrapped in a <div> tag marked
with the vevent class. Within that <div> tag is the
data for the event marked up with different tags, some with the specific classes for the event data. The
url class is for the anchor that goes to the Web page. The dtstart
and dtend classes are on the tags that have the start and end times encoded in their
title elements. The summary, location, and
description classes are all attached to the tags that wrap the corresponding content.
Now, the page that these items are on can use CSSs to encode them any way it chooses. So, the look of the card is still unique to the site. The naming convention on the classes is what counts.
That's all there is to microformats. The Web site for microformats is a wiki on which you can add your own formats for whatever data you want to represent. When I last looked, there were formats for events, contact information, reviews, and a host of others. The site even has handy format creators, as in Figure 1.
Figure 1. The hCalendar creator page
In this article, I use PHP to extract the event records from a given page into an XML format. Then, I create another
PHP page to take that XML and build an HTML page with the events encoded in the hCalendar
microformat. But before that, let me talk a little about the pros and cons of microformats.
To be honest, any experienced engineer would look at the hCalendar example above and say,
"Where's the format in this microformat?" Certainly, microformats are much less a stand-alone format than
a layer on top of an existing format -- HTML. And that has to be considered one of the cons. I mean, look at
the competition:
BEGIN:vCalendar VERSION:5.0 PRODID:-//Microsoft Corporation//Works 2000//EN BEGIN:vEvent DTSTART: 20060501T000000 DTEND: 20060502T000000 SUMMARY;ENCODING=QUOTED-PRINTABLE: My Conference opening PRIORITY: 3 END:vEvent END:vCalendar |
It's strong, it's clear, it's easily read, and clearly, it would be easy to parse. It looks like something out of the Fortran 77 or early UNIX® tradition. You could easily imagine an XML alternative format that would update this format a little. But to get this format of information out there, you must write dedicated clients. That's the big con of new standards -- even XML standards -- and the big plus of microformats.
With microformats, you leverage two highly successful technologies: Hypertext Transfer Protocol (HTTP) and HTML. Both of these standards are integrated into every desktop in the world, and engineers are extending what you can do with those technologies continuously.
Take Greasemonkey as an example. Greasemonkey is a fantastic extension to the Mozilla Firefox browser that allows you to apply JavaScript code to a page after it has been loaded. Some people use the extension to remove banner ads. Others use it to add extra information to sites they don't control. Still others write Greasemonkey scripts that find microformatted data. And what's better is that because the scripts are written in JavaScript code, you don't have to worry about platform concerns. No need for Win32 or Cocoa here: Just write your extension in portable JavaScript code. But Greasemonkey doesn't work on non-HTML pages, and that's where microformats shine.
To be pessimistic for a moment and thus boost microformats a little further, I'll pass on an experience I had at a recent tech conference in Seattle, Washington. I was talking with one of the founders of Technorati, a blog search engine, and I was under the impression that after HTML, the second most successful tag-based data format would have been RSS. But in my conversation, the man revealed that most RSS is poorly formed or out of date and that Technorati doesn't depend on RSS. Instead, they go directly to the HTML and look for blog-style patterns in the code.
That's not to imply that RSS is bad: I love RSS. But I think it shows that getting formats other than HTML accepted in the real world is very difficult. So layering formats on top of HTML makes a lot of sense.
Now, off my soapbox and into the code. That starts with reading microformats from a page.
To read a page encoded with microformat data, you must have a Web page that has events on it. I start with an .html file, as in Listing 2.
Listing 2. Hcalendar.html
<html>
<head>
<style>
body { font-family: arial, verdana, sans-serif; }
</style>
</head>
<body>
<div style="width:600px;">
<div class="vevent" id="one">
<a class="url" href="http://myevent.com">
<abbr class="dtstart" title="20060501">May 1</abbr> -
<abbr class="dtend" title="20060502">02, 2006</abbr>
<span class="summary">My Conference opening</span> - at
<span class="location">Hollywood, CA</span>
</a>
<div class="description">The opening days of the conference</div>
</div>
<div class="vevent" id="two">
<a class="url" href="http://myevent.com">
<abbr class="dtstart" title="20060503">May 3</abbr> -
<abbr class="dtend" title="20060504">04, 2006</abbr>
<span class="summary">My Conference closing</span> - at
<span class="location">Hollywood, CA</span>
</a>
<div class="description">The closing days of the conference</div>
</div>
</div>
</body>
</html>
|
When I look at this in the Web browser, I see something similar to Figure 2.
Figure 2. The hcalendar.html page
It isn't pretty to look at, but that's just to keep the example as simple as possible. The key here is that you can
use whatever HTML styling you want to make the card look however you like. As long as you use the right CSS class
names, it will still be recognized as an hCalendar-microformatted item.
Now that I have the page, I need some PHP code to read that page. Listing 3 shows this code which is the base for the reading script.
Listing 3. The get_page function
<?php
require_once 'HTTP/Client.php';
function get_page( $url )
{
$client = new HTTP_Client();
$client->get( $url );
$resp = $client->currentResponse();
return $resp['body'];
}
|
This code uses the HTTP Client PEAR module to read the content from a given URL. If you haven't installed that module, you can use the PEAR command line to install it:
% pear install HTTP_Client |
Next, I turn the HTML returned from the site into an XML Document Object Model (DOM). Thankfully, the HTML returned
Extensible HTML (XHTML) from the site, so it's simple to point an XML reader at it. I do this through the
get_events() function in Listing 4.
Listing 4. The get_events function
function get_events( $page )
{
$body = get_page( $page );
$dom = new DomDocument();
$dom->loadXML( $body );
$xpath = new DOMXPath( $dom );
$events = $xpath->query("//div[@class='vevent']");
$parsed_events = array();
foreach( $events as $event )
{
$e = parse_event( $dom, $event );
$parsed_events []= $e;
}
return $parsed_events;
}
|
This function starts by calling the get_page() function to retrieve the content of the
page. Then, it creates and loads the DomDocument() function. With the DOM version in
hand, I use XPath queries to get any <div> tag on the page where the
vevent class occurs, and I pass those nodes on to parse_event.
If you're unfamiliar with XPath, let me break it down a bit for you. The expression:
//div |
would match any <div> tag at any level. Adding this restriction:
//div[@class='vevent'] |
will match only those <div> tags that have an attribute named class
that itself has the value that matches vevent. If you use XML or an XML-based language,
such as XHTML or RSS, you should become familiar with XPath. It is by far the easiest way to navigate an XML tree
and find the information you're looking for.
To break out the data from each event <div> tag, the parse_event
classes use more XPath queries to extract the data. You see this in Listing 5.
Listing 5. The parse_event() function
function parse_event( $dom, $event )
{
$data = array();
$xpath = new DOMXPath( $dom );
$url = $xpath->query( ".//*[contains(@class,'url')]/@href", $event );
$data['url'] = $url->length > 0 ? $url->item(0)->nodeValue : '';
$dtstart = $xpath->query( ".//*[contains(@class,'dtstart')]/@title", $event );
$data['dtstart'] = $dtstart->length > 0 ? $dtstart->item(0)->nodeValue : '';
$dtend = $xpath->query( ".//*[contains(@class,'dtend')]/@title", $event );
$data['dtend'] = $dtend->length > 0 ? $dtend->item(0)->nodeValue : '';
$summary = $xpath->query( ".//*[contains(@class,'summary')]", $event );
$data['summary'] = $summary->length > 0 ? $summary->item(0)->nodeValue : '';
$location = $xpath->query( ".//*[contains(@class,'location')]", $event );
$data['location'] = $location->length > 0 ? $location->item(0)->nodeValue : '';
$desc = $xpath->query( ".//*[contains(@class,'description')]", $event );
$data['desc'] = $desc->length > 0 ? $desc->item(0)->nodeValue : '';
return $data;
}
|
The code looks a bit complicated, but it's really a set of XPath queries that look for the specific tags with the specific class names somewhere in the XML DOM tree. But this XPath coding is more complicated. First, there's the notation:
.//* |
which means, "any tag from this point down," where this point is the <event>
tag the code is currently looking at. Notice that the $xpath->query statement now
specifies an additional argument -- $event -- which is the root of the search. Typically,
XPath queries start at the root of the document, but you can specify another root. I did so using the
$event item.
Now, I don't want just any tag. I want a tag that has an attribute named class that contains
a particular value, such as url. So, I add this syntax:
.//*[contains(@class,'url')] |
so that it matches any tag in which url is part of the class name. But what I really
want is the href attribute from that tag, so I even further refine the path this way:
.//*[contains(@class,'url')]/@href |
This refinement gets the href attribute of the matching tag.
After the events have been chewed up and returned as an array from the get_events()
function, I need another function that exports that array of events as XML. To do so, I use the
dump_events() function, as in Listing 6.
Listing 6. The dump_events() function
function dump_events( $events )
{
$dom = new DomDocument();
$dom->formatOutput = true;
$root = $dom->createElement( 'events' );
$dom->appendChild( $root );
foreach( $events as $event )
{
$elEvent = $dom->createElement( 'event' );
$root->appendChild( $elEvent );
$elUrl = $dom->createElement( 'url' );
$elUrl->appendChild( $dom->createTextNode( $event['url'] ) );
$elEvent->appendChild( $elUrl );
$elStart = $dom->createElement( 'start' );
$elStart->appendChild( $dom->createTextNode( $event['dtstart'] ) );
$elEvent->appendChild( $elStart );
$elEnd = $dom->createElement( 'end' );
$elEnd->appendChild( $dom->createTextNode( $event['dtend'] ) );
$elEvent->appendChild( $elEnd );
$elSummary = $dom->createElement( 'summary' );
$elSummary->appendChild( $dom->createTextNode( $event['summary'] ) );
$elEvent->appendChild( $elSummary );
$elLocation = $dom->createElement( 'location' );
$elLocation->appendChild( $dom->createTextNode( $event['location'] ) );
$elEvent->appendChild( $elLocation );
$elDesc = $dom->createElement( 'description' );
$elDesc->appendChild( $dom->createTextNode( $event['desc'] ) );
$elEvent->appendChild( $elDesc );
}
print( $dom->saveXML() );
}
|
This function is rather the inverse of the other functions. Instead of querying around some XML, this code creates a
DOM by using createElement and appendElement to create a tree.
Then, I use the saveXML command to export the data to the standard output.
When I run this PHP script on the command line with the URL of the hcalendar.html page, I get the output in Listing 7.
Listing 7. Output from the PHP script
% php get_calendar.php http://localhost/micro/hcalendar.html
<?xml version="1.0"?>
<events>
<event>
<url>http://myevent.com</url>
<start>20060501</start>
<end>20060502</end>
<summary>My Conference opening</summary>
<location>Hollywood, CA</location>
<description>The opening days of the conference</description>
</event>
<event>
<url>http://myevent.com</url>
<start>20060503</start>
<end>20060504</end>
<summary>My Conference closing</summary>
<location>Hollywood, CA</location>
<description>The closing days of the conference</description>
</event>
</events>
%
|
Now I have a script that I can point at any Web page and extract any hCalendar-formatted
items as XML.
Creating hCalendar items from XML
Now that I have the XML that I extracted from a Web page, I can create a PHP page that formats that XML as
hCalendar items within the HTML. Listing 8 shows this page.
Listing 8. Index.php
<?php
$dom = new DomDocument();
$dom->load( "calendar.xml" );
$xpath = new DomXPath($dom);
$events = $xpath->query( '//event' );
?>
<html>
<head>
<title>My Calendar</title>
<style>
body { font-family: arial, verdana, sans-serif; }
td { border-bottom: 1px solid black; border-top: 1px solid black; }
abbr { border-bottom: none; }
</style>
</head>
<body>
<table>
<?php
foreach( $events as $event )
{
$desc = $xpath->query( 'description', $event )->item(0)->nodeValue;
$start= $xpath->query( 'start', $event )->item(0)->nodeValue;
$end = $xpath->query( 'end', $event )->item(0)->nodeValue;
$location = $xpath->query( 'location', $event )->item(0)->nodeValue;
$summary = $xpath->query( 'summary', $event )->item(0)->nodeValue;
$url = $xpath->query( 'url', $event )->item(0)->nodeValue;
?>
<tr>
<td>
<div class="vevent">
<a class="url" href="<?php echo( $url ); ?>">
<span class="summary"><?php echo($summary ); ?></span></a><br/>
Start: <abbr class="dtstart" title="<?php echo($start ); ?>">
<?php echo($start ); ?></abbr><br/>
End: <abbr class="dtend" title="<?php echo($end ); ?>">
<?php echo($end ); ?></abbr><br/>
Location: <span class="location"><?php echo($location ); ?></span><br/>
<div class="description"><?php echo($desc ); ?></div>
</div>
</td>
</tr>
<?php
}
?>
</table>
</body>
</html>
|
This code might look complicated, but it's actually quite simple. The page starts by loading the calendar.xml file
that I created with the get_calendar.php script. It then starts HTML all the way to a <table>
tag. Within that tag, I iterate around the <event> tags and export them as
rows within the HTML. Then, I finish the Web page. Figure 3 shows the result.
Figure 3. The index.php page
To see whether this code actually encodes hCalendar items, I point the get_calendar.php script
at it. Listing 9 shows the result.
Listing 9. A portion of the events XML
% php get_calendar.php http://localhost/micro/index.php
<?xml version="1.0"?>
<events>
<event>
<url>http://myevent.com</url>
<start>20060501</start>
<end>20060502</end>
<summary>My Conference opening</summary>
<location>Hollywood, CA</location>
<description>The opening days of the conference</description>
</event>
...
%
|
How great is that? I have one script that reads a page with calendar items and exports it as XML. Then, I have another page that turns that XML back into calendar items. The original script can then read that page and come out with the same data. It's definitely a circular action.
Okay, maybe it's not that great. It's also not that pretty. What happens when I want to make the presentation a bit nicer? Do I have to dump the microformatting? Not at all. In Listing 10, I improved the format of the calendar item.
Listing 10. Index2.php
...
<?php
foreach( $events as $event )
{
$desc = $xpath->query( 'description', $event )->item(0)->nodeValue;
$start= $xpath->query( 'start', $event )->item(0)->nodeValue;
$end = $xpath->query( 'end', $event )->item(0)->nodeValue;
$location = $xpath->query( 'location', $event )->item(0)->nodeValue;
$summary = $xpath->query( 'summary', $event )->item(0)->nodeValue;
$url = $xpath->query( 'url', $event )->item(0)->nodeValue;
?>
<tr>
<td class="event">
<div class="vevent">
<table width="100%" cellspacing="0" cellpadding="0">
<tr>
<td colspan="2">
<a class="url" href="<?php echo( $url ); ?>">
<span class="summary"><?php echo($summary ); ?></span></a>
</td>
</tr>
<tr>
<td>Start</td>
<td><abbr class="dtstart" title="<?php echo($start ); ?>">
<?php echo($start ); ?></abbr></td>
</tr>
<tr>
<td>End</td>
<td><abbr class="dtend" title="<?php echo($end ); ?>">
<?php echo($end ); ?></abbr></td>
</tr>
<tr>
<td>Location</td>
<td><span class="location"><?php echo($location ); ?></span></td>
</tr>
<tr>
<td colspan="2">
<div class="description"><?php echo($desc ); ?></div>
</td>
</tr>
</table>
</div>
</td>
</tr>
<?php
}
?>
...
|
It can be challenging to decipher what's going on from the tags. But it's easy to see the difference in the display, as in Figure 4.
Figure 4. The index2.php page
Now the start, end, and location columns all line up nicely. But does it still parse as
hCalendar items? It does, because the XPath code in the get_calendar.php script is so
flexible.
Listing 11 shows the test I ran to prove it.
Listing 11. Test on index2.php
% php get_calendar.php http://localhost/micro/index2.php
<?xml version="1.0"?>
<events>
<event>
<url>http://myevent.com</url>
<start>20060501</start>
<end>20060502</end>
...
|
I really like that symmetry between these two reading and writing scripts.
Microformats are a pragmatic approach to solving the issue of structured data on the Web. Is it as architecturally pure as XML-encoded data separated from its formatting through a mechanism such as XSLT style sheets? No. But I think this approach is a realistic middle step that will help build a more intelligent Web that is easier to use and provides better search and data integration.
Learn
- Visit the developerWorks XML zone to expand your XML skills.
- Stay current with developerWorks technical events and webcasts.
- Visit the PHP home page, a good place to start learning about all things PHP.
- For more information about microformats, visit the microformats home page.
- Check out the helpful information about semantic Web on Wikipedia.
- Learn more about the DOM as it's defined by the World Wide Web Consortium (W3C).
- Understand the XPath standard, a powerful XML tool.
Get products and technologies
- Build your next development project with IBM trial software, available for download directly from developerWorks.
Discuss
- Participate in developerWorks blogs and get involved in the developerWorks community.
Jack D. Herrington is a senior software engineer with more than 20 years of experience. He's the author of three books: Code Generation in Action, Podcasting Hacks, and PHP Hacks. He has also written more than 30 articles. You can reach Jack at jherr@pobox.com.




