Managing XML data: Tag URIs

Choosing unique identifiers

Uniform Resource Identifiers (URIs) can identify things without necessarily locating them. XML namespace URIs are the most obvious such use, but many others abound. When you use URIs primarily as identifiers, it's important to create URIs that are globally unique without implying that they reside on a particular server. Tag is a simple algorithm for creating unique, easy-to-remember URIs while avoiding conflicts. This has important implications for RDF, Atom, and other systems that use URIs as identifiers.

Elliotte Rusty Harold (elharo@metalab.unc.edu), Adjunct Professor, Polytechnic University

Photo of Elliot Rusty HaroldElliotte Rusty Harold is originally from New Orleans, to which he returns periodically in search of a decent bowl of gumbo. However, he resides in the Prospect Heights neighborhood of Brooklyn with his wife Beth and cats Charm (named after the quark) and Marjorie (named after his mother-in-law). He's an adjunct professor of computer science at Polytechnic University, where he teaches Java technology and object-oriented programming. His Cafe au Lait Web site has become one of the most popular independent Java sites on the Internet, and his spin-off site, Cafe con Leche, has become one of the most popular XML sites. His books include Effective XML, Processing XML with Java, Java Network Programming, and The XML 1.1 Bible. He's currently working on the XOM API for processing XML and the Jaxen XPath engine.



24 January 2006

Also available in Japanese

Unique identifiers

The Atom XML format for blogs and other syndicated feeds requires each entry to have a unique ID enclosed in an id element. Atom requires that the ID be a syntactically correct URI. It also requires that this ID not only be unique within the document where it appears, but also be globally unique across all documents, on all servers, for all time. That's a pretty strong requirement. However, it's necessary, because Atom feeds are frequently chopped into pieces, republished by other sites like Bloglines, and accumulated with content from other sites in aggregators and news clients like Vienna. For example, in this real-world entry, the ID is the rather unwieldy value http://www.cafeaulait.org/?tag=http___eclipse.org_aspectj#news2005December21:

Listing 1. Sample Atom entry
<entry>
   <title>The Eclipse Project has released AspectJ 5.0.</title>
   <content type="xhtml">
     <div xmlns="http://www.w3.org/1999/xhtml">
       <p>
The Eclipse Project has released
<a href="http://eclipse.org/aspectj">AspectJ</a> 5.0.
AspectJ is a derivative of Java that allows
programmers to write code that applies across multiple classes.
The AspectJ compiler requires Java 1.3, but can generate code for Java
1.1 and later. "This release constitutes a full-upgrade of AspectJ to
support Java 5, while also delivering a large number of quality
improvements that will benefit users running on JDK 1.4 or below. In
addition to the Java 5 related language changes, AspectJ 5 also 
supports an @AspectJ style of aspect declaration, greatly enhanced 
load-time weaving capabilities, a full reflection API, and tools APIs 
for parts of the weaver."
</p>
     </div>
   </content>
   <link href="#news2005December21"/>
   <id>http://www.cafeaulait.org/?tag=
      http___eclipse.org_aspectj#news2005December21</id>
   <updated>2005-12-21T09:00:01-05:00</updated>
 </entry>

(Actually, the real ID is even longer and uglier, but it needed to fit on this page.)

This ID starts with the URL of the page from which the Atom feed is extracted, http://www.cafeaulait.org/. Next, a query string is appended with one parameter, tag. The value of this parameter is formed by first concatenating all the URLs referenced in the entry and then replacing all URL reserved characters (such as the colon) with the underscore. This distinguishes different entries from the same page. Finally, the date is appended as a fragment identifier in case earlier or later entries contain the same set of URLs. It's ugly, long, and hard to read; but it's reliably unique.

Atom isn't the only protocol that requires globally unique URI identifiers. RDF, OWL, and the semantic Web all assume that URIs can be assigned to any item -- not just Web pages, but people, pets, planets, paramecia, DNA sequences, medical diagnoses, dates, and anything else you can imagine. For the semantic Web to work, it's important that the same URI not be used to identify Pluto the dog and Pluto the planet.

You'll find URIs used as identifiers in many other contexts, including SAX feature and property names, XML namespaces, RDDL natures and purposes, algorithms in XML digital signatures, SVG feature strings, and more.

All URLs are URIs, and HTTP URLs like the one in the earlier example serve perfectly well as identifiers. The problem with using them as identifiers is that users expect to be able to follow HTTP URLs. Even if a URL only appears in an Atom feed document not meant for human consumption, people will type it into a Web browser. In the immortal words ofXML luminary Claude L. Bullard,

All the handwaving about URN/URI/URL doesn't avoid the simple fact that if one puts http:// anywhere in browser display space, the system colors it blue and puts up a finger. The monkey expects a resource and when it doesn't get one, it shocks the monkey. Monkeys don't read specs to find out why they should be shocked. They turn red and put up a finger.

In addition to annoying users, placing unresolvable HTTP URLs in a document tends to fill up server error logs with 404s for the URLs you thought no one would try to find. Even if users don't find them, a broken robot somewhere will. Consequently, it would be nice to have a URI scheme purely for identification purposes that doesn't look like an HTTP URL. This is where tag URIs come in.


Tag syntax

It's probably impossible to define a globally unique identifier syntax without some form of centralized registration system to avoid conflicts. However, you don't need more than one such registry -- and fortunately, a registry already exists that almost everybody with a computer participates in: the domain name system. Like XML namespaces and Java package names, tag URIs piggyback off DNS to guarantee uniqueness. Each tag URI includes either a domain name or an e-mail address.

Domain names are sold, expired, snatched, and stolen on a regular basis, so a date is also included in the URI. Presumably, not more than one person or organization owns any given domain name or e-mail address on any given day.

Finally, an arbitrary string is added to the URI so that one person on one day can create any number of tag URIs. Here are a few example tag URIs:

  • tag:elharo@ibiblio.org,2006:javafaq/slides
  • tag:elharo@ibiblio.org,2005-12:Elliotte
  • tag:elharo.com,2006-01-25:ElliotteHarold:presentations:Javapolis2005-12-14
  • tag:elharo.com,2005:double

In URI terminology, these are all opaque URIs -- that is, they don't follow the hierarchical system used in HTTP, HTTPS, file, and FTP URLs. However, they do have their own distinct structure. In particular, each such URI is composed of three parts separated by colons:

scheme:tagging entity:specific identifier

The scheme is always tag. That's simple. Although URI schemes are case insensitive, the tag RFC (see Resources) recommends the lower-case form.

The specific identifier is optional. If included, it can hold any content within the limits of URI syntax. In brief, this means it's allowed to contain ASCII alphanumeric characters and a few punctuation marks, but no whitespace, no reserved characters [like colon (:) and slash (/)], and no non-ASCII characters. No particular meaning is assigned to the specific identifier. You can store anything here that seems useful, although you should endeavor to make it meaningful and obvious to humans. That is, strings like "sr_8_xs_ap_i2_xgl14" (taken from a real URL at one of the largest e-commerce sites) are discouraged. The tag RFC encourages strings made up of real words.

The tagging entity is where the meat is. This is the part that guarantees uniqueness. The tagging entity is based on domain names. However, because domain names change hands, there's also a date component. For example, the entity macfaq.com,2005 refers to the person or organization who owned the domain name macfaq.com in 2005. If that domain changes hands in 2006, then macfaq.com,2006 refers to the new owner, but the previous owner can still use macfaq.com,2005. If the domain name changes owners in the middle of a year, months and even days can be added, separated from the year by a hyphen. (This is the customary date format defined in ISO 8601 and endorsed by the W3C.) For example, macfaq.com,2005-12-21 refers to the entity that owned macfaq.com on December 21, 2005.

Comparing tags for equality

In one respect, tags differ from the usual interpretation of URLs: Tags are considered to be equal only if they're character-per-character identical. Case folding isn't performed, even on the static scheme part tag. Percent encoding may be used, but it isn't resolved. For instance, tag:elharo@ibiblio.org,2006:javafaq/slides, TAG:elharo@ibiblio.org,2006:javafaq/slides, and tag:elharo%40ibiblio.org,2006:javafaq/slides are considered three different URIs.

All years must use four digits, and all days and months must use two. For example, to create a tag URI on New Year's Day 2006, you write macfaq.com,2006-01-01 rather than macfaq.com,06-1-1. The date doesn't have to be the date the URI was first created, but it often is. You can also pick a date in the past, as long you owned the domain then. However, you shouldn't create tag URIs that include a date in the future, because the ownership of the domain name or e-mail address may change unexpectedly.

Although you can add a time component to the tagging entity, doing so is discouraged because differing time zones can cause overlap and conflicts. If a domain name does change hands, then it's best to only assign tag URIs 48 hours before or after the switch to remove all doubt about the ownership.

Not everyone owns a personal domain name, but most people have an e-mail address. If you don't own a domain name, or if your organization is so large that sorting out the usage of URIs between branches and divisions is tricky, use a full e-mail address instead: for example, tag:elharo@ibiblio.org,2006:javafaq/slides. In this case, the tagging entity is now the owner of the elharo@ibiblio.org e-mail address in 2006, rather than the owner of the ibiblio.org domain name in 2006.


Summary

Tag URIs finally let URIs do what they were meant to do: identify without implying any sort of location or behavior that they don't have. They're easy to create, they're human legible, they work with existing systems, they're an open standard, and they don't have any backward compatibility issues. What's not to like?

The only thing that might suggest using an HTTP URL instead of a tag URI is if you want to put a page at the other end of the URL, either now or in the future. HTTP URLs let you do this. Tag URIs don't. However, the vast majority of HTTP URLs intended for use as identifiers (as opposed to locators) produce 404 Not Found errors when plugged into a browser. If you know you're not going to put a page at the end of the URL, choose tag URIs as identifiers rather than HTTP URLs.

Resources

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into XML on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML
ArticleID=101662
ArticleTitle=Managing XML data: Tag URIs
publish-date=01242006