The Atom XML format for blogs and other syndicated feeds requires each
entry to have a unique ID enclosed in an
id element. Atom
requires that the ID be a syntactically correct URI. It also requires that
this ID not only be unique within the document where it appears, but also
be globally unique across all documents, on all servers, for all time.
That's a pretty strong requirement. However, it's necessary, because Atom
feeds are frequently chopped into pieces, republished by other sites like
Bloglines, and accumulated with content from other sites in aggregators
and news clients like Vienna. For example, in this real-world entry, the
ID is the rather unwieldy value
Listing 1. Sample Atom entry
<entry> <title>The Eclipse Project has released AspectJ 5.0.</title> <content type="xhtml"> <div xmlns="http://www.w3.org/1999/xhtml"> <p> The Eclipse Project has released <a href="http://eclipse.org/aspectj">AspectJ</a> 5.0. AspectJ is a derivative of Java that allows programmers to write code that applies across multiple classes. The AspectJ compiler requires Java 1.3, but can generate code for Java 1.1 and later. "This release constitutes a full-upgrade of AspectJ to support Java 5, while also delivering a large number of quality improvements that will benefit users running on JDK 1.4 or below. In addition to the Java 5 related language changes, AspectJ 5 also supports an @AspectJ style of aspect declaration, greatly enhanced load-time weaving capabilities, a full reflection API, and tools APIs for parts of the weaver." </p> </div> </content> <link href="#news2005December21"/> <id>http://www.cafeaulait.org/?tag= http___eclipse.org_aspectj#news2005December21</id> <updated>2005-12-21T09:00:01-05:00</updated> </entry>
(Actually, the real ID is even longer and uglier, but it needed to fit on this page.)
This ID starts with the URL of the page from which the Atom feed is
extracted, http://www.cafeaulait.org/. Next, a query string is appended
with one parameter,
tag. The value of this parameter is
formed by first concatenating all the URLs referenced in the entry and
then replacing all URL reserved characters (such as the colon) with the
underscore. This distinguishes different entries from the same page.
Finally, the date is appended as a fragment identifier in case earlier or
later entries contain the same set of URLs. It's ugly, long, and hard to
read; but it's reliably unique.
Atom isn't the only protocol that requires globally unique URI identifiers. RDF, OWL, and the semantic Web all assume that URIs can be assigned to any item -- not just Web pages, but people, pets, planets, paramecia, DNA sequences, medical diagnoses, dates, and anything else you can imagine. For the semantic Web to work, it's important that the same URI not be used to identify Pluto the dog and Pluto the planet.
You'll find URIs used as identifiers in many other contexts, including SAX feature and property names, XML namespaces, RDDL natures and purposes, algorithms in XML digital signatures, SVG feature strings, and more.
All URLs are URIs, and HTTP URLs like the one in the earlier example serve perfectly well as identifiers. The problem with using them as identifiers is that users expect to be able to follow HTTP URLs. Even if a URL only appears in an Atom feed document not meant for human consumption, people will type it into a Web browser. In the immortal words ofXML luminary Claude L. Bullard,
All the handwaving about URN/URI/URL doesn't avoid the simple fact that if one puts http:// anywhere in browser display space, the system colors it blue and puts up a finger. The monkey expects a resource and when it doesn't get one, it shocks the monkey. Monkeys don't read specs to find out why they should be shocked. They turn red and put up a finger.
In addition to annoying users, placing unresolvable HTTP URLs in a document tends to fill up server error logs with 404s for the URLs you thought no one would try to find. Even if users don't find them, a broken robot somewhere will. Consequently, it would be nice to have a URI scheme purely for identification purposes that doesn't look like an HTTP URL. This is where tag URIs come in.
It's probably impossible to define a globally unique identifier syntax without some form of centralized registration system to avoid conflicts. However, you don't need more than one such registry -- and fortunately, a registry already exists that almost everybody with a computer participates in: the domain name system. Like XML namespaces and Java package names, tag URIs piggyback off DNS to guarantee uniqueness. Each tag URI includes either a domain name or an e-mail address.
Domain names are sold, expired, snatched, and stolen on a regular basis, so a date is also included in the URI. Presumably, not more than one person or organization owns any given domain name or e-mail address on any given day.
Finally, an arbitrary string is added to the URI so that one person on one day can create any number of tag URIs. Here are a few example tag URIs:
In URI terminology, these are all opaque URIs -- that is, they don't follow the hierarchical system used in HTTP, HTTPS, file, and FTP URLs. However, they do have their own distinct structure. In particular, each such URI is composed of three parts separated by colons:
scheme:tagging entity:specific identifier
The scheme is always
tag. That's simple. Although URI schemes
are case insensitive, the tag RFC (see Resources)
recommends the lower-case form.
The specific identifier is optional. If included, it can hold any content within the limits of URI syntax. In brief, this means it's allowed to contain ASCII alphanumeric characters and a few punctuation marks, but no whitespace, no reserved characters [like colon (:) and slash (/)], and no non-ASCII characters. No particular meaning is assigned to the specific identifier. You can store anything here that seems useful, although you should endeavor to make it meaningful and obvious to humans. That is, strings like "sr_8_xs_ap_i2_xgl14" (taken from a real URL at one of the largest e-commerce sites) are discouraged. The tag RFC encourages strings made up of real words.
The tagging entity is where the meat is. This is the part that guarantees uniqueness. The tagging entity is based on domain names. However, because domain names change hands, there's also a date component. For example, the entity macfaq.com,2005 refers to the person or organization who owned the domain name macfaq.com in 2005. If that domain changes hands in 2006, then macfaq.com,2006 refers to the new owner, but the previous owner can still use macfaq.com,2005. If the domain name changes owners in the middle of a year, months and even days can be added, separated from the year by a hyphen. (This is the customary date format defined in ISO 8601 and endorsed by the W3C.) For example, macfaq.com,2005-12-21 refers to the entity that owned macfaq.com on December 21, 2005.
All years must use four digits, and all days and months must use two. For example, to create a tag URI on New Year's Day 2006, you write macfaq.com,2006-01-01 rather than macfaq.com,06-1-1. The date doesn't have to be the date the URI was first created, but it often is. You can also pick a date in the past, as long you owned the domain then. However, you shouldn't create tag URIs that include a date in the future, because the ownership of the domain name or e-mail address may change unexpectedly.
Although you can add a time component to the tagging entity, doing so is discouraged because differing time zones can cause overlap and conflicts. If a domain name does change hands, then it's best to only assign tag URIs 48 hours before or after the switch to remove all doubt about the ownership.
Not everyone owns a personal domain name, but most people have an e-mail address. If you don't own a domain name, or if your organization is so large that sorting out the usage of URIs between branches and divisions is tricky, use a full e-mail address instead: for example, tag:email@example.com,2006:javafaq/slides. In this case, the tagging entity is now the owner of the firstname.lastname@example.org e-mail address in 2006, rather than the owner of the ibiblio.org domain name in 2006.
Tag URIs finally let URIs do what they were meant to do: identify without implying any sort of location or behavior that they don't have. They're easy to create, they're human legible, they work with existing systems, they're an open standard, and they don't have any backward compatibility issues. What's not to like?
The only thing that might suggest using an HTTP URL instead of a tag URI is if you want to put a page at the other end of the URL, either now or in the future. HTTP URLs let you do this. Tag URIs don't. However, the vast majority of HTTP URLs intended for use as identifiers (as opposed to locators) produce 404 Not Found errors when plugged into a browser. If you know you're not going to put a page at the end of the URL, choose tag URIs as identifiers rather than HTTP URLs.
- Participate in the discussion forum.
- Read the official tag RFC.
- Visit the tag URI Web site.
- Unique, reliable, stable identifiers are just one reason to prefer Atom over RSS.
- Decide whether Pluto is or isn't a planet.
- Read Atom feeds with RSSOwl.
- developerWorks XML zone: Find more XML resources here, including articles, tutorials, tips, and standards.
Dig deeper into XML on developerWorks
Get samples, articles, product docs, and community resources to help build, deploy, and manage your cloud apps.
Experiment with new directions in software development.
Software development in the cloud. Register today to create a project.
Evaluate IBM software and solutions, and transform challenges into opportunities.