The World Wide Web combines three kinds of technologies: data formats, protocols, and identifiers that tie the two together. The relationship between data formats such as XML and HTML is relatively clear, as is the relationship between protocols such as HTTP and FTP. But identifiers seem to be a bit trickier to pin down.
Web addresses were relatively obscure a dozen years ago, but now they appear not just in Web browsers but also on business cards and brochures, on billboards and buses and T-shirts. They're commonly known as Uniform Resource Locators or URLs. A typical example would be http://www.cisco.com/en/US/partners/index.html. But what about shorter forms, such as www.yahoo.com/sports? Is that a URL, too? How about ../noarch/config.xsd? Or guide/glossary#octothorpe?
To make good use of URLs in XML namespaces, XML schemas, and Extensible Stylesheet Language Transformations (XSLT), you need to know the rules. But the XML family of specifications refers to URIs and URNs -- what's the difference between these and URLs? That question has a long history.
My role in that history goes back at least as far as the Hypertext conference in 1991, where I met both Douglas Engelbart (inventor of the mouse and pioneer of networked computers and hypertext) and Tim Berners-Lee (inventor of the World Wide Web). In a 1990 summary of his 20-plus years of research (see Resources), Engelbart listed among the requirements for an Open Hyperdocument System, "in principle, every object that someone might validly want or need to cite should have an unambiguous address." In his 1991 design document on naming, Berners-Lee wrote:
This is probably the most crucial aspect of design and standardization in an open hypertext system. It concerns the syntax of a name by which a document or part of a document (an anchor) is referenced from anywhere else in the world.
This article discusses the current state of the art in naming technology and standardization for the World Wide Web, as well as some of the history and evolution of the terminology. It concludes with a perspective on naming in information management.
RFC3986, "Uniform Resource Identifier (URI): Generic Syntax," is an Internet Standard. The Request for Comments (RFC) series is the famous archival document series that is the backbone of the Internet Engineering Task Force (IETF) standards process. Only a few of the thousands of RFCs, such as TCP (RFC793) and the Internet Mail format (RFC821) and protocol (RFC822), have advanced to full Internet Standard status. RFC3986 advanced to this status in January 2005.
According to the URI standard, the first example above -- http://www.cisco.com/en/US/partners/index.html -- is indeed a URI, and it has several component parts:
- A scheme name (
- A domain name (
- A path (
The IETF consensus process manages the schemes. The Official IANA Registry of URI Schemes
(see Resources) includes familiar schemes like
mailto, plus many others that
you may or may not be familiar with.
A URI path is like a typical file pathname. URIs adopted forward slashes (
a/b/c) from the UNIX® tradition, because when URIs were designed in the late 1980s, UNIX culture was more prevalent on the Internet than PC culture. At that time, there were several popular notations for accessing remote files. One of them was Ange-ftp, an extension to emacs for editing remote files. It combined host names and user names with pathnames to get something like
/email@example.com:~mblack/. The URI syntax that was developed for the Web used the double-slash notation for cross-machine naming (following the Apollo Domain UNIX dialect), but it also introduced the scheme syntax so that naming conventions from any number of different protocols could be unified. Some examples include:
The second example in the introduction, www.yahoo.com/sports, is not really a URI. It's a convenient shorthand for http://www.yahoo.com/sports, a format supported by popular Web browser user interfaces (UIs). However, don't make the mistake of leaving out the scheme in XSLT like this:
<xsl:include href="exslt.org/math/min/math.min.template.xsl" />
because it won't work as you expect, unless you really intend to refer to a file in a directory called
exslt.org next to your XSLT stylesheet. The
href attribute in XSLT takes a URI reference, which may be absolute or relative. A URI reference that starts with a scheme and a colon is absolute; otherwise, the reference is relative. A relative URI reference is much like a file path. For example,
../noarch/config.xsd is also a relative URI reference.
It is a slight oversimplification to say that the
href attribute in HTML takes a URI reference. URIs and URI references are taken from a limited set of ASCII characters, and HTML is more internationalized than that. In fact, the Request for Comments that followed RFC3986 was RFC3987, Internationalized Resource Identifiers (IRIs) (see Resources). This specification is not as far along in the IETF standards process as its predecessor, but the technology itself is quite mature and widely deployed. IRIs are just like URIs except that they can use the whole range of Unicode characters, not just ASCII. Each IRI has a corresponding encoding as a URI, in case an IRI needs to be used in a protocol (such as HTTP) that accepts only URIs.
Typically, a URI reference is relative to whatever document you find it in. If you're looking in a document with base URI
http://exslt.org/math/min/math.min.template.xsl and you see a URI reference
../../random/random.xml, then that reference would expand to
http://exslt.org/random/random.xml. In HTML, you can put a
base element at the top of the document to override the base URI. The XML Base specification (see Resources) provides the equivalent in XML.
Consider a document that you can access either as
file:/my/doc or as
http://my.domain/doc. Typically, when you access the document through the file system, you want references like
#part2 to expand to
file:/my/doc#part2; when you access the document through HTTP, you want
#part2 to expand to
http://my.domain/doc#part2. But in a Resource Description Framework (RDF) schema, the expanded form needs to stay the same for some things to work. XML Base makes this expansion easy (see Listing 1).
Listing 1. Expanded form in RDF
<rdf:RDF xmlns="&owl;" xmlns:owl="&owl;" xml:base="http://www.w3.org/2002/07/owl" xmlns:rdf="&rdf;" xmlns:rdfs="&rdfs;" > ... <Class rdf:about="#Nothing"/>
In this example, the
#Nothing reference expands to
http://www.w3.org/2002/07/owl#Nothing no matter where you
find that document.
Okay, so much for URIs, IRIs, and URI references. What about URLs and URNs?
URIs are designed to serve as both names and locators. When they were brought to the IETF for standardization, they became known as Uniform Resource Locators, and a separate effort on Uniform Resource Names began.
For Internet hosts, names and locations have separate standards. Host names have the
same syntax as domain names (for example,
These host names are connected to addresses like 192.168.300.21 by the Domain Name
System (DNS) protocol. This indirection allows the names to remain stable when hosts are
moved around in the network and renumbered.
The occasional broken link in the Web made Web addresses look and feel more like locations than names, and different perspectives emerged in the IETF community:
- URIs: RFC1630, issued in June 1994, was called "Universal Resource Identifiers in WWW: A Unifying Syntax for the Expression of Names and Addresses of Objects on the Network as used in the World-Wide Web" (see Resources). It was an Informational RFC -- that is, it did not carry any endorsement from the community.
- URLs: RFC1738, issued in December 1994, was called "Uniform Resource Locators" (see Resources). It was a Proposed Standard -- that is, it was the result of a consensus process, though it was not yet tested and mature enough to be a full Internet Standard.
- URNs: RFC1737, issued December 1994, was called "Functional Requirements for Uniform Resource Names" (see Resources).
RFC1737 was followed in 1997 by Proposed Standard RFC2141, "URN Syntax," which specified another scheme --
urn: -- to join
ftp:, and the rest.
The eventual URI Standard (RFC3986) clarifies the distinction in section 1.1.3, "URI, URL, and URN":
A URI can be further classified as a locator, a name, or both. The term "Uniform Resource Locator" (URL) refers to the subset of URIs that, in addition to identifying a resource, provide a means of locating the resource by describing its primary access mechanism (e.g., its network "location"). The term "Uniform Resource Name" (URN) has been used historically to refer to both URIs under the "urn" scheme [RFC2141], which are required to remain globally unique and persistent even when the resource ceases to exist or becomes unavailable, and to any other URI with the properties of a name.
An individual scheme does not have to be classified as being just one of "name" or "locator". Instances of URIs from any given scheme may have the characteristics of names or locators or both, often depending on the persistence and care in the assignment of identifiers by the naming authority, rather than on any quality of the scheme. Future specifications and related documentation should use the general term "URI" rather than the more restrictive terms "URL" and "URN" [RFC3305].
A natural tension exists between persistence and availability. If I have a file on a host that's connected to the Internet, the simplest way to make it available to you is to run a Web server on that host and hand you a URI that consists of whatever name the host happens to have, along with the filename (for example,
http://dhcp324.coolISP.net/drafts/freeLunch.wsdl). That works fine until my Dynamic Host Configuration Protocol (DHCP) lease expires, I change ISPs, or I move the file from
/keepers/. And what if the service becomes popular and I decide to charge for it? The more inessential the information in the name, the less likely it is to persist across changes.
But a nice persistent name like
is not as simple to manage. I have to register a domain, maintain the mapping from the
domain name to the host address, and either remember that
is the file where I keep my lunch service description or set up a file mapping table on my
The PURL project and the Digital Object Identifier (DOI) system (see Resources) represent different approaches to the persistence problem. A Persistent URL (PURL) is an ordinary HTTP URI in a domain backed by a strong persistence policy. For example, purl.org is run by the Online Computer Library Center (OCLC), a worldwide library cooperative. Anyone can apply for an account and administer his or her own set of PURLs. You publish your content on an ordinary Web server, then connect it to your PURL with HTTP redirection. The indirection from PURLs to less-persistent HTTP URIs is much like the indirection provided by DNS, except that the source and the destination of the redirection are in the same category. When you have set up a PURL, such as
http://purl.org/net/dajobe/, you can use it like any other HTTP URI. More importantly, the people you want to communicate with can use it just like any other HTTP URI; no plug-ins or add-ons are needed.
The DOI system uses its own scheme -- for example,
Web browsers can be adapted to support this scheme with a plug-in. The DOI foundation
provides policies, registration services, and HTTP redirection services similar to PURL providers
like OCLC. While the DOI foundation supports an alias for each DOI of the form
http://dx.doi.org/10.123/456, the DOI Handbook (see Resources)
states that this system has "significant disadvantages when compared with the resolver
plug-in." Managing two different names for each object seems like a more significant
disadvantage to me.
Despite this tension between persistence and availability, a good URI has both; it works as both a persistent name and an available location. So, a URL is really just a URI with practical utility.
Proponents of the
urn: scheme argue that this tension is
irreconcilable within the framework of HTTP and DNS. I acknowledge that there are areas
of concern, but every Web master faces the same issues, and the community is learning
information management principles to address them. The fundamental issue is that the
world changes continuously, and keeping things in sync takes effort.
Most of the time, the hierarchical nature of DNS naming is convenient, but it concentrates
a lot of power in one place and raises challenging governance issues. Peer-to-peer designs
such as distributed hash tables may eliminate some of the centralization issues with DNS,
but who knows what governance issues they will bring with them? Various leading-edge
developments show how new protocols can be used to service existing
http://... names, adding value to the existing hypermedia network.
This seems more likely to succeed than the deployment of new schemes for anything remotely
similar to HTTP's
GET/PUT/POST/DELETE operations. I expect
that present-day best practices in information management and future protocol enhancements
will make carefully chosen URIs built on HTTP and DNS last quite a long time.
- Read "Knowledge-Domain Interoperability and an Open Hyperdocument System," Douglas Engelbart's 1990 summary of his pioneering research in computer-supported cooperative work (CSCW).
- Explore "Document Naming," part of Tim Berners-Lee's Design Issues that started in 1991 and continues to this day.
- Find out more about the Internet Engineering Task Force (IETF), the organization that develops the Internet Protocol, DNS, Internet Mail, and many other Internet technologies. The RFC series is the backbone of the IETF standards process. The following RFCs are discussed in this article:
- RFC1630 -- "Universal Resource Identifiers in WWW"
- RFC1737 -- "Functional Requirements for Uniform Resource Names"
- RFC1738 -- "Uniform Resource Locators" (W3C keeps a hypertext version)
- RFC2141 -- "URN Syntax"
- RFC3305 -- "Report from the Joint W3C/IETF URI Planning Interest Group: URIs, URLs, and URNs"
- RFC3986 -- "Uniform Resource Identifier (URI): Generic Syntax" (Roy Fielding's hypertext version is also handy)
- RFC3987 -- "Internationalized Resource Identifiers (IRIs)"
- Reference the Internet Assigned Numbers Authority (IANA), which maintains the official list of URI schemes, among other things.
- Want to know more about what PURLs are and how to use them? Check out the PURL project site, and be sure to read the PURL FAQ.
- Find out about The DOI System and its approach to persistence. The DOI Handbook has its own chapter on resolution.
- Visit the W3C to find out more about XML Schema, XML Namespaces, and XML Base.
- Learn more about URLs, URNs, and URIs in Uche Ogbuji's developerWorks article "Principles of XML design: Use XML namespaces with care" (April 2004) as well as his first overview of XML standards (January 2004).
- Find hundreds more XML resources on the developerWorks XML zone.
- Browse for books on these and other technical topics.
- Learn how you can become an IBM Certified Developer in XML and related technologies.
Dan Connolly is the W3C URI Activity Lead, a member of the W3C Technical Architecture Group, and chair of the RDF Data Access Working Group. He joined the W3C staff in 1995. He edited the HTML 2.0 specification in 1995, chaired the Working Group that produced HTML 3.2 and 4.0 and CSS 1.0, led the XML Activity through the release of XML 1.0 in 1998 and XML Schema in 2001, and participated in the development of the Web Ontology Language, which became a W3C Recommendation in 2004. Dan is also a research scientist in the Decentralized Information Group at the MIT Computer Science and Artificial Intelligence Laboratory.