Skip to main content

If you don't have an IBM ID and password, register here.

By clicking Submit, you agree to the developerWorks terms of use.

The first time you sign into developerWorks, a profile is created for you. This profile includes the first name, last name, and display name you identified when you registered with developerWorks. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

All information submitted is secure.

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerworks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

By clicking Submit, you agree to the developerWorks terms of use.

All information submitted is secure.

Untangle URIs, URLs, and URNs

Naming and the problem of persistence

Dan Connolly (connolly@w3.org), Technical Staff, W3C/MIT
Dan Connolly is the W3C URI Activity Lead, a member of the W3C Technical Architecture Group, and chair of the RDF Data Access Working Group. He joined the W3C staff in 1995. He edited the HTML 2.0 specification in 1995, chaired the Working Group that produced HTML 3.2 and 4.0 and CSS 1.0, led the XML Activity through the release of XML 1.0 in 1998 and XML Schema in 2001, and participated in the development of the Web Ontology Language, which became a W3C Recommendation in 2004. Dan is also a research scientist in the Decentralized Information Group at the MIT Computer Science and Artificial Intelligence Laboratory.

Summary:  In information management, persistence and availability are in constant tension. This tension has led to separate technologies for Uniform Resource Names (URNs) and Uniform Resource Locators (URLs). Meanwhile, Uniform Resource Identifiers (URIs) are designed to serve as both persistent names and available locations. This article explains how to use the current URI standards with XML technologies, gives a history of URNs and URLs, and provides a perspective on the tension between persistence and availability.

Date:  21 Jun 2005
Level:  Introductory

Comments:  

The World Wide Web combines three kinds of technologies: data formats, protocols, and identifiers that tie the two together. The relationship between data formats such as XML and HTML is relatively clear, as is the relationship between protocols such as HTTP and FTP. But identifiers seem to be a bit trickier to pin down.

Web addresses were relatively obscure a dozen years ago, but now they appear not just in Web browsers but also on business cards and brochures, on billboards and buses and T-shirts. They're commonly known as Uniform Resource Locators or URLs. A typical example would be http://www.cisco.com/en/US/partners/index.html. But what about shorter forms, such as www.yahoo.com/sports? Is that a URL, too? How about ../noarch/config.xsd? Or guide/glossary#octothorpe?

To make good use of URLs in XML namespaces, XML schemas, and Extensible Stylesheet Language Transformations (XSLT), you need to know the rules. But the XML family of specifications refers to URIs and URNs -- what's the difference between these and URLs? That question has a long history.

My role in that history goes back at least as far as the Hypertext conference in 1991, where I met both Douglas Engelbart (inventor of the mouse and pioneer of networked computers and hypertext) and Tim Berners-Lee (inventor of the World Wide Web). In a 1990 summary of his 20-plus years of research (see Resources), Engelbart listed among the requirements for an Open Hyperdocument System, "in principle, every object that someone might validly want or need to cite should have an unambiguous address." In his 1991 design document on naming, Berners-Lee wrote:

This is probably the most crucial aspect of design and standardization in an open hypertext system. It concerns the syntax of a name by which a document or part of a document (an anchor) is referenced from anywhere else in the world.

This article discusses the current state of the art in naming technology and standardization for the World Wide Web, as well as some of the history and evolution of the terminology. It concludes with a perspective on naming in information management.

The URI standard

RFC3986, "Uniform Resource Identifier (URI): Generic Syntax," is an Internet Standard. The Request for Comments (RFC) series is the famous archival document series that is the backbone of the Internet Engineering Task Force (IETF) standards process. Only a few of the thousands of RFCs, such as TCP (RFC793) and the Internet Mail format (RFC821) and protocol (RFC822), have advanced to full Internet Standard status. RFC3986 advanced to this status in January 2005.

According to the URI standard, the first example above -- http://www.cisco.com/en/US/partners/index.html -- is indeed a URI, and it has several component parts:

  • A scheme name (http)
  • A domain name (www.cisco.com)
  • A path (/en/US/partners/index.html)

The IETF consensus process manages the schemes. The Official IANA Registry of URI Schemes (see Resources) includes familiar schemes like http, https, and mailto, plus many others that you may or may not be familiar with.

A URI path is like a typical file pathname. URIs adopted forward slashes (a/b/c) from the UNIX® tradition, because when URIs were designed in the late 1980s, UNIX culture was more prevalent on the Internet than PC culture. At that time, there were several popular notations for accessing remote files. One of them was Ange-ftp, an extension to emacs for editing remote files. It combined host names and user names with pathnames to get something like /jbrown@freddie.ucla.edu:~mblack/. The URI syntax that was developed for the Web used the double-slash notation for cross-machine naming (following the Apollo Domain UNIX dialect), but it also introduced the scheme syntax so that naming conventions from any number of different protocols could be unified. Some examples include:

  • mailto:mbox@domain
  • ftp://host/file
  • http://domain/path

The second example in the introduction, www.yahoo.com/sports, is not really a URI. It's a convenient shorthand for http://www.yahoo.com/sports, a format supported by popular Web browser user interfaces (UIs). However, don't make the mistake of leaving out the scheme in XSLT like this:

<xsl:include href="exslt.org/math/min/math.min.template.xsl" />

because it won't work as you expect, unless you really intend to refer to a file in a directory called exslt.org next to your XSLT stylesheet. The href attribute in XSLT takes a URI reference, which may be absolute or relative. A URI reference that starts with a scheme and a colon is absolute; otherwise, the reference is relative. A relative URI reference is much like a file path. For example, ../noarch/config.xsd is also a relative URI reference.

Internationalized Resource Identifiers

It is a slight oversimplification to say that the href attribute in HTML takes a URI reference. URIs and URI references are taken from a limited set of ASCII characters, and HTML is more internationalized than that. In fact, the Request for Comments that followed RFC3986 was RFC3987, Internationalized Resource Identifiers (IRIs) (see Resources). This specification is not as far along in the IETF standards process as its predecessor, but the technology itself is quite mature and widely deployed. IRIs are just like URIs except that they can use the whole range of Unicode characters, not just ASCII. Each IRI has a corresponding encoding as a URI, in case an IRI needs to be used in a protocol (such as HTTP) that accepts only URIs.

Overriding the base URI with xml:base

Typically, a URI reference is relative to whatever document you find it in. If you're looking in a document with base URI http://exslt.org/math/min/math.min.template.xsl and you see a URI reference ../../random/random.xml, then that reference would expand to http://exslt.org/random/random.xml. In HTML, you can put a base element at the top of the document to override the base URI. The XML Base specification (see Resources) provides the equivalent in XML.

Consider a document that you can access either as file:/my/doc or as http://my.domain/doc. Typically, when you access the document through the file system, you want references like #part2 to expand to file:/my/doc#part2; when you access the document through HTTP, you want #part2 to expand to http://my.domain/doc#part2. But in a Resource Description Framework (RDF) schema, the expanded form needs to stay the same for some things to work. XML Base makes this expansion easy (see Listing 1).


Listing 1. Expanded form in RDF

<rdf:RDF
  xmlns="&owl;"
  xmlns:owl="&owl;"
  xml:base="http://www.w3.org/2002/07/owl"
  xmlns:rdf="&rdf;"
  xmlns:rdfs="&rdfs;"
>

...
    <Class rdf:about="#Nothing"/>

In this example, the #Nothing reference expands to http://www.w3.org/2002/07/owl#Nothing no matter where you find that document.

Okay, so much for URIs, IRIs, and URI references. What about URLs and URNs?


URLs and URNs

URIs are designed to serve as both names and locators. When they were brought to the IETF for standardization, they became known as Uniform Resource Locators, and a separate effort on Uniform Resource Names began.

For Internet hosts, names and locations have separate standards. Host names have the same syntax as domain names (for example, zork1.example.edu). These host names are connected to addresses like 192.168.300.21 by the Domain Name System (DNS) protocol. This indirection allows the names to remain stable when hosts are moved around in the network and renumbered.

The occasional broken link in the Web made Web addresses look and feel more like locations than names, and different perspectives emerged in the IETF community:

  • URIs: RFC1630, issued in June 1994, was called "Universal Resource Identifiers in WWW: A Unifying Syntax for the Expression of Names and Addresses of Objects on the Network as used in the World-Wide Web" (see Resources). It was an Informational RFC -- that is, it did not carry any endorsement from the community.
  • URLs: RFC1738, issued in December 1994, was called "Uniform Resource Locators" (see Resources). It was a Proposed Standard -- that is, it was the result of a consensus process, though it was not yet tested and mature enough to be a full Internet Standard.
  • URNs: RFC1737, issued December 1994, was called "Functional Requirements for Uniform Resource Names" (see Resources).

RFC1737 was followed in 1997 by Proposed Standard RFC2141, "URN Syntax," which specified another scheme -- urn: -- to join http:, ftp:, and the rest.

The eventual URI Standard (RFC3986) clarifies the distinction in section 1.1.3, "URI, URL, and URN":

A URI can be further classified as a locator, a name, or both. The term "Uniform Resource Locator" (URL) refers to the subset of URIs that, in addition to identifying a resource, provide a means of locating the resource by describing its primary access mechanism (e.g., its network "location"). The term "Uniform Resource Name" (URN) has been used historically to refer to both URIs under the "urn" scheme [RFC2141], which are required to remain globally unique and persistent even when the resource ceases to exist or becomes unavailable, and to any other URI with the properties of a name.
An individual scheme does not have to be classified as being just one of "name" or "locator". Instances of URIs from any given scheme may have the characteristics of names or locators or both, often depending on the persistence and care in the assignment of identifiers by the naming authority, rather than on any quality of the scheme. Future specifications and related documentation should use the general term "URI" rather than the more restrictive terms "URL" and "URN" [RFC3305].

Practical persistence

A natural tension exists between persistence and availability. If I have a file on a host that's connected to the Internet, the simplest way to make it available to you is to run a Web server on that host and hand you a URI that consists of whatever name the host happens to have, along with the filename (for example, http://dhcp324.coolISP.net/drafts/freeLunch.wsdl). That works fine until my Dynamic Host Configuration Protocol (DHCP) lease expires, I change ISPs, or I move the file from /drafts/ to /keepers/. And what if the service becomes popular and I decide to charge for it? The more inessential the information in the name, the less likely it is to persist across changes.

But a nice persistent name like http://xyzpdq.org/2005/ls434 is not as simple to manage. I have to register a domain, maintain the mapping from the domain name to the host address, and either remember that ls434 is the file where I keep my lunch service description or set up a file mapping table on my Web server.

The PURL project and the Digital Object Identifier (DOI) system (see Resources) represent different approaches to the persistence problem. A Persistent URL (PURL) is an ordinary HTTP URI in a domain backed by a strong persistence policy. For example, purl.org is run by the Online Computer Library Center (OCLC), a worldwide library cooperative. Anyone can apply for an account and administer his or her own set of PURLs. You publish your content on an ordinary Web server, then connect it to your PURL with HTTP redirection. The indirection from PURLs to less-persistent HTTP URIs is much like the indirection provided by DNS, except that the source and the destination of the redirection are in the same category. When you have set up a PURL, such as http://purl.org/net/dajobe/, you can use it like any other HTTP URI. More importantly, the people you want to communicate with can use it just like any other HTTP URI; no plug-ins or add-ons are needed.

The DOI system uses its own scheme -- for example, doi:10.123/456. Web browsers can be adapted to support this scheme with a plug-in. The DOI foundation provides policies, registration services, and HTTP redirection services similar to PURL providers like OCLC. While the DOI foundation supports an alias for each DOI of the form http://dx.doi.org/10.123/456, the DOI Handbook (see Resources) states that this system has "significant disadvantages when compared with the resolver plug-in." Managing two different names for each object seems like a more significant disadvantage to me.


Creative tensions in information management

Despite this tension between persistence and availability, a good URI has both; it works as both a persistent name and an available location. So, a URL is really just a URI with practical utility.

Proponents of the urn: scheme argue that this tension is irreconcilable within the framework of HTTP and DNS. I acknowledge that there are areas of concern, but every Web master faces the same issues, and the community is learning information management principles to address them. The fundamental issue is that the world changes continuously, and keeping things in sync takes effort.

Most of the time, the hierarchical nature of DNS naming is convenient, but it concentrates a lot of power in one place and raises challenging governance issues. Peer-to-peer designs such as distributed hash tables may eliminate some of the centralization issues with DNS, but who knows what governance issues they will bring with them? Various leading-edge developments show how new protocols can be used to service existing http://... names, adding value to the existing hypermedia network. This seems more likely to succeed than the deployment of new schemes for anything remotely similar to HTTP's GET/PUT/POST/DELETE operations. I expect that present-day best practices in information management and future protocol enhancements will make carefully chosen URIs built on HTTP and DNS last quite a long time.


Resources

About the author

Dan Connolly is the W3C URI Activity Lead, a member of the W3C Technical Architecture Group, and chair of the RDF Data Access Working Group. He joined the W3C staff in 1995. He edited the HTML 2.0 specification in 1995, chaired the Working Group that produced HTML 3.2 and 4.0 and CSS 1.0, led the XML Activity through the release of XML 1.0 in 1998 and XML Schema in 2001, and participated in the development of the Web Ontology Language, which became a W3C Recommendation in 2004. Dan is also a research scientist in the Decentralized Information Group at the MIT Computer Science and Artificial Intelligence Laboratory.

Report abuse help

Report abuse

Thank you. This entry has been flagged for moderator attention.


Report abuse help

Report abuse

Report abuse submission failed. Please try again later.


developerWorks: Sign in

If you don't have an IBM ID and password, register here.


Forgot your IBM ID?


Forgot your password?
Change your password


By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. This profile includes the first name, last name, and display name you identified when you registered with developerWorks. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

Choose your display name

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

(Must be between 3 – 31 characters.)


By clicking Submit, you agree to the developerWorks terms of use.

 


Rate this article

Comments

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML
ArticleID=86333
ArticleTitle=Untangle URIs, URLs, and URNs
publish-date=06212005
author1-email=connolly@w3.org
author1-email-cc=dwxed@us.ibm.com

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

For articles in technology zones (such as Java technology, Linux, Open source, XML), Popular tags shows the top tags for all technology zones. For articles in product zones (such as Info Mgmt, Rational, WebSphere), Popular tags shows the top tags for just that product zone.

For articles in technology zones (such as Java technology, Linux, Open source, XML), My tags shows your tags for all technology zones. For articles in product zones (such as Info Mgmt, Rational, WebSphere), My tags shows your tags for just that product zone.

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).