By now it's a well-known and oversimplified bedtime story: In 1989 Tim Berners-Lee invented the Web, and casinos, pornographers, and, incidentally, businesspeople the world over found a medium of unprecedented power. Many limitations of the Web are widely accepted:
- The predominance of HTML documents, which mix content with presentation
- The difficulty of maintaining Web sites to reflect inevitable real-world changes
- The difficulty of seamlessly presenting dynamic content
- The seeming futility of finding precisely what one wants using a Web-crawler search engine.
The W3C, the consortium founded in 1994 by Berners-Lee and other industrial shapers of the Web, has been working hard to change these four limitations. The first two are supposed to give way to a future of an XML-driven Web, which would improve the maintainability and flexibility of Web data. The W3C takes aim at the latter two with the Resource Description Framework (RDF), claiming that RDF will make the management and navigation of Web data easier to automate by providing structured Web metadata as counterpart to Web data. (See the sidebar for a note about the word metadata and other such elusive concepts.)
Thus far XML has garnered much of the world's attention, but as many XML specialists point out (and as many observers of XML's remarkable media coverage have probably thought), XML is not very interesting. XML is nothing more than a way to standardize data formats. In a way, it is just the next level of data above the character level, which has been standardized on such similarly unglamorous technologies as ASCII and Unicode.
This is not to underplay XML's importance. A data-format standard makes all of the more glamorous technologies possible, and RDF is the leading example of the benefit that comes once the data format has been standardized. Many proclaim that RDF is really the XML's killer app, and with good reason. Despite all this, RDF remains somewhat obscure. This is mainly because at its core RDF is very abstract, very dry, and very academic. With this article I hope to illustrate why RDF is very important to anyone interested in XML.
While trying to implement initiatives for managing the Web, particularly the Platform for Internet Content Selection (PICS), a content rating system, the W3C kept running into the difficulty of how to uniformly express assertions about Web pages which could be used by automated content filters and selectors.
RDF is very simple. It is no more than a way to express and process a series of simple assertions. For example: This article is authored by Uche Ogbuji.
This is called a statement in RDF and has three structural parts: a subject ("this article"), a predicate ("is authored by"), and an object ("Uche Ogbuji"). This is a familiar breakdown of such assertions, whether in the field of formal logic or grammar (well, OK, as long as you don't make too fine a point of that intransitive verb). Indeed, RDF is nothing more than an application of long study in such fields aimed at describing resources, which consist of any item accessible through the Web.
In RDF, resources are represented by Uniform Resource Identifiers (URIs), of which URLs are a subset. The subject of RDF statements must actually be a resource, so the above English statement could be turned into an RDF statement illustrated in Figure 1.
Figure 1: An RDF statement
Figure 1 shows the common graph representation of RDF statements, introduced in the RDF Model and Syntax 1.0 Recommendation (RDF M&S). Note that the object is a string:
"Uche Ogbuji". This is called a literal in RDF, but an object could also be a resource. Take a look at Figure 2.
Figure 2: A small RDF model
Figure 2 shows several RDF statements combined into a single diagram.
All of RDF is pretty much an expansion of this basis. RDF defines a directed
graph of statements that describe Web-based resources. As you can see,
I have replaced the literal
"Uche Ogbuji" in the original statement with a URI representing this person, which in turn is the subject of several more statements. Such a collection of RDF statements is called a model in RDF.
This might seem rather simple to be such an important technology, but it is RDF's very simplicity that makes it so powerful. Computer science already has plenty to say about the effectiveness of graphs for representing information. RDF allows many simple statements to be aggregated so that machine agents can apply the well-tested graph traversal techniques to glean data. These statements are called triples because there are three predominant parts (subject, object, and predicate). Databases of such triples have been shown to be scalable to many millions of triples, mostly because of the simplicity of this information. Such scalability is the only hope if a technology is to make an attempt at taming the vast Web.
The abstract representation we have discussed above is the basis of RDF, but it is quite impractical for exchanging RDF descriptions and placing such descriptions in HTML and XML content. To this end RDF M&S also provides a serialization format in XML for RDF. According to this format the model in Figure 2 might be rendered as in Listing 1.
Listing 1: XML serialization of the RDF model in figure 1
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns="http://schemas.uche.ogbuji.net/rdfexample/"> <rdf:Description about="http://uche.ogbuji.net/thisarticle"> <authored-by> <rdf:Description ID="uche.ogbuji.net"> <name>Uche Ogbuji</name> <nationality>Nigerian</nationality> </rdf:Description> </authored-by> </rdf:Description> </rdf:RDF>
Listing 1 shows just one of many forms, some more verbose and some more abbreviated, that are provided for XML expression. This flexibility of RDF syntax -- often an obstacle to learning and implementing RDF -- makes it much easier to apply RDF processing to existing XML. One constant in all RDF serializations is the use of the element
rdf:RDF to wrap the RDF statements.
Note the use of XML namespaces in Listing 1. RDF relies heavily on XML namespaces for disambiguating names. There are several element and attribute names that must be in the namespace defined by RDF M&S (using the prefix
rdf in this example and many others you'll see). All RDF predicates must use a namespace to clarify their meaning, as the examples show.
Inside the RDF wrapper element, a
description element indicates the subject of the enclosed statements. This example uses the
about attribute, which points to an external resource as the subject. There is one statement with this resource as subject, marked by the element
<authored-by>, which forms the predicate. Note that this element has the namespace
http://schemas.uche.ogbuji.net/rdfexample/. According to RDF M&S, this is translated to an abstract model in which the actual predicate is formed by joining the namespace URI and local name of the predicate element. So really the full predicate of this statement is
In addition, the namespaces are supposed to provide schemas that have type
and constraint information for RDF.
The remaining part of the statement, you'll remember, is the object. But the object of the first statement is not very clear from the listing. RDF handles the case in which the object of a statement is a resource but doesn't really have an external URI. In the example, the resource representing the person named Uche Ogbuji is such a case, and is actually represented by the embedded
Description element with an
ID attribute. The URI of this resource becomes the joining of the URI of the RDF file as a whole, and the value of the
ID attribute. Note that RDF takes this arcane concept (one of many) even further by allowing fully anonymous resources without even an ID.
The resource with
ID "uche.ogbuji.net" itself is the subject of two statements, with predicates represented by the child elements
nationality. Note that these predicates are also in the
http://schemas.uche.ogbuji.net/rdfexample/ namespace. The object of these statements are literals:
"Uche Ogbuji" and
That wraps up the introduction of RDF. See Resources for links to more detailed introductions to RDF, and more advanced topics such as statement containers, reification, and schemas.
As I said, RDF's power comes from its simplicity. The W3C suggests that webmasters begin the process of annotating existing Web data with RDF by embedding simple descriptions (such as in Listing 1) into the headers of their documents. Actually, rather than using the sample namespace for the schema I used in the listing, webmasters are encouraged to make use of the Dublin Core, a standard specification for library-like metadata (see Resources). Use of standard cataloguing metadata would assist search engine Web crawlers and other machine agents the way HTML meta tags help search engines index Web pages. The advantage of RDF is that it is readily extensible with schemas that are also machine readable, bringing about an unprecedented level of automation.
This automation of resource discovery, description, and schematics is the basis of what Berners-Lee and the W3C have been touting for some time as the next-generation Web, also known as the semantic Web. This term is rather controversial (see the sidebar "RDF wordplay" below), but it indicates the application of well-established artificial intelligence technologies, known as semantic networks, to the task of automating data processing on the Web. This evolution would allow Web crawlers to gather more than just plain keywords. Through RDF schemas, this evolution would allow Web crawlers to get some sense of the meaning of the various parts of distributed RDF statements. What meaning would actually mean is a matter of continuing debate and discussion. But at minimum RDF schemas provide a mechanism for navigating established contracts for descriptions of Web resources.
Of course, there is no semantic Web yet, and there is no telling whether such a vision will ever survive the lethargy of webmasters, the test of scalability, or problems with shifting resources. The latter concern is that URLs are based on the domain-name system, which is constantly changing. RDF resources are actually URIs, which are a superset of URLs, but the other URI formats are very obscure compared to URL. And they are not tested in the pervasive use that URLs endure.
So if the ambitious goals of RDF are some time ahead of us and somewhat uncertain, why is RDF important?
I've already mentioned (and it is well discussed in the literature) how hard the Web has become to manage on a macro scale. This problem is the same even in limited domains. The well-known client/server revolution in application design brought about a paradigm where a forms-and-display code plugs into a server data store. This approach and the development techniques associated with it are really only meant to handle a fixed and highly-controlled database environment. The extension to three-tier and n-tier systems hasn't changed this much. The problem is that as applications migrate to the Web, the rigidity gets in the way of maintainability.
I use the term Web applications to describe any location on the Web with dynamic or interactive content. This ranges from portals to e-commerce sites. Increasingly, to be competitive, Web applications must assemble data from diverse sources and services; furthermore, requirements for such applications tend to be far more fluid in "Internet time." This is the sort of environment in which the extensibility of both XML and RDF really pays dividends. XML allows great flexibility for adaptation of data formats, and RDF provides great flexibility for adaptation of data-processing rules.
We have discussed some of the problems with the idea that RDF can turn the entire Web into a semantic network, but many of these problems are more easily dealt with in the controlled environment of a single application. A central RDF database can be put in place covering triples that describe resources, which are combined to form the views of the Web application. In fact, some of the core application objects, particularly the ones most subject to change, can be directly referenced by the RDF model. This becomes a database index, but one that can be more easily extended.
Basically, RDF can provide Web-based applications an "escape hatch" from the strictures of traditional database design and application evolution. Some folks have been complaining for years that traditional database management tools are too highly structured, and therefore add hefty maintenance costs when the real world inevitably changes around the application. This faction (including the author) has long advocated a "semi-structured" approach to data management because it can drastically reduce maintenance costs. An RDF database working with a traditional database is one technology that goes a long way to addressing such concerns.
As a consultant I have made significant use of RDF to augment traditional databases in controlled but evolving systems. I've seen it reduce maintenance costs for portals, Web-based searching, and message indexing applications. As a heavy user of the Web, I can easily envision much of the advantage that XML, RDF, and the proclaimed semantic Web would provide.
RDF is by no means a perfect technology. Its serialization is rather rough around certain edges, and the only available RDF schema specification is almost completely toothless. RDF does have two powerful features: It is well designed to work with XML, which is designed for the Web and is quickly becoming the pervasive standard for data-exchange. It is also simple enough that even the troublesome edge cases are manageable.
If you already have a body of XML data, it is not very difficult to build a pilot program that creates indexes and rules for handling your XML data using RDF gleaned therefrom. Many RDF tools have already emerged, so you will rarely have to do much invention, and this approach would allow you explore some of the advantages of RDF in closed systems. Meanwhile, it's even easier to annotate your Web content with RDF descriptions alongside your HTML meta tags, which would give you early entry into the promised semantic Web.
- The W3C maintains an RDF information page that's full of useful links to additional information about RDF.
- I especially recommend Pierre-Antoine Champmin's RDF tutorial.
- The Dublin Core metadata initiative maintains a vocabulary for describing library-related metadata which has been suggested by the W3C for use on the Web in stating authorship, copyright, and so on, using RDF.
- You can see some practical use of RDF for rapid deployment of Web service descriptions in my developerWorks article
on WSDL and RDF (developerWorks, November 2000).
- There is a summary of discussion and debate on the meaning of semantic Web in this XML-DEV posting and the ensuing thread. There was also a long discussion on the W3C's RDF interest mailing list that culminated in this XML-DEV posting.
- For testing the RDF in this article I used 4RDF, which you can download from the 4Suite site.
- Browse for books on these and other technical topics.
Uche Ogbuji is a consultant and co-founder of Fourthought Inc., a consulting firm specializing in XML solutions for enterprise knowledge management applications. Fourthought develops 4Suite, the open-source platform for XML middleware. Mr. Ogbuji is a computer engineer and writer born in Nigeria, living and working in Boulder, Colorado, USA. You can reach him at firstname.lastname@example.org.