Throughout this column I've placed strong emphasis on the aspects of Web 2.0 that concern open, shared data rather than flashy effects. Certainly Ajax is important because when used well it can enhance the usability of Web sites. But Web feeds, open, Web-friendly APIs, and third-party plug-in and mashup capabilities are the real substance of Web 2.0. One community closely associated with the Web's original stewards, the W3C, is committed to a particular, coherent set of practices along these lines. The Linking Open Data (LOD) community combines the vision of the W3C for using semantic features to enhance the Web with the pragmatism that characterizes mainstream Web 2.0. At the heart of this community is a semi-official project of the W3C, which says on its main Wiki page (see Resources):
The goal of the W3C SWEO [Semantic Web Education and Outreach] Linking Open Data community project is to extend the Web with a data commons by publishing various open datasets as RDF on the Web and by setting RDF links between data items from different data sources.
The emphasis on RDF is natural for the W3C, which has been pushing the technology for a decade, but one development that gives LOD extra legs is the emergence of influential voices realizing that insistence on strict RDF format across the board is probably not the best present strategy for winning over Web developers. LOD supports RDF as a conceptual model, but the new emphasis is more on linking and openness than on any one syntax. After all, RDF is merely URIs, links, and labels, so any model that includes these three can readily work with RDF systems. The full LOD community is a penumbra around the W3C-led core who support all the advantages of opening up data that I've discussed so far in this column, and who see RDF, Atom, JSON, and so on as merely tools for Web developers to open up their data.
What is the Web of data?
The LOD community in all its diversity draws inspiration from the thoughts of the Web's inventor, Tim Berners-Lee. In his article, "Giant Global Graph" (see Resources) he expressed the basic evolution of the idea with a few apt observations:
- 'The realization [behind creation of the Internet] was, "It isn't the cables, it is the computers which are interesting". The Net was designed to allow the computers to be seen without having to see the cables.'
- 'The [World Wide Web] increases the power we have as users again. The realization was "It isn't the computers, but the documents which are interesting". Now you could browse around a sea of documents without having to worry about which computer they were stored on.'
- 'Now, people are making another mental move. There is realization now, "It's not the documents, it is the things they are about which are important".'
Berners-Lee points out that all stages of this evolution are about webs of links—a web of computers (though we call this a "network" rather than a "web"), a web of documents (what most people call "The Web"), and ultimately a web of all the things we want to share. He argues that we should extend the basic principles of the Web more directly to data (for example, the contents of traditional databases), and that we should not be shy about making links to non-computer resources such as people, tangible and abstract things, places, and so on. I'll discuss the trick to linking to non-computer resources in a later section. This expansive view of linking is called "the Web of data" and forms the basis of LOD.
Taking a closer look at what LOD means in practice, the starting point is in four basic principles Berners-Lee drafted in another paper, "Linked Data" (see Resources). As paraphrased by Wikipedia these are:
- Use URIs to identify things that you expose to the Web as resources.
- Use HTTP URIs so that people can locate and look up (dereference) these things.
- Provide useful information about the resource when its URI is dereferenced.
- Include links to other, related URIs in the exposed data as a means of improving information discovery on the Web.
Principle 1 means you should try to expose information as much as possible using URIs. Not just Web pages, but front-office application documents, database rows and metadata, personal data, transactional logs, business rules and policies, and even services. If sharing it is useful, look to give its component pieces URIs. You might wonder about security. Perhaps you're used to relying on the traditional application to protect your data. Remember that people bank on the Web. They make stock trades on the Web. They book travel and buy things on the Web. The Web is well-proven as a secure data conduit, provided best practices are followed.
Principle 2 means you should give up obscure ID schemes, even URI schemes, and just stick to HTTP, which has served the Web so well. This ensures that the widest variety of tools and resources will be able to access it.
Principle 3 means that the data that you provide to people who access the data URIs should be in a common format suitable for sharing on the Web. XML is one obvious candidate for this, but not all XML is suitable. You would need to use XML in a way that is semantically transparent, which means that the constructs in the XML are described in a rich way that can be processed by a machine. RDF is the main format used in the LOD community. It offers very high semantic transparency, but support for RDF is not yet as widespread as support for XML. One way to get the best of both worlds is to use GRDDL, a system for viewing XML through an RDF lens.
You might be wondering about JSON and microformats which have gained so much fame in the Web 2.0 world. The problem with these is that they are almost always even less semantically transparent than XML, although GRDDL can be used with microformats.
Principle 4 is basically the "share the wealth" principle. The first three encourage you to make possible Web pointers to data, and to maximize the usefulness of the data at those pointers. Once you have these pointers, you should not be shy about using them. Provide links as broadly as you can. You never know how someone or some machine is going to choose to navigate your Web of data, and the entire goal of LOD is to make it easy to use data in ways that were not originally conceived.
The principles set forth in the previous section make a lot of sense for stuff
that's already available as computer data. Documents, files, databases, and such are
called "information resources." But Berners-Lee says, "It's not the documents, it
is the things they are about which are important." Obviously much of the things in
this category are not information resources. They are people, places, and other tangible and abstract things. How do you create a Web of stuff whose existence is not contained within a computer? This is where LOD has picked up on a very clever trick. You go ahead and give such things a URI anyway, say
http://censusdata.example.com/joe.cool. You don't use Joe Cool's home page (
http://joe.cool.name/heyjoe.html) to identify him, because this would confuse anyone following such a link. Is the link a relationship with the person, or with his home page document?
The trick is that when anyone goes to the identifier for Joe Cool himself,
http://censusdata.example.com/joe.cool, they get back a special HTTP
response code that says "this is a non-information resource, so I can't give you the
resource itself, but I can give you some links where you can get more information about it." This special HTTP code is the 303, and the main link to related information might be the link to Joe Cool's home page,
http://joe.cool.name/heyjoe.html. The 303 step, however, makes it perfectly clear that the original identifier addressed the person, not his home page.
I've long been a sceptic of systems that try to couple the identities of non-information resources too closely to computer representations. I think there are basic philosophical problems with doing so, and that it tends to add a lot of complexity to systems. I must admit that the 303 trick is the least complex approach I've seen, and I'll be interested in watching how it develops with wider practice. Some of the philosophical issues do remain, and I don't think I exaggerate when I say that if successful 303s could open up a new era of information systems as profound as the original Web, with its annoying, yet enabling 404s.
The LOD community maintains a diagram of significant, public data sets that are available using LOD principles. Figure 1 is a recent version of that diagram.
Figure 1. LOD datasets
You can find a link to a clickable version of this diagram in Resources. The size of each bubble is a rough indication of the amount of data in that data set. Some interesting items are:
- Freshmeat, one of the classic sites that lists open-source data
- MusicBrainz, an online database of digital music tracks and albums
- Project Gutenberg, a venerable initiative to make out-of-copyright texts freely available
- FOAF, an RDF approach to social networking
- DBPedia, an LOD wrapper around Wikipedia articles
Despite its wild success, much can be done to improve the Web. At the heart of the W3C's efforts for doing so is the semantic Web, which would create a network of semantically transparent data. LOD is basically a very Web-developer-friendly path to the semantic Web, and one that neatly complements the most important Web 2.0 concepts. I've discussed mashups before in this column. You take a service output from site A and mix it into one from site B. With LOD this doesn't have to be such a conscious process, specialized for each component site. You really get to draw transparently from a wealth of data and services scattered across the Web. Some will be free for use, and some will be restricted for security or commerce, but these are just details that Web developers have already sorted out, for the most part.
LOD means making it easier for people to discover important things you place on the Web, and making it easier for them to do unexpected, fruitful things with them. The next time you have a Web project, start by thinking of it in terms of what information and non-information resources are represented in the Web app, and do everything you can to give each one a well-designed HTTP URI and a semantically rich data format, and create links, links, and more links.
- Find a very friendly introduction to LOD through a presentation Tom Coates made available on the Web: "Native to a Web of Data."
- Tim Berners-Lee assembles the basics of the web of data in his paper, "Linked Data."
- Tim Berners-Lee puts some of the LOD concepts into perspective in his note, "Giant Global Graph."
- Find a wealth of useful links on LOD at the W3C SWEO Linking Open Data community project home page.
- See the clickable version of the LOD dataset diagram.
- The low-level technical detail for those interested in LOD is explained in "How to Publish Linked Data on the Web," by Chris Bizer, Richard Cyganiak, and Tom Heath.
- "The future of the Web is Semantic," by Naveen Balani, is a good introduction to the semantic Web.
- Dan Connolly explains the various sorts of Web names and identifiers in "URIs, URLs, and URNs."
- Tim Berners-Lee discusses his linked data principles, among many other topics in this interview.
- Stay current with developerWorks technical events and webcasts.
- Expand your site development skills with articles and tutorials (including previous installments of this column) that specialize in Web technologies in the developerWorks Web development zone.
- Participate in developerWorks blogs and get involved in the developerWorks community.