Planning a Semantic Web site

Prepare your site for structured data

The Semantic Web brings with it the opportunities for users to get smarter search results, and for site owners to get more targeted traffic as users find what they really want. But these benefits don't just magically appear. This article leads you through the aspects of both information architecture and general infrastructure you need in place to truly take advantage of this burgeoning opportunity.

This article discusses what you need to know to make your Web site part of the Semantic Web. It starts with a discussion of the problems the Semantic Web tries to solve and then moves to the technologies involved, such as Resource Description Framework (RDF), Web Ontology Language (OWL), and SPARQL Protocol and RDF Query Language (SPARQL). You'll see how the Semantic Web is layered on top of the existing Web. It then covers some issues that you want to know about when you plan a new Web site and also gives specific examples of how to use technologies like RDFa and Microformats to enable your existing Web site to become a part of the Semantic Web.

Introduction to the Semantic Web

The World Wide Web is the largest single information resource humanity has ever produced. Unfortunately, despite its dependence on computers to operate at all, most of the information is only understandable by humans and not by computers. While computers can use the syntax of HTML documents to display them to you in a browser, Web computers can't understand the content—the semantics.

Frequently used acronyms

  • API: application programming interface
  • HTML: Hypertext Markup Language
  • URI: Uniform Resource Identifier
  • W3C: World Wide Web Consortium
  • XML: Extensible Markup Language

The Semantic Web is Tim Berners-Lee's vision of the future of the Web. Although the dream is not yet realized, enough building blocks are now in place to enable you to take advantage of several Semantic Web technologies on your Web site, including RDF, OWL and SPARQL. The goal of the Semantic Web is to expose the vast information resource of the Web as data that computers can automatically interpret.

The Web was originally all about documents. The simple act of clicking on a link in your Web browser triggers your browser to ask a Web server to send you a document, which it then displays to you. The document might be your calendar for the next seven days, or it might be an e-mail from a friend. The Web browser doesn't really care; it just follows its internal rules for displaying the page. It's up to you to understand the information on the page.

Structuring data adds value to that data. With consistent structure, it can be used in more ways. You can see the demand for structured data today in the proliferation of APIs that have sprung up around Web sites as a part of the Web 2.0 trend—an API is structured data, and structured data from a variety of sources is what powers mashups. The idea behind mashups is that data is pulled from various sources on the Web and, when combined and displayed in a unified manner, this combination of elements adds value over and above the source information alone.

The individual APIs that everyone is busy building are to solve the exact same problem that the Semantic Web is intended to address: Expose the content of the Web as data and then combine disparate data sources in different ways to build new value. Rather than build and maintain your own API, you can build your Web site to take full advantage of the Semantic Web infrastructure which is already in place. If your Web site is your API, you can reduce the overall development and maintenance. Similarly, rather than build custom solutions for every Web site you want to pull data from, you can implement one solution based on Semantic Web technologies and have it work interchangeably across many Web sites—including Web sites you weren't even aware of before you began development.


Semantic Web technology overview

Semantic Web technologies can be considered in terms of layers, each layer resting on and extending the functionality of the layers beneath it. Although the Semantic Web is often talked about as if it were a separate entity, it is an extension and enhancement of the existing Web rather than a replacement of it.

Figure 1. The Semantic Web technology stack
The Semantic Web technology stack

As shown in Figure 1, the base layer of the Semantic Web is HTTP and URIs. These are commonly considered 'Web' rather than 'Semantic Web', but every proposed Semantic Web technology rests upon these Web fundamentals. URIs are the nouns of the semantic Web. HTTP are the verbs: GET, PUT and POST as well as a number of thoroughly tested solutions in the fields of authentication and encryption.

The Resource Description Framework (RDF) is the workhorse of the Semantic Web. It is a grammar for encoding relationships. An RDF triple has three components: a subject, a predicate (or verb), and object. Each can be expressed as a resource on the Web, that is a URI. This is far less ambiguous than encoding data in random XML documents. Compare the different ways of expressing a simple relationship in XML given in Listing 1 with the RDF triple in Listing 2.

Listing 1. Ambiguous relationships in XML
<author>
    <uri>page</uri>
    <name>Rob</name>
</author>

<person name="Rob">
    <work>page</work>
</person>

<document href="http://www.example.org/test/page" author="Rob" />

Listing 2 shows the RDF triple.

Listing 2. Expressing relationships in RDF
<page> <author> <Rob> .

The relationship expressed in all the examples shown in Listing 1 is 'Rob is the author of page'—a fairly simple statement—yet expressed in several ways in XML. It would be very difficult to build software that can derive that relationship from all the possible ways to express it in XML. But an RDF expresses that relationship in only one way, so it becomes feasible to build generic parsers.

In the early days of the Semantic Web, it was hoped that content producers would make all their content available in RDF and soon make a plethora of data available. Unfortunately, perhaps because the main XML expression of RDF looked unnecessarily complex, uptake was slow. More succinct RDF representations, like Notation3 (N3) and Terse RDF Triple Language (Turtle) are now available but have been unable to overcome the inertia. (For more on N3 and Turtle, see Resources.) A solution to the problem was inspired by the Microformats approach. With Microformats, semantic value is added to existing HTML content by using consistent patterns of standard HTML elements and attributes. Microformats exist for narrow but common items of data such as contact information and calendar items. The W3C equivalent is RDFa, RDF data embedded in XHTML. The implementation is slightly more complex than Microformats but it is far more generic—anything which you can express in RDF, you can add to XHTML documents using RDFa. Through this technique the Semantic Web can be bootstrapped by existing Web content.

Of course, the RDF embedded in XHTML documents as RDFa is no good for all the Semantic Web tools, which require RDF as input. There needs to be an automatic method to recognize the presence of RDFa content and extract the RDF out of it. The W3C solution for this is Gleaning Resource Descriptions from Dialects of Languages (GRDDL). The idea is that you run an existing XHTML document through an XSL transform to generate RDF. You can then link the GRDDL transform either through direct inclusion of references or indirectly through profile and namespace documents.

While unambiguously expressed semantics with RDF are good, even if everyone did that, it is of little use if you have no idea how the RDF from different sites is related. The RDF triple in Listing 2 expressed an author relationship in the predicate, and while the meaning might seem obvious to you, computers still need some help. If you expressed an author relationship in an RDF file on your site, could the computer assume they were the same thing? What if you instead had a writer relationship in your RDF triple? What you need is a way to express a common vocabulary, to be able to say that my author and your author are the same thing, or that 'author' and 'writer' are analogous. On the Semantic Web this problem is solved by ontologies, and the W3C standard for expressing ontologies is the Web Ontology Language (OWL). OWL is a large subject in it's own right, and since you're only interested in applications of it in this article, see Resources for more information about it.

Once you have some sources of data in RDF, and you have ontologies to let you determine the relationships between them, you need a way to get useful information out of them. The Simple Protocol and RDF Query Language (or SPARQL, pronounced 'sparkle') is an SQL-like syntax for expressing queries against RDF data, and the queries themselves look and act like RDF data. The fundamental paradigm for SPARQL is pattern matching and it is designed to work across the Web on data combined from disparate sources and to be flexible. For example, matches can be described as optional, which makes it much better than SQL at querying ragged data. Ragged data has an unpredictable and unreliable structure, which is what you might expect to find if your data is combined from various sources on the Web rather than from a single well-contained SQL database.


Things you need to know when planning a Semantic Web site

As you've already seen, if you build the next great Web 2.0 site, you can save time if you plan from the start to embrace Semantic Web technologies and turn your Web site into an API, rather than create a separate API for your Web site. A Semantic Web approach gives you free API-like functionality. Usually an API is a way to get structured data, in XML or JSON format, out of an otherwise unstructured Web site. This leads to a dual approach: You have Web pages for human consumption and you have an API where computers can pull out structured information for automatic processing. However, this creates extra work for you; if you expect people to make use of your API, then you have to document it and support it and keep it synchronized with new features on your Web site. With a Semantic Web approach, your Web site is the structured data. You don't need a separate implementation. You and your users can take advantage of other Semantic Web tools to do automatic processing.

This does raise some issues for planning. With an API you are free to define your own data format for each item of information you want to deliver, and in the Semantic Web this is analogous to defining your own ontology. Ontology design can be a difficult thing to get right with little experience, so you should consider whether any of the large array of existing ones will be suitable for the types of data you plan to use, which will be discussed in the next section. When you design an API, you also usually consider an object model for conceptual organization so developers can understand when they get collections of items or just items, and which collections their items belong in. On a Semantic Web site this will be partly determined by your ontology choices, but also by your URI scheme. Next, you'll look at approaches to making your URIs usable as part of your API.

Finally, on an existing Web site, you and your users can still benefit from the Semantic Web, if you update your content to take advantage of GRDDL, RDFa and Microformats.


Evaluate your data in the context of existing ontologies

A more complex part of the Semantic Web is to design an ontology that matches up to your data. Arriving at the right ontology is usually a critical element of successful implementation of Semantic Web projects. Fortunately, many ontologies already exist. Table 1 lists some of them.

Table 1. Some ontologies in use on the Web today
Dublin CoreThis metadata element standard for cross-domain information resource description provides a simple and standardised set of conventions for describing things online in ways that make them easier to find.
SIOCSemantically-Interlinked Online Communities Project is an ontology that expresses the information contained both explicitly and implicitly in Internet discussion methods, such as blogs or forums mailing lists.
FOAFThe Friend of a Friend ontology describes individuals, their activities and their relations to other people and objects. FOAF allows the description of social networks in a distributed fashion.
DOAPDescription Of A Project is an ontology to describe open-source projects
ResumeRDFThis ontology expresses a Resume or Curriculum Vitae (CV), including information such as work and academic experience or skills.

In addition, many ontologies are domain specific in fields such as technology, environmental science, chemistry and linguistics. These will apply to fewer Web sites than those listed above, however. A lot of your data is likely to fit into at least one of the areas covered by the ontologies in Table 1, in which case you can incorporate them in your planning.


Choose a Semantic URI scheme

If your Web site is your API, then your URIs are the methods that programmers will access to get data. A sensible, succinct and consistent structure is therefore very important, and you need to think about it in advance because frequent changes after everything is launched will cost the goodwill of your target audience. You should also remember that the components of an RDF triple are usually URIs. To change them will invalidate most existing RDF which refers to your Web site.

In the early days of the Web, the structure of the URI usually reflected the organization of the files on a Web server. If you sold a particular type of widget among a collection of products, its URI might be similar to: http://www.mysite.com/products/gadgets/widget.html.

The advantage of this approach is that it is relatively semantically clear; if you also sold a doodad, then an obvious URI where you might expect to find the product details is: http://www.anothersite.com/products/gadgets/doodad.html.

The relationship between the widget and the doodad is fairly clear. The main problem is that this approach is inflexible; the categorization hierarchy is fixed.

As the Web advanced, dynamically generated sites became the norm. But while the sites became more flexible, with structure no longer tied to a particular layout of files, the amount of semantic information in the URI decreased. The page you are shown is determined by some rather cryptic information in the query string. For instance, the URI of the widget might be: http://www.mysite.com/inventory.cgi?pid=12345 and the URI of the doodad might be: http://www.mysite.com/inventory.cgi?pid=67890.

Suddenly the URI gives you very little semantic value. It's certainly not clear that these two products might be in the same category. More recently, content management systems and Web development frameworks have started to address this issue. Now it's much easier to have semantically structured URIs yet retain the flexibility of dynamic pages. This is achieved through URIs that refer not to a physical file on the server, but to content which can be delivered from a script or page in a different location. In the trend-setting Ruby on Rails framework. this is achieved through routes (rules that map matching URLs to specific controllers and actions). In CMS packages, the feature usually depends on Apache's mod_rewrite (or equivalent on other Web servers) and is often referred to as "Search Engine Friendly URIs" or something similar. When you choose a CMS or development framework for your site, be sure to investigate what it is capable of in this regard.

One final note: If possible, consider removing file name extensions from your URIs. The filename extensions (.html and .cgi) provide no semantic information that is relevant to the user and actually cause problems in the long run. If you changed your Web site to use PHP instead of CGI scripts, you suddenly have different URIs but serve exactly the same content. This is bad for the semantic value of your URIs, as well as your Google ranking! A more semantically elegant method is to take advantage of the HTTP headers to do content negotiation. Consider the following URI: http://www.mysite.com/products/gadgets/widget.

A Web browser will generally indicate its preferred content type using the Accept HTTP header. When asked for this resource, the Web server can check that header, note that text/html is one of the options, and serve an HTML page. If you have a mashup application that wants RDF, then the Accept header in the HTTP request should contain application/rdf+xml and the Web server, from the same URI, can serve an RDF version of the page.

At present this content negotiation functionality is not available in many off-the-shelf CMS solutions, but in the short term it should be possible for a lot of them to use URIs without file extensions, which means you can add this functionality in the future without upsetting your URI scheme.


Take advantage of existing semantic add tools

Whether you fully embrace the Semantic Web in your Web site infrastructure, or just want to make your existing content more useful, there are probably several opportunities to add structure to existing content on your Web site. This is the domain of Microformats, RDFa and GRDDL. Table 2 lists the more common information types that you can easily mark up as structured data.

Table 2. Opportunities for structured markup and automatic transformation
Information typeStructured Markup
People and OrganizationshCard, RDF vCard
Calendars and EventshCalendar, RDF Calendar
Opinions, Ratings and ReviewsVoteLinks, hReview
Social NetworksXFN, FOAF
Licensesrel-license
Tags, Keywords, Categoriesrel-tag
Lists and OutlinesXOXO

Adding the structured markup to your page is fairly simple. Listings 3 and 4 below show a fragment of HTML containing contact information without, and then with, the additional markup required for the RDF vCard, respectively.

Listing 3. Unstructured contact information
<div class="contactinfo">
  Rob Crowther. Web hacker
  at
  <a href="http://example.org">
    Example.org
  </a>.
  You can contact me
  <a href="mailto:robertc@example.org">
    via e-mail
  </a> or on my work phone at 0123 456789.
</div>

Listing 4 shows the contact information with additional markup required for the RDF vCard.

Listing 4. Contact Information using vCard
<div xmlns:contact="http://www.w3.org/2001/vcard-rdf/3.0#" class="contactinfo"  
                                     about="http://example.org/staff/robertc">
  <span property="contact:fn">Rob Crowther</span>.
  <span property="contact:title">Web hacker</span>
  at
  <a rel="contact:org" href="http://example.org">
    Example.org
  </a>.
  You can contact me
  <a rel="contact:email" href="mailto:robertc@example.org">
    via e-mail
  </a>
  or on my 
  <span property="contact:tel">
    <span property="contact:type">work</span>
    phone at
    <span property="contact:value">0123 456789</span>
  </span>.
</div>

In Listing 4, you can see span elements added to delimit the semantically significant bits of text, and attributes that indicate what they mean. You added the namespace "contact" linked to the RDF VCard vocabulary. Next, you indicated that this element is about the resource represented by the URI http://example.org/staff/robertc. Then, you added metadata using the rel attribute for link relationships and the property attribute on non-links. The only slightly complex part is the telephone because you need to specify a type as well as the number. To achieve this, you nest the type and value elements inside the tel element. Adding this structure allows users to add the contact details to their address book with a single click of the mouse.

Other automatic processing is possible with the other structured forms; for example, Technorati makes use of the rel-tag microformat to categorize its vast aggregation of blog posts. A rel-tag is shown in Listing 5, and as you can see, it is simply a link that makes use of the rel attribute. The significant part is the last bit of the URI, after the final /. This is the tag (using the normal URI encoding conventions where a space is represented by the plus sign).

Listing 5. rel-tag for Technorati for the tag 'semantic web'
<a href="http://technorati.com/tag/semantic+web" rel="tag">
  Semantic Web
</a>

If you write a blog post related to the Semantic Web that includes the code from Listing 5 and then ping Technorati to let them know you made a new post (a lot of blog software can be configured to do this automatically), then their crawler will index your post and add a summary of it to the page that your tag element links to, along with any other posts with the same tag (see Figure 2).

Figure 2. The 'semantic web' page on Technorati, generated from rel-tag
The 'semantic web' page on Technorati, generated from rel-tag

Conclusion

In this article, you saw how Semantic Web technologies address the need for structured data on the Web in a standard and consistent manner, in contrast to the currently popular method of each Web site defining their own API. You looked at how the Semantic Web technologies add value in layers on top of the HTTP and URIs of the existing Web, first allowing the unambiguous expression of relationships with RDF, and then allowing for shared meaning with OWL based ontologies and finally querying the distributed Web of knowledge using SPARQL. The article also looked at how you can take advantage of existing ontologies to define what your data is and use a semantic URI scheme to enable your Web site to also be your API. Finally the article looked at how you can upgrade the content of your existing Web site using RDFa and Microformats so that GRDDL services can automatically extract RDF from your pages.

Although the promise of Tim Berners-Lee's Semantic Web is yet to be fully realized, the years of thinking and research that have gone into it are starting to bear fruit in terms of solutions to practical problems that people face today. The strong collaboration trends in Web 2.0 will only lead to more requirements for structured and semantically encoded data being available on the Web. With some planning, you can be in position to take advantage of the Semantic Web tools which help meet that need.

Resources

Learn

Get products and technologies

  • IBM trial software: Build your next development project with trial software available for download directly from developerWorks.

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into XML on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML
ArticleID=291702
ArticleTitle=Planning a Semantic Web site
publish-date=04102008