Level: Introductory Rob Crowther (robert@crowther.info), Web developer, Freelance
04 Mar 2008 Updated 10 Apr 2008 The Semantic Web brings with it the opportunities for users to get smarter
search results, and for site owners to get more targeted traffic as users find what they really want. But these benefits don't just magically appear. This article leads you through the aspects of both information architecture and general infrastructure you need in place to truly take advantage of this burgeoning opportunity.
This article discusses what you need to know to make your Web site part of the Semantic
Web. It starts with a discussion of the problems the Semantic Web tries to solve and
then moves to the technologies involved, such as Resource Description Framework (RDF),
Web Ontology Language (OWL), and SPARQL Protocol and RDF Query Language (SPARQL). You'll see how the
Semantic Web is layered on top of the existing Web. It then covers some issues that
you want to know about when you plan a new Web site and also gives specific examples of
how to use technologies like RDFa and Microformats to enable your existing Web site to become a part of the Semantic Web.
Introduction to the Semantic Web
The World Wide Web is the largest single information resource humanity has ever
produced. Unfortunately, despite its dependence on computers to operate at all, most of
the information is only understandable by humans and not by computers. While computers
can use the syntax of HTML documents to display them to you in a browser, Web computers can't understand the content—the semantics.
 |
Frequently used acronyms
- API: application programming interface
- HTML: Hypertext Markup Language
- URI: Uniform Resource Identifier
- W3C: World Wide Web Consortium
- XML: Extensible Markup Language
|
|
The Semantic Web is Tim Berners-Lee's vision of the future of the Web. Although the
dream is not yet realized, enough building blocks are now in place to enable you to take advantage of several Semantic Web technologies on your Web site, including RDF, OWL and SPARQL. The goal of the Semantic Web is to expose the vast information resource of the Web as data that computers can automatically interpret.
The Web was originally all about documents. The simple act of clicking on a link in
your Web browser triggers your browser to ask a Web server to send you a document, which
it then displays to you. The document might be your calendar for the next seven days,
or it might be an e-mail from a friend. The Web browser doesn't really care; it just
follows its internal rules for displaying the page. It's up to you to understand the
information on the page.
Structuring data adds value to that data. With consistent structure, it can be used in
more ways. You can see the demand for structured data today in the proliferation of APIs
that have sprung up around Web sites as a part of the Web 2.0 trend—an
API is structured data, and structured data from a variety of sources is what
powers mashups. The idea behind mashups is that data is pulled from various sources on the Web and, when combined and displayed in a unified manner, this combination of elements adds value over and above the source information alone.
The individual APIs that everyone is busy building are to solve the exact same problem
that the Semantic Web is intended to address: Expose the content of the Web as data and
then combine disparate data sources in different ways to build new value. Rather than
build and maintain your own API, you can build your Web site to take full advantage of
the Semantic Web infrastructure which is already in place. If your Web site is your API,
you can reduce the overall development and maintenance. Similarly, rather than build
custom solutions for every Web site you want to pull data from, you can implement one
solution based on Semantic Web technologies and have it work interchangeably across many
Web sites—including Web sites you weren't even aware of before you began development.
Semantic Web technology overview
Semantic Web technologies can be considered in terms of layers, each layer resting on and extending the functionality of the layers beneath it. Although the Semantic Web is often talked about as if it were a separate entity, it is an extension and enhancement of the existing Web rather than a replacement of it.
Figure 1. The Semantic Web technology stack
As shown in Figure 1, the base layer of the Semantic Web is HTTP and URIs. These are commonly considered 'Web' rather than 'Semantic Web', but every proposed Semantic Web technology rests upon these Web fundamentals. URIs are the nouns of the semantic Web. HTTP are the verbs: GET, PUT and POST as well as a number of thoroughly tested solutions in the fields of authentication and encryption.
The Resource Description Framework (RDF) is the workhorse of the Semantic Web. It is a
grammar for encoding relationships. An RDF triple has three components: a subject, a
predicate (or verb), and object. Each can be expressed as a resource on the Web, that is
a URI. This is far less ambiguous than encoding data in random XML documents. Compare
the different ways of expressing a simple relationship in XML given in Listing 1 with
the RDF triple in Listing 2.
Listing 1. Ambiguous relationships in XML
<author>
<uri>page</uri>
<name>Rob</name>
</author>
<person name="Rob">
<work>page</work>
</person>
<document href="http://www.example.org/test/page" author="Rob" />
|
Listing 2 shows the RDF triple.
Listing 2. Expressing relationships in RDF
The relationship expressed in all the examples shown in Listing 1 is
'Rob is the author of page'—a fairly simple statement—yet expressed in
several ways in XML. It would be very difficult to build software that can derive that
relationship from all the possible ways to express it in XML. But an RDF expresses that
relationship in only one way, so it becomes feasible to build generic parsers.
In the early days of the Semantic Web, it was hoped that content
producers would make all their content available in RDF and soon make a
plethora of data available. Unfortunately, perhaps because the main XML
expression of RDF looked unnecessarily complex, uptake was slow. More
succinct RDF representations, like Notation3 (N3) and Terse RDF Triple
Language (Turtle) are now available but have been unable to overcome the
inertia. (For more on N3 and Turtle, see Resources.) A solution to the problem was inspired by the Microformats
approach. With Microformats, semantic value is added to existing HTML
content by using consistent patterns of standard HTML elements and
attributes. Microformats exist for narrow but common items of data such
as contact information and calendar items. The W3C equivalent is RDFa,
RDF data embedded in XHTML. The implementation is slightly more complex
than Microformats but it is far more generic—anything which you can
express in RDF, you can add to XHTML documents using RDFa. Through this
technique the Semantic Web can be bootstrapped by existing Web content.
Of course, the RDF embedded in XHTML documents as RDFa is no good for all the Semantic
Web tools, which require RDF as input. There needs to be an automatic method to
recognize the presence of RDFa content and extract the RDF out of it. The W3C solution
for this is Gleaning Resource Descriptions from Dialects of Languages (GRDDL). The idea
is that you run an existing XHTML document through an XSL transform to generate RDF. You
can then link the GRDDL transform either through direct inclusion of references or indirectly through profile and namespace documents.
While unambiguously expressed semantics with RDF are good, even if everyone did that, it
is of little use if you have no idea how the RDF from different sites is related. The
RDF triple in Listing 2 expressed an author relationship in the
predicate, and while the meaning might seem obvious to you, computers still need some
help. If you expressed an author relationship in an RDF file on your site, could the
computer assume they were the same thing? What if you instead had a writer relationship
in your RDF triple? What you need is a way to express a common vocabulary, to be able
to say that my author and your author are the same thing, or that 'author' and 'writer' are analogous. On the Semantic Web this problem is solved by ontologies, and the W3C standard for expressing ontologies is the Web Ontology Language (OWL). OWL is a large subject in it's own right, and since you're only interested in applications of it in this article, see Resources for more information about it.
Once you have some sources of data in RDF, and you have ontologies to let you determine
the relationships between them, you need a way to get useful information out of them.
The Simple Protocol and RDF Query Language (or SPARQL, pronounced 'sparkle') is an
SQL-like syntax for expressing queries against RDF data, and the queries themselves look
and act like RDF data. The fundamental paradigm for SPARQL is pattern matching and it
is designed to work across the Web on data combined from disparate sources and to be
flexible. For example, matches can be described as optional, which makes it much better
than SQL at querying ragged data. Ragged data has an unpredictable and
unreliable structure, which is what you might expect to find if your data is combined from various sources on the Web rather than from a single well-contained SQL database.
Things you need to know when planning a Semantic Web site
As you've already seen, if you build the next great Web 2.0 site, you can save time
if you plan from the start to embrace Semantic Web technologies and turn your Web site
into an API, rather than create a separate API for your Web site. A Semantic Web
approach gives you free API-like functionality. Usually an API is a way to get
structured data, in XML or JSON format, out of an otherwise unstructured Web site. This
leads to a dual approach: You have Web pages for human consumption and you have an API
where computers can pull out structured information for automatic processing. However,
this creates extra work for you; if you expect people to make use of your API, then you
have to document it and support it and keep it synchronized with new features on your
Web site. With a Semantic Web approach, your Web site is the structured data. You don't
need a separate implementation. You and your users can take advantage of other Semantic Web tools to do automatic processing.
This does raise some issues for planning. With an API you are free to define your own
data format for each item of information you want to deliver, and in the Semantic Web
this is analogous to defining your own ontology. Ontology design can be a difficult
thing to get right with little experience, so you should consider whether any of the
large array of existing ones will be suitable for the types of data you plan to use,
which will be discussed in the next section. When you design an
API, you also usually consider an object model for conceptual organization so developers
can understand when they get collections of items or just items, and which collections
their items belong in. On a Semantic Web site this will be partly determined by your ontology choices, but also by your URI scheme. Next, you'll look at approaches to making your URIs usable as part of your API.
Finally, on an existing Web site, you and your users can still benefit from the
Semantic Web, if you update your content to take advantage of GRDDL, RDFa and Microformats.
Evaluate your data in the context of existing ontologies
A more complex part of the Semantic Web is to design an ontology that matches up to
your data. Arriving at the right ontology is usually a critical element of successful
implementation of Semantic Web projects. Fortunately, many ontologies already exist. Table 1 lists some of them.
Table 1. Some ontologies in use on the Web today
| Dublin Core | This metadata element standard for cross-domain information resource description provides a simple and standardised set of conventions for describing things online in ways that make them easier to find. | | SIOC | Semantically-Interlinked Online Communities Project is an ontology that expresses the
information contained both explicitly and implicitly in Internet discussion methods,
such as blogs or forums mailing lists. | | FOAF | The Friend of a Friend ontology describes individuals, their activities and their
relations to other people and objects. FOAF allows the description of social networks in a distributed fashion. | | DOAP | Description Of A Project is an ontology to describe open-source projects | | ResumeRDF | This ontology expresses a Resume or Curriculum Vitae (CV), including information such
as work and academic experience or skills. |
In addition, many ontologies are domain specific in fields such as technology,
environmental science, chemistry and linguistics. These will apply to fewer Web sites
than those listed above, however. A lot of your data is likely to fit into at least one
of the areas covered by the ontologies in Table 1, in which case you can incorporate them in your planning.
Choose a Semantic URI scheme
If your Web site is your API, then your URIs are the methods that programmers will
access to get data. A sensible, succinct and consistent structure is therefore very
important, and you need to think about it in advance because frequent changes after
everything is launched will cost the goodwill of your target audience. You should also
remember that the components of an RDF triple are usually URIs. To change them will invalidate most existing RDF which refers to your Web site.
In the early days of the Web, the structure of the URI usually reflected the
organization of the files on a Web server. If you sold a particular type of widget among
a collection of products, its URI might be similar to: http://www.mysite.com/products/gadgets/widget.html.
The advantage of this approach is that it is relatively semantically clear; if you also
sold a doodad, then an obvious URI where you might expect to find the product details is: http://www.anothersite.com/products/gadgets/doodad.html.
The relationship between the widget and the doodad is fairly clear. The main problem is that this approach is inflexible; the categorization hierarchy is fixed.
As the Web advanced, dynamically generated sites became the norm. But while the sites
became more flexible, with structure no longer tied to a particular layout of files, the
amount of semantic information in the URI decreased. The page you are shown is determined by some rather cryptic information in the query string. For instance, the URI of the widget might be: http://www.mysite.com/inventory.cgi?pid=12345 and the URI of the doodad might be: http://www.mysite.com/inventory.cgi?pid=67890.
Suddenly the URI gives you very little semantic value. It's certainly not clear that
these two products might be in the same category. More recently, content management
systems and Web development frameworks have started to address this issue. Now it's much
easier to have semantically structured URIs yet retain the flexibility of dynamic pages.
This is achieved through URIs that refer not to a physical file on the server, but to
content which can be delivered from a script or page in a different location. In the
trend-setting Ruby on Rails framework. this is achieved through routes (rules
that map matching URLs to specific controllers and actions). In CMS packages, the
feature usually depends on Apache's mod_rewrite (or equivalent on other Web servers) and
is often referred to as "Search Engine Friendly URIs" or something similar. When you
choose a CMS or development framework for your site, be sure to investigate what it is capable of in this regard.
One final note: If possible, consider removing file name extensions from your URIs.
The filename extensions (.html and .cgi) provide no semantic information that is relevant to the user and actually cause problems in the long run. If you changed your Web site to use PHP instead of CGI scripts, you suddenly have different URIs but serve exactly the same content. This is bad for the semantic value of your URIs, as well as your Google ranking! A more semantically elegant method is to take advantage of the HTTP headers to do content negotiation. Consider the following URI: http://www.mysite.com/products/gadgets/widget.
A Web browser will generally indicate its preferred content type using the Accept HTTP
header. When asked for this resource, the Web server can check that header, note that text/html is one of the options, and serve an HTML page. If you have a mashup application that wants RDF, then the Accept header in the HTTP request should contain application/rdf+xml and the Web server, from the same URI, can serve an RDF version of the page.
At present this content negotiation functionality is not available in many
off-the-shelf CMS solutions, but in the short term it should be possible for a lot of
them to use URIs without file extensions, which means you can add this functionality in the future without upsetting your URI scheme.
Take advantage of existing semantic add tools
Whether you fully embrace the Semantic Web in your Web site infrastructure, or just
want to make your existing content more useful, there are probably several opportunities
to add structure to existing content on your Web site. This is the domain of
Microformats, RDFa and GRDDL. Table 2 lists the more common information types that you can easily mark up as structured data.
Table 2. Opportunities for structured markup and automatic transformation
| Information type | Structured Markup |
|---|
| People and Organizations | hCard, RDF vCard | | Calendars and Events | hCalendar, RDF Calendar | | Opinions, Ratings and Reviews | VoteLinks, hReview | | Social Networks | XFN, FOAF | | Licenses | rel-license | | Tags, Keywords, Categories | rel-tag | | Lists and Outlines | XOXO |
Adding the structured markup to your page is fairly simple. Listings 3 and 4 below show a fragment of HTML containing contact
information without, and then with, the additional markup required for the RDF vCard,
respectively.
Listing 3. Unstructured contact information
<div class="contactinfo">
Rob Crowther. Web hacker
at
<a href="http://example.org">
Example.org
</a>.
You can contact me
<a href="mailto:robertc@example.org">
via e-mail
</a> or on my work phone at 0123 456789.
</div>
|
Listing 4 shows the contact information with additional markup required for the RDF vCard.
Listing 4. Contact Information using vCard
<div xmlns:contact="http://www.w3.org/2001/vcard-rdf/3.0#" class="contactinfo"
about="http://example.org/staff/robertc">
<span property="contact:fn">Rob Crowther</span>.
<span property="contact:title">Web hacker</span>
at
<a rel="contact:org" href="http://example.org">
Example.org
</a>.
You can contact me
<a rel="contact:email" href="mailto:robertc@example.org">
via e-mail
</a>
or on my
<span property="contact:tel">
<span property="contact:type">work</span>
phone at
<span property="contact:value">0123 456789</span>
</span>.
</div>
|
In Listing 4, you can see span elements added to delimit the
semantically significant bits of text, and attributes that indicate what they mean. You
added the namespace "contact" linked to the RDF VCard vocabulary. Next, you indicated
that this element is about the resource represented by the URI http://example.org/staff/robertc. Then, you added metadata using the
rel attribute for link relationships and the property attribute on non-links. The only
slightly complex part is the telephone because you need to specify a type as well as the
number. To achieve this, you nest the type and value elements inside the tel element. Adding this structure allows users to add the contact details to their address book with a single click of the mouse.
Other automatic processing is possible with the other structured forms; for example,
Technorati makes use of the rel-tag microformat to categorize its vast aggregation of blog posts. A rel-tag is shown in Listing 5, and as you can see, it is simply a link that makes use of the rel attribute. The significant part is the last bit of the URI, after the final /. This is the tag (using the normal URI encoding conventions where a space is represented by the plus sign).
Listing 5. rel-tag for Technorati for the tag 'semantic web'
<a href="http://technorati.com/tag/semantic+web" rel="tag">
Semantic Web
</a>
|
If you write a blog post related to the Semantic Web that includes the code from
Listing 5 and then ping Technorati to let them know you made a new
post (a lot of blog software can be configured to do this automatically), then their
crawler will index your post and add a summary of it to the page that your tag element
links to, along with any other posts with the same tag (see Figure 2).
Figure 2. The 'semantic web' page on Technorati, generated from rel-tag
Conclusion
In this article, you saw how Semantic Web technologies address the need for structured data on the Web in a standard and consistent manner, in contrast to the currently popular method of each Web site defining their own API. You looked at how the Semantic Web technologies add value in layers on top of the HTTP and URIs of the existing Web, first allowing the unambiguous expression of relationships with RDF, and then allowing for shared meaning with OWL based ontologies and finally querying the distributed Web of knowledge using SPARQL. The article also looked at how you can take advantage of existing ontologies to define what your data is and use a semantic URI scheme to enable your Web site to also be your API. Finally the article looked at how you can upgrade the content of your existing Web site using RDFa and Microformats so that GRDDL services can automatically extract RDF from your pages.
Although the promise of Tim Berners-Lee's Semantic Web is yet to be fully realized, the
years of thinking and research that have gone into it are starting to bear fruit in
terms of solutions to practical problems that people face today. The strong
collaboration trends in Web 2.0 will only lead to more requirements for structured and
semantically encoded data being available on the Web. With some planning, you can be in position to take advantage of the Semantic Web tools which help meet that need.
Resources Learn
-
The ultimate
mashup—Web services and the semantic Web (Nicholas Chase, developerWorks, August 2006):
Practice using Semantic Web techniques with this six-part tutorial series.
-
Introduction to Jena: Use RDF models in your Java applications with the Jena Semantic Web Framework( Philip McCarthy, developerWorks, June 2004): Find out how to use the Jena Semantic Web Toolkit to exploit RDF data models in your Java applications.
-
Programmable Web: Stay up to date with the latest on mashups and the new Web 2.0 APIs.
-
The Structured Web -
A Primer: Read a general introduction to the value of structured data.
- The W3C's RDF Primer: Learn the basics of RDF and how to use it effectively.
-
A
Semantic Web Primer for Object-Oriented Software Developers: Read how to use
Ontologies, such as RDF Schema and OWL, in the context of OOP.
- The W3C's
OWL Overview: Get an understanding of what OWL can do for apps that process
information content instead of just presenting it to humans.
- The SPARQL Query Language for RDF
specification: Explore the syntax and semantics of this query language for RDF.
-
Notation3: Read about N3, a
compact and readable alternative to RDF's XML syntax.
-
Terse RDF Triple Language: Check out
Turtle, a textual syntax for RDF that writes RDF graphs in a compact and natural text
form, with abbreviations for common usage patterns and datatypes. Turtle is compatable
with existing N-Triples and Notation 3 formats and the triple pattern syntax of SPARQL.
-
Cool URIs for the Semantic
Web: Read guidelines for effective URIs as the link between RDF and the semantic Web.
-
University
of Southampton Department of Electronics and Computer Science: See a semantic Web site in action.
-
RDFa or Microformats: Embed semantic information in your Web pages.
-
IBM XML certification: Find out how you can become an IBM-Certified Developer in XML and related technologies.
-
XML technical library: See the developerWorks XML Zone for a wide range of technical articles and tips, tutorials, standards, and IBM Redbooks.
-
developerWorks technical events and webcasts: Stay current with technology in these sessions.
- The IBM developerWorks XML zone: Learn more about XML and the Semantic Web.
- The technology
bookstore: Browse for books on these and other technical topics.
Get products and technologies
-
IBM trial software: Build your next development project with trial software available for download directly from developerWorks.
Discuss
About the author
Rate this page
|