Build Wikipedia query forms with semantic technology

Create simple Web forms that drive semantic Web standard queries to take advantage of exciting new databases

By providing open access to increasing amounts of Linked Data, public SPARQL endpoints are boosting the growth of the Semantic Web by providing great data for you to use in your applications. As with many other data-driven Web sites out there, a Web page can be created by sending a query to these endpoints and then wrapping the results in HTML tags; the big difference for SPARQL endpoints is the public availability of this new data for your applications. This article shows how simple CGI scripting can get data from two different SPARQL endpoints to build applications that answer your user's questions about actors shared between two directors and which musicians have released which albums.

Share:

Bob DuCharme (bob@snee.com), Solutions Architect, Innodata Isogen

Photo of Bob DuCharmeBob DuCharme, a solutions architect at Innodata Isogen, was an XML expert when XML was a four-letter word. He's written four books and nearly one hundred online and print articles about information technology without using the word "functionality" in any of them. See http://www.snee.com/bob for more and www.snee.com/bobdc.blog for his weblog.



21 July 2009

Also available in Chinese Japanese Portuguese

Frequently used acronyms

  • CSS: Cascading stylesheet
  • CGI: Common Gateway Interface
  • HTML: Hypertext Markup Language
  • HTTP: Hypertext Transfer Protocol
  • JSON: JavaScript Object Notation
  • RDBMS: Relational Database Management System
  • RDF: Resource Description Framework
  • REST: Representational State Transfer
  • SPARQL: SPARQL Protocol and RDF Query Language
  • URI: Uniform Resource Identifier
  • URL: Uniform Resource Locator
  • XML: Extensible Markup Language

SPARQL endpoints provide access to databases through queries that use the W3C standard SPARQL query language. More and more of this data is becoming available on the public internet, and your applications can retrieve and use this data much the way they do with relational database data. Once you know a little SPARQL, you can incorporate queries in this language into applications that are otherwise very similar to applications you've already written, but you (and, after you write these applications, your users) will have access to all kinds of new data.

This article discusses two examples of applications that display user-friendly forms, query a database's SPARQL endpoint, and present the results without requiring the form's user to know anything about the technology and standards used to deliver that data. For a zip file accompanying the article includes all example files, see Download. The first application lets you name two film directors, and then retrieves the names of any actors who appeared in films by both of them; the second retrieves information about recording artists' albums.

As with so many data-driven Web sites out there, the basic architecture of these applications follow this pattern:

  1. The user enters one or more query terms on a Web form and clicks Submit.
  2. The form passes the entered values to a CGI script.
  3. The CGI script plugs the values into a query and sends the query to a database server.
  4. The server returns the query results to the CGI script, which builds an HTML page around the returned data and sends the page to the user's browser.

This pattern is as old as CGI scripts. What makes it new this time is that the query language in use is SPARQL, instead of the more well-known SQL. What makes it exciting is that while few if any SQL databases are available on the internet for your application to freely query, more large, useful databases are becoming available with SPARQL interfaces (also known as SPARQL endpoints) all the time. In fact, these SPARQL endpoints are often additional interfaces added on to existing relational databases. In addition to the SPARQL endpoints that appear on the public internet, others are appearing behind firewalls to ease cross-silo querying of enterprise data.

Application one: The database and the query

If the first application automates the delivery of a query to a particular database and the formatting of the returned data, let's start with its database and query. While the Internet Movie Database (IMDb) can tell you plenty about just about any English-language sound movie ever made, it won't accept queries that let you compare and contrast information. For example, if I wonder whether any actors have appeared in a film directed by Sofia Coppola and another directed by her father Francis Ford Coppola, I'd have to go to the six IMDb Web pages for films that she's directed and the 32 pages for films that he's directed, manually compile a list of actors, and then cross-reference them.

Luckily, the Linked Movie Database at linkedmdb.org provides a database of movie credit information that accepts SPARQL queries. For example, the query in Listing 1 tells it to list the titles of all films directed by Sofia Coppola:

Listing 1. A query for Sofia Coppola films
PREFIX m: <http://data.linkedmdb.org/resource/movie/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT ?filmTitle WHERE {
  ?film rdfs:label ?filmTitle.
  ?film m:director ?dir.
  ?dir  m:director_name "Sofia Coppola".
}

SPARQL is designed to query data represented with the RDF data model, which represents data as a collection of triples, each of which has a subject, a predicate, and an object. Sometimes it's easier to think of these as entity, attribute name, attribute, and value. One such triple could be described as "the entity with the ID http://data.linkedmdb.org/resource/director/7764 has a director_name value of 'Sofia Coppola'."

In SPARQL, variables begin with question marks, and this query uses triples with variables inserted at certain places to describe three conditions for the film titles you want returned:

  • RDF uses URIs as identifiers, but you want the database to retrieve human-readable film titles. In the query, the ?film variable stands in for the film's URI identifier. The film has an rdfs:label of ?filmTitle, which is what the query asks for in the SELECT statement. The rdfs:label predicate is defined in the RDF Schema standard, and the namespace declaration in the second line of the query spells out exactly what the rdfs prefix refers to.
  • The film has a director with the identifier ?dir.
  • Because ?dir stands in for a URI identifier for a director and you want the director's actual name, the query asks for the ?dir whose m:director_name value is Sofia Coppola. Another namespace declaration at the query's beginning shows what the m prefix refers to. I knew that director_name was the right predicate name to use because I looked up the names that linkedmdb.org data used before writing this query, just as I look at the names an XML or RDBMS schema uses before I create an XQuery or SQL query. You can find these out from any SPARQL endpoint with a simple SPARQL query that essentially says "show me all the predicates used in any triples, but with no repeats":
SELECT DISTINCT ?p WHERE {?s ?p ?o}

Many SPARQL endpoints, including the one at linkedmdb.org, offer a form-based interface for you to try SPARQL queries. If you go to their SNORQL form, paste in either of the two queries above, and click Go!, you should see a list of Sofia Coppola's films. (If the default Results: setting doesn't work, try others, especially as JSON, the format you'll use in your application.)

Listing 2 shows another query to try on linkedmdb.org's SNORQL form. This one lists all the actors in all of Sofia Coppola's films:

Listing 2. A query to retrieve actors
PREFIX m: <http://data.linkedmdb.org/resource/movie/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT ?actorName ?filmTitle WHERE {
  ?film rdfs:label ?filmTitle;
        m:director ?dir;
        m:actor ?actor.
  ?dir  m:director_name "Sofia Coppola".
  ?actor m:actor_name ?actorName.
}

In addition to asking for the titles of her films, it asks for the actors in each film—or rather, for the actor names that go with the actor identifiers associated with each film, because as with your first query (and the majority of SPARQL queries) you use URIs to identify which data you want but then pull out the human-readable labels associated with the URIs for the output. It's similar to the way an SQL query might use product ID values to cross-reference products and then pull out product names for the actual report.

This query also introduces some SPARQL shorthand. Rather than spell out three full triples to indicate that you want the title, director, and actor data associated with each film, it uses semicolons to show that the predicate/object pairs m:director ?dir and m:actor ?actor go with the subject ?film, just like the predicate/object pair rdfs:label ?filmTitle does.


Building the application

The purpose of the first application is to provide the benefits of the linkedmdb.org database and the SPARQL query language while hiding the actual query and URI from the user of the application. In terms of the architecture of the system described above,

  1. The application user enters two director names using the form in Figure1.

    Figure 1. Input form for commonActors application
    Input form for commonActors application. User enters the official name of two directors and clicks Submit.
  2. The form passes the two director names to the commonActors.cgi CGI script.
  3. The CGI script plugs the director names into a SPARQL query and sends the query to the linkedmdb.org SPARQL endpoint at http://data.linkedmdb.org/sparql.
  4. The SPARQL endpoint returns a JSON version of the results, and the CGI script builds an HTML page around the returned data and sends the page to the user's browser.

Listing 3 shows the SPARQL query that will be sent, hard-coded to ask about Sofia and Francis Ford Coppola:

Listing 3. SPARQL query for listing common actors in Sofia and Francis Ford Coppola films
PREFIX m: <http://data.linkedmdb.org/resource/movie/>
SELECT DISTINCT ?actorName WHERE {

  ?dir1     m:director_name "Sofia Coppola".

  ?dir2     m:director_name "Francis Ford Coppola".

  ?dir1film m:director ?dir1;
            m:actor ?actor.

  ?dir2film m:director ?dir2;
            m:actor ?actor.

  ?actor    m:actor_name ?actorName.
}

If you paste this query into linkedmdb.org's SNORQL interface, you'll see that only Kathleen Turner has been in a film directed by each of the two Coppolas.

The Web input form stored in the commonActors.html file (which you can try for yourself at http://www.snee.com/sparqlforms/commonActors.html) assigns the entered director names to the variables dir1 and dir2, which it passes to the commonActors.cgi script.

Listing 4. Web input form for commonActors application
<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <title>Find common actors between two directors</title>
    <link href="simple.css" type="text/css" rel="stylesheet" />
  </head>
  <body>
    <h1>Find common actors between two directors</h1>
    <form action="commonActors.cgi" method="get">

      <p>Enter the "official" name of each director (check 
      <a href="http://www.imdb.com">IMDB</a> if you're not sure) 
      and click "Search" to list actors who have appeared in movies by 
      both directors.</p>
      <p>
        <input type="text" name="dir1"/>
        <input type="text" name="dir2"/>
        <input type="submit" value="search"/> 
      </p>

    </form>
  </body>
</html>

A nice thing about SPARQL endpoints is that you can communicate with all of them using a REST-based HTTP interface, but a Web search should turn up an even simpler interface tailored to your favorite programming language. I wrote my commonActors.cgi CGI script in Python and used the SPARQLWrapper library to simplify the sending of my query and the unpacking of the results.

Listing 5. Python script for commonActors application
#!/usr/local/bin/python

import sys
sys.path.append('/usr/home/bobd/lib/python/') # needed for hosted version
from SPARQLWrapper import SPARQLWrapper, JSON
import string
import urllib
import cgi

def main():
  form = cgi.FieldStorage() 
  dir1name = form.getvalue('dir1')
  dir2name = form.getvalue('dir2')

  sparql = SPARQLWrapper("http://data.linkedmdb.org/sparql")
  queryString = """

PREFIX m: <http://data.linkedmdb.org/resource/movie/>
SELECT DISTINCT ?actorName WHERE {

  ?dir1     m:director_name "DIR1-NAME".
  ?dir2     m:director_name "DIR2-NAME".
  ?dir1film m:director ?dir1;
            m:actor ?actor.

  ?dir2film m:director ?dir2;
            m:actor ?actor.

  ?actor    m:actor_name ?actorName.
}
  """

  queryString = queryString.replace("DIR1-NAME",dir1name)
  queryString = queryString.replace("DIR2-NAME",dir2name)

  sparql.setQuery(queryString)
  sparql.setReturnFormat(JSON)

  try:
    ret = sparql.query()
    results = ret.convert()
    requestGood = True
  except Exception, e:
    results = str(e)
    requestGood = False

  print """Content-type: text/html

    <html>
      <head>
        <title>results</title>
          <link href="simple.css" type="text/css" rel="stylesheet" />
      </head>
      <body>
"""

  if requestGood == False:
    print "<h1>Problem communicating with the server</h1>"
    print "<p>" + results + "</p>"
  elif (len(results["results"]["bindings"]) == 0):
      print "<p>No results found.</p>"

  else:
    for result in results["results"]["bindings"]:
      print "<p>" + result["actorName"]["value"] + "</p>"

  print "</body></html>"

main()

The sys.path.append line was necessary when I ran the script on a hosting service to tell the host's Python interpreter where I had installed the SPARQLWrapper and JSON libraries that are not part of the standard Python distribution.

After importing the necessary libraries and storing the director name values passed from the HTML form in the dir1name and dir2name variables, the script sets up the SPARQL query and sends it to the SPARQL endpoint. This endpoint is identified by a URL passed as an argument when you create the SPARQLWrapper object stored in the script's sparql object.

The query stored in the queryString variable resembles the one you saw earlier that looked for actors who had appeared in both Coppolas' films, except for the DIR1-NAME and DIR2-NAME strings that are replaced with the dir1name and dir2name values after the queryString variable is created.

After setting the query string and identifying the format in which you want data returned as JSON, a try/except block sends the query to the server and sets a requestGood variable to indicate whether the request was successful. After outputting the header of an HTML page, the script outputs one of three things:

  • A simple error message if the request was unsuccessful.
  • A "No results found" message if an error-free query found no results.
  • Each actor's name as a single HTML paragraph.

The script finished by outputting the HTML page's body and html elements.


Application two: Albums, their artists, and release dates

Your HTML and CSS skills can help you build a build a slicker-looking application than my commonActors demo. To create an application that not only looks nicer but also does more, note that the SELECT statement in the SPARQL query above only asks for one piece of information for each matched pattern: the actor name. Typical SPARQL queries, like typical SQL queries, ask for more than that, and as the program written in your host language (in my example above, Python) goes through the retrieved query results, it can do all kinds of interesting things with the retrieved information.

The real possibilities, though, lie in the wider and wider choice of data available. Moving beyond linkedmdb.org, you can find other public SPARQL endpoints that give you access to a wider range of data.

One of the biggest and most popular collections of SPARQL data is DBpedia, a community effort to extract structured data from Wikipedia "Infoboxes" (the fielded information in the gray boxes on the right side of many Wikipedia pages) and store it where you can retrieve it with SPARQL queries.

When listening to music, I often find myself wondering when a particular album was released, so I wrote the simpleAlbumQuery application to make it easier to look this up. (In fact, while writing the first draft of this article, I was listening to a selection of Duke Ellington tunes, and, while listening to Diana Krall's version of "I'm Just a Lucky So and So", I looked her up at http://www.snee.com/sparqlforms/simpleAlbumQuery.html, where I have a copy of the application stored. Try it yourself!) In the zip file accompanying this article, along with the commonActors.html file, commonActors.cgi file, and the simple.css stylesheet that they use, you'll find a simpleAlbumQuery.html Web page file and the simpleAlbumQuery.cgi Python CGI script. (See Download.)

For the simpleAlbumQuery application, Figure 2 shows the HTML form, which has two fields to pass artist and album parameters to the Python CGI script. (View a larger version.)

Figure 2. Input form for SimpleAlbumQuery application
Input form for SimpleAlbumQuery application with artist and album fields plus four sample queries

This form includes some suggested queries to jumpstart the user. In addition to filling out the suggested values on the form, the user can click one of the try it links. For example, clicking the second try it link activates the URL http://www.snee.com/sparqlforms/simpleAlbumQuery.cgi?artist=&album=Fillmore, which has the same effect as entering nothing in the first form field, Fillmore in the second, and then clicking the search button: it calls the simpleAlbumQuery.cgi script, passing an empty string as the artist value and Fillmore as the album one.

Figure 3 shows the first three entries resulting from the sample Fillmore query. Along with a little CSS to make it look nicer, it includes the album cover images, and the album names link to the Wikipedia pages for those albums. (Scrolling down in the returned results will reveal two albums of Miles Davis at the Fillmore in 1970. If it's the same show as the one listed on the marquee for Neil Young and Crazy Horse in the third row in Figure 3, then it's a nice example of Linked Data serendipity—and must have been quite a show.)

Figure 3. First three entries returned by query for artist = "" and album = "Fillmore"
Result of SimpleAlbumQuery query

Other than a bit of string manipulation to format some of the returned text, you'll find that simpleAlbumQuery.cgi has only two significant differences from commonActors.cgi:

  • The query is sent to a different destination, so a different URL is passed to the SPARQLWrapper creation method: http://dbpedia.org/sparql, the SPARQL endpoint URL for DBpedia.
  • The queryString variable stores a different query that asks for different information, and to customize the query for the simpleAlbumQuery application's user, different values are passed to it from the HTML form.

Take a closer look at the query in Listing 6. (While DBpedia has its own SNORQL form for directly entering queries at http://dbpedia.org/snorql/, this query won't work as shown because it has the ARTIST-STRING and ALBUM-TITLE-STRING placeholder strings that, as with DIR1-NAME and DIR2-NAME in the commonActors application, will be replaced before the query is sent to the SPARQL endpoint server.)

Listing 6. SPARQL query for albumQuery application
PREFIX dbpedia2: <http://dbpedia.org/property/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT DISTINCT ?artistName ?album
    ?wpURL ?releaseDate ?coverURL
WHERE {
  ?s dbpedia2:artist   ?artist;
     dbpedia2:name     ?album;
     foaf:page         ?wpURL;
     dbpedia2:released ?releaseDate;
     dbpedia2:cover    ?coverURL.
  ?artist rdfs:label   ?artistName.

  FILTER (regex(?artistName, "ARTIST-STRING")).
  FILTER (regex(?album, "ALBUM-TITLE-STRING")).
  FILTER (lang(?artistName) = "en").
}
LIMIT 30

DBpedia entries that correspond to Wikipedia pages each have a URI identifier with various pieces of information associated with them. This includes entries for albums, so this query asks for an album's artist identifier, album name, the Wikipedia URL to build the links in Figure 3, the release date, and the cover URL to display the image in the output. Once you have the artist identifier, you'll want to show users the human-readable name than the URI identifier, so the query asks for the rdfs:label that goes with that artist identifier.

How does the query specify the albums you want? By using SPARQL's FILTER keyword to indicate that you only want an artist name and album name matching the indicated patterns. As with the commonActors application, this query has dummy strings that get replaced by the CGI script before the script is sent off to the SPARQL endpoint. Because DBpedia can store multiple names for an artist, depending on how people refer to the artist in different languages, a third FILTER statement shows that you only want the English version of the artist's name.

To be polite to the DBpedia server, this query also includes a LIMIT 30 statement to prevent it from retrieving too much data. For example, if a user of the simpleAlbumQuery application enters the letter "a" as an album title and leaves the artist field blank, the query asks DBpedia for all albums with the letter "a" in their title from any artist in its database, and that's asking for a bit too much.

As you look through the other parts of simpleAlbumQuery.cgi and simpleAlbumQuery.html, outside of the SPARQL query you'll find that code in each parallels some corresponding code from the commonActors files.


Your own application

When you find a SPARQL endpoint with data that is useful to you or your users, you can create your own HTML form and a CGI script based on the structure of the two applications included with this article and then build an application that uses that data. Every time you see a Wikipedia Infobox, it's fun to remember that you can create an application like commonActors or simpleAlbumQuery to make use of that data. And, while the growing W3C list of SPARQL endpoints includes several more databases of music-related information, you'll find many others as well, especially in the fields of literature, biology, and pharmaceuticals. To take it even further, nothing prevents a single application from retrieving data from multiple endpoints and building something new from the collection, making even more complex and interesting applications. The only limit is your imagination.


Download

DescriptionNameSize
Sample Python, HTML, and CSS for this articleWPQueryFormApps.zip5KB

Resources

Learn

Get products and technologies

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into XML on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML, Web development, Open source
ArticleID=413746
ArticleTitle=Build Wikipedia query forms with semantic technology
publish-date=07212009