Now that I've outlined basic techniques for extracting data from XML applications to RDF in previous columns, you can proceed to reap some of the rewards of this extraction. If you haven't recently read the two previous installments, you may want to review them (see the links in the related-content section of the column's table of contents, to the right) before reading on.
The demonstration application, you may recall, is an issue tracker that manages incident data in XML formats. So far the columns have looked at techniques for extracting RDF from this data, and at basic techniques for querying the resulting RDF model. Now let's take a close look at one instance of why all this effort is so valuable.
WordNet is a project of Princeton University. Styled "a lexical database for the English language," it is a system that describes and classifies words and concepts by gathering collections of synonyms into groups called synonym sets, or synsets. I can't praise enough this important long-running project, which represents such admirable industry. It is doubly important because its openness means that practically any developer can use it. WordNet has an "unencumbered" license. It is similar to a BSD license in that the only real restriction is that you not misappropriate the Princeton University trademark in promotion of any derivative of WordNet.
I must say that it is always nice to find that some of the most important fruits of intellectual labor are freely available for the common good. These days we hear too much news of organizations attempting to make dubious profits by taking freely from common knowledge and refusing to return the contribution without payment.
WordNet, currently in version 1.7, includes a glossary of synsets that represent tens of thousands of nouns. The synsets are related by a variety of concepts (including hyponyms), concepts that are a type of other concepts, and subclasses (a hypernym is the superclass of a hyponym). WordNet also includes mappings between concepts that are similar without being synonyms.
Princeton distributes WordNet itself as data files and command-line query tools for various platforms. Many projects have adapted and enhanced WordNet, and because it represents such a well-constructed network of resources, the RDF community has been especially active in adopting WordNet. The several related projects include Dan Brickley's WordNet for the Web, and Dr. Jonathan Borden's adaptation of that to a browsable XHTML format. In this article, I use a straightforward translation of the WordNet databases to RDF, courtesy of Sergey Melnik and Stefan Decker, who, like Brickley and Borden, are both quite busy in the RDF community. Most of the projects I mention are still based on WordNet 1.6, including the Melnik and Decker RDF translation used here.
This column illustrates the use of RDF tools and the WordNet RDF models to add semantic features to the RDF-powered issue tracker. You can use any RDF tools, with some variations, to follow the process; I'll be working with 4RDF, from my company's open-source 4Suite. WordNet in RDF form is huge, so I'll be using 4RDF's persistent database back end to manage it. 4RDF allows you to manage models in memory, which is how we've been using it so far in this series, or by persistent storage (either in a SQL DBMS such as PostgreSQL and Oracle, in a flat file, or in Mangrove, an experimental customized RDF back end I've written). I use PostgreSQL for this column's examples because Mangrove will probably make its debut in the next release of 4Suite, 0.12.0, which may not be out before publication time. PostgreSQL is an open-source, enterprise-quality relational DBMS that comes with many Linux distributions, and with packages available for many other platforms.
To set up, first of all, download the WordNet/RDF files. I had to fix a couple of places where the file wordnet_hyponyms-20010201.rdf was corrupted. I amended line 58268 from
and line 109228 from
<b:hyponymOf rdf:resof rdf:resource="&a;105862019"/>
Next, add the noun and hyponym database WordNet RDF files to a PostgreSQL-based RDF model named sit (for semantic issue tracker):
Listing 1. Adding the WordNet RDF files to a database RDF model named sit
$ 4rdf --driver=Postgres --dbName=sit wordnet_nouns-20010201.rdf $ 4rdf --driver=Postgres --dbName=sit wordnet_hyponyms-20010201.rdf
As you see in Listing 1, you must specify the
Postgres driver, which requires that you specify a database name. A database of that name will be created if it does not exist, and the generated statements will be added to the database of that name if it does exist.
Before turning back to the issue tracker, try running a simple test script on the WordNet model created so far.
Listing 2: A small test program (wn-test.py) to exercise the WordNet RDF model
from Ft.Rdf.Drivers import Postgres from Ft.Rdf import Model, Util WN_RDF_BASE = "http://www.cogsci.princeton.edu/~wn/schema/" def Test(): db = Postgres.GetDb('wordnet') db.begin() m = Model.Model(db) print "Size of the model (number of statements): ", m.size() print "Synonyms of the word 'knowledge':" noun = Util.GetSubject(m, WN_RDF_BASE+'wordForm', 'knowledge') print Util.GetObjects(m, noun, WN_RDF_BASE+'wordForm') print "Classification chain for 'knowledge':" HypernymChain(noun, 'knowledge', m) db.rollback() return def HypernymChain(noun, wform, m): hyper = Util.GetObject(m, noun, WN_RDF_BASE+'hyponymOf') if hyper: hwform = Util.GetObject(m, hyper, WN_RDF_BASE+'wordForm') print "%s is a kind of %s"%(wform, hwform) HypernymChain(hyper, hwform, m) return if __name__ == "__main__": Test()
In Listing 2,
Postgres.GetDb('sit') makes a connection to the RDF database back end you created. After this connection is made, you begin a transaction on the back end and use it to set up a model object, which you can then query. First of all, you just look at the number of statements in the model. Then you use some of the basic sorts of RDF queries introduced in the last article to find the synonyms of the word knowledge and to trace the chain of its classification among concepts.
WordNet/RDF uses URIs in the form
to represent noun synsets, where
100001740 is the ID of the synset. There is then a
wordForm statement for each word that is one of the synonyms in the synset. If appropriate, there is also a
hyponymOf statement to tie each synset to its superclass. Accordingly, the test program gets synonyms using a couple of simple queries according to this schema.
HypernymChain is a recursive function that takes a synset resource and the word form of one of the the synonyms, and looks for a hypernym. Save the listing to a file called wn-test.py and run it, as follows:
$ python wn-test.py Size of the model (number of statements): 351632 Synonyms of the word 'knowledge': ['cognition', 'knowledge'] Classification chain for 'knowledge': knowledge is a kind of psychological feature
The last installment looked at how to transform multiple XML files in order to extract serialized RDF, and how to then pass the results through an RDF processor. Now it's time to add those statements to the same model that contains the WordNet statements:
$ 4rdf --driver=Postgres --dbName=sit issues.rdf
Again the code specifies the
Postgres driver and the same database name, but now it reads from the issues.rdf file, which, if you remember, contains the metadata extracted from the sample issue tracker files.
Now you can reprise the queries we've already looked at, but with semantic capabilities. For instance, in the last installment I showed an example of a query using regular expression to find all actions assigned to uogbuji whose body contained the string vote. Now, say you want to find all actions assigned to uogbuji that concern selection in general. You are not looking for some string pattern anymore, but instead you want a true lexical pattern at the cognitive level of language. You're looking for any words that carry the sense of selection: not only vote, but also choice and ballot.
Listing 3 is a program that executes the search.
Listing 3: Program (seman-search.py) to execute a search by English semantic concept, using WordNet
from Ft.Rdf.Drivers import Postgres from Ft.Rdf import Model, Util from Ft.Rdf.Model import REGEX USER_ID_BASE = 'http://users.rdfinference.org/ril/issue-tracker#' IT_SCHEMA_BASE = 'http://xmlns.rdfinference.org/ril/issue-tracker#' WN_RDF_BASE = "http://www.cogsci.princeton.edu/~wn/schema/" g_relatedWords =  def SemanSearch(word): db = Postgres.GetDb('wordnet') db.begin() model = Model.Model(db) #Find the synset resource of which we have the word form noun = Util.GetSubject(model, WN_RDF_BASE+'wordForm', word) print 'Actions assigned to uogbuji related to "%s":'%(word) #Get all word forms of this synset and its hyponyms HyponymWords(noun, model) #Combine all the words into one large regular expression pattern = ".*" + "|".join(g_relatedWords) + ".*" #Use this regex to search for the appopriate concepts actions = Util.GetSubjects(model, IT_SCHEMA_BASE+'body', pattern, objectFlags=REGEX) #Iterate over the bodies of the matching actions for action in actions: #See if this action is assigned to uogbuji assignee = Util.GetObject(model, action, IT_SCHEMA_BASE+'assign-to') body = Util.GetObject(model, action, IT_SCHEMA_BASE+'body') if assignee == USER_ID_BASE+'uogbuji': print "*", body db.rollback() return def HyponymWords(noun, model): words = Util.GetObjects(model, noun, WN_RDF_BASE+'wordForm') print "words", words g_relatedWords.extend(words) hypos = Util.GetSubjects(model, WN_RDF_BASE+'hyponymOf', noun) print "hypos", hypos for h in hypos: HyponymWords(h, model) return if __name__ == "__main__": import sys SemanSearch(sys.argv)
In Listing 3,
SemanSearch is the main function. It begins by accessing the RDF model that is populated with WordNet and issue tracker data. It takes an argument: a word that is to be the basis of the concept search. The first step is to translate the word into a WordNet synset resource. Then the program adds all other word forms (synonyms) in this synset to a search list, as well as all word forms of hyponym synsets. As an example, the result of gathering this search list given the starting concept selection is the 89 word forms in Listing 4.
Listing 4. Results of searching for synonyms and hyponyms for selection
['choice', 'pick', 'selection', 'casting', 'sampling', 'random sampling', 'proportional sampling', 'representative sampling', 'stratified sampling', 'conclusion', 'decision', 'determination', 'appointment', 'assignment', 'designation', 'naming', 'nominating', 'nomination', 'co-optation', 'co-option', 'delegacy', 'ordaining', 'ordination', 'laying on of hands', 'call', 'move', 'demarche', 'maneuver', 'maneuvering', 'manoeuvering', 'manoeuvre', 'tactical maneuver', 'tactical manoeuver', 'parking', 'device', 'gimmick', 'twist', 'fast one', 'trick', 'feint', 'gambit', 'ploy', 'stratagem', 'artifice', 'ruse', 'measure', 'step', 'countermeasure', 'countermine', 'porcupine provision', 'shark repellent', 'golden parachute', 'greenmail', 'pac-man strategy', 'poison pill', 'suicide pill', 'safe harbor', 'scorched-earth policy', 'casting lots', 'drawing lots', 'sortition', 'finding', 'finding of fact', 'verdict', 'compromise verdict', 'quotient verdict', 'directed verdict', 'false verdict', 'general verdict', 'partial verdict', 'special verdict', 'conclusion of law', 'finding of law', 'volition', 'willing', 'election', 'co-optation', 'co-option', 'ballot', 'balloting', 'vote', 'voting', 'secret ballot', 'split ticket', 'straight ticket', 'multiple voting', 'casting vote', 'veto', 'pocket veto']
The program assembles the word forms into one big regular expression in order to do a one-pass search of issue actions for words in this search list. The resulting list of actions is then further narrowed to actions assigned to "uogbuji," as explained in the previous column. The following is the result of running the program (seman-search.py) in Listing 3:
$ python seman-search.py selection Actions assigned to uogbuji related to "selection": * Organize a vote on this topic
The base concept is passed in on the command line (selection in this case), and the matching actions are printed out.
There are problems with this approach to semantic searching, of course. The most obvious is performance. The size of the WordNet/RDF graph makes it expensive to traverse. The sample session above, using selection as a base concept for searching, took more than two minutes to run on my PIII 1GHz laptop, grinding the on-disk DBMS quite heavily. Almost all of the time is spent in the recursive descent of the hyponym chain. I'm almost afraid to speculate how long it would hang up (and maybe crash) the machine if you were to search for such an abstract concept as thing.
Part of the problem is that, as in the previous installment, this example still uses truly brute-force means to query the RDF model. It does not really take advantage of any optimizations. I'll address some of this later on in the series when I look at RDF Inference Language (RIL). Even with optimizations, though, there are likely to be lingering performance problems with such powerful semantic searching.
You can deal with some of the performance problems through policy. You can prohibit searches with more than a certain level of abstractness. One crude measurement of abstractness might be the longest distance from a synset to its most distant hyponym. Another possible solution, since most of the processing time is spent traversing the WordNet graph, is to optimize common searches by using a crawler to prepare lists of relevant concepts.
Another problem is that the WordNet effort is still incomplete. Almost every noun or noun phrase in common use is mapped in WordNet, but not yet verbs, modifiers, and other parts of speech. This does limit searches. Then there is the fact that WordNet is an English language project, so the power of semantic searching might not be available for foreign languages. There is a EuroWordNet effort to develop mappings for European languages, but this is also ongoing. Also, of course, many important language groups, such as East Asian and Middle Eastern, may not yet have such facilities available.
I have been pleased to see the excited reactions that I get from demonstrating the possibilities opened up by semantic searching using RDF. It does help get across the value of semi-structured relationship and metadata management, which is a boon, because it is not easy to explain such abstract concepts to businessmen and even to many developers. I'll continue this exploration of using knowledge management to improve applications by looking at more detail of the RDF-driven issue tracker. The next installment, however, will detour to an update of matters that I have covered earlier on in the column.
- The Princeton team's home page for WordNet: a lexical database for the English language.
- The RDF translation of WordNet, which is used in this article. There is also an RDF schema to be found here.
- WordNet: An Electronic Lexical Database, ISBN 026206197X (MIT Press 1998)
- WordNet for the Web, a service that represents each WordNet concept as a separate Web page
- A tool to render WordNet for the Web pages as a set of Resource Directory Description Language (RDDL) documents
- EuroWordNet: "Building a multilingual database with wordnets for several European languages."
Uche Ogbuji is a consultant and co-founder of Fourthought Inc., a software vendor and consultancy specializing in XML solutions for enterprise knowledge management applications. Fourthought develops 4Suite, an open-source platform for XML, RDF and knowledge-management applications. Mr. Ogbuji is a computer engineer and writer born in Nigeria, living and working in Boulder, Colorado, USA. You can contact Mr. Ogbuji at firstname.lastname@example.org.