Skip to main content

If you don't have an IBM ID and password, register here.

By clicking Submit, you agree to the developerWorks terms of use.

The first time you sign into developerWorks, a profile is created for you. This profile includes the first name, last name, and display name you identified when you registered with developerWorks. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

All information submitted is secure.

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerworks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

By clicking Submit, you agree to the developerWorks terms of use.

All information submitted is secure.

Thinking XML: Basic XML and RDF techniques for knowledge management, Part 3

Knowledge from semantics

Uche Ogbuji (uche@ogbuji.net), Principal consultant, Fourthought, Inc.
Uche Ogbuji
Uche Ogbuji is a consultant and co-founder of Fourthought Inc., a software vendor and consultancy specializing in XML solutions for enterprise knowledge management applications. Fourthought develops 4Suite, an open-source platform for XML, RDF and knowledge-management applications. Mr. Ogbuji is a computer engineer and writer born in Nigeria, living and working in Boulder, Colorado, USA. You can contact Mr. Ogbuji at uche@ogbuji.net.

Summary:  This column, the third in a series, shows how to add semantic knowledge to an RDF application by incorporating WordNet synonym sets. With the added knowledge of the WordNet lexical database, you can search a set of RDF data for related concepts, not just one keyword at a time. As the demonstration issue-tracker application shows, that means searching once for instances that fit within the concept of "selection" rather than searching individually on "vote," "choice," "ballot," and 86 other related terms. Columnist Uche Ogbuji's sample code in Python illustrates the techniques.

Date:  01 Nov 2001
Level:  Intermediate

Comments:  

Now that I've outlined basic techniques for extracting data from XML applications to RDF in previous columns, you can proceed to reap some of the rewards of this extraction. If you haven't recently read the two previous installments, you may want to review them (see the links in the related-content section of the column's table of contents, to the right) before reading on.

The demonstration application, you may recall, is an issue tracker that manages incident data in XML formats. So far the columns have looked at techniques for extracting RDF from this data, and at basic techniques for querying the resulting RDF model. Now let's take a close look at one instance of why all this effort is so valuable.

Introducing WordNet

WordNet is a project of Princeton University. Styled "a lexical database for the English language," it is a system that describes and classifies words and concepts by gathering collections of synonyms into groups called synonym sets, or synsets. I can't praise enough this important long-running project, which represents such admirable industry. It is doubly important because its openness means that practically any developer can use it. WordNet has an "unencumbered" license. It is similar to a BSD license in that the only real restriction is that you not misappropriate the Princeton University trademark in promotion of any derivative of WordNet.

I must say that it is always nice to find that some of the most important fruits of intellectual labor are freely available for the common good. These days we hear too much news of organizations attempting to make dubious profits by taking freely from common knowledge and refusing to return the contribution without payment.

WordNet, currently in version 1.7, includes a glossary of synsets that represent tens of thousands of nouns. The synsets are related by a variety of concepts (including hyponyms), concepts that are a type of other concepts, and subclasses (a hypernym is the superclass of a hyponym). WordNet also includes mappings between concepts that are similar without being synonyms.

Princeton distributes WordNet itself as data files and command-line query tools for various platforms. Many projects have adapted and enhanced WordNet, and because it represents such a well-constructed network of resources, the RDF community has been especially active in adopting WordNet. The several related projects include Dan Brickley's WordNet for the Web, and Dr. Jonathan Borden's adaptation of that to a browsable XHTML format. In this article, I use a straightforward translation of the WordNet databases to RDF, courtesy of Sergey Melnik and Stefan Decker, who, like Brickley and Borden, are both quite busy in the RDF community. Most of the projects I mention are still based on WordNet 1.6, including the Melnik and Decker RDF translation used here.


Setting up WordNet/RDF

This column illustrates the use of RDF tools and the WordNet RDF models to add semantic features to the RDF-powered issue tracker. You can use any RDF tools, with some variations, to follow the process; I'll be working with 4RDF, from my company's open-source 4Suite. WordNet in RDF form is huge, so I'll be using 4RDF's persistent database back end to manage it. 4RDF allows you to manage models in memory, which is how we've been using it so far in this series, or by persistent storage (either in a SQL DBMS such as PostgreSQL and Oracle, in a flat file, or in Mangrove, an experimental customized RDF back end I've written). I use PostgreSQL for this column's examples because Mangrove will probably make its debut in the next release of 4Suite, 0.12.0, which may not be out before publication time. PostgreSQL is an open-source, enterprise-quality relational DBMS that comes with many Linux distributions, and with packages available for many other platforms.

To set up, first of all, download the WordNet/RDF files. I had to fix a couple of places where the file wordnet_hyponyms-20010201.rdf was corrupted. I amended line 58268 from

<rdf:Descripout="&a;103178459">

to

<rdf:Description rdf:about="&a;103178459">

and line 109228 from

<b:hyponymOf rdf:resof rdf:resource="&a;105862019"/>

to


<b:hyponymOf rdf:resource="&a;105862019"/>


Next, add the noun and hyponym database WordNet RDF files to a PostgreSQL-based RDF model named sit (for semantic issue tracker):


Listing 1. Adding the WordNet RDF files to a database RDF model named sit

$ 4rdf --driver=Postgres --dbName=sit wordnet_nouns-20010201.rdf 
$ 4rdf --driver=Postgres --dbName=sit wordnet_hyponyms-20010201.rdf 

As you see in Listing 1, you must specify the Postgres driver, which requires that you specify a database name. A database of that name will be created if it does not exist, and the generated statements will be added to the database of that name if it does exist.

Before turning back to the issue tracker, try running a simple test script on the WordNet model created so far.


Listing 2: A small test program (wn-test.py) to exercise the WordNet RDF model

from Ft.Rdf.Drivers import Postgres
from Ft.Rdf import Model, Util

WN_RDF_BASE = "http://www.cogsci.princeton.edu/~wn/schema/"

def Test():
    db = Postgres.GetDb('wordnet')
    db.begin()
    m = Model.Model(db)
    print "Size of the model (number of statements): ", m.size()

    print "Synonyms of the word 'knowledge':"
    noun = Util.GetSubject(m, WN_RDF_BASE+'wordForm', 'knowledge')
    print Util.GetObjects(m, noun, WN_RDF_BASE+'wordForm')

    print "Classification chain for 'knowledge':"

    HypernymChain(noun, 'knowledge', m)
    
    db.rollback()
    return


def HypernymChain(noun, wform, m):
    hyper = Util.GetObject(m, noun, WN_RDF_BASE+'hyponymOf')
    if hyper:
        hwform = Util.GetObject(m, hyper, WN_RDF_BASE+'wordForm')
        print "%s is a kind of %s"%(wform, hwform)
        HypernymChain(hyper, hwform, m)
    return


if __name__ == "__main__":
    Test()



In Listing 2, Postgres.GetDb('sit') makes a connection to the RDF database back end you created. After this connection is made, you begin a transaction on the back end and use it to set up a model object, which you can then query. First of all, you just look at the number of statements in the model. Then you use some of the basic sorts of RDF queries introduced in the last article to find the synonyms of the word knowledge and to trace the chain of its classification among concepts.

WordNet/RDF uses URIs in the form

http://www.cogsci.princeton.edu/~wn/concept#100001740

to represent noun synsets, where 100001740 is the ID of the synset. There is then a wordForm statement for each word that is one of the synonyms in the synset. If appropriate, there is also a hyponymOf statement to tie each synset to its superclass. Accordingly, the test program gets synonyms using a couple of simple queries according to this schema. HypernymChain is a recursive function that takes a synset resource and the word form of one of the the synonyms, and looks for a hypernym. Save the listing to a file called wn-test.py and run it, as follows:


$ python wn-test.py 
Size of the model (number of statements):  351632
Synonyms of the word 'knowledge':
['cognition', 'knowledge']
Classification chain for 'knowledge':
knowledge is a kind of psychological feature



A lexical layer for the issue tracker

The last installment looked at how to transform multiple XML files in order to extract serialized RDF, and how to then pass the results through an RDF processor. Now it's time to add those statements to the same model that contains the WordNet statements:

$ 4rdf --driver=Postgres --dbName=sit issues.rdf

Again the code specifies the Postgres driver and the same database name, but now it reads from the issues.rdf file, which, if you remember, contains the metadata extracted from the sample issue tracker files.

Now you can reprise the queries we've already looked at, but with semantic capabilities. For instance, in the last installment I showed an example of a query using regular expression to find all actions assigned to uogbuji whose body contained the string vote. Now, say you want to find all actions assigned to uogbuji that concern selection in general. You are not looking for some string pattern anymore, but instead you want a true lexical pattern at the cognitive level of language. You're looking for any words that carry the sense of selection: not only vote, but also choice and ballot.

Listing 3 is a program that executes the search.


Listing 3: Program (seman-search.py) to execute a search by English semantic concept, using WordNet

from Ft.Rdf.Drivers import Postgres
from Ft.Rdf import Model, Util
from Ft.Rdf.Model import REGEX

USER_ID_BASE = 'http://users.rdfinference.org/ril/issue-tracker#'
IT_SCHEMA_BASE = 'http://xmlns.rdfinference.org/ril/issue-tracker#'
WN_RDF_BASE = "http://www.cogsci.princeton.edu/~wn/schema/"

g_relatedWords = []

def SemanSearch(word):
    db = Postgres.GetDb('wordnet')
    db.begin()
    model = Model.Model(db)

    #Find the synset resource of which we have the word form
    noun = Util.GetSubject(model, WN_RDF_BASE+'wordForm', word)

    print 'Actions assigned to uogbuji related to "%s":'%(word)

    #Get all word forms of this synset and its hyponyms
    HyponymWords(noun, model)

    #Combine all the words into one large regular expression
    pattern = ".*" + "|".join(g_relatedWords) + ".*"

    #Use this regex to search for the appopriate concepts
    actions = Util.GetSubjects(model, IT_SCHEMA_BASE+'body', pattern,
                               objectFlags=REGEX)

    #Iterate over the bodies of the matching actions
    for action in actions:
        #See if this action is assigned to uogbuji
        assignee = Util.GetObject(model, action, IT_SCHEMA_BASE+'assign-to')
        body = Util.GetObject(model, action, IT_SCHEMA_BASE+'body')
        if assignee == USER_ID_BASE+'uogbuji':
            print "*", body

    db.rollback()
    return


def HyponymWords(noun, model):
    words = Util.GetObjects(model, noun, WN_RDF_BASE+'wordForm')
    print "words", words
    g_relatedWords.extend(words)
    hypos = Util.GetSubjects(model, WN_RDF_BASE+'hyponymOf', noun)
    print "hypos", hypos
    for h in hypos:
        HyponymWords(h, model)
    return


if __name__ == "__main__":
    import sys
    SemanSearch(sys.argv[1])


In Listing 3, SemanSearch is the main function. It begins by accessing the RDF model that is populated with WordNet and issue tracker data. It takes an argument: a word that is to be the basis of the concept search. The first step is to translate the word into a WordNet synset resource. Then the program adds all other word forms (synonyms) in this synset to a search list, as well as all word forms of hyponym synsets. As an example, the result of gathering this search list given the starting concept selection is the 89 word forms in Listing 4.


Listing 4. Results of searching for synonyms and hyponyms for selection

['choice', 'pick', 'selection', 'casting', 'sampling', 'random 
sampling', 'proportional sampling', 'representative sampling', 
'stratified sampling', 'conclusion', 'decision', 'determination', 
'appointment', 'assignment', 'designation', 'naming', 'nominating', 
'nomination', 'co-optation', 'co-option', 'delegacy', 'ordaining', 
'ordination', 'laying on of hands', 'call', 'move', 'demarche', 
'maneuver', 'maneuvering', 'manoeuvering', 'manoeuvre', 
'tactical maneuver', 'tactical manoeuver', 'parking', 'device', 
'gimmick', 'twist', 'fast one', 'trick', 'feint', 'gambit', 'ploy', 
'stratagem', 'artifice', 'ruse', 'measure', 'step', 'countermeasure', 
'countermine', 'porcupine provision', 'shark repellent', 'golden 
parachute', 'greenmail', 'pac-man strategy', 'poison pill', 'suicide 
pill', 'safe harbor', 'scorched-earth policy', 'casting lots', 'drawing 
lots', 'sortition', 'finding', 'finding of fact', 'verdict', 'compromise 
verdict', 'quotient verdict', 'directed verdict', 'false verdict', 
'general verdict', 'partial verdict', 'special verdict', 'conclusion 
of law', 'finding of law', 'volition', 'willing', 'election', 'co-optation', 
'co-option', 'ballot', 'balloting', 'vote', 'voting', 'secret ballot', 
'split ticket', 'straight ticket', 'multiple voting', 'casting vote', 
'veto', 'pocket veto']

The program assembles the word forms into one big regular expression in order to do a one-pass search of issue actions for words in this search list. The resulting list of actions is then further narrowed to actions assigned to "uogbuji," as explained in the previous column. The following is the result of running the program (seman-search.py) in Listing 3:

$ python seman-search.py selection
Actions assigned to uogbuji related to "selection":
* Organize a vote on this topic

The base concept is passed in on the command line (selection in this case), and the matching actions are printed out.


All is not sweets and candles

There are problems with this approach to semantic searching, of course. The most obvious is performance. The size of the WordNet/RDF graph makes it expensive to traverse. The sample session above, using selection as a base concept for searching, took more than two minutes to run on my PIII 1GHz laptop, grinding the on-disk DBMS quite heavily. Almost all of the time is spent in the recursive descent of the hyponym chain. I'm almost afraid to speculate how long it would hang up (and maybe crash) the machine if you were to search for such an abstract concept as thing.

Part of the problem is that, as in the previous installment, this example still uses truly brute-force means to query the RDF model. It does not really take advantage of any optimizations. I'll address some of this later on in the series when I look at RDF Inference Language (RIL). Even with optimizations, though, there are likely to be lingering performance problems with such powerful semantic searching.

You can deal with some of the performance problems through policy. You can prohibit searches with more than a certain level of abstractness. One crude measurement of abstractness might be the longest distance from a synset to its most distant hyponym. Another possible solution, since most of the processing time is spent traversing the WordNet graph, is to optimize common searches by using a crawler to prepare lists of relevant concepts.

Another problem is that the WordNet effort is still incomplete. Almost every noun or noun phrase in common use is mapped in WordNet, but not yet verbs, modifiers, and other parts of speech. This does limit searches. Then there is the fact that WordNet is an English language project, so the power of semantic searching might not be available for foreign languages. There is a EuroWordNet effort to develop mappings for European languages, but this is also ongoing. Also, of course, many important language groups, such as East Asian and Middle Eastern, may not yet have such facilities available.


Conclusion

I have been pleased to see the excited reactions that I get from demonstrating the possibilities opened up by semantic searching using RDF. It does help get across the value of semi-structured relationship and metadata management, which is a boon, because it is not easy to explain such abstract concepts to businessmen and even to many developers. I'll continue this exploration of using knowledge management to improve applications by looking at more detail of the RDF-driven issue tracker. The next installment, however, will detour to an update of matters that I have covered earlier on in the column.


Resources

About the author

Uche Ogbuji

Uche Ogbuji is a consultant and co-founder of Fourthought Inc., a software vendor and consultancy specializing in XML solutions for enterprise knowledge management applications. Fourthought develops 4Suite, an open-source platform for XML, RDF and knowledge-management applications. Mr. Ogbuji is a computer engineer and writer born in Nigeria, living and working in Boulder, Colorado, USA. You can contact Mr. Ogbuji at uche@ogbuji.net.

Report abuse help

Report abuse

Thank you. This entry has been flagged for moderator attention.


Report abuse help

Report abuse

Report abuse submission failed. Please try again later.


developerWorks: Sign in

If you don't have an IBM ID and password, register here.


Forgot your IBM ID?


Forgot your password?
Change your password


By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. This profile includes the first name, last name, and display name you identified when you registered with developerWorks. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

Choose your display name

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

(Must be between 3 – 31 characters.)


By clicking Submit, you agree to the developerWorks terms of use.

 


Rate this article

Comments

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML
ArticleID=86729
ArticleTitle=Thinking XML: Basic XML and RDF techniques for knowledge management, Part 3
publish-date=11012001
author1-email=uche@ogbuji.net
author1-email-cc=

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

For articles in technology zones (such as Java technology, Linux, Open source, XML), Popular tags shows the top tags for all technology zones. For articles in product zones (such as Info Mgmt, Rational, WebSphere), Popular tags shows the top tags for just that product zone.

For articles in technology zones (such as Java technology, Linux, Open source, XML), My tags shows your tags for all technology zones. For articles in product zones (such as Info Mgmt, Rational, WebSphere), My tags shows your tags for just that product zone.

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).