Skip to main content

By clicking Submit, you agree to the developerWorks terms of use.

The first time you sign into developerWorks, a profile is created for you. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

All information submitted is secure.

  • Close [x]

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerworks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

By clicking Submit, you agree to the developerWorks terms of use.

All information submitted is secure.

  • Close [x]

Thinking XML: Basic XML and RDF techniques for knowledge management, Part 2

Combining files into an RDF model, and basic RDF querying

Uche Ogbuji (uche@ogbuji.net), Principal consultant, Fourthought, Inc.
Uche Ogbuji
Uche Ogbuji is a consultant and co-founder of Fourthought Inc., a software vendor and consultancy specializing in XML solutions for enterprise knowledge management applications. Fourthought develops 4Suite, an open-source platform for XML middleware. Mr. Ogbuji is a computer engineer and writer born in Nigeria, living and working in Boulder, Colorado, USA. You can contact Mr. Ogbuji at uche@ogbuji.net.

Summary:  This Thinking XML column shows how to combine metadata collected from multiple XML source documents into a single Resource Description Framework (RDF) model for effective querying. In this follow-up to his previous installment that introduced how to use XML and RDF together for knowledge management, columnist Uche Ogbuji builds on the techniques for populating RDF models with data from existing XML formats. The centerpiece of this discussion is an example in which a Web-based issue tracker, originally developed to manipulate application data in XML, is extended to take advantage of RDF. Sample code listings in XSLT and Python demonstrate two methods of aggregating metadata from XML files into a single RDF model (one using XSLT and the other using RDF), and examples of simple RDF queries.

Date:  01 Sep 2001
Level:  Intermediate
Also available in:   Japanese

Activity:  12487 views
Comments:  

In the previous installment of this column, Basic XML and RDF techniques for knowledge management, Part 1 (which you may want to review before you go any further), I introduced an example of an issue-tracker application based on collections of data in XML format. I then showed how to extract RDF from the application data using XSLT.

This column rounds things out by showing how to combine the discrete RDF snippets created by transforming each XML source document. I also demonstrate basic querying techniques.

Batch derivation of resources

The last column went through the basic exercise of extracting RDF as serialized forms from XML source files. However, most useful advantages of knowledge management come with manipulating abstract models of RDF statements, rather than individual serialized files. An RDF model is a graph structure, which can be very simple or of mind-boggling complexity. (In fact, at the extreme, the vision of the Semantic Web is to create an RDF model at least as large as the present Web. This would probably be the biggest computer data structure ever put in use, if it takes off.)

So far, though, the example issue tracker is made up of a collection of XML files that are a serialization of portions of an RDF model. Each is like a piece in a jigsaw puzzle, which, if put together, would represent an RDF model of all the data in the issue-tracker application. Each piece in itself is not very important, so the next step is to look at how to create the completed picture that is the real knowledge-management prize.

Batch conversion using XSLT

One option for assembling this picture is not to deal with jigsaw puzzle pieces at all, but to use XSLT to generate the picture all in one go. You can write a transform that gathers each XML issue document, applies the RDF conversion processing described in the last column, then accumulates the result of each transform into one serialized RDF result. To do so, you would take advantage of the standard document() XSLT function, which can read in arbitrary documents to be treated as supplementary XML source documents for transform.

The document() function needs to know the precise name of documents to be read in. XSLT has no standard mechanism for, say, reading in multiple documents by using a wildcard. You can solve this problem many ways (including resorting to XSLT extensions that would allow batch document loading by wildcard or some other mechanism). Because this is not the central problem of this example, though, I chose the simple workaround of creating a hub XSLT document with a listing of each issue document. Listing 1, issue-hub.xml, shows what the hub document looks like with the two example issue documents (introduced in the previous column) specified.


Listing 1: Hub XML document with listings of each application data document issue document



<issues>
  <issue>issue1.xml</issue>
  <issue>issue2.xml</issue>
</issues>


Once you have a hub document, you can apply the transform in Listing 2, metadata-batch.xslt, to the hub document as source to create the aggregate RDF from all listed issues.

Listing 2 may look familiar; most of it is the same as the transform from the last installment. That is because I was careful to modularize the processing of the XML structure. The root template now looks for each issue element from the hub document (note that here I use the null namespace). It then loads the document from the filename specified in the content of each element. Note the use of document(.)/* rather than just document(.). This construction prevents the processor from matching the root node of the issue documents, which would end up invoking the template intended for the hub document's root node. Those are the only differences in this column's batch transform.

You can use any XSLT processor to perform the transform. I used 4XSLT to process it as follows:

$ 4xslt -o issues.rdf issue-hub.xml metadata-batch.xslt

The transform results in the serialized RDF in issues.rdf, which contains the metadata from both issue1.xml and issue2.xml.

Batch conversion using RDF tools

Another option for aggregating multiple XML documents into a single RDF model is to build each RDF document separately, as in the last installment, and then use an RDF parser to parse all the documents into a single abstract model. So, for instance, using 4RDF against the two RDF files generated from the example issue files in the previous column, you'd get Listing 3.


Listing 3: Results of batch conversion of two RDF issue files, derived from XML files, parsed by an RDF parser (4RDF, in this case)


$ 4rdf -d 1.rdf 2.rdf
The following is a list of resulting tuples, each in the form "subject,
predicate, object".
[
("http://meta.rdfinference.org/ril/issue-tracker/ril-20010502",
"http://xmlns.rdfinference.org/ril/issue-tracker#issue",
"#i2001030423"),
("#anonymous:e08-e06-30b-90d-a060005104", "id", "i2001030423"),
("#anonymous:e08-e06-30b-90d-a060005104",
"http://xmlns.rdfinference.org/ril/issue-tracker#author",
"http://users.rdfinference.org/ril/issue-tracker#uogbuji"),
("#anonymous:1040d07-20b-909-407-b0e0402205",
"http://xmlns.rdfinference.org/ril/issue-tracker#author", "Alexandre
Fayolle <Alexandre.Fayolle@logilab.fr>"),
("#anonymous:1040d07-20b-909-407-b0e0402205",
"http://xmlns.rdfinference.org/ril/issue-tracker#body", "The
abbreviation in listing 8 doesn't seem necessary to Nico Chauvat or
me."),
("#anonymous:e08-e06-30b-90d-a060005104",
"http://xmlns.rdfinference.org/ril/issue-tracker#comment",
"#anonymous:1040d07-20b-909-407-b0e0402205"),
("#anonymous:60e0b06-50a-80c-90c-50a0a08602",
"http://xmlns.rdfinference.org/ril/issue-tracker#author",
"http://users.rdfinference.org/ril/issue-tracker#uogbuji"),
("#anonymous:60e0b06-50a-80c-90c-50a0a08602",
"http://xmlns.rdfinference.org/ril/issue-tracker#assignment",
"http://users.rdfinference.org/ril/issue-tracker#uogbuji"),
("#anonymous:e08-e06-30b-90d-a060005104",
"http://xmlns.rdfinference.org/ril/issue-tracker#action",
"#anonymous:60e0b06-50a-80c-90c-50a0a08602"),
("http://meta.rdfinference.org/ril/issue-tracker/ril-20010502",
"http://xmlns.rdfinference.org/ril/issue-tracker#issue",
"#i2001042003"),
("#anonymous:2030002-506-10a-201-f090809c08", "id", "i2001042003"),
("#anonymous:2030002-506-10a-201-f090809c08",
"http://xmlns.rdfinference.org/ril/issue-tracker#author",
"http://users.rdfinference.org/ril/issue-tracker#nchauvat"),
("#anonymous:c000706-b0b-d02-d0a-e050c0460b",
"http://xmlns.rdfinference.org/ril/issue-tracker#author", "Alexandre
Fayolle <Alexandre.Fayolle@logilab.fr>"),
("#anonymous:c000706-b0b-d02-d0a-e050c0460b",
"http://xmlns.rdfinference.org/ril/issue-tracker#body", "I agree"),
("#anonymous:2030002-506-10a-201-f090809c08",
"http://xmlns.rdfinference.org/ril/issue-tracker#comment",
"#anonymous:c000706-b0b-d02-d0a-e050c0460b"),
("#anonymous:b0c0c00-800-2-c08-20a0d0f209",
"http://xmlns.rdfinference.org/ril/issue-tracker#author",
"http://users.rdfinference.org/ril/issue-tracker#uogbuji"),
("#anonymous:b0c0c00-800-2-c08-20a0d0f209",
"http://xmlns.rdfinference.org/ril/issue-tracker#assignment",
"http://users.rdfinference.org/ril/issue-tracker#uogbuji"),
("#anonymous:2030002-506-10a-201-f090809c08",
"http://xmlns.rdfinference.org/ril/issue-tracker#action",
"#anonymous:b0c0c00-800-2-c08-20a0d0f209"),
]

The -d option in the first line of Listing 3 tells $RDF to dump the subject/predicate/object triples from the abstract RDF model to the command line, which is the display that follows in the rest of Listing 3.

Once you have the complete RDF model for the issue tracker, you'll probably want to look at it in a more human-legible form than the display in Listing 3. So, for instance, if you take the aggregate RDF file from Listing 3 and use Dan Brickley's RDF visualizer tool (see Resources) on it, you get a graph similar to Figure 1. It's quite wide for the average laptop screen, so click the link to open the chart in a separate window.

The nice thing about the visualizer display in Figure 1 is that you can make out certain patterns immediately, such as the contributions and responsibilities of user "uogbuji". Of course, I must admit that the verbose identifiers that tend to come along with URIs in RDF hamper this clarity.


Query across the system

Another immediate benefit of having an RDF model of the metadata available is that many system-wide queries become simpler and more generic than they would be if you were writing XPath queries of the collection of XML documents, or gathering that data into proprietary data structures for querying. The emergence of XML Query Language (XQuery) and proprietary document-collection query tools provided by XML repository vendors can help address this need as well, but RDF is available now, and at least the basic model is standardized.

Unfortunately, the querying of RDF models is not yet standardized, and this is a very significant hole that the RDF community has to fill. Luckily, because of the simplicity of RDF's model, it is easy to build almost every form of query through basic pattern matching. This is the fundamental approach used in 4RDF, where querying is a matter of finding statement triples that match given subject, predicate, and object patterns. As an example, look at part of the unified model just created, in Table 1.

Table 1. A few statements from the issue-tracker data

Subject:#anonymous:a0d010d-f0c-706-20e-80a0407606
Predicate:http://xmlns.rdfinference.org/ril/issue-tracker#body
Object:Organize a vote on this topic
Subject:#anonymous:a0d010d-f0c-706-20e-80a0407606
Predicate:http://xmlns.rdfinference.org/ril/issue-tracker#assign-to
Object:http://users.rdfinference.org/ril/issue-tracker#uogbuji
Subject:#anonymous:402000d-403-309-c01-9080a205
Predicate:http://xmlns.rdfinference.org/ril/issue-tracker#body
Object:Correct all to use the "0/1" form in the next draft.
Subject:#anonymous:402000d-403-309-c01-9080a205
Predicate:http://xmlns.rdfinference.org/ril/issue-tracker#assign-to
Object:http://users.rdfinference.org/ril/issue-tracker#uogbuji

Given these triples, one could easily see how one could simply ask "What are the action IDs assigned to user uogbuji?," and "What is the body of each action?". The first question translates to finding triples that match the following patterns, where "*" is a wildcard matching any value:

subjectpredicateobject
*http://xmlns.rdfinference.org/ril/issue-tracker#assign-tohttp://users.rdfinference.org/ril/issue-tracker#uogbuji

One of the responses to this would be a statement with subject "#anonymous:a0d010d-f0c-706-20e-80a0407606", representing a matching action ID. Note that this is a special ID generated by 4RDF for anonymous resources (ones without an ID explicitly assigned by the application). The sequence "anonymous" is followed by a hexadecimal value making up a universally unique identifier (UUID). The second example question, given this ID, is equivalent to matching the following pattern, which returns a statement with object "Organize a vote on this topic.":

subjectpredicateobject
#anonymous:a0d010d-f0c-706-20e-80a0407606http://xmlns.rdfinference.org/ril/issue-tracker#body*

And just building on this simple idea, almost any form of RDF query is possible, even the equivalent of relational (SQL) queries and object (OQL) queries.

Coding RDF queries

The Python program in Listing 4, query1.py, uses 4RDF to put into action this technique of query through pattern matching, reading in the issues.rdf file, and printing the body of all of uogbuji's assignments. (If you're not working in Python you can, of course, use other RDF query tools such as RDFDb, Jena, or one listed in Dave Beckett's RDF Resource Guide in Resources.)


Listing 4: Python program that demonstrates a simple query of an RDF model


  from Ft.Rdf import Util

#Returns an RDF model object, and the database instance it uses for
#persistence (in our case, it's just a memory data structure)
model, db = Util.DeserializeFromUri('issues.rdf')
db.begin()

USER_ID_BASE = 'http://users.rdfinference.org/ril/issue-tracker#'
IT_SCHEMA_BASE = 'http://xmlns.rdfinference.org/ril/issue-tracker#'

print 'Actions assigned to uogbuji:'

#None is used as the wild-card
matching_statements = model.complete(None,
                                     IT_SCHEMA_BASE+'assign-to',
                                     USER_ID_BASE+'uogbuji'
                                     )

for statement in matching_statements:
    id = statement.subject
    matching_statements = model.complete(id,
                                         IT_SCHEMA_BASE+'body',
                                         None
                                         )
    body = matching_statements[0].object
    print "*", body

db.commit()



In Listing 4, first the RDF is read (deserialized) from the issues.rdf file; then a query is executed to find statements indicating which actions are assigned to uogbuji. Then the subject of each statement, which is the action ID, is used to query the body of each action. If you have Python and 4Suite installed, you would run the same example as follows:

$ python query1.py
Actions assigned to uogbuji
* Organize a vote on this topic
* Correct all to use the "0/1" form in the next draft.

This is the lowest level of RDF model query and therefore a bit cumbersome. The Python program in Listing 5, query2.py, takes a few shortcuts by directly querying for relevant subjects, objects, and so on. As a result, it is a simpler version of the exact same function in Listing 4.


Listing 5: Simplify the query code


  from Ft.Rdf import Util

#Returns an RDF model object, and the database instance it uses for
#persistence (in our case, it's just a memory data structure)
model, db = Util.DeserializeFromUri('issues.rdf')
db.begin()

USER_ID_BASE = 'http://users.rdfinference.org/ril/issue-tracker#'
IT_SCHEMA_BASE = 'http://xmlns.rdfinference.org/ril/issue-tracker#'

print 'Actions assigned to uogbuji:'

actions = Util.GetSubjects(model, IT_SCHEMA_BASE+'assign-to',
                           USER_ID_BASE+'uogbuji')

for action in actions:
    body = Util.GetObject(model, action, IT_SCHEMA_BASE+'body')
    print "*", body

db.commit()


Listing 5 still involves two levels of query; it would be less awkward and more efficient to use a single request that combines the queries. (That is possible using the RDF Inference Language (RIL), which I'll examine in further installments of this column. Incidentally, the draft open specification for RIL is referred to in the example issue tracker data.) Listing 6 shows the result of executing the basic query, with or without the API shortcuts.


Listing 6: Simplified query session that performs the same work as Listing 5


$ python query2.py
Actions assigned to uogbuji
* Organize a vote on this topic
* Correct all to use the "0/1" form in the next draft.

RDF queries don't have to involve exact matches of metadata statement components. Most query languages provide much more flexibility. As an example, the Python program in Listing 7 (query3.py) uses regular expression-based query to find all actions assigned to uogbuji whose body contains the string "vote".

Again the code in Listing 7 would be a single query, and more efficient, using RIL, which I'll cover in my next column. Listing 8 shows the use of regular expressions in query.


Listing 8: Session demonstrating the use of regular expressions in query


$ python query3.py
Actions assigned to uogbuji
* Organize a vote on this topic


And next time ...

This column demonstrates how to piece together the entire RDF model from the example issue-tracker application and illustrates basic querying of the model. The next installment will look at one very juicy plum of a feature that comes very inexpensively with the harnessing of RDF's power.



Download

DescriptionNameSizeDownload method
Sample code for columnrdfcode2.zip28 KB HTTP

Information about download methods


Resources

  • Downloadable zipped files of the author's sample code from the previous column that introduces RDF techniques with XML and the sample code files for this column make it easier to follow the examples or adapt them for reuse.

  • Introduction to RDF, by Uche Ogbuji, provides a basic introduction to RDF and includes links to other useful resources.

  • Dave Beckett's RDF Resource Guide is a comprehensive set of links to RDF-related articles, tools, specifications, discussions, events, and so on.

  • Dan Brickley's RDF visualizer, which is Web based, and available from any platform, renders an RDF file into a graphic representation, like that in Figure 1 of this column.

  • Two RDF-query tools mentioned but not demonstrated in this article are the query language associated with the RDF database RDFDb, which provides C and Perl access to the RDF database, and Jena, an experimental Java API for manipulating RDF models.

  • The examples in this article were tested using 4Suite's XSLT processor.

  • XML: the next big thing, a seminal IBM research paper by Tom Halfhill, discusses the possibilities of RDF for powering next-generation search engines.

  • Check out Thinking XML's previous columns.

  • The Semantic Web is gaining ground and interest and is now an official activity of the W3C

About the author

Uche Ogbuji

Uche Ogbuji is a consultant and co-founder of Fourthought Inc., a software vendor and consultancy specializing in XML solutions for enterprise knowledge management applications. Fourthought develops 4Suite, an open-source platform for XML middleware. Mr. Ogbuji is a computer engineer and writer born in Nigeria, living and working in Boulder, Colorado, USA. You can contact Mr. Ogbuji at uche@ogbuji.net.

Report abuse help

Report abuse

Thank you. This entry has been flagged for moderator attention.


Report abuse help

Report abuse

Report abuse submission failed. Please try again later.


developerWorks: Sign in


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

Choose your display name

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


Rate this article

Comments

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML
ArticleID=86599
ArticleTitle=Thinking XML: Basic XML and RDF techniques for knowledge management, Part 2
publish-date=09012001
author1-email=uche@ogbuji.net
author1-email-cc=

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

For articles in technology zones (such as Java technology, Linux, Open source, XML), Popular tags shows the top tags for all technology zones. For articles in product zones (such as Info Mgmt, Rational, WebSphere), Popular tags shows the top tags for just that product zone.

For articles in technology zones (such as Java technology, Linux, Open source, XML), My tags shows your tags for all technology zones. For articles in product zones (such as Info Mgmt, Rational, WebSphere), My tags shows your tags for just that product zone.

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Special offers