So far in our exploration of the issue tracker application, I have discussed, by example, the RDF data being extracted from the XML data, techniques for making this extraction, and a neat semantic searching capability that all this fuss with RDF makes possible. Now I'll take a closer look at the role schemata play in using RDF for building knowledge management features into XML applications.
Relational and object database schemata, and even XML schemata, provide documentation, guidance, and control for data-driven applications. RDF schemata are looser and more generic; they set forth our classification of the resources that we put into the RDF models. In this installment and the next, we'll look at schemata for the issue tracker RDF statements, using both the W3C RDF schema (RDFS) specification and the DARPA Agent Markup Language/Ontology Inference Language (DAML+OIL), which is an important extension of, and improvement on, the W3C specification. Some familiarity with RDFS and DAML+OIL is useful, though I will be introducing most of the concepts I use in my examples and discussion.
Both RDFS and DAML+OIL revolve around the classification of resources. In previous installments of this column, you may have noticed that the issue tracker RDF has been rather light on classifications. In fact, it has used no classes and types at all so far. This is just fine for an RDF system. In the case of the issue tracker, since pretty much any resource can be marked up with issues -- and an issue can be pretty much anything to which we can attach authors, comments, and actions -- strict classifications are probably artificial and would just get in the way.
However, one of the strengths of RDF is that it does not require strict classifications of the sort required by many object-oriented (OO) languages. Its concept of class and type is far more general, and is open to interpretation by the model designers. A class can be the core of any sort of organization you might want to use for resources. It doesn't have to be a neat tree, such as the scientific classifications of living things. For instance, in the XML world "purchase order" is often used as an example of a document that is next to impossible to standardize even with the weight of XML behind any effort. This is because of the myriad ways that POs are classified, subclassified, and generally conceived. RDF is specifically designed to accommodate this sort of chaos.
RDFS introduces some of the worldview of OO development by putting forth the idea of a class as the natural indicator of type. Indeed a lot of RDF implementations follow this example, perhaps because OO techniques have enjoyed so much prominence recently. But I think it's extremely important to note that this pattern is not fundamental to RDF itself.
These are rather deep concepts, so a tangible example is in order here. Take the idea of a telephone number. There are many ways we can look at a telephone number, if we want to fix it in a classification scheme:
- A telephone number is a kind of number.
- A telephone number is a kind of contact datum.
- A telephone number is a kind of asset (ask any U.S. business that has struggled to reserve a toll-free number that spells out their trademark on the numeric keypad).
- A toll-free number is a kind of telephone number.
- A fax number is a kind of telephone number.
You can see some of the classic hierarchy that is a hallmark of OO thinking. You can also see some of the overlapping and tentativeness of classifications that tend to cause problems in established OO practice. Just ask any OO developer about "diamonds of death" or "non-flying birds" if you want to trigger a blinking fit. In the above, the "kind of" is often mapped to the OO concept of an "is-a" relationship, and usually defines the type of the object as a consequence of the built-in semantics of OO implementation languages.
But in the real world, there is more to type than class. Take the following statements:
- 501-555-1111 is Mark's work number.
- 500-555-1234 is Mark's home number.
- Use 500-555-1234 as Mark's emergency contact number.
- You must use 10-digit dialing to dial 555-1234 from outside the 555 exchange.
These statements all define characteristics of a phone number. They are less clearly classifications than the first set of examples, and indeed in the OO worldview, they may be indicated in many ways, such as attributes and associations, but rarely using typing. However, considering the ways people generally think about such characteristics, there is no reason to think that they aren't types just as much as the first set of statements. It is natural to say that "a work number is a type of telephone number," and for locations within the 501 area code, a "10-digit number" is naturally a "type" of telephone number. In RDF there is no reason why these characteristics shouldn't be expressed using
rdf:type predicates. In fact, consider the vCard/RDF proposal, a W3C note that suggests a conversion from the very popular vCard contact specification scheme to RDF. vCard/RDF uses
rdf:type to differentiate work numbers from home numbers, fax numbers from voice numbers, Internet mailboxes from Lotus Notes mailboxes, etc. It also uses
rdf:type in the common RDFS sense as well: for indicating classifications within its data model.
But if the same predicate (
rdf:type) can be used in such a divergence of ways, hasn't it become dangerously vague? I think the situation calls for refinement of the various uses of
rdf:type, and it would be best if RDFS were to introduce a subproperty of
rdfs:type, or if that is too confusing,
rdfs:classificationType. Similarly, vCard could create a subproperty of
vCard:contactType, to differentiate the various concepts of type that it employs.
The issue tracker doesn't need to do a lot of neat things with typing and classifications, but the above discussion encourages the idea that there is no reason why types, classes, and other schematic matters shouldn't be constructed quite loosely. For most RDF projects in which I've worked, it was established that you sit around the table with loads of doughnuts and caffeinated beverages in order to hammer out the schema. This is a puritan conscientiousness borrowed from the OO development and relational DBMS worlds. But in working with the issue tracker thus far, I've worked with a few instances before even coming around to the schema. There is no reason not to have done so. We are attaching issues to any Web-based resource, and we are making very loose statements about these issues.
It's time to talk schema. Listing 1 is an XML fragment that illustrates an RDFS class for an issue:
Listing 1. The Issue class
<rdfs:Class ID="Issue"> <rdfs:label>Issue</rdfs:label> <rdfs:comment>A problem, suggestion or other matter for action or discussion relevant to a resource</rdfs:comment> </rdfs:Class>
This code declares an in-line (because of the use of
ID) RDFS class for an issue. Note the label and comment -- I think these are extremely important, and in my practice I require both on every resource defined, especially on schematic elements. Labels are especially important because smart RDF tools can use them to present user-friendly names for resources rather than ugly URIs.
Listing 2. The Author class and issue and author properties
<rdfs:Class ID="Author"> <rdfs:label>Author</rdfs:label> <rdfs:comment>A person raising or posting an issue</rdfs:comment> </rdfs:Class> <rdfs:Property ID="issue"> <rdfs:label>issue</rdfs:label> <rdfs:comment>Associate an issue with its resource </rdfs:comment> <rdfs:range rdf:resource="#Issue"/> </rdfs:Property> <rdfs:Property ID="author"> <rdfs:label>author</rdfs:label> <rdfs:comment>Associate an issue with whoever posted it </rdfs:comment> <!-- Where the <i>dc</i> entity has been set to the Dublin Core metadata element base URI --> <rdfs:subPropertyOf rdf:resource="&dc;creator"/> <rdfs:domain rdf:resource="#Issue"/> <rdfs:range rdf:resource="#Author"/> </rdfs:Property>
Here we define a property
issue. The range statement asserts that the object of any statement with an
issue predicate must have an
Issue. We don't make any such restriction on the subject of such statements (which would be a domain statement), so in effect any resource can have an
issue predicate, which is our intent. The
author property is defined with both a domain and a range, and is made to be a subproperty of the "creator" metadata element from Dublin Core. This means that any
issue with an
author property automatically asserts a
dc:creator property as well. This is a common and useful technique and in this case means that agent software that is familiar with Dublin Core will be able to deal with our issue tracker metadata to some extent, without any problems. This trick is actually part of the foundation of the semantic Web.
If you've gone back to the instance data at this point in order to compare it to the schema we're building, you might be scratching your head: "But this doesn't match the instances with which we've been working." For example, we have:
Listing 3. Snippet from earlier instances
<rdf:Description about='&ril-spec;ril-20010502'> <rit:issue rdf:resource='#i2001030423'/> </rdf:Description> <rdf:Description ID='i2001030423'> <it:author rdf:resource='&ril-users;#uogbuji'/> </rdf:Description>
This appears to violate the constraints we have set because the resource with ID i2001030423 is not declared to have
Issue, nor is the resource with ID "uogbuji" declared to have
Whether this is indeed a violation of the schema might actually depend on how we interpret the schema. The most common interpretation is that if there are no statements in the model to fulfill the terms of a constraint (such as domain or range), then the model is inconsistent -- usually an error condition. This is known as a restrictive role for RDF schemata. It is also part of what is sometimes known as a "closed world" assumption, since it does not consider anything that is not manifest in the model at the time of inquiry.
But there is another, less common, but very interesting approach. One of the constraints we defined in this installment says that if a resource has an
author property, then it must be of
Issue. One could then infer from the presence of said property on the
i2001030423 resource that it must be of the required type. In short, the processor could effectively generate statements that allow constraints to be satisfied. This is known as a generative or inferential role for RDF schemata. It is closer to how people deal with the vicissitudes of the real world, and thus closer to the powerful ideas behind the semantic Web. But with this power come thorny pitfalls of knowledge representation.
The most important lesson to be learned here is that all is well even though we started with prototype RDF instances, and then worked up a schema that at first glance seemed to invalidate our earlier efforts. All is well, thanks to the generosity (I don't use the term lightly) of RDF. As an experienced modeler/designer, I must say that this power and flexibility is one of the bedrock strengths of RDF, as well as one of the reasons it can be so hard for traditional OO and relational thinkers to grasp.
We have run out of space for this installment, but I hope you've found it valuable that I've taken time to introduce and discuss important modeling concepts as we've proceeded. I would have been grateful for such a walk through when I was first trying to get my head around extensible metadata. In the next installment, we'll round off the issue tracker schema in RDFS form and look at it in DAML form as well.
- Pierre-Antoine Champin has written an excellent RDF Tutorial, which also covers RDFS.
- The RDFS specification is actually still just a Candidate Recommendation of the W3C. The recent RDFCore activity will probably contribute to its completion.
- Dave Beckett has produced a handy reference of RDF and RDFS concepts.
- There is plenty of good reading for further study, including An Extensible Approach for Modeling Ontologies in RDF(S), a paper that discusses modeling of ontologies using RDFS, including the modeling of axioms (e.g. general statements such as "all men must die").
- There is also the W3C note by Walter W. Chang: A Discussion of the Relationship Between RDF-Schema and UML.
- There are also useful insights in this post on www-rdf-interest, which summarizes a private exchange between Graham Klyne and Stephen Cranefield, including insights on the intersection of UML and RDF modeling.
- Check out Thinking XML's previous columns.
Uche Ogbuji is a consultant and co-founder of Fourthought Inc., a software vendor and consultancy specializing in XML solutions for enterprise knowledge management. Fourthought develops 4Suite, an open-source platform for XML, RDF and knowledge-management applications. Mr. Ogbuji is a computer engineer and writer born in Nigeria, living and working in Boulder, Colorado, USA. You can contact Mr. Ogbuji at firstname.lastname@example.org.