In the last installments, "State of the art in XML modeling" and "Schema standardization for top-down semantic transparency," I presented an overview of some interesting technologies and techniques for semantic transparency. In this third and final part of the mini-series, I discuss what I think is the most important tool available for semantic transparency. The right sort of schema annotations are very useful on a range of levels, starting with improved documentation. In the first of these articles, I discussed informal schema annotations. The approach I discuss in this article is an important refinement on that idea, which takes advantage of semantic anchors to formalize the annotations. It also builds on another short article on developerWorks, "Use data dictionary links for XML and Web services schemata". I suggest you read these articles before proceeding with this one.
First, however, I want to mention another fine conference I attended recently.
The XTech conference concept is based on the earlier XML Europe conferences, with an added emphasis on browser technologies that complement the XML and semantic technology tracks. In addition, a track and overall theme of open data looks not just at all the neat new technologies for processing XML, but also at how people and organizations are making data freely available to the world in order to open up entirely new applications and sources of added value.
XTech 2005 was held in Amsterdam (the same venue as XML Europe 2004, and the likely venue for the 2006 conference as well) from May 25th through 27th. The conference has always been one of my favorites, with a high concentration of fresh thinking and practical application in markup technologies. The browser and open data tracks added a great deal of energy to the proceedings, as did organizer Edd Dumbill's creative work in building an atmosphere of collaboration around the conference. There was a conference Wiki (for public notes) and an IRC channel (for public chatting). Edd also put some popular uses of XML technology to good use by hosting Planet XTech, a metadata-driven aggregation of Weblogs and pictures relating to XTech, and by offering the conference schedule in an XML form that people were encouraged to "remix" or process in interesting and useful ways. In a way, this helped fuel all the talk of open data.
Microformats were a big topic of discussion at the conference. These are basically little islands of XML data that are embedded within host formats such as XHTML or RSS. Microformats allow users to mix in information about all sorts of extended concerns, such as calendar information, personal contact information, or picture metadata. One of my favorite talks at the conference discussed a system ponderously named Gleaning Resource Descriptions from Dialects of Languages, or GRDDL, which (among other uses) extracts structured metadata from microformats. GRDDL is an important idea and one I expect to cover in future Thinking XML installments. Despite problems in my own presentation -- the venue's projectors refused to cooperate with my laptop's video drivers -- I learned a lot, had a good time, and most of all was gratified to see yet more evidence of the ever-increasing respect for semantic technologies in the XML sphere (see Resources for more on the XTech Conference).
The key to formalizing schema annotations is to find a good vocabulary resource with a clear set of identifiers for the terms you use. You then write these identifiers (typically URIs) in as the end points of data dictionary links in your schema. Listing 1 is a RELAX NG schema (compact syntax) snippet that includes informal annotations.
Listing 1. Sample RELAX NG schema using informal annotation to provide semantic clues
namespace dc = "http://purl.org/dc/elements/1.1/"
element purchase-order
{
dc:description [ "General purpose purchase order for merchandise" ]
attribute id {
dc:description [ "Unique identifier for the purchase order" ]
text
}
#Rest of the schema here
}
|
This approach is limited. It provides only informal descriptions, which require reading comprehension by people to be of any use. This makes it hard to develop software that can use these annotations to drive decisions about the semantics of the schema. Reflecting this informality, the annotation uses the dc:description element, which is generally a prose account of the resource. To formalize the annotation, I shall switch to a more definitive statement from the OWL Web Ontology Language, so it's clear that I am identifying the schema data elements with a vocabulary term. I'll use WordNet as the vocabulary. WordNet is a database of English words and lexical relationships between them. I've discussed WordNet recently in this column; you can basically use it as a machine-readable dictionary. Listing 2 is based on Listing 1, but uses more formal annotations.
Listing 2. Sample RELAX NG schema using formal annotation to provide semantic clues
namespace wn = "http://www.cogsci.princeton.edu/"
element purchase-order
{
wn:definition [
http://cogsci.princeton.edu/cgi-bin/webwn?stage=1&word=purchase+order]
attribute id {
wn:definition [
http://cogsci.princeton.edu/cgi-bin/webwn2.0?stage=1&word=identifier]
text
}
#Rest of the schema here
}
|
The annotations in this example provide unequivocal reference to definitions of records. You can go right to the cited WordNet page to find the lexicographical definitions:
The noun "purchase order" has one sense in WordNet.
1. order, purchase order -- (a commercial document used to request someone to supply something in return for payment and providing specifications and quantities; "IBM received an order for a hundred computers")
The noun "identifier" has one sense in WordNet.
1. identifier -- (a symbol that establishes the identity of the one bearing it)
A machine can use equivalence of the URLs to check semantic equivalence -- and WordNet allows you to go even further, using its thesaurus-like facilities for richer semantics. If one schema has an anchor to "name" and the other to "identifier", a machine can navigate WordNet automatically to recognize the lexical similarity of those terms. In practice, however, WordNet is not necessarily the best choice for such annotations; most often the terms used in schemata have specific technical meanings that are not covered in such a general purpose dictionary. Also, because of its ambitions, WordNet is not entirely complete, has quite a few errors, and is contantly evolving. Other options for semantic anchoring include ebXML core components and RosettaNet dictionaries, both of which I've covered in this column. You can even combine multiple anchors for each symbol.
Even before you get around to processing such annotations at the semantic level, you can use them in documentation tasks. It's straightforward to use XSLT to generate schema indices and data dictionaries by extracting schema definitions and annotations. You can even incorporate information extracted from the pages at the anchor URLs. If you try to do this with WordNet, you might want to use one of the RDF translations of WordNet, rather than the original Princeton Web pages, which make for rather sloppy mark-up.
Semantic anchors in abstract schemata
If you read my article "Discover the flexibility of Schematron abstract patterns," you learned a very useful technique for abstracting the basic information content of schemata from the actual XML syntax. As I state in that article, "You can gain even more expressive power by augmenting Schematron abstract patterns with semantically rich annotations... The resulting schemata would be readily adaptable to any syntax, while at the same time offering semantic transparency." Listing 3 is a Schematron snippet that demonstrates this powerful combination of techniques.
Listing 3. Schematron abstract pattern with formal annotation
<pattern abstract="true" name="purchase-order">
<rule context="$purchase-order">
<wn:definition href=
"http://cogsci.princeton.edu/cgi-bin/webwn?stage=1&word=purchase+order"
/>
<assert test="$id">
A purchase order requires an ID
</assert>
</rule>
<rule context="$id">
<wn:definition href=
"http://cogsci.princeton.edu/cgi-bin/webwn?stage=1&word=identifier"
/>
<assert test="count(key('ids', .)) = 1">
An ID must be unique
</assert>
</rule>
</pattern>
|
This code defines a Schematron abstract pattern that covers the abstract notion of a purchase order with a unique identifier. You can instantiate it in any XML pattern you like -- perhaps a PO element with an ID attribute, or a purchase-order element with an ident sub-element. See my earlier article for more information on how this works. Regardless of the syntax you choose, you trace the structure from the concrete schema patterns to the abstract pattern (as in Listing 3), where you find semantic anchors that solidify the definition. The schema constraint count(key('ids', .)) = 1 checks the set of all identifiers to be sure that only one has the current value (in other words whether the identifier is unique). It requires that you've defined a key named ids.
In my consulting on XML design, I have found semantic anchors to be a very useful tool in raising the quality of schemas and other instruments I develop. The most superficial benefit is that I can use semantic anchors to generate supporting documentation for customers, and later on I can surprise them with the quick adaptation and processing techniques opened up by the clear semantics.
With this article, I wrap up this survey of practical semantic transparency techniques. I hope I've helped demonstrate that good semantic design is not just a pretty bit of theory, but a consideration that you can apply to all your work with XML technology. I'll continue to cover the subject in this column, of course, and I do encourage you to share your perspectives by participating in the Thinking XML discussion forum.
- Participate in the discussion forum.
- Review the developerWorks articles referenced in this installment:
- "State of the art in XML modeling" (March 2005)
- "Schema standardization for top-down semantic transparency" (April 2005)
- "Use data dictionary links for XML and Web services schemata" (May 2004)
- "Discover the flexibility of Schematron abstract patterns" (October 2004)
- Find out more about XTech 2005, which was held in Amsterdam May 25-27. See Planet XTech and The XTech Wiki to find out more about the conference. Uche Ogbuji's presentation focused on "Matching Python idioms to XML idioms."
- Explore Gleaning Resource Descriptions from Dialects of Languages (GRDDL), a system "for getting RDF data out of XML and XHTML documents using explicitly associated transformation algorithms, typically represented in XSLT."
- Read "What Are Microformats?" by Micah Dubinko for more on Microformats.
- Learn RELAX NG for more effective XML design. David Mertz provides an excellent start in his XML Matters column here on developerWorks:
- Part 1 is a fairly complete overview of both the syntax and semantics of RELAX NG schemata (February 2003).
- Part 2 addresses a few additional semantic issues and looks at tools for working with RELAX NG (March 2003).
- Part 3 looks at tools for working with RELAX NG compact syntax (May 2003).
- Visit the home page for WordNet, a database of English words and the lexical relationships between them developed by Princeton University's Cognitive Science Laboratory.
- Find more XML resources on the developerWorks XML zone, including previous installments of the Thinking XML column. "XML meets semantics, Part 3" discusses RosettaNet dictionaries (May 2001).
- Learn how you can become an IBM Certified Developer in XML and related technologies.

Uche Ogbuji is a consultant and co-founder of Fourthought Inc., a software vendor and consultancy specializing in XML solutions for enterprise knowledge management. Fourthought develops 4Suite, an open source platform for XML, RDF, and knowledge-management applications. Mr. Ogbuji is also a lead developer of the Versa RDF query language. He is a computer engineer and writer born in Nigeria, living and working in Boulder, Colorado, USA. You can find more about Mr. Ogbuji at his Weblog Copia, or contact him at uche@ogbuji.net.
Comments (Undergoing maintenance)





