Skip to main content

Thinking XML: Schema annotation for bottom-up semantic transparency

Pushing schemata beyond syntax into semantics

Uche Ogbuji (uche@ogbuji.net), Principal Consultant, Fourthought, Inc.
Photo of Uche Ogbuji
Uche Ogbuji is a consultant and co-founder of Fourthought Inc., a software vendor and consultancy specializing in XML solutions for enterprise knowledge management. Fourthought develops 4Suite, an open source platform for XML, RDF, and knowledge-management applications. Mr. Ogbuji is also a lead developer of the Versa RDF query language. He is a computer engineer and writer born in Nigeria, living and working in Boulder, Colorado, USA. You can find more about Mr. Ogbuji at his Weblog Copia, or contact him at uche@ogbuji.net.

Summary:  Learn more about the different approaches to semantic transparency as Uche Ogbuji discusses what they mean to developers using XML. Whether or not you reuse schemata, you might find it valuable to use formal annotations (as opposed to the informal annotations covered earlier). You gain benefits on several levels by doing so. On the most immediately practical level, you can generate better documentation. A more far-sighted benefit is that it gives you an important measure of semantic transparency. This installment discusses semantic anchors, and gives examples. The author also takes a moment to discuss The XTech Conference 2005.

Date:  14 Jul 2005
Level:  Intermediate
Activity:  1916 views

In the last installments, "State of the art in XML modeling" and "Schema standardization for top-down semantic transparency," I presented an overview of some interesting technologies and techniques for semantic transparency. In this third and final part of the mini-series, I discuss what I think is the most important tool available for semantic transparency. The right sort of schema annotations are very useful on a range of levels, starting with improved documentation. In the first of these articles, I discussed informal schema annotations. The approach I discuss in this article is an important refinement on that idea, which takes advantage of semantic anchors to formalize the annotations. It also builds on another short article on developerWorks, "Use data dictionary links for XML and Web services schemata". I suggest you read these articles before proceeding with this one.

First, however, I want to mention another fine conference I attended recently.

XTech Conference 2005

The XTech conference concept is based on the earlier XML Europe conferences, with an added emphasis on browser technologies that complement the XML and semantic technology tracks. In addition, a track and overall theme of open data looks not just at all the neat new technologies for processing XML, but also at how people and organizations are making data freely available to the world in order to open up entirely new applications and sources of added value.

XTech 2005 was held in Amsterdam (the same venue as XML Europe 2004, and the likely venue for the 2006 conference as well) from May 25th through 27th. The conference has always been one of my favorites, with a high concentration of fresh thinking and practical application in markup technologies. The browser and open data tracks added a great deal of energy to the proceedings, as did organizer Edd Dumbill's creative work in building an atmosphere of collaboration around the conference. There was a conference Wiki (for public notes) and an IRC channel (for public chatting). Edd also put some popular uses of XML technology to good use by hosting Planet XTech, a metadata-driven aggregation of Weblogs and pictures relating to XTech, and by offering the conference schedule in an XML form that people were encouraged to "remix" or process in interesting and useful ways. In a way, this helped fuel all the talk of open data.

Microformats were a big topic of discussion at the conference. These are basically little islands of XML data that are embedded within host formats such as XHTML or RSS. Microformats allow users to mix in information about all sorts of extended concerns, such as calendar information, personal contact information, or picture metadata. One of my favorite talks at the conference discussed a system ponderously named Gleaning Resource Descriptions from Dialects of Languages, or GRDDL, which (among other uses) extracts structured metadata from microformats. GRDDL is an important idea and one I expect to cover in future Thinking XML installments. Despite problems in my own presentation -- the venue's projectors refused to cooperate with my laptop's video drivers -- I learned a lot, had a good time, and most of all was gratified to see yet more evidence of the ever-increasing respect for semantic technologies in the XML sphere (see Resources for more on the XTech Conference).


Formal schema annotation

The key to formalizing schema annotations is to find a good vocabulary resource with a clear set of identifiers for the terms you use. You then write these identifiers (typically URIs) in as the end points of data dictionary links in your schema. Listing 1 is a RELAX NG schema (compact syntax) snippet that includes informal annotations.


Listing 1. Sample RELAX NG schema using informal annotation to provide semantic clues
  
namespace dc = "http://purl.org/dc/elements/1.1/"
element purchase-order
{
  dc:description [ "General purpose purchase order for merchandise" ]
  attribute id {
    dc:description [ "Unique identifier for the purchase order" ]
    text
  }
  #Rest of the schema here
}

This approach is limited. It provides only informal descriptions, which require reading comprehension by people to be of any use. This makes it hard to develop software that can use these annotations to drive decisions about the semantics of the schema. Reflecting this informality, the annotation uses the dc:description element, which is generally a prose account of the resource. To formalize the annotation, I shall switch to a more definitive statement from the OWL Web Ontology Language, so it's clear that I am identifying the schema data elements with a vocabulary term. I'll use WordNet as the vocabulary. WordNet is a database of English words and lexical relationships between them. I've discussed WordNet recently in this column; you can basically use it as a machine-readable dictionary. Listing 2 is based on Listing 1, but uses more formal annotations.


Listing 2. Sample RELAX NG schema using formal annotation to provide semantic clues
  
namespace wn = "http://www.cogsci.princeton.edu/"
element purchase-order
{
  wn:definition [
http://cogsci.princeton.edu/cgi-bin/webwn?stage=1&word=purchase+order]
  attribute id {
    wn:definition [
http://cogsci.princeton.edu/cgi-bin/webwn2.0?stage=1&word=identifier]
    text
  }
  #Rest of the schema here
}

The annotations in this example provide unequivocal reference to definitions of records. You can go right to the cited WordNet page to find the lexicographical definitions:

The noun "purchase order" has one sense in WordNet.
1. order, purchase order -- (a commercial document used to request someone to supply something in return for payment and providing specifications and quantities; "IBM received an order for a hundred computers")
The noun "identifier" has one sense in WordNet.
1. identifier -- (a symbol that establishes the identity of the one bearing it)

A machine can use equivalence of the URLs to check semantic equivalence -- and WordNet allows you to go even further, using its thesaurus-like facilities for richer semantics. If one schema has an anchor to "name" and the other to "identifier", a machine can navigate WordNet automatically to recognize the lexical similarity of those terms. In practice, however, WordNet is not necessarily the best choice for such annotations; most often the terms used in schemata have specific technical meanings that are not covered in such a general purpose dictionary. Also, because of its ambitions, WordNet is not entirely complete, has quite a few errors, and is contantly evolving. Other options for semantic anchoring include ebXML core components and RosettaNet dictionaries, both of which I've covered in this column. You can even combine multiple anchors for each symbol.

Even before you get around to processing such annotations at the semantic level, you can use them in documentation tasks. It's straightforward to use XSLT to generate schema indices and data dictionaries by extracting schema definitions and annotations. You can even incorporate information extracted from the pages at the anchor URLs. If you try to do this with WordNet, you might want to use one of the RDF translations of WordNet, rather than the original Princeton Web pages, which make for rather sloppy mark-up.


Semantic anchors in abstract schemata

If you read my article "Discover the flexibility of Schematron abstract patterns," you learned a very useful technique for abstracting the basic information content of schemata from the actual XML syntax. As I state in that article, "You can gain even more expressive power by augmenting Schematron abstract patterns with semantically rich annotations... The resulting schemata would be readily adaptable to any syntax, while at the same time offering semantic transparency." Listing 3 is a Schematron snippet that demonstrates this powerful combination of techniques.


Listing 3. Schematron abstract pattern with formal annotation
  
  <pattern abstract="true" name="purchase-order">
    <rule context="$purchase-order">
      <wn:definition href=
"http://cogsci.princeton.edu/cgi-bin/webwn?stage=1&word=purchase+order"
       />
      <assert test="$id">
        A purchase order requires an ID
      </assert>
    </rule>
    <rule context="$id">
      <wn:definition href=
"http://cogsci.princeton.edu/cgi-bin/webwn?stage=1&word=identifier"
       />
      <assert test="count(key('ids', .)) = 1">
        An ID must be unique
      </assert>
    </rule>
  </pattern>

This code defines a Schematron abstract pattern that covers the abstract notion of a purchase order with a unique identifier. You can instantiate it in any XML pattern you like -- perhaps a PO element with an ID attribute, or a purchase-order element with an ident sub-element. See my earlier article for more information on how this works. Regardless of the syntax you choose, you trace the structure from the concrete schema patterns to the abstract pattern (as in Listing 3), where you find semantic anchors that solidify the definition. The schema constraint count(key('ids', .)) = 1 checks the set of all identifiers to be sure that only one has the current value (in other words whether the identifier is unique). It requires that you've defined a key named ids.


Not just pretty theory

In my consulting on XML design, I have found semantic anchors to be a very useful tool in raising the quality of schemas and other instruments I develop. The most superficial benefit is that I can use semantic anchors to generate supporting documentation for customers, and later on I can surprise them with the quick adaptation and processing techniques opened up by the clear semantics.

With this article, I wrap up this survey of practical semantic transparency techniques. I hope I've helped demonstrate that good semantic design is not just a pretty bit of theory, but a consideration that you can apply to all your work with XML technology. I'll continue to cover the subject in this column, of course, and I do encourage you to share your perspectives by participating in the Thinking XML discussion forum.


Resources

About the author

Photo of Uche Ogbuji

Uche Ogbuji is a consultant and co-founder of Fourthought Inc., a software vendor and consultancy specializing in XML solutions for enterprise knowledge management. Fourthought develops 4Suite, an open source platform for XML, RDF, and knowledge-management applications. Mr. Ogbuji is also a lead developer of the Versa RDF query language. He is a computer engineer and writer born in Nigeria, living and working in Boulder, Colorado, USA. You can find more about Mr. Ogbuji at his Weblog Copia, or contact him at uche@ogbuji.net.

Comments (Undergoing maintenance)



Trademarks  |  My developerWorks terms and conditions

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML
ArticleID=88903
ArticleTitle=Thinking XML: Schema annotation for bottom-up semantic transparency
publish-date=07142005
author1-email=uche@ogbuji.net
author1-email-cc=

My developerWorks community

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere).

My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Rate a product. Write a review.

Special offers