Thinking XML: Semantic anchors for XML

Universal identifier schemes for XML interchange

XML syntax is just the foundation for data interoperability. The next step is semantic transparency. Some groups are working to address this by defining entire document formats to be adopted wholesale, while other groups are working on ways to express common terminology and concepts at a more granular level. In this installment, Uche Ogbuji looks at XML Topic Maps Published Subjects and Universal Data Element Framework (UDEF), two ideas that take the granular approach by seeking to provide anchors in the semantic stream.

Share:

Uche Ogbuji, Principal Consultant, Fourthought, Inc.

Photo of Uche OgbujiUche Ogbuji is a consultant and co-founder of Fourthought Inc., a software vendor and consultancy specializing in XML solutions for enterprise knowledge management. Fourthought develops 4Suite, an open source platform for XML, RDF, and knowledge-management applications. Mr. Ogbuji is a computer engineer and writer born in Nigeria, living and working in Boulder, Colorado, USA. You can contact Mr. Ogbuji at uche@ogbuji.net.



07 October 2003

As I've discussed earlier in this column, XML only provides the most basic foundation towards the goal of universal information interchange. XML is thoroughly established, and a great deal of the effort to build standards on top of XML has been directed towards semantic transparency, which would allow disparate systems to share some understanding of the actual concepts that are represented in some structured form in XML documents. See the inaugural Thinking XML article for a discussion of semantic transparency.

Many approaches are taken toward such an ambitious goal, but I tend to classify these into two main categories:

  • Top-down initiatives define entire document formats along with the semantics of all the elements, attributes, and content, usually by reference to relevant industry standards. Examples are OAGIS (covered in "XML meets semantics, Part 4") and UBL (covered in "Universal Business Language (UBL)").
  • Bottom-up initiatives define terms and concepts at the discrete level, independently of the documents in which they would appear. Examples are The ISO Basic Semantics Register (BSR), an effort that unfortunately seems to have stalled, and RosettaNet Dictionaries (covered in "XML meets semantics, Part 3").

Top-down approaches are often less ambitious in scope and positioned for industry backing. Bottom-up approaches have broader potential, but are also far more difficult to develop and evangelize. RosettaNet is rather interesting in that it pursues both, providing dictionaries and document schemata. Also, UBL shares close ties with bottom-up efforts in the ebXML space.

The terms and concepts formally defined in dictionaries and semantic registries are the anchors on which you can build generalized semantics for communications in XML. In this article, I shall look at two additional initiatives to build such anchors.

Published subjects

In my last article I covered XML Topic Maps, and I mentioned that one of the bedrock ideas behind that technology is subject identifiers which provide unique identifiers for particular concepts. At the most ambitious level, one has published subjects which strive for global scope. Even though the idea of published subjects is closely tied to Topic Maps principles, there is no reason why they cannot be used as general semantic anchors even in other technologies such as RDF and general XML vocabularies. XML Topic Maps (which, if you remember, is a specialization of the more general Topic Maps specification) prescribes that published subjects be URIs, aligning them with most Web technologies including XML and RDF. The OASIS Topic Maps Published Subjects TC (see Resources) works to define and encourage the use of public subjects across a variety of technologies, as expressed in the introduction to its specification, entitled "Published Subjects: Introduction and Basic Requirements":

The goal of the OASIS Topic Maps Published Subjects Technical Committee is to promote Topic Maps interoperability through the use of Published Subjects. A further goal is to promote interoperability between Topic Maps and other technologies that make explicit use of abstract representations of subjects, such as the Resource Description Framework (RDF) and the Web Ontology Language (OWL).

Published Subjects as defined in this Specification provide an open, scaleable, URI-based method of identifying subjects of discourse. They cater for the needs of both humans and applications, and they provide mechanisms for ensuring confidence and trust on the part of users. Published Subjects are therefore expected to be of particular interest to publishers and users of ontologies, taxonomies, classifications, thesauri, registries, catalogues, and directories, and for applications (including agents) that capture, collate or aggregate information and knowledge.

The Published Subjects TC only works on the framework. Actual sets of public subjects are developed by other groups. The most prominent examples I could find are published subjects for ISO language and country codes developed by the OASIS Topic Maps Published Subjects for Geography and Languages TC. I mention some additional published subjects initiatives in Resources. Listing 1 is a snippet from an XML Topic Map that uses one of these published subjects.

Listing 1. Example of published subject for ISO country codes in XTM
<topic id="French">
  <subjectIdentity>
    <subjectIndicatorRef
      xlink:href="http://psi.oasis-open.org/geolang/iso639/#fre"/>
  </subjectIdentity>
  <baseName>
    <baseNameString>francais</baseNameString>
  </baseName>
</topic>

Bernard Vatant, chair of the Published Subjects TC, developed an example of how published subjects can be used in Web Ontology Language (OWL). OWL is essentially the successor to DAML+OIL (covered in "Basic XML and RDF techniques for knowledge management, Part 5"). I shall cover OWL in more depth soon. Listing 2 is a snippet that uses the same published subject as in Listing 1, but in OWL.

Listing 2. Example of published subject for ISO country codes in OWL
 <Lang rdf:ID="fre">
  <!-- The following asserts equivalence between the local resource #fre
       and the published subject for the French language -->
  <owl:sameAs rdf:resource="http://psi.oasis-open.org/geolang/iso639/#fre"/>
  <rdfs:label xml:lang="en">French</rdfs:label>
  <rdfs:label xml:lang="fr">francais</rdfs:label>
 </Lang>

The owl:sameAs statement and the published subject effectively provide an anchor from a locally-defined resource to a semantically ambiguous concept.


Universal Data Elements

The Universal Data Element Framework (UDEF) is an ambitious scheme to provide unique identifiers for a variety of data elements that are key to various industries (see Resources). The project is styled as a sort of Dewey Decimal System of data elements. They are also very clear about their strict focus on the bottom-up:

There is a distinction to be made between the document standards, core components, ontological and taxonomical efforts underway as a result of the wide ranging integration and application collaboration activities on the net: The UDEF seeks only [to] be an attribute in the data element. There are no process, validation or handling requirements, it only seeks to communicate in a standard and repeatable way, the exact concept that the data element represents. There is very little about context, just enough to identify the data element exactly.

UDEF takes the interesting approach of using a cryptic combination of letters, numbers, underscores, and periods for identifiers rather than natural language. This does have the advantage of relative neutrality towards locales -- unlike so many XML technologies that are biased towards English -- but it means that lookup tools are essential for working with UDEF, even informally. As an example, the concept of a product part identifier is given the UDEF ID of 9_5.8 and relevant to the examples given in the published subjects section, the concept of a country code is given the UDEF identifier of e.7_4.


Anchors aweigh

As yet, the world of published subjects does not offer very much for developers to build on. The basic requirements have just been finalized, a year after the scheduled date, and the TC has had a lot of difficulty making progress. Along with the stagnation of ISO BSR, this suggests the enormous challenge faced by efforts for bottom-up semantic transparency. However, the difficulty is balanced by the promise, and I would suggest that readers continue to follow these initiatives. I certainly hope that more developers become interested in the published subjects efforts, and are able to contribute. In commendable OASIS tradition, the working group is very open and transparent.

UDEF does compete with the likes of Bizcodes and United Nations Centre for Trade Facilitation and Electronic Business (UN/CEFACT) Core Components Identifiers. The latter, for example, are an important part of ebXML and are used in UBL. Discussions all around the table have focused on how to reconcile the various efforts, but it remains to be seen whether any one of these efforts will gain significant traction in practice, never mind there being room for several. I shall cover Bizcodes, Core Components, and more such initiatives in future articles.

It's wonderful that so many efforts are aimed at tackling the issue of semantic transparency from various angles. As expected in such an open environment, many of the initiatives are also operating with consideration of similar efforts. UDEF offers examples of how to use their IDs in ebXML and OAGIS as well as RDF and XML schemata. But it is still interesting to consider whether top-down or bottom-up approaches will be most crucial in establishing true interoperability at the semantic level. Will it take the establishment of complete and coherent documents standards that can be readily used, or do the basic building blocks of a shared terminology have to be in place so that interoperability is possible even without agreement on precise document standards? Please don't hesitate to offer your opinion on the matter on the Thinking XML discussion forum.

Resources

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into XML on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML
ArticleID=12324
ArticleTitle=Thinking XML: Semantic anchors for XML
publish-date=10072003