As I've discussed earlier in this column, XML only provides the most basic foundation towards the goal of universal information interchange. XML is thoroughly established, and a great deal of the effort to build standards on top of XML has been directed towards semantic transparency, which would allow disparate systems to share some understanding of the actual concepts that are represented in some structured form in XML documents. See the inaugural Thinking XML article for a discussion of semantic transparency.
Many approaches are taken toward such an ambitious goal, but I tend to classify these into two main categories:
- Top-down initiatives define entire document formats along with the semantics of all the elements, attributes, and content, usually by reference to relevant industry standards. Examples are OAGIS (covered in "XML meets semantics, Part 4") and UBL (covered in "Universal Business Language (UBL)").
- Bottom-up initiatives define terms and concepts at the discrete level, independently of the documents in which they would appear. Examples are The ISO Basic Semantics Register (BSR), an effort that unfortunately seems to have stalled, and RosettaNet Dictionaries (covered in "XML meets semantics, Part 3").
Top-down approaches are often less ambitious in scope and positioned for industry backing. Bottom-up approaches have broader potential, but are also far more difficult to develop and evangelize. RosettaNet is rather interesting in that it pursues both, providing dictionaries and document schemata. Also, UBL shares close ties with bottom-up efforts in the ebXML space.
The terms and concepts formally defined in dictionaries and semantic registries are the anchors on which you can build generalized semantics for communications in XML. In this article, I shall look at two additional initiatives to build such anchors.
In my last article I covered XML Topic Maps, and I mentioned that one of the bedrock ideas behind that technology is subject identifiers which provide unique identifiers for particular concepts. At the most ambitious level, one has published subjects which strive for global scope. Even though the idea of published subjects is closely tied to Topic Maps principles, there is no reason why they cannot be used as general semantic anchors even in other technologies such as RDF and general XML vocabularies. XML Topic Maps (which, if you remember, is a specialization of the more general Topic Maps specification) prescribes that published subjects be URIs, aligning them with most Web technologies including XML and RDF. The OASIS Topic Maps Published Subjects TC (see Resources) works to define and encourage the use of public subjects across a variety of technologies, as expressed in the introduction to its specification, entitled "Published Subjects: Introduction and Basic Requirements":
The goal of the OASIS Topic Maps Published Subjects Technical Committee is to promote Topic Maps interoperability through the use of Published Subjects. A further goal is to promote interoperability between Topic Maps and other technologies that make explicit use of abstract representations of subjects, such as the Resource Description Framework (RDF) and the Web Ontology Language (OWL).
Published Subjects as defined in this Specification provide an open, scaleable, URI-based method of identifying subjects of discourse. They cater for the needs of both humans and applications, and they provide mechanisms for ensuring confidence and trust on the part of users. Published Subjects are therefore expected to be of particular interest to publishers and users of ontologies, taxonomies, classifications, thesauri, registries, catalogues, and directories, and for applications (including agents) that capture, collate or aggregate information and knowledge.
The Published Subjects TC only works on the framework. Actual sets of public subjects are developed by other groups. The most prominent examples I could find are published subjects for ISO language and country codes developed by the OASIS Topic Maps Published Subjects for Geography and Languages TC. I mention some additional published subjects initiatives in Resources. Listing 1 is a snippet from an XML Topic Map that uses one of these published subjects.
Listing 1. Example of published subject for ISO country codes in XTM
<topic id="French"> <subjectIdentity> <subjectIndicatorRef xlink:href="http://psi.oasis-open.org/geolang/iso639/#fre"/> </subjectIdentity> <baseName> <baseNameString>francais</baseNameString> </baseName> </topic>
Bernard Vatant, chair of the Published Subjects TC, developed an example of how published subjects can be used in Web Ontology Language (OWL). OWL is essentially the successor to DAML+OIL (covered in "Basic XML and RDF techniques for knowledge management, Part 5"). I shall cover OWL in more depth soon. Listing 2 is a snippet that uses the same published subject as in Listing 1, but in OWL.
Listing 2. Example of published subject for ISO country codes in OWL
<Lang rdf:ID="fre"> <!-- The following asserts equivalence between the local resource #fre and the published subject for the French language --> <owl:sameAs rdf:resource="http://psi.oasis-open.org/geolang/iso639/#fre"/> <rdfs:label xml:lang="en">French</rdfs:label> <rdfs:label xml:lang="fr">francais</rdfs:label> </Lang>
owl:sameAs statement and the published subject effectively provide an anchor from a locally-defined resource to a semantically ambiguous concept.
Universal Data Elements
The Universal Data Element Framework (UDEF) is an ambitious scheme to provide unique identifiers for a variety of data elements that are key to various industries (see Resources). The project is styled as a sort of Dewey Decimal System of data elements. They are also very clear about their strict focus on the bottom-up:
There is a distinction to be made between the document standards, core components, ontological and taxonomical efforts underway as a result of the wide ranging integration and application collaboration activities on the net: The UDEF seeks only [to] be an attribute in the data element. There are no process, validation or handling requirements, it only seeks to communicate in a standard and repeatable way, the exact concept that the data element represents. There is very little about context, just enough to identify the data element exactly.
UDEF takes the interesting approach of using a cryptic combination of letters, numbers, underscores, and periods for identifiers rather than natural language. This does have the advantage of relative neutrality towards locales -- unlike so many XML technologies that are biased towards English -- but it means that lookup tools are essential for working with UDEF, even informally. As an example, the concept of a product part identifier is given the UDEF ID of
9_5.8 and relevant to the examples given in the published subjects section, the concept of a country code is given the UDEF identifier of
As yet, the world of published subjects does not offer very much for developers to build on. The basic requirements have just been finalized, a year after the scheduled date, and the TC has had a lot of difficulty making progress. Along with the stagnation of ISO BSR, this suggests the enormous challenge faced by efforts for bottom-up semantic transparency. However, the difficulty is balanced by the promise, and I would suggest that readers continue to follow these initiatives. I certainly hope that more developers become interested in the published subjects efforts, and are able to contribute. In commendable OASIS tradition, the working group is very open and transparent.
UDEF does compete with the likes of Bizcodes and United Nations Centre for Trade Facilitation and Electronic Business (UN/CEFACT) Core Components Identifiers. The latter, for example, are an important part of ebXML and are used in UBL. Discussions all around the table have focused on how to reconcile the various efforts, but it remains to be seen whether any one of these efforts will gain significant traction in practice, never mind there being room for several. I shall cover Bizcodes, Core Components, and more such initiatives in future articles.
It's wonderful that so many efforts are aimed at tackling the issue of semantic transparency from various angles. As expected in such an open environment, many of the initiatives are also operating with consideration of similar efforts. UDEF offers examples of how to use their IDs in ebXML and OAGIS as well as RDF and XML schemata. But it is still interesting to consider whether top-down or bottom-up approaches will be most crucial in establishing true interoperability at the semantic level. Will it take the establishment of complete and coherent documents standards that can be readily used, or do the basic building blocks of a shared terminology have to be in place so that interoperability is possible even without agreement on precise document standards? Please don't hesitate to offer your opinion on the matter on the Thinking XML discussion forum.
- Participate in the discussion forum.
- Find out more about the OASIS Topic Maps Published Subjects TC.
- Read about published subjects for languages, countries, and regions as defined by the OASIS Topic Maps Published Subjects for Geography and Languages TC. The TC has developed published subjects for ISO 639:1988 (E/F) - Codes for the representation of names of languages. They have also developed published subjects for ISO 3166. Bernard Vatant drafted an example of how these can be used in OWL.
- Mary Nishikawa developed a set of published subjects for the core concepts of Universal Standard Products and Services Classification (UNSPSC) codes (see also the metadata in Dublin Core/RDF). Read about other Topic Maps sets related to UNSPSC in her draft proposal "Best Practices for Published Subject Documentation Structure".
- Preview the definitions of published subjects for XML standards and technologies by the OASIS Topic Maps Vocabulary for XML Standards and Technologies TC. They are still in the preliminary stages and have not yet released a published subjects set.
- Visit the UDEF home page for a lot of information for developers curious about the initiative, though it takes some digging through the marketing. Also try a less formal page with several bulleted resources and links to an early advocacy presentation. "XML Microstandards" by William J. Lewis (Intelligent Enterprise) highlights UDEF.
- Find more XML resources on the developerWorks XML zone, including previous installments of the Thinking XML column.
- IBM's DB2 database provides not only relational database storage, but also XML-related tools such as the DB2 XML Extender which provides a bridge between XML and relational systems. Visit the DB2 Developer Domain to learn more about DB2.
- Find out how you can become an IBM Certified Developer in XML and related technologies.