In the last installment, "State of the art in XML modeling," I started a series of of three articles to mark the 30th installment of this column; this is an overview of some of the interesting technologies and techniques for semantic transparency, including my opinion on the state of the art. In this second part of the series, I look at the pros and cons of adopting existing XML formats with well-defined semantics. But first I want to mention a very interesting conference I attended in early March.
The Semantic Technology Conference 2005
Knowledge Technologies 2001 in Austin, Texas was the first conference I went to with an emphasis on the semantic technologies related to XML. I saw great energy and excitement at that conference about the potential of such technologies. But then again, it was attended mostly by academics and those on the bleeding edge commercially. The business presence was minimal, and this largely reflected the timing on the technology adoption cycle. The early adopters of XML still had to convince the business interests of the value of semantic technologies, and in doing so largely (and strangely) found they competed with Web services as the technology most likely to carry on the success of XML.
Lately, as I've remarked in this column, businesses have begun to appreciate the importance of semantic technologies, and this was evident at The Semantic Technology Conference 2005, which ran March 7-10 in San Francisco. Many of the same people and themes made a reappearance from the conference four years earlier, but this time the ranks were swelled by business interests -- lots of them. From venture capitalists (always a sign you're at a certain point in the cycle) to technology managers to entrepreneurs, people were not just talking about the technical potential of semantic technologies, but also the commercial opportunities and expected return on investment. I'd been impressed with the energy at the 2001 conference, but this one blew me away. It also happened to be one of the best organized conferences I've ever attended.
I gave a talk entitled "XML Design for Semantic Transparency" (Resources ), which covered many of the themes I have explored in this column and in other articles for developerWorks. I have always focused not on the long-term vision of the Semantic Web, but rather on immediate applications of semantic technologies to improve the value of XML technologies. I was gratified at the attendance to my talk, and the enthusiastic response. I still think the industry has yet to take advantage of the powerful combination of XML and semantic technology, but you can be part of the growth. I encourage you to keep an eye out for the Semantic Technology Conference 2006. I'll be there.
Hitchhiking the semantic highway
Not long after the emergence of XML, industry groups started to work on ambitious XML-based standards for all sorts of information. This brute-force tactic for solving the problem of semantic transparency is what I call the top-down approach. These groups look to define entire document formats along with the semantics of all the elements, attributes, and content. This often involves leaning on existing industry data dictionaries and other such standards, where available. Sometimes EDI standards serve as the starting blocks.
Reusing such standards can help reduce the amount of work that goes into developing semantically transparent data formats. Some of the advantages are:
- Ready integration with business partners who have adopted the same format standards
- Well-defined semantics, not only of individual data elements but also their relationship and perhaps expected processing models
- Likelihood of less expensive training and recruiting; if you use well-known formats, more people in the labor pool are familiar with them
- Chance of fewer regulatory problems; if you work in a regulated industry, you might find that regulatory concerns mandate or strongly suggest specific data formats for internal and external information exchange
Of course, you should take the following pitfalls into consideration:
- Standardized data formats may not quite fit your specific needs. Standards are works of compromise between competing interests. They are often developed by committee, and just about every culture has some sharp joke about the typically ugly results of design by committee. You may find you have a lot of work to do in order to fit into the framework of prevailing standards. Many standards provide for extensibility as a sort of escape clause, but if you're not careful you can end up saddled with the very semantic transparency problems you seek to avoid.
- You might find intellectual property encumbrances. As with any hot technology, a mass of respectable and frivolous copyrights and patents has accumulated in the XML space, and this has affected well-known data formats and some processing conventions. Be aware of any legal encumbrances and the potential cost of dealing with these.
- The old quips about competing standards apply especially to XML. Just about every area of business interest has multiple competing XML standards, and you may find yourself in a bit of a game of chance determining which one to adopt. Also, because not enough standards look to employ semantic transparency techniques that support connections to competing standards, you might find yourself locked into your first choice.
Partial adoption of vocabularies
You might choose to compromise by adopting standardized vocabularies, while extending or otherwise modifying them to suit your needs. If so, be sure to direct as much attention to your own semantic transparency efforts as you would if you had completely created the vocabulary yourself. You will save some work by reference to material in the target format's documentation or data dictionaries, but if anything, you should take more care in formalizing the semantics of your modifications and extensions. You want to be sure not to create a false sense of security about the interoperability you might have with others using the same standard.
You may also want to consider a less well-known approach to such compromise: schema systems that work to separate semantics from the chosen syntax. This might be enough to bend external formats to your needs (although usually your need is to specialize the semantics rather than the syntax). In the next article, I shall discuss tools that make this possible such as Schematron abstract patterns and XML architectural forms.
It's time for a second detour from the main thrust of this article. John Cowan, one of the most erudite scholars in the XML space, recently weighed in on a discussion thread regarding the OpenDocument file format (formerly known as the OpenOffice XML format), which I've covered previously in this column (see Resources). I've said a few times in developerWorks that modeling people's names is an extraordinarily difficult problem, and John's comments nicely illustrate just how difficult a problem this is. He writes:
IMHO (and I've worked on the problem for some years), all attempts to structure names so that they work correctly across cultures (and with scholarship being international now, the problem comes up repeatedly) just don't work.
- Western Europeans and their cultural descendants put surnames last for display, first for sorting.
- Hungarians put surnames first for all purposes, at least in the Hungarian language.
- Chinese also put surnames first, and often retain this convention when mentioned in other languages.
- Icelanders (mostly) have no surnames, only given-names and patronymics, and use given+patronymic for both sorting and display.
- Indonesians mostly have only one name.
I think the only universal answer is to represent full names in two ways: a display version and a sort version. One could do this with markup as follows:
<name><part key='2'>John</part> <part key='1'>Cowan</part></name>
but I don't think it's really worthwhile:
<name sortAs='Cowan, John'>John Cowan</name>
is probably more appropriate despite the duplication of content.
This is a sensible approach from the modeling point of view, but it does open up the ugly possibility that one would have to have all personal names entered twice, in effect, to ensure the correctness of the different forms of the names. In fact, later on in the discussion Cowan mentions how dangerous it can be to cut corners in such matters:
[R]educing names to [a final presentation] format requires hand-tuning, as when the middle name "O'Flynn" reduces to "O'F.", or knowing that "Willard van Orman Quine" is properly "Quine, W.V.O." Even automatically inverting names is too hard: "William Lyon Mackenzie King" properly inverts to "Mackenzie King, William Lyon", though we often find him called simply "King".
In this same thread, David Wheeler made a comment that well encapsulates the unexpected significance of the problem of modeling names cross-culturally:
There are naming standards, but the truly internationalized ones are so complicated that they're practically never used.
In a message to me, John discussed having run into such naming issues "in the context of [international e-mail addressing and transport standard] X.400 and [international directory standard] X.500; the former has a whole apparatus with title, given name, middle name(s), surname, and 'generational epithet' (i.e. 'Sr.' or 'Jr.' or 'III'); the latter, invented later, has simply 'common name' and 'surname', where the former is explicitly for display and the latter is de facto used for sorting."
Modeling names is an old problem, one that's not getting any easier as technology becomes increasingly global. Even though few of you will ever have to deal directly with the full complexity of internationalized name modeling, you should always carry some respect for the scale of the issue in your mind as a safeguard against the most common and embarrassing modeling gaffes.
I always advise people to at least consider using externally-developed XML vocabularies. Rolling your own can often seem deceptively simple. I've learned to interpret the comment "so just how hard can it possibly be to design a useful XML format?" as a precursor to later difficulties. The most important step in deciding whether to adopt industry standards in a quest for semantic transparency is to find and evaluate the candidate standards themselves. The popularity of XML has engendered an explosion in the development of XML standards formats, so this is not always a simple proposition. In a recent tip (see Resources), I provided some useful places to go looking for suitable XML vocabularies. In this column, I've also often highlighted standards specific to certain industries, as well as some more universal efforts. You can help others when you share your own experiences with prevalent XML formats and participate on the Thinking XML discussion forum.
- Participate in the discussion forum.
- Read more about The Semantic Technology Conference at which Uche Ogbuji presented "XML Design for Semantic Transparency."
- Bookmark the tip "Look up XML schemata and Web services with these helpful resources" by Uche Ogbuji (developerWorks, February 2005), which identifies useful places to start looking for suitable XML vocabularies. Mr. Ogbuji also offers a brief discussion of the most important general-purpose XML schemata in his article "A survey of XML standards: Part 3" (developerWorks, February 2004).
- John Cowan's comments about the difficulty of modeling names started with his posting "OpenDocument - suggested tweaks for bibliography format" to the public comments mailing list for the OpenDocument XML format. This was covered in "Thinking XML: The open office file format", by Uche Ogbuji (developerWorks, January 2003).
- Check out IBM alphaWorks' Semantics research topic, which focuses on new semantic information management schemes that enable companies to make better use of their information.
- Find more XML resources on the developerWorks XML zone, including previous installments of the Thinking XML column. In "Semantic anchors for XML", Mr. Ogbuji discusses top-down versus bottom-up approaches to semantic transparency ( October 2003).
- Browse for books on these and other technical topics.
- Learn how you can become an IBM Certified Developer in XML and related technologies.

Uche Ogbuji is a consultant and co-founder of Fourthought Inc., a consulting firm specializing in XML solutions for enterprise knowledge management applications. Fourthought develops 4Suite, the open source platform for XML middleware. Mr. Ogbuji is a Computer Engineer and writer born in Nigeria, living and working in Boulder, Colorado, USA. You can reach him at uche@ogbuji.net.
Comments (Undergoing maintenance)





