Skip to main content

Thinking XML: Schema standardization for top-down semantic transparency

The state of the art in XML modeling includes reusing models designed by others

Uche Ogbuji (uche@ogbuji.net), Consultant, Fourthought, Inc.
Uche Ogbuji is a consultant and co-founder of Fourthought Inc., a consulting firm specializing in XML solutions for enterprise knowledge management applications. Fourthought develops 4Suite, the open source platform for XML middleware. Mr. Ogbuji is a Computer Engineer and writer born in Nigeria, living and working in Boulder, Colorado, USA. You can reach him at uche@ogbuji.net.

Summary:  This installment continues the review of the many different approaches to semantic transparency, discussing what they mean to the developer using XML. One way to save resources on a long journey is to hitchhike. In XML, you can take advantage of countless open schema initiatives that, in effect, use schema standardization for top-down semantic transparency. But it's not all a free ride. In this article, Uche Ogbuji looks at the advantages and disadvantages of third-party schema reuse. He also takes a moment to discuss The Semantic Technology Conference 2005, and respond to some recent discussion on the difficulty of modeling people's names.

Date:  08 Apr 2005
Level:  Intermediate
Activity:  1451 views

In the last installment, "State of the art in XML modeling," I started a series of of three articles to mark the 30th installment of this column; this is an overview of some of the interesting technologies and techniques for semantic transparency, including my opinion on the state of the art. In this second part of the series, I look at the pros and cons of adopting existing XML formats with well-defined semantics. But first I want to mention a very interesting conference I attended in early March.

The Semantic Technology Conference 2005

Knowledge Technologies 2001 in Austin, Texas was the first conference I went to with an emphasis on the semantic technologies related to XML. I saw great energy and excitement at that conference about the potential of such technologies. But then again, it was attended mostly by academics and those on the bleeding edge commercially. The business presence was minimal, and this largely reflected the timing on the technology adoption cycle. The early adopters of XML still had to convince the business interests of the value of semantic technologies, and in doing so largely (and strangely) found they competed with Web services as the technology most likely to carry on the success of XML.

Lately, as I've remarked in this column, businesses have begun to appreciate the importance of semantic technologies, and this was evident at The Semantic Technology Conference 2005, which ran March 7-10 in San Francisco. Many of the same people and themes made a reappearance from the conference four years earlier, but this time the ranks were swelled by business interests -- lots of them. From venture capitalists (always a sign you're at a certain point in the cycle) to technology managers to entrepreneurs, people were not just talking about the technical potential of semantic technologies, but also the commercial opportunities and expected return on investment. I'd been impressed with the energy at the 2001 conference, but this one blew me away. It also happened to be one of the best organized conferences I've ever attended.

I gave a talk entitled "XML Design for Semantic Transparency" (Resources ), which covered many of the themes I have explored in this column and in other articles for developerWorks. I have always focused not on the long-term vision of the Semantic Web, but rather on immediate applications of semantic technologies to improve the value of XML technologies. I was gratified at the attendance to my talk, and the enthusiastic response. I still think the industry has yet to take advantage of the powerful combination of XML and semantic technology, but you can be part of the growth. I encourage you to keep an eye out for the Semantic Technology Conference 2006. I'll be there.


Hitchhiking the semantic highway

Not long after the emergence of XML, industry groups started to work on ambitious XML-based standards for all sorts of information. This brute-force tactic for solving the problem of semantic transparency is what I call the top-down approach. These groups look to define entire document formats along with the semantics of all the elements, attributes, and content. This often involves leaning on existing industry data dictionaries and other such standards, where available. Sometimes EDI standards serve as the starting blocks.

Reusing such standards can help reduce the amount of work that goes into developing semantically transparent data formats. Some of the advantages are:

  • Ready integration with business partners who have adopted the same format standards
  • Well-defined semantics, not only of individual data elements but also their relationship and perhaps expected processing models
  • Likelihood of less expensive training and recruiting; if you use well-known formats, more people in the labor pool are familiar with them
  • Chance of fewer regulatory problems; if you work in a regulated industry, you might find that regulatory concerns mandate or strongly suggest specific data formats for internal and external information exchange

Of course, you should take the following pitfalls into consideration:

  • Standardized data formats may not quite fit your specific needs. Standards are works of compromise between competing interests. They are often developed by committee, and just about every culture has some sharp joke about the typically ugly results of design by committee. You may find you have a lot of work to do in order to fit into the framework of prevailing standards. Many standards provide for extensibility as a sort of escape clause, but if you're not careful you can end up saddled with the very semantic transparency problems you seek to avoid.
  • You might find intellectual property encumbrances. As with any hot technology, a mass of respectable and frivolous copyrights and patents has accumulated in the XML space, and this has affected well-known data formats and some processing conventions. Be aware of any legal encumbrances and the potential cost of dealing with these.
  • The old quips about competing standards apply especially to XML. Just about every area of business interest has multiple competing XML standards, and you may find yourself in a bit of a game of chance determining which one to adopt. Also, because not enough standards look to employ semantic transparency techniques that support connections to competing standards, you might find yourself locked into your first choice.

Partial adoption of vocabularies

You might choose to compromise by adopting standardized vocabularies, while extending or otherwise modifying them to suit your needs. If so, be sure to direct as much attention to your own semantic transparency efforts as you would if you had completely created the vocabulary yourself. You will save some work by reference to material in the target format's documentation or data dictionaries, but if anything, you should take more care in formalizing the semantics of your modifications and extensions. You want to be sure not to create a false sense of security about the interoperability you might have with others using the same standard.

You may also want to consider a less well-known approach to such compromise: schema systems that work to separate semantics from the chosen syntax. This might be enough to bend external formats to your needs (although usually your need is to specialize the semantics rather than the syntax). In the next article, I shall discuss tools that make this possible such as Schematron abstract patterns and XML architectural forms.


Naming names

It's time for a second detour from the main thrust of this article. John Cowan, one of the most erudite scholars in the XML space, recently weighed in on a discussion thread regarding the OpenDocument file format (formerly known as the OpenOffice XML format), which I've covered previously in this column (see Resources). I've said a few times in developerWorks that modeling people's names is an extraordinarily difficult problem, and John's comments nicely illustrate just how difficult a problem this is. He writes:

IMHO (and I've worked on the problem for some years), all attempts to structure names so that they work correctly across cultures (and with scholarship being international now, the problem comes up repeatedly) just don't work.
  • Western Europeans and their cultural descendants put surnames last for display, first for sorting.
  • Hungarians put surnames first for all purposes, at least in the Hungarian language.
  • Chinese also put surnames first, and often retain this convention when mentioned in other languages.
  • Icelanders (mostly) have no surnames, only given-names and patronymics, and use given+patronymic for both sorting and display.
  • Indonesians mostly have only one name.
I think the only universal answer is to represent full names in two ways: a display version and a sort version. One could do this with markup as follows:
<name><part key='2'>John</part> <part key='1'>Cowan</part></name>
but I don't think it's really worthwhile:
<name sortAs='Cowan, John'>John Cowan</name>
is probably more appropriate despite the duplication of content.

This is a sensible approach from the modeling point of view, but it does open up the ugly possibility that one would have to have all personal names entered twice, in effect, to ensure the correctness of the different forms of the names. In fact, later on in the discussion Cowan mentions how dangerous it can be to cut corners in such matters:

[R]educing names to [a final presentation] format requires hand-tuning, as when the middle name "O'Flynn" reduces to "O'F.", or knowing that "Willard van Orman Quine" is properly "Quine, W.V.O." Even automatically inverting names is too hard: "William Lyon Mackenzie King" properly inverts to "Mackenzie King, William Lyon", though we often find him called simply "King".

In this same thread, David Wheeler made a comment that well encapsulates the unexpected significance of the problem of modeling names cross-culturally:

There are naming standards, but the truly internationalized ones are so complicated that they're practically never used.

In a message to me, John discussed having run into such naming issues "in the context of [international e-mail addressing and transport standard] X.400 and [international directory standard] X.500; the former has a whole apparatus with title, given name, middle name(s), surname, and 'generational epithet' (i.e. 'Sr.' or 'Jr.' or 'III'); the latter, invented later, has simply 'common name' and 'surname', where the former is explicitly for display and the latter is de facto used for sorting."

Modeling names is an old problem, one that's not getting any easier as technology becomes increasingly global. Even though few of you will ever have to deal directly with the full complexity of internationalized name modeling, you should always carry some respect for the scale of the issue in your mind as a safeguard against the most common and embarrassing modeling gaffes.


What standards, anyway?

I always advise people to at least consider using externally-developed XML vocabularies. Rolling your own can often seem deceptively simple. I've learned to interpret the comment "so just how hard can it possibly be to design a useful XML format?" as a precursor to later difficulties. The most important step in deciding whether to adopt industry standards in a quest for semantic transparency is to find and evaluate the candidate standards themselves. The popularity of XML has engendered an explosion in the development of XML standards formats, so this is not always a simple proposition. In a recent tip (see Resources), I provided some useful places to go looking for suitable XML vocabularies. In this column, I've also often highlighted standards specific to certain industries, as well as some more universal efforts. You can help others when you share your own experiences with prevalent XML formats and participate on the Thinking XML discussion forum.


Resources

About the author

Uche Ogbuji

Uche Ogbuji is a consultant and co-founder of Fourthought Inc., a consulting firm specializing in XML solutions for enterprise knowledge management applications. Fourthought develops 4Suite, the open source platform for XML middleware. Mr. Ogbuji is a Computer Engineer and writer born in Nigeria, living and working in Boulder, Colorado, USA. You can reach him at uche@ogbuji.net.

Comments (Undergoing maintenance)



Trademarks  |  My developerWorks terms and conditions

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML
ArticleID=82567
ArticleTitle=Thinking XML: Schema standardization for top-down semantic transparency
publish-date=04082005
author1-email=uche@ogbuji.net
author1-email-cc=

My developerWorks community

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere).

My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Special offers