Skip to main content

Thinking XML: State of the art in XML modeling

What do developers need to know about the various approaches to semantic transparency?

Uche Ogbuji (uche@ogbuji.net), Consultant, Fourthought, Inc.
Uche Ogbuji is a consultant and co-founder of Fourthought Inc., a consulting firm specializing in XML solutions for enterprise knowledge management applications. Fourthought develops 4Suite, the open source platform for XML middleware. Mr. Ogbuji is a Computer Engineer and writer born in Nigeria, living and working in Boulder, Colorado, USA. You can reach him at uche@ogbuji.net.

Summary:  The running theme of the column has been semantic transparency: the ability to correctly interpret the contents of XML documents. Semantic transparency might be the most important aspect of XML modeling. This is first in a series of articles that review the many different approaches to semantic transparency and discuss what they mean to developers using XML.

Date:  11 Mar 2005
Level:  Intermediate
Activity:  2550 views
Comments:  

This is the 30th installment of the Thinking XML column. It is almost exactly four years since the first article, and in retrospect I'm amazed by the flight of time and the march of events since then. The activity associated with XML has been tremendous, and I hope some of it has been apparent in the range of topics covered in this column. This activity has been especially interesting in the use of XML for knowledge management technologies, which is the focus of this column. In the first installment -- in February 2001 -- I discussed the goal of XML semantic transparency, which I think is the most important aspect of XML data modeling. Throughout this column, I've considered different approaches to semantic transparency. For this installment, I'll kick off a short series of articles that provide an overview of some interesting technologies and techniques for semantic transparency, offering my opinion on the state of the art. I'll break this series into three parts:

  1. Using informal descriptions in formal schemata (this article)
  2. Using schema standardization for top-down semantic transparency
  3. Using semantic anchors within schemata for bottom-up semantic transparency

Formal schemata, informal transparency

One common misconception about XML is that if you just define a schema, others will know how to process the XML instances and interoperate with your system. This may be true, depending on how the schema is authored, but generally not as a result of features of the schema language itself. Listing 1 is a sample RELAX NG schema (compact syntax) snippet:


Listing 1. Sample RELAX NG schema using annotations to provide semantic clues
namespace dc = "http://purl.org/dc/elements/1.1/"
element purchase-order
{
  dc:description [ "General purpose purchase order for merchandise" ]
  attribute id {
    dc:description [ "Unique identifier for the purchase order" ]
    text
  }
  #The rest of the schema here
}

For those not familiar with RELAX NG, the first line is a namespace declaration for Dublin Core, which is a popular vocabulary for metadata elements such as titles, descriptions, attributions, and other library-like properties. The second line defines an element named purchase-order. The line beginning dc:description is an annotation using the namespace prefix declared earlier to indicate that the intent of the annotation is to provide information that conforms to the Dublin Core description element. The next four lines define an attribute named id, with a plain text value. This attribute definition has an annotation of its own, giving the intended meaning of the attribute. The line after all that is a comment. Notice that in this example I use annotations to provide information that's important to understanding the semantics of the schema, whereas I use the comment to convey incidental information. An example of a document that conforms to this schema is: <purchase-order id="123"/>.

If Listing 1 is the purchase order schema that Acme Organization comes up with, then Zenith Organization, acting separately, might come up with the schema in Listing 2.


Listing 2. Sample RELAX NG schema similar to Listing 1
namespace dc = "http://purl.org/dc/elements/1.1/"
element po
{
  dc:description [ "Simple purchase order" ]
  attribute number {
    dc:description [ "Number for identifying the purchase order" ]
    text
  }
  #The rest of the schema here
}
  

Notice that the annotations are similar, but the actual element and attribute names are different. A corresponding example document might be: <po number="123"/>. A person can look at the two schemata above and recognize from the annotations the equivalence of the purchase-order element in one to the po element in the other, and the id attribute in one to the number attribute in the other. In this way, semantic transparency is achieved through informal means. A person has to use imprecise natural language skills to make sense of the annotations, rather than some strict and unambiguous definition.

The problem is scalability of this process. The above example has simple, one-to-one mappings between data elements in the two vocabularies, and annotations that you can readily compare in a casual reading. More realistic situations involve more complex schemata with less predictable mappings and subtler differences in annotations and other such informal descriptions. In such cases, it might be very difficult to achieve semantic transparency through natural language schema annotations.

DTDs do not provide directly for annotation, but other popular schema languages do: RELAX NG, W3C XML Schema (WXS), and Schematron. In these languages, you can structure annotations themselves for machine consumption, providing more reliable routes to semantic transparency; I'll cover some such techniques in future articles. Unfortunately, such techniques are not very well taught, discussed, or even analyzed, partly because many people involved with XML mistakenly believe that semantic transparency is not a pressing concern, or that it is something that XML in itself already provides for. In my own biased view, one particular distraction has interfered with the focus on semantic transparency.


A prominent red herring

XML experts usually recognize the weakness of informal descriptions like those described above for providing semantic transparency. The attempt to boost such facilities has always been part of the what's next discussion following the success of XML 1.0 -- alongside linking, processing conventions, and other concerns. Early on, people tackling such problems split into several camps. In one prominent camp are veterans of mainstream programming languages and database management systems who think the best ways to formalize the underpinnings of XML documents are the common data typing techniques with which they are most familiar. They are accustomed to thinking of all semantics in terms of the primitive axioms that make up the static data typing of mainstream languages and database systems. They feel that if they can just bind XML tightly into familiar metaphors, then they can get a grip on modeling problems.

A data typing proponent might want to touch up the schema in Listing 2 to look like the version in Listing 3.


Listing 3. Sample RELAX NG schema using WXS data types
namespace dc = "http://purl.org/dc/elements/1.1/"
element po
{
  dc:description [ "Simple purchase order" ]
  attribute number {
    dc:description [ "Number for identifying the purchase order" ]
    xsd:int
  }
  #The rest of the schema here
}
  

This time a WXS data type is assigned to the attribute, reflecting the schema designer's assumption that the purchase order number should be constrained to an integer. That is the meaning of the line xsd:int. Clearly this addition barely scratches the surface of the problem of proper interpretation of the schema. To be fair, even data typing advocates do not claim it does, but they do claim that this added bit of precision gives processing tools the power to do other sorts of reasoning and analysis on the XML instances. I happen to think this claim is somewhat dubious, and I believe that it has siphoned much energy from the XML community towards a fruitless obsession with data types. This energy might be more usefully directed towards the problem of semantic transparency.

A more direct problem is that when people reflexively use data types, they often end up reducing flexibility in unanticipated ways. As an example, if Zenith Organization, using the schema in Listing 3, wants to trade with Acme Organization, using the schema in Listing 1, there is now the additional complication that one schema sees PO numbers as integers, and the other sees them as plain old text. This mismatch is reflected in all the data-type-aware tools. Such mismatches are inevitable in any integration project, but in this case the gain from strict data typing does not measure up to the flexibility that is lost.

What does this mean to a developer? I don't mean to argue that you should not use schema data types -- just don't use them as a reflex. Use them to mark very carefully considered constraints that you expect to make sense throughout the life of the system. And don't get so preoccupied with data typing that you forget to consider how to clarify the more general semantics related to your XML vocabulary.

I myself have added the ability to infer data types from text patterns in XML nodes to the Amara XML toolkit, one of the XML processing libraries I develop for the Python programming language. I am careful to make this type inference optional, and I think it's probably dangerous to use it as a cornerstone of any processing tool chain. I've also given users the capability to set up custom data types in a declarative way using Jeni Tennison's Data Type Library Language (DTLL -- see Resources). DTLL helps make more explicit the fact that data typing in XML is nothing more than a specialized interpretation of text. That is the crux of the matter: XML is text, and only text. Other layers such as data typing are mere interpretations of that text (and should be optional interpretations). The moment you lose sight of that, you're in for all sorts of unforeseen complications.


Wrap-up

Good annotations of schemata are very important, regardless of whether or not they lead to semantic transparency. It might even be enough to maintain a separate data dictionary document, of the sort familiar to database developers. For each term used in the schema, a data dictionary provides a description that informally fills in the semantics for that term. As you recognize the supremacy of text in XML, the importance of semantic transparency becomes clear. Since all XML processing is ultimately a matter of interpreting language, it is essential to find ways to reduce the ambiguity of that interpretation. If you have any thoughts on schemata, schema annotations, data typing, or related topics, please share them by posting on the Thinking XML discussion forum.


Resources

About the author

Uche Ogbuji

Uche Ogbuji is a consultant and co-founder of Fourthought Inc., a consulting firm specializing in XML solutions for enterprise knowledge management applications. Fourthought develops 4Suite, the open source platform for XML middleware. Mr. Ogbuji is a Computer Engineer and writer born in Nigeria, living and working in Boulder, Colorado, USA. You can reach him at uche@ogbuji.net.

Comments



Trademarks  |  My developerWorks terms and conditions

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML
ArticleID=56220
ArticleTitle=Thinking XML: State of the art in XML modeling
publish-date=03112005
author1-email=uche@ogbuji.net
author1-email-cc=

My developerWorks community

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere).

My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Special offers