XMLOpen 2004 took place September 21-23 in Cambridge, England. It brought together a group of XML experts, most from the UK, but others from Europe, the US, Australia, Japan, and elsewhere. The theme, which I set forth in my keynote address was XML and its intersection with open standards and open source. This conference saw an extraordinary amount of smart work and food for thought, and in this article I shall discuss conference proceedings related to topics already discussed in my other developerWorks articles.
Later on in this article I shall offer final observations on topics in XML Hacks, continuing the coverage in my last article in this column.
Punts are simple, gondola-like boats for navigating the Cam, the calm and stately stream that flows through the campus of one of the world's foremost universities. Those who took a three-day pass on punting to come to the XMLOpen conference (see Resources) heard a lot of doubts expressed about the merits of W3C XML Schema (WXS) and Web services, ambivalence about XPath 2.0 and XQuery, enthusiasm for RELAX NG and ISO DSDL, and advocacy of XML processing through programming languages and frameworks that fall somewhat outside the mainstream. This confluence of concerns was a natural product of the fact that the speakers were drawn from those working at the vanguard of XML tools and techniques.
Rick Jelliffe, creator of Schematron, opened his session by announcing that the XML schema and reporting language had gained enough votes to be ratified as an ISO committee draft standard as part 3 of ISO Document Schema Definition Languages (DSDL): "Rule-based validation" (see Resources). My recent tutorial on Schematron offers an introduction to that useful, soon-to-be standard. With a growing list of implementations, and public realization of the versatility of the language, Schematron was a consistent buzzword at the conference.
Jelliffe's talk was actually about his experiences trying to come up with metrics of XML schema complexity. The idea was to get an index number to help estimate the difficulty of implementing processing tasks (such as creating an XSLT transform) for a vocabulary and the typical uses for the vocabulary. Jelliffe's formula was a count of element types, attributes, and various special cases of these measured either from a DTD or from one or more instance documents. While there was some discussion of the exact details of such measurements -- for example, the extent to which structured fields and controlled vocabularies within content complicated processing -- the general idea turned out to be one that others had considered and even implemented. I mentioned that at Fourthought, the consultancy where I practice, we have created a lightweight measure to estimate how hard it would be to develop an XML schema (in RELAX NG) given the outlines of a vocabulary needed by the client. It will be interesting to see whether the industry begins to come up with general measurements of XML language complexity, and even to standardize such measurements, perhaps along lines that are traceable to ISO standards for software quality.
A URI scheme for the Semantic Web?
In earlier installments of this column, and in particular "Basic XML and RDF techniques for knowledge management, Part 7", I discussed topics related to the Semantic Web, the W3C's ambitious plan for a next-generation Web where documents are well annotated for meaning and context. Semantic Web technologies use URIs as the basic identifiers of all things being discussed, whether they're computer records, real world objects, or even abstractions. One of the challenges of Semantic Web research is creating a URI that reliably identifies a thing rather than some accidental aspect of it. For example, sometimes when people describe a person in Semantic Web languages, they use the URL of the person's Web home page as a stand-in. But this introduces confusion between descriptions of the actual home page itself and descriptions of the person.
Henry Thompson, a developer, researcher, W3C staffer, and well-known member of the XSL and Schema working groups, proposes to address this problem using a new URI scheme called Web Proper Names. With WPNs, the URI is constructed based on the results of a search on a well-known engine such as Google; the URI includes details of the engine and search terms used, the date and language in which the search was made, the extent to which search results were checked for relevance by a person, and, crucially, the owner or "baptiser" as Thompson calls it -- the person or entity responsible for the name. The baptiser is usually the same as the person who performs the search and checks results.
Here's an example of a WPN. If you wanted to make assertions about a person, you would perform a search on that person's name, and use appropriate terms to narrow down the results so that most are about the person in question. So, if one "Ralph Parker" works in materials engineering and another in medicine, and you wanted to describe the latter, you might specify search terms to omit pages where the word "materials" occurs. One of the search engine results might be the home page of Ralph Parker, which you might have considered using as a URI to represent the person. However, by using a WPN instead you make it clear that what you're describing is not that Web page (nor any of the Web pages returned by your search results), but rather the object that is the main subject of those Web pages. WPNs can be rather long. The following is Thompson's example WPN for the Eiffel Tower:
wpn://www.ltg.ed.ac.uk/~ht/WPN/EiffelTower? terms=eiffel+tower+paris+-hotel+-webcam&ln=en& se=www.google.com&dt=2004-05-21&rs=17&cs=5&pc=8 |
Note: In the preceding code example, the code normally appears as a single continuous line. In this instance, the lines of code are split into multiple lines for ease of formatting and printing.
As I opined in the Q&A for the talk, the Semantic Web, based on some proponents' claims, may not be a reasonably-sized undertaking for the next-generation Web. Information technology is predicated on the idea that the material being processed is but an analogue of real-world things. We process computer records of people, organizations, places, ideas, and the like, rather than the actuality of these things. The philosophy of names, words, and meanings is a very old and contentious one, and the merest contemplation of such issues as precisely what a computer identifier should mean in the real world is fraught with endless complications and pitfalls. The Semantic Web should focus on giving Web authors cheap and simple tools (specifically, tools that have open source options and are easy enough to learn in a half day) to annotate pages with their ideas of context. Convention will emerge in each community of topical interest through rough consensus, as it always does when people stumble into any information sharing exercise. Within closed systems (such as in an organization), conventions can be imposed through management. (In effect, what an identifier means is what corporate policy says it means. Full stop.) Trying to impose universal identifiers or even conventions for identifiers is an impossible task for the Semantic Web, whether you're a proponent of RDF or topic maps.
In the end, Thompson's idea is a very clever one, and I plan to make use of it in less ambitious ways. It seems like a nice way to define and describe topics of interest in a Web log, for example, especially since WPNs can be translated to HTTP URLs that should resolve to Resource Directory Description Language (RDDL) -- see my article on the topic, "Use RDDL with your XML and Web services namespaces."
Sean McGrath has been a longtime advocate of XML pipelines, which he describes as "a way of thinking about systems focusing on dataflows rather than object APIs." XML pipelines are a way of breaking down XML processing projects into small tasks performed by independent and reusable processing stages. For example, you could run an XML file through one stage that renames certain elements, another that adds new lines to text according to a word-wrapping routine, and finally a stage that transforms the document to plain text output. Pipelining is in part the classic divide and conquer approach to problem solving that almost all programmers are familiar with, but rather than thinking of decomposing algorithms into manageable chunks, McGrath and other pipelining proponents advocate focusing on the data and data transforms. He invokes the idea of pioneering software engineer Michael Jackson that all data processing can be boiled down to data flows with respect to time. McGrath argues that Web services and many other established XML processing practices revolve around the shoehorning of the data into fashionable programming techniques of the day, introducing unnecessary complexity. Pipelining restores the very simplicity and versatility that are the hallmarks of XML's success.
McGrath discussed many properties of pipelines, including the relative ease of auditing and debugging, the value of pipeline stage reuse, and the fact that each pipeline stage can be written using whatever programming tools are most practical -- some might use SAX, others DOM, and still others XSLT. He also discussed techniques such as merges and splits between pipeline data flows, and delta schemata -- the practice of using a schema to account for the intermediate data between each pipeline stage. Pipelines have emerged in many different ways in the XML universe, including ISO's DSDL, which uses a pipeline approach to break down the many aspects of XML schema into smaller, independent specifications. XML best practices are still emerging, but many experts agree that pipelines in one form or another are the future of XML processing practice.
Rich and extensible data types
The WXS data types specification (part 2 of WXS overall) is often referenced in other specifications, especially W3C specs, but it is also held up for criticism as an arbitrary and complex set of data types that too often don't align with the specific needs of real-world applications. Jeni Tennison has been working on this problem for some time and has developed Data Type Library Language (DTLL -- see Resources) as a means of specifying custom data types for XML. She was inspired by her observation that data in real XML tends more towards human readability, for presentation rather than processing. This point of view dovetails neatly with that of RELAX NG, and in fact the primary goal of DTLL is to be a means for defining data type libraries for use with RELAX NG. RELAX NG is part 3 of ISO DSDL, and DTLL is the current candidate for "Part 5: Data types."
DTLL allows you to tell the processor how to parse data types by defining regular expressions for breaking them down into important components (for example, the red, blue, and green parts of an RGB color value). You can then express how data types are tested for equality, or their sort order. This allows them to be used naturally in XSLT's xsl:for-each and other processing settings. DTLL supports inheritance of type components (supertypes) and other features to support modularization and reuse of data type libraries. Overall, this feature set is based on a very thorough analysis of existing uses of data types in common XML vocabularies, including DocBook, XHTML, SVG, MathML, and more -- all covered in my "Survey of XML standards." Tennison has thought through many of the very difficult problems that revolve around binding a textual format such as XML to the many types and systems of often non-textual data that need to be processed (and she admits that some problems remain to be solved). DTLL is still quite new, but given its merits and the backing of ISO, you soon might put it to use for data types that closely fit your processing needs.
I have one more observation to make regarding the book XML Hacks. Hack #92, "Use Elements Instead of Entities to Avoid the 'amp Explosion Problem'",discusses a problem where careless processing leads to unnecessary and confusing text such as "&". This happens when one escapes text that has already been escaped. The solution for the problem given in the book is to use special elements to represent these entities instead, and then replace these with the necessary entities at the end of the processing stage (presumably using pipelines as discussed above). I doubt such a measure is ever necessary. It's important for XML systems to know the source and state of each chunk of text being processed. In particular, systems must keep track of whether or not text has been escaped for XML representation. If they lose track of this, the potential increases for much greater mischief than just redundant entity escaping. If the system does track the source and state of each chunk of text, then the problem described in this hack simply does not occur. I don't agree with the solution given in this section because it complicates processing as a way to compensate for bugs in the processing. It's better to just fix the bugs. If you're using processing pipelines, then the key is in establishing contracts for pipeline inputs and outputs as to whether the data is escaped.
It is gratifying to watch the discipline of XML processing mature, as manifested by the emergence of books such as I've covered in the past few articles, and the quality of conferences such as XMLOpen. The professional conventions and standards being developed in these important times are key to gaining the benefits that have attracted so many to XML. I highly recommend that you participate in this process, and one way is by posting your thoughts and experiences on the Thinking XML discussion forum.
- Participate in the discussion forum.
- Learn more about the XMLOpen Conference, which took place September 21-23, 2004 in Cambridge, UK.
- Check out ISO Document Schema Definition Languages (DSDL). But first browse this article, an overview of the collection of standards that make up DSDL, with some discussion of the progress of each part.
- Read the paper "Web Proper Names: Naming Referents on the Web", by Harry Halpin and Henry S. Thompson, both of the University of Edinburgh.
- Visit the Home page for XML Hacks (edited by Michael Fitzgerald; O'Reilly and Associates, 2004)
for a table of contents, 11 sample hacks available freely online, and an errata.
You can also order the book at the developerWorks Developer Bookstore.
- Learn more about data types in Part 2 of the W3C XML Schema Recommendation. These are sometimes criticized as being an arbitrary and complex set of data types that too often don't align with the specific needs of real-world applications.
- Take a closer look at Jeni Tennison's Data Type Library Language (DTLL), which specifies custom data types for XML.
- Look around the Schematron home page and resource directory. You can also get a solid background on Schematron with this tutorial by Uche Ogbuji (developerWorks, September 2004).
- Read "Basic XML and RDF techniques for knowledge management, Part 7" (developerWorks, July 2002), an earlier installment of this column that covers the Semantic Web, the W3C's ambitious plan for a next-generation Web where documents are well annotated for meaning and context.
- Read more about Resource Directory Description Language (RDDL) in the author's developerWorks article "Use RDDL with your XML and Web services namespaces" (May 2004).
- Confused by all the XML standards out there? Uche Ogbuji's developerWorks article series on XML standards can help you sort through it all:
- Part 1 -- The core standards
- Part 2 -- XML processing standards
- Part 3 -- The most important vocabularies
- Part 4 -- Detailed cross-reference of the most important XML standards
- Find more XML resources on the developerWorks XML zone, including previous installments of the Thinking XML column.
- Learn how you can become an IBM Certified Developer in XML and related technologies.

Uche Ogbuji is a consultant and co-founder of Fourthought Inc., a software vendor and consultancy specializing in XML solutions for enterprise knowledge management. Fourthought develops 4Suite, an open source platform for XML, RDF, and knowledge-management applications. Mr. Ogbuji is also a lead developer of the Versa RDF query language. He is a computer engineer and writer born in Nigeria, living and working in Boulder, Colorado, USA. You can contact Mr. Ogbuji at uche@ogbuji.net.





