Last month marked ten years since the World Wide Web Consortium (W3C) Standard Generalized Markup Language (SGML) on the Web Editorial Review Board publicly unveiled the first draft of Extensible Markup Language (XML) 1.0 at the SGML 96 conference. In November 1996, in the same hotel, Tim Bray threw the printed 27-page XML spec into the audience from the stage, from whence it fluttered lightly down; then, he said, "If that had been the SGML spec, it would have taken out the first three rows." The point was made. Although SGML remains in production to this day, as a couple of sessions reminded attendees, the markup community rapidly moved on to XML and never looked back.
Ten years later, IDEAlliance's annual conference is still going strong, although today it's called XML 2006. It's the major North American XML conference and the largest pure XML show still running. However, many of the players remain the same. Quite a few attendees (and one keynoter) could and did give first-hand reports of the early days to others like myself who only discovered the power of descriptive markup from XML.
The conference was smaller than in previous years (as almost all conferences are, post-dotcom implosion) -- about 400 people. However, repeat attendees reported that this was the most exciting and active iteration in several years. Despite running four concurrent tracks over three days, limiting speakers to somewhere between 400 seconds (for the least interesting subjects) and 45 minutes (for the most interesting subjects), and not covering speakers' travel expenses, the referees still had to choose from about four times as many submissions as they had slots for. For the six late-breaking sessions, that ratio was more like 10:1. It looks like the XML world is accelerating once again.
In addition to the final emergence from the post dot-bomb malaise and the possible expansion of Bubble 2.0, several factors converged to make this one of the most interesting XML conferences since the late 90s:
- XQuery
- Atom
- Web 2.0
Without a doubt, XQuery stood out as the star of the show, beginning at the start with Roger Bamford's opening keynote, in which he announced the FLWOR Foundation and its upcoming efforts to develop an open source XQuery engine written in C++. The engine will sit on top of various storage engines, including Oracle's Berkeley DB.
At least a dozen presentations addressed XQuery, compared to only three last year. That the W3C XQuery and Extensible Stylesheet Language Transformation (XSLT) working groups released eight proposed recommendations two weeks before the conference (after years of development) didn't hurt. Barring any last-minute spec bugs, the final recommendations are expected to be released within weeks, not months. There are already over a dozen mostly conformant implementations of the specs, and four of them were represented at the show: IBM® DB2® 9, Oracle Database 10g Release2, Mark Logic, and Data Direct XQuery.
On Wednesday morning, Darin McBeath of Reed Elsevier gave a keynote address titled "Unleashing the Power of XML" in which he described numerous cases of publishers like Oxford University Press, O'Reilly, and even JetBlue implementing successful XQuery projects on top of either hybrid or pure XML databases. Hybrid XML databases are those such as DB2 and Oracle that support both Structured Query Language (SQL) and XQuery. Native XML databases are those such as Mark Logic that support only XQuery.
I talked to several database vendors on the exhibit floor and attended several more XQuery presentations to try to make some sense of this. In brief, hybrid databases are still composed of relational tables. However, fields aren't limited to the usual SQL types like INT and DATE.
They can also be declared to have type XML. An XML-type field contains a complete, well-formed, optionally schema-valid document (or null). You can select, insert, and update the XML values using SQL statements with embedded XQuery subqueries. For example, the code in Listing 1 inserts an Extensible Hypertext Markup Language (XHTML) formatted comment into the comments table. This table has two fields, a CHAR(16) username and an XML comment.
Listing 1: Inserting XML into a hybrid SQL-XML database
INSERT INTO comments (username, comment)
VALUES ("FP",
"<div xmlns='http://www.w3.org/1999/xhtml' class='comment'>
<p>
The relational model rules <strong>supreme</strong>.
It can do anything an XML model can do. Unfortunately, no one's ever
listened to me and implemented a true relational database.
</p>
<p>
I will now go sulk in my corner until the world accepts Codd
and the <a href='http://www.itworld.com/nl/db_mgr/05072001/'>12 commandments</a>.
</p>
</div>"); |
The database parses the XML data before insertion and stores it in a form that's amenable to searching and querying without having to reparse the data for every query. Indexes can even be defined on particular paths in the XML trees to improve performance.
To SQL, this data looks like one CLOB. However, XQuery expressions can exploit the structure of the XML. For example, Listing 2 shows a query that extracts img elements from the comments table.
XQUERY
declare namespace html = "http://www.w3.org/1999/xhtml"
for $img in db2-fn:xmlcolumn('comments.comment')//html:img
return $img; |
Some parts of this process are standardized in either the XQuery specs, JSR 225: XQuery API for Java, or SQL/XML. A few pieces are still in development, notably updates and full-text search. However, many pieces remain deliberately unspecified. Consequently, most applications will use some vendor-specific code. Listing 2 uses the DB2-specific function xmlcolumn to find the right field.
All of this violates the relational model in about half a dozen different ways. On the other hand, no major production database has ever fully implemented the relational model anyway, the cries of the relational purists not withstanding. Normalization is also left by the wayside (as it usually is in any large database in which performance is a major concern).
Despite the ideological impurity, this is an incredibly useful way to organize data. In particular, publishing and Web applications that need to store large documents rather than small fields and unmarked-up strings should see major benefits from this approach. The current strategy of storing marked-up text in BLOBs, CLOBs, and VARCHARs isn't nearly as natural or efficient. Shredding the document into individual nodes to be stored in separate records is even nastier. Allowing the document to be stored as a single unit in one field in one record while still letting it be searched with XQuery fits the structure of most traditional and Web publishing applications much more neatly.
As Jon Bosak reminded listeners in the closing keynote, we still struggle with issues that go back to the SGML days and even earlier. Not surprisingly, a lot of these problems are people issues masquerading as technical issues.
Many of these problems revolve around authoring: Who creates the markup, and how do they create it? The three traditional approaches are:
- Type in a WYSIWYG editor such as OpenOffice or Microsoft Word; then, convert that to XML.
- Type in a structure-aware XML editor such as <Oxygen/>.
- Type plain text in an editor with limited if any XML knowledge, such as emacs or BBEdit.
Although option 3 is my preferred method (it's how I'm writing this article, for instance) it's a little too simple to justify conference presentations and probably not appropriate for nonprogrammer end users. This year, the competition between Microsoft's Office Open XML Formats and OpenOffice's OpenDoc format (ODF) focused attention on option 1.
Both Microsoft and the open-document partisans seem to believe they're involved in a struggle to the death. They see themselves as waging nothing less than a war for truth, justice, and the American way on the one hand and liberté, égalité, and fraternité on the other. Consequently, neither side distinguished itself at this show. Both could benefit from toning down the hype several notches and recognizing their own limits.
When teased out from the invective, most of the technical discussion this year focused on the Microsoft formats -- probably because they're newer, and this conference thrives on the new. The critical factor seems to be that Microsoft is dead-set on maintaining pixel-perfect compatibility with at least 10 years of legacy Microsoft Office binary formats. This means the newly minted ECMA Open XML standard is really just an XML encoding of a legacy format. Consequently, the specification was severely constrained by the requirements of compatibility.
The result is a specification that's about 6,000 pages long. It's almost 10 times as large as the OpenDoc specification, for pretty much the same functionality. It's possibly the single largest XML specification I've ever encountered. I'm not sure even the combined family of WS-* specs matches it. This is more complete than any similar spec Microsoft has published in the past, and it will help anyone who needs to read or generate Microsoft Office documents. However, I can't see that any other project will ever adopt this format for anything other than exchange with Microsoft Office. It's too big and too full of legacy detail. If you don't already have 10 years of legacy code that reads, writes, and displays something close to this, you have no hope of implementing it.
By contrast, the competing OpenDoc format, although originally derived from StarOffice legacy formats, is much simpler and more independent of its ancestry. It's already been adopted as the native file format by separate office suites and programs with independent code bases. I'd be surprised if anyone besides Microsoft tried that with Office Open XML and even more surprised if they succeeded. Even Microsoft itself hasn't yet been able to implement this format in Microsoft Office for the Mac -- and that's not an independent code base.
By the way: An extra demerit goes to Microsoft for naming its format Office Open XML, thereby thoroughly confusing it with OpenOffice's OpenDoc format.
Another ongoing development in the XML world is Darwin Information Typing Architecture (DITA), an XML format for modular documentation. Rather than books and articles, DITA documents are divided into topics, concepts, and tasks. Map documents indicate how the different topics are rearranged to make magazine articles, Web pages, tutorials, conference presentations, man pages, and more. Different map documents can reuse the same topics to make new and different collections of documentation. Even individual paragraphs, sentences, and words can be pointed to and transcluded into output documents.
For technical authors like myself, this sounds like a godsend. I don't have to keep rewriting or cutting and pasting the same content. After all, how many different ways can I explain a linked list? This is the writer's version of structured programming and the DRY principle (Don't Repeat Yourself). At least, that's the theory.
The reality might fail to meet expectations. Without mentioning DITA by name, Jon Bosak shot it down pretty conclusively in his closing keynote. As he pointed out, we've been here before. He first encountered this idea in the late 1970s and was very enamored of it for a time.
The problem is that you can't cut and paste topics freely between documents. Doing so produces frankenbooks that don't have a consistent authorial voice, target audience, or flow. One of the worst problems is that when you write a topic, you don't know what you can or can't assume the reader already understands from previous chapters, because the chapters are always moving. Thus you end up either constantly repeating all prerequisites or leaving out necessary prerequisites.
Bosak noted that this technique was invented several times before, and it hasn't worked yet. Sprinkling magic XML pixie dust on a flawed concept won't make it work now. He suspects that writers like this approach because it helps them be treated like important, exciting software developers rather than lowly, boring tech writers. (It worked for him.) That's why this bad idea keeps getting reinvented as soon as its previous failure is forgotten.
In 2006, it's hard to find a technical conference or subject that doesn't involve the Web. XML 2006 was no exception. An entire track was devoted to XML and the Web. Web 2.0 themes like mashups, Asynchronous JavaScript and XML (Ajax), and user interactivity were especially prominent; but this part of the conference ranged widely from cell phones to servers, from RSS to Atom to HTML.
Atom might be old hat for the markup geeks at this show, but the newer Atom Publishing Protocol (APP) is bleeding-edge enough to attract a lot of interest. APP may be a sleeper technology like XML was 10 years ago. Like XML, it's starting small with a simple use case. (For XML, the use case was putting SGML on the Web. For APP, it's posting blog entries.) However, like XML before it, APP is proving capable of a lot more than its creators aimed for. It could well do for application layer protocols what XML did for data formats: APP might finally let everyone use a few standard, interoperable, reliable libraries to transfer content between systems instead of rolling their own brittle unique code.
Not all the players in this space were present at this conference, but IBM was there with the Abdera library. Abdera is an Apache Incubator project that implements APP as a Java™ class library that other applications can invoke on either the client or server side. Wrap a user interface around this, and you have a blog editor. Attach a database to the backend, and you have a content management system. Abdera looks promising; but even if it doesn't pan out, numerous other developers are working on similar libraries in many languages and environments.
APP is also the poster child for Representational State Transfer (REST), the architecture on which Hypertext Transfer Protocol (HTTP) and the Web are based. REST wasn't specifically on the program, but it kept coming up. For instance, in a session on the Google Checkout API, Google evangelist Patrick Chanezon mentioned that only the earliest public APIs they designed were done with the WS-* stack in mind. These days, they try to design their APIs RESTfully. Maybe they implement a SOAP gateway somewhere for the developers who prefer those tools, but behind the scenes it's all REST.
I also noted that the distinction between WS-* and REST is blurring. REST seems to be winning in actual code, even if not yet in developers' minds. A lot of people are doing REST even when they don't have any idea that's what it's called. More than once, I saw "WS-*" or equivalent on a speaker's slides, only to find out in further conversation that all they were doing was sending plain old XML over HTTP -- not even using SOAP or Web Services Description Language (WSDL), much less the whole family of specs that sit on top of that. I suppose I don't care what they call their designs, as long as they're doing it the right way. Web services has always been a fuzzy term. However, going forward, developers need to be aware that unqualified terms like Web services can have very different meanings to different people.
Metadata might be another classic people problem masquerading as a technical problem. In brief, how do you get authors to create and enter reliable metadata for their content? Google has created one of the world's most effective search engines by ignoring metadata completely and focusing exclusively on the data.
The Semantic Web zealots aren't ready to give up on Resource Description Framework (RDF) yet (although topic maps were conspicuous by their absence), so several presentations focused on means of deriving RDF metadata from HTML data, relational databases, and other unannotated systems.
The most promising approach (probably because it promised the least) was a W3C effort called Gleaning Resource Descriptions from Dialects of Languages (GRDDL), presented by Harry Halpin. He's developing XSL transforms that produce RDF metadata from a variety of XML, HTML, and microformats. Ronald Reck and Ken Sall's effort to infer metadata from the CIA World Factbook, Wikipedia, and Project Gutenberg aimed for more but achieved less, in my opinion.
What was shocking about both these proposals, compared to earlier years, was that neither required any effort by or even cooperation from the original document authors. I wonder if this means the metadata community is finally coming to learn Google's lesson? If they can extract the implicit metadata embedded in the documents rather than ask authors to add explicit metadata outside the documents, they can get both more and better metadata.
XML 2006 was one of the most exciting and active conferences I've been to in several years. Every time slot had at least two and often three or four presentations I wanted to see. Hallway conversations and the exhibit floor were equally active. Even after the official end of the conference for the day, a lot of activity continued in the hotel lobby and bar. (Cold Boston weather may have contributed to this.)
Looking forward to 2007, I think XQuery and native XML databases will be very, very hot. Many large publishing businesses have already successfully implemented XQuery systems to excellent effect. Smaller publishers might want to wait until simpler, cheaper, possibly open source solutions become available; but they might not have to wait long. The biggest rumor at the conference was that principals from one of the largest pure XML database players and one of the largest hybrid XML database players have joined forces to create a new open source XQuery engine that will be competitive with the big boys.
Beyond XQuery, technologies to watch include GRDDL, APP, and XProc. Also pay attention to anything called Web 2.0. Yes, it's hype; and yes, if you asked four speakers at this conference how they defined Web 2.0, you got six different answers; but there's a lot of reality behind the hype. Finally, if you previously looked at technologies like XSL-FO or XForms and rejected them because the implementations weren't robust or ready for prime time, take another look. A lot of bugs have been shaken out, and people have started to build impressive systems on top of some formerly shaky foundations.
The first 10 years of XML were only the beginning. The future is looking very bright indeed.
Learn
- The official XML 2006 conference Web site: Find abstracts for all the talks and soon-to-post finished papers.
- Microsoft's Mike Champion blog: Dig into his reports of several interesting talks about application development languages such as XQueryP, Linq, and XLinq that I missed while attending other tracks.
- O'Reilly's Keith Fahlgren reports: Find highlights for several more interesting talks I missed while attending other tracks. (Why does everything run in parallel? I'm not a multiprocessor!)
- Conference photos: View pictures that several people posted to Flickr.
- XTech 2007: Visit Paris next May for the next major XML conference and the European equivalent of XML 2006.
- PureXML from IBM RedBooks: I got a nice book called PureXML about XML support in DB2 9 from the IBM booth. Stupidly, I left it behind in my hotel room. Fortunately, I found it available online.
- IBM XML certification: Find out how you can become an IBM-Certified Developer in XML and related technologies.
- XML technical library: See the developerWorks XML Zone for a wide range of technical articles and tips, tutorials, standards, and IBM Redbooks.
- developerWorks technical events and webcasts: Stay current with technology in these sessions.
Get products and technologies
- IBM trial software: Build your next development project with trial software available for download directly from developerWorks.
Discuss
- XML zone discussion forums: Participate in any of several XML-centered forums.
- developerWorks blogs: Get involved in the developerWorks community.

Elliotte Rusty Harold is originally from New Orleans, to which he returns periodically in search of a decent bowl of gumbo. However, he resides in the Prospect Heights neighborhood of Brooklyn with his wife Beth, their dog Shayna and cats Charm and Marjorie. He's an adjunct professor of computer science at Polytechnic University, where he teaches Java and object-oriented programming. His Cafe au Lait Web site has become one of the most popular independent Java sites on the Internet, and his spin-off site, Cafe con Leche, has become one of the most popular XML sites. His most recent book is Java I/O, 2nd edition. He's currently working on the XOM API for processing XML, the Jaxen XPath engine, and the Jester test coverage tool. He'll be talking about Web Forms 2.0 at the SD West 2007 conference in Santa Clara in March.