The wheels of progress turn slowly, but turn they do. The crystal ball might be a little hazy, but the outline of XML's future is becoming clear. The exact time line is a tad uncertain, but where XML is going isn't. XML's future lies with the Web, and more specifically with Web publishing.
It seems a little funny to have to say that. After all, isn't publishing what the Web is about? The Web was designed first and foremost as a mechanism to publish information. What else can it do? Quite a lot. The last three years have seen an explosion of interest in Web applications that go far beyond traditional Web sites. Word processors, spreadsheets, games, diagramming tools, and more are all migrating into the browser. This trend will only accelerate in the coming year as local storage in Web browsers makes it increasingly possible to work offline. But XML is still firmly grounded in Web 1.0 publishing, and that's still very important.
Several dreams are coming true this year. Sun's dream of network deployed applications is happening now, although shockingly the language of choice for these applications is JavaScript™, not Java™. This is a missed opportunity of the first order: Sun could have delivered this 10 years ago, but sadly it never had the experience, vision, or interest in the client to make it happen; now Sun is playing a desperate (and doomed) game of catch-up.
Netscape's dream of replacing the operating system with a browser is also coming true this year. Netscape had the vision to see this coming. Unfortunately, it didn't have the business savvy or artistic taste necessary to pull it off. Nonetheless, Firefox and the Mozilla Foundation, both direct descendants of Netscape, are key players in bringing about this bold new world.
For Microsoft®, the nightmare of a younger, nimbler competitor overtaking them is also coming true. The company was so distracted by Sun and Netscape that it failed to notice Google sneaking up on Office and Windows. GMail, Google Docs, and similar applications from a variety of sources are rapidly rendering the underlying operating system irrelevant.
Sure, you still need an operating system to run a browser, but increasingly no one will care which operating system it is, any more than anyone in the last decade cared who manufactured their PC, as long as it ran Microsoft Windows®. Now no one will care who manufactures their operating system as long as it runs Google. Operating systems are being commodified, just as PCs were. The Windows monarch hasn't been defeated so much as abandoned, leaving Microsoft guarding the gates to an empty castle.
What does XML have to do with this?
For a much-hyped technology, XML has had little to do with this situation. Although the rebels sail under the Asynchronous JavaScript + XML (Ajax) banner, and although the x in Ajax stands for XML, no one uses XML much for any of this. Almost as soon as the acronym was coined, Web developers began replacing XML with raw JavaScript code and passing it around as data, and then executing it with eval()
—security issues be damned.
The problem is one of APIs, not data formats. More specifically, it's a problem with one API: Document Object Model (DOM). Most developers learn DOM first and then never learn any of the alternatives. They don't distinguish between DOM and XML, and thus they confuse their well-founded disgust with DOM with an unfounded disgust with XML. DOM isn't a least-common-denominator API: it's a worst-common-denominator API. You couldn't design a worse API for processing XML if you tried. But developers are extremely resistant to learning new things. Outside the Java community, where JDOM and dom4j have made some progress, better alternatives like E4X and the Amara XML Toolkit remain almost unknown and are actively resisted. The genius of JavaScript Serialized Object Notation (JSON) was also its biggest weakness. Because JSON is executable JavaScript code, it doesn't require JavaScript programmers to learn anything new to use it. A more secure data-transfer format wouldn't have been accepted.
DOM is a millstone around XML's neck. It's the single biggest impediment to broader XML adoption in software development. XML has gone as far as it can in programming while dragging this 2,000-pound boat anchor behind it. Unless the World Wide Web Consortium (W3C) and browser vendors deprecate DOM and replace it with a sane alternative (preferably several sane alternatives: trying to do everything with one API is a large part of why DOM is as bad as it is), XML has run its course in software development—especially Web software development (and increasingly, that's the only kind of software development that matters). The W3C should address the needs of working developers and deprecate a bad spec when required.
Is XML dead? No, I believe that XML has a bright and important future. It just isn't a future that has much if anything to do with either classic or Web software development. To understand where XML is moving in 2008 and beyond, you have to first look back to 1997 and even earlier to find the origins of XML.
You have to understand that XML was never meant to be used in software development—at least, not in the early days. None of the early specs—XML 1.0, XPath, Extensible Stylesheet Language Transformation (XSLT), Namespaces in XML, Extensible Hypertext Markup Language (XHTML), and DOM—focused on the needs of software developers. If XML had been designed for software development, it would have supported lists and maps and data types as JSON eventually did. XML was instead designed for publishing, and more specifically for publishing Web pages.
XML was an outgrowth of a 20-year-older technology known as SGML. At roughly the same time Codd was at IBM® figuring out how to structure data by shredding it into tiny little unordered pieces, Charles F. Goldfarb, Edward Mosher, and Raymond Lorie were also at IBM figuring out how to structure large ordered documents that would never make sense as tables. Codd was thinking about business data like inventories and financial records. Goldfarb, Mosher, and Lorie were thinking about business documents like annual reports and airplane technical manuals.
SGML was intended to solve publishing problems: how do you write, maintain, update, print, search, and read documents that may run to tens of thousands of pages across a variety of platforms with different processors, character sets, natural languages, operating systems, and vendors? SGML achieved some success with organizations in the government and military sectors that had these needs, and a few technical publishers like O'Reilly made occasional use of it; but overall it was too large and complex for most people's needs—even people in the publishing industry.
SGML's biggest success was also its biggest failure: HTML. HTML was intended to be an SGML application, but almost none of the people who wrote browsers, editors, or Web pages knew anything about SGML beyond what the acronym stood for. (Many didn't even know that.) Extensions were introduced willy-nilly that rapidly degraded any claim HTML had to SGML conformance. Even the few and expensive SGML tools that then existed couldn't process the miasma of real-world HTML on the Web circa 1996.
This was the situation XML was invented to rectify. On the one hand, it was supposed to simplify SGML down to one reasonable, standard subset everyone could agree on and faithfully implement. The hope was that this simpler specification could achieve the broader adoption that had eluded SGML. In this, XML mostly succeeded.
On the other hand, XML was meant to lay the groundwork for a well-formed Web with fewer annoying cross-browser incompatibilities and idiosyncrasies In this, XML mostly failed. XML and XHTML just introduced yet another dialect of HTML that browsers would have to handle, without even coming close to replacing tag soup.
Success or failure, XML was intended for publishing: books, manuals, and—most important—Web pages. XML wasn't optimized or planned for use in software development outside of publishing. Its use for config files, remote procedure calls, object serialization, database dumps, and similar developer-oriented tasks wasn't anticipated or planned for. Therefore it should come as little surprise that XML isn't always a perfect fit for these chores. Nonetheless, XML did offer developers something they never had before: a platform-independent, language-agnostic, internationally-savvy data format with numerous high-quality, free parsers easily available. The combination was irresistible enough that programmers could overlook the lack of data types like int and float and basic data structures like lists and maps.
But because this isn't what XML was meant to do, it isn't how you should judge XML or where you should look for its greatest strengths and future prospects. To find those, you have to return to the field XML was designed for: publishing, especially Web publishing. Publishing on the Web has three pieces:
- The author
- The publisher
- The reader
The reader piece is done. That's the browser. All major browsers now support XML. But the writing and publishing pieces and the connection between them are just getting started.
Before a document can be published, it has to be authored, and here the fight is over. XML has won. All major office suites now save their documents in zipped XML by default. These include Microsoft Office, OpenOffice, StarOffice, WordPerfect Office, and Lotus® Notes®. Even graphic applications like Adobe® Illustrator® can now save documents as XML. The most notable hold-out is Apple's iWork, but look for it to join the parade in the new year.
The notable change this represents hasn't been fully appreciated, mostly because the XML is hidden from the user, as it should be. A typical spreadsheet user has no reason to know or care exactly how Excel® lays out its persistent data on disk. However, the textual XML structure makes it much easier for third parties to reverse-engineer the format and interoperate with the application. Even when the XML vocabulary is as poorly documented and incoherent as OpenOffice XML, it's still about a thousand times easier to work with than previous opaque binary formats. If the format is somewhat more sensible and less platform tied and legacy bound—like, for instance, the OpenDoc format—then it's even easier to work with.
In 2008 we will still see a lot of shouting and hollering over which XML vocabulary to use for office documents, and not a few polemics on both sides. I suspect Microsoft will lose its efforts to have OOXML declared an ISO standard in February, but I'm not certain of that. Either way, the writing on the wall is clear. Microsoft Office will continue to lose market share to OpenOffice, iWork, and other competitors.
That's not because Office is a bad product (it isn't) or because it's closed source (it is) but simply because it has nowhere to go but down. There's no room left for it to grow. The only question is how far it will fall and how fast. The most immediate threat to the Microsoft hegemony is that more governments will follow the lead of the Netherlands and Norway and mandate OpenDoc and/or open source solutions. Less troubling to Microsoft, but still a concern, is the increasing proliferation of PCs at such low prices that Windows and Office become 50% or more of the total system cost. To some extent, Microsoft has already begun to recognize this with low-cost Student and Teacher editions of Office that are easily available to anyone who teaches, has a student in their family, attended grammar school some time within the last century, or once held a door on the A train for someone they thought might be a teacher. However, in a classic case of the left hand not knowing what the right hand is doing, Microsoft is also ramping up piracy protection in Office. Although this will likely decrease casual piracy, it will also drive more users to open source alternatives.
Now that all our office documents are in XML, it will become incredibly easy to transform them from one format to another. 2008 is the tenth anniversary of perhaps the single most game-changing technology in the XML family: XSLT. In its spanking new 2.0 version, it's even more powerful. Fairly soon, conversions between the competing formats will become so straightforward that most people will stop caring which format they're using. A few styles and macros might get munged on the transition, but these features aren't used in most documents anyway. Vendor lock-in from document formats becomes a far smaller problem.
Of course, the most important conversion isn't from OpenDoc to OOXML or vice versa: it's a down conversion from either OpenDoc or OOXML to XHTML. The HTML exporters in OpenOffice and Microsoft Office are uniformly atrocious. Look for third-party developers to pick up the slack. Most important, look for individual corporate developers and webmasters to begin publishing custom templates for their sites. This will enable regular folks to write in Microsoft Word as they're accustomed to doing and then upload their musings straight into the local content-management system. Editing and reviewing tools can be built right in. Because machines generate all the markup (the humans see the GUI interface they're used to), well-formedness will be a freebie. The majority of the Web won't be well-formed by the end of 2008, but a larger percentage will be than today.
XSLT and XML office formats will also bring a lot of hidden data out into the open. Numerous business documents have languished unread in file systems for the last decade or more. Most of them are doubtless irrelevant today, but some of them contain important information that's been forgotten because no one can search it. Corporate developers will extract and repurpose information from existing Office documents, first by automating conversion to newer XML-based formats, and then using XSLT and XQuery to make the data findable.
If the authoring tool will be a traditional office program such as Word or OpenOffice Writer, what will be on the server to hold this? And how will you move the content from the client to the server? This is where two of the most significant 1.0 releases from 2007 come into play: the Atom Publishing Protocol (APP) and XQuery.
Traditionally, you see two hard problems in training non-techies to write for the Web: teaching them semantic markup and showing them how to use FTP. (Remember, many nontechnical users can't even use the standard File Open dialog box. They store everything in the My Documents folder or on the desktop. They're lost if they accidentally put a file somewhere else. Programmers understand hierarchies, but many users don't think that abstractly.)
XML-enabled word processors like OpenOffice and Microsoft Word solve the first problem. The Atom Publishing Protocol solves the second. APP will do to do for Web authoring what HTTP did for Web browsing: provide a standard protocol that a variety of independent clients and servers can use to communicate without prior agreement or a shared conceptual model.
Independent software vendors can write their own authoring tools that talk to APP services on the different servers. These can be custom editors or plug-ins for products like Word and OpenOffice. Uploading content will be as simple as saving a file on the local hard drive is today, and in some cases simpler. Creating a new document will simply require a URL to POST to and a username and password. (Wiki-like sites might not need even the username and password.) To edit a document, you will need nothing more than the URL of that document.
So now you know how you'll write XML in 2008 (Word or OpenOffice), and you know how you'll send it to the server (APP). The last question is where to put all this wonderful XML.
Traditionally, this question has had two answers. The first is to save the XML in a file system. The second is to stuff it in a Binary Large Object (BLOB) in a relational database. Both are kludges, and neither performs very well for Web sites.
File systems are simple, reliable, well-understood technology, but they have poor to atrocious search capabilities. They tend to duplicate the same data in a dozen different places, they have no comprehension of the internal structure of the documents they hold, and they can't provide subdocument granularity.
Relational databases are great stores for small chunks of data that don't have any particular order and don't repeat themselves a lot. But neither characteristic is true of XML. Although any relational table can be easily transformed into an XML document (just make each row an element and each field an attribute or child element), the reverse isn't true. Many XML documents possess significant order, relevant white space, mixed content, repetitive but distinct content, unpredictable nested elements, and other features that make relational tables an unsuitable data store. In particular, Web pages, including HTML and XHTML pages, possess all these characteristics. Certainly you can store HTML and XML documents in a relational database—MySQL seems to be used for little else—but the results are neither pretty nor fast. If you shred the documents into many small pieces, you can search, select, and reorder the data. However, you end up spending a lot of money to try to eke out acceptable performance. Maintenance and development also become a nightmare as your SQL queries grow to half a page or more of complex logic to accomplish relatively simple things that SQL was never meant to do.
The alternative approach is to store each document in a BLOB. This is fast and simple. However, you lose all ability to select only part of the data or combine different documents. Application logic is farmed out to PHP, Ruby, or some other server-side framework. You use the database as little more than an alternative file system with slightly better isolation characteristics.
What we need is a database designed to work with the hierarchical structures of typical Web documents rather than cutting across them. For the first time, such databases now exist at multiple scales, they're stable, and they're ready to use. On the low end, eXist and Berkeley DBXML are looking better and better. On the high end, expensive big-iron XML databases like Mark Logic will continue to convert big publishers who can afford the cost of entry. Hybrid solutions like IBM DB2® 9 pureXML™ will drive XQuery adoption among customers who need to mix documents with tabular data.
Compared to earlier products like these, the new breed are more stable, more scalable, and more reliable. Most important, they now share a standard language, XQuery 1.0, finally released after years of development. The likelihood of porting applications from one database to another is usually overstated; but the necessity of porting developers' skills from one server to the next is equally understated. No one wants to learn six different dialects of a query language and then relearn it every six months. Now they don't have to.
The update part of XQuery is marching forward rapidly. It probably won't be finished in 2008, but it's already solid enough to be implemented, as long as users don't mind modifying their code a tad with each new draft. The situation will improve throughout the year.
Finally, 2008 should see the release of javax.xml.xquery. This is a standard API for connecting Java programs to XQuery engines and databases. Think of it as JDBC for XQuery. It enables you to mix XQuery into your Java code. This may not become a
standard part of the Java class library until the release of Java 7 in 2009, but it's already supported by some products, with more to come.
Query is finally ready for production, and APP is ready to break out. If I was looking to invest money or time in XML, these are the technologies I'd focus on. The world might not need yet another content-management system, blog engine, or bulletin board; but it absolutely could use each of these if they stored and searched their content with a native XML database and published to it with APP.
I'd also like to make a plea for one more developer tool. I know I said that XML in software development was dead, but maybe a spark of life remains. The lesson of JSON is that many applications don't need or want the flexibly structured data that XML enables. Suppose a basic XML format for encoding lists and maps could contain typed data. Maybe something as simple as Listing 1.
Listing 1. A basic data format for XML
<data xmlns="http://www.w3.org/data">
<list>
<string>Foo</string>
<int>17</int>
<year>1999</year>
<list>
<map>
<entry>
<string>Boston</string><string>Red Sox</string>
</entry>
<entry>
<string>New York</string><string>Yankees</string>
</entry>
</map>
</data>
|
Then, suppose you wrote a few straightforward libraries in Java, JavaScript, Python, Perl, Ruby, and other languages that could parse these documents into the native data structures of their languages. Is it possible to reinvent JSON but this time without the security problems and with the enhanced flexibility XML enables? Is there a hacker in the house?
Learn
-
Manage
ODF and Microsoft Office 2007 documents with DB2 9 pureXML (Chris C. Gruber,
developerWorks, August 2007): Learn to store and re-purpose data with PHP's PDO and XQuery using IBM DB2 9.
-
Manage
a media collection with the Atom Publishing Protocol (Nicholas Chase, developerWorks, April 2007): Combine the Atom syndication format with the Atom Publishing Protocol to create a Web-based media repository.
-
An introduction to XQuery (Howard Katz, developerWorks, January 2006): Review this early look at the W3C standard for an XML query language.
-
Why XForms?
(Elliotte Rusty Harold, developerWorks, October 2006): Will it work for you? Explore
XForms as a solution, including its internationalization, accessibility, and device independence.
-
XML 2007: Year in
review (Elliotte Rusty Harold, developerWorks, December 2007): Look back at XML in 2007 with the author.
-
IBM XML certification: Find out how you can become an IBM-Certified Developer in XML and related technologies.
-
XML technical library: See the developerWorks XML Zone for a wide range of technical articles and tips, tutorials, standards, and IBM Redbooks.
-
developerWorks technical events and webcasts: Stay current with technology in these sessions.
- The technology
bookstore: Browse for books on these and other technical topics.
Get products and technologies
- The eXist native XML database: Experiment with XQuery
with this open source database management system entirely built on XML technology. With
eXist, you can store XML data according to the XML data model and get efficient, index-based XQuery processing.
-
IBM trial software: Build your next development project with trial software available for download directly from developerWorks.
Discuss
- Participate in the discussion forum.
-
XML zone discussion forums: Participate in any of several XML-related discussions.
-
developerWorks blogs: Check out these blogs and get involved in the developerWorks community.

Elliotte Rusty Harold is originally from New Orleans, to which he returns periodically in search of a decent bowl of gumbo. However, he resides in the Prospect Heights neighborhood of Brooklyn with his wife Beth and cats Charm (named after the quark) and Marjorie (named after his mother-in-law). He's an adjunct professor of computer science at Polytechnic University, where he teaches Java and object-oriented programming. His Cafe au Lait Web site has become one of the most popular independent Java sites on the Internet, and his spin-off site, Cafe con Leche, has become one of the most popular XML sites. His next book, Refactoring HTML will be published by Addison Wesley this spring.
Comments (Undergoing maintenance)





