Skip to main content

XML 2007

Year in review

Elliotte Rusty Harold (elharo@metalab.unc.edu), Adjunct Professor, Polytechnic University
Photo of Elliot Rusty Harold
Elliotte Rusty Harold is originally from New Orleans, to which he returns periodically in search of a decent bowl of gumbo. However, he resides in the Prospect Heights neighborhood of Brooklyn with his wife Beth and cats Charm (named after the quark) and Marjorie (named after his mother-in-law). He's an adjunct professor of computer science at Polytechnic University, where he teaches Java and object-oriented programming. His Cafe au Lait Web site has become one of the most popular independent Java sites on the Internet, and his spin-off site, Cafe con Leche, has become one of the most popular XML sites. His most recent book is Java I/O, 2nd edition. He's currently working on the XOM API for processing XML, the Jaxen XPath engine, and the Jester test coverage tool.

Summary:  Join Elliotte Rusty Harold for a look back at the most significant XML news from 2007.

Date:  31 Dec 2007
Level:  Introductory
Activity:  2869 views

2007 was another slow year for XML. However, several important specifications did reach 1.0, and XML continued to gain traction in publishing, both Web and traditional. Most important, the slow leak in the Web services ship caused by its collision with the REST iceberg turned into a gusher, and the whole vessel began to sink beneath the waves. The tip of this Titanic-sinking iceberg was POX, plain old XML documents sent over standard HTTP without any schemas or specs to get in anyone's way. (Some people saw the iceberg approaching years ago, but as Roy Fielding said, "the industry still insisted on proving that for themselves.")

REST isn't the only technology that hides 90% of its power below the surface. The full power of XML has yet to be explored. The Atom Publishing Protocol (APP) and XQuery both reached 1.0 this year, and the impact of both is just beginning to be felt.

Despite having survived numerous challenges from pretenders to the throne over the last 10 years (YAML, SML, S-expressions, and other also-rans), XML saw its most serious challenge yet in 2007 with the increasing popularity of JSON. JSON usage hasn't yet peaked and seems likely to continue to increase in the next year despite JSON's limited applicability, security problems, and poorly designed application programming interfaces (APIs).

January

XQuery's been a "next year" technology for half a decade now, but in January the promise was finally realized with the official publication of XQuery 1.0. Several pure XML databases have already implemented it, including Mark Logic, eXist, Sedna, and Berkeley DB XML. You can also find it in hybrid databases including IBM® DB2® 9 and Oracle 10g. XQuery is also available in some stand-alone products including Michael Kay's Saxon and DataDirect XQuery.

But the job isn't finished. XQuery is only a quarter of the solution. In CRUD terms, XQuery is Read without Create, Update, or Delete. These necessary features have to be filled in with proprietary extensions. This means you can't easily move an application or database from one implementation to the next. More importantly, you can't teach developers a standard syntax, and you can't easily hire an experienced Mark Logic programmer to work on a DB2 9 application. (Programs rarely shift between platforms, but developers often do.) XQuery needs an update (and create and delete) facility. The World Wide Web Consortium (W3C) issued a last-call draft of XQuery Update Facility 1.0 in August, and vendors are starting to implement it. If the XML community is lucky, 2008 will finally see this last critical piece exit last call and move on to final release.

The XQuery working group also published the first batch of XQuery 1.1 requirements this year. Some of the most significant possible new features include exception handling, extension functions, function pointers, and/or lambda expressions. With any luck, these will take only another five or six years to finish. However, it's better to get started now before XQuery adoption really takes off and any spec changes become multidecade endeavors as was the case for SQL, Fortran, and C.

Although XQuery is a fairly complete programming language—it can completely replace PHP, static HTML, and essentially any other Web framework—you'll probably need to integrate it with programs written in traditional languages for the foreseeable future. Thus it's good that the XQuery API for Java (XQJ) advanced to proposed final draft in the Java Community Process this year. Think of this as JDBC for XQuery. XQJ is already supported by Saxon 9 and Data Direct XQuery 3.0, and more vendors are likely to follow once the finished spec is released next year.


February

In this shortest and coldest month of the year, everyone stayed home and not a lot happened. There were new releases of Saxon, TagSoup, and WebCGM. (What's WebCGM, you ask? Exactly.)

We did get the last-call working draft of XForms 1.1 this month, which added some critical new features including support for PUT and DELETE submissions. But the 10 remaining months in the year weren't quite long enough to push XForms forward to final recommendation. Candidate recommendation in November was as far as XForms 1.1 got. Maybe next year.

Most significant XForms vendors released updated versions of their products at various points throughout the year, including FormsPlayer, Chiba, Orbeon, and the Mozilla XForms extension. Unfortunately, the holy grail of XForms support built right into a major browser remained elusive.


March

The W3C was formed more than a decade ago as a direct response to the failure of the Internet Engineering Task Force (IETF) HTML 2.0 effort. Given that history, it's a little shocking to realize that by 2006 the W3C had effectively abandoned HTML. Through the millennium to date, they were so focused on XML and the Semantic Web that they pretty much forgot about the one technology they were formed to support. Thus, in 2004 some Web designers and browser vendors got together as the Web Hypertext Application Technology Working Group (WhatWG), picked up the ball the W3C had dropped, and started to run away with it.

It took a couple of years, but in March the W3C finally noticed they were about to be scored on. They rechartered their own HTML working group and started playing catch-up. The two teams mostly agreed to share the ball going forward, but the WhatWG is still playing quarterback and driving the ball down the field.

Despite the brouhaha, not much running HTML 5 code was produced in 2007. Specification development focused on Web video, SQL APIs, and parsing arcana. Whether any of this will ever be implemented in Web browsers that normal people use remains an open question. Personally, I question the wisdom of spending months developing an in-browser SQL API precisely at the time native XML databases are warming up off field (and I promise that's my last football metaphor in this article, or you can take away my referee's whistle).


April

Despite early hype that XML would bring semantics to the Web and enable browsers to understand what they were displaying, XML was really always about syntax, not semantics. The entire XML 1.0 specification has only two marginally semantic attributes: xml:space and xml:lang (and I'm not sure about xml:space). Meaning, for the most part, comes from the application that processes the XML document, not the document itself. That's largely been true of subsequent specs in the XML family, including namespaces, the XML Infoset, XSLT, and XPath.

But in April, the W3C expanded on the xml:lang attribute in a big way by releasing the Internationalization Tag Set (ITS) 1.0. This recommendation defines standard attributes for identifying directionality, translatability, ruby text, and other common aspects of document localization and internationalization that can be shared across many different vocabularies. For example, in the DocBook article shown in Listing 1, the its:translate attribute indicates that the author element shouldn't be translated, and the its:dir attribute says that the whole document uses left-to-right text.

Listing 1. A DocBook article with extra ITS markup

                 <xforms:model><dbk:article
      xmlns:its="http://www.w3.org/2005/11/its" 
      xmlns:dbk="http://docbook.org/ns/docbook" 
      its:version="1.0" version="5.0" xml:lang="en"
      its:dir="ltr">
  <dbk:info>
    <dbk:title>Fun with XML</dbk:title>
    <dbk:author its:translate="no">
       <dbk:personname>
         <dbk:firstname>Elliotte</dbk:firstname>
         <dbk:surname>Harold</dbk:surname>
       </dbk:personname>
     </dbk:author>
   </dbk:info>
   <dbk:para>XML rocks!</dbk:para>
</dbk:article>

This spec didn't get a lot of attention, but it's useful for anyone who publishes in a multilingual environment (and these days, that's nearly everyone).

In April, The W3C Internationalization Activity also posted the finished version of Internationalization Best Practices: Specifying Language in XHTML & HTML Content. This advice is summarized in 16 "best practices" that I extract here:

  • Best Practice 1: Always declare the default language for text in the page using attributes on the html tag, unless the document contains content aimed at speakers of more than one language.
  • Best Practice 2: Where a document contains content aimed at speakers of more than one language, decide whether you want to declare one language in the html tag, or leave the languages undefined until later.
  • Best Practice 3: Where a document contains content aimed at speakers of more than one language, try to divide the document linguistically at the highest possible level, and declare the appropriate language for each of those divisions.
  • Best Practice 4: Use the lang and/or xml:lang attributes around text to indicate any changes in language.
  • Best Practice 5: For HTML use the lang attribute only, for XHTML 1.0 served as text/html use the lang and xml:lang attributes, and for XHTML served as XML use the xml:lang attribute only.
  • Best Practice 6: Use language attributes rather than HTTP or meta elements to declare the default language for text processing.
  • Best Practice 7: Do not declare the default language of a document in the body element, use the html element.
  • Best Practice 8: If the text in attribute values and element content is in different languages, consider using a nested approach.
  • Best Practice 9: Consider using a Content-Language declaration in the HTTP header or a meta tag to declare metadata about the languages of the intended audience of a document.
  • Best Practice 10: Where a document contains content aimed at speakers of more than one language, use Content-Language with a comma-separated list of language tags.
  • Best Practice 11: Follow the guidelines in the IETF's BCP 47 for language attribute values.
  • Best Practice 12: Use the shortest possible language tag values.
  • Best Practice 13: Where possible, use the codes zh-Hans and zh-Hant to refer to Simplified and Traditional Chinese, respectively.
  • Best Practice 14: When pointing to a resource in another language, consider the pros and cons of indicating the language of the target document.
  • Best Practice 15: If you want to indicate that the target document of an a element is in another language, consider the pros and cons of using hreflang with CSS.
  • Best Practice 16: Do not use flag icons to indicate languages.

May

MathML was one of the first XML applications, but sadly it has seen limited practical uptake. Nonetheless, the W3C Math Working Group hasn't given up, and in late April they released the first draft of MathML 3. (Yes, I know this is supposed to be the May section, but not much XML news happened in May.)

The most important feature in version 3 is support for elementary school math notation. After all, first graders outnumber mathematics PhDs about 100,000 to 1. MathML 3 also adds support for bidirectional layout and improves linebreaking and positioning for improved typesetting. Finally, the spec has been rewritten with clarity in mind. One can only hope the third time is the charm. After all, math is what the Web was invented for.


June

In June, the OpenOffice Project released version 2.2. of OpenOffice, a cross-platform office suite that saves all its files as zipped XML in the international standard OpenDoc format. This was mostly a bug-fix release, and it wouldn't normally merit mention in a year-in-review article. But the real news was that for the first time, the OpenOffice Project also released a native Mac OS X version along with the versions for Linux® and Microsoft® Windows®.

Unlike previous semi-releases on the Mac, 2.2 was based on the Mac's native Aqua user interface toolkit rather than X-Windows. The Mac release was only alpha quality, but it was still a major step forward in making OpenOffice a real competitor to Microsoft Office. If OpenOffice can attract a significant number of MacBook-wielding programmers, it might finally be able to cure some of the user-interface glitches that have plagued it since 1.0.

June also saw the biggest news on the browser front all year when Apple posted the first of several betas of Safari 3.0 for Windows. No longer content with its 6% (and growing) market share, Apple seems to be commencing a full-scale challenge to Microsoft on its home front. First iTunes and now Safari? Can iLife and iWork be far behind? Only 2008 will tell. In the meantime, Safari supports XML, XSLT, Cascading Style Sheets (CSS), XHTML, Atom, and RSS. Safari's CSS support is better than any other browser on Windows. While distracted by Google-inspired paranoia, Microsoft might not notice Apple sneaking up on it from behind.


July

In July, the W3C published the first public working draft of Efficient XML Interchange (EXI) Format 1.0. The spec claims that:

"EXI is a very compact representation for the eXtensible Markup Language (XML) Information Set that is intended to simultaneously optimize performance and the utilization of computational resources. The EXI format uses a hybrid approach drawn from the information and formal language theories, plus practical techniques verified by measurements, for entropy encoding XML information. Using a relatively simple algorithm, which is amenable to fast and compact implementation, and a small set of data types, it reliably produces efficient encodings of XML event streams."

I'm not sure what's worse: the incredible opaqueness of the format or the fact that EXI really truly isn't a representation of the XML infoset. Opaqueness I expected, but the latter surprised me. EXI introduces data types such as Binary, Boolean, Decimal, Float, Integer, Unsigned Integer, and Date-Time. XML doesn't have data types, and that's a feature, not a bug. XML doesn't presume to tell any reader how it must interpret any particular string of text it finds in a document. EXI does.

Fortunately, by the end of the year, EXI started to see some serious push-back from other parts of the W3C, including the influential Technical Architecture Group. The W3C process makes it hard to derail a spec, no matter how unwise it is, so EXI will probably be released in 2008 regardless. It wouldn't be the first turkey egg the W3C has laid (schemas, anyone?), and it certainly won't be the last; but maybe with enough advance warning about the problems inherent in binary serializations, this won't cause as much damage as it otherwise might. Let's hope the world treats this more like XML 1.1 than XML Schemas.


August

In August, XML geeks dust off their French phrase books and head to Montreal for the annual Extreme Markup Languages conference. This is by far the geekiest of the three major XML shows each year. There are no classes about how to write stylesheets or schemas. Instead, topics include subjects like "A Web 2.0 ANSI SQL Transparent Native XML Nonlinear Hierarchical LCA Query Processor" and "Exploring intertextual semantics: A reflection on attributes and optionality."

This conference has always been a little shaky financially, usually with more speakers than paying attendees. The sponsor often doesn't make up its mind whether to hold the conference again until the end of the show; and everyone waits around with bated breath to see if it will make it one more year. Sadly, this year that didn't happen. 2007 turned out to be the last go-round for Extreme (although it outlasted many competitors).

But from the ashes of the old, a new conference shall arise. Mulberry Technologies, which has been running Extreme in everything but name for as long as I've attended, has announced Balisage: The Markup Conference, to take place in Montreal 12-15 August 2008.

"Balisage is designed to meet the needs of markup theoreticians and practitioners who are pushing the boundaries of the field. It's all about the markup: how to create it; what it means; hierarchies and overlap; modeling; taxonomies; transformation; query, searching, and retrieval; presentation and accessibility; making systems that make markup dance (or dance faster to a different tune in a smaller space)—in short, changing the world and the Web through the power of marked-up information."

If the Canadian loonie continues its run against the US dollar, 2008 might not be a cost-effective year for Americans to go, but Europeans and Canadians should have a good time.


September

The biggest story of the year broke wide open in September with the discovery that, in support of the Office Open XML format, Microsoft promoted a voter registration campaign within the various national member bodies at the International Standards Organization (ISO). The news first surfaced in Sweden, where 23 mostly minor Microsoft-affiliated companies joined the Swedish Standards Institute at the last minute, and 22 of them voted in favor of approving OOXML. Other national standards bodies also found themselves inundated with more new membership applications than they'd seen in years, mostly from Microsoft partners. Countries that hadn't previously participated in JTC 1/SC 34 (the specific ISO subcommittee where most XML work happens) suddenly joined.

Although Office Open XML got a simple majority of the votes (51-18-18), it needed at least a two-thirds majority of "P-members" and no more than 25% negative votes. It failed on both counts, so the spec went back to Ecma International for resolution of comments. Perhaps Microsoft can improve the spec enough to get the extra votes it needs when the spec comes up for reconsideration in February, but the outcome is uncertain. As I write this, Microsoft appears reluctant to allow the ISO to control future evolution of OOXML, so some previously Yes votes might change to No votes.

The effort to influence the OOXML ballot caused collateral damage to several other, unrelated specifications, including Document Schema Definition Languages (DSDL). Many of the new members who voted in favor of OOXML had no interest in other working-group tasks. Once their initial vote was cast, they disappeared and prevented the group from reaching a quorum on unrelated and much less controversial issues.


October

October saw the release of the Atom Publishing Protocol. APP began its life as a simple format for uploading blog entries to replace custom APIs like the MetaWeblog and WordPress APIs. But along the way, it turned into something much, much more.

APP is nothing less than a RESTful, scalable, extensible, secure system for publishing content to HTTP servers. On one hand, it's a pure protocol, completely independent of any particular server or client. On the other hand, because it's nothing more than HTTP, it's easy to implement in existing clients and servers.

The Web was originally intended to be a read-write medium. But for the first 15 years, most energy went into the reading half of that equation. Browsers got all the attention, while authoring tools withered on the vine. Page editors were generally poor and forced to tunnel through FTP to file systems. Only now, with APP, is the field opening up to editors that are as rich, powerful, and easy to use as the browsers.

Some good server software, such as the eXist native XML database, has already started to take advantage of APP, and several clients are working on it. More will do so over the coming year. Publishing on the Web will finally become as straightforward as browsing it.


November

In November, Mark Logic unveiled MarkMail, an XQuery-based site for interacting with e-mail archives. According to Jason Hunter:

"Each email is stored internally as an XML document and accessed using XQuery. All searches, faceted navigation, analytic calculations, and HTML page renderings are performed on a single MarkLogic Server machine."

MarkMail is currently indexing 500 or so Apache mailing lists, jdom-interest, and xml-dev, among others.

Naturally, the first thing people did with all this power was ego-surf. It turns out that within this collection, Michael Kay of Saxon fame is the top human poster of all time (a few Apache robots sending out commit messages beat him); but on xml-dev, top poster honors go to Len Bullard with more than 4,000 posts. The fact that most of Len's posts are several-page articles makes this even more impressive.

I came in at number 10 on xml-dev, with 1,014 posts. I would have been number 9, except that a couple of years ago, when I changed mail clients, my screen name changed from "Elliotte Rusty Harold" to "Elliotte Harold," and the database thinks those are two different people. There are still a few bugs in the system. :-)


December

December started with the IDEAlliance's annual XML 2007 conference, the largest XML show of the year. This year's event took place in Boston. Attendance was down, with just a few more than 300 attendees and 15 exhibitors.

Most of the show was about what are now relatively well-known technologies, at least to the elite group of XML developers who still attend. Like last year, XQuery stood out as the star of the show, although XForms made a respectable showing. XProc, RDFa, OpenDoc, Office Open XML, Atom, APP, and JSON were also subjects of some interest and hallway chatter. Web services and anything SOAP-related were conspicuous by their absence. I don't think I ever heard those terms mentioned except when they were followed by "but now we're moving to REST."

The one really new thing at the show came from an unexpected source: Intel. Although better known for hardware, Intel also develops software that takes maximum advantage of the company's processors. Intel came to the show to show off and release the Intel XML Software Suite, a collection of native X86 libraries for Linux and Windows that provide really fast XSLT processing, XPath evaluation, XML Schema Validation, and Document Object Model (DOM) and Simple API for XML (SAX) parsing. A Java Native Interface (JNI) based wrapper for the Java™ platform is also included.

Intel claims the library is twice as fast as XSLTC and Xalan for XPath and XSLT, and six times faster than Xerces-C++ for raw parsing of large (100 megabyte+) documents. The parser achieves these gains by using symbol table data structures that occupy less memory and by multithreading the processing across two or more cores. This works for documents in the 300 MB to 32 GB region. For smaller documents, the overhead of the technique makes traditional parsers faster.

I haven't had a chance to test those claims myself; but if they're true, that's very interesting. Xerces isn't the fastest parser out there, but a six-times speed-up is better than anyone else has done. Surprisingly, Intel has done this with the standard APIs, SAX and DOM. Personally, I had little doubt that XML parsing performance could be improved, but I expected that doing so would require new APIs designed for high performance. Intel doesn't seem to have needed that.

December usually closes with a bang as W3C working groups rush to finish their work and push out specs before the Christmas holidays. The week before Christmas is traditionally the single busiest time of year at the W3C. Keep an eye on http://www.w3.org/TR/. You might yet see a few surprises to come. :-)


Summary

2007 was a productive year for XML. The most sound and fury focused around the standardization of office document formats, a fight that even spilled over into the popular press. (Who ever thought you'd be reading about ISO standards for XML formats in the Wall Street Journal?)

But if I had to pick the most important story of the year, I'd be hard pressed to choose between the continuing slow growth of XQuery, APP, and XForms. All have the potential to radically alter the software infrastructure that underlies the Web. XForms is a radically new client-development platform, XQuery is a radically new server-development platform, and APP connects them together. Of the three, XQuery is ready for serious production use today, and APP is gearing up. Look for big things from both of them in 2008. XForms is running behind and may be a little late to the party, but I hope it gets there before the doors close. Either way, the future for XML on the Web looks brighter than ever.


Resources

Learn

Get products and technologies

  • IBM trial software: Build your next development project with trial software available for download directly from developerWorks.

Discuss

About the author

Photo of Elliot Rusty Harold

Elliotte Rusty Harold is originally from New Orleans, to which he returns periodically in search of a decent bowl of gumbo. However, he resides in the Prospect Heights neighborhood of Brooklyn with his wife Beth and cats Charm (named after the quark) and Marjorie (named after his mother-in-law). He's an adjunct professor of computer science at Polytechnic University, where he teaches Java and object-oriented programming. His Cafe au Lait Web site has become one of the most popular independent Java sites on the Internet, and his spin-off site, Cafe con Leche, has become one of the most popular XML sites. His most recent book is Java I/O, 2nd edition. He's currently working on the XOM API for processing XML, the Jaxen XPath engine, and the Jester test coverage tool.

Comments (Undergoing maintenance)



Trademarks  |  My developerWorks terms and conditions

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML
ArticleID=278888
ArticleTitle=XML 2007
publish-date=12312007
author1-email=elharo@metalab.unc.edu
author1-email-cc=dwxed@us.ibm.com

My developerWorks community

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere).

My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Rate a product. Write a review.

Special offers