Though XML is an extraordinarily successful technology, it has its flaws. With great success comes great scrutiny; folks tried to redesign XML from the very beginning. People are dealing with the complexity of XML namespaces and with XML processing specs such as XPath, XSLT, and XQuery 3.0. Some influential core XML experts looked at the bold possibility of starting over with a simplification of XML itself.
Another factor is the threat that is posed by the web browser vendors who work on HTML5. The resulting trends run against some of the most cherished principles of XML programming. In fact XML is treated as an inconvenience by many who are behind HTML5. Many developers who prefer JSON treat XML as an inconvenience.
This combination of forces led to discussion on the XML-DEV mailing list and on various blogs. Eventually, James Clark made a complete proposal for MicroXML. John Cowan, who serves on the W3C XML Core working group, stepped in as a major contributor and editor of the specification. A W3C Community Group came together in 2012, of which I was the chair and Clark and Cowan editors. (We three are now co-chairs.) This group produced a draft specification (see Resources), and its members developed some basic implementations.
MicroXML has no official standing in any recognized standards organization. The community group is intentionally informal, but the specification is of great interest to XML developers. Many important modern specs, including JSON, have similarly informal roots.
In this article, I explain the basic principles of MicroXML and illustrate the key differences between MicroXML and full XML with examples. I assume that you are familiar with the basics of XML.
Two key goals of MicroXML are:
- To maintain a simple syntax and data model
- To maintain compatibility with earlier versions of XML
The Community Group expanded and refined these goals into a set of nine design goals to guide the specification:
- The syntax of MicroXML is a subset of XML 1.0.
- MicroXML specifies a data model and a mapping from the syntax to the data model, which is substantially consistent with XML 1.0.
- MicroXML is dramatically simpler than XML regarding its specification, syntax, and data model.
- MicroXML is designed to complement rather than replace XML, JSON, and HTML.
- MicroXML supports the needs of documents, in particular mixed content.
- MicroXML supports Unicode.
- MicroXML supports the use of text editors for authoring.
- MicroXML is able to straightforwardly represent HTML.
- The specification of MicroXML is as self-contained as is practical.
The first two goals are the most fundamental. MicroXML documents are well-formed XML documents, and the specification of a data model is key. XML 1.0 didn't really specify a data model, which led to a succession of separate specifications for XML data models. These specifications include the Infoset and the XPath Data Model (XDM) for XPath 2.0 and beyond, which are dozens of pages long. Even the XPath 1.0 data model, which is recognized for its elegance and shortness, runs to several pages. The MicroXML data model is less than a half page (of the eight pages or so of the overall specification). An initial data model helps enforce simplicity and improves the likelihood of interoperability.
The enveloping concept in MicroXML is the element item, which is the result of parsing an input stream that completely conforms to the MicroXML specification. The top-level element item can also contain other element items. It has a name, an attributes map, and a content list.
The most fundamental distinction between XML and MicroXML is in parser error handling. With the notorious, draconian error handling in XML, parsers are required to halt immediately upon encountering the first error. This error handling is a matter of great controversy, especially considering how people became accustomed to sloppy markup with HTML. Critics of XML often cite the popularized form of Postel's Law: Be conservative in what you send, liberal in what you accept.
MicroXML does not insist on any approach to handling errors. A parser can recover or continue. If the document is not well-formed MicroXML, a parser can signal that fact, but otherwise is free in its behavior. For example, a parser might switch to a different interpretation of the input if it encounters an error. Think of how HTML parsers can switch from standards-compliance to "tag soup" mode, and you get the idea.
For example, if an XML processor encounters the following input, it must stop as soon as it reaches
</para> and raise a non-well-formedness error about a mismatched closing tag:
<para>Hello, I claim to be <strong>MicroXML</para>
A MicroXML parser might continue at that point, but it no longer reports the input as a
MicroXML document. It can even insert a
</strong> just before the
</para> to repair the output, but must not claim that the result is a MicroXML document.
This subtle relaxation of the well-formedness restriction makes a significant difference if you design real-word systems that must deal with unpredictable input.
MicroXML supports only one encoding: UTF-8. A MicroXML document is a sequence of characters that are encoded in UTF-8 that form a structure that is expressed in MicroXML's data model. As with XML, the raw sequence of characters is called the text, which comprises markup and character data. This example shows the technical distinction between text and character data:
<para style="friendly">Hello, I am...<strong>MicroXML</strong></para>
Everything from the
<para> tag to the
</para> tag, inclusively, is text, but only these sequences are character data:
Hello, I am...
Character data is what displays within attribute values and between the tags that make up elements.
Elements, attributes, and character data are the foundation of XML, and in MicroXML
those constructs change little. The biggest difference is that colons are forbidden in
element and attribute names. This restriction prohibits prefixes as defined in the
Namespaces in XML specification. The
xmlns attribute is
also forbidden, which means that MicroXML does not support namespaces at all. This
trait is the greatest surprise for those encountering MicroXML; I address it further
in the Namespaces section.
White space in attributes is not normalized in MicroXML as it is in XML. In XML, the following two documents are indistinguishable:
<para>Hi. I'm some form of <abbr ref="Extensible Markup Language">XML</abbr></para> <para>Hi. I'm some form of <abbr ref="Extensible Markup Language">XML</abbr></para>
Notice the difference in white space in the
ref attribute. In MicroXML, white space in attributes is reported exactly as it is encountered, so these two documents are different.
Processing instructions are always a controversial area of XML, and they are prohibited
in MicroXML. MicroXML comments are the same as XML comments, except that they are not
part of the MicroXML data model. They are ignored by applications that are not specialized to preserve syntax. Basically, MicroXML comments are for people, not programs. For compatibility reasons, MicroXML doesn't relax any of the restrictions in XML, notably against two dashes (
--) within comments, and thus nested comments.
MicroXML does not support the XML declaration, nor does it support any form of document type declaration. For example, you cannot trigger standards-processing mode in typical web browsers even if you use MicroXML in a way compatible with HTML5.
MicroXML does not support namespaces at all. Colons and thus prefixes are not
supported, nor is the
xmlns attribute. This lack of
namespace support affects compatibility with a large cross-section of XML specifications, and is one of the more controversial features of MicroXML. But the decision to omit namespaces was made for good reason.
Namespaces add extensive complexity to solve a minor problem. Many people who train or otherwise help others with XML can testify that namespaces are by far the hardest concept for users and developers to grasp. Namespaces complicate all derived specifications and software enormously. If MicroXML was to take a stand on simplicity, it had to take a stand against namespaces.
If you design new vocabularies with MicroXML, you might find that you miss XML namespaces far less than you might think. The difficulties come in adapting existing XML vocabularies that do use namespaces. In such cases, you must use transforms to strip namespaces during MicroXML phases, and reconstruct the namespaces for XML phases. The MicroXML community is working on tools and conventions to help with this process.
The prohibition on prefixes also means, for example, that you must use
lang rather than
xml:lang. However, MicroXML does not provide any special support for such attributes; vocabularies must set conventions for them.
This section uses a fairly complete, real-world example of typical XML to show how it looks as MicroXML. Atom is a good sample format because it often includes a mix of namespaces. Listing 1 is based on a listing in the "Process
Atom 1.0 with XSLT" tutorial on developerWorks. I removed one of the
entry elements and used a namespace prefix of
a for all the core Atom elements, such as
feed, to help illustrate changes to namespaces in MicroXML.
Listing 1. Typical XML
<?xml version="1.0" encoding="utf-8"?> <a:feed xmlns:a="http://www.w3.org/2005/Atom" xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" xml:base="http://copia.ogbuji.net"> <a:id>http://copia.ogbuji.net/atom1.0</a:id> <a:title>Copia</a:title> <a:updated>2005-07-15T12:00:00Z</a:updated> <a:author> <a:name>Uche Ogbuji</a:name> <a:uri>http://uche.ogbuji.net</a:uri> </a:author> <a:link href="/blog" /> <a:link rel="self" href="/blog/atom1.0" /> <a:entry> <a:id>http://copia.ogbuji.net/blog/2005-09-16/xhtml</a:id> <a:title>XHTML tutorial pubbed</a:title> <a:link href="http://copia.posterous.com/xhtml-tutorial-pubbed"/> <a:category term="xml"/> <a:category term="css"/> <a:category term="xhtml"/> <a:updated>2005-07-15T12:00:00Z</a:updated> <a:content type="xhtml"> <div> <p> <a href="http://www.ibm.com/developerworks/edu/x-dw-x-xhtml-i.htm"> "XHTML, step-by-step" </a> </p> <blockquote> <p>Start working with Extensible Hypertext Markup Language. In this tutorial, author Uche Ogbuji shows you how to use XHTML in practical Web sites.</p> </blockquote> <p>In this tutorial</p> <ul> <li>Tutorial introduction</li> <li>Anatomy of an XHTML Web page</li> <li>Understand the ground rules</li> <li>Replace common HTML idioms</li> <li>Some practical considerations</li> <li>Wrap up</li> </ul> </div> </a:content> </a:entry> </a:feed>
Listing 2 is a MicroXML version of Listing 1. Notice the lack of XML declaration, which is not supported in MicroXML and is unnecessary because only the UTF-8 encoding is supported. It includes no namespaces at all.
Listing 2. MicroXML version
<!-- http://www.w3.org/2005/Atom --> <feed lang="en" base="http://copia.ogbuji.net"> <id>http://copia.ogbuji.net/atom1.0</id> <title>Copia</title> <updated>2005-07-15T12:00:00Z</updated> <author> <name>Uche Ogbuji</name> <uri>http://uche.ogbuji.net</uri> </author> <link href="/blog" /> <link rel="self" href="/blog/atom1.0" /> <entry> <id>http://copia.ogbuji.net/blog/2005-09-16/xhtml</id> <title>XHTML tutorial pubbed</title> <link href="http://copia.posterous.com/xhtml-tutorial-pubbed"/> <category term="xml"/> <category term="css"/> <category term="xhtml"/> <updated>2005-07-15T12:00:00Z</updated> <content type="xhtml"> <div> <p> <a href="http://www.ibm.com/developerworks/edu/x-dw-x-xhtml-i.htm"> "XHTML, step-by-step" </a> </p> <blockquote> <p>Start working with Extensible Hypertext Markup Language. In this tutorial, author Uche Ogbuji shows you how to use XHTML in practical Web sites.</p> </blockquote> <p>In this tutorial</p> <ul> <li>Tutorial introduction</li> <li>Anatomy of an XHTML Web page</li> <li>Understand the ground rules</li> <li>Replace common HTML idioms</li> <li>Some practical considerations</li> <li>Wrap up</li> </ul> </div> </content> </entry> </feed>
The code in Listing 2 is not Atom. Atom is defined with
its own namespace plus XHTML for structured descriptions and content. You can easily
convert the Listing 2 code to Atom XML. Apply a syntactic XML
transformation that adds
xmlns attributes to the
feed element and any
div elements. An Atom
div element is rarely confused with any other element,
which illustrates my point that XML namespaces are used far more often than necessary.
Listing 3 shows a MicroXML document that is close to a valid HTML5 document:
Listing 3. HTML5-like MicroXML
<html lang="en"> <!-- A comment --> <head> <title>Welcome page</title> </head> <body> <p>Welcome to <a href="ibm.com/developerworks/">IBM developerWorks</a>.</p> </body> </html>
The document in Listing 3 omits an HTML5
doctype declaration, which is prohibited in MicroXML. You can convert it into valid HTML5 by adding this line to the top:
MicroXML has some important advantages over full XML, including benefits in areas such as mobile and cloud computing, and with considerations of security. Its lack of draconian error handling makes it a more flexible format for information interchange. With greater simplicity, MicroXML is more readily processed in scenarios with limited resources, such as on mobile devices — or more inexpensively processed where resources are metered, as with cloud computing. MicroXML also has far fewer security problems, an important consideration for use on the network. Without document type declarations or entities, MicroXML documents are self-contained. Parsing a MicroXML document requires no access to a separate resource, which helps eliminate denial-of-service and spoofing attacks.
MicroXML is still an emerging specification. I don't necessarily advocate that you embrace it completely and turn away from full XML, but it is an important development. Understanding MicroXML can give you a good sense of how to use XML most effectively in the face of changes such as the emergence of JSON and HTML5. The MicroXML specification is small, and some tools are already in the offing. I encourage you to experiment with MicroXML.
MicroXML, Part 2: Process MicroXML with microxml-js" (Uche Ogbuji, developerWorks,
May 2013): In Part 2 of this series, learn about and experiment with a tool for parsing MicroXML.
- MicroXML Community Group: The
MicroXML Community Group hosts the MicroXML
Specification, relevant discussion, and a wiki. The specification is itself a
MicroXML document, and uses the HTML 4 vocabulary.
- More on MicroXML: Read this December 2010 entry in James Clark's blog to learn how MicroXML took shape.
"HTML5 fundamentals" (Grace Walker, developerWorks, May 2011): Learn more about HTML5 and about its XML annex in "Thinking XML: The XML flavor of HTML5" (Uche Ogbuji, developerWorks, July 2010).
"Process Atom 1.0
with XSLT" (Uche Ogbuji, developerWorks, December 2005): In this tutorial, learn
about XSLT techniques for processing Atom documents through real-life use cases.
- Thinking XML: Read Uche Ogbuji's column series on developerWorks.
- New to XML: Get the resources that you need to learn XML.
- XML area on developerWorks: Find the resources that you need to advance your skills in the XML arena, including DTDs, schemas, and XSLT. See the XML technical library for a wide range of technical articles and tips, tutorials, standards, and IBM Redbooks.
- developerWorks on Twitter: Join today to follow developerWorks tweets.
- developerWorks podcasts: Listen to interesting interviews and discussions for software developers.
- developerWorks on-demand demos: Watch demos ranging from product installation and setup for beginners to advanced functionality for experienced developers.
Get products and technologies
- IBM product evaluation versions: Download application development tools and middleware products from DB2®, Lotus®, Rational®, Tivoli®, and WebSphere®.
- developerWorks XML forums: Participate in any of several XML-related discussions.
- The developerWorks community: Connect with other developerWorks users while you explore the developer-driven blogs, forums, groups, and wikis.
Uche Ogbuji is a partner at Zepheira, LLC, a solutions firm that specializes in the next generation of web technologies. Mr. Ogbuji is the lead developer of 4Suite, an open source platform for XML, RDF, and knowledge-management applications, and its successor Akara. He is a computer engineer and writer who was born in Nigeria and lives and works in Boulder, Colorado, US. You can find more about Mr. Ogbuji at his Weblog Copia, or on Twitter.