Introducing MicroXML, Part 1: Explore the basic principles of MicroXML

Learn about the possible future of XML

Parts of the XML community always grumble that XML is difficult to understand and process. XML is fundamentally complex for various historical reasons, and people proposed simplified versions for more than a decade. JSON and HTML5 threaten some of the most basic XML tenets. MicroXML a simplification of XML that is compatible with earlier versions emerged from discussions of these issues. MicroXML is now under the guidance of a W3C community group, and several basic implementations are already available for the draft specification. In this first article of a two-part series, learn from one of the MicroXML Community Group co-chairs about MicroXML and its technical differences from the XML 1.x core standards.

Editor's note: This two-article series, originally published in 2012, was revised to reflect subsequent important updates to the MicroXML specification.

Uche Ogbuji, Partner, Zepheira, LLC

Photo of Uche OgbujiUche Ogbuji is a partner at Zepheira, LLC, a solutions firm that specializes in the next generation of web technologies. Mr. Ogbuji is the lead developer of 4Suite, an open source platform for XML, RDF, and knowledge-management applications, and its successor Akara. He is a computer engineer and writer who was born in Nigeria and lives and works in Boulder, Colorado, US. You can find more about Mr. Ogbuji at his Weblog Copia, or on Twitter.



07 May 2013 (First published 29 May 2012)

Also available in Chinese Russian Japanese Vietnamese

Other articles in this series

Though XML is an extraordinarily successful technology, it has its flaws. With great success comes great scrutiny; folks tried to redesign XML from the very beginning. People are dealing with the complexity of XML namespaces and with XML processing specs such as XPath, XSLT, and XQuery 3.0. Some influential core XML experts looked at the bold possibility of starting over with a simplification of XML itself.

Another factor is the threat that is posed by the web browser vendors who work on HTML5. The resulting trends run against some of the most cherished principles of XML programming. In fact XML is treated as an inconvenience by many who are behind HTML5. Many developers who prefer JSON treat XML as an inconvenience.

This combination of forces led to discussion on the XML-DEV mailing list and on various blogs. Eventually, James Clark made a complete proposal for MicroXML. John Cowan, who serves on the W3C XML Core working group, stepped in as a major contributor and editor of the specification. A W3C Community Group came together in 2012, of which I was the chair and Clark and Cowan editors. (We three are now co-chairs.) This group produced a draft specification (see Resources), and its members developed some basic implementations.

MicroXML has no official standing in any recognized standards organization. The community group is intentionally informal, but the specification is of great interest to XML developers. Many important modern specs, including JSON, have similarly informal roots.

In this article, I explain the basic principles of MicroXML and illustrate the key differences between MicroXML and full XML with examples. I assume that you are familiar with the basics of XML.

Principles of MicroXML

Two key goals of MicroXML are:

  • To maintain a simple syntax and data model
  • To maintain compatibility with earlier versions of XML

The Community Group expanded and refined these goals into a set of nine design goals to guide the specification:

  • The syntax of MicroXML is a subset of XML 1.0.
  • MicroXML specifies a data model and a mapping from the syntax to the data model, which is substantially consistent with XML 1.0.
  • MicroXML is dramatically simpler than XML regarding its specification, syntax, and data model.
  • MicroXML is designed to complement rather than replace XML, JSON, and HTML.
  • MicroXML supports the needs of documents, in particular mixed content.
  • MicroXML supports Unicode.
  • MicroXML supports the use of text editors for authoring.
  • MicroXML is able to straightforwardly represent HTML.
  • The specification of MicroXML is as self-contained as is practical.

The first two goals are the most fundamental. MicroXML documents are well-formed XML documents, and the specification of a data model is key. XML 1.0 didn't really specify a data model, which led to a succession of separate specifications for XML data models. These specifications include the Infoset and the XPath Data Model (XDM) for XPath 2.0 and beyond, which are dozens of pages long. Even the XPath 1.0 data model, which is recognized for its elegance and shortness, runs to several pages. The MicroXML data model is less than a half page (of the eight pages or so of the overall specification). An initial data model helps enforce simplicity and improves the likelihood of interoperability.

The enveloping concept in MicroXML is the element item, which is the result of parsing an input stream that completely conforms to the MicroXML specification. The top-level element item can also contain other element items. It has a name, an attributes map, and a content list.

Well-formedness

The most fundamental distinction between XML and MicroXML is in parser error handling. With the notorious, draconian error handling in XML, parsers are required to halt immediately upon encountering the first error. This error handling is a matter of great controversy, especially considering how people became accustomed to sloppy markup with HTML. Critics of XML often cite the popularized form of Postel's Law: Be conservative in what you send, liberal in what you accept.

MicroXML does not insist on any approach to handling errors. A parser can recover or continue. If the document is not well-formed MicroXML, a parser can signal that fact, but otherwise is free in its behavior. For example, a parser might switch to a different interpretation of the input if it encounters an error. Think of how HTML parsers can switch from standards-compliance to "tag soup" mode, and you get the idea.

For example, if an XML processor encounters the following input, it must stop as soon as it reaches </para> and raise a non-well-formedness error about a mismatched closing tag:

<para>Hello, I claim to be <strong>MicroXML</para>

A MicroXML parser might continue at that point, but it no longer reports the input as a MicroXML document. It can even insert a </strong> just before the </para> to repair the output, but must not claim that the result is a MicroXML document. This subtle relaxation of the well-formedness restriction makes a significant difference if you design real-word systems that must deal with unpredictable input.


Anatomy of MicroXML

MicroXML supports only one encoding: UTF-8. A MicroXML document is a sequence of characters that are encoded in UTF-8 that form a structure that is expressed in MicroXML's data model. As with XML, the raw sequence of characters is called the text, which comprises markup and character data. This example shows the technical distinction between text and character data:

<para style="friendly">Hello, I am...<strong>MicroXML</strong></para>

Everything from the <para> tag to the </para> tag, inclusively, is text, but only these sequences are character data:

  • friendly
  • Hello, I am...
  • MicroXML

Character data is what displays within attribute values and between the tags that make up elements.

Elements, attributes, and character data

Elements, attributes, and character data are the foundation of XML, and in MicroXML those constructs change little. The biggest difference is that colons are forbidden in element and attribute names. This restriction prohibits prefixes as defined in the Namespaces in XML specification. The xmlns attribute is also forbidden, which means that MicroXML does not support namespaces at all. This trait is the greatest surprise for those encountering MicroXML; I address it further in the Namespaces section.

White space in attributes is not normalized in MicroXML as it is in XML. In XML, the following two documents are indistinguishable:

<para>Hi. I'm some form of <abbr ref="Extensible Markup 
Language">XML</abbr></para>

<para>Hi. I'm some form of <abbr ref="Extensible Markup Language">XML</abbr></para>

Notice the difference in white space in the ref attribute. In MicroXML, white space in attributes is reported exactly as it is encountered, so these two documents are different.

Processing instructions, comments, and document type declarations

Processing instructions are always a controversial area of XML, and they are prohibited in MicroXML. MicroXML comments are the same as XML comments, except that they are not part of the MicroXML data model. They are ignored by applications that are not specialized to preserve syntax. Basically, MicroXML comments are for people, not programs. For compatibility reasons, MicroXML doesn't relax any of the restrictions in XML, notably against two dashes (--) within comments, and thus nested comments.

MicroXML does not support the XML declaration, nor does it support any form of document type declaration. For example, you cannot trigger standards-processing mode in typical web browsers even if you use MicroXML in a way compatible with HTML5.

Namespaces

MicroXML does not support namespaces at all. Colons and thus prefixes are not supported, nor is the xmlns attribute. This lack of namespace support affects compatibility with a large cross-section of XML specifications, and is one of the more controversial features of MicroXML. But the decision to omit namespaces was made for good reason.

Namespaces add extensive complexity to solve a minor problem. Many people who train or otherwise help others with XML can testify that namespaces are by far the hardest concept for users and developers to grasp. Namespaces complicate all derived specifications and software enormously. If MicroXML was to take a stand on simplicity, it had to take a stand against namespaces.

If you design new vocabularies with MicroXML, you might find that you miss XML namespaces far less than you might think. The difficulties come in adapting existing XML vocabularies that do use namespaces. In such cases, you must use transforms to strip namespaces during MicroXML phases, and reconstruct the namespaces for XML phases. The MicroXML community is working on tools and conventions to help with this process.

The prohibition on prefixes also means, for example, that you must use lang rather than xml:lang. However, MicroXML does not provide any special support for such attributes; vocabularies must set conventions for them.


Examples of MicroXML

This section uses a fairly complete, real-world example of typical XML to show how it looks as MicroXML. Atom is a good sample format because it often includes a mix of namespaces. Listing 1 is based on a listing in the "Process Atom 1.0 with XSLT" tutorial on developerWorks. I removed one of the entry elements and used a namespace prefix of a for all the core Atom elements, such as feed, to help illustrate changes to namespaces in MicroXML.

Listing 1. Typical XML
<?xml version="1.0" encoding="utf-8"?>
<a:feed xmlns:a="http://www.w3.org/2005/Atom" xmlns="http://www.w3.org/1999/xhtml"
      xml:lang="en"
      xml:base="http://copia.ogbuji.net">
  <a:id>http://copia.ogbuji.net/atom1.0</a:id>
  <a:title>Copia</a:title>
  <a:updated>2005-07-15T12:00:00Z</a:updated>
  <a:author>
    <a:name>Uche Ogbuji</a:name>
    <a:uri>http://uche.ogbuji.net</a:uri>
  </a:author>
  <a:link href="/blog" />
  <a:link rel="self" href="/blog/atom1.0" />
  <a:entry>
    <a:id>http://copia.ogbuji.net/blog/2005-09-16/xhtml</a:id>
    <a:title>XHTML tutorial pubbed</a:title>
    <a:link href="http://copia.posterous.com/xhtml-tutorial-pubbed"/>
    <a:category term="xml"/>
    <a:category term="css"/>
    <a:category term="xhtml"/>
    <a:updated>2005-07-15T12:00:00Z</a:updated>
    <a:content type="xhtml">
      <div>
        <p>
          <a href="http://www.ibm.com/developerworks/edu/x-dw-x-xhtml-i.htm">
            "XHTML, step-by-step"
          </a>
        </p>
        <blockquote>
          <p>Start working with Extensible Hypertext Markup Language. In this tutorial,
          author Uche Ogbuji shows you how to use XHTML in practical Web sites.</p>
        </blockquote>
        <p>In this tutorial</p>
        <ul>
          <li>Tutorial introduction</li>
          <li>Anatomy of an XHTML Web page</li>
          <li>Understand the ground rules</li>
          <li>Replace common HTML idioms</li>
          <li>Some practical considerations</li>
          <li>Wrap up</li>
        </ul>
      </div>
    </a:content>
  </a:entry>
</a:feed>

Listing 2 is a MicroXML version of Listing 1. Notice the lack of XML declaration, which is not supported in MicroXML and is unnecessary because only the UTF-8 encoding is supported. It includes no namespaces at all.

Listing 2. MicroXML version
<!-- http://www.w3.org/2005/Atom -->
<feed lang="en" base="http://copia.ogbuji.net">
  <id>http://copia.ogbuji.net/atom1.0</id>
  <title>Copia</title>
  <updated>2005-07-15T12:00:00Z</updated>
  <author>
    <name>Uche Ogbuji</name>
    <uri>http://uche.ogbuji.net</uri>
  </author>
  <link href="/blog" />
  <link rel="self" href="/blog/atom1.0" />
  <entry>
    <id>http://copia.ogbuji.net/blog/2005-09-16/xhtml</id>
    <title>XHTML tutorial pubbed</title>
    <link href="http://copia.posterous.com/xhtml-tutorial-pubbed"/>
    <category term="xml"/>
    <category term="css"/>
    <category term="xhtml"/>
    <updated>2005-07-15T12:00:00Z</updated>
    <content type="xhtml">
      <div>
        <p>
          <a href="http://www.ibm.com/developerworks/edu/x-dw-x-xhtml-i.htm">
            "XHTML, step-by-step"
          </a>
        </p>
        <blockquote>
          <p>Start working with Extensible Hypertext Markup Language. In this tutorial,
          author Uche Ogbuji shows you how to use XHTML in practical Web sites.</p>
        </blockquote>
        <p>In this tutorial</p>
        <ul>
          <li>Tutorial introduction</li>
          <li>Anatomy of an XHTML Web page</li>
          <li>Understand the ground rules</li>
          <li>Replace common HTML idioms</li>
          <li>Some practical considerations</li>
          <li>Wrap up</li>
        </ul>
      </div>
    </content>

  </entry>
</feed>

The code in Listing 2 is not Atom. Atom is defined with its own namespace plus XHTML for structured descriptions and content. You can easily convert the Listing 2 code to Atom XML. Apply a syntactic XML transformation that adds xmlns attributes to the feed element and any div elements. An Atom div element is rarely confused with any other element, which illustrates my point that XML namespaces are used far more often than necessary.

Listing 3 shows a MicroXML document that is close to a valid HTML5 document:

Listing 3. HTML5-like MicroXML
<html lang="en">
  <!-- A comment -->
  <head>
    <title>Welcome page</title>
  </head>
  <body>
    <p>Welcome to <a href="ibm.com/developerworks/">IBM developerWorks</a>.</p>
  </body>
</html>

The document in Listing 3 omits an HTML5 doctype declaration, which is prohibited in MicroXML. You can convert it into valid HTML5 by adding this line to the top:

<!DOCTYPE html>

Wrap-up

Other articles in this series

MicroXML has some important advantages over full XML, including benefits in areas such as mobile and cloud computing, and with considerations of security. Its lack of draconian error handling makes it a more flexible format for information interchange. With greater simplicity, MicroXML is more readily processed in scenarios with limited resources, such as on mobile devices — or more inexpensively processed where resources are metered, as with cloud computing. MicroXML also has far fewer security problems, an important consideration for use on the network. Without document type declarations or entities, MicroXML documents are self-contained. Parsing a MicroXML document requires no access to a separate resource, which helps eliminate denial-of-service and spoofing attacks.

MicroXML is still an emerging specification. I don't necessarily advocate that you embrace it completely and turn away from full XML, but it is an important development. Understanding MicroXML can give you a good sense of how to use XML most effectively in the face of changes such as the emergence of JSON and HTML5. The MicroXML specification is small, and some tools are already in the offing. I encourage you to experiment with MicroXML.

Resources

Learn

Get products and technologies

  • IBM product evaluation versions: Download application development tools and middleware products from DB2®, Lotus®, Rational®, Tivoli®, and WebSphere®.

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into XML on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML, Open source, Web development
ArticleID=818276
ArticleTitle=Introducing MicroXML, Part 1: Explore the basic principles of MicroXML
publish-date=05072013