MicroXML, a simplification of XML that is compatible with earlier versions, is an emerging specification under the W3C's Community Group process. In Part 1 of this series, "Explore the basic principles of MicroXML," you learned the basics of MicroXML and how it differs from XML 1.x and related standards.
MicroXML was proposed by James Clark and advanced by John Cowan, who also created its first parser, MicroLark (open source, Apache 2.0 license). MicroLark is written in the Java™ language and implements several modes of parsing: pull mode, push mode, and tree mode. Cowan has not yet updated MicroLark to conform to the latest community specification, but other emerging implementations include a project by James Clark with JavaScript and Java parser implementations.
In this article, learn to parse the MicroXML format using James Clark's JavaScript parser (microxml-js) in the browser.
To follow along with the examples in this article, download microxml-js (see Resources). You can either use software that can retrieve code from the Git version-control system, or click ZIP on the microxml-js GitHub page.
Unpack the downloaded files, navigate with your browser to the location where you saved them, and open the test.html file to display a page like the one in Figure 1:
Figure 1. Initial HTML test page from microxml-js
The main feature of the page is the text area that includes the <doc></doc> content. To exercise the parser, type or
paste MicroXML into that text area and click Parse. The parser
converts the MicroXML into a JSON object according to the informational JSON rendition
of the MicroXML data model. The resulting JSON code then displays in the JSON data model section. Figure 2 shows the result of parsing the MicroXML <doc></doc> line:
Figure 2. HTML test page from microxml-js after a test parse
In Figure 2, I highlighted the resulting JSON code in a yellow oval (itself not part of the browser display). The JSON code reads:
["doc",{},[]]
|
Experimenting with the sample MicroXML
Listing 1 is a slightly modified version of a simple MicroXML file from Part 1:
Listing 1. Simple MicroXML file
<html lang="en">
<!-- A comment -->
<head>
<title>Welcome page</title>
</head>
<body>
<p>Welcome to <a href="http://ibm.com/developerworks/">IBM developerWorks</a>.</p>
</body>
</html>
|
Figure 3 shows the result of pasting Listing 1 into the test.html text area and parsing it. (Ignore the red dotted underlining in the text area, which comes from Firefox's spell-checker.)
Figure 3. HTML test page from microxml-js after parsing Listing 1
In Figure 3, the visible portion of the JSON code reads:
["html",{"lang":"en"},["\n \n ",["head",{},["\n ",["title",{},["Welcome page"]],"\n
|
Notice that the JSON code stretches beyond the right border of the browser window. To see all the JSON code, you can scroll left and right. It would be nice to get the code pretty-printed so that you can more easily make out the structure of the resulting JSON. I implemented that effect with a small change to this line in test.html:
document.getElementById("json").textContent =
JSON.stringify(MicroXML.parse(textarea.value));
|
I changed it to:
document.getElementById("json").textContent =
JSON.stringify(MicroXML.parse(textarea.value), undefined, 2);
|
I saved the modified file as test1.html, and I also changed the page title a bit to track which version I loaded. With test1.html, loading up and parsing Listing 1 yields the result in Figure 4:
Figure 4. Modified HTML test page from microxml-js after parsing Listing 1
A close look at the data model
Listing 2 shows the full output of the JSON data model that is generated from Listing 1:
Listing 2. JSON data model that is parsed from Listing 1
[
"html",
{
"lang": "en"
},
[
"\n \n ",
[
"head",
{},
[
"\n ",
[
"title",
{},
[
"Welcome page"
]
],
"\n "
]
],
"\n ",
[
"body",
{},
[
"\n ",
[
"p",
{},
[
"Welcome to ",
[
"a",
{
"href": "http://ibm.com/developerworks/"
},
[
"IBM developerWorks"
]
],
"."
]
],
"\n "
]
],
"\n"
]
]
|
The data model is simple. Lists, such as MicroXML element content, become JSON lists. Mappings, such as attribute sets, become JSON objects. An element is a list of three items: its name as a (Unicode) string, an object for its attributes, and then a list of its children. Notice that Listing 1 contains a comment that is missing in the data model.
As with any XML or MicroXML parser, you must understand what happens in the case of erroneous input. For example, paste this line of malformed XML (the example from Part 1) into the parser's test text area:
<para>Hello, I claim to be <strong>MicroXML</para> |
Figure 5 shows the output:
Figure 5. Test output from malformed MicroXML
In this case an error message (Parse error: name "para" in end-tag does not
match name "strong" in start-tag.) displays right below the text area, and the
JSON data model is blank. The parser code also highlights the location of the error
(in this case para). Clark's parser does not recover from
errors but stops immediately with a report of the error, much like an XML 1.0
parser does. Clark is also working on a version of the parser that supports error recovery.
XML 1.0 is still the dominant format in use and will be for a long time to come. The
most common errors that MicroXML parsers encounter are because XML features were accidentally left in. Listing 1 is MicroXML meant to look like the XML flavor of HTML5. I omitted the <!DOCTYPE html>
declaration that is recommended for XHTML5 because it is not allowed in MicroXML. If you restore it and paste the result into the text area, you get the error in Figure 6:
Figure 6. Test output from MicroXML with erroneous doctype declaration
The parser highlights only one character in the input text (the D in DOCTYPE) to mark the error. I added an oval highlight in Figure 6 to emphasize it. The error message is Parse error: expected "-". The parser expects the <! followed by the -- of the comment syntax.
Another likely error is persistence of XML namespaces, which are eliminated from MicroXML. Figure 7 demonstrates the banning of the xmlns attribute:
Figure 7. Test output from MicroXML with erroneous
xmlns attribute
In Figure 7, the input text is:
<doc> xmlns:="http://spam.com"> </doc> |
This input causes the parser to output Parse error: "xmlns" is not allowed as an attribute name.
Figure 8 demonstrates the banning of colons in element names:
Figure 8. Test output from MicroXML with erroneous colon in element name
In Figure 8, the input text is:
<x:<doc> xmlns:x="http://spam.com"> <x:/doc> |
This input causes the parser to output Parse error: expected ">".
Figure 9 demonstrates the banning of colons even in attribute names:
Figure 9. Test output from MicroXML with erroneous colon in attribute name
In Figure 9, the code in the text area is:
<doc xml:id='mydoc'> </doc> |
This code is valid XML 1.0 because the xml prefix is a special
one that does not require declaration. You can no longer use even this prefix in
MicroXML, though, because colons are banned in attribute names as well. In this case the parser outputs Parse error: expected "=".
MicroXML also restricts the way that you can represent characters. The most notable restriction, and the one that I think most likely to cause errors in the field, is the banning of any encodings except for UTF-8. These days, more software is Unicode-aware and can produce UTF-8, but it can still be difficult for even developers to get that right.
In MicroXML, you can use only hexadecimal-encoded character entities. Figure 10 demonstrates the error from an existing decimal entity:
Figure 10. Test output from MicroXML with erroneous decimal character entity
In Figure 10, the input text is:
<doc>   </doc> |
As the parse error text (Parse error: expected "x".) suggests, you can
fix this error by switching to hexadecimal entity form:   in this instance.
MicroXML also bans greater than (>) characters (sometimes
known as "right angle brackets") in element or attribute content, as in Figure 11:
Figure 11. Test output from MicroXML with erroneous unescaped greater than character
In Figure 11, the input text of <doc>
> <doc> results in Parse error: ">" characters must always be
escaped. To fix this error, escape the greater than (>) character with >.
A specification means little without practical implementation. It was important for supporters of MicroXML to step up and implement it. John Cowan led the way with MicroLark, and James Clark wrote the first implementation of the community spec when it emerged. I am implementing it for Python 3. As a developer, usually I find MicroXML parsers much easier to learn and work with than XML parsers.
The JavaScript interface to Clark's parser is loosely documented. If you want to dig deeper, start by working with the JavaScript object output it produces. The output is similar to XML Document Object Model (DOM) but much easier to process with common web coding techniques. microxml-js is an easy way to begin developing your own MicroXML applications, including for mobile usage.
Learn
-
"Introducing
MicroXML, Part 1: Explore the basic principles of MicroXML" (Uche Ogbuji, developerWorks, May 2013): Learn the basics of MicroXML in Part 1 of this series.
- MicroXML Community Group: The MicroXML Community Group hosts the MicroXML Specification, relevant discussion, and a wiki. The specification is itself a MicroXML document, and uses the HTML 4 vocabulary.
- JavaScript Object Notation (JSON): Learn about this popular format for web applications.
- "Social networks meet open-source project hosting" (Uche Ogbuji, developerWorks, May 2012): Learn about GitHub and other such social coding communities.
- Thinking XML: Read Uche Ogbuji's column series on developerWorks.
- New to XML: Get the resources that you need to learn XML.
- XML area on developerWorks: Find the resources that you need to advance your skills in the XML arena, including DTDs, schemas, and XSLT. See the XML technical library for a wide range of technical articles and tips, tutorials, standards, and IBM Redbooks.
- developerWorks on Twitter: Join today to follow developerWorks tweets.
- developerWorks podcasts: Listen to interesting interviews and discussions for software developers.
- developerWorks on-demand demos: Watch demos ranging from product installation and setup for beginners to advanced functionality for experienced developers.
Get products and technologies
-
James Clark's MicroXML parser projects: Grab the source code, including microxml-js.
- John
Cowan's MicroLark project: Keep an eye on this project, which Cowan is now updating for MicroXML.
- IBM product evaluation versions: Download application development tools and middleware products from DB2®, Lotus®, Rational®, Tivoli®, and WebSphere®.
Discuss
- developerWorks XML forums: Participate in any of several XML-related discussions.
- The developerWorks community: Connect with other developerWorks users while you explore the developer-driven blogs, forums, groups, and wikis.

Uche Ogbuji is a partner at Zepheira, LLC, a solutions firm that specializes in the next generation of web technologies. Mr. Ogbuji is the lead developer of 4Suite, an open source platform for XML, RDF, and knowledge-management applications, and its successor Akara. He is a computer engineer and writer who was born in Nigeria and lives and works in Boulder, Colorado, US. You can find more about Mr. Ogbuji at his Weblog Copia, or on Twitter.



