Introducing MicroXML, Part 2: Process MicroXML with microxml-js

Experiment with a JavaScript MicroXML parser

MicroXML is a simplification of XML that is compatible with earlier versions. Part 1 of this two-article series covers the basic principles of MicroXML. MicroXML is designed with a straightforward grammar that can be processed with many modern general-purpose parsing tools. James Clark, who led the original push for MicroXML, is among those thinkers who developed a parser for the community specification. Learn how to use Clark's JavaScript MicroXML parser to experiment with the format.

Editor's note: This two-article series, originally published in 2012, was revised to reflect subsequent important updates to the MicroXML specification.

Share:

Uche Ogbuji, Partner, Zepheira, LLC

Photo of Uche OgbujiUche Ogbuji is a partner at Zepheira, LLC, a solutions firm that specializes in the next generation of web technologies. Mr. Ogbuji is the lead developer of 4Suite, an open source platform for XML, RDF, and knowledge-management applications, and its successor Akara. He is a computer engineer and writer who was born in Nigeria and lives and works in Boulder, Colorado, US. You can find more about Mr. Ogbuji at his Weblog Copia, or on Twitter.



07 May 2013 (First published 12 June 2012)

Also available in Chinese Russian Japanese Vietnamese Portuguese

MicroXML, a simplification of XML that is compatible with earlier versions, is an emerging specification under the W3C's Community Group process. In Part 1 of this series, "Explore the basic principles of MicroXML," you learned the basics of MicroXML and how it differs from XML 1.x and related standards.

MicroXML was proposed by James Clark and advanced by John Cowan, who also created its first parser, MicroLark (open source, Apache 2.0 license). MicroLark is written in the Java™ language and implements several modes of parsing: pull mode, push mode, and tree mode. Cowan has not yet updated MicroLark to conform to the latest community specification, but other emerging implementations include a project by James Clark with JavaScript and Java parser implementations.

In this article, learn to parse the MicroXML format using James Clark's JavaScript parser (microxml-js) in the browser.

Getting started

To follow along with the examples in this article, download microxml-js (see Resources). You can either use software that can retrieve code from the Git version-control system, or click ZIP on the microxml-js GitHub page.

Unpack the downloaded files, navigate with your browser to the location where you saved them, and open the test.html file to display a page like the one in Figure 1:

Figure 1. Initial HTML test page from microxml-js
Screen capture of browser showing the initial HTML test page from microxml-js

The main feature of the page is the text area that includes the <doc></doc> content. To exercise the parser, type or paste MicroXML into that text area and click Parse. The parser converts the MicroXML into a JSON object according to the informational JSON rendition of the MicroXML data model. The resulting JSON code then displays in the JSON data model section. Figure 2 shows the result of parsing the MicroXML <doc></doc> line:

Figure 2. HTML test page from microxml-js after a test parse
Screen capture of browser showing the HTML test page from microxml-js after a test parse

In Figure 2, I highlighted the resulting JSON code in a yellow oval (itself not part of the browser display). The JSON code reads:

["doc",{},[]]

Experimenting with the sample MicroXML

Listing 1 is a slightly modified version of a simple MicroXML file from Part 1:

Listing 1. Simple MicroXML file
<html lang="en">
  <!-- A comment -->
  <head>
    <title>Welcome page</title>
  </head>
  <body>
    <p>Welcome to <a href="http://ibm.com/developerworks/">IBM developerWorks</a>.</p>
  </body>
</html>

Figure 3 shows the result of pasting Listing 1 into the test.html text area and parsing it. (Ignore the red dotted underlining in the text area, which comes from Firefox's spell-checker.)

Figure 3. HTML test page from microxml-js after parsing Listing 1
Screen capture of browser showing the HTML test page from microxml-js after parsing Listing 1

In Figure 3, the visible portion of the JSON code reads:

["html",{"lang":"en"},["\n  \n  ",["head",{},["\n    ",["title",{},["Welcome page"]],"\n

Improving the display

Notice that the JSON code stretches beyond the right border of the browser window. To see all the JSON code, you can scroll left and right. It would be nice to get the code pretty-printed so that you can more easily make out the structure of the resulting JSON. I implemented that effect with a small change to this line in test.html:

document.getElementById("json").textContent = 
   JSON.stringify(MicroXML.parse(textarea.value));

I changed it to:

document.getElementById("json").textContent = 
   JSON.stringify(MicroXML.parse(textarea.value), undefined, 2);

I saved the modified file as test1.html, and I also changed the page title a bit to track which version I loaded. With test1.html, loading up and parsing Listing 1 yields the result in Figure 4:

Figure 4. Modified HTML test page from microxml-js after parsing Listing 1
Screen capture of browser showing the modified HTML test page from microxml-js, after parsing Listing 1

A close look at the data model

Listing 2 shows the full output of the JSON data model that is generated from Listing 1:

Listing 2. JSON data model that is parsed from Listing 1
[
  "html",
  {
    "lang": "en"
  },
  [
    "\n  \n  ",
    [
      "head",
      {},
      [
        "\n    ",
        [
          "title",
          {},
          [
            "Welcome page"
          ]
        ],
        "\n  "
      ]
    ],
    "\n  ",
    [
      "body",
      {},
      [
        "\n    ",
        [
          "p",
          {},
          [
            "Welcome to ",
            [
              "a",
              {
                "href": "http://ibm.com/developerworks/"
              },
              [
                "IBM developerWorks"
              ]
            ],
            "."
          ]
        ],
        "\n  "
      ]
    ],
    "\n"
  ]
]

The data model is simple. Lists, such as MicroXML element content, become JSON lists. Mappings, such as attribute sets, become JSON objects. An element is a list of three items: its name as a (Unicode) string, an object for its attributes, and then a list of its children. Notice that Listing 1 contains a comment that is missing in the data model.


Error handling

As with any XML or MicroXML parser, you must understand what happens in the case of erroneous input. For example, paste this line of malformed XML (the example from Part 1) into the parser's test text area:

<para>Hello, I claim to be <strong>MicroXML</para>

Figure 5 shows the output:

Figure 5. Test output from malformed MicroXML
Screen capture of browser showing test output from malformed MicroXML

In this case an error message (Parse error: name "para" in end-tag does not match name "strong" in start-tag.) displays right below the text area, and the JSON data model is blank. The parser code also highlights the location of the error (in this case para). Clark's parser does not recover from errors but stops immediately with a report of the error, much like an XML 1.0 parser does. Clark is also working on a version of the parser that supports error recovery.

Misplaced XML 1.0

XML 1.0 is still the dominant format in use and will be for a long time to come. The most common errors that MicroXML parsers encounter are because XML features were accidentally left in. Listing 1 is MicroXML meant to look like the XML flavor of HTML5. I omitted the <!DOCTYPE html> declaration that is recommended for XHTML5 because it is not allowed in MicroXML. If you restore it and paste the result into the text area, you get the error in Figure 6:

Figure 6. Test output from MicroXML with erroneous doctype declaration
Screen capture of browser showing test output from MicroXML with an erroneous doctype declaration

The parser highlights only one character in the input text (the D in DOCTYPE) to mark the error. I added an oval highlight in Figure 6 to emphasize it. The error message is Parse error: expected "-". The parser expects the <! followed by the -- of the comment syntax.

XML 1.0 namespaces

Another likely error is persistence of XML namespaces, which are eliminated from MicroXML. Figure 7 demonstrates the banning of the xmlns attribute:

Figure 7. Test output from MicroXML with erroneous xmlns attribute
Screen capture of browser showing test output from MicroXML with an erroneous xmlns attribute

In Figure 7, the input text is:

<doc> xmlns:="http://spam.com">
</doc>

This input causes the parser to output Parse error: "xmlns" is not allowed as an attribute name.

Figure 8 demonstrates the banning of colons in element names:

Figure 8. Test output from MicroXML with erroneous colon in element name
Screen capture of browser showing test output from MicroXML with erroneous colon in element name

In Figure 8, the input text is:

<x:<doc> xmlns:x="http://spam.com">
<x:/doc>

This input causes the parser to output Parse error: expected ">".

Figure 9 demonstrates the banning of colons even in attribute names:

Figure 9. Test output from MicroXML with erroneous colon in attribute name
Screen capture of browser showing test output from MicroXML with erroneous colon in attribute name

In Figure 9, the code in the text area is:

<doc xml:id='mydoc'>
</doc>

This code is valid XML 1.0 because the xml prefix is a special one that does not require declaration. You can no longer use even this prefix in MicroXML, though, because colons are banned in attribute names as well. In this case the parser outputs Parse error: expected "=".

Character errors

MicroXML also restricts the way that you can represent characters. The most notable restriction, and the one that I think most likely to cause errors in the field, is the banning of any encodings except for UTF-8. These days, more software is Unicode-aware and can produce UTF-8, but it can still be difficult for even developers to get that right.

In MicroXML, you can use only hexadecimal-encoded character entities. Figure 10 demonstrates the error from an existing decimal entity:

Figure 10. Test output from MicroXML with erroneous decimal character entity
Screen capture of browser showing test output from MicroXML with erroneous decimal character entity

In Figure 10, the input text is:

<doc> &#160; </doc>

As the parse error text (Parse error: expected "x".) suggests, you can fix this error by switching to hexadecimal entity form: &#xA0; in this instance.

MicroXML also bans greater than (>) characters (sometimes known as "right angle brackets") in element or attribute content, as in Figure 11:

Figure 11. Test output from MicroXML with erroneous unescaped greater than character
Screen capture of browser showing test output from MicroXML with erroneous unescaped greater than character

In Figure 11, the input text of <doc> > <doc> results in Parse error: ">" characters must always be escaped. To fix this error, escape the greater than (>) character with &gt;.


Wrap-up

A specification means little without practical implementation. It was important for supporters of MicroXML to step up and implement it. John Cowan led the way with MicroLark, and James Clark wrote the first implementation of the community spec when it emerged. I am implementing it for Python 3. As a developer, usually I find MicroXML parsers much easier to learn and work with than XML parsers.

The JavaScript interface to Clark's parser is loosely documented. If you want to dig deeper, start by working with the JavaScript object output it produces. The output is similar to XML Document Object Model (DOM) but much easier to process with common web coding techniques. microxml-js is an easy way to begin developing your own MicroXML applications, including for mobile usage.

Resources

Learn

Get products and technologies

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into XML on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML, Web development, Open source
ArticleID=818277
ArticleTitle=Introducing MicroXML, Part 2: Process MicroXML with microxml-js
publish-date=05072013