 | Level: Intermediate Edd Dumbill (edd@xml.com), Editor and publisher, xmlhack.com
01 Oct 2002 XML's syntax has brought many benefits due to its interoperability, yet it can be tiresome to author XML documents. Edd Dumbill examines a range of alternative syntaxes for XML, and discusses their benefits and drawbacks.
What's in a name? That which we call a rose
By any other name would smell as sweet.
-William Shakespeare, Romeo and Juliet
One of the paradoxes of XML is that despite having a heritage from the document-creation community, it can often be remarkably frustrating to author by hand. The extra typing required to open and close tags and escape special characters not only wastes time, but introduces more possibility for error. If you don't want to buy an editor to help you get around this -- and many people don't, for various reasons including taste, principle, and the sheer intractibility of creating a general-purpose XML editor -- then you're stuck editing in longhand.
SGML, the document-oriented ancestor of XML, had a way round this. SGML included ways of adding shortcuts to reduce the amount of tagging required, and could even completely redefine document syntax. However, when XML was created, this functionality was omitted to simplify the language and increase interoperability.
Over time, though, many of the features in SGML have been reimplemented for XML -- either by standards organizations, or just by community efforts. This is somewhat ironic as, in the early days of XML, its proponents took great delight in proclaiming the simplicity of XML over SGML. Now, with all of XML's bolt-ons, the complexities of the two technologies are at least comparable!
The purpose of this article is to survey some of the most popular alternative syntaxes developed for XML, and highlight their areas of usefulness. I will not attempt to list them all, as many people have already made endeavours in this area. Alternative syntaxes have been created for various reasons: to save effort, to mimic favorite environments, to better illustrate the underlying data model, or to work better with existing tools. (In answer to the obvious question about decreasing interoperability through other syntaxes, note that none of these syntaxes purport to be an exchange syntax -- that is still left to the XML 1.0 syntax.)
Ease of use
The first two syntaxes I want to examine are major contenders in the labor saving category: PYX and SOX.
PYX is a line-oriented alternative syntax for XML. It was covered in detail on developerWorks earlier this year by David Mertz (see Resources.) Unlike the other syntaxes I will cover in this article, PYX is mainly useful as an alternative output syntax. As you will see, authoring XML using PYX does not appear to be very practical. Created by Sean McGrath, PYX is based on an SGML (surprise, surprise!) concept called Element Structure Information Set, or ESIS. PYX uses the first character of every line to represent a markup event such as an open-tag or attribute. Table 1 shows what a basic XHTML page might look like in PYX.
Table 1. Comparison of PYX against XML
grep '^(p' document.pyx | wc -l
|
|
PYX version
|
XML version
|
(html
Axmlns http://www.w3.org/1999/xhtml
-\n
(head
(title
-Test page
)title
)head
-\n
(body
-\n
)body
-\n
)html
|
<html xmlns="http://www.w3.org/1999/xhtml">
<head><title>Test page</title></head>
<body>
</body>
</html>
|
PYX's chief advantage is in leveraging the rich supply of line-oriented tools, especially under the UNIX-like operating systems that have emerged over the last 20 years. Instead of having to rewrite tools to process documents using SAX or DOM, you can use familiar tools like grep and wc. For example, to count the number of paragraphs in a document, you could use the command:
grep '^(p' document.pyx | wc -l
|
As with all of the syntaxes discussed in this article, the creator of PYX also released tools to convert PYX to and from XML. For more information on these, see Resources.
SOX, or Simple Outline XML, draws on another common text formatting pattern, the outline. An outline is a common name given to a tree-shaped hierarchy in a document. Such a hierarchy may be expressed through special character sequences or indentation. SOX uses indentation to indicate the level of nesting of XML elements. By doing this, it can omit the closing tag of an element. Listing 1 shows the example from Table 1 in SOX.
Listing 1. Simple XHTML document expressed in SOX
html>
xmlns http://www.w3.org/1999/xhtml
head>
title> Test page
body>
|
A new user of SOX will not find it a completely unfamiliar step from XML 1.0, especially when compared to PYX. SOX's primary advantage is that it is much easier to ensure that a document is well-formed when editing with a simple text editor. Also, the restriction of one element per line means that some line-oriented processing is still possible, which means it's harder to write an obscure document.
However, because SOX is designed to be edited, and because it preserves the use of the greater than sign (>) as a special character, it has some more subtle rules for dealing with whitespace, escaping characters, and so on. For complete details, see the SOX page mentioned in Resources. Due to these subtleties, it's hard to see that SOX actually presents much benefit beyond using a decent developer's text editor with an XML editing mode, such as emacs or vim -- you still need to run your SOX file through the SOX-to-XML converter in order to check its correctness.
Mimicking programming languages
Several attempts have been made to project XML into a favorite syntax from a programming language. Benefits of this approach include:
- You don't need to switch between alternate syntaxes when editing
- You can take advantage of existing editing aids
- You may be able to interpret the XML files directly in the languages' compiler or interpreter
Python -- SLiP
The syntax of the Python programming language often polarizes opinion: It relies on the indentation level of lines to indicate blocks, rather than the braces {} or parentheses () used by other languages. It certainly leads to a pleasant style of uncluttered code.
The SLiP syntax (which stands for "Sorta Like Python") for XML, developed by Scott Sweeney (see Resources), uses Python-like indentation rules, and can be formatted by any Python-aware editor. Sweeney describes his motivation: "The idea came to me while attending a conference recently. I wanted a way to take notes quickly on my laptop ... Almost all of the XML editors I have seen to date have been mouse-oriented and require constant back-and-forth between mouse and keyboard, making it impossible to keep up with the lectures. I wanted something quicker."
Listing 2 shows SLiP syntax for our trivial XHTML document.
Listing 2. Simple XHTML document expressed in SLiP
html(xmlns="http://www.w3.org/1999/xhtml"):
head():
title(): "Test page"
body():
|
Scheme -- SXML
While most of the syntaxes examined in this article attempt to introduce some degree of simplification, SXML takes a different approach, prioritizing Scheme compatibility over brevity. It provides a representation for XML documents in the Scheme programming language; once such a representation is available, then operations on the document become native operations on Scheme data structures.
Listing 3 shows the example XHTML document in SXML. There is an interesting historical twist to the use of s-expressions to encode XML: Before XML burst large upon the W3C, s-expressions were being favored as one syntax for W3C recommended languages -- see the W3C's Platform for Internet Content Selection (PICS) content rating recommendation, for example, in Resources.
Lisiting 3. Simple XHTML document expressed in SXML
(html (@ (xmlns "http://www.w3.org/1999/xhtml"))
(head
(title "Test page"))
(body))
|
SXML's creator, Oleg Kiselyov, has gone beyond the mere syntax, and developed SXML into a useful toolkit that includes XPath and XSLT implementations. Oleg's XML and Scheme page (see Resources) includes many interesting ideas on mingling XML with Scheme, including discussion on the idea of making XML documents executable.
Domain-specific syntaxes
Perhaps the most useful category of non-XML syntaxes is that of domain-specific syntaxes. Any general purpose alternative syntax for XML is still going to bump up against the limits of that generalness: There are few general semantic concepts, so the opportunity for collapsing them into abbreviated syntax is limited. By contrast, when applications of XML are considered, there is much more scope for collapsing multi-tag structures into a shorter representation. This section considers some of the most useful of these syntaxes.
WikiML for documentation-oriented markup
After HTML, one of the most popular markup languages on the Web is probably that used in WikiWikiWeb, a rapid-entry hypertext documentation system that uses a Web browser as its user interface. Wikis tend to use simple character-based markup to denote structure, trading off flexibility against convenience. For example, compare the following HTML hypertext links with Wiki markup:
<p>Here's a link to <a href="http://www.ibm.com">IBM</a>.</p>
Here's a link to [IBM|http://www.ibm.com].
|
Unfortunately, Wiki syntax tends to vary among different Wiki systems. The WikiML tools take the syntax used in the popular PHPWiki project, and convert it into an XML language, WikiML. By applying various style sheets, you can then make the transition to, say, XHTML or DocBook. Wikis are highly convenient for the particular task of writing documentation, and you would be hard-pressed to write as efficiently in raw XML.
Shrinking XSLT -- XSLTXT
XSLT must surely be one of the most contorted programming languages in popular circulation. The XML syntax does little to help the style sheet creator, and simple constructs such as "switch/case" blocks can grow to an amazing length very quickly.
XSLTXT is a project that attempts to limit the tag soup of XSLT. XSLTXT does not attempt to alter XSLT semantics at all, but just provide a reduced-clutter syntax. Table 2 shows a comparison of XSLT and XSLTXT on a typical block.
Table 2. Comparison of XSLT and XSLTXT code
|
XSLT code
|
XSLTXT equivalent
|
<xsl:template name="foo">
<xsl:param name="a"/>
<xsl:param name="b"/>
SELECT <xsl:value-of select="$a"> FROM <xsl:value-of select="$b"/>
</xsl:template>
|
tpl .name "foo" ("a", "b")
"SELECT "
val "$a"
" FROM "
val "$b"
|
XSLTXT, like SLiP, uses indentation to signify block structure, and uses abbreviations for keywords such as xsl:template. Additionally, parentheses are used for parameters. The XSLTXT project provides converters for XSLT and XML, and also provides a TXTReader Java class that can be used as a plug-in deserializer for XSLT in XML processors.
RELAX NG Compact
RELAX NG (RNG) is a schema language for XML, developed by an OASIS Technical Committee. One of the major forces behind RNG is James Clark, who is also the mastermind of XSLT. (One is tempted to wonder if XSLT experience motivated the RELAX NG Compact syntax.) The RNG creators recognised that when you're thinking about modeling a schema, you don't really want to waste time considering excess angle brackets. So, they created RELAX NG Compact, a non-XML syntax that implements the same concepts as the XML syntax for RELAX NG. Table 3 shows how the compact syntax aids clarity over the XML syntax.
Table 3. A comparison of RELAX NG's Compact and XML syntaxes
|
RELAX NG version
|
RELAX NG Compact version
|
<?xml version="1.0" encoding="UTF-8"?>
<element name="date"
xmlns="http://relaxng.org/ns/structure/1.0">
<optional>
<attribute name="type"/>
</optional>
<element name="year"><text/></element>
<element name="month"><text/></element>
<element name="day"><text/></element>
</element>
|
element date {
attribute type { text }?,
element year { text },
element month { text },
element day { text }
}
|
The compact syntax simultaneously reduces the amount of text and
makes the relationship between the elements clearer. It goes
further than XSLTXT: In addition to reducing the amount of typing required, RELAX NG
Compact is actually a different language -- note, for example, the
use of the question mark (?) in place of the <optional>
element container. Because the compact syntax has actually been
developed by the same group responsible for the original language,
its adoption and support is good. In contrast, a lot of the
syntaxes highlighted in this article tend to originate from
a single third party.
Other XML languages with complex constructs that have had
non-XML syntaxes proposed include Topic Maps, for which Lars Marius
Garshol has proposed the Linear Topic Map Notation, and RDF, for which Dan Connolly and Tim Berners-Lee
have proposed N3, which actually an RDF superset (see Resources).
Summary
The main motivation for creating non-XML syntaxes is
the difficulty inherent in authoring XML. As Scott Sweeney
noted, even the best commercial XML editors require a degree of
point and clicking that gets in the way of rapid, free-form
content creation. At the end of the day, the contracted general XML syntaxes such as SOX
and SLiP have very little to differentiate between them:
Their main benefit seems to be in the ability to omit
closing tags.
One downside to such contracted syntaxes is in the loss of
interoperability and future-proofing. Most of the efforts come from
single third-party sources. It's not entirely clear what support there is
for diverse character encodings, as well as the less
frequently used parts of XML such as processing instructions. Also
in most cases there is only one tool originator in place, so the ideas may
well just die out.
When I first wrote on this topic a year ago, developer and author
Michael Champion responded, encouraging me not to lose sight of the
fact that the interoperability of the XML 1.0 syntax, and
consequent network effect, is the main value of XML, and what got
us where we are now.
Moving beyond general-purpose XML syntax replacements, the
application-specific non-XML syntaxes seem to offer a lot more
value, especially where a lot of content is being created. Wiki markup, for instance,
can be a huge time saver in
comparison to writing straight into DocBook XML. (Eric van der Vlist
has written entire books using this method.) Of course, you don't
get all the features of DocBook, but there is a reasonable
trade-off with ease-of-authoring. RELAX NG Compact is a good
example of how a non-XML syntax can really illuminate the
underlying concepts and data structures of the language without
being cryptic -- and the fact that it's sanctioned by the RNG
committee provides some insurance for the future.
In conclusion, is it possible to say when it is best to use a
non-XML syntax? It's certainly easy to say when not to use one: when the
incremental benefit over XML 1.0 is small, or the route to
preserving your data in XML 1.0 is fragile or lossy. A good
practice is to consider when the content is created, and whether it will
be exchanged frequently. One suitable scenario for a non-XML
syntax occurs when there is a one-off creation of the content up front,
and it is either not to be exchanged, or exchanged after being
translated into XML 1.0. I certainly don't think "well, I can carry
on using Python mode in Emacs" is sufficient justification for
moving data out of XML for anything but the most personal of
projects.
Easy content creation is still the most compelling argument I've
seen for using alternative XML representations. Editor development
seems to be one of the trickiest problems in computing, and every
little bit helps.
Resources
- David Mertz examines PYX in detail in his
XML Matters developerWorks column "Intro to PYX" (developerWorks, February
2002).
- Sean McGrath introduces PYX, and his open source XML processing
library Pyxie.
- Arnold deVos' SOX
(Simple Outline XML) page introduces SOX and provides a SAX
parser and serializer.
- Scott Sweeney's SLiP project
provides a Python-like syntax for XML, and tools to enable you to
use any editor to create XML through SLiP.
- Oleg Kiselyov's SXML is a
representation of XML in Scheme. Kiselyov's XML and Scheme page
includes much interesting reading.
- Functional programming junkies will also be interested in David
Mertz's exploration of
Haskell and XML (developerWorks, October 2001), and Bijan Parsia's Functional
Programming and XML.
- Eric van der Vlist's WikiML
project translates from WikiWikiWeb markup, as used by PHPWiki, to an XML
vocabulary.
- The XSLTXT
project implements a shortened, more legible version of XSLT.
- RELAX NG's
Compact Syntax is developed by the same committee as the
original (XML) RELAX NG language.
- The W3C's Platform for Internet Content Selection (PICS) content rating recommendation enables labels (metadata) to be associated with Internet content.
- Find out more about Getting into RDF & Semantic Web using N3 by Dan Connolly and Tim Berners-Lee.
-
Read all of Edd Dumbill's previous XML Watch columns.
- Find more XML resources on the developerWorks
XML technology zone.
-
Rational Application Developer for WebSphere Software helps Java™ developers rapidly design, develop, assemble, test, profile and deploy high quality Java/J2EE, Portal, Web, Web services and SOA applications.
- Find out how you can become an IBM Certified Developer in XML and related technologies.
About the author
Rate this page
|  |