I am aware that some connoisseurs of single-malt scotches can discourse endlessly on the virtues of one scotch versus another. But as someone who rarely drinks alcohol, I have trouble fully understanding the mindset of these enthusiasts.
I feel almost the same way about Lisp and Scheme programming. I can tell that it is an area filled with sophistication and intelligence, but somehow both the Polish (prefix) notation and endless parentheses, and the fervent semantic eschewal of a distinction between code and data, continue to feel alien to me. Nonetheless, I have enough of a fascination that I want to see how these languages approach XML processing.
Among Lisp/Scheme enthusiasts, the starting point for the SSAX library for Scheme is the observation that XML is
semantically almost identical to the nested list-oriented data
structures native to Lisp-like languages. Anything you can represent
in XML can be straightforwardly represented as SXML -- Scheme lists
nesting the same data as the original XML. Moreover, Scheme comes
with a rich library of list and tree manipulation functions, and a
history of contemplating manipulation of those very structures. A
natural fit, perhaps.
A good first step is to take a look at SXML in its concrete form. Trees are the underlying abstraction -- the Infoset -- of XML; but the abstract information takes a specific semantic form. For example, the following is a starkly reduced (but still well-formed) version of another article I wrote recently:
Listing 1. An XML document with most XML features
$ cat example.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/css"
href="http://gnosis.cx/dW/programming/dW.css" ?>
<dw-document xmlns:dw="http://www.ibm.com/developerWorks/">
<title>The Twisted Matrix Framework</title>
<author name="David Mertz, Ph.D.">
<bio>David thinks it's turtles all the way down...</bio>
</author>
<docbody>
<dw:heading refname="" toc="yes">Introduction</dw:heading>
<p>
Sorting through Twisted Matrix is reminiscent of the old story
about blind men and elephants. Twisted Matrix has many
capabilities within it, and it takes a bit of a gestalt switch
to <i>get</i> a good sense of why they are all there.
</p>
</docbody>
</dw-document>
|
Transformed to SXML, this article looks like:
Listing 2. An SXML document with most XML features
$ ./xml2sxml example.xml
(*TOP* (*PI* xml "version=\"1.0\" encoding=\"UTF-8\"")
(*PI* xml-stylesheet
"type=\"text/css\"
href=\"http://gnosis.cx/dW/programming/dW.css\" ")
(dw-document
(title "The Twisted Matrix Framework")
(author
(@ (name "David Mertz, Ph.D."))
(bio "David thinks it's turtles all the way down..."))
(docbody
(http://www.ibm.com/developerWorks/:heading
(@ (toc "yes") (refname ""))
"Introduction")
(p "
Sorting through Twisted Matrix is reminiscent of the old story
about blind men and elephants. Twisted Matrix has many
capabilities within it, and it takes a bit of a gestalt switch
to "
(i "get")
" a good sense of why they are all there.
"))))
|
You'll find a number of interesting differences between XML and SXML,
but also an obvious and direct correspondence between any given aspect of the two.
Some of the differences are relatively trivial -- parentheses instead of
angle brackets, for example -- while others are ambivalent. For an example of the latter, SSAX 's creator, Oleg Kiselyov, thinks the reduction of closing tags to
closing parentheses makes the syntax more concise (see Resources). However, the
designers of XML went out of their way to remove the tag reduction
options in SGML. Perhaps they were wrong, but explicit closing tags are
not there because their possibility was overlooked.
In a number of ways, however, SXML corrects some awkwardness in XML,
without sacrificing information. In particular, the distinction
between attributes and element contents feels arbitrary to most XML
developers, and SXML removes it in an elegant way. An attribute
collection is simply another nested list, but one that happens to
start with the name @ -- a name that is conveniently prohibited in
XML identifiers. Effectively, an SXML document is like an XML
document that eschews attributes, but sometimes nests a <@> child
inside other elements. Referring to children that happen to be named
@ in SXML is no different than the filtering on any other
tag name. It's interesting to note that both my gnosis.xml.objectify and
RELAX NG stylesheets attempt a similar homogenization of attributes and
elements.
Processing instructions and comments are also reduced to special tag
names that are not available in XML: PI for the former,
COMMENT for the latter. As with most of Lisp, just one
basic data structure represents everything.
Namespaces are also interesting in the SXML format. The full namespace
URI simply becomes part of the tagname when SXML is generated as
above. However, an optional NAMESPACES tag can be used to
abbreviate namespace references in essentially the same way as in
XML, but to utilize this you need to either write SXML by hand or enhance the conversion
utility.
From the description above, SXML seems like just another shortcut notation for XML, of which there are many (such as PYX, SOX, SLiP, and XSLTXT). The difference is that SXML is not merely (arguably) simpler to read and edit, but is already itself code in Scheme. No special parsers for the format are needed or even relevant.
As a first example of working with the SSAX library, take a
look at the application xml2sxml that was utilized above:
Listing 3. xml2sxml conversion script
#!/sw/bin/guile -s !# (load "sxml-all.scm") (pp (SSAX:XML->SXML (open-input-file (cadr (command-line))) '() )) |
Not too much to it, is there? Of course, this relies on a collection of
load functions that I put into my convenience module:
Listing 4. sxml-all.scm support file
(load "libs/pp.scm") (load "libs/guile/common.scm") (load "libs/guile/myenv.scm") (load "libs/guile/parse-error.scm") (load "libs/util.scm") (load "libs/input-parse.scm") (load "libs/look-for-str.scm") (load "sxml/ssax.scm") (define (port? x) (or (input-port? x) (output-port? x))) |
I may not need all of these load functions every time, but this
loads the complete collection of SSAX functions. The last line is
an oddity: For whatever reason, the function port? that is used by
SSAX is not available in the version of Guile that I installed on
Mac OSX using fink. However, the definition I added comes straight
out of the manual for Guile. I'm assuming that a different Scheme system would not have this same issue.
The data structure produced by the function SSAX:XML->SXML is a
regular list that you can work with using all of the usual Scheme
functions. For example:
Listing 5. Navigating an SXML structure
guile> (load "sxml-all.scm")
guile> (define sxml
(SSAX:XML->SXML (open-input-file "example.xml") `()))
guile> (list-ref sxml 1)
(*PI* xml "version=\"1.0\" encoding=\"UTF-8\"")
guile> (cadr (list-ref sxml 3))
(title "The Twisted Matrix Framework")
guile> (eq? (car (list-ref sxml 3)) `dw-document)
#t
|
While an SXML representation is just a tree that can be manipulated
and traversed with general Scheme techniques, the SSAX library
provides a handy macro called SSAX:make-parser that works in a
manner similar to the SAX API in other programming languages. A
number of tree-walking optimizations are built into this macro,
giving you linear -- O(N) in Big-O notation -- efficiencies in processing a given SXML
structure; a naive approach could easily use more memory or CPU
time. (See Resources for more on Big-O.)
Unlike the actual SAX API that you might have used in languages like
C, C++, Python, or Perl, SSAX walks a tree rather than
scanning a linear bytestream -- that is, SAX or expat simply look for
opening and closing tags as events, and call back to the relevant
handlers. If you want to keep track of the relative nesting and
context in which a tag occurs, you need to maintain your own stack, or
other data structure. In SSAX, by contrast, every node descends from
a parent, passing and returning a seed. Of course, this seed is itself essentially a data structure that you can modify in each
NEW-LEVEL-SEED and FINISH-ELEMENT handler -- but at least it's local rather than global.
To show you how SSAX works, I have enhanced an outline example
that is available on the CVS directory for the SSAX library (see Resources).
I'll demonstrate how to display attributes and (abbreviated) CDATA content.
This will take you most of the way toward writing an sxml2xml utility -- one
that oddly is not distributed as part of SSAX, not even as a
direct function or macro. However, I'll skip handling proper escaping, processing instructions, and a few other aspects.
Listing 6. Outline conversion script
#!/sw/bin/guile -s
!#
(load "sxml-all.scm")
(define (format-attrs attrs)
(if (and (pair? attrs) (pair? (car attrs)))
(string-append " " (caar attrs) "='" (cdar attrs) "'"
(if (> (length attrs) 1)
(format-attrs (cdr attrs)) ""))
""))
(define (tag->string tag)
(if (symbol? tag) tag (cdr tag)))
(define (outline xml-port)
((SSAX:make-parser
NEW-LEVEL-SEED
(lambda (elem-gi attrs namespaces expected-content seed)
(display (string-append
seed ; indent the element name
"<" ; open brace
(tag->string elem-gi) ; print the name of the element
(format-attrs attrs) ; display the attributes
">\n")) ; closing brace, newline
(string-append " " seed)) ; advance the indent level
FINISH-ELEMENT
(lambda (elem-gi attributes namespaces parent-seed seed)
parent-seed) ; restore the indent level
CHAR-DATA-HANDLER
(lambda (s1 s2 seed)
(if (> (string-length s1) 30)
(display
(string-append seed "|" (substring s1 0 30) "...\n")))
seed)
) xml-port ""))
(display (call-with-input-file (cadr (command-line)) outline)))
|
The basics are a lot like a SAX class. The outline function is
generated with the SSAX:make-parser macro, which allows the definition of
several event types. The main ones are entering and leaving an
element, and getting character data. A couple support functions help
with the process.
The seed used in outline is quite simple; it is just a string
that gets longer as deeper branches of the tree are reached. Of
course, you could pass around a whole list of encountered tags -- such as
for an XPath-like analysis of what to do with a node. The CDATA
handler simply checks whether there is enough CDATA to bother
displaying (at least 30 characters, arbitrarily chosen), and then displays it
at the same indent as the current element.
The NEW-LEVEL-SEED handler demonstrates a couple of interesting
aspects, mostly in the two support functions it employs. Not every tag
is a simple symbol in the SXML structure;
specifically, a namespace-qualified tag is a pair instead.
The function tag->string checks
a tag's type, and only displays the local part of the name -- not
the namespace. You could take another approach, but this
demonstrates the test needed.
The function format-attrs is probably more an example of generic recursive
programming in Scheme than it is specific to SSAX.
Still, tags can have zero, one, or several attributes, and you need to
return a string for each case. A real Scheme programmer could probably
point out an even more concise way to do this -- I welcome comments in
the discussion forum for this column.
Now I'll take a look at the output, given the earlier XML document. By the
way, warnings are generated for the unprocessed PIs, so I redirect
STDERR to ignore those:
Listing 7. An outline display of example.xml
$ ./outline example.xml 2>/dev/null
<dw-document>
<title>
<author name='David Mertz, Ph.D.'>
<bio>
|David thinks it's turtles all ...
<docbody>
<heading refname='' toc='yes'>
<p>
|
Sorting through Twisted...
<i>
| a good sense of why they are ...
|
In addition to its equivalents for SAX and DOM (the native Scheme
nested lists), SSAX comes with its own SXPath and SXSLT
components. I don't have room here for an extensive discussion
of these, but they are worth mentioning briefly.
Unfortunately, the SXPath and SXSLT functions and macros
discussed in Oleg Kiselyov's document "XML, XPath, XSLT
implementations as SXML, SXPath and SXSLT" (see Resources) are not included in the
SSAX distribution, at least not the one for Guile (other
distributions are available for a number of Scheme systems). What can
be easily downloaded only works with a few Scheme systems, and
versioning is unclear. Based just on the document mentioned, SXPath
works much like XPath. For example, either of the following
expressions expand to a selection function:
Listing 8. SXPath expressions, native and textual
((sxpath '(// TAF VALID @ Trange *text)) document) ((txpath "//TAF/VALID/@TRange/text()") document) |
These undergo macro expansion to:
Listing 9. Full path selector function
((node-join (node-closure (node-typeof? 'TAF)) (select-kids (node-typeof? 'VALID)) (select-kids (node-typeof? '@)) (select-kids (node-typeof? 'TRange)) (select-kids (node-typeof? '*text*))) document) |
Of course, playing with this expanded form allows programming arbitrary calculations inside an XPath-like selector -- anything you can write in Scheme.
SXSLT is similar in concept. Stylesheets are written in Scheme form
which is semantically similar to XSLT. But much as with the
flexibility of HaXml, within any particular transformation rule, you
can embed arbitrary extra code. Particular XSLT engines, of course,
often come with foreign-function APIs to write extra capabilities in
JavaScript, VB, or other languages. But with SXSLT, the custom
functions are written in the very same Scheme language as the
transformation stylesheet elements.
I like the SSAX library quite a bit, and I suspect I will like it
more as I become more comfortable with Lisp/Scheme programming. It
shares many of the advantages of other native XML libraries I have
written about in other installments: gnosis.xml.objectify, REXML,
XML::Grove, HaXml, and so on. A lot can be said for making
XML into just another data object in whatever programming language you use.
That said, SSAX has a lot of rough edges. It's hard to figure out
what to download, and what Scheme systems each part is available for.
The documentation is somewhat inconsistent and incomplete -- most of the
documents are academic in focus, and do more to discuss abstract goals
and concepts than concrete usage and APIs. As a demonstration of
what is possible in Scheme, using functional techniques, these papers
are interesting; but it would be nice to have something that's easy
to install and use, and just works.
- Participate in the discussion forum.
- Check out the homepage for SXML, which offers a number of overlapping documents, mostly written by Oleg Kiselyov.
- Download the
SSAXlibrary for various Scheme systems. - Browse the most current
SSAXfiles, including some tests and examples not included in the distribution, on SourceForge.net. - Find the
SXPathextension, but unfortunately not for Guile. - Find out more about the XML Information Set (Infoset), a W3C Recommendation that specifies the information content of XML documents -- meaning,
which features of a concrete document carry information, and which are incidental.
- Learn more about Big-O notation in this glossary by David Mertz.
- Take a look at the GNU project's Guile, the version of Scheme used for this article. Other versions, both commercial and free, exist as well, but Guile seems widespread -- and it is the version that
finkwill install under Mac OSX. - Read "Transcending the limits of DOM, SAX, and XSLT," a previous installment of this column that discusses the XML library
HaXmlfor the functional programming language Haskell. While Haskell is purer in its functional programming paradigm, many common motivations and designs went intoHaXmland SXML (developerWorks, October 2001). - Review the alternative syntaxes for XML in this "XML Watch" column by Edd Dumbill (developerWorks, October 2002).
- Get a nice visual summary of syntax variations of near-XML languages at Scott Sweeney's site.
- IBM's DB2 database provides relational database storage, plus pureXML to quickly serve data and reduce your work in the management of XML data.Visit the DB2 Developer Domain to learn more about DB2.
- IBM XML certification: Find out how you can become an IBM-Certified Developer in XML and related technologies.

David Mertz once led the desperate life of scholarship. David may be reached at mertz@gnosis.cx; his life pored over at http://gnosis.cx/dW/. Suggestions and recommendations on this, past, or future columns are welcomed. Check out David's new book Text Processing in Python.
