Skip to main content

If you don't have an IBM ID and password, register here.

By clicking Submit, you agree to the developerWorks terms of use.

The first time you sign into developerWorks, a profile is created for you. This profile includes the first name, last name, and display name you identified when you registered with developerWorks. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

All information submitted is secure.

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerworks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

By clicking Submit, you agree to the developerWorks terms of use.

All information submitted is secure.

XML Matters: Investigating SXML and SSAX

Manipulating XML in the Scheme programming language

David Mertz (mertz@gnosis.cx), Parenthesizer, Gnosis Software, Inc.
Photo of David Mertz
David Mertz once led the desperate life of scholarship. David may be reached at mertz@gnosis.cx; his life pored over at http://gnosis.cx/dW/. Suggestions and recommendations on this, past, or future columns are welcomed. Check out David's new book Text Processing in Python.

Summary:  In previous installments, David has looked at XML libraries for various programming languages, and their various strengths and weaknesses. This time, he looks at the Lisp/Scheme family of languages that continues to endure in popularity, especially in teaching and among purists. Included in the discussion are the SSAX library for Scheme (an efficient pure-functional parser), the SXML tree library (similar to DOM), and the related tools SXSLT and SXPath, which have been created to work with them. David covers the advantages of parsing in a strongly functional language, and compares SSAX with libraries for other languages.

View more content in this series

Date:  23 Oct 2003
Level:  Intermediate

Comments:  

I am aware that some connoisseurs of single-malt scotches can discourse endlessly on the virtues of one scotch versus another. But as someone who rarely drinks alcohol, I have trouble fully understanding the mindset of these enthusiasts.

I feel almost the same way about Lisp and Scheme programming. I can tell that it is an area filled with sophistication and intelligence, but somehow both the Polish (prefix) notation and endless parentheses, and the fervent semantic eschewal of a distinction between code and data, continue to feel alien to me. Nonetheless, I have enough of a fascination that I want to see how these languages approach XML processing.

What is SXML?

Among Lisp/Scheme enthusiasts, the starting point for the SSAX library for Scheme is the observation that XML is semantically almost identical to the nested list-oriented data structures native to Lisp-like languages. Anything you can represent in XML can be straightforwardly represented as SXML -- Scheme lists nesting the same data as the original XML. Moreover, Scheme comes with a rich library of list and tree manipulation functions, and a history of contemplating manipulation of those very structures. A natural fit, perhaps.

A good first step is to take a look at SXML in its concrete form. Trees are the underlying abstraction -- the Infoset -- of XML; but the abstract information takes a specific semantic form. For example, the following is a starkly reduced (but still well-formed) version of another article I wrote recently:


Listing 1. An XML document with most XML features
$ cat example.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/css"
      href="http://gnosis.cx/dW/programming/dW.css" ?>
<dw-document xmlns:dw="http://www.ibm.com/developerWorks/">
  <title>The Twisted Matrix Framework</title>
  <author name="David Mertz, Ph.D.">
    <bio>David thinks it's turtles all the way down...</bio>
  </author>
  <docbody>
    <dw:heading refname="" toc="yes">Introduction</dw:heading>
    <p>
      Sorting through Twisted Matrix is reminiscent of the old story
      about blind men and elephants. Twisted Matrix has many
      capabilities within it, and it takes a bit of a gestalt switch
      to <i>get</i> a good sense of why they are all there.
    </p>
  </docbody>
</dw-document>

Transformed to SXML, this article looks like:


Listing 2. An SXML document with most XML features
$ ./xml2sxml example.xml
(*TOP* (*PI* xml "version=\"1.0\" encoding=\"UTF-8\"")
       (*PI* xml-stylesheet
             "type=\"text/css\"
      href=\"http://gnosis.cx/dW/programming/dW.css\" ")
       (dw-document
         (title "The Twisted Matrix Framework")
         (author
           (@ (name "David Mertz, Ph.D."))
           (bio "David thinks it's turtles all the way down..."))
         (docbody
           (http://www.ibm.com/developerWorks/:heading
             (@ (toc "yes") (refname ""))
             "Introduction")
           (p "
      Sorting through Twisted Matrix is reminiscent of the old story
      about blind men and elephants. Twisted Matrix has many
      capabilities within it, and it takes a bit of a gestalt switch
      to "
              (i "get")
              " a good sense of why they are all there.
    "))))

You'll find a number of interesting differences between XML and SXML, but also an obvious and direct correspondence between any given aspect of the two. Some of the differences are relatively trivial -- parentheses instead of angle brackets, for example -- while others are ambivalent. For an example of the latter, SSAX 's creator, Oleg Kiselyov, thinks the reduction of closing tags to closing parentheses makes the syntax more concise (see Resources). However, the designers of XML went out of their way to remove the tag reduction options in SGML. Perhaps they were wrong, but explicit closing tags are not there because their possibility was overlooked.

In a number of ways, however, SXML corrects some awkwardness in XML, without sacrificing information. In particular, the distinction between attributes and element contents feels arbitrary to most XML developers, and SXML removes it in an elegant way. An attribute collection is simply another nested list, but one that happens to start with the name @ -- a name that is conveniently prohibited in XML identifiers. Effectively, an SXML document is like an XML document that eschews attributes, but sometimes nests a <@> child inside other elements. Referring to children that happen to be named @ in SXML is no different than the filtering on any other tag name. It's interesting to note that both my gnosis.xml.objectify and RELAX NG stylesheets attempt a similar homogenization of attributes and elements.

Processing instructions and comments are also reduced to special tag names that are not available in XML: PI for the former, COMMENT for the latter. As with most of Lisp, just one basic data structure represents everything.

Namespaces are also interesting in the SXML format. The full namespace URI simply becomes part of the tagname when SXML is generated as above. However, an optional NAMESPACES tag can be used to abbreviate namespace references in essentially the same way as in XML, but to utilize this you need to either write SXML by hand or enhance the conversion utility.


Working with the SSAX library

From the description above, SXML seems like just another shortcut notation for XML, of which there are many (such as PYX, SOX, SLiP, and XSLTXT). The difference is that SXML is not merely (arguably) simpler to read and edit, but is already itself code in Scheme. No special parsers for the format are needed or even relevant.

As a first example of working with the SSAX library, take a look at the application xml2sxml that was utilized above:


Listing 3. xml2sxml conversion script
#!/sw/bin/guile -s
!#
(load "sxml-all.scm")
(pp (SSAX:XML->SXML (open-input-file (cadr (command-line))) '() ))

Not too much to it, is there? Of course, this relies on a collection of load functions that I put into my convenience module:


Listing 4. sxml-all.scm support file
(load "libs/pp.scm")
(load "libs/guile/common.scm")
(load "libs/guile/myenv.scm")
(load "libs/guile/parse-error.scm")
(load "libs/util.scm")
(load "libs/input-parse.scm")
(load "libs/look-for-str.scm")
(load "sxml/ssax.scm")
(define (port? x) (or (input-port? x) (output-port? x)))

I may not need all of these load functions every time, but this loads the complete collection of SSAX functions. The last line is an oddity: For whatever reason, the function port? that is used by SSAX is not available in the version of Guile that I installed on Mac OSX using fink. However, the definition I added comes straight out of the manual for Guile. I'm assuming that a different Scheme system would not have this same issue.

The data structure produced by the function SSAX:XML->SXML is a regular list that you can work with using all of the usual Scheme functions. For example:


Listing 5. Navigating an SXML structure
guile> (load "sxml-all.scm")
guile> (define sxml
         (SSAX:XML->SXML (open-input-file "example.xml") `()))
guile> (list-ref sxml 1)
(*PI* xml "version=\"1.0\" encoding=\"UTF-8\"")
guile> (cadr (list-ref sxml 3))
(title "The Twisted Matrix Framework")
guile> (eq? (car (list-ref sxml 3)) `dw-document)
#t


Using SSAX

While an SXML representation is just a tree that can be manipulated and traversed with general Scheme techniques, the SSAX library provides a handy macro called SSAX:make-parser that works in a manner similar to the SAX API in other programming languages. A number of tree-walking optimizations are built into this macro, giving you linear -- O(N) in Big-O notation -- efficiencies in processing a given SXML structure; a naive approach could easily use more memory or CPU time. (See Resources for more on Big-O.)

Unlike the actual SAX API that you might have used in languages like C, C++, Python, or Perl, SSAX walks a tree rather than scanning a linear bytestream -- that is, SAX or expat simply look for opening and closing tags as events, and call back to the relevant handlers. If you want to keep track of the relative nesting and context in which a tag occurs, you need to maintain your own stack, or other data structure. In SSAX, by contrast, every node descends from a parent, passing and returning a seed. Of course, this seed is itself essentially a data structure that you can modify in each NEW-LEVEL-SEED and FINISH-ELEMENT handler -- but at least it's local rather than global.

To show you how SSAX works, I have enhanced an outline example that is available on the CVS directory for the SSAX library (see Resources). I'll demonstrate how to display attributes and (abbreviated) CDATA content. This will take you most of the way toward writing an sxml2xml utility -- one that oddly is not distributed as part of SSAX, not even as a direct function or macro. However, I'll skip handling proper escaping, processing instructions, and a few other aspects.


Listing 6. Outline conversion script
#!/sw/bin/guile -s
!#
(load "sxml-all.scm")

(define (format-attrs attrs)
  (if (and (pair? attrs) (pair? (car attrs)))
    (string-append " " (caar attrs) "='" (cdar attrs) "'"
      (if (> (length attrs) 1)
        (format-attrs (cdr attrs)) ""))
    ""))

(define (tag->string tag)
  (if (symbol? tag) tag (cdr tag)))

(define (outline xml-port)
  ((SSAX:make-parser
      NEW-LEVEL-SEED
      (lambda (elem-gi attrs namespaces expected-content seed)
        (display (string-append
          seed                  ; indent the element name
          "<"                   ; open brace
          (tag->string elem-gi) ; print the name of the element
          (format-attrs attrs)  ; display the attributes
          ">\n"))               ; closing brace, newline
        (string-append "  " seed)) ; advance the indent level

      FINISH-ELEMENT
      (lambda (elem-gi attributes namespaces parent-seed seed)
          parent-seed)          ; restore the indent level

      CHAR-DATA-HANDLER
      (lambda (s1 s2 seed)
        (if (> (string-length s1) 30)
          (display
            (string-append seed "|" (substring s1 0 30) "...\n")))
        seed)

      ) xml-port ""))

(display (call-with-input-file (cadr (command-line)) outline)))

The basics are a lot like a SAX class. The outline function is generated with the SSAX:make-parser macro, which allows the definition of several event types. The main ones are entering and leaving an element, and getting character data. A couple support functions help with the process.

The seed used in outline is quite simple; it is just a string that gets longer as deeper branches of the tree are reached. Of course, you could pass around a whole list of encountered tags -- such as for an XPath-like analysis of what to do with a node. The CDATA handler simply checks whether there is enough CDATA to bother displaying (at least 30 characters, arbitrarily chosen), and then displays it at the same indent as the current element.

The NEW-LEVEL-SEED handler demonstrates a couple of interesting aspects, mostly in the two support functions it employs. Not every tag is a simple symbol in the SXML structure; specifically, a namespace-qualified tag is a pair instead. The function tag->string checks a tag's type, and only displays the local part of the name -- not the namespace. You could take another approach, but this demonstrates the test needed.

The function format-attrs is probably more an example of generic recursive programming in Scheme than it is specific to SSAX. Still, tags can have zero, one, or several attributes, and you need to return a string for each case. A real Scheme programmer could probably point out an even more concise way to do this -- I welcome comments in the discussion forum for this column.

Now I'll take a look at the output, given the earlier XML document. By the way, warnings are generated for the unprocessed PIs, so I redirect STDERR to ignore those:


Listing 7. An outline display of example.xml
$ ./outline example.xml 2>/dev/null
<dw-document>
  <title>
  <author name='David Mertz, Ph.D.'>
    <bio>
      |David thinks it's turtles all ...
  <docbody>
    <heading refname='' toc='yes'>
    <p>
      |
      Sorting through Twisted...
      <i>
      | a good sense of why they are ...


What else?

In addition to its equivalents for SAX and DOM (the native Scheme nested lists), SSAX comes with its own SXPath and SXSLT components. I don't have room here for an extensive discussion of these, but they are worth mentioning briefly.

Unfortunately, the SXPath and SXSLT functions and macros discussed in Oleg Kiselyov's document "XML, XPath, XSLT implementations as SXML, SXPath and SXSLT" (see Resources) are not included in the SSAX distribution, at least not the one for Guile (other distributions are available for a number of Scheme systems). What can be easily downloaded only works with a few Scheme systems, and versioning is unclear. Based just on the document mentioned, SXPath works much like XPath. For example, either of the following expressions expand to a selection function:


Listing 8. SXPath expressions, native and textual
((sxpath '(// TAF VALID @ Trange *text)) document)
((txpath "//TAF/VALID/@TRange/text()") document)

These undergo macro expansion to:


Listing 9. Full path selector function
((node-join
  (node-closure (node-typeof? 'TAF))
  (select-kids (node-typeof? 'VALID))
  (select-kids (node-typeof? '@))
  (select-kids (node-typeof? 'TRange))
  (select-kids (node-typeof? '*text*)))
document)

Of course, playing with this expanded form allows programming arbitrary calculations inside an XPath-like selector -- anything you can write in Scheme.

SXSLT is similar in concept. Stylesheets are written in Scheme form which is semantically similar to XSLT. But much as with the flexibility of HaXml, within any particular transformation rule, you can embed arbitrary extra code. Particular XSLT engines, of course, often come with foreign-function APIs to write extra capabilities in JavaScript, VB, or other languages. But with SXSLT, the custom functions are written in the very same Scheme language as the transformation stylesheet elements.


Final thoughts on SSAX

I like the SSAX library quite a bit, and I suspect I will like it more as I become more comfortable with Lisp/Scheme programming. It shares many of the advantages of other native XML libraries I have written about in other installments: gnosis.xml.objectify, REXML, XML::Grove, HaXml, and so on. A lot can be said for making XML into just another data object in whatever programming language you use.

That said, SSAX has a lot of rough edges. It's hard to figure out what to download, and what Scheme systems each part is available for. The documentation is somewhat inconsistent and incomplete -- most of the documents are academic in focus, and do more to discuss abstract goals and concepts than concrete usage and APIs. As a demonstration of what is possible in Scheme, using functional techniques, these papers are interesting; but it would be nice to have something that's easy to install and use, and just works.


Resources

  • Participate in the discussion forum.

  • Check out the homepage for SXML, which offers a number of overlapping documents, mostly written by Oleg Kiselyov.

  • Download the SSAX library for various Scheme systems.

  • Browse the most current SSAX files, including some tests and examples not included in the distribution, on SourceForge.net.

  • Find the SXPath extension, but unfortunately not for Guile.

  • Find out more about the XML Information Set (Infoset), a W3C Recommendation that specifies the information content of XML documents -- meaning, which features of a concrete document carry information, and which are incidental.

  • Learn more about Big-O notation in this glossary by David Mertz.

  • Take a look at the GNU project's Guile, the version of Scheme used for this article. Other versions, both commercial and free, exist as well, but Guile seems widespread -- and it is the version that fink will install under Mac OSX.

  • Read "Transcending the limits of DOM, SAX, and XSLT," a previous installment of this column that discusses the XML library HaXml for the functional programming language Haskell. While Haskell is purer in its functional programming paradigm, many common motivations and designs went into HaXml and SXML (developerWorks, October 2001).

  • Review the alternative syntaxes for XML in this "XML Watch" column by Edd Dumbill (developerWorks, October 2002).

  • Get a nice visual summary of syntax variations of near-XML languages at Scott Sweeney's site.

  • IBM's DB2 database provides relational database storage, plus pureXML to quickly serve data and reduce your work in the management of XML data.Visit the DB2 Developer Domain to learn more about DB2.

  • IBM XML certification: Find out how you can become an IBM-Certified Developer in XML and related technologies.

About the author

Photo of David Mertz

David Mertz once led the desperate life of scholarship. David may be reached at mertz@gnosis.cx; his life pored over at http://gnosis.cx/dW/. Suggestions and recommendations on this, past, or future columns are welcomed. Check out David's new book Text Processing in Python.

Report abuse help

Report abuse

Thank you. This entry has been flagged for moderator attention.


Report abuse help

Report abuse

Report abuse submission failed. Please try again later.


developerWorks: Sign in

If you don't have an IBM ID and password, register here.


Forgot your IBM ID?


Forgot your password?
Change your password


By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. This profile includes the first name, last name, and display name you identified when you registered with developerWorks. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

Choose your display name

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

(Must be between 3 – 31 characters.)


By clicking Submit, you agree to the developerWorks terms of use.

 


Rate this article

Comments

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML
ArticleID=12332
ArticleTitle=XML Matters: Investigating SXML and SSAX
publish-date=10232003
author1-email=mertz@gnosis.cx
author1-email-cc=

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

For articles in technology zones (such as Java technology, Linux, Open source, XML), Popular tags shows the top tags for all technology zones. For articles in product zones (such as Info Mgmt, Rational, WebSphere), Popular tags shows the top tags for just that product zone.

For articles in technology zones (such as Java technology, Linux, Open source, XML), My tags shows your tags for all technology zones. For articles in product zones (such as Info Mgmt, Rational, WebSphere), My tags shows your tags for just that product zone.

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).