XML Matters: Intro to PYX

A line-oriented XML

XML is a fairly simple format. It uses plain Unicode text rather than binary encoding, and all the structures are declared with predictable-looking tags. Nonetheless, there are still enough rules in the XML grammar that a carefully debugged parser is needed to process XML documents -- and every parser imposes its own particular programming style. An alternative is to make XML even simpler. The open-source PYX format is a purely line-oriented format for representing XML documents that allows for much easier processing of XML document contents with common text tools like grep,sed, awk, wc, and the usual UNIX collection.

David Mertz (mertz@gnosis.cx), Simplifier, Gnosis Software, Inc.

David Mertz David Mertz believes that most XML writers have only explained APIs; his point is to change them (or at least circumvent them). David may be reached at mertz@gnosis.cx; his life pored over at http://gnosis.cx/dW/. Suggestions and recommendations on this, past, or future, columns are welcomed.



01 February 2002

Regular readers of this column have almost certainly noted my dissatisfaction with the most popular techniques for manipulating XML documents. The articles that have discussed my Python xml_objectify modify have largely been in response to the complexity of DOM. My introduction to the Haskell HaXml library was primarily a response to what I think is a certain obtuseness of XSLT. Similarly, this time I find SAX also to be far "heavier" than necessary for many of the problems SAX solves.

The SAX API is far more lightweight than either DOM or XSLT -- not only in terms of computer resources, but more importantly in terms of programmer effort and learning curve. Still, even SAX demands that an XML programmer utilize a parser library, and conform to a callback API. The data inside XML documents simply aren't complex enough to warrant these demands. In my opinion, there ought to be an easier way to handle XML documents; and in particular, one ought to be more free to use a variety of familiar tools and techniques when manipulating XML.

The PYX format

The PYX format is a line-oriented representation of XML documents that is derived from the SGML ESIS format. PYX is not actually XML, but it is able to represent all the information within an XML document in a manner that is easier to handle using familiar text processing tools. Moreover, PYX documents can be transformed back into XML as needed. PYX documents are approximately the same size as the corresponding XML versions (sometimes a little larger, sometimes a little smaller), so storage and transmission considerations differ little between XML and PYX.

The PYX format is extremely simple to describe and understand. The first character on each line identifies the content-type of the line. Content does not directly span lines, although successive lines might contain the same content-type. In the case of tag attributes, the attribute name and value are simply separated by a space, without use of extra quotes. The prefix characters are:

(  start-tag
)  end-tag
A  attribute
-  character data (content)
?  processing instruction

A few characters are escaped in the PYX format. Newline characters that occur inside character data are always indicated on a separate content line, using the \n escape code. Tabs are escaped using the \t sequence, and can occur on regular (non-newline) content lines, wherever the corresponding tabs occur in the original XML. Backslashes, which are used for escaping are themselves escaped as \\. Note that spaces are not escaped, and usually some content lines will consist entirely of spaces (which might not be visually obvious in a text viewer). Lines are terminated by your platform-specific newline delimiter.

The motivation for PYX is the wide usage, convenience, and familiarity of line-oriented text processing tools and techniques. The GNU textutils, for example, include tools like wc, tail, head, and uniq; other familiar text processing tools are grep, sed, awk, and in a more sophisticated way, perl and other scripting languages. These types of tools generally expect newline-delimited records and rely on regular expression patterns to identify parts of texts. As it happens, neither of the expectations is a good match for XML.


Using PYX

Let's take a look at PYX in action. PYX libraries exist for several programming languages, but much of the time it is most useful simply to use the command line tools xmln and xmlv. The first is a non-validating transformation tool, the second adds validation against a DTD. Under the hood, the expat and rxp parsers are compiled into these tools, but a user does not need to worry about the APIs for those parsers.

Listing 1. XML and PYX versions of a document
[PYX]# cat test.xml
<?xml version="1.0"?>
<!DOCTYPE Spam SYSTEM "spam.dtd" >
<!-- Document Comment -->
<?xml-stylesheet href="test.css" type="text/css"?>
<Spam flavor="pork" size="8oz">
  <Eggs>Some text about eggs.</Eggs>
  <MoreSpam>Ode to Spam (spam="smoked-pork")</MoreSpam>
</Spam>

[PYX]# ./xmln test.xml
?xml-stylesheet href="test.css" type="text/css"
(Spam
Aflavor pork
Asize 8oz
-\n
-
(Eggs
-Some text about eggs.
)Eggs
-\n
-
(MoreSpam
-Ode to Spam (spam="smoked-pork")
)MoreSpam
-\n
)Spam

You should notice that the transformation loses the DOCTYPE declaration and the comment in the original XML document. For many purposes, this is not important (parsers often discard this information as well). The PYX format, in contrast to the XML format, allows one to easily pose a variety of ad hoc questions about a document. For example: What are all the attribute values in the sample document? Using PYX, we can simply ask:

Listing 2. An ad hoc query using PYX format (attributes)
[PYX]# ./xmln test.xml | grep "^A" | awk '{print $2}'
pork
8oz

Getting this answer out of the original XML is a huge challenge: You either have to create a whole program that calls a parser and looks for tag attribute dictionaries, or come up with a complex regular expression that will find the information of interest. Complicating things is the contents of the <MoreSpam> element, which contains something that looks a lot like a tag attribute, but is not.

Here is another task that PYX makes simple: Let's try to dump the non-empty content lines of an XML document. One could do this with SAX, but doing so would require writing a little application with a characters() handler, and empty skeletons of several other handlers. What we might like is something similar to lynx -dump applied to HTML files -- a one liner, in other words. One possibility is:

Listing 3. An ad hoc query using PYX format (contents)
[PYX]# ./xmln test.xml | grep '^-[^\n ]' | sed s/^-//
Some text about eggs.
Ode to Spam (spam="smoked-pork")

Sean McGrath's article (see Resources) has additional similar examples.


Going back to XML

The PYX format is sufficiently simple that just about any competent programmer can write a PYX2XML tool in under an hour. Every line tells you exactly what needs to be output, whether it's a tag, PI, or content.

There is only a very slight statefulness to the PYX2XML conversion. Specifically, when an open tag is encountered, the (indefinitely many) lines that follow may contain attributes for the tag. After the attributes (if any) are output, a closing angle bracket is required. When the open tag is encountered, the conversion utility does not yet know how many attributes, if any, exist. Therefore, a "looking-for-attributes" state needs to be set to true or false.

Unfortunately, despite the notable simplicity of the PYX2XML conversion, the tool pyx2xml.py is broken in alarmingly many ways. It does some spacing in the XML that looks odd, but which is well-formed. But far worse, there are actual programming errors that will crash the short script. Let me just provide a working implementation here for readers:

Listing 4. Python script for PYX-to-XML conversion
  import sys, os, xreadlines
  unescape = lambda s: s.replace(r'\t','\t').replace(r'\\','\\')
  write = sys.stdout.write
  get_attrs = 0

  for line in xreadlines.xreadlines(sys.stdin):
     if get_attrs and line[0] <> 'A':
        get_attrs = 0           # End of tag attribues
        write('>')
     if line[0] == '?':         # Proc Instr
        write('<?%s?>\n' % line[1:-1])
     elif line[0] == '(':       # Open tag
        write('<%s' % line[1:-1])
        get_attrs = 1
     elif line[0] == 'A':       # Tag attrib
        name,val = line[1:].split(None, 1)
        write(' %s="%s"' % (name, unescape(val)[:-1]))
     elif line[:3] == r'-\n':   # Newline
        write(os.linesep)
     elif line[0] == '-':       # Misc content
        write(unescape(line[1:-1]))
     elif line[0] == ')':       # Close tag
        write('</%s>' % line[1:-1])

Other considerations

The Pyxie project page contains a Python module called pyxie, which contains a number of classes that work with PYX-encoded documents in tree-based or event-based styles. If you adopt the PYX format for many uses (and if you use Python), it might be worth using some of these classes. But in a way, I feel like these classes somewhat miss the point. The virtue of PYX format is its simplicity, and accessibility with line-oriented tools.

If you want an in-memory tree representation of an XML document, DOM is already available to do just this. If you want somewhat less convoluted APIs than those of DOM, you can also obtain tree structures using modules such as Python's xml_objectify, Perl's XML::Parser, Ruby's REXML, and Java's JDOM. pyxie is similar in purpose, but has no real advantage. Similarly, if you want fully general event-oriented processing of XML documents, you might as well use SAX or expat; pyxie offers no special advantage here.

There are times when you might want to process PYX documents in a way that is somewhat sensitive to the hierarchical structure of the data. At a certain point, this falls back into the same complexity we have with SAX or DOM, and the point of PYX is lost. But at an initial level of complexity, the only data structure one really needs in order to treat PYX in a hierarchical fashion is a tag stack. This is a fairly simple data structure requirement.

For example, in sequentially processing the test document above, you would perform the following stack operations:

1. Push "Spam"
2. Push "Eggs"
3. Pop ("Eggs")
4. Push "MoreSpam"
5. Pop ("MoreSpam")
6. Pop ("Spam")

In this simple case, the stack never gets more than two items deep, but in general it can get arbitrarily deep. Knowing when to push and when to pop is remarkably simple: Push when a start-tag line is encountered; pop when an end-tag line is encountered. Pop operations do not even need to know the end-tag string, since it is always the last thing that is popped by a stack. Actually, the PYX format would not lose any information by leaving off end-tag strings (but one might lose the convenience of self-identifying end-tags that do not require stack counts; for example, PYX2XML would be harder to write).

At each point in the line-by-line processing of a PYX file, the single stack tells one everything there is to know about the hierarchical context of the current line. One can even construct an XPath-style qualifier by simply peaking into the stack; potentially certain operations on content or attributes might depend upon this context. This sort of processing goes slightly beyond what one can usually do with basic text utilities, but it nonetheless remains more ad hoc, flexible, and simpler than the full blown APIs of an XML parser/interface.


Conclusion

Designers of XML API's have, unfortunately, largely forgotten the KISS principle ("Keep It Simple, Stupid"). There are certainly applications where the full power and complexity of DOM, SAX, or XSLT are warranted, and even necessary. But for a large number of XML applications, the popular APIs create unnecessary barriers to entry for day-to-day programmers. Fortunately, there are ways to avoid extra complexity, and this column has tried, and will continue to try, to provide readers with ways to make simple things simple.

Resources

  • See Sean McGrath's intro to the PYX format.
  • McGrath has also written a book that is largely about the usage of PYX, in combination with Python, called XML Processing with Python, Prentice Hall, 2000. This book was one of the titles I have reviewed in my "Charming Python" column, Updating your reading list.
  • See this Perl library for working with (and converting to/from) PYX.
  • Find other articles in David Mertz's XML Matters column.

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Select information in your profile (name, country/region, and company) is displayed to the public and will accompany any content you post. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into XML on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML
ArticleID=12073
ArticleTitle=XML Matters: Intro to PYX
publish-date=02012002