Streams are a concept related to pipes: they are what the pipes operate on. The idea of a stream is that a program doesn't have to have all the data available before it can start working; this lets it work on data as it is generated, data coming over a network, or files too large to fit in memory.
(Note: I'm going to survey a lot of tools, specs, and libraries in the following paragraphs. You can find links to all of them in Resources at the end of the article.)
The pipes and streams metaphor is very successful in UNIX, and can apply with varying degrees of success to many other systems. It's available, in one form or another, in most programming languages and operating systems. In XML usage, the most common form of pipes and streams processing is SAX Filters. As most readers of this column know, SAX is the Simple API for XML. It is a stream-based API, parsing XML and calling event handlers as significant events (like opening or closing an element) occur during the parse. Since SAX processing does not have to maintain state or build a huge data structure (like the DOM), it is commonly used for XML tasks which operate on large data sets (hard to fit in memory), need to be very fast, or are sparse (only select parts of the XML data is used and the rest is ignored).
The problem with using pipes and filters at the programming language level, say with the Java™ InputStreamReader class or SAX Filters, is that you work at a very low level with tedious, error-prone, and repetitive code. To find a way to raise the abstraction level of the code and still maintain flexibility for a wide variety of XML processing tasks is what motivates the libraries and tools under review. Also, in many cases connecting together several XML tools for staged processing (a pipeline) results in re-parsing the XML several times. It's better to keep the processed XML in memory as it is passed from stage to stage.
But first, let's look at which problems you can address with pipestreaming libraries? It's all well and good to want this abstract way to manipulate XML, but what can you do with it?
Some common applications of an XML pipeline include:
- Read/Parse XML (from disk, the network, a CMS, or elsewhere)
- Convert non-XML to XML (for instance, convert reStructured Text to XHTML)
- Validation (whether using a DTD, XSchema, RelaxNG, or Schematron)
- Aggregate XML fragments (for instance, through XInclude or DITA Maps)
- Transform (probably using XSLT)
- Fill in templates (replacing placeholder values with actual content)
- Format for display or print (through CSS or XSL-FO)
- Serialize/Write back out to XML (to disk, the network, a CMS, or elsewhere)
- Tie it all together as a declarative syntax (generally also XML)
Note that a pipeline can include any or all of these applications, some might be repeated and varied, and other custom elements might be included. But these seem to be the core ingredients for XML processing. And as noted above, the pipeline should keep the parsed XML through the process, not re-parse at each stage.
For instance, you could take small bits of markup representing people and events, and assemble a Web calendar out of them. Or you could run that in reverse, scanning several people's Web calendars to aggregate their schedules and find a meeting time. I will use calendars and people for this example, but the ideas relate to anything expressible in XML (which is a pretty hefty list these days).
One reason I use calendars as an example is that there's a movement going on called 'microformats', which involves taking existing specifications and expressing them in existing XML dialects, rather than trying to reinvent the wheel or anything else. The most common XML dialect they focus on is XHTML, using class and link attributes to add semantics, with the goal of being both human-readable (in a Web page) and machine-readable (by including machine-specific data in attributes). There are microformats for people and organizations (hCard, based on the vCard specification), calendars and events (hCalendar, based on the iCalendar specification), outlines and links (XOXO), tags and keywords (rel-tag), reviews (hReview), social networks (XFN), content licensing (rel-license) and more. For the purposes of this article, I will focus on hCalendar.
Listing 1 is an example of the hCalendar microformat:
Listing 1. hCalendar microformat
<div class="vevent"> <abbr class="dtstart" title="20060322T0830-0800"> March 22, 2006 - 08:30 </abbr> - <abbr class="dtend" title="20060322T0900-0800"> 09:00 </abbr> - <span class="summary"> Dentist Appointment </span> - at <span class="location"> Office on Pacific Ave. </span> <div class="description"> Get permanent crown installed. </div> </div>
This particular example was generated with the hCalendar creator. Phil Dawes has a python library to parse hCalendar and hCard on his site, and on the hCalendar page you can find many more tools and implementations. Firefox® users with the Greasemonkey extension can even find Greasemonkey scripts for extracting a variety of microformats from the pages they visit. Upcoming.org and Eventful.com put events and calendars on their sites in hCalendar format. Certainly we have access to plenty of data to work with in microformats.
How do you create a pipeline to process the XML data once you've found or created it? You have several options of varying sophistication and complexity. SAX Filters are a low-level type of plumbing that either underlies or inspires some of the other toolkits. XMLGAWK is the GNU AWK line-oriented programming language with extensions to allow it to process XML, becoming element-oriented instead of line-oriented. XmlRpcFilteringPipe is Lion Kimbro's take on pipestreaming, using XML-RPC to build the pipeline. NetKernel™ appears to be a whole toolkit which takes the idea of pipestreaming and turns all the knobs to eleven. Apache Cocoon is built around the concept of "component pipelines" which are conceptually similar to, but more complex than, XML pipestreaming. All of these provide good approaches, but I want to discuss two projects that approach (for me) the best balance of power and simplicity; then I can discuss a couple of promising developments.
First, there is XPipe, from Sean McGrath. Sean also developed Pyxie (now Pixie2), which turns XML into a line-oriented language suitable for processing by standard UNIX pipes and filters. XPipe leverages this with an extensible set of Java components. I think Sean was on the right path, but the effort seems to have fizzled. The site hasn't been updated for some time, and Sean told me in e-mail that while his company has a commercial version they continue to develop, he hasn't had time to work on the open-source version, and if he did he would substantially change how it works. I think he had a good approach, but it could still be simpler while retaining the basic concept.
The best contender I have found for now is Norm Walsh's Simple XML Pipeline, or SXPipe. This has stages for nearly all the basic XML processing tasks I listed at the beginning of this article (not a coincidence; I based the list on the capabilities of SXPipe) and is well specified and the XML for it is easy to write. There are some caveats: by default it requires Java 1.5, which on OS X you can currently only get as a developer preview, and the stages themselves are pretty heavyweight Java components, making the system less easy to extend than it might be, but these are relatively small issues compared to the things SXPipe got right. Aside from the straightforward syntax and processing model, SXPipe is simple (only five elements), standards-based (it leverages XPath 1.0, XML Namespaces, XML Base, and uses XSL Transformation, and uses these specs in appropriate contexts (not always the case). And the language is even simpler than the W3C note (that Norm co-edited) "XML Pipeline Definition Language," while keeping the essential power and flexibility of that earlier work. Listing 2 shows an example SXPipe document:
Listing 2. SXPipe document
<pipeline xmlns="http://sxpipe.dev.java.net/xmlns/sxpipe/"> <stage process="Read" input="dw-article.xml"/> <stage process="Validate" schema="dw-article-5.0.xsd"/> <stage process="Transform" stylesheet="dw-article-5.0.xsl"/> <stage process="Write" output="dw-article.html"/> </pipeline>
That's pretty simple and sweet. In my ideal world I'd be able to extend this using Python. Then, for instance, I could easily add stages for processing Restructured Text into XML as part of my pipeline, or apply Cheetah templates. Implementing SXPipe in Python wouldn't be terribly difficult, but making it portable has challenges, since Python doesn't ship with XSL or XML validation libraries by default.
The future of pipestreaming is bright. For one thing, the fact that so many tools draw on the same inspiration seems to show a real need for pipestreaming. Whether the world eventually standardizes on SXPipe (for instance), or continues to roll their own, the pipes and streams metaphors are here to stay. From the perspective of my favorite programming language, Python, the PullDom has been part of the standard library for ages now, and will be supplemented in the upcoming 2.5 release by Fredrik Lundh's excellent ElementTree library. I took a stab at implementing SXPipe using ElementTree, but it still doesn't have tools for transformation or validation, and only primitive XPath support. Martijn Faassen's lxml library wraps the extremely powerful, but difficult-to-use libxml2 and libxslt in Python, and exposes the same simple interface as ElementTree, as well as XInclude, full XPath, XSLT, and various forms of validation. As that project stabilizes it will be relatively easy to create a nice implementation of SXPipe in Python.
Now you have some microformat content, and a tool for pipefitting. Now what can you do with it? Let's explore a couple of possibilities. This is a thought experiment, so the URLs are just examples, and don't really point anywhere. Say you're part of a family of four, the Nuclear family: Wanda (mom), Xathrus (dad), Yolanda (daughter) and Zander (son). Each of you has a calendar on your own Web page in the format http://example.com/calendar/[name]/[period], where the period could be a year, year-month, year-month-day, or a more natural-language phrase like "tomorrow" or "next week." Now you want to take all of those calendars and periodically munge them together for import into a program like Apple®'s iCal® or Mozilla®'s Sunbird™. Grab an XSLT transformation stylesheet from the Web to convert hCalendar to iCalendar, call it "cal-transform.xsl" and use the source document (source-calendar.xml) in Listing 3:
Listing 3. source-calendar.xml
<calendar xmlns:xi="http://www.w3.org/2001/XInclude"> <xi:include href="http://example.com/calendar/Wanda/today"/> <xi:include href="http://example.com/calendar/Xathrus/today"/> <xi:include href="http://example.com/calendar/Yolanda/today"/> <xi:include href="http://example.com/calendar/Zander/today"/> </calendar>
Next, apply the pipeline Listing 4 to your source document:
Listing 4. additions to the source document
<pipeline xmlns="http://sxpipe.dev.java.net/xmlns/sxpipe/"> <stage process="Read" input="source-calendar.xml"/> <stage process="XInclude"/> <stage process="Transform" stylesheet="cal-transform.xsl"/> <stage process="Write" systemId="family-calendar.ics"/> </pipeline>
Given all of the above, running that pipeline should give you a single calendar file that you can import. Let's try a different use case. When microformats catch on, you should be able to download your own data from the bank (account and credit statements), the grocery store (your receipts in digital format), even the library. For most of the data you don't particularly care whether it is valid, but the banking data you want to validate against a schema. For each data source you want to transform the data to get it all in the same format. So use your imagination (and ignore security considerations for now; this is pipe-dreaming after all) to pull all of these disparate sources into one stream as in Listing 5:
Listing 5. The single stream
<pipeline xmlns="http://sxpipe.dev.java.net/xmlns/sxpipe/"> <!-- Get my schedule for the day --> <stage process="Read" input="http://myjob.com/myappointments"/> <stage process="Transform" stylesheet="calendar-to-portal.xsl"/> <stage process="Write" systemId="work.xml"/> <!-- Do I have any books due back at the library? --> <stage process="Read" input="http://muni-library.gov/books-due"/> <stage process="Transform" stylesheet="library-to-portal.xsl"/> <stage process="Write" systemId="library.xml"/> <!-- Get latest transactions in my chequing account --> <stage process="Read" input="http://mybank.com/chequeing-acct"/> <stage process="Validate" schema="my-bank-acct.xsd"/> <stage process="Transform" stylesheet="chequing-to-portal.xsl"/> <stage process="Write" systemId="chequeing.xml"/> <!-- Get latest transactions on my credit card --> <stage process="Read" input="http://mybank.com/my-credit-acct"/> <stage process="Validate" schema="my-creditcard.xsd"/> <stage process="Transform" schema="creditcard-to-portal.xsl"/> <stage process="Write" systemId="creditcard.xml"/> <!-- Check the weather forecast --> <stage process="Read" input="http://localweather.com"/> <stage process="Transform" stylesheet="weather-to-portal.xsl"/> <stage process="Write" systemId="weather.xml"/> <!-- Create an XInclude document to glue it all together --> <stage process="Document" label="accumulator"> <portal xmlns:xi="http://www.w3.org/2001/XInclude" xmlns:"http://www.example.com/portal"> <xi:include href="work.xml"/> <xi:include href="library.xml"/> <xi:include href="chequeing.xml"/> <xi:include href="creditcard.xml"/> <xi:include href="weather.xml"/> </portal> </stage> <stage process="XInclude"/> <!-- Convert it all to HTML --> <stage process="Transform" stylesheet="portal-to-html.xsl"/> <!-- Viola! We have our own personal portal on the world --> <stage process="Write" systemId="my-daily-portal.html"/> </pipeline>
Hmmm, I saw a bit of hand-waving with questions in that last example, but that's because I haven't yet covered the intersection of microformats and syndication. A lot of the current microformat content is designed to (or rapidly being retrofitted to) flow through newsfeed aggregators designed originally for reading blogs in RSS format. And a lot more than just hCalendar flows into the aggregator, as I hinted in the previous example: microformats already exist for reviews and outlines, events and people, but soon you'll likely see microformats for financial transactions, overdue books, weather forecasts, plus things not even imagined yet.
Along with Microformats, another force on the horizon is Atom, which is really two different specifications, the Atom syndication format, and the Atom API. Syndication with Atom is very much like syndication with the older RSS formats, but more clearly specified (and it's an IETF standard). But that's a topic for the next article, in which I will explore the intersection of microformats and Atom syndication. Until then, may all your pipes be leak-free and all your streams be clean.
- Tip: Command-line XML Processing (developerWorks, May 2003): Read David's tip on some of the same issues.
- XML Matters: The RXP parser (developerWorks, August 2003): Review David's earlier column about command-line tools for XML along the lines of UNIX pipes.
- SAX Filters: Explore this low-level plumbing that either underlies or inspires other toolkits.
- AWK is one of the old standby tools for line-oriented text processing, Gnu AWK. XMLGAWK: Get the power and simplicity that AWK brings to XML processing with XML extensions.
- About Microformats: Visit the official microformats site to read the goal and strategies of microformats.
- SXPipe: Check out Norm Walsh's foray into XML pipestreaming; it seems to be really well thought out. The author's only complaint is the Java implementation makes it harder to extend from Python.
- SXPipe Project: Find details on the Java.net project page, including official specification, JavaDoc, and source code.
- XML Pipeline Definition Language: Read the W3C Note by Norm Walsh and Eve Maler.
- XmlRpcFilteringPipe: See Lion Kimbro's take on pipestreaming; it uses XML-RPC for the pipeline.
- Introducing NetKernel: Read the philosophy and origins of NetKernel.
- NetKernel Homepage: Explore further information and developments with 1060's NetKernel
- XPipe: Check out Sean McGrath's entry in the pipestream sweepstakes.
- Greasemonkey scripts for microformats: Try this a set of scripts for Firefox with the Greasemonkey extension that allow you to access microformat content embedded in Web pages.
- hCalendar FAQ: Answer all your Frequently Asked Questions.
- hCalendar: Check out the hCalendar home page.
- hCalendar creator: Create hCalendar content with this Web form-based tool.
- ifreebusy.com: Look at this startup focused on hCalendar.
- Upcoming: Check out this set of events and calendars using hCalendar.
- Eventful: Visit another calendar/scheduling site which lists information using hCalendar.
- Family Calendars: Read a post, by Dori Smith, that provided some ideas for examples in this article, although the author doesn't address her full wishlist yet.
- XML Matters: Find other articles in this developerWorks column.
- developerWorks XML zone: Find more XML resources here, including articles, tutorials, tips, and standards.
- IBM Certified Solution Developer -- XML and related technologies: Learn how to get certified.
Get products and technologies
- Structured Blogging: Get plugins for Wordpress and Movable Type, to add microformats to your blog.
- Python microformat parser: Try Phil Dawes' parser for various microformats; it currently handles only hCard and hCalendar.
- ElementTree: Python 2.5 will include Frederik Lundh's pythonic API for XML processing, or install it separately.
- lxml: This library uses the same simple API as ElementTree, but is built on top of libxml2 and libxslt, making it very fast and providing extra features, such as full XPath 1.0, XInclude, XSLT, XML Schema, and RelaxNG validation.
Dethe Elza's favorite job title has been Chief Mad Scientist. Dethe can be reached at firstname.lastname@example.org. He keeps a blog mainly about Python and Mac OS X at http://livingcode.blogspot.com/ and writes programs for his kids. Suggestions and recommendations on this column are welcome.
David Mertz is a great believer in open standards, and is only modestly intimidated by verbosity. David may be reached at email@example.com; his life pored over at http://gnosis.cx/dW/. Suggestions and recommendations on this, past, or future columns are welcomed. Check out David's book Text Processing in Python.