Skip to main content

If you don't have an IBM ID and password, register here.

By clicking Submit, you agree to the developerWorks terms of use.

The first time you sign into developerWorks, a profile is created for you. This profile includes the first name, last name, and display name you identified when you registered with developerWorks. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

All information submitted is secure.

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerworks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

By clicking Submit, you agree to the developerWorks terms of use.

All information submitted is secure.

Tip: Command-line XML processing

Working with XML documents from the shell

David Mertz (mertz@gnosis.cx), Line commander, Gnosis Software, Inc.
Photo of David Mertz
David Mertz uses a wholly unstructured brain to write about structured document formats. David may be reached at mertz@gnosis.cx; his life pored over at http://gnosis.cx/dW/. And buy his book, Text processing in Python .

Summary:  Most of the time, processing XML documents utilizes heavy-duty APIs and custom applications. However, the tradition of using small tools with I/O piped between them works fine on Unix-like platforms. Here, David shows you how you can use XML for this kind of quick-and-dirty processing with one-liners that are especially useful during development and debugging cycles.

View more content in this series

Date:  07 May 2003
Level:  Intermediate

Comments:  

As much as I hate to say it, XML tools simply have not reached the level of convenience of text utilities that are available at a Unix-like command-line. For line-oriented, whitespace- or comma-delimited-text files, it is quite amazing what you can accomplish with clever combinations of sed, grep, xargs, wc, cut, pipes, and short shell scripts.

In my opinion, it is not that XML is inherently resistant to the modular treatment flat text files enjoy; XML developers just need to learn from experience the best ways to componentize XML tools. For example, in writing this tip I had a few realistic sample tasks in mind; but what I found was that even those tools that have command-line facilities have not yet learned to play nicely with each other. Working with multiple tools is not intractable, it just requires a little bit of wrapping.

One fact worth noting is that quite a few people have written versions -- in various programming languages -- of similar simple tools. Each version behaves a bit differently, but they all tend to accomplish the same overall task. For this tip, I look at the tools xml_indexer, xmlcat, and xpath; the first two come from my Gnosis Utilities, while the last is a Perl module written by Matt Sergeant (get it from CPAN).

Finding words in XML prose

I have written previously about xml_indexer (see Resources), which creates an index of the words within XML documents by their XPath. For example, you can index then search an XML document with:


Listing 1. Indexing and searching on XPaths
                
% xml_indexer chap.xml
% indexer events were
/Users/dqm/chap.xml::/chapter/sect1[2]/sect2[1]/para[1]
/Users/dqm/chap.xml::/chapter/sect1[2]/sect2[4]/para[3]
1 files matched wordlist: ['events', 'mostly']
Processed in 0.062 seconds (SlicedZPickleIndexer)

These commands display the elements within the XML document chap.xml that contain the words "events" and "were" (not necessarily in order or proximity). If other XML documents were added to the index, matching occurrences in them would also appear. By the way, new searches are almost instantaneous, even if multiple documents are indexed.

While it tells you a little bit to know that words occur at particular XPaths within particular documents, the point of a search is usually to see (or further process) the actual content matches. For that, you need to employ a command-line xpath tool; I have installed Perl's XML::XPath, whose behavior I like.

You can cut-and-paste the discovered XPaths into the xpath tool. For example:


Listing 2. Manually looking at an XPath
                
% xpath chap.xml '/chapter/sect1[2]/sect2[4]/para[3]'
Found 1 nodes:
-- NODE --
<para>It is not particularly remarkable that...
...
</para>

This points to a nice modularity in the tools. Moreover, if the XPath that's passed to xpath were to have wildcards in it, it might match more than just the one node. Unfortunately, the output of indexer does not have quite the right form to pipe to xpath; to automate looking at the nodes with matched words, indexer separates the filename from the XPath with "::", and xpath only looks at one XPath at a time. You can do better.


A first little shell script

You might find a way to manage the impedance mismatch above using clever combinations of xargs, apply, pipes, and the like. But I found it easier to write a short (and reusable) shell script:


Listing 3. find-xml-elements
                
#!/bin/sh
for hit in `indexer $@ 2> /dev/null`
do
  echo $hit | sed 's/::/ /' > loc.tmp
  cat loc.tmp | xargs xpath 2> /dev/null
  echo
done
rm loc.tmp

As with other well-designed command-line tools, indexer and xpath send informational messages to STDERR, and the actual results to STDOUT. For my script, I am not interested in the STDERR messages. Now I can find all the nodes in which a list of words occur as easily as:


Listing 4. Searching XML elements for words
                
% find-xml-elements events were
<para>Lest we forget some events in a recent decade...
...
Salem and by HUAC.</para>
<para>It is not particularly remarkable that...
...
being uncovered.</para>

So far, so good. The search outputs a series of XML snippets, where each top-level element contains the searched words. However, the result is generally not quite a well-formed XML document, since it is multiply-rooted.


Comparing XML documents and extracting text

One of the challenges of analyzing XML data is that XML documents can contain variations in formatting that are irrelevant to their semantic content: Some whitespace can be ignored, the order of attributes is discarded during parsing, empty elements may be either self-closed or have an end-tag, and entities can be encoded in a few different ways. In truth, even much of the whitespace that can't be ignored from a parser's perspective is nonetheless insignificant from an application point-of-view; pretty newlines and indenting are useful for people, and many applications (optionally) perform such stylistic formatting.

A rather large number of tools have been written to compare XML documents in a semantically useful way. Most of them have been given an obvious name like xmldiff (use Google to find versions for various programming languages). Underlying such a comparison of XML documents is a canonicalization of the layout of each document. Once inflexible algorithmic decisions have been made about the exact rendering of an XML document, semantically similar documents are easier to compare with generic tools like diff.

I use a Python script I wrote called xmlcat. The tool is not complicated -- it acts much like the standard cat utility -- but canonicalizes XML documents along the way. The reason I like xmlcat more than similar tools like xmlpp (see Resources) is that it adds an option that's inspired by the Web browser lynx. If you pass the --dump argument to xmlcat, it outputs only the textual content of an XML document, eliminating the tags (using vertical whitespace is a moderately pretty way to do this). For data-oriented XML this capability is of little use, but for marked-up prose it is handy.


A second shell script for viewing text

If you search XML documents of prose for content words, most likely you are interested in the content more than you are the markup. Filtering with xmlcat --dump is exactly the trick to remove unwanted XML tags. However, directly piping the output of find-xml-elements to xmlcat is not quite right, since the output of find-xml-elements is not quite an entire well-formed XML document (it is fragments, as noted). A short shell script solves the problem:


Listing 5. find-xml-text
                
#!/bin/sh
for hit in `indexer $@ 2> /dev/null`
do
  echo $hit | sed 's/::/ /' > loc.tmp
  cat loc.tmp | xargs xpath 2> /dev/null | xmlcat --dump
  echo
done
rm loc.tmp

The output from find-xml-text plays nicely with standard text utilities. For example, I would like to display all the paragraphs that contain some search terms, but remove any left indent from their lines and limit line-length:


Listing 6. Searching XML element text for words
                
% find-xml-text events were | sed 's/^ *//' | fmt -w 70
Lest we forget some events in a recent decade...
...
...those in Salem and by HUAC.

It is not particularly remarkable...
...
...being uncovered.


Resources

  • Read Kip Hampton's worthwhile article from last year, "Perl and XML on the Command Line," which looks at Perl tools for command-line XML processing.

  • Find the Perl tools, xmldiff (compare XML documents) and xmlpp (XML pretty printer), at the DecisionSoft site.

  • Go to Gnosis Utilities and download several of the utilities discussed in this article.

  • Review this XML Matters column as it discusses full text indexing of XML documents by XPath (developerWorks, May 2001).

  • Find more XML resources on the developerWorks XML zone. For a complete list of XML tips to date, check out the tips summary page.

  • IBM trial software: Build your next development project with trial software available for download directly from developerWorks.

  • IBM XML certification: Find out how you can become an IBM-Certified Developer in XML and related technologies.

About the author

Photo of David Mertz

David Mertz uses a wholly unstructured brain to write about structured document formats. David may be reached at mertz@gnosis.cx; his life pored over at http://gnosis.cx/dW/. And buy his book, Text processing in Python .

Report abuse help

Report abuse

Thank you. This entry has been flagged for moderator attention.


Report abuse help

Report abuse

Report abuse submission failed. Please try again later.


developerWorks: Sign in

If you don't have an IBM ID and password, register here.


Forgot your IBM ID?


Forgot your password?
Change your password


By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. This profile includes the first name, last name, and display name you identified when you registered with developerWorks. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

Choose your display name

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

(Must be between 3 – 31 characters.)


By clicking Submit, you agree to the developerWorks terms of use.

 


Rate this article

Comments

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML
ArticleID=12265
ArticleTitle=Tip: Command-line XML processing
publish-date=05072003
author1-email=mertz@gnosis.cx
author1-email-cc=

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

For articles in technology zones (such as Java technology, Linux, Open source, XML), Popular tags shows the top tags for all technology zones. For articles in product zones (such as Info Mgmt, Rational, WebSphere), Popular tags shows the top tags for just that product zone.

For articles in technology zones (such as Java technology, Linux, Open source, XML), My tags shows your tags for all technology zones. For articles in product zones (such as Info Mgmt, Rational, WebSphere), My tags shows your tags for just that product zone.

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).