Start working with XMLStarlet

Open source toolkit allows you to work with XML from the command line

Learn how to use the XMLStarlet command-line utility to format, transform, fix, and edit XML using a set of simple commands. Jack Herrington shows you how easy it is to get up and running -- and simplify your life -- with this powerful tool.

Share:

Jack Herrington (jherr@pobox.com), Senior Software Engineer, Code Generation Network

An engineer with more than 20 years of experience, Jack Herrington is currently editor in chief of the Code Generation Network. He is the author of Code Generation in Action. You can contact him at jack_d_herrington@codegeneration.net.



10 June 2005

Also available in Japanese

XMLStarlet is an open source XML toolkit that you can use on your UNIX®, Mac OS® X, or Microsoft® Windows® command line. You can use XMLStarlet to validate XML, to format it, to select portions of it, to transform it with XSLT, even to make edits. This means you can put XML utilities into your shell scripts without writing any custom code in a programming language like Perl or Java®.

To get started with XMLStarlet, you need to install it. But to do so, you need the libxml2 and libxslt2 libraries. On Windows, you don't need to install libxml2 and libxslt2 -- they come with the Win32 package. You can download the Win32 executable and install it somewhere on your path so it's easily executable from the command line. If you're running UNIX, and your machine doesn't already have libxml2 and libxslt2, then you must download and install them (see Resources).

Next, surf over to the XMLStarlet home page, and download the latest build (see Resources). Run the ./configure script to set up the build scripts. Then, run make install to build the package and install it. If you aren’t the super user, you should use sudo make install so the commands are installed in the system directories.

You might also want to check out the XML, XSLT, and XML Path Language (XPath) pages to keep up with these three standards; they're critical to making the most of XMLStarlet (see Resources).

The basics

Now that it's installed, you can navigate around XMLStarlet. Start by running the xml command on its own (see Listing 1).

Listing 1. The XMLStarlet help page
% xml
XMLStarlet Toolkit: command-line utilities for XML
Usage: xml [<options>] <command> [<cmd-options>]
where <command> is one of:
ed    (or edit)      - Edit/Update XML document(s)
sel   (or select)    - Select data or query XML document(s) (XPATH, etc)
tr    (or transform) - Transform XML document(s) using XSLT
val   (or validate)  - Validate XML document(s) (well-formed/DTD/XSD/RelaxNG)
fo    (or format)    - Format XML document(s)
el    (or elements)  - Display element structure of XML document
c14n  (or canonic)   - XML canonicalization
ls    (or list)      - List directory as XML
esc   (or escape)    - Escape special XML characters
unesc (or unescape)  - Unescape special XML characters
pyx   (or xmln)      - Convert XML into PYX format (based on ESIS - ISO 8879)
p2x   (or depyx)     - Convert PYX into XML
<options> are:
--version            - show version
--help  - show help
Wherever file name mentioned in command help it is assumed
that URL can be used instead as well.

Type: xml <command> --help <ENTER> for command help

XMLStarlet is a command line toolkit to query/edit/check/transform
XML documents (for more information see http://xmlstar.sourceforge.net/)

The basic format of each command is xml <command> followed by some options. Getting help for each of the options is as easy as xml <command> --help. For example, Listing 2 shows the help for the edit (ed) command.

Listing 2. Help for the edit command
% xml ed --help
XMLStarlet Toolkit: Edit XML document(s)
Usage: xml ed <global-options> {<action>} [ <xml-file-or-uri> ... ]
where
<global-options>  - global options for editing
<xml-file-or-uri> - input XML document file name/uri (stdin otherwise)

<global-options> are:
-P (or --pf)        - preserve original formatting
-S (or --ps)        - preserve non-significant spaces
-O (or --omit-decl) - omit XML declaration (<?xml ...?>)
-N <name>=<value>   - predefine namespaces (name without 'xmlns:')
ex: xsql=urn:oracle-xsql
Multiple -N options are allowed.
-N options must be last global options.
--help or -h        - display help

where <action>
-d or --delete <xpath>
-i or --insert <xpath> -t (--type) elem|text|attr -n <name> -v (--value) <value>
-a or --append <xpath> -t (--type) elem|text|attr -n <name> -v (--value) <value>
-s or --subnode <xpath> -t (--type) elem|text|attr -n <name> -v (--value) <value>
-m or --move <xpath1> <xpath2>
-r or --rename <xpath1> -v <new-name>
-u or --update <xpath> -v (--value) <value>
-x (--expr) <xpath> (-x is not implemented yet)

XMLStarlet is a command line toolkit to query/edit/check/transform
XML documents (for more information see http://xmlstar.sourceforge.net/)

This help display looks complicated, but the good stuff is at the bottom; there, you learn about deleting XML nodes, inserting them, changing their value, and more.


XML directory listings

Long code lines

Some code lines in this article are too long to fit in the window without being truncated. These lines are wrapped in the code listings, although they appear on a single line at the actual command line. Such lines are indicated with a » symbol (see, for example, Listing 3).

To begin playing with XMLStarlet, you need XML. That brings you to your first command, xml ls, which gives a listing of the current directory in XML. Listing 3 shows an example.

Listing 3. A directory listing in XML
% xml ls
<xml>
    <d p="rwxr-xr-x" a="2005.05.04 23:03:46" 
    » m="2004.03.24 16:21:02" s="374" n="."/>
    <d p="rwxr-xr-x" a="2005.05.04 23:03:46" 
    » m="2005.05.04 22:13:41" s="1938"n=".."/>
    <f p="rw-r--r--" a="2005.03.24 17:53:52" 
    » m="2004.03.24 01:13:43" s="6148"n=".DS_Store"/>
    <f p="rw-r--r--" a="2005.03.24 17:53:52" 
    » m="2004.03.24 00:41:46" s="173" n="build.xml"/>
    <d p="rwxr-xr-x" a="2005.04.30 11:34:27" 
    » m="2004.03.24 01:13:43" s="544" n="docs"/>
    <f p="rw-r--r--" a="2005.03.24 17:53:52" 
    » m="2004.03.21 18:41:58" s="641" n="input.xml"/>
    <f p="rw-r--r--" a="2005.03.24 17:53:52" 
    » m="2004.03.23 23:41:15" s="3587"n="main.xsl"/>
    <f p="rw-r--r--" a="2005.03.24 17:53:52" 
    » m="2004.03.24 00:37:10" s="184" n="Makefile"/>
    <f p="rw-r--r--" a="2005.03.24 17:53:52" 
    » m="2004.03.24 00:36:41" s="3869"n="MyGenerator.class"/>
    <f p="rw-r--r--" a="2005.03.24 17:53:52" 
    » m="2004.03.24 00:36:33" s="5265"n="MyGenerator.java"/>
    <d p="rwxr-xr-x" a="2005.04.30 11:34:25" 
    » m="2004.03.24 00:20:07" s="272" n="output"/>
</xml>

You may think this directory listing displays too much information. If so, you can (for example) remove the directory nodes, as shown in Listing 4.

Listing 4. Directory listing without directory nodes
% xml ls | xml ed -d "//d"
<?xml version="1.0"?>
<xml>
    <f p="rw-r--r--" a="2005.03.24 17:53:52" 
    » m="2004.03.24 01:13:43" s="6148" n=".DS_Store"/>
    <f p="rw-r--r--" a="2005.03.24 17:53:52" 
    » m="2004.03.24 00:41:46" s="173" n="build.xml"/>
    <f p="rw-r--r--" a="2005.03.24 17:53:52" 
    » m="2004.03.21 18:41:58" s="641" n="input.xml"/>
    <f p="rw-r--r--" a="2005.03.24 17:53:52" 
    » m="2004.03.23 23:41:15" s="3587" n="main.xsl"/>
    <f p="rw-r--r--" a="2005.03.24 17:53:52" 
    » m="2004.03.24 00:37:10" s="184" n="Makefile"/>
    <f p="rw-r--r--" a="2005.03.24 17:53:52" 
    » m="2004.03.24 00:36:41" s="3869" n="MyGenerator.class"/>
    <f p="rw-r--r--" a="2005.03.24 17:53:52" 
    » m="2004.03.24 00:36:33" s="5265" n="MyGenerator.java"/>
</xml>

You use the edit command (ed) to remove the d nodes from the XML. The ls command outputs the directory to the standard output. The pipe (|) then redirects the standard output to the standard input of the edit command, which removes the d nodes from the listing. You specify the d nodes using the XPath expression //d, which matches a d node at any level in the tree. You can make this command more specific by using /xml/d.

Now, suppose you want to remove the a and m attributes (see Listing 5).

Listing 5. Directory listing without a and m attributes
% xml ls | xml ed -d "//d" -d "//@a" -d "//@m" -d "//@p"
<?xml version="1.0"?>
<xml>
    <f s="6148" n=".DS_Store"/>
    <f s="173" n="build.xml"/>
    <f s="641" n="input.xml"/>
    <f s="3587" n="main.xsl"/>
    <f s="184" n="Makefile"/>
    <f s="3869" n="MyGenerator.class"/>
    <f s="5265" n="MyGenerator.java"/>
</xml>

That’s more workable. Your listing is down to just files, and within the file nodes you see only the size and name of the file. To make the display easier to follow, you can put the result into a file called ls.xml. You can also use the rename edit function to change the f tag to file (see Listing 6).

Listing 6. Directory listing with size and name attributes
% cat ls.xml | xml ed -r "//f" -v "file"
<?xml version="1.0"?>
<xml>
    <file s="6148" n=".DS_Store"/>
    <file s="173" n="build.xml"/>
    <file s="641" n="input.xml"/>
    <file s="3587" n="main.xsl"/>
    <file s="184" n="Makefile"/>
    <file s="3869" n="MyGenerator.class"/>
    <file s="5265" n="MyGenerator.java"/>
</xml>

In addition, instead of using short names for tags and attributes like s and n, you can change them to size and name, respectively (see Listing 7).

Listing 7. Directory listing with file tags
% cat ls.xml | xml ed -r "//f" -v "file" -r "//@s" -v "size" -r "//@n" -v "name"
<?xml version="1.0"?>
<xml>
    <file size="6148" name=".DS_Store"/>
    <file size="173" name="build.xml"/>
    <file size="641" name="input.xml"/>
    <file size="3587" name="main.xsl"/>
    <file size="184" name="Makefile"/>
    <file size="3869" name="MyGenerator.class"/>
    <file size="5265" name="MyGenerator.java"/>
</xml>

That’s easy to read. And you haven’t written one line of XSLT, Perl, or Java code. Save this file as ls2.xml.


Validating

The new directory listing is nice, but is it still valid? Listing 8 shows you how to determine that.

Listing 8. Checking the XML's well-formedness
% xml val ls2.xml
ls2.xml - valid

Yep, it's valid. In the sense that it is well formed -- which means the tags are balanced, the characters are encoded properly, and that sort of thing. But it still may not have all of the required tags, or the right tags. To determine that, you need to know the proper structure of the file. So, you need a schema. Only when you have checked the XML document against a schema and found that it passes will it truly be valid.

Listing 9 shows a basic RELAX NG schema for the XML directory listing file.

Listing 9. The RELAX NG schema
<?xml version="1.0" encoding="UTF-8"?>
<grammar ns="" xmlns=http://relaxng.org/ns/structure/1.0
datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes">
  <start>
    <element name="xml">
      <oneOrMore>
        <element name="file">
          <attribute name="name">
            <data type="NMTOKEN"/>
          </attribute>
          <attribute name="size">
            <data type="integer"/>
          </attribute>
        </element>
      </oneOrMore>
    </element>
  </start>
</grammar>

RELAX NG is easy to read. At the top, the element tag defines the name xml as the base tag. Then, the oneOrMore tags inside the xml tags are named file and size.

Is the ls2.xml file valid against this new schema? See Listing 10.

Listing 10. Checking against the schema
% xml val -e -r ls.rng ls2.xml
ls2.xml - valid

If you're like me, you aren't satisfied until you see it fail. So, add an attribute named someAttribute to one of the file items in a file called ls3.xml, and run it again (see Listing 11).

Listing 11. Checking a bad file against the schema
% xml val -e -r ls.rng ls3.xml
ls3.xml:4: element file: Relax-NG validity error : 
» Invalid attribute someAttribute for element file
ls3.xml - invalid

As it turns out, it fails. Not only do you know the file is well formed, but you also know it has all the right tags and attributes.


Going to text

You can also play with the selection functions, which let you extract elements of the data from the XML. The example in Listing 12 extracts the file names from the XML directory listing as plain text.

Listing 12. Extracting the file names
% xml sel -t -m "/xml/file" -v "concat(@name,'
')" ls2.xml
.DS_Store
build.xml
input.xml
main.xsl
Makefile
MyGenerator.class
MyGenerator.java

Look at two things here. First, the XPath to get to the file names is the /xml/file specification. Second, the output specification using the -v option concatenates the name attribute on the file tag with a carriage return.

Now you can add the -s option to sort the files by the size attribute (see Listing 13). The A:N:- syntax tells XMLStarlet to use an ascending numerical sort. (This code adds the size parameter to the concat statement to make sure it's working.)

Listing 13. Sorting the list
% xml sel -t -m "/xml/file" -s A:N:- "@size" -v "concat 
» ( @name,':',@size,'
' ) " ls2.xml
build.xml:173
Makefile:184
input.xml:641
main.xsl:3587
MyGenerator.class:3869
MyGenerator.java:5265
.DS_Store:6148

A little traffic fun

To have some fun with the xml command, you can use it to parse a traffic report. Yahoo!® Maps provides a traffic service. You can use the curl command with the -g option (for GET) to download the latest traffic information through RSS. For example, in Listing 14, I specify my zip code by adding the ?csz=94101 argument, and the result is the latest San Francisco traffic report.

Listing 14. San Francisco traffic as RSS
% curl -g "http://maps.yahoo.com/traffic.rss?csz=94101" –s
<?xml version="1.0" encoding="ISO-8859-1" ?>
<rss version="2.0">
<channel>
<title>Yahoo! Maps Traffic -- San Francisco,  CA 94101</title>
<link>http://us.rd.yahoo.com/maps/mapresults/trfrssarea/*
» http://maps.yahoo.com/maps_result?csz=
» San+Francisco%2C++CA+94101&country=
» us&lat=37.775&lon=
» -122.4183&trf=1&mag=5</link>
<category>Traffic</category>
<description>Yahoo! Maps Traffic -- 
» San Francisco,  CA 94101</description>
<language>en-us</language>
<ttl>3</ttl>
<lastBuildDate>Fri, 06 May 2005 16:33:59 -0700<
» /lastBuildDate>
<pubDate>Fri, 06 May 2005 18:31:27 CDT<
» /pubDate>
<copyright>Copyright (c) 2005 Yahoo! Inc. 
» All rights reserved.</copyright>
<item>
<title>
Incident, On I-580 At Seminary Ave
</title>
<description>
Traffic Collision, Severity: Major, Started: 04:20pm 05/06/05, 
» Estimated End: 04:50pm 05/06/05, 
» Last Updated: 04:25pm 05/06/05
</description>
<link>http://us.rd.yahoo.com/maps/mapresults/trfrssitem/*
» http://maps.yahoo.com/maps_result?csz=
» San+Francisco%2C++CA+94101&mlt=
» 37.778234&mln=-122.168438&lat=
» 37.775&lon=-122.4183&trf=
» 1&exctrf=1&mag=4</link>
<pubDate>Fri, 06 May 2005 16:20:00 -0700</pubDate>
<category>Incident </category>
<severity>Major</severity>
<endDate>Fri, 06 May 2005 16:50:00 -0700</endDate>
<updatedDate>Fri, 06 May 2005 16:25:00 -0700<
» /updatedDate>
</item>
...

Now you can pipe the output of the curl command through the XMLStarlet command to get just the descriptions (see Listing 15).

Listing 15. Traffic RSS piped through XMLStarlet
% curl -g "http://maps.yahoo.com/traffic.rss?csz=94101" 
» -s | xml sel -t -m "/rss/channel/item/description" -v "."
Traffic Collision, Severity: Major, Started: 04:20pm 05/06/05, 
» Estimated End: 04:50pm 05/06/05, 
» Last Updated: 04:25pm 05/06/05
Disabled Vehicle, Severity: Moderate, Started: 04:20pm 05/06/05, 
» Estimated End: 04:50pm 05/06/05, 
» Last Updated: 04:25pm 05/06/05
Disabled Vehicle, Severity: Moderate, Started: 04:19pm 05/06/05, 
» Estimated End: 04:49pm 05/06/05, 
» Last Updated: 04:25pm 05/06/05
Pedestrian On The Roadway, Severity: Critical, 
» Started: 04:17pm 05/06/05, 
» Estimated End: 04:47pm 05/06/05, 
» Last Updated: 04:25pm 05/06/05
Traffic Collision, Severity: Major, Started: 04:15pm 05/06/05, 
» Estimated End: 04:45pm 05/06/05, 
» Last Updated: 04:25pm 05/06/05
...

The -m option picks out the description of each item. Then, using the -v option, you can output only the text of the node by specifying a period (.).


Conclusion

This article barely scratches the surface of this very powerful XML tool. If you have time, check out XMLStarlet's XSLT transformation functions, the handy escaping and unescaping functions, the XML formatting functions, and more.

Resources

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into XML on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML, Open source
ArticleID=84853
ArticleTitle=Start working with XMLStarlet
publish-date=06102005