XML for Perl developers, Part 1: XML plus Perl -- simply magic

Integrate XML into a Perl application using XML::Simple

This series is a guide to those who need a quick XML-and-Perl solution. In a surprisingly large number of cases, you only need one tool to integrate XML into a Perl application, XML::Simple. Part 1 tells you where to get it, how to use it, and where to go next. Once you whet your appetite for working with XML in Perl, the other two articles in this series will help you sharpen your new skills further.

Share:

Jim Dixon (jddixon@gmail.com), Writer, Freelance

Jim Dixon is an independent contractor recently returned to San Francisco, where he advises Web 2.0 startups on the wonders of Perl and Ruby. Earlier in life he was technical lead at a UK/US Internet service provider for seven years and developed a lot of Java/J2EE software.



30 January 2007

Also available in Russian Japanese

Introduction

This is the first of a three-part series on Perl and XML, which focuses on XML::Simple. For Perl programmers, the most common first use of XML is in retrieving parameters from a configuration file. This article shows you how to read such a parameter in two lines of code, the first telling Perl that XML::Simple is being used and the second setting a variable to a value in the file. You don't even have to give the name of the configuration file: XML::Simple will make an intelligent guess.

For a more elaborate example, you will pay a visit to a pet shop. In that section, you will learn how to read an XML file into a hierarchical Perl data structure, a mixture of anonymous arrays and hashes, with a minimum of effort. This article illustrates how concisely Perl can transform and restructure the information contained in the original XML document, and then shows how to write it back out in various forms.

Finally, I discuss some limitations of XML::Simple. This leads into the subjects of the next two articles in this series: more advanced parsing, the use of sophisticated tools in transforming XML from one form to another, and techniques for serializing XML from DOM and other in-memory forms.

This article is primarily intended for Perl programmers with little exposure to Perl but will also prove useful to XML experts interested in exploring a more programmatic approach to manipulating XML documents.


Getting started

Before you get started, you need to install Perl. If you don't already have it, see Resources for a link.

Next, you will need XML::Simple. If you are using UNIX or Linux, it's most convenient to get these from CPAN using cpan. You begin this process by installing cpan on your machine using the commands shown in Listing 1. Generally you will want to do this as root, to make the Perl modules available to all users.

Listing 1. Installing cpan, getting XML::Simple
$ perl -MCPAN -e shell
cpan> ...
cpan> install XML::Simple
cpan> quit

When you run the command for the first time, you will go through a long dialog. This is elided in Listing 1. Some users will find it convenient to know that you can edit the resulting configuration; it's in /etc/perl/CPAN/Config.pm.

Windows users follow a similar procedure using PPM (see Resources if you don't have PPM). In this case the command to install a module is similar to that shown in Listing 2.

Listing 2. Windows: using PPM to get XML::Simple
$ ppm install XML::Simple

Both cpan and ppm check for dependencies during installation and will fetch any missing dependencies from the depository. This is automatic if you set cpan's prerequisites policy to 'follow'. The modules are generally compiled during the installation and generates pages of messages. This can take some time and should not be seen as a cause for concern.

Another prerequisite

XML::Simple converts XML documents to references to hashes and arrays of hashes. This means that you need a solid understanding of the interaction of references, hashes, and arrays in Perl. If you need help in this direction, consult the excellent Perl reference tutorial in Resources.


XML::Simple

Basically, Grant McLean's XML::Simple has two functions; it converts XML text documents into Perl data structures, mixtures of anonymous hashes and arrays, and it converts such data structures back to XML text documents.

This limited functionality is immensely useful, which will be demonstrated at two levels. First you'll see how to import data from configuration files in XML form. Then in a more elaborate example, at the local pet shop, you will learn how to read a large and complex XML file into memory, transform it in ways that might be difficult with conventional XML tools like XSLT, and write it back to disk.

For many, XML::Simple will provide all that is necessary to deal with XML in Perl.

An XML configuration file

You have a problem, a problem that faces programmers all around the world every day. You need to pass moderately complex configuration information to your program and it's just too much of a hassle to do it with command line arguments. So you decide to use a configuration file. Because XML is after all the standard for this sort of thing, you decide to format the file that way, leading to what is shown in Listing 3. You're going to use XML::Simple to deal with this.

Listing 3. A configuration file, part1.xml
<config>
  <user>freddy</user>
  <passwd>longNails</passwd>
  <books>
    <book author="Steinbeck" title="Cannery Row"/>
    <book author="Faulkner" title="Soldier's Pay"/>
    <book author="Steinbeck" title="East of Eden"/>
  </books>
</config>

In addition to the constructor, XML::Simple has two subroutines: XMLin() and XMLout(). As you might expect, the first reads an XML file, returning a reference. Given a reference to an appropriate data structure, the second converts it to an XML document, either in string format or as a file, depending upon its parameters.

XML::Simple generally has sensible defaults, so that for example if you specify no input file name, a Perl program named part1.pl (as in Listing 4) will read a file called part1.xml.

Listing 4. part1.pl
#!/usr/bin/perl -w
use strict;
use XML::Simple;
use Data::Dumper;
print Dumper (XML::Simple->new()->XMLin());

Executing part1.pl yields the output shown in Listing 5.

Listing 5. Output from part1.pl
$VAR1 = {
      'passwd' => 'longNails',
      'user' => 'freddy',
      'books' => {
             'book' => [
                   {
                 'title' => 'Cannery Row',
                 'author' => 'Steinbeck'
                   },
                   {
                 'title' => 'Soldier\'s Pay',
                 'author' => 'Faulkner'
                   },
                   {
                 'title' => 'East of Eden',
                 'author' => 'Steinbeck'
                   }
                 ]
           }
    };

XMLin() has returned a reference to a hash. If this were assigned to a variable called $config, you could then get the user's name using $config->{user} and the password using $config->{passwd}. Those with an eye for economy will notice that you can read the configuration file and return a single parameter in less than one line of code: XML::Simple->new->{user}.

The dump makes it clear that you have to be careful in dealing with XML::Simple.

  • First, it discards the name of the root element.
  • Secondly, it collapses elements with the same name into a single reference to an anonymous array. Consequently, the title of the first book is @{$config->{books}->{book}}[0]->{title} or 'Cannery Row'.
  • Thirdly, it treats attributes and subelements identically.

You can change each of these behaviors by options to XMLin(). See Resources and the discussion below for more information on options.


A more complicated example: The pet shop

XML::Simple is good for a lot more than economical parsing of configuration files. It can in fact deal with large and complex XML files and convert them into regular data structures that often are quite amenable to transformations, which are quite straightforward in Perl but difficult or impossible using more conventional XML transformation tools like XSLT.

Assume that you are working at a pet shop, which keeps information on the pets in an XML file. A small part of the document is shown below as Listing 6. The manager wants a few changes made:

  • To save space, change all of the subelements to attributes
  • Increase prices by 20%
  • Make all prices look the same, so all will show two decimal places
  • Sort the list
  • Replace dates of birth with ages

With your new-found confidence in Perl, and with your awareness that XSLT is computationally challenged -- have you ever tried to do a shift using XPath? -- you decide to do the job with XML::Simple (see Listing 6).

Listing 6. A few of our pets, pets.xml
<?xml version='1.0'?>
<pets>
  <cat>
    <name>Madness</name>
    <dob>1 February 2004</dob>
    <price>150</price>
  </cat>
  <dog>
    <name>Maggie</name>
    <dob>12 October 2005</dob>
    <price>75</price>
    <owner>Rosie</owner>
  </dog>
  <cat>
    <name>Little</name>
    <dob>23 June 2006</dob>
    <price>25</price>
  </cat>
</pets>

Initial explorations

Your first try at using XML::Simple begins as shown in Listing 7.

Listing 7. Your brave new Perl
#!/usr/bin/perl -w
use strict;
use XML::Simple;
use Data::Dumper;
my $simple = XML::Simple->new();
my $data   = $simple->XMLin('pets.xml');
# DEBUG
print Dumper($data) . "\n";
# END

Being prudent, you use Data::Dumper to look at what gets read into memory and are upset to find what's seen in Listing 8.

Listing 8. What you get
$VAR1 = {
      'cat' => {
           'Little' => {
                   'dob' => '23 June 2006',
                   'price' => '25'
                 },
           'Madness' => {
                'dob' => '1 February 2004',
                'price' => '150'
                  }
         },
      'dog' => {
           'owner' => 'Rosie',
           'dob' => '12 October 2005',
           'name' => 'Maggie',
           'price' => '75'
         }
    };

This is disappointing. Cats and dogs are represented quite differently: The two cats are stored in a doubly nested hash keyed by name, whereas the information about the dog is stored in a simple hash, with its name treated like any other attribute. Also you notice that the name of the root element has disappeared. So you go off and read the excellent documentation (see Resources) and discover the existence of options, including in particular ForceArray=>1 and KeepRoot=>1. The first causes all nested elements to be represented as arrays. On input the second causes the name of the root element to be retained. On output, about which you'll see more later, it means that the in-memory representation of the data includes the name of the root element. With these changes you get what's in Listing 9, which is considerably easier for a programmer to deal with, although it might take up more memory.

Listing 9. Data::Dumper output after adding options, cleaned up a little to make it more readable
$VAR1 = {
      'pets' => [
            {
              'cat' => [
                   {
                       'dob'   => [ '1 February 2004' ],
                       'name'  => [ 'Madness' ],
                       'price' => [ '150' ]
                   },
                   {
                       'dob'   => [ '23 June 2006' ],
                       'name'  => [ 'Little' ],
                       'price' => [ '25' ]
                   }
                 ],
              'dog' => [
                   {
                       'owner' => [ 'Rosie' ],
                       'dob'   => [ '12 October 2005' ],
                       'name'  => [ 'Maggie' ],
                       'price' => [ '75' ]
                   }
                 ]
            }
          ]
    };

Transforming the in-memory data structure

You now have a regular structure in memory, one which is very easy to deal with programmatically. To achieve your boss's first objective, which is to convert elements to attributes, you need to replace references to arrays, as shown in Listing 10.

Listing 10. Reference to single-element array
'name' => [ 'Maggie' ]

You then must replace references to simple values, as shown in Listing 11.

Listing 11. Reference to simple value
'name' => 'Maggie'

Given this change, XML::Simple will output an attribute-value pair, rather than a subelement. Where there is more than one instance of the type to output -- in this case, where you have two cats but one dog -- you need to collect the hashes as an anonymous array of anonymous hashes. Listing 12 shows you how to accomplish part of this minor bit of magic.

Listing 12. Folding arrays to hashes, converting elements to attributes
sub makeNewHash($) {
    my $hashRef = shift;
    my %oldHash = %$hashRef;
    my %newHash = ();
    while ( my ($key, $innerRef) = each %oldHash ) {
        $newHash{$key} = @$innerRef[0];
    }
    return \%newHash;
}

Given a reference to the XML describing an individual pet, this code 'folds' it into a hash. If there is only one pet of the type, you are done. You write a reference to the new hash back into $data. However, if there is more than one pet of the type, what you write back is a reference to an anonymous array containing references to the anonymous hashes describing individual pets. You can see how this is done by looking at foldType() in the complete solution, Listing 16.


The other requirements: the joy of Perl

The boss's other requirements were to sort the list, increase prices by 20%, write prices to two decimal places, and replace dates of birth with ages. The first is as it turns out the default for output with XML::Simple. Given that this is Perl, the second and third are a one-liner. Perl is happily polymorphic: prices are numbers while calculating the 20% price increase, but if you write them back as strings they remain in whatever format you wrote them in. So Listing 13 does the job, converting string to number and back to string again.

Listing 13. Reformatting and increasing prices
sprintf "%6.2f", $amt * (1 + $change)

Converting dates of birth to ages proved more difficult. A quick check with CPAN showed that Date::Calc had all the necessary features (and a lot more). Decode_Date_EU converts dates in 'European' formats like 13 January 2006 to the 3-element array (YMD) that the package uses as a standard. Given two such dates, Delta_YMD($earlier, $later) then yields the difference, in your case the age, in the same format. Unfortunately, Delta_YMD is a bit buggy: sometimes the day or month will be negative! But a little googling finds a patch, and everything works again. deltaYMD in the complete solution (shown in Listing 16) shows how to handle this.


Dispatching on cats and dogs

To make the code more easily extensible, use a dispatch table, as shown in Listing 14. Dispatch tables are discussed in much detail in Jason Dominus's excellent book, Higher Order Perl (see Resources for a link).

Listing 14. A dispatch table
my $DISPATCHER = {
    'cat'   => sub { foldType(shift); }, 
    'dog'   => sub { foldType(shift); },
    'hippo' => \&hippoFunc,
};

The dispatcher can either contain the actual code used to deal with a particular element as an anonymous subroutine or it can contain a reference to a named subroutine defined elsewhere. You can use construct where switch-case is used in other languages.

In the worked example, there are only two element types, cat and dog. It is likely that in real XML documents there will be many at different levels. Using one or more dispatch tables is much clearer and much more maintainable than the Perl alternative, line after line of if ... elsif ... elsif constructs.


Writing your XML to disk

XML::Simple's defaults on output are typically sensible. If you supply no options to XMLout(), it produces a string. If you want to write to a file instead, add an OutputFile option. If you don't tell it to do otherwise, it will use <opt> as the root element. If the in-memory data structure has a name for the root element, add a KeepRoot option, setting it to true, or as it's known in Perl, 1. Listing 15 does all of this for you.

Listing 15. Output to an XML file
$simple->XMLout($data, 
            KeepRoot   => 1, 
            OutputFile => 'pets.fixed.xml',
            XMLDecl    => "<?xml version='1.0'?>",
        );

The complete solution

The 112 lines of code that follow in Listing 16 do what the boss requested. XML::Simple's economy is impressive. Eight lines of code read and write the XML. Less than half of the remaining code is concerned with transforming its structure.

Listing 16. Final version of the code
#!/usr/bin/perl -w
use strict;

use XML::Simple;
use Date::Calc qw(Add_Delta_YM Decode_Date_EU Delta_Days Delta_YMD); 
use Data::Dumper;

my $simple = XML::Simple->new (ForceArray => 1, KeepRoot => 1);
my $data   = $simple->XMLin('pets.xml');

my @now = (localtime(time))[5, 4, 3];
$now[0] += 1900;  # Perl years start in 1900
$now[1]++;        # months are zero-based

sub fixPrice($$) {
    my ($amt, $change) = @_;
    return sprintf "%6.2f", $amt * (1 + $change);
}

sub deltaYMD($$) {
    my ($earlier, $later) = @_;   # refs to YMD arrays
    my @delta = Delta_YMD (@$earlier, @$later); 
    while ( $delta[1] < 0 or $delta[2] < 0 ) {
        if ( $delta[1] < 0 ) {  # negative month
            $delta[0]--;
            $delta[1] += 12;
        }
        if ( $delta[2] < 0 ) {  # negative day
            $delta[1]--;
            $delta[2] = Delta_Days(
                    Add_Delta_YM (@$earlier, @delta[0,1]), @$later);
        }
    }
    return \@delta;
}
 
sub dob2age($) {
    my $strDOB = shift;
    my @dob = Decode_Date_EU($strDOB);
    my $ageRef = deltaYMD( \@dob, \@now );
    my ($ageYears, $ageMonths, $ageDays) = @$ageRef;
    my $age;
    if ( $ageYears > 1 ) {
        $age = "$ageYears years"; 
    } elsif ($ageYears == 1) {
        $age = '1 year' . ( $ageMonths > 0 ? 
            ( ", $ageMonths month" . ($ageMonths > 1 ? 's' : '') ) 
            : '');
    } elsif ($ageMonths > 1) {
        $age = "$ageMonths months";
    } elsif ($ageMonths == 1) {
        $age = '1 month' . ( $ageDays > 0 ?
            ( ", $ageDays day" . ($ageDays > 1 ? 's' : '') ) : '');
    } else {
        $age = "$ageDays day" . ($ageDays != 1 ? 's' : '');
    }
    return $age;

}
 
sub makeNewHash($) {
    my $hashRef = shift;
    my %oldHash = %$hashRef;
    my %newHash = ();
    while ( my ($key, $innerRef) = each %oldHash ) {
        my $value = @$innerRef[0];
        if ($key eq 'dob') {
            $newHash{'age'} = dob2age($value);
        } else {
            if ($key eq 'price') {
                $value = fixPrice($value, 0.20);
            }
            $newHash{$key} = $value;
        }
    }
    return \%newHash;
}
sub foldType ($) {
    my $arrayRef = shift;
    # if single element in array, return simple hash
    if (@$arrayRef == 1) { 
        return makeNewHash(@$arrayRef[0]);
    }
    # if multiple elements, return array of simple hashes
    else {
        my @outArray = ();
        foreach my $hashRef (@$arrayRef) {
            push @outArray, makeNewHash($hashRef);
        }
        return \@outArray;
    }
} 
my $dispatcher = {
    'cat' => sub { foldType(shift); }, 
    'dog' => sub { foldType(shift); },
};
 
my @base = @{$data->{pets}};
my %types = %{$base[0]};
my %newTypes = ();
while ( my ($petType, $arrayRef) = each %types ) {
    my @petArray = @$arrayRef;
    print "type $petType has " . @petArray . " representatives \n";
 
    my $refReturned = &{$dispatcher->{$petType}}( $arrayRef );
    $newTypes{$petType} = $refReturned;
}
$data->{pets} = \%newTypes;             # overwrite existing data
$simple->XMLout($data, 
            KeepRoot   => 1, 
            OutputFile => 'pets.fixed.xml',
            XMLDecl    => "<?xml version='1.0'?>",
        );

Although you can make the Perl more concise, this code also illustrates how easy it is to manipulate XML in Perl. In particular, use of dispatch tables makes it possible to deal with many differently structured element types in a very clear and maintainable way.


Limitations

Unfortunately, you just can't do some things with XML::Simple. I will elaborate on this in Parts 2 and 3, but XML::Simple has two major limitations. First, on input it reads the entire XML file into memory, so if the file is too big, or if you're dealing with a stream of XML data, you can't use the module. Secondly, it can't deal with XML mixed content, where both text and subelements appear in the body of an element, as in Listing 17.

Listing 17. Mixed content
<example>of <mixed/> content</example>

How do you know whether your file is too big for XML::Simple to handle? The rule of thumb is that XML expands by a factor of ten when read into memory. The implication is that if you have a few hundred megabytes of free memory on your workstation, XML::Simple should be able to handle XML files that are up to a few tens of megabytes in size.


Summary

XML has become pervasive in the computing world and is buried more and more deeply into modern applications and operating systems. It's imperative for the Perl programmer to develop a good understanding of how to use it. Tools like XML::Simple make it easy to convert XML documents into easily understandable Perl data structures and translate such data structures back into XML. Each action will normally be a single line of code.

On the other hand, XML specialists can be pleasantly surprised at how useful Perl can be in transforming and responding to XML contents.

Part 2 will show you how to take advantage of the two major schools of XML parsing for Perl developers: tree parsing and event-driven parsing.

Resources

Learn

Get products and technologies

  • Perl: Get the most recent version and put it in action.
  • The huge CPAN Perl library (18 million Google hits!): Visit the Comprehensive Perl Archive Network for all things Perl.
  • PPM, Perl Package Manager for Windows: Get a tool that allows you to install, remove, upgrade, and otherwise manage the use of common Perl CPAN modules (like Tk and DBI) with ActivePerl.
  • Grant McLean's XML::Simple: Try the XML::Simple module for a simple API layer on top of an underlying XML parsing module.
  • XML specification: Dig into this complete description of the Extensible Markup Language (XML).
  • Document Object Model (DOM) spec: Get the details on a platform- and language-neutral interface that allows programs and scripts to dynamically access and update the content, structure and style of documents.
  • Introduction to XML (Doug Tidwell, developerWorks, August 2002): For a gentler introduction to XML, take this tutorial that covers how XML developed, how it's shaping the future of electronic commerce, a variety of XML programming interfaces and standards, and two case studies that show how companies solve business problems with XML.
  • XPath 1.0: Get the specification for a language to navigate the DOM tree.
  • XSLT 1.0 specification: Learn about transforming one XML document into another.
  • Dare to script tree-based XML with Perl: Find out how to work with tree-based document models (Parand Darugar, developerWorks, July 2000): Get a solid introduction to tree-based XML parsing with Perl.
  • IBM trial software: Build your next development project with trial software available for download directly from developerWorks.

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into XML on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML, Web development
ArticleID=188825
ArticleTitle=XML for Perl developers, Part 1: XML plus Perl -- simply magic
publish-date=01302007