Skip to main content

Dare to script tree-based XML with Perl

Find out how to work with tree-based document models

Parand Tony Darugar (tdarugar@yahoo.com), Head of architecture, Yahoo! Search Marketing Services
Parand Tony Darugar is the head of architecture for Yahoo! Search Marketing Services (formerly Overture). His interests include Web services and Service Oriented Architectures (SOA), XML, high-performance business systems, distributed architectures, and artificial intelligence. You can reach him at tdarugar@yahoo.com.

Summary:  Parsing an XML document into tree structures makes it possible to operate on the tree structure of the data. Find out how to use the functions for accessing and manipulating the document tree, and follow a sample stock-trading application that uses Perl, DOM, XML, and a database to evaluate trading rules. (You can apply the same techniques with other scripting languages, including Tcl and Python.) This is the second installment on using scripting languages to manipulate and transform XML documents.

Date:  01 Jul 2000
Level:  Introductory
Activity:  5706 views

My previous article (see Resources) discussed how to manipulate XML with scripting languages, and Perl in particular. In the course of parsing a file that way, handlers are called as each tag is encountered. That provides a very efficient means of processing XML, both in memory usage and processing time. However, certain tasks are difficult to accomplish in the event-based methodology. Imagine, for example, needing to move or rearrange certain segments of the document or sorting items within the document. Because the document is processed as a stream, we would first need to store the components before sorting or rearranging them. A mechanism that would store the components automatically would make such tasks substantially easier.

XML documents are required to be well-balanced, making it easy to store them as trees. Once you parse XML documents into tree structures, you can then operate on the tree. This yields a great deal of flexibility in dealing with the documents: you can access the components of the document in random order, rearrange them, and add or remove them. This is especially appropriate for applications in which the flow of processing is based on external logic, as opposed to the order and occurance of elements within the XML document. Storing the document as a tree enables random access to its data and structure, instead of having the processing governed by when and where the tags and elements occur.

Tree-based methodologies do have some drawbacks, however. They require the parsing of the entire XML document and the creation of the tree data structure before the processing and business logic takes place. Because the tree data structure is generally stored in memory, these methods have much larger memory footprints than stream-based methods.

Looking at tree-based options

Four popular modules do tree-based processing of XML with Perl, each with slightly different goals and histories:

  • DOM
  • Grove
  • Twig
  • XML::Simple

The Document Object Model (DOM), is a platform- and language-neutral interface for dynamically accessing and updating the content, structure, and style of XML documents. The DOM provides a standard set of objects for representing documents, a standard model of how these objects can be combined, and a standard interface for accessing and manipulating them. DOM is a W3C recommendation, making it a recognized Web standard. DOM implementations are available for a wide variety of languages, including Perl, Tcl, Python, C, C++, and Java.

Grove is an alternative document model. The grove concept has its roots in SGML processing; it revolves around the idea of representing a resource (for example, an XML document) as a set of nodes and properties. A grove is a directed graph of nodes, each of which exhibits one or more properties, which may be atomic values (such as strings, booleans, or integers) or lists of nodes or references to nodes. Thus a node is a set of properties, some of which can be references to other nodes.

Twig, a tree-based document model, can handle large XML documents. By allowing callbacks during processing (like the event-based models discussed earlier), allowing processing of sub-trees within a document while ignoring others, and allowing flushing of already processed sub-trees, Twig lets the user process very large files with controlled memory overhead and good performance.

XML::Simple aims to make processing of XML files simple and natural in Perl. The XML file is processed into a Perl data structure, allowing easy access to the data using normal Perl syntax. Originally created for processing configuration files, XML::Simple handles only a limited subset of XML -- for example, XML elements that contain both text and nested elements, such as marked-up text, are not handled.


Selecting a tree-based processing module

You can use any of the four modules to build powerful XML-based applications. Choosing the most appropriate depends on the application, as well as the user's preferences and requirements.

If the data being processed does not contain complex structures, XML::Simple is an excellent choice. Its low overhead, minimal learning curve, and simple API make it easy to learn, use, and deploy.

If the data is more complex, DOM, Grove, and Twig come into play. DOM is quite popular and has many merits: it is a W3C recommendation, it is available for many languages, and it is covered in many articles, tutorials, and books. It also ties in well with other XML standards, such as XQL, which is included in the XML::DOM package. Due to its deliberate language-neutral nature, however, DOM is less Perl-like, and therefore less intuitive for Perl veterans to use. DOM offers fewer convenient functions than the alternatives and is somewhat more awkward.

Grove brings a rich heritage from its SGML roots and offers a well-thought-out and articulated model for storing information. Its interface fits more naturally with the Perl syntax, and its Visitor-based model provides a rich method for traversing the tree structure. Unfortunately, Grove is less actively developed than the alternatives and less popular.

Twig attempts to provide an easy-to-use tool while maintaining high performance, both in speed and conservation of memory. Like Grove, Twig's interface is Perl-like, and it offers convenience functions to make programming easy. Its ability to selectively process parts of the tree differentiate it from the alternatives, especially when dealing with large files.

For this article I chose DOM because of its popularity and the fact that it is a standard. C and C++ implementations of DOM, such as Xerces-Perl, are becoming available, further strengthening DOM's position and promising improved performance.


Thinking in DOM

A DOM is a tree of nodes: every piece of the XML document is represented in the tree as a node. This is somewhat different from what you may expect; for example, attributes could well be expressed as Perl hashes. In DOM the attributes, XML tags, text contents, and just about everything else are nodes.

Following the node-centric convention provides a consistent method for representing the tree. Unfortunately, it makes DOM less Perl-like than some of the other modules, and sometimes less intuitive to use. On the other hand, the same methodologies and function calls that are used with Perl can be used when programming DOM with other languages.

Consider this simple XML document:

<paragraph align="left">The <it>Italicized</it> portion.</paragraph>

Now look at the sample XML represented by DOM as a tree, as shown in Figure 1.


Figure 1. A DOM tree treats each part of the sample XML document as a node
Figure 1. A DOM tree treats each part of the sample XML document as a node

In Figure 1 each of the Document, Element, Text, and Attr pieces of the tree are XML::DOM::Nodes.

DOM provides a set of functions for accessing and manipulating the document tree. These functions are specified as language-neutral interfaces by the W3C DOM Recommendation, and DOM packages implement these interfaces. The Perl XML::DOM package includes extensive documentation on these functions and how to use them.


Using DOM

Here is a simple invocation of the DOM parser:

use XML::DOM;
# Read and parse the document
my $parser = new XML::DOM::Parser;
my $doc = $parser->parsefile ("filename");

This parses the given file, building a DOM document in memory. The doc variable is the handle to this document.


Walking the DOM

Unlike the event-based model, where handler callbacks are generated as the XML document is processed, tree-based models require explicit traversal of the result object. This is accomplished by starting at the root node and traversing the children in order. A recursive implementation of the tree traversal could look like:

sub traverse {
  my($node)= @_;
  if ($node->getNodeType == ELEMENT_NODE) {
    print "<", $node->getNodeName, ">";
    foreach my $child ($node->getChildNodes()) {
      traverse($child);
    }
    print "</", $node->getNodeName, ">";
  } elsif ($node->getNodeType() == TEXT_NODE) {
    print $node->getData;
  }
}

If the node is an Element Node, it represents an XML tag, and may contain children. Each of the child nodes is processed by a recursive call to the traverse function. For Text Nodes, the text data is simply printed.

Note the use of object-oriented Perl: each node in the DOM tree is an object with a set of methods. The getNodeType method, for example, returns the type of the node. XML::DOM makes extensive use of objects. The XML::DOM documentation includes the full set of available methods.


Accessing the DOM

Often you need to access particular nodes or subtrees of the document instead of traversing it. DOM provides several methods for random access and moving around within the document.

To find a particular tag, you can use the getElementsByTagName function:

my @matching_tags = $start_node->getElementsByTagName("my_tag");
my $first_match = $matching_tags[0];
foreach my $match (@matching_tags) { do_something; }

The getElementsByTagName function returns a set of nodes that match the given tag name. The matching works quite simply: any tag matching the given tag name within start_node 's subtree is selected. Thus, it is not possible to differentiate between two tags with the same name but different contexts (for example, a search for x matches both <a><x /></a> and <b><x /></b> ). However, this function is available as a method of XML::DOM::Node , allowing selective application by choosing the starting node intelligently.

It is also possible to move around the tree using the parent-child relationships. For example, the following addresses the current node's parent's first child:

my $other_node = $start_node->getParentNode->getFirstChild;


Composing complex queries: XQL

XQL, the XML Query Language, is a standard for advanced queries on XML content. A detailed discussion of XQL is beyond the scope of this article; the XML::XQL documentation includes an excellent tutorial. For the purposes of this article, let's try some fairly simple queries:

my @matches = $start_node->xql("stock_quotes/stock_quote/price");

This query searches for the element price occurring within the stock_quote element, which itself occurs within the stock_quotes element. Queries can also contain conditionals and search on attributes:

my @ask_price_nodes = $stock_quote->xql("price[\@type='ask']");

This searches for price nodes where the attribute type has the value ask .


Building the DOM

You can modify the DOM in memory, adding and removing nodes and subtrees. To add a new node, you create the new node, set its value, and add it to the tree at the appropriate location. To add the node percentage as a child of change in the XML file, do the following:

my $newnode = $doc->createElement("percentage");
$newnode->addText("23");
my $change_node = $doc->getElementsByTagName("change")->item(0);
$change_node->appendChild($newnode);

doc is the XML::DOM::Document object, usually created by the parser. The new element node is created using the createElement method; similar methods exist for other Node types such as text, comments, and attributes.

Having created the new element node, next set its value, find the location of its parent in the tree, and add it to the tree using the appendChild method.


Putting it all together: The sample application

Now it's time to rebuild the sample application from the earlier "XML and Scripting Languages" using DOM. The application, a simple stock trading program, takes stock quotes in XML format, retrieves the trading rules from a database, and decides which trades to make based on the rules and the stock quotes.

The first step is to read the XML file:

my $parser = new XML::DOM::Parser;
my $doc = $parser->parsefile ($file);

Then, open the database connection:

use DBI;
my $dsn = "DBI:mysql:database=test;";
my $dbh = DBI->connect($dsn);

With the XML data in memory as a DOM and the database connection established, loop over each stock_quote in the XML file:

foreach my $stock_quote ($doc->getElementsByTagName("stock_quote")) {

Now you need to know what stock symbol this stock_quote refers to:

my @symbol_nodes = $stock_quote->getElementsByTagName("symbol");
my $symbol_node  = $symbol_nodes[0];
my $symbol       = $symbol_node->getFirstChild->getData();

With the symbol in hand, you can look up the rules that apply to this stock:

my $sth = $dbh->prepare("select * from rules where symbol='$symbol'");
$sth->execute();

The trading rules are stored as actions to take based on given fields (such as price, volume, or change) and values of those fields. For example:

INSERT INTO rules VALUES ("MSFT", "volume", "65000000", "sell");

indicates that MSFT should be sold if the volume is over 65000000. See "XML and Scripting Languages" for further details.

For each rule, compare the value of the applicable field from the XML stock quote with the threshold specified in the rule:

while (my $ref = $sth->fetchrow_hashref()) {
 my $field = $ref->{'field'}; my $value = $ref->{'value'}; my $action = $ref->{'action'};
 my $matching_field = $stock_quote->getElementsByTagName($field)->item(0);
 $value_from_xml    = $matching_field->getFirstChild->getData();
 if ($value_from_xml > $rule_threshold) {
   take_action($symbol, $action);
 }
}

Thus you have the basic logic of the trading application implemented. You need to add a special case for the price tag, since its value is stored not as the text contents of the tag but rather as an attribute:

if ($field eq "price") {
  my @ask_price_nodes = $stock_quote->xql("price[\@type='ask']");
  my $ask_price_node  = $ask_price_nodes[0];
  $value_from_xml = $ask_price_node->getAttribute('value');
}

Each stock quote has several price elements, each specifying the ask price, opening price, and high and low price for the day. Use XQL to find the price tag within the current stock_quote which has the attribute type set to ask . In other words, look for the asking price of the stock. Note that using XQL simplifies the code significantly; you don't have to loop through the price nodes, examining the type attribute of each.

The complete program, available as a separate listing, operates on XML formatted stock quotes, which can be generated with the live XML stock quote server available from XML Today Web site (see Resources).

Running the program with our set of trading rules and our XML formatted stock quote produces the following output:

pl domactive.pl stocks.xml
** Dealing with IBM: **************
Handling rule: if (price > 100) take action sell.
Value for price from xml value = 109.1875
Rule "price > 100" applies for IBM
Taking action "sell" on stock "IBM" .
** Dealing with MSFT: **************
Handling rule: if (volume > 65000000) take action buy.
Value for volume from xml value = 64282200
Rule "volume > 65000000" does not apply for MSFT, no action taken.
** Dealing with CSCO: **************
Handling rule: if (change > 3) take action sell.
Value for change from xml value = +2
Rule "change > 3" does not apply for CSCO, no action taken.
** Dealing with INTC: **************
** Dealing with ORCL: **************
** Dealing with SUNW: **************


Summing up

Using tree-based methodologies for handling XML files can simplify programming and enable random access to data where stream-based methods offer only linear access. Powerful querying capabilities, such as XQL, allow you to quickly find the segments of information you are interested in, without explicitly traversing the entire document. These powerful tools, the examples above, and the wealth of documentation available on the Web and in print empower you to build your XML-based applications quickly and simply, today.



Download

DescriptionNameSizeDownload method
Sample code for articlexml-perl2-code.zip2KB HTTP

Information about download methods


Resources

About the author

Parand Tony Darugar

Parand Tony Darugar is the head of architecture for Yahoo! Search Marketing Services (formerly Overture). His interests include Web services and Service Oriented Architectures (SOA), XML, high-performance business systems, distributed architectures, and artificial intelligence. You can reach him at tdarugar@yahoo.com.

Comments (Undergoing maintenance)



Trademarks  |  My developerWorks terms and conditions

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML, Linux
ArticleID=11016
ArticleTitle=Dare to script tree-based XML with Perl
publish-date=07012000
author1-email=tdarugar@yahoo.com
author1-email-cc=dwxed@us.ibm.com

My developerWorks community

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere).

My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Rate a product. Write a review.

Special offers