Level: Introductory Parand Tony Darugar (tdarugar@yahoo.com), Head of architecture, Yahoo! Search Marketing Services
01 Jul 2000 Parsing an XML document into tree structures makes it possible to operate on the tree structure of the data. Find out how to use the functions for accessing and manipulating the document tree, and follow a sample stock-trading application that uses Perl, DOM, XML, and a database to evaluate trading rules. (You can apply the same techniques with other scripting languages, including Tcl and Python.) This is the second installment on using scripting languages to manipulate and transform XML documents.
My previous article (see Resources) discussed
how to manipulate XML with scripting languages, and Perl in particular.
In the course of parsing a file that way,
handlers are called as each tag is encountered. That provides a very efficient
means of processing XML, both in memory usage and processing time. However,
certain tasks are difficult to accomplish in the event-based methodology.
Imagine, for example, needing to move or rearrange certain segments of
the document or sorting items within the document. Because the document
is processed as a stream, we would first need to store the components before
sorting or rearranging them. A mechanism that would store the components
automatically would make such tasks substantially easier. XML documents are required to be well-balanced, making it easy
to store them as trees. Once you parse XML documents into tree structures,
you can then operate on the tree. This yields a great deal of flexibility
in dealing with the documents: you can access the components of the document
in random order, rearrange them, and add or remove them. This is especially appropriate for applications in which the flow of processing is based on external logic, as opposed to the order and occurance of elements within the XML document. Storing the document as a tree enables random access to its data and structure, instead of having the processing governed by when and where the tags and elements occur. Tree-based methodologies do have some drawbacks, however. They require
the parsing of the entire XML document and the creation of the tree data
structure before the processing and business logic takes place.
Because the tree data structure is generally stored in memory, these methods
have much larger memory footprints than stream-based methods. Looking at tree-based options
Four popular modules do tree-based processing of XML with Perl, each
with slightly different goals and histories:
- DOM
- Grove
- Twig
- XML::Simple
The Document Object Model (DOM), is a platform- and language-neutral
interface for dynamically accessing and updating the content, structure,
and style of XML documents. The DOM provides a standard
set of objects for representing documents, a standard model of how these
objects can be combined, and a standard interface for accessing and manipulating
them. DOM is a W3C recommendation, making it a recognized Web standard.
DOM implementations are available for a wide variety of languages, including
Perl, Tcl, Python, C, C++, and Java.
Grove is an alternative document model. The grove concept has
its roots in SGML processing; it revolves around the idea of representing
a resource (for example, an XML document) as a set of nodes and properties.
A grove is a directed graph of nodes, each of which exhibits one or more
properties, which may be atomic values (such as strings, booleans, or integers)
or lists of nodes or references to nodes. Thus a node is a set of
properties, some of which can be references to other nodes.
Twig, a tree-based document model, can handle large XML documents.
By allowing callbacks during processing (like the event-based models discussed
earlier), allowing processing of sub-trees within a document while ignoring
others, and allowing flushing of already processed sub-trees, Twig lets
the user process very large files with controlled memory overhead and good
performance.
XML::Simple aims to make processing of XML files simple and natural
in Perl. The XML file is processed into a Perl data structure, allowing
easy access to the data using normal Perl syntax. Originally created for
processing configuration files, XML::Simple handles only a limited subset
of XML -- for example, XML elements that contain both text and nested elements,
such as marked-up text, are not handled.
Selecting a tree-based processing module
You can use any of the four modules to build powerful XML-based
applications. Choosing the most appropriate depends on the application,
as well as the user's preferences and requirements. If the data being processed does not contain complex structures, XML::Simple is an excellent choice. Its low overhead, minimal learning curve, and simple API make it easy to learn, use, and deploy. If the data is more complex, DOM, Grove, and Twig come into play. DOM is quite popular and
has many merits: it is a W3C recommendation, it is available for many languages, and it is covered in many articles, tutorials, and books. It also ties in well with other XML standards, such as XQL, which is included in the XML::DOM package. Due to its deliberate language-neutral nature,
however, DOM is less Perl-like, and therefore less intuitive for Perl veterans
to use. DOM offers fewer convenient functions than the alternatives and
is somewhat more awkward. Grove brings a rich heritage from its SGML roots and offers a well-thought-out
and articulated model for storing information. Its interface fits more
naturally with the Perl syntax, and its Visitor-based model provides a
rich method for traversing the tree structure. Unfortunately, Grove is
less actively developed than the alternatives and less popular. Twig attempts to provide an easy-to-use tool while maintaining high
performance, both in speed and conservation of memory. Like Grove, Twig's
interface is Perl-like, and it offers convenience functions to make programming
easy. Its ability to selectively process parts of the tree differentiate
it from the alternatives, especially when dealing with large files. For this article I chose DOM because of its popularity and the fact
that it is a standard. C and C++ implementations of DOM, such as Xerces-Perl,
are becoming available, further strengthening DOM's position and promising
improved performance.
Thinking in DOM
A DOM is a tree of nodes: every piece of the XML document is represented
in the tree as a node. This is somewhat different from what you may expect;
for example, attributes could well be expressed as Perl hashes. In DOM
the attributes, XML tags, text contents, and just about everything else
are nodes. Following the node-centric convention provides a consistent method for
representing the tree. Unfortunately, it makes DOM less Perl-like than
some of the other modules, and sometimes less intuitive to use. On the
other hand, the same methodologies and function calls that are used with
Perl can be used when programming DOM with other languages. Consider this simple XML document: <paragraph align="left">The <it>Italicized</it> portion.</paragraph> |
Now look at the sample XML represented by DOM as a tree, as shown in Figure
1.
Figure 1. A DOM tree treats each part of the
sample XML document as a node

In Figure 1 each of the Document, Element, Text,
and Attr pieces of the tree are XML::DOM::Nodes. DOM provides a set of functions for accessing and manipulating the document
tree. These functions are specified as language-neutral interfaces by the
W3C DOM Recommendation, and DOM packages implement these interfaces. The
Perl XML::DOM package includes extensive documentation on these functions
and how to use them.
Using DOM
Here is a simple invocation of the DOM parser:
use XML::DOM;
# Read and parse the document
my $parser = new XML::DOM::Parser;
my $doc = $parser->parsefile ("filename"); |
This parses the given file, building a DOM document in memory. The
doc
variable is the handle to this document.
Walking the DOM
Unlike the event-based model, where handler callbacks are generated
as the XML document is processed, tree-based models require explicit traversal
of the result object. This is accomplished by starting at the root node
and traversing the children in order. A recursive implementation of the
tree traversal could look like:
sub traverse {
my($node)= @_;
if ($node->getNodeType == ELEMENT_NODE) {
print "<", $node->getNodeName, ">";
foreach my $child ($node->getChildNodes()) {
traverse($child);
}
print "</", $node->getNodeName, ">";
} elsif ($node->getNodeType() == TEXT_NODE) {
print $node->getData;
}
} |
If the node is an
Element
Node, it represents an XML tag, and
may contain children. Each of the child nodes is processed by a recursive
call to the
traverse
function. For
Text
Nodes, the text
data is simply printed. Note the use of object-oriented Perl: each node in the DOM tree is an
object with a set of methods. The
getNodeType
method, for example,
returns the type of the node. XML::DOM makes extensive use of objects.
The XML::DOM documentation includes the full set of available methods.
Accessing the DOM
Often you need to access particular nodes or subtrees of the
document instead of traversing it. DOM provides several methods for random
access and moving around within the document. To find a particular tag, you can use the
getElementsByTagName
function: my @matching_tags = $start_node->getElementsByTagName("my_tag");
my $first_match = $matching_tags[0];
foreach my $match (@matching_tags) { do_something; } |
The
getElementsByTagName
function returns a set of nodes that
match the given tag name. The matching works quite simply: any tag matching
the given tag name within
start_node
's subtree is selected. Thus,
it is not possible to differentiate between two tags with the same name
but different contexts (for example, a search for
x
matches both
<a><x
/></a>
and
<b><x /></b>
). However, this function
is available as a method of
XML::DOM::Node
, allowing selective
application by choosing the starting node intelligently. It is also possible to move around the tree using the parent-child relationships.
For example, the following addresses the current node's parent's first
child: my $other_node = $start_node->getParentNode->getFirstChild; |
Composing complex queries: XQL
XQL, the XML Query Language, is a standard for advanced queries on
XML content. A detailed discussion of XQL is beyond the scope of this article;
the XML::XQL documentation includes an excellent tutorial. For the purposes
of this article, let's try some fairly simple queries: my @matches = $start_node->xql("stock_quotes/stock_quote/price"); |
This query searches for the element
price
occurring within
the
stock_quote
element, which itself occurs within the
stock_quotes
element. Queries
can also contain conditionals and search on attributes: my @ask_price_nodes = $stock_quote->xql("price[\@type='ask']"); |
This searches for price nodes where the attribute
type
has
the value
ask
.
Building the DOM
You can modify the DOM in memory, adding and removing nodes and subtrees.
To add a new node, you create the new node, set its value, and add it to
the tree at the appropriate location. To add the node
percentage
as a child of
change
in the XML
file, do the following: my $newnode = $doc->createElement("percentage");
$newnode->addText("23");
my $change_node = $doc->getElementsByTagName("change")->item(0);
$change_node->appendChild($newnode); |
doc
is the
XML::DOM::Document
object, usually created
by the parser. The new element node is created using the
createElement
method; similar methods exist for other Node types such as text, comments,
and attributes. Having created the new element node, next set its value, find the location
of its parent in the tree, and add it to the tree using the
appendChild
method.
Putting it all together: The sample application
Now it's time to rebuild the sample application from the earlier "XML
and Scripting Languages" using DOM. The application, a simple
stock trading program, takes stock quotes
in XML format, retrieves the trading rules
from a database, and decides which trades to make based on the rules and
the stock quotes. The first step is to read the XML file: my $parser = new XML::DOM::Parser;
my $doc = $parser->parsefile ($file); |
Then, open the database connection: use DBI;
my $dsn = "DBI:mysql:database=test;";
my $dbh = DBI->connect($dsn); |
With the XML data in memory as a DOM and the database connection established,
loop over each
stock_quote
in the XML file: foreach my $stock_quote ($doc->getElementsByTagName("stock_quote")) { |
Now you need to know what stock symbol this
stock_quote
refers
to: my @symbol_nodes = $stock_quote->getElementsByTagName("symbol");
my $symbol_node = $symbol_nodes[0];
my $symbol = $symbol_node->getFirstChild->getData(); |
With the symbol in hand, you can look up the rules that apply to this
stock: my $sth = $dbh->prepare("select * from rules where symbol='$symbol'");
$sth->execute(); |
The trading rules are stored as actions
to take based on given fields (such as price, volume, or change) and values
of those fields. For example:
INSERT INTO rules VALUES ("MSFT", "volume", "65000000", "sell");
|
indicates that MSFT should be sold if the volume is over 65000000. See
"XML and Scripting Languages" for further details. For each rule, compare the value of the applicable field from the
XML stock quote with the threshold specified in the rule: while (my $ref = $sth->fetchrow_hashref()) {
my $field = $ref->{'field'}; my $value = $ref->{'value'}; my $action = $ref->{'action'};
my $matching_field = $stock_quote->getElementsByTagName($field)->item(0);
$value_from_xml = $matching_field->getFirstChild->getData();
if ($value_from_xml > $rule_threshold) {
take_action($symbol, $action);
}
} |
Thus you have the basic logic of the trading application implemented.
You need to add a special case for the
price
tag, since its value
is stored not as the text contents of the tag but rather as an attribute: if ($field eq "price") {
my @ask_price_nodes = $stock_quote->xql("price[\@type='ask']");
my $ask_price_node = $ask_price_nodes[0];
$value_from_xml = $ask_price_node->getAttribute('value');
} |
Each stock quote has several
price
elements, each specifying
the ask price, opening price, and high and low price for the day. Use XQL
to find the
price
tag within the current
stock_quote
which has the attribute
type
set to
ask
. In other words,
look for the asking price of the stock. Note that using XQL simplifies
the code significantly; you don't have to loop through the
price
nodes, examining the
type
attribute of each. The complete program, available
as a separate listing, operates on XML formatted
stock quotes, which can be generated with the live
XML stock quote server available from XML Today Web site (see Resources). Running the program with our set of trading
rules and our XML formatted stock quote
produces the following output: pl domactive.pl stocks.xml
** Dealing with IBM: **************
Handling rule: if (price > 100) take action sell.
Value for price from xml value = 109.1875
Rule "price > 100" applies for IBM
Taking action "sell" on stock "IBM" .
** Dealing with MSFT: **************
Handling rule: if (volume > 65000000) take action buy.
Value for volume from xml value = 64282200
Rule "volume > 65000000" does not apply for MSFT, no action taken.
** Dealing with CSCO: **************
Handling rule: if (change > 3) take action sell.
Value for change from xml value = +2
Rule "change > 3" does not apply for CSCO, no action taken.
** Dealing with INTC: **************
** Dealing with ORCL: **************
** Dealing with SUNW: ************** |
Summing up
Using tree-based methodologies for handling XML files can simplify programming
and enable random access to data where stream-based methods offer only
linear access. Powerful querying capabilities, such as XQL, allow you to
quickly find the segments of information you are interested in, without
explicitly traversing the entire document. These powerful tools, the examples
above, and the wealth of documentation available on the Web and in print
empower you to build your XML-based applications quickly and simply, today.
Download | Description | Name | Size | Download method |
|---|
| Sample code for article | xml-perl2-code.zip | 2KB | HTTP |
|---|
Resources - The first installment of this series, "XML and scripting languages," provides background on processing XML with
Perl and scripting languages and presents event-based models.
- The Document Object Model (DOM) Level 1 Specification provides a standard set of objects for representing HTML and XML
documents, and a standard interface for accessing and manipulating them.
- XML::DOM
is a Perl extension to XML::Parser to build an Object Oriented data structure
with a DOM-Level-1-compliant interface.
- XML::Twig is a tree interface to XML
documents allowing chunk-by-chunk processing of huge documents.
- "Processing XML efficiently
with Perl and XML::Twig" is a comprehensive tutorial on using XML::Twig
- XML::Grove
provides simple access to the information set of parsed XML, HTML, or SGML
instances using a tree of Perl hashes.
- "Groves: an
illustrated example" discusses Groves and the concepts behind them.
- "An Introduction
to Groves" explains the inner workings of Groves and why they are important.
- XML::Simple
is a trivial API for reading and writing XML.
- "Ways to Rome" presents how to perform the same task using a wide range of available XML
modules.
- MySQL Database is a free SQL database
available for most operating systems, including most flavors of UNIX, as
well as Windows and OS/2.
- "High
Performance Web Applications using Perl, XML, and Databases" discusses
issues and techniques for building high-performance Web applications using
Perl, XML, and databases.
About the author  | 
|  | Parand Tony Darugar is the head of architecture for Yahoo! Search Marketing Services (formerly Overture). His interests include Web services and Service Oriented Architectures (SOA), XML, high-performance business systems, distributed architectures, and artificial intelligence. You can reach him at tdarugar@yahoo.com. |
Rate this page
|