Effective XML processing with DOM and XPath in Perl

Based on an analysis of several large XML projects, this article examines how to make effective and efficient use of DOM. Developer/author Tony Daruger provides a set of usage patterns and a library of functions to make DOM robust and easy to use. Though the DOM offers a flexible and powerful means for creating, processing, and manipulating XML documents, some aspects of DOM make it awkward to use and can lead to brittle and buggy code. This article suggests ways to avoid the pitfalls. Perl code samples demonstrate the techniques.

Parand Darugar (tdarugar@yahoo com), Head of architecture, Yahoo! Search Marketing Services

Author photo: Parand Tony DarugarParand Tony Darugar is the head of architecture for Yahoo! Search Marketing Services (formerly Overture). His interests include Web services and Service Oriented Architectures (SOA), XML, high-performance business systems, distributed architectures, and artificial intelligence. You can reach him at tdarugar@yahoo.com.



01 October 2001

Also available in Chinese Japanese

The Document Object Model (DOM), is a platform- and language-neutral interface for dynamically accessing and updating the content, structure, and style of XML documents. DOM defines a standard set of interfaces for representing documents, a standard model of how these objects can be combined, and a standard set of methods for accessing and manipulating them. DOM is a W3C Recommendation, which makes it a recognized Web standard. Implementations are available for a wide variety of languages, including Perl, C, C++, Java, Tcl, and Python.

As I'll demonstrate in this article, DOM is an excellent choice for XML handling when stream-based models (such as SAX) are not sufficient. Unfortunately, several aspects of the specification, such as its language-neutral interface and its use of the "everything-is-a-node" abstraction, make it difficult to use and prone to generating brittle code. This was particularly evident in my company's recent review of several large DOM projects that were developed by a variety of developers over the past year. The common problems, and their remedies, are discussed below.

Exploring the DOM

The DOM specification is designed to be usable with any programming language. Therefore, it attempts to use a common, core set of features which are available in all languages. The DOM specification also attempts to remain neutral in its interface definitions. Because of this, Perl programmers can apply their DOM knowledge when working with Java, and vice versa.

The specification also treats every part of the document as a node consisting of a type and a value. This provides an elegant conceptual framework for dealing with all aspects of the document. As an example, the following XML fragment

<paragraph align="left">the <it>Italicized</it> portion.</paragraph>

is represented via the following DOM structure:

Figure 1: DOM Representation of an XML Document
DOM Representation

Each of the Document, Element, Text, and Attr pieces of the tree are DOM::Nodes.

Design issues

The downside of DOM's language neutrality is that the methodologies and patterns that are normally used in each programming language cannot be employed. For example, the attributes of an XML node would naturally be represented in Perl as a hash, since they are a set of unique name-value pairs. With DOM, however, they are represented as a set of nodes, and the value of each is accessed via a separate function call. Instead of using a simple hash, the programmer must learn to use a number of new data structures and access methods. These minor inconveniences add up to unusual coding practices and an increase in lines of code. They also force the programmer to learn the DOM method of doing things in place of the way she would handle it intuitively.

The everything-is-a-node abstraction, while quite elegant, leads to awkward coding situations, such as the attribute node example above. This also occurs when accessing the value contained within an XML tag. Consider the XML fragment: <tagname>Value</tagname>. You may think the text value would be accessible by calling a getValue or similar method on the tagname node. In fact, the text is treated as one or more child nodes under the tagname node. Thus, in order to get the text value, you need to traverse the children of tagname, collating them into a string. There is good reason for this: tagname may contain other embedded XML tags. If tagname does contain embedded XML tags, getting its text value makes less sense. In the real world, however, we have seen very frequent coding errors caused by this lack of convenient functions.

The everything-is-a-node abstraction also loses some value because of the number of node types that exist and because of the lack of uniformity present in their access methods. For example, the insertData method is used to set the value of CharacterData nodes, while the value of Attr (attribute) nodes is set by direct access to a value field. By presenting different interfaces for the different nodes, the uniformity and elegance of the model is diminished, and the learning curve is increased.


Common coding problems

An analysis of several large XML projects revealed some common problems in working with the DOM. A few of these are presented below.

Code bloat

In all of the projects that we looked at in our review, an overarching problem presented itself: it took many lines of code to do simple things. In one example, 16 lines of code were used to check the value of an attribute. But the same task, with improved robustness and error handling, can be accomplished in three lines of code. What contributed to the increase in the number of code lines were the low-level nature of the DOM API, incorrect application of methods and programming patterns, and lack of knowledge of the full API. The following presents specific instances of these issues.

Traversing the DOM

In the code we examined, the most common task was to traverse or search the DOM. Here is a condensed version of the code required to find a node called "header" under the config section of the document:

$document_root  = $dom_document->getDocumentElement();
 my $config_node = $document_root->getFirstChild();
 foreach my $node ( $config_node->getChildNodes() ) {
   if ( $node->getName() eq "header") {
     # do something
   }
 }

The document is traversed from the root by getting the top element, getting its first child (config_node), and finally by individually examining config_node's children. Unfortunately, not only is this method quite verbose, but it is also fraught with fragility and the potential to have bugs.

As an example, the second line of the code gets the intermediate node using the getFirstChild method. Already, a multitude of potential problems exist. The first child of the root node may not be actually be the config_node the user is searching for. By blindly following the first child, we have ignored the actual name of the tag and will potentially be searching the incorrect part of the document. A frequent error in this scenario occurs when the source XML document contains whitespace or a carriage return after the root node; the first child of the root node is actually a DOM::Text node, not the intended node. To correctly navigate to our intended node, we need to examine each of document_root's child nodes until we find one that is not a Text node and that has the name we are looking for.

We are also ignoring the possibility that the document may have a different structure from what we are expecting. If the document_root doesn't have any child nodes, for example, config_node will be set to undef, and the third line of the example will raise an error. Therefore, to properly navigate the document, not only do we have to examine each child node individually and check for the appropriate name, but at every step we also have to check to make sure each method call returned a valid value. Writing robust, error-free code that can handle arbitrary input requires both a great deal of attention to detail and many lines of code.

Retrieving the text value within a tag

After DOM traversal, the second most common task was to retrieve the text value contained in a tag. Consider the XML fragment <sometag>The Value</sometag>. Having navigated our way to the sometag node, how do we capture its text value (The Value)? An intuitive implementation may be:

$sometag->getData();

As you may have guessed, the above code will not perform the desired action. We cannot call a getData or a similar function on the sometag node because the actual text is stored as one or more child nodes. A better approach would be:

$sometag->getFirstChild()->getData();

The problem here is that the value may not actually be contained in the first child; processing instructions or other embedded nodes may be found within sometag, or the text value may be contained in several child nodes instead of in just one. Recall that whitespace is frequently represented as a text node, so the call to $sometag->getFirstChild() may get you only the carriage return between the tag and its value. In fact, we need to traverse all of the children, checking for nodes of type Text, and collating their values until we have the complete value.

getElementsByTagName

The DOM interface includes a method for finding child nodes with a given name. For example, the call:

my @results = $document_root->getElementsByTagName("name");

will return an array (or a NodeList) of tags called name from within the document. This is certainly more convenient than the traversal methods we discussed above. It is also the cause of a common set of bugs.

The problem is that getElementsByTagName recursively traverses the document, returning all matching nodes. Suppose you have a document containing customer information, company information, and product information. All three of these items can potentially have a name tag within them. If you were to call getElementsByTagName searching for customer names and ended up with product and company names, your program will likely misbehave. Calling the function on a subtree of the document can diminish the risks. However, XML's flexible nature makes it quite difficult to ensure the subtree you are operating on has the structure you are expecting, and doesn't have spurious child nodes with the name you are searching on.


Effective use of the DOM

Given the limitations imposed by DOM's design constraints, how can you use the specification effectively and efficiently? We present a few basic principles and guidelines for DOM usage, and create a library of functions to make life easier.

Basic principles

Your experience using DOM will be significantly improved if you follow a few basic principles:

  • Do not use DOM to traverse the document
  • Whenever possible, use XPath to find nodes or traverse the document
  • Use a library of higher-level functions to make DOM use easier

These principles are derived directly from examination of common problems. DOM traversal, as discussed above, is a leading cause of errors. It is also, however, one of the most commonly needed functionalities. How do we traverse the document without using the DOM?

XPath

XPath is a language for addressing, searching, and matching pieces of the document. It is a W3C Recommendation, which makes it an accepted standard, and it is implemented in most languages and XML packages. Chances are your DOM package supports XPath either directly or via an add-on.

XPath provides an excellent means by which to traverse and search the document. It uses a path notation, similar to that used in file systems and URLs, to specify and match pieces of the document. For example, the XPath: /x/y/z searches the document for a root node of x, under which resides the node y, under which resides the node z. This statement returns all nodes that match the specified path structure.

More complex matchings are possible both in terms of the structure of the document, and the values of the nodes and their attributes. The statement /x/y/* returns all nodes under any node y with the parent x. /x/y[@name='a'] matches all nodes y who have a parent x, and have an attribute called name with the value a.

A full examination of XPath and its usage is beyond the scope of this article. See Resources for links to some excellent tutorials. Take a little time to learn XPath, and you will be rewarded with much easier handling of XML documents.


Library of functions

One of the surprising aspects of our examination of the DOM projects was the amount of copy-and-paste code that was present. Pieces of code from one file would be copied and pasted into many others to implement similar pieces of functionality. Why would experienced developers who otherwise employ good programming practices engage in copy-and-paste methods instead of creating helper libraries? We believe this is because most programmers are not DOM experts, and they will happily grab the first piece of code that does what they need. They do not feel confident enough in their DOM skills to produce the canonical functions that make up the helper library.

It is quite easy to create and use helper libraries to implement common functionalities; it only requires a small amount of discipline. Below are some basic helper functions that will get you started.

getValue

The most commonly performed action when working with XML documents is looking up the value of a given node. As discussed above, this can present difficulties both in traversing the document to find the desired node and in retrieving the value of the node. The traversal can be simplified using XPath, and the retrieval of the value can be coded once and then reused. We have implemented the getValue function with the helper of two lower-level functions, findNode. This helper finds and returns the first node, which matches the given XPath expression, and getTextContents, which non-recursively returns the concatenated values of the text nodes under the passed-in node, as shown in Listing 2.

sub getTextContents {
  my ($node, $strip)= @_;
  my $contents;

  if (! $node ) 
  { 
    return; 
  }
  for my $child ($node->getChildNodes()) {
    if ( ! is_element_node($child) ) {
       $contents .= $child->getData();
    }
  }

  if ($strip) {
    $contents =~ s/^\s+//;
    $contents =~ s/\s+$//;
  }

  return $contents;
}

sub findNode {
  my ($node, $xpath) = @_;
  if (! defined($node) || ! defined($xpath) )
  {
    return undef;
  }
  my $match = ($node->xql($xpath))[0];
  if (! $match )
  {
    return undef;
  }
  return $match;
}

sub getValue {
  my ($node, $xpath) = @_;
  my $match = findNode( $node, $xpath );
  if (! defined($match) )
  {
    return undef;
  }
  return getTextContents( $match );
}

getValue is called by passing in both a node from which to start the search, and an XPath statement that specifies the node we're searching for. The function finds the first node to match the given XPath and extracts its text value.

setValue

Another common action is to set the value of a node to a desired value, as shown in Listing 3.

sub setValue {
  my ($node, $xpath, $value) = @_;
  my $match = findNode( $node, $xpath );
  if (! defined($match) )
  {
    return undef;
  }
  
  foreach my $child ( $match->getChildNodes() ) 
  {
    $match->removeChild ($child);
  }
  $match->addText($value);
  return $match;
}

This function takes a starting node and an XPath statement -- just like getValue -- and a string to set the value of the matching node to. It finds the desired node using findNode, removes all of its children (thereby removing any text and other elements contained within it), and sets its text contents to the passed-in string.

appendNode

While some programs look up and modify the values contained in XML documents, others modify the structure of the document itself by adding and removing nodes. This helper function simplifies the addition of a node to the document, as shown in Listing 4.

sub appendNode {
  my ($doc, $nodename, $xpath, $value) = @_;

  if (! defined($nodename) || ($nodename eq "") ) {
    return undef;
  }

  my $match = findNode( $doc, $xpath );
  if (! defined($match) )
  {
    return undef;
  }

  my $newnode;
  eval {
    $newnode = $doc->createElement( $nodename );
  };

  if ($@ || (! defined($newnode) )) {
    return undef;
  }
  
  $match->appendChild( $newnode );
  
  if ( defined($value) ) {
    $newnode->addText($value);
  }

  return $newnode;
}

The parameters to this function are the DOM document, the name of the node to add, the XPath statement specifying the node to add it under (that is, what the parent node of the new node is), and, optionally, the text value of the node. The new node is appended to the specified parent node, and its value is set to the passed-in string.

copySubTree

Copying a section of a document into another location or document, while not a very common operation, was the cause of much confusion and gave rise to various inventive copy procedures. As Listing 5 illustrates, it is, in fact, fairly simple to implement.

sub copySubTree
{
  my ($sourcenode, $destnode) = @_;

  my $copy_node =  $sourcenode->cloneNode(1);
  if ( $sourcenode->getOwnerDocument() ne $destnode->getOwnerDocument() ) 
  {
    $copy_node->setOwnerDocument( $destnode->getOwnerDocument() );
  }
  $destnode->appendChild($copy_node);
  return $copy_node;
}

This function takes the source node and copies it over as a child under the destination node. The destination node may be in another document, in which case the subtree is copied between documents.


Conclusion

The DOM has been maligned as a difficult and nonintuitive way of manipulating XML documents. In fact, it forms a very effective base which easy-to-use systems can be built upon by following a few simple principles. DOM has already been implemented and optimized on most platforms, and is a very good choice for applications that need to search and manipulate XML documents in complex processes.

Resources

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into XML on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML
ArticleID=12044
ArticleTitle=Effective XML processing with DOM and XPath in Perl
publish-date=10012001