Skip to main content

Fast incremental updates of XML records

Classic UNIX utilities offer help for a modern problem

Uche Ogbuji (uche@ogbuji.net), Principal, Zepheira
Photo of Uche Ogbuji
Uche Ogbuji is a partner at Zepheira, LLC, a solutions firm specializing in the next generation of Web technologies. Mr. Ogbuji is lead developer of 4Suite, an open source platform for XML, RDF, and knowledge-management applications, and lead developer of the Versa RDF query language. He is a computer engineer and writer born in Nigeria, living and working in Boulder, Colorado. You can find more about Mr. Ogbuji at his Weblog, Copia.

Summary:  XML is often used today as a data export and exchange format. In such cases, you might deal with a feed of XML records; sometimes, if this feed, is too long, there are performance problems importing it into another system. As such, you might want to produce only an incremental feed—that is, one that only includes items that have changed. This article presents a collection of simple techniques that you can combine into a system for more digestible feeds containing only updated records.

Date:  28 Aug 2007
Level:  Intermediate
Activity:  1945 views

XML is a very popular format for exchanging collections of records. You can export these records from a relational database, or they can be in formats such as Atom, which is structured around a collection of entry elements. A common architectural pattern is to synchronize data sets by having one system export a set of records to another; this export is often in the form of a large XML file that contains the entire record set. I can certainly say a lot about better ways to approach such an architecture, but in reality this is a pattern seen over and over again, and sometimes you just have to accept it and try to make it as efficient as possible.

Such systems have some common efficiency problems.

  • The XML exports can be so large that they use up excessive bandwidth in transmission
  • For large files, the processing needed to validate and import the XML takes a long time

In this article, I suggest a simple batch of techniques to address these problems. It is not a one-stop recipe for solving problems like these in all cases, but if you putt together the ideas in this article, you should be able to assemble a solution for your particular needs.

A closer look at the problem space

You should always be quick to look to several decades of experience when solving such problems. The crux of the techniques presented in this article follows the lines of the age-old diff and patch utilities well known in UNIX®. diff is a utility that compares two files (or sets of files) and reports the differences in a standard format. patch can read this standard format and apply the represented updates to some other file. If you use almost any revision control system, such as CVS or Subversion, you use a system that builds on diff and patch for efficient communication of updates. XML brings some complexities to the matter, as I shall discuss, but if you pay attention to classic tools, you can learn important lessons about how to proceed, and in some cases you might make modest adjustments and use such tools directly in XML processing.

Narrowing the scope

In this article, I'll focus on XML with particular characteristics:

  • The root element serves as an envelope whose children are a series of record elements
  • Each record element has a unique ID attribute or child element
  • Within each record is a consistent order of elements

The last requirement might seem stringent, but it doesn't necessarily mean that your schema must mandate the order. In practice, incremental updates usually involve comparison of successive export files from the same process, and in such scenarios, matters such as the order of elements within records tend to be consistent. In the worst case, if the schema allows arbitrary order, and you don't want to rely on the order in the actual exports, you can process the XML to impose an order, although I don't cover that particular technique in this article.

The sample files

Listing 1 (contacts.xml) is the main example for this article. It's a simple address book format with a contacts envelope and each record in a contact element, each of which has a unique id attribute.


Listing 1. contacts.xml: Example of XML comprising a series of records
                
<?xml version="1.0" encoding="utf-8"?>
<contacts>
  <contact id='ep' added="2003-06-10">
    <name>Ezra Pound</name>
    <address>
      <street>45 Usura Place</street>
      <city>Hailey</city>
      <province>ID</province>
    </address>
  </contact>
  <contact id='tse' added="2003-06-20">
    <name>Thomas Eliot</name>
    <address>
      <street>3 Prufrock Lane</street>
      <city>Stamford</city>
      <province>CT</province>
    </address>
  </contact>
  <contact id="co" added="2004-11-15">
    <name>Christopher Okigbo</name>
    <address>
      <street>7 Heaven's Gate</street>
      <city>Idoto</city>
      <province>Anambra</province>
    </address>
  </contact>
</contacts>

    

For an example scenario, imagine that the system that exported Listing 1 performs some processing, and then exports an updated version of the records into a file called contacts2.xml, illustrated in Listing 2.


Listing 2. contacts2.xml: An incrementally updated version of Listing 1
                
<contacts>
  <contact id='ep' added='2003-06-10'>
    <name>Ezra Pound</name>
    <address>
      <street>45 Usura Place</street>
      <city>Hailey</city>
      <province>ID</province>
    </address>
  </contact>
  <contact id='tse' added='2003-06-20'>
    <name>Thomas Eliot</name>
    <address>
      <street>3 Prufrock Lane</street>
      <city>Stamford</city>
      <province>CT</province>
    </address>
  </contact>
  <contact id='lh' added='2004-11-01'>
    <name>Langston Hughes</name>
    <address>
      <street>10 Bridge Tunnel</street>
      <city>Harlem</city>
      <province>NY</province>
    </address>
  </contact>
  <contact added='2004-11-15' id='co'>
    <name>Christopher Okigbo</name>
    <address>
      <street>7 Heaven's Gate</street>
      <city>Nsukka</city>
      <province>Anambra</province>
    </address>
  </contact>
</contacts>

    

A closer look at the diff

It's instructive to use the standard diff utility on Listings 1 and 2. To do this, enter diff -u contacts.xml contacts2.xml on a UNIX or Mac OS X command line. Listing 3 shows the output, with line numbers added for reference.


Listing 3. Difference between Listings 1 and 2 (UNIX diff utility universal format, with line numbers added for reference)
                
     1 --- contacts.xml        2007-06-29 09:50:53.000000000 -0600
     2 +++ contacts2.xml       2007-06-29 09:50:50.000000000 -0600
     3 @@ -1,6 +1,5 @@
     4 -<?xml version="1.0" encoding="utf-8"?>
     5  <contacts>
     6 -  <contact id='ep' added="2003-06-10">
     7 +  <contact id='ep' added='2003-06-10'>
     8      <name>Ezra Pound</name>
     9      <address>
    10        <street>45 Usura Place</street>
    11 @@ -8,7 +7,7 @@
    12        <province>ID</province>
    13      </address>
    14    </contact>
    15 -  <contact id='tse' added="2003-06-20">
    16 +  <contact id='tse' added='2003-06-20'>
    17      <name>Thomas Eliot</name>
    18      <address>
    19        <street>3 Prufrock Lane</street>
    20 @@ -16,11 +15,19 @@
    21        <province>CT</province>
    22      </address>
    23    </contact>
    24 -  <contact id="co" added="2004-11-15">
    25 +  <contact id='lh' added='2004-11-01'>
    26 +    <name>Langston Hughes</name>
    27 +    <address>
    28 +      <street>10 Bridge Tunnel</street>
    29 +      <city>Harlem</city>
    30 +      <province>NY</province>
    31 +    </address>
    32 +  </contact>
    33 +  <contact added='2004-11-15' id='co'>
    34      <name>Christopher Okigbo</name>
    35      <address>
    36        <street>7 Heaven's Gate</street>
    37 -      <city>Idoto</city>
    38 +      <city>Nsukka</city>
    39        <province>Anambra</province>
    40      </address>
    41    </contact>    
    

Before I go further, here are a few notes for those of you who might not be familiar with the diff format. The first two lines specify the name, location, and most recently updated timestamp of the original and updated files. Lines 3, 11, and 20 each start a chunk of difference annotations, and specify the span of line numbers for each chunk from the source and updated files. Lines that begin with a plus sign (+) were found in the updated file, but not the original, while lines that begin with a minutes sign (-) are present in the original, but not found in the updated file. If you examine these differences carefully, you will find some significant changes and some insignificant ones. The significant changes are:

  • A record was added in the updated XML (lines 25 through 32 of the diff)
  • The city name changed in the record with the ID co (lines 24 and 33 of the diff)

The insignificant changes are:

  • An XML declaration is missing in the updated file
  • The two files use quotation marks differently
  • The attributes in one record were reordered (lines 37 and 38 of the diff)

You'll want to ignore these latter differences in XML processing, but general-purpose diff tools bundle them together with the important changes, which complicates things. An important tool to address this complication is Canonical XML, a standard form of XML that normalizes inconsequential differences. (See Resources for a link to a separate developerWorks article on this topic.) You can also gain an effect similar to canonicalization when you stick to consistent tools for exporting files. For example, attributes always use double quotes, or the file always includes an XML declaration.


Divide and conquer

The analysis of Listing 3 shows that an incremental update file will comprise two records: the new lh record and the updated co record. Some processing is needed to screen out the inconsequential changes to find this. A useful technique is to break the large XML file into a collection of individual records files for this screening and related processing. In XML processing, performance often degrades in a nonlinear fashion as the size of the file increases. By breaking a large file into smaller files, you can usually lower CPU and memory utilization. But you do have to be careful not to negate that savings by, say, launching a new process for each file, which would incur the overhead of runtime initialization over and over again.

Of the many ways to break up large XML files, you want to be sure you use a streaming process to do it. In other words, don't use a process that loads the entire initial XML file into memory at once. Instead, use one that loads in one chunk at a time, discarding the memory used for chunks you are done with. Most tools and APIs for streaming XML processing will work for this purpose. For the Java™ platform, try the Streaming API for XML (StAX) or a specialized language similar to XSLT called Streaming Transformations for XML (STX). My favorite language, Python, has pushdom, which comes bundled in the standard library; I wrote a tool for the purpose, called Amara Pushbind, part of the Amara XML Toolkit (see Resources). Listing 4 is Amara Pushbind code that takes the original and the incrementally updated XML file and breaks them into smaller files on disk. I don't think you'll need to know Python to understand this program, and I've added comments to elucidate the one or two tricky spots.


Listing 4. splitter.py: Python code to split XML files into multiple records files
                
FILE1 = 'contacts.xml'
FILE2 = 'contacts2.xml'

import os
import amara

output_dir = 'before'
os.mkdir(output_dir)
for record in amara.pushbind(FILE1, u'contact'):
    #combine the subdirectory name with the id attribute
    #from the record to specify the output file
    filepath = os.path.join(output_dir, record.id)
    #open the file, write the XML just for that record, then close the file
    f = open(filepath, 'w')
    f.write(record.xml())
    f.close()
    print 'Wrote file for record', filepath

#almost a duplicate of the above logic, but instead splitting FILE2
#into an "after" subdirectory
output_dir = 'after'
os.mkdir(output_dir)
for record in amara.pushbind(FILE2, u'contact'):
    filepath = os.path.join(output_dir, record.id)
    f = open(filepath, 'w')
    f.write(record.xml())
    f.close()
    print 'Wrote file for record', filepath
    

You should be able to easily translate this program to StAX, STX, plain old SAX code, or any other streaming API that catches your fancy.


Bringing it on home

If you run Listing 3 or its equivalent, you end up with sibling "before" and "after" directories with the XML records split from Listings 1 and 2, respectively. Many XML tools, including Amara, will normalize details such as attribute ordering and quoting for you. If your tools don't, you can probably add a processing step to convert to canonical XML, preferably using a streaming approach. At this point, things are set up so you can go back to the old standby diff to narrow things down to just the records that have changed. Running diff -Nru before after on the command line, you should get output to match Listing 5.


Listing 5. Difference between split-up XML records (UNIX diff utility universal format)
                
    diff -Nru before/co after/co
    --- before/co   2007-06-29 14:51:17.000000000 -0600
    +++ after/co    2007-06-29 14:51:17.000000000 -0600
    @@ -2,7 +2,7 @@
         <name>Christopher Okigbo</name>
         <address>
           <street>7 Heaven's Gate</street>
    -      <city>Idoto</city>
    +      <city>Nsukka</city>
           <province>Anambra</province>
         </address>
       </contact>
    \ No newline at end of file
    diff -Nru before/lh after/lh
    --- before/lh   1969-12-31 17:00:00.000000000 -0700
    +++ after/lh    2007-06-29 14:51:17.000000000 -0600
    @@ -0,0 +1,8 @@
    +<contact added="2004-11-01" id="lh">
    +    <name>Langston Hughes</name>
    +    <address>
    +      <street>10 Bridge Tunnel</street>
    +      <city>Harlem</city>
    +      <province>NY</province>
    +    </address>
    +  </contact>
    \ No newline at end of file
    
    

You can see that this is much closer to what you want for an incremental update than the diff output in Listing 3. Now only the changed and added records are shown. If you think about it, though, for this specific use case, you don't even need to know the details that changed within those records. It's enough just to know which records have changed. Now that you have diff working in a basic fashion, you're free to use the utility's many useful features, one of which is the quiet output form, triggered by the -q flag. By running diff -Nrq before after on the command line, you should get output to match Listing 6.


Listing 6. Difference between split-up XML records (UNIX diff utility quiet format)
                
Files before/co and after/co differ
Files before/lh and after/lh differ

    

To construct the incremental feed, parse Listing 6 to get the names of files containing the needed records, then concatenate these and wrap them in the overall envelope element (contacts, in this case). This process is also quite efficient, because diff is smart enough to work on these multiple files without the overhead of launching multiple processes, and you should be able to concatenate the needed files in a streaming fashion.

What about deletion?

In one possible type of update, a record appears in the base XML but not in the updated XML. This happens when the output system deletes records. Usually you want the input system to know about such deletions, but handling these is a bit less of a general case than additions and updates. For the latter cases, you need to provide the full contents of the records so the input system can get the new data. In the case of deletions, you usually only need to send a list of deleted IDs. You can get this list through a comparison of the contents of the before and after directories.

What about whitespace?

Sometimes processing changes whitespace in a way that you don't care about. All well-behaved tools and canonicalization will preserve whitespace exactly as they find it, unless you tell them otherwise, and this can lead to unnecessary updates when only whitespace has changed. There are several ways to deal with this. You can normalize whitespace while splitting the file into records. When doing so, however, consider whether you'd like to have more readable universal format diffs. If you were to remove all whitespace between elements in this article's example, the resulting records files would all be on a single line. diff would still correctly tell you which records had changed, but in the equivalent of Listing 5, you would not be able to see as clearly that only the city value changed for the co record.

Sometimes I preserve universal format diffs for historical tracking, even though I use quiet format diffs for actually identifying changed records. Another possible approach is to write the records to the file as they are and then ignore all whitespace differences with the -w option in diff. If you decide to use this tactic, make sure that you don't care about such changes at all.


Wrap up

Many other techniques are useful for this sort of processing. As always, the more diverse your XML toolkit, the better equipped you are to find flexible solutions to such problems. Sometimes, plain text processing tools such as diff can be valuable for XML processing. But, as you've learned in this article, you have to be very careful when using such tools because XML often introduces complications to text processing. I always recommend that developers arrange things so they work with collections of small XML files (or at least an XML DBMS), rather than large, monolithic files. When this is not practical, it's good to know you can apply a few neat tricks to get information from one system to another without bogging everything down.


Resources

Learn

Get products and technologies

Discuss

About the author

Photo of Uche Ogbuji

Uche Ogbuji is a partner at Zepheira, LLC, a solutions firm specializing in the next generation of Web technologies. Mr. Ogbuji is lead developer of 4Suite, an open source platform for XML, RDF, and knowledge-management applications, and lead developer of the Versa RDF query language. He is a computer engineer and writer born in Nigeria, living and working in Boulder, Colorado. You can find more about Mr. Ogbuji at his Weblog, Copia.

Comments (Undergoing maintenance)



Trademarks  |  My developerWorks terms and conditions

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML
ArticleID=251461
ArticleTitle=Fast incremental updates of XML records
publish-date=08282007
author1-email=uche@ogbuji.net
author1-email-cc=dwxed@us.ibm.com

My developerWorks community

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere).

My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Special offers