Skip to main content

By clicking Submit, you agree to the developerWorks terms of use.

The first time you sign into developerWorks, a profile is created for you. Select information in your profile (name, country/region, and company) is displayed to the public and will accompany any content you post. You may update your IBM account at any time.

All information submitted is secure.

  • Close [x]

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerworks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

By clicking Submit, you agree to the developerWorks terms of use.

All information submitted is secure.

  • Close [x]

XML Matters: More on XML and compression

Block-level algorithms and resource loads

David Mertz (mertz@gnosis.cx), Compactor, Gnosis Software, Inc.
David Mertz
David Mertz believes that less is more. David may be reached at mertz@gnosis.cx; his life pored over at http://gnosis.cx/dW/. Suggestions and recommendations on this, past, or future columns are welcomed.

Summary:  In an earlier installment of this column, David examined techniques by which XML documents can be reversibly restructured to improve compression. For large XML documents and embedded processes, however, restructuring an entire source prior to a compression pass might be impractical. In this installment, David examines how well restructuring techniques can be adapted to block-level processing -- in terms of both compression improvements and CPU/memory requirements.

View more content in this series

Date:  01 Apr 2002
Level:  Introductory
Also available in:   Japanese  Spanish

Activity:  6144 views
Comments:  

XML documents tend to be quite large compared with other forms of data representation. Though verbose, the simplicity and generality of XML make it worthwhile. Much of the time, the size of XML really makes no difference -- hard disks are cheap, and the time on the wire might be only a small part of the total time in the process. But at other times, bandwidth and storage space can be important.

Develop skills on this topic

This content is part of a progressive knowledge path for advancing your skills. See A data compression primer

To give you an idea, it is not at all unusual for XML documents that represent table-oriented data to be three times as large as a CSV or database representation -- or even a flat file. A similar increase in size is typical in representing EDI data, such as for ebXML projects. In many of these contexts, you start with multi-megabyte data sources, so making them several times larger can be inconvenient, especially for transmission purposes.

When you think about compressing documents, you normally think first of general compression algorithms like Lempel-Ziv, Huffman, Burrows-Wheeler, and the common utilities that implement variations thereof. Examples of such utilities are gzip, zip, and bzip2 (and the library versions of each). These utilities tend to substantially reduce the size of XML files -- especially bzip2. Fewer people are aware of PPM techniques for compression -- which can generally compress at even higher rates than bzip2 -- but at the cost of an even greater time requirement than the already sluggish bzip2. The reigning "king of the hill" for compressing XML documents is xmlppm, which utilizes XML-specific techniques similar to those presented in this column; however, xmlppm is several orders of magnitude slower, and unboundedly more memory-hungry, than the xml2struct presented here. But if you have the time and memory, xmlppm achieves astonishing compression rates.

You can obtain considerably better compression rates when you use the specific structure of XML documents to produce more compressible representations. The XMill utility is an implementation of this technique (as is xmlppm, in a somewhat more sophisticated way). However, previous techniques for combining XML restructuring with generic compression algorithms have operated at the level of entire XML documents. This is true of XMill, xmlppm, and my original xml2struct utility. The general principle behind all these techniques is to locate relatively homogeneous parts of a document and group them close to each other in a transformed file. After such grouping, standard compressors compress better (higher ratios, but nominally faster as well). In particular, XML suggests similarities between specific portions of a document by enclosing them in element tags with the same name.

If you must deal with many-megabyte XML documents, it's often impractical to spend the memory, disk space, and CPU overhead to manipulate such huge documents. It would be nice if you could take not the entire multi-megabyte source, but only parsed blocks, read serially, and apply similar techniques. It is worth recalling from the previous installment of this column on this topic, that the Burrows-Wheeler algorithm applied "generically" performs much of the same restructuring that xml2struct performs using knowledge of XML and yields similar compression results. The effectiveness of the Burrows-Wheeler algorithm depends in large part on the ability to perform global analysis on large input sources -- you'll see in the quantitative analysis below that Burrows-Wheeler loses all its advantages when addressing relatively small serial blocks from XML documents. The restructuring-pass techniques remain effective to a much larger degree under this constraint.

Usage scenario

There are a variety of practical applications for the block-level restructuring/compression techniques addressed in this paper. The very simplest case is one in which XML files are relatively large (hundreds or thousands of megabytes), memory and disk-space are moderately constrained (you don't have gigabytes of extra memory or storage available for the process), and channel costs are comparatively expensive (transmitting a half-gigabyte is a worthwhile savings over transmitting a gigabyte). The scenario for the simplest case would utilize a protocol like the one below:

xmlstruct communication protocol

  1. Machine A has both a channel to Machine B and a large XML source, either a file or some dynamic generation of XML content.
  2. A SAX parser on A flushes its data each time a block worth of source XML has been encountered.
    • The flushed block is restructured.
    • The flushed block is compressed by conventional means such as gzip or bzip2.
    • The restructured/compressed block is transmitted.
  3. Machine B receives blocks over the channel. The underlying XML is reconstructed in a serial fashion. Each block of XML could be appended to a file; or it could be fed to a SAX parser that acts on the elements and contents and discards it once processed.

Overview of the technique

Very little in the restructuring technique of xml2struct needs to be changed to accommodate arbitrary block sizes. The article archive contains two implementations of block-level restructuring. A Python implementation follows closely the pattern of the xml2struct.py utility presented in "XML Matters #13: Exploring the entropy of documents" (see Resources) -- the code is also much more compact and easier to follow. An "optimized" C implementation was also developed to examine the speed performance of the algorithm.

In the original algorithm, a list of tags was included as the first delimited section of a restructured XML document. This list of tags -- generated on an ad hoc basis during the actual document parsing -- was used as an index for single-byte tag placeholders used in a structure schematic. The strategy of using byte index values in the place of tags reduces the size of restructured documents somewhat, but also limits the algorithm to handling DTDs with fewer than 254 distinct tags.

Under the block-level revision below, I assume that a taglist is available independently. A utility function to create a taglist from a DTD is provided in the article archive. Assuming both sides of the channel have the requisite DTD, everything works and, shy of a DTD, any other format that specifies the order of tags works, too.

The only significant revision needed was the addition of a new (first) delimited document section to indicate current element nesting. In the original document-level format, every start tag was paired to an end tag. But since XML documents are broken at arbitrary positions in the block-level format, it is necessary to record a stack of open elements where a block begins. The first block that is now parsed has an empty field for this first section; subsequent ones are likely to have one or more bytes listing tag indexes that have not yet been closed.

The format of a restructured block is rather simple:

  1. List of open tags: Each start tag is a single byte (with byte value >= 0x03); this list is pushed on to the taglist stack to match corresponding close tag bytes
  2. A section delimiter: 0x00 (the null byte)
  3. A compact representation of the block document structure, where each start tag is represented by a single byte and the occurrence of content is marked by a 0x02 byte; a close tag is represented by a 0x01 byte and must be back-matched to a corresponding start tag
  4. Another section delimiter: 0x00 (the null byte)
  5. The contents of all the elements indicated in the document structure schematic, grouped by element type: Each individual content item is delimited by a 0x02 byte, while the start of elements of a new type is delimited by a 0x01 byte (the last not strictly needed, but it makes reversing the transformation easier)

Two limitations of the demonstration code should be noted. The first is purely a product of research implementation convenience: Element attributes are not handled by the current parser. This is primarily intended to make the presented code easier to follow. Adding like-attribute blocks is a straightforward extension of the current technique and is unlikely to appreciably affect compression or performance.

A second limitation is more consequential. Blocks are only flushed when end tags are encountered. If the PCDATA content of single elements is not consistently smaller than the block size used, no enforcement of the block size is performed. For example, a huge XHTML document that contains one big <pre> element would not enforce any reasonable block size. It might be possible to change this limitation (although doing so would by inelegant within a SAX framework). However, there is little point in lifting this, since reorganization will be inconsequential for documents with element contents larger than the block size (and should not be performed in general). The one situation where a problem could arise is when an implementing system has a hard limit on available memory, and encounters a block too large to process.

In the normal case, input XML blocks are slightly larger than the indicated block size, but once single-byte tag placeholders are substituted for tags, the resultant size is typically slightly smaller than the block size. Once compression is applied to the block, the compressed block size is considerably smaller than the input block size.


Quantifying compression

For quantification purposes, I work with the same two representative XML documents addressed in my earlier research. One was a prose-oriented document, an XML version of Shakespeare's Hamlet. The second XML source document was created from a 1MB Apache log file, with XML tags used to surround each field of the log file (and each entry). The tags used were not particularly verbose, but the file still expanded to about 3MB.

On the graphs below, the two gray bars in the front represent the compression achievable with simple file-level compression, using gzip and bzip2. As discussed above, file-level compression of large files might be impractical; but as expected, file-level compression achieves better results since it looks for repeated strings in the whole file. In general, you must have blocks of 100k or larger before the block-level compression pulls nearly even with the file-level compression. Still, if you have to worry about gigabyte source documents, even a megabyte block looks pretty manageable in comparison.

The green and blue bars at the back of the graph represent the compression achievable by compressing blocks without a restructuring pass.

Finally, the red bar in the middle represents the compression performance of xml2struct.py, combined with zlib library compression. Better compression results would be obtained by using the bzip2 library. This was not tested, however, because most anticipated future uses are likely to make the speed disadvantage of bzip2 prohibitive. However, if bandwidth is more important than CPU time, adding a bzip2 compression layer might enhance the benefits.

Let's look at the results, then I'll make some comments. I tried block sizes of 1k, 10k, 100k, and 1MB. First, Hamlet and then a weblog:


Figure 1. Compression of hamlet.xml by different techniques
Graph of compression of hamlet.xml by different techniques

Figure 2. Compression of weblog.xml by different techniques
Graph of compression of weblog.xml by different techniques

The same general pattern occurs in both hamlet.xml and weblog.xml, but the pattern is much stronger in weblog.xml. The highly repetitive structure of a log file gives restructuring its strongest advantage. At small block sizes, compression is much worse than file-level compression. Around 10k block size, block-level compression starts to look moderately good; and at 100k block size it becomes very close to the file-level compression techniques.

The aspect of the charts that is most interesting for this paper is the compression characteristics of the restructuring block strategy. Restructuring consistently improves on the behavior of simple block-level zlib (around 30% for the weblog, less for Hamlet). With blocks of about 100k, restructuring does better than file-level gzip, which is a very positive result.

A surprising result is the behavior of block-level bzip2 compression. As one would expect, once block size gets large, it makes no difference whether block-level compression is used. But one has to get to the 1MB size to wipe out the whole difference. However, at small block sizes (1k especially, but even at 10k), block-level bzip2 does shockingly badly. Prior restructuring is unlikely to improve this significantly. In fact, for the 1k block size, bzip2 is consistently much worse than zlib.


Quantifying CPU usage and transformation rate

Using the optimized C implementation of xml2struct, I examined the speed at which restructuring and compression can be performed. If you use these procedures as a channel intermediary, the ability of the process to saturate its output channel is of crucial importance. The times presented here were gathered using the Linux time utility. Elapsed time of runs is reported; each run showed close to 100% CPU usage, and was predominantly user time. I was surprised to see that block size makes very little difference to the running times of any of the examined transformations.


Figure 3. Time requirements of transformations chart
Graph of time requirements of transformations chart

Figure 4. Time requirements of transformations table
Time requirements of transformations table

The general timing pattern is pretty clear. Restructuring in the C implementation is quite fast; gzip is even faster; bzip2 is slow. PPM, incidentally, is another 10+ times slower than bzip2. I included the running time of the Python implementation as a baseline; this version is completely non-optimized. It could be made faster, but probably not more than 2x. In general, the Python implementation and a bzip2 pass are both too slow for most channel saturation uses.

What do these running times mean for output channel saturation? A 3 MB file can be restructured in slightly under a second on a PIII-733, with block size making only a small difference to the speed of restructuring. Compressing the restructured file with gzip/'zlib' adds another quarter second to the process time. This works out to approximately 20 megabits/sec; in other words, a PIII-733 performing xml2struct+gzip can saturate two 10 megabit ethernet channels, or about 13 T1 lines. If it's dedicated to the xml2struct process, a slow Pentium should suffice to fill a T1 line, which is dedicated to transmitting XML documents efficiently. Going in the other direction, an Intel P4 or AMD Athlon running at clock speeds over a Gigahertz should be able to satisfy the requirements of a 45 megabit T3 line.


Conclusion

The structure of documents significantly affects their compressibility with standard compression algorithms. Since XML encodes much of the detail of its semantic structure in its syntactic form, restructuring such documents can improve their compressibility. Moreover, I have shown that such restructuring techniques are amenable to serialization using parsed blocks from large XML documents. Also, the xml2struct algorithm can be implemented in optimized C with performance characteristics that allow it to saturate dedicated output channels from XML servers using current-generation CPUs and currently-common channel bandwidths.


Resources

About the author

David Mertz

David Mertz believes that less is more. David may be reached at mertz@gnosis.cx; his life pored over at http://gnosis.cx/dW/. Suggestions and recommendations on this, past, or future columns are welcomed.

Report abuse help

Report abuse

Thank you. This entry has been flagged for moderator attention.


Report abuse help

Report abuse

Report abuse submission failed. Please try again later.


developerWorks: Sign in


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Select information in your profile (name, country/region, and company) is displayed to the public and will accompany any content you post. You may update your IBM account at any time.

Choose your display name

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


Rate this article

Comments

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML
ArticleID=12099
ArticleTitle=XML Matters: More on XML and compression
publish-date=04012002