XML documents tend to be quite large compared with other forms of data representation. Though verbose, the simplicity and generality of XML make it worthwhile. Much of the time, the size of XML really makes no difference -- hard disks are cheap, and the time on the wire might be only a small part of the total time in the process. But at other times, bandwidth and storage space can be important.
To give you an idea, it is not at all unusual for XML documents that represent table-oriented data to be three times as large as a CSV or database representation -- or even a flat file. A similar increase in size is typical in representing EDI data, such as for ebXML projects. In many of these contexts, you start with multi-megabyte data sources, so making them several times larger can be inconvenient, especially for transmission purposes.
When you think about compressing documents, you normally think first of general compression algorithms like Lempel-Ziv, Huffman, Burrows-Wheeler, and the common utilities that implement variations thereof. Examples of such utilities are gzip, zip, and bzip2 (and the library versions of each). These utilities tend to substantially reduce the size of XML files -- especially bzip2. Fewer people are aware of PPM techniques for compression -- which can generally compress at even higher rates than bzip2 -- but at the cost of an even greater time requirement than the already sluggish bzip2. The reigning "king of the hill" for compressing XML documents is xmlppm, which utilizes XML-specific techniques similar to those presented in this column; however, xmlppm is several orders of magnitude slower, and unboundedly more memory-hungry, than the xml2struct presented here. But if you have the time and memory, xmlppm achieves astonishing compression rates.
You can obtain considerably better compression rates when you use the specific structure of XML documents to produce more compressible representations. The XMill utility is an implementation of this technique (as is xmlppm, in a somewhat more sophisticated way). However, previous techniques for combining XML restructuring with generic compression algorithms have operated at the level of entire XML documents. This is true of XMill, xmlppm, and my original xml2struct utility. The general principle behind all these techniques is to locate relatively homogeneous parts of a document and group them close to each other in a transformed file. After such grouping, standard compressors compress better (higher ratios, but nominally faster as well). In particular, XML suggests similarities between specific portions of a document by enclosing them in element tags with the same name.
If you must deal with many-megabyte XML documents, it's often impractical to spend the memory, disk space, and CPU overhead to manipulate such huge documents. It would be nice if you could take not the entire multi-megabyte source, but only parsed blocks, read serially, and apply similar techniques. It is worth recalling from the previous installment of this column on this topic, that the Burrows-Wheeler algorithm applied "generically" performs much of the same restructuring that xml2struct performs using knowledge of XML and yields similar compression results. The effectiveness of the Burrows-Wheeler algorithm depends in large part on the ability to perform global analysis on large input sources -- you'll see in the quantitative analysis below that Burrows-Wheeler loses all its advantages when addressing relatively small serial blocks from XML documents. The restructuring-pass techniques remain effective to a much larger degree under this constraint.
There are a variety of practical applications for the block-level restructuring/compression techniques addressed in this paper. The very simplest case is one in which XML files are relatively large (hundreds or thousands of megabytes), memory and disk-space are moderately constrained (you don't have gigabytes of extra memory or storage available for the process), and channel costs are comparatively expensive (transmitting a half-gigabyte is a worthwhile savings over transmitting a gigabyte). The scenario for the simplest case would utilize a protocol like the one below:
- Machine A has both a channel to Machine B and a large XML source, either a file or some dynamic generation of XML content.
- A SAX parser on A flushes its data each time a block
worth of source XML has been encountered.
- The flushed block is restructured.
- The flushed block is compressed by conventional means such as gzip or bzip2.
- The restructured/compressed block is transmitted.
- Machine B receives blocks over the channel. The underlying XML is reconstructed in a serial fashion. Each block of XML could be appended to a file; or it could be fed to a SAX parser that acts on the elements and contents and discards it once processed.
Very little in the restructuring technique of xml2struct needs to be changed to accommodate arbitrary block sizes. The article archive contains two implementations of block-level restructuring. A Python implementation follows closely the pattern of the xml2struct.py utility presented in "XML Matters #13: Exploring the entropy of documents" (see Resources) -- the code is also much more compact and easier to follow. An "optimized" C implementation was also developed to examine the speed performance of the algorithm.
In the original algorithm, a list of tags was included as the first delimited section of a restructured XML document. This list of tags -- generated on an ad hoc basis during the actual document parsing -- was used as an index for single-byte tag placeholders used in a structure schematic. The strategy of using byte index values in the place of tags reduces the size of restructured documents somewhat, but also limits the algorithm to handling DTDs with fewer than 254 distinct tags.
Under the block-level revision below, I assume that a taglist is available independently. A utility function to create a taglist from a DTD is provided in the article archive. Assuming both sides of the channel have the requisite DTD, everything works and, shy of a DTD, any other format that specifies the order of tags works, too.
The only significant revision needed was the addition of a new (first) delimited document section to indicate current element nesting. In the original document-level format, every start tag was paired to an end tag. But since XML documents are broken at arbitrary positions in the block-level format, it is necessary to record a stack of open elements where a block begins. The first block that is now parsed has an empty field for this first section; subsequent ones are likely to have one or more bytes listing tag indexes that have not yet been closed.
The format of a restructured block is rather simple:
- List of open tags: Each start tag is a single byte (with byte value >= 0x03); this list is pushed on to the taglist stack to match corresponding close tag bytes
- A section delimiter: 0x00 (the null byte)
- A compact representation of the block document structure, where each start tag is represented by a single byte and the occurrence of content is marked by a 0x02 byte; a close tag is represented by a 0x01 byte and must be back-matched to a corresponding start tag
- Another section delimiter: 0x00 (the null byte)
- The contents of all the elements indicated in the document structure schematic, grouped by element type: Each individual content item is delimited by a 0x02 byte, while the start of elements of a new type is delimited by a 0x01 byte (the last not strictly needed, but it makes reversing the transformation easier)
Two limitations of the demonstration code should be noted. The first is purely a product of research implementation convenience: Element attributes are not handled by the current parser. This is primarily intended to make the presented code easier to follow. Adding like-attribute blocks is a straightforward extension of the current technique and is unlikely to appreciably affect compression or performance.
A second limitation is more consequential. Blocks are only
flushed when end tags are encountered. If the PCDATA content
of single elements is not consistently smaller than the block
size used, no enforcement of the block size is performed. For
example, a huge XHTML document that contains one big
element would not enforce any reasonable block size. It might
be possible to change this limitation (although doing so would
by inelegant within a SAX framework). However, there is little
point in lifting this, since reorganization will be
inconsequential for documents with element contents larger than
the block size (and should not be performed in general). The
one situation where a problem could arise is when an
implementing system has a hard limit on available memory, and
encounters a block too large to process.
In the normal case, input XML blocks are slightly larger than the indicated block size, but once single-byte tag placeholders are substituted for tags, the resultant size is typically slightly smaller than the block size. Once compression is applied to the block, the compressed block size is considerably smaller than the input block size.
For quantification purposes, I work with the same two representative XML documents addressed in my earlier research. One was a prose-oriented document, an XML version of Shakespeare's Hamlet. The second XML source document was created from a 1MB Apache log file, with XML tags used to surround each field of the log file (and each entry). The tags used were not particularly verbose, but the file still expanded to about 3MB.
On the graphs below, the two gray bars in the front represent the compression achievable with simple file-level compression, using gzip and bzip2. As discussed above, file-level compression of large files might be impractical; but as expected, file-level compression achieves better results since it looks for repeated strings in the whole file. In general, you must have blocks of 100k or larger before the block-level compression pulls nearly even with the file-level compression. Still, if you have to worry about gigabyte source documents, even a megabyte block looks pretty manageable in comparison.
The green and blue bars at the back of the graph represent the compression achievable by compressing blocks without a restructuring pass.
Finally, the red bar in the middle represents the compression performance of xml2struct.py, combined with zlib library compression. Better compression results would be obtained by using the bzip2 library. This was not tested, however, because most anticipated future uses are likely to make the speed disadvantage of bzip2 prohibitive. However, if bandwidth is more important than CPU time, adding a bzip2 compression layer might enhance the benefits.
Let's look at the results, then I'll make some comments. I tried block sizes of 1k, 10k, 100k, and 1MB. First, Hamlet and then a weblog:
Figure 1. Compression of hamlet.xml by different techniques
Figure 2. Compression of weblog.xml by different techniques
The same general pattern occurs in both hamlet.xml and weblog.xml, but the pattern is much stronger in weblog.xml. The highly repetitive structure of a log file gives restructuring its strongest advantage. At small block sizes, compression is much worse than file-level compression. Around 10k block size, block-level compression starts to look moderately good; and at 100k block size it becomes very close to the file-level compression techniques.
The aspect of the charts that is most interesting for this paper is the compression characteristics of the restructuring block strategy. Restructuring consistently improves on the behavior of simple block-level zlib (around 30% for the weblog, less for Hamlet). With blocks of about 100k, restructuring does better than file-level gzip, which is a very positive result.
A surprising result is the behavior of block-level bzip2 compression. As one would expect, once block size gets large, it makes no difference whether block-level compression is used. But one has to get to the 1MB size to wipe out the whole difference. However, at small block sizes (1k especially, but even at 10k), block-level bzip2 does shockingly badly. Prior restructuring is unlikely to improve this significantly. In fact, for the 1k block size, bzip2 is consistently much worse than zlib.
Using the optimized C implementation of xml2struct, I examined the speed at which restructuring and compression can be performed. If you use these procedures as a channel intermediary, the ability of the process to saturate its output channel is of crucial importance. The times presented here were gathered using the Linux time utility. Elapsed time of runs is reported; each run showed close to 100% CPU usage, and was predominantly user time. I was surprised to see that block size makes very little difference to the running times of any of the examined transformations.
Figure 3. Time requirements of transformations chart
Figure 4. Time requirements of transformations table
The general timing pattern is pretty clear. Restructuring in the C implementation is quite fast; gzip is even faster; bzip2 is slow. PPM, incidentally, is another 10+ times slower than bzip2. I included the running time of the Python implementation as a baseline; this version is completely non-optimized. It could be made faster, but probably not more than 2x. In general, the Python implementation and a bzip2 pass are both too slow for most channel saturation uses.
What do these running times mean for output channel saturation? A 3 MB file can be restructured in slightly under a second on a PIII-733, with block size making only a small difference to the speed of restructuring. Compressing the restructured file with gzip/'zlib' adds another quarter second to the process time. This works out to approximately 20 megabits/sec; in other words, a PIII-733 performing xml2struct+gzip can saturate two 10 megabit ethernet channels, or about 13 T1 lines. If it's dedicated to the xml2struct process, a slow Pentium should suffice to fill a T1 line, which is dedicated to transmitting XML documents efficiently. Going in the other direction, an Intel P4 or AMD Athlon running at clock speeds over a Gigahertz should be able to satisfy the requirements of a 45 megabit T3 line.
The structure of documents significantly affects their compressibility with standard compression algorithms. Since XML encodes much of the detail of its semantic structure in its syntactic form, restructuring such documents can improve their compressibility. Moreover, I have shown that such restructuring techniques are amenable to serialization using parsed blocks from large XML documents. Also, the xml2struct algorithm can be implemented in optimized C with performance characteristics that allow it to saturate dedicated output channels from XML servers using current-generation CPUs and currently-common channel bandwidths.
- This article is based, in large part, on two longer whitepapers written by David Mertz for Intel Developer Services:
Optimizing xml2struct Processing for Embedded Applications and Compression and Streaming of XML Documents.
- This article is a follow up article on the investigation in XML Matters: XML and Compression. The code and results therein provide useful background to this article.
- Find the source code files mentioned in this article.
- Amazingly good but slow compression of XML
documents is provided by the xmlppm utility. Read Compressing XML with Multiplexed Hierarachical PPM Models by
James Cheney of Cornell University.
- Find the complete plays of Shakespeare in XML form. The document hamlet.xml, used for testing purposes, was obtained there.
- Read the paper that introduced the Burrows-Wheeler algorithm (now implemented in the bzip2 utility): M. Burrows and D.J. Wheeler. "A block-sorting lossless data compression algorithm." Technical Report 124, Digital System Research center, 1994.
- Many Unix-like systems include bzip2 as part of their
standard distribution. Get bzip2 for other platforms -- or in newer
- I wrote what I believe is a good general introduction to data
- Finally, take a look at Rational Application Developer for WebSphere Software, an easy-to-use, integrated development environment for building, testing, and deploying J2EE applications, including generating XML documents from DTDs and schemas.
Find other articles in David Mertz's XML Matters column.
David Mertz believes that less is more. David may be reached at firstname.lastname@example.org; his life pored over at http://gnosis.cx/dW/. Suggestions and recommendations on this, past, or future columns are welcomed.