XML as a format has a lot of nice properties; it is a perfectly general way of representing arbitrary data structures, and it is human readable (more or less). But XML has one very notable unpleasant property. XML documents are VERBOSE; not just a little on the wordy side, but almost unbelievably huge. Much of the time, this drawback of XML really makes no difference. DASD is cheap, and the time on the wire might be only a small part of the total time in the process. But at other times, bandwidth and storage space can be important.
To quantify things, it is not at all unusual for XML documents that represent table-oriented data to be three times as large as a CSV or database representation, or even a flat file. A similar increase is typical in representing EDI data (such as for ebXML projects). In many of these contexts, one starts out with multi-megabyte data sources. Making these multiples larger can be inconvenient, especially for transmission purposes. For example, for purposes of this article I created an XML representation of an approximately 1 megabyte Apache access logfile. This created an XML document 3.18 times the size of the underlying line-oriented log. The only information added in the process was some descriptive names for fields, but that could have also been specified in a single header line of less than a hundred bytes. Moreover, my specific XML representation did not include any namespaces in the tags, which would have increased the size further.
When you think about compressing documents, you normally think first of general compression algorithms like Lempel-Ziv and Huffman, and of the common utilities that implement variations on them. Specifically, on Unix-like platforms, what first comes to mind is usually the utility gzip; on other platforms, zip is more common (using utilities such as PKZIP, Info-ZIP, and WinZip). gzip turns out to be quite consistently better than zip, but only by small margins. These utilities indeed tend substantially to reduce the size of XML files. However, it also turns out that you can obtain considerably better compression rates by two means, either individually or in combination.
The first technique is to use the Burrows-Wheeler compression algorithm rather than Lempel-Ziv sequential algorithms. In particular, the somewhat less common utility bzip2 (or its associated libraries and APIs) is an implementation of Burrows-Wheeler for many system platforms (and is accompanied by full source and a BSD-style license). Burrows-Wheeler operates by grouping related strings in a uncompressed source, rather than in the Lempel-Ziv style of building up a dictionary of string occurrences. bzip2 consistently obtains better compression than gzip across many file types, but the effect is especially dramatic for XML documents. On the down side, bzip2 is generally slower than gzip. Then again, the slowness of bandwidth will very often swamp speed differences in CPU or memory-bound algorithms.
The second technique is to take advantage of the very specific structure of XML documents to produce more compressible representations. The XMill utility is one implementation of this technique, and it is available (with C++ source) under a liberal license from AT&T. XMill, however, seems to require certain clickthrough style limitations on its licensing, and it cannot be distributed by other parties directly (at least as I understand it). I have created my own implementation of the same general technique, and I present the implementation in this article. The code herein is released to the public domain, and the technique as implemented was developed independently and has only a "bird's-eye view" similarity to XMill: Sometimes XMill does better, and sometimes I do (but XMill is probably always faster than my initial implementation, which pays attention only to compression results).
I obtained or created four base documents for purposes of comparison in this article. The first is Shakespeare's play
Hamlet as an XML document (see Resources). The markup includes tags such as <PERSONA>, <SPEAKER> and <LINE>, which map fairly naturally to the typographic forms one might encounter in a printed copy. In order to make comparisons of just how the XML markup contributes to document size and compressibility, I derived from hamlet.xml a document
hamlet.txt that simply removed all the XML tags but left the content. This derivation is not reversible and is a strict
loss of information. A person reading hamlet.txt would not have a great deal of difficulty determining semantically which
content is a "speaker" name and which a "line," for example, but there is no easy way a computer could regenerate the source
XML document.
The next two documents are an Apache Weblog file (a compact set of line-oriented records), and an XML document created from this file. Since the source document is the log file, no information is lost in the transformation, and it is trivial to re-create exactly the original format from the XML. Of course, you cannot use an XML parser, or DOM, or a validator, or a DTD with the log file format. Let's take a look at the sizes of the base documents in Listing 1.
Listing 1.Directory listing of base documents
288754 hamlet.xml 188830 hamlet.txt 949474 weblog.txt 3021921 weblog.xml |
In both cases, the XML is much larger. In the Hamlet example, the comparison is not entirely fair because the actual information content of the text version is also diminished. But for the Weblog file, the XML starts to look fairly bad. Not everything is quite as it appears, however. Compression programs do a fairly good job of boiling documents down to their actual information content, and meaningless padding tends towards zero size in the compressed version (asymptotically with compression effort, and if all is happy). Let's try gzip, zip and bzip2 in listings 2 and 3.
Listing 2. Directory listing of compressed Shakespeare
78581 hamlet.xml.gz 72505 hamlet.txt.gz 78696 hamlet.xml.zip 72620 hamlet.txt.zip 57522 hamlet.xml.bz2 56743 hamlet.txt.bz2 |
Listing 3. Directory listing of compressed Weblog
91029 weblog.txt.gz 115524 weblog.xml.gz 91144 weblog.txt.zip 115639 weblog.xml.zip 56156 weblog.txt.bz2 66994 weblog.xml.bz2 |
A number of interesting things emerge from studying the sizes in the two listings. For both styles of documents (for every compression technique), the size differences in compressed files is much smaller than between the XML versus text originals. It is also noteworthy that gzip and zip cluster very closely together for corresponding cases, while bzip2 does much better all the time. Moreover, when using bzip2 on the Shakespeare document, the compressed size difference between the text and XML formats is nearly negligible, despite the 53% larger size of the XML base document.
However, the Weblog stands out as a problem case. While compression narrows the bloat of the XML conversion quite a bit, even the bzip2 version still lets the XML markup increase the size by about 20%. It's not necessarily a disaster, but it feels as if we should be able to compress down to the true information content.
An XML document has a rather inefficient form when it comes to compression. As you will see, bzip2 somewhat alleviates this inefficiency by regrouping strings. But at heart, XML documents are a jumble of fairly dissimilar parts -- tags, attributes, element bodies of different types. If you could take each relatively homogeneous set of things and group them close to each other in a transformed file, standard compressors would have more to work with. For example, if every <host> tag-body in the Weblog occurs near the others, the block of stuff that contains the IP address of hosts will be easy to compress. The trick here is to come up with a way of transforming an XML document into something that contains all the same information but structures the layout in a compressor-friendly style.
The utilities xml2struct.py and struct2xml.py do exactly what is desired. The versions below have a few limitations, but they demonstrate the principles involved. Some limitations are that each document is limited to 253 distinct tags, and that attributes and processing instructions are not handled. Fixing those limits should not change the gist of the results, however. XMill performs a similar transformation, but with some extra options and with fewer limitations.
The general format of a "struct" document is as follows:
- A list of tags that occur in the original XML document, separated by newline characters.
- A section delimeter: 0x00 (the null byte)
- A compact representation of the overall document structure, where each start tag is represented by a single byte, and the occurrence of content is marked by a 0x02 byte.
- Another section delimiter: 0x00 (the null byte)
- The contents of all the elements indicated in the document structure schematic, grouped by element type. Each individual content item is delimited by a 0x02 byte, while the start of a elements of a new type is delimited by a 0x01 byte (the last not strictly needed, but it makes reversing the transformation easier).
Below is complete Python code to perform and reverse the described transformation. It would be simple to implement this transformation in another programming language also:
Listing 4. xml2struct.py
import sys
import xml.sax
from xml.sax.handler import *
class StructExtractor(ContentHandler):
"""Create a special structure/content form of an XML document"""
def startDocument(self):
self.taglist = []
self.contentdct = {}
self.state = [] # stack for tag state
self.newstate = 0 # flag for continuing chars in same elem
self.struct = [] # compact document structure
def endDocument(self):
sys.stdout.write('\n'.join(self.taglist))
# Write out the taglist first
sys.stdout.write(chr(0)) # section delimiter \0x00
sys.stdout.write(''.join(self.struct))
# Write out the structure list
sys.stdout.write(chr(0)) # section delimiter \0x00
for tag in self.taglist: # Write all content lists
sys.stdout.write(chr(2).join(self.contentdct[tag]))
sys.stdout.write(chr(1)) # delimiter between content types
def startElement(self, name, attrs):
if not name in self.taglist:
self.taglist.append(name)
self.contentdct[name] = []
if len(self.taglist) > 253:
raise ValueError, "More than 253 tags encountered"
self.state.append(name) # push current tag
self.newstate = 1 # chars go to new item
# single char to indicate tag
self.struct.append(chr(self.taglist.index(name)+3))
def endElement(self, name):
self.state.pop() # pop current tag off stack
self.newstate = 1 # chars go to new item
self.struct.append(chr(1)) # \0x01 is endtag in struct
def characters(self, ch):
currstate = self.state[-1]
if self.newstate: # either add new chars to state item
self.contentdct[currstate].append(ch)
self.newstate = 0
self.struct.append(chr(2))
# \0x02 content placeholder in struct
else: # or append the chars to current item
self.contentdct[currstate][-1] += ch
if __name__ == '__main__':
parser = xml.sax.make_parser()
handler = StructExtractor()
parser.setContentHandler(handler)
parser.parse(sys.stdin)
|
Using SAX rather than DOM makes this transformation fairly time efficient, even though time was not a large consideration in developing it. To reverse the transformation, you'd use Listing 5.
Listing 5. struct2xml.py
def struct2xml(s):
tags, struct, content = s.split(chr(0))
taglist = tags.split('\n') # all the tags
contentlist = [] # list-of-lists of content items
for block in content.split(chr(1)):
contents = block.split(chr(2))
contents.reverse() # pop off content items from end
contentlist.append(contents)
state = [] # stack for tag state
skeleton = [] # templatized version of XML
for c in struct:
i = ord(c)
if i >= 3: # start of element
i -= 3 # adjust for struct tag index offset
tag = taglist[i] # spell out the tag from taglist
state.append(tag) # push current tag
skeleton.append('<%s>' % tag)
# insert the element start tag
elif i == 1: # end of element
tag = state.pop() # pop current tag off stack
skeleton.append('</%s>' % tag)
# insert the element end tag
elif i == 2: # insert element content
tag = state[-1]
item = contentlist[taglist.index(tag)].pop()
item = item.replace('&','&')
skeleton.append(item) # add bare tag to indicate content
else:
raise ValueError, "Unexpected structure tag: ord(%d)" % i
return ''.join(skeleton)
if __name__ == '__main__':
import sys
print struct2xml(sys.stdin.read()),
|
The real meat of the discussed transformation comes when you try to compress the results. If all goes according to plan, foo.struct will be significantly more compressible than foo.xml, even though the two contain identical information (which is provable because they are symmetrically derivable). In a sense, xml2struct.py is already a sort of compression algorithm (it produces somewhat smaller files), but the real point is not to use it directly but as a foundation for further compression.
Let's look at some sizes, including a few repeated from above. Some results from XMill are thrown in for comparison; you'll be able to distinguish them as the ones that include the name .xmi. (XMill is available in versions using gzip and bzip2 algorithms.)
Listing 6. Directory listing of "structured XML"
228610 hamlet.struct 57533 hamlet.struct.bz2 57522 hamlet.xml.bz2 71060 hamlet.struct.gz 78581 hamlet.xml.gz 61823 hamlet.xmi.bz2 75247 hamlet.xmi.gz |
The results on this prose document are somewhat mixed. "Restructuring" the XML document assists gzip quite a bit. But plain old bzip2 on the original XML file does 11 bytes better at generating a compressible structure than do my attempts. Of course, I am comforted that XMill behaves similarly -- but worse -- than my transformation.
The technique does quite a bit better with Weblog files. Here it actually pays off.
Listing 7. Directory listing 2 of "structured XML"
1764816 weblog.struct 59955 weblog.struct.bz2 66994 weblog.xml.bz2 56156 weblog.txt.bz2 76183 weblog.struct.gz 115524 weblog.xml.gz 59409 weblog.xmi.bz2 74439 weblog.xmi.gz |
As before, restructuring the XML Weblog aids gzip compression extremely significantly. But since gzip is not my preferred technique anymore, this is only moderately interesting. What is of genuine interest is that I have also improved the compression rate of the already wonderful bzip2 by 11%. While maybe not earth shattering, this is enough to matter when worrying about megabytes. For what it's worth, XMill does a bit better than xml2struct.py in this case. It is also interesting that the compression of this restructured XML is within 7% of the best-obtained compression of the original textual Weblog.
Though the utility presented here is a preliminary attempt, even in this early form it does surprisingly well -- at least in some cases -- of squeezing those last bytes out of compressed XML files. With a little refinement and experimentation, I expect that a few percent more reduction could be obtained. Part of what makes writing this utility hard is that bzip2 does such a good job to start with. I was honestly surprised by just how effective the Burrows-Wheeler algorithm was when I started empirical testing.
Some commercial utilities attempt to perform XML compression in a manner that utilizes knowledge of the specific DTDs of compressed documents. It is quite likely that these techniques obtain additional compression. However, xml2struct.py and XMill have the nice advantage of being simple command-line tools that you can transparently apply to XML files. Custom programming of every compression is not always desirable or possible. But where it is, squeezing out even more bytes might be an attainable goal.
-
Much of the inspiration for this article comes from the work of the XMill XML compressor. You can find information and a downloadable version at the XMill page. The license requires a clickthrough, and the download page unfortunately seems to have a buggy script that does not allow downloads from all sites.
-
The complete plays of Shakespeare in XML form are available for downloading from ibiblio.org. The document hamlet.xml used for testing purposes was obtained there.
- Read the paper that introduced the Burrows-Wheeler algorithm (now implemented in the bzip2 utility): M. Burrows and D.J. Wheeler. "A block-sorting lossless data compression algorithm." Technical Report 124, Digital System Research center, 1994.
-
Many Unix-like systems include bzip2 as part of their
standard distribution. For other platforms, or for newer
versions, bzip2 can be found at Redhat.
-
I wrote what I believe is a good general introduction to data
compression.
- Find other articles in David Mertz's XML Matters column.

David Mertz believes that less is more. David may be reached at mertz@gnosis.cx; his life pored over at http://gnosis.cx/dW/. Suggestions and recommendations on this, past, or future, columns are welcomed.
Comments (Undergoing maintenance)





