A data compression primer
Theory and strategy of data representation
Data compression is widely used in a variety of programming contexts. All popular operating systems and programming languages have numerous tools and libraries for dealing with data compression of various sorts. The right choice of compression tools and libraries for a particular application depends on the characteristics of the data and application in question: streaming versus file; expected patterns and regularities in the data; relative importance of CPU usage, memory usage, channel demands and storage requirements; and other factors.
Just what is data compression, anyway? The short answer is
that data compression removes redundancy from data; in
information-theoretic terms, compression increases the
entropy of the compressed text. But those statements are
essentially just true by definition. Redundancy can come in a
lot of different forms. Repeated bit sequences (
are one type. Repeated byte sequences are another
XXXXXXXX). But more often redundancies tend to come on a
larger scale, either regularities of the data set taken as a
whole, or sequences of varying lengths that are relatively
common. Basically, the purpose of data compression is to find algorithmic transformations of data representations that will
produce more compact representations given "typical" data sets.
If this description seems a bit complex to unpack, read on to
find some more practical illustrations.
Lossless and lossy compression
There are actually two fundamentally different "styles" of data compression: lossy and lossless. This article is generally about lossless compression techniques, but it's helpful to understand the distinction first. Lossless compression involves a transformation of the representation of a data set so that it is possible to reproduce exactly the original data set by performing a decompression transformation. Lossy compression is a representation that allows you to reproduce something "pretty much like" the original data set. An advantage of lossy techniques is that they can frequently produce far more compact data representations than those of lossless compression techniques. Most often lossy compression techniques are used for images, sound-files, and video. Lossy compression may be appropriate in these areas insofar as human observers do not perceive the literal bit-pattern of a digital image/sound, but rather more general "gestalt" features of the underlying image/sound.
From the point of view of "normal" data, lossy compression is not an option. We do not want a program that does "about the same" thing as the one we wrote. We do not want a database that contains "about the same" kind of information as what we put into it. At least not for most purposes (and I know of few practical uses of lossy compression outside of what are already approximate mimetic representations of the real world, likes images and sounds).
A data set example
For the purposes of this article, let's start with a specific
hypothetical data representation. Here is an
easy-to-understand example. In the town of Greenfield, MA, the
telephone prefixes are
readers: In the USA, local telephone numbers are seven digits,
and are conventionally represented in the form ###-####; prefixes
are assigned in geographic blocks). Suppose also that the
first prefix is the mostly widely assigned of the three. The
suffix portions might be any other digits, in fairly equal
distribution. The data set we are interested in is "the list
of all the telephone numbers currently in active use." One can
imagine various reasons why this might be interesting for
programmatic purposes, but we're not concerned with that here.
Initially, the data set we are interested in comes in a particular data representation: a multicolumn report (perhaps generated as output of some query or compilation process). The first few lines of this report might look like:
Table 1. Multicolumn report
============================================================= 772-7628 772-8601 772-0113 773-3429 774-9833 773-4319 774-3920 772-0893 772-9934 773-8923 773-1134 772-4930 772-9390 774-9992 772-2314 [...]
Whitespace compression can be characterized most generally as "removing what we are not interested in." Even though this technique is technically a lossy-compression technique, it is still useful for many types of data representations we find in the real world. For example, even though HTML is far more readable in a text editor if indentation and vertical spacing is added, none of this "whitespace" should make any difference to the rendering of the HTML document by a Web browser. If you happen to know that an HTML document is destined only for a Web browser (or for a robot/spider) then it might be a good idea to take out all the whitespace to make it transmit faster and occupy less space in storage. What we remove in whitespace compression never really had any functional purpose to start with.
In the case of our example in this article, it is possible to remove quite a bit from the described report. The row of "="s across the top adds nothing functional; nor do the '-'s within numbers, nor the spaces between them. All of these are useful for a person reading the original report, but do not matter once we think of it as "data." What we remove is not precisely "whitespace" in traditional terms, but the intent is the same.
Whitespace compression is extremely "cheap" to perform. It is just a matter of reading a stream of data, and excluding a few specific values from the output stream. In many cases, no "decompression" step is involved at all. But even where we would wish to recreate something close to the original somewhere down the datastream, it should require little in terms of CPU or memory. What we reproduce may or may not be exactly what we started with, depending on just what rules and constraints were involved in the original. An HTML page typed by a human in a text editor will probably have spacing that is idiosyncratic. Then again, automated tools often produce "reasonable" indentation and spacing of HTML. In the case of the rigid report format in our example, there is no reason why the original representation could not be produced by a "decompressing formatter".
Run-Length Encoding (RLE) is the simplest widely used lossless compression technique. Like whitespace compression, it is "cheap" -- especially to decode. The idea behind it is that many data representations consist largely of strings of repeated bytes. Our example report is one such data representation. It begins with a string of repeated "="s, and has strings of spaces scattered through it. Rather than represent each character with its own byte, RLE will (sometimes or always) have an iteration count followed by the character to be repeated.
If repeated bytes are predominant within the expected data
representation, it might be adequate and efficient to always
have the algorithm specify one or more bytes of iteration
count, followed by one character. However, if one-length
character strings occur, these strings will require two (or
more) bytes to encode them, in other words,
00000001 01011000 might be
the output bitstream required for just one ASCII "X" of the
input stream. Then again, a hundred "X"s in a row would be
01100100 01011000, which is quite good.
What is frequently done in RLE variants is to selectively use bytes to indicate iterator counts, and otherwise just have bytes represent themselves. At least one byte-value has to be reserved to do this, but that can be escaped in the output, if needed. For example, in our example telephone-number report, we know that everything in the input stream is plain ASCII characters. Specifically, they all have bit one of their ASCII value as 0. We could use this first ASCII bit to indicate that an iterator count was being represented rather than representing a regular character. The next seven bits of the iterator byte could be used for the iterator count; and the next byte could represent the character to be repeated. So, for example, we could represent the string "YXXXXXXXX" as:
"Y" Iter(8) "X" 01001111 10001000 01011000
This example does not show how to escape iterator byte-values, nor does it allow iteration of more than 127 occurrences of a character. Variations on RLE deal with issues such as these, if needed.
Huffman encoding looks at the symbol table of a whole data set. The compression is achieved by finding the "weights" of each symbol in the data set. Some symbols occur more frequently than others do; so Huffman encoding suggests that the frequent symbols need not be encoded using as many bits as the less frequent symbols. There are variations on Huffman-style encoding, but the original (and frequent) variation involves looking for the most common symbol, and encoding it using just one bit, say 1. If you encounter a 0, you know you're on the way to encoding a longer variable length symbol.
Let's imagine we apply a Huffman encoding to our local phone-book example (assume we have already whitespace-compressed the report). We might get:
Table 2. Hoffman encoding results
Encoding Symbol 1 7 010 2 011 3 00000 4 00001 5 00010 6 00011 8 00100 9 00101 0 00111 1
Our initial symbol set of digits could already be straightforwardly encoded (with no-compression) as 4-bit sequences (nibbles). The Huffman encoding given will use up to 5 bits for the worst-case symbols, which is obviously worse than the nibble encoding. However, our best case will use only 1 bit; and we know that our best case is also the most frequent case, by having scanned the data set. So we might encode a particular phone number like:
772 7628 --> 1 1 010 1 00010 010 00011
The nibble encoding would take 28 bits to represent a phone number; in this particular case, our encoding takes 19 bits. I introduced spaces into the above example for clarity, you can see that they are not necessary to unpack the encoding, since the encoding table will determine whether we have reached the end of an encoded symbol (but you have to keep track of your place in the bits).
Huffman encoding is still fairly cheap to decode, cycle-wise. But it requires a table lookup, so it cannot be quite as cheap as RLE, however. The encoding side of Huffman is fairly expensive, though; the whole data set has to be scanned, and a frequency table built up. In some cases a "shortcut" is appropriate with Huffman coding. Standard Huffman coding applies to a particular data set being encoded, with the set-specific symbol table prepended to the output datastream. However, if not just the single data set -- but the whole type of data encoded -- has the same regularities, we can opt for a global Huffman table. If we have such a global Huffman table, we can hardcode the lookups into our executables, which makes both compression and decompression quite a bit cheaper (except for the initial global sampling and hard-coding). For example, if we know our data set would be English-language prose, letter-frequency tables are well known, and quite consistent across data sets.
Probably the most significant lossless compression technique is Lempel-Ziv. What is explained here is LZ78, but LZ77 and other variants work in a similar fashion. The idea in LZ78 is to encode a streaming byte sequence using a dynamic table. At the start of compressing a bitstream, the LZ table is filled with the actual symbol set, along with some blank slots. Various size tables are used, but for my above (whitespace compressed) telephone number example, let's suppose that we use a 32-entry table (this should be OK for our example, although much too small for most other types of data). First thing, we fill the first ten slots with our alphabet (digits). As new bytes come in, we first output an existing entry that grabs the longest sequence possible, then fill the next available slot with the N+1 length sequence. In the worst case, we are using 5 bits instead of 4 bits for a single symbol, but we'll wind up getting to use 5 bits for multiple symbols in a lot of cases. For example, the machine might do this (a table slot is noted with square brackets):
7 --> Lookup: 7 found --> nothing to add --> keep looking 7 --> Lookup: 77 not found --> add '77' to  --> output =00111 2 --> Lookup: 72 not found --> add '72' to  --> output =00111 7 --> Lookup: 27 not found --> add '27' to  --> output =00010 6 --> Lookup: 76 not found --> add '76' to  --> output =00111 2 --> Lookup: 62 not found --> add '62' to  --> output =00110 8 --> Lookup: 28 not found --> add '28' to  --> output =00010
So far, we've got nothing out of it, but let's continue with the next phone number:
7 --> Lookup: 87 not found --> add '87 to  --> output =00100 7 --> Lookup: 77 found --> nothing to add --> keep looking 2 --> Lookup: 772 not found --> add '772' to  --> output =01011 8 --> Lookup: 28 found --> nothing to add --> keep looking 6 --> Lookup: 286 not found --> add '286' to  --> output =10000 ....
The steps should suffice to show the pattern. We have not
achieved any net compression yet, but notice that we've already
managed to use slot 11 and slot 16, thereby getting two symbols
with one output in each case. We've also accumulated the very
useful byte sequence
772 in slot 18, which would prove useful later
in the stream.
What LZ78 does is fill up one symbol table with (hopefully)
helpful entries, then write it, clear it, and start a new one.
In this regard, a symbol table of 32 entries is still probably too small, since that will get cleared before a lot of reuse of
772 and the like is achieved. But the small symbol table is easy to illustrate.
In typical data sets, Lempel-Ziv variants achieve much better compression rates than Huffman or RLE. On the other hand, Lempel-Ziv variants are very pricey cycle-wise, and can use large tables in memory. Most real-life compression tools and libraries use a combination of Lempel-Ziv and Huffman techniques.
Solving the right problem
Just as choosing the right algorithm can often create orders-of-magnitude improvements over even heavily optimized wrong algorithms, choosing the right data representation is often even more important than compression methods (which are always a sort of post hoc optimization of desired features). The simple data-set example used in this article is a perfect case where re-conceptualizing the problem would actually be a much better approach than using any of the compression techniques illustrated.
Think again about what our data represents. It is not a very general collection of data, and the rigid a priori constraints allow us to reformulate our whole problem. What we have is a maximum of 30,000 telephone numbers (7720000 through 7749999), some of which are active, and others of which are not. We do not have a "duty," as it were, to produce a full representation of each telephone number that is active, but simply to indicate the binary fact that it is active. Thinking of the problem this way, we can simply allocate 30,000 bits of memory and storage, and have each bit indicate "yes" or "no" to the presence of one telephone number. The ordering of the bits in the bit array can be simple ascending order from the lowest to the highest telephone number in the range.
This bit-array solution is the best in almost every respect. It allocates exactly 3750 bytes to represent the data set; the various compression techniques will use a varying amount of storage depending on both the number of telephone numbers in the set and the efficiency of the compression. But if 10,000 of the 30,000 possible telephone numbers are active, and even a very efficient compression technique requires several bytes per telephone number, then the bit array is an order-of-magnitude better. In terms of CPU demands, the bit array is not only better than any of the discussed compression methods, it is also quite likely to be better than the naive non-compression method of listing all the numbers as strings. Stepping through a bit array and incrementing a "current-telephone-number" counter can be done quite efficiently, and mostly within the on-chip cache of a modern CPU.
The lesson to be learned from this very simple example is certainly not that every problem has some magic shortcut (as this one does). A lot of problems genuinely require significant memory, bandwidth, storage and CPU resources; and in many of those cases compression techniques can help ease -- or shift -- those burdens. But a more moderate lesson could be suggested: Before compression techniques are employed, it's a good idea to make sure that your starting conceptualization of the data representation is a good one.
In honor of Claude Shannon.