I first learned about inversion lists in relation to Perl 6 Unicode programming. In Unicode programming, inversion lists are essential for storing binary properties of character ranges. Dr. Mark Davis, the president of the Unicode consortium, was kind enough to point me to some C implementations of inversion lists. He said (paraphrased):
"I first started using inversion lists in about 1985. They are compact for most data, have fast lookup, and allow fast boolean operations. I don't know of any pre-existing implementations, but it is simple enough that it could well have been developed multiple times."
In addition, Richard Gillam's book "Unicode Demystified" (see Resources later in this article for a link) contained helpful information about inversion lists, in addition to many other fascinating topics. I recommend it highly for anyone interested in Unicode programming.
So what are inversion lists? Inversion lists are best described as a condensed summary of a bit string. They are similar to a simple run-length encoding of data, though there are some differences.
Let's look at an illustrative example. Suppose you want to encode the bit string "1110011." An inversion list would store a list of three numbers: 0, 3, 5. All we store is the start position of the 1s, then the start position of the 0s, then the position of 1s again, and so on until the bit string is over.
If the run begins with a 1, we start the list with the number 0, meaning the 1s start the bit string. This rule can be inverted to store a 0 if the run begins with a 0 bit, but the effect is the same as long as both the encoder and the decoder of the inversion list agree on this detail. In fact, this rule is the only tricky thing about inversion lists!
If we are building an inversion list for searching only, we do not need to store the position of the last bit. If, however, we want to construct the full original data, we need to know where to stop adding bits.
Making an inversion list, then, is nothing more than counting bits. For specific applications such as Unicode character ranges, inversion lists can save you a lot of time and effort.
Algorithm::InversionList module
I wrote a module called Algorithm::InversionList that makes an inversion list from any data. It has been uploaded to the CPAN Perl module archive, and is freely available (see Resources for a link).
At first, I tried to write Algorithm::InversionList using regular expressions and the index() function. That worked, but it was slow and pretty confusing. The current implementation, which is on CPAN as version 0.03, uses the Perl vec() function to get and store bits from the data. There is room for optimization -- for instance, the retrieval and storage of bits is done one by one, and it would be much more efficient if done in groups of eight.
There are four parts of the Algorithm::InversionList module. First, the preamble:
Listing 1. Algorithm::InversionList preamble
package Algorithm::InversionList; use 5.006; use strict; use warnings; require Exporter; our @ISA = qw(Exporter); our @EXPORT = qw( invlist data_from_invlist ); our $VERSION = '0.03'; |
This is basic bookkeeping; it states that we require Perl 5.006, that we like to use strict variable checking and warnings, and that we will export the invlist() and data_from_invlist() functions. This module is procedural, meaning that there are no objects, you simply get the two imported functions, and then you can use them to produce inversion lists and to convert inversion lists back to regular data. See Resources for information on the Exporter module.
Next comes the invlist() function, which produces inversion lists.
Listing 2. The invlist() function
sub invlist
{
my $string = shift @_;
# we need a valid string
return undef unless defined $string;
# handle trivial case of 0-length string
return [] unless length $string;
# this is suboptimal, we eventually want to do things
# in multiples of 8 (on byte boundaries)
# $length is length in bits, we avoid b* because it will
# create a list 8 times larger than C*
my @unpacked = unpack("C*", $string);
my $length = scalar @unpacked * 8;
my $current = vec($string, 0, 1);
my $new;
my @list;
push @list, 0 if $current;
foreach my $offset (0..$length)
{
$new = vec($string, $offset, 1);
if ($new != $current)
{
push @list, $offset;
}
$current = $new;
}
push @list, $length;
return \@list;
}
|
We start with the data given to us in $string. If it is undefined, we return undef as well. If it is empty, we return an empty array. Note how Perl distinguishes between undefined and empty (length 0) strings.
We initialize our variables, finding the length of the string from a call to unpack(). Read the documentation for pack() and unpack() in the Resources section to understand why we use the "C*" template. Then, re-read that documentation, because it is one of the most essential pieces of Perl knowledge you can have.
The variable $current will always have the current bit (the last one we examined). We start with an empty list, and we insert a 0 if the first bit is 1. That way, we indicate that the run of 1s starts at position 0.
We use the vec() function -- the documentation for it is also linked in the Resources section. Here, we use it to get only one bit at a time, but vec() can retrieve much larger blocks of bits.
Now, we run the main loop. For each bit in the string, we get the bit into $new, and if it's different from $current (the previous bit), we push the offset (stored in the aptly named $offset variable) at the end of the list @list. Then, we assign $new to $current so it will be the "old" bit on the next loop iteration.
After the loop is done, we push the length of the data at the end of the list. This is always necessary, since it will tell the decoder how long the last run of bits will be. Then we return a reference to @list to our caller, and that's all.
Listing 3. The data_from_invlist() function
sub data_from_invlist
{
my $out = ''; # start with a blank string
my $append = 0; # we're appending '0' bits first
my $list = shift @_;
return undef unless defined $list;
my $start_offset = 0;
foreach my $end_offset (@$list) # for each inversion list value
{
foreach my $offset ($start_offset .. $end_offset-1)
{
vec($out, $offset, 1) = $append;
}
$append++; # 0 => 1, 1 => 2
$append %= 2; # 2 => 0, 1 => 1
$start_offset = $end_offset;
}
return $out; # return the data
}
|
The necessary companion to the invlist() function is data_from_invlist(), which generates the original data back from the inversion list.
First, we initialize $out, which will be our output. We start appending 0s first, because the first offset we're given will tell us to start appending 1s. That's why $append is initialized to 0.
For each offset in the inversion list, we go from the previous offset (0 for the first item) to the current offset minus 1 and set all the bits in the output at those offsets to $append. I called it $append, because the vec() function will append data to $out when $out is not long enough for the offset.
Note that if the end offset is equal to the start offset, nothing happens, because the .. operator does not go backwards.
After every item is processed, we increment $append and mod it to 2, with the net effect that 0 becomes 1 and 1 becomes 0.
Finally, we make the starting offset for the next loop the ending offset of the current one.
At the end of the module is the POD documentation. You may also want to look at the test script included with the module, test.pl. It has some neat testing automation.
It may seem that inversion lists are useless for regular data, but in fact they can be easily extended to handle regular data. We just need to run eight inversion lists in parallel, encoding the eight bits that make up each byte. Why bytes? Because that way we can find repeating bytes, the most likely thing to repeat in a long data sequence. In other words, we don't want to encode
101000111010001110100011101000111010001110100011101000111010001110100011
(10100011 repeated 9 times) with an inversion list; we want to encode the eight parallel channels of 9 1s, 9 0s, 9 1s, 9 0s, 9 0s, 9 0s, 9 1s, and 9 1s. There are much better compression algorithms; this is just a demonstration of how inversion lists can be applied to wider channels than a single bit string.
Listing 4. Bit channels
sub do_channels
{
my $data = shift @_;
print "Test data $data\n";
my $n = 8;
# you don't have to change this unpack if $n changes
my @unpacked = unpack("C*", $data);
my $length = scalar @unpacked;
$length *= 8/$n;
my @channels = ('' x $n);
# multiplex the data onto $n channels
foreach my $channel (0..$n-1)
{
foreach my $offset (0..$length-1)
{
vec($channels[$channel], $offset, 1) = vec($data, $offset*$n+$channel, 1);
}
}
my @inv = map { invlist($_) } @channels;
foreach my $channel (0..$n-1)
{
my $list = $inv[$channel];
print "Channel $channel: @$list\n",
}
my @out = map { data_from_invlist($_) } @inv;
my $out = '';
# demux the $n channels back
foreach my $channel (0..$n-1)
{
foreach my $offset (0..$length-1)
{
vec($out, $offset*$n+$channel, 1) = vec($out[$channel], $offset, 1);
}
}
print "All OK\n" if $out eq $data;
}
|
What's going on here? The do_channels() function uses the $n variable (8 by default) to decide how many channels to use for the data in $data. If the data were, for instance, repeating every 32 bits, you want $n set to 32 so each channel will contain repetitive data.
The @channels list is initialized to be $n empty strings. Then, each string is filled with the bits corresponding to that channel from $data. We obtain the inversion list for each channel's data using a map() invocation on @channels, store it in the @inv list, and then convert back to data in @out using map() on the @inv list.
Note the elegant use of map() in this case. Using map() is very nice for transforming one list into another as a mapping, especially when the number of list elements is not too big.
Finally, we put the channels back into the $out string, using the data in the @out list. We check that $out and $data are identical, and they should be.
I hope you enjoyed reading about inversion lists. There are other structures such as inversion maps and compact arrays that also merit a closer look; see the Gillam book in the Resources section for more information on those.
Implementing inversion lists in Perl was both fun and vexing. Perl is very helpful for the programmer, but that can be an impediment when implementing bit-oriented algorithms. Also, you need to be aware of the potential speed impact of any statement, much more so than in C.
If you have suggestions for more features in Algorithm::InversionList, or other algorithms and data structures in that vein, feel free to drop me a line.
- Read Ted's other Perl articles in the "Cultured Perl" series on developerWorks.
- The CPAN module archive is a repository
of useful Perl code.
- You can download the
Algorithm-InversionListmodule and theAlgorithm::InversionListmanual pages. - Ted highly recommends the book Unicode Demystified: A Practical Programmer's Guide to the Encoding Standard by Richard Gillam (Addison-Wesley, 2002). This excellent book about Unicode programming covers inversion lists and inversion maps as well.
- The Unicode Consortium is the place to go if you are interested in Unicode programming.
- The documentation for the Exporter module, the
pack()function documentation, thevec()function documentation, and theunpack()function documentation will prove useful to you if you start using inversion lists. - You'll find more Linux articles in the developerWorks Linux zone.

Teodor Zlatanov graduated with an M.S. in computer engineering from Boston University in 1999. He has worked as a programmer since 1992, using Perl, Java, C, and C++. His interests are in open source work on text parsing, three-tier client-server database architectures, UNIX system administration, CORBA, and project management. Contact Teodor at tzz-at-iglou.com.