Skip to main content

If you don't have an IBM ID and password, register here.

By clicking Submit, you agree to the developerWorks terms of use.

The first time you sign into developerWorks, a profile is created for you. This profile includes the first name, last name, and display name you identified when you registered with developerWorks. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

All information submitted is secure.

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerworks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

By clicking Submit, you agree to the developerWorks terms of use.

All information submitted is secure.

Cultured Perl: Inversion lists with Perl

Condensing bit strings and other data

Teodor Zlatanov (tzz-at-iglou.com), Programmer, Gold Software Systems
Teodor Zlatanov
Teodor Zlatanov graduated with an M.S. in computer engineering from Boston University in 1999. He has worked as a programmer since 1992, using Perl, Java, C, and C++. His interests are in open source work on text parsing, three-tier client-server database architectures, UNIX system administration, CORBA, and project management. Contact Teodor at tzz-at-iglou.com.

Summary:  Inversion lists are an essential part of any Perl programmer's toolkit, especially for those who deal with ranges and Unicode. In this article, Ted explains inversion lists, illustrated by a Perl implementation that he wrote and put on the CPAN network, and shows how inversion lists can be used to compress normal data in addition to bit strings.

Date:  08 Oct 2003
Level:  Intermediate

Comments:  

I first learned about inversion lists in relation to Perl 6 Unicode programming. In Unicode programming, inversion lists are essential for storing binary properties of character ranges. Dr. Mark Davis, the president of the Unicode consortium, was kind enough to point me to some C implementations of inversion lists. He said (paraphrased):

"I first started using inversion lists in about 1985. They are compact for most data, have fast lookup, and allow fast boolean operations. I don't know of any pre-existing implementations, but it is simple enough that it could well have been developed multiple times."

In addition, Richard Gillam's book "Unicode Demystified" (see Resources later in this article for a link) contained helpful information about inversion lists, in addition to many other fascinating topics. I recommend it highly for anyone interested in Unicode programming.

Inversion lists

So what are inversion lists? Inversion lists are best described as a condensed summary of a bit string. They are similar to a simple run-length encoding of data, though there are some differences.

Let's look at an illustrative example. Suppose you want to encode the bit string "1110011." An inversion list would store a list of three numbers: 0, 3, 5. All we store is the start position of the 1s, then the start position of the 0s, then the position of 1s again, and so on until the bit string is over.

If the run begins with a 1, we start the list with the number 0, meaning the 1s start the bit string. This rule can be inverted to store a 0 if the run begins with a 0 bit, but the effect is the same as long as both the encoder and the decoder of the inversion list agree on this detail. In fact, this rule is the only tricky thing about inversion lists!

If we are building an inversion list for searching only, we do not need to store the position of the last bit. If, however, we want to construct the full original data, we need to know where to stop adding bits.

Making an inversion list, then, is nothing more than counting bits. For specific applications such as Unicode character ranges, inversion lists can save you a lot of time and effort.


Algorithm::InversionList module

I wrote a module called Algorithm::InversionList that makes an inversion list from any data. It has been uploaded to the CPAN Perl module archive, and is freely available (see Resources for a link).

At first, I tried to write Algorithm::InversionList using regular expressions and the index() function. That worked, but it was slow and pretty confusing. The current implementation, which is on CPAN as version 0.03, uses the Perl vec() function to get and store bits from the data. There is room for optimization -- for instance, the retrieval and storage of bits is done one by one, and it would be much more efficient if done in groups of eight.

There are four parts of the Algorithm::InversionList module. First, the preamble:


Listing 1. Algorithm::InversionList preamble
package Algorithm::InversionList;

use 5.006;
use strict;
use warnings;

require Exporter;

our @ISA = qw(Exporter);

our @EXPORT = qw(
invlist data_from_invlist	
);
our $VERSION = '0.03';

This is basic bookkeeping; it states that we require Perl 5.006, that we like to use strict variable checking and warnings, and that we will export the invlist() and data_from_invlist() functions. This module is procedural, meaning that there are no objects, you simply get the two imported functions, and then you can use them to produce inversion lists and to convert inversion lists back to regular data. See Resources for information on the Exporter module.

Next comes the invlist() function, which produces inversion lists.


Listing 2. The invlist() function
sub invlist
{
 my $string = shift @_;

 # we need a valid string
 return undef unless defined $string;

 # handle trivial case of 0-length string
 return [] unless length $string;

 # this is suboptimal, we eventually want to do things
 #    in multiples of 8 (on byte boundaries)
 # $length is length in bits, we avoid b* because it will
 #    create a list 8 times larger than C*
 my @unpacked = unpack("C*", $string);
 my $length = scalar @unpacked * 8;
 my $current = vec($string, 0, 1);
 my $new;
 my @list;

 push @list, 0 if $current;

 foreach my $offset (0..$length)
 {  
  $new = vec($string, $offset, 1);
  if ($new != $current)
  {
   push @list, $offset;
  }
  $current = $new;
 }

 push @list, $length;

 return \@list;
}

We start with the data given to us in $string. If it is undefined, we return undef as well. If it is empty, we return an empty array. Note how Perl distinguishes between undefined and empty (length 0) strings.

We initialize our variables, finding the length of the string from a call to unpack(). Read the documentation for pack() and unpack() in the Resources section to understand why we use the "C*" template. Then, re-read that documentation, because it is one of the most essential pieces of Perl knowledge you can have.

The variable $current will always have the current bit (the last one we examined). We start with an empty list, and we insert a 0 if the first bit is 1. That way, we indicate that the run of 1s starts at position 0.

We use the vec() function -- the documentation for it is also linked in the Resources section. Here, we use it to get only one bit at a time, but vec() can retrieve much larger blocks of bits.

Now, we run the main loop. For each bit in the string, we get the bit into $new, and if it's different from $current (the previous bit), we push the offset (stored in the aptly named $offset variable) at the end of the list @list. Then, we assign $new to $current so it will be the "old" bit on the next loop iteration.

After the loop is done, we push the length of the data at the end of the list. This is always necessary, since it will tell the decoder how long the last run of bits will be. Then we return a reference to @list to our caller, and that's all.


Listing 3. The data_from_invlist() function
sub data_from_invlist
{
 my $out = '';              # start with a blank string
 my $append = 0;            # we're appending '0' bits first
 my $list = shift @_;

 return undef unless defined $list;

 my $start_offset = 0;
 foreach my $end_offset (@$list)    # for each inversion list value
 {
  foreach my $offset ($start_offset .. $end_offset-1)
  {
   vec($out, $offset, 1) = $append;
  }

  $append++;                # 0 => 1, 1 => 2
  $append %= 2;             # 2 => 0, 1 => 1
  $start_offset = $end_offset;
 }

 return $out;               # return the data
}

The necessary companion to the invlist() function is data_from_invlist(), which generates the original data back from the inversion list.

First, we initialize $out, which will be our output. We start appending 0s first, because the first offset we're given will tell us to start appending 1s. That's why $append is initialized to 0.

For each offset in the inversion list, we go from the previous offset (0 for the first item) to the current offset minus 1 and set all the bits in the output at those offsets to $append. I called it $append, because the vec() function will append data to $out when $out is not long enough for the offset.

Note that if the end offset is equal to the start offset, nothing happens, because the .. operator does not go backwards.

After every item is processed, we increment $append and mod it to 2, with the net effect that 0 becomes 1 and 1 becomes 0.

Finally, we make the starting offset for the next loop the ending offset of the current one.

At the end of the module is the POD documentation. You may also want to look at the test script included with the module, test.pl. It has some neat testing automation.


Compressing arbitrary data

It may seem that inversion lists are useless for regular data, but in fact they can be easily extended to handle regular data. We just need to run eight inversion lists in parallel, encoding the eight bits that make up each byte. Why bytes? Because that way we can find repeating bytes, the most likely thing to repeat in a long data sequence. In other words, we don't want to encode

101000111010001110100011101000111010001110100011101000111010001110100011

(10100011 repeated 9 times) with an inversion list; we want to encode the eight parallel channels of 9 1s, 9 0s, 9 1s, 9 0s, 9 0s, 9 0s, 9 1s, and 9 1s. There are much better compression algorithms; this is just a demonstration of how inversion lists can be applied to wider channels than a single bit string.


Listing 4. Bit channels
sub do_channels
{
 my $data = shift @_;
 print "Test data $data\n";
 my $n = 8;

 # you don't have to change this unpack if $n changes
 my @unpacked = unpack("C*", $data);
 my $length = scalar @unpacked;
 $length *= 8/$n;

 my @channels = ('' x $n);

 # multiplex the data onto $n channels
 foreach my $channel (0..$n-1)
 {
  foreach my $offset (0..$length-1)
  {
   vec($channels[$channel], $offset, 1) = vec($data, $offset*$n+$channel, 1);
  }
 }

 my @inv = map { invlist($_) } @channels;

 foreach my $channel (0..$n-1)
 {
  my $list = $inv[$channel];
  print "Channel $channel: @$list\n", 
 }

 my @out = map { data_from_invlist($_) } @inv;
 my $out = '';

 # demux the $n channels back
 foreach my $channel (0..$n-1)
 {
  foreach my $offset (0..$length-1)
  {
   vec($out, $offset*$n+$channel, 1) = vec($out[$channel], $offset, 1);
  }
 }
 print "All OK\n" if $out eq $data;
}

What's going on here? The do_channels() function uses the $n variable (8 by default) to decide how many channels to use for the data in $data. If the data were, for instance, repeating every 32 bits, you want $n set to 32 so each channel will contain repetitive data.

The @channels list is initialized to be $n empty strings. Then, each string is filled with the bits corresponding to that channel from $data. We obtain the inversion list for each channel's data using a map() invocation on @channels, store it in the @inv list, and then convert back to data in @out using map() on the @inv list.

Note the elegant use of map() in this case. Using map() is very nice for transforming one list into another as a mapping, especially when the number of list elements is not too big.

Finally, we put the channels back into the $out string, using the data in the @out list. We check that $out and $data are identical, and they should be.


Conclusion

I hope you enjoyed reading about inversion lists. There are other structures such as inversion maps and compact arrays that also merit a closer look; see the Gillam book in the Resources section for more information on those.

Implementing inversion lists in Perl was both fun and vexing. Perl is very helpful for the programmer, but that can be an impediment when implementing bit-oriented algorithms. Also, you need to be aware of the potential speed impact of any statement, much more so than in C.

If you have suggestions for more features in Algorithm::InversionList, or other algorithms and data structures in that vein, feel free to drop me a line.


Resources

About the author

Teodor Zlatanov

Teodor Zlatanov graduated with an M.S. in computer engineering from Boston University in 1999. He has worked as a programmer since 1992, using Perl, Java, C, and C++. His interests are in open source work on text parsing, three-tier client-server database architectures, UNIX system administration, CORBA, and project management. Contact Teodor at tzz-at-iglou.com.

Report abuse help

Report abuse

Thank you. This entry has been flagged for moderator attention.


Report abuse help

Report abuse

Report abuse submission failed. Please try again later.


developerWorks: Sign in

If you don't have an IBM ID and password, register here.


Forgot your IBM ID?


Forgot your password?
Change your password


By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. This profile includes the first name, last name, and display name you identified when you registered with developerWorks. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

Choose your display name

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

(Must be between 3 – 31 characters.)


By clicking Submit, you agree to the developerWorks terms of use.

 


Rate this article

Comments

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Linux
ArticleID=11342
ArticleTitle=Cultured Perl: Inversion lists with Perl
publish-date=10082003
author1-email=tzz-at-iglou.com
author1-email-cc=

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

For articles in technology zones (such as Java technology, Linux, Open source, XML), Popular tags shows the top tags for all technology zones. For articles in product zones (such as Info Mgmt, Rational, WebSphere), Popular tags shows the top tags for just that product zone.

For articles in technology zones (such as Java technology, Linux, Open source, XML), My tags shows your tags for all technology zones. For articles in product zones (such as Info Mgmt, Rational, WebSphere), My tags shows your tags for just that product zone.

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).