Compress stage

The Compress stage is a processing stage. This stage uses the UNIX compress or GZIP utility to compress a data set.

The Compress stage is a processing stage. It can have a single input link and a single output link.

The Compress stage uses the UNIX compress or GZIP utility to compress a data set. It converts a data set from a sequence of records into a stream of raw binary data. The complement to the Compress stage is the Expand stage, which is described in Expand Stage.

A compressed data set is similar to an ordinary data set and can be stored in a persistent form by a Data Set stage. However, a compressed data set cannot be processed by many stages until it is expanded, that is, until its rows are returned to their normal format. Stages that do not perform column-based processing or reorder the rows can operate on compressed data sets. For example, you can use the Copy stage to create a copy of the compressed data set.

Because compressing a data set removes its normal record boundaries, the compressed data set must not be repartitioned before it is expanded.

shows a Compress stage compressing a data set

DataStage® puts the existing data set schema as a subrecord to a generic compressed schema. For example, given a data set with a schema of:

a:int32;
b:string[50];

The schema for the compressed data set would be:

record
  ( t: tagged {preservePartitioning=no}
    ( encoded: subrec
        ( bufferNumber: dfloat;
          bufferLength: int32;
          bufferData: raw[32000];
         );
      schema: subrec
        ( a: int32;
          b: string[50];
         );

Therefore, when you are looking to reuse a file that has been compressed, ensure that you use the 'compressed schema' to read the file rather than the schema that had gone into the compression.

The stage editor has three pages:

Stage Page. This is always present and is used to specify general information about the stage.
Input Page. This is where you specify details about the data set being compressed.
Output Page. This is where you specify details about the compressed data being output from the stage.