Example applications

Examples of executing SAS PROC statements in parallel.

Example 1: parallelize PROC steps using the BY keyword

This example parallelizes a SAS application using PROC SORT and PROC MEANS. In this example, you first sort the input to PROC MEANS, then calculate the mean of all records with the same value for the acctno field.

The following figure illustrates this SAS application:

Shows a SAS application using PROC SORT and PROC MEANS

Shown below is the original SAS code:


libname prod "/prod";
proc means data=prod.dhist;
   BY acctno; 
run;

The BY clause in a SAS step signals that you want to hash partition the input to the step. Hash partitioning guarantees that all records with the same value for acctno are sent to the same processing node. The SAS PROC step executing on each node is thus able to calculate the mean for all records with the same value for acctno.

Shown below is the implementation of this example:

Shows the SAS application implemented in InfoSphere DataStage

PROC MEANS pipes its results to standard output, and the sas operator sends the results from each partition to standard output as well. Thus the list file created by the sas operator, which you specify using the -listds option, contains the results of the PROC MEANS sorted by processing node.

Shown below is the SAS PROC step for this application:


proc means data=liborch.p_dhist;                                                          
   by acctno; 
run;

Example 2: parallelizing PROC steps using the CLASS keyword

One type of SAS BY GROUP processing uses the SAS keyword CLASS. CLASS allows a PROC step to perform BY GROUP processing without your having to first sort the input data to the step. Note, however, that the grouping technique used by the SAS CLASS option requires that all your input data fit in memory on the processing node.

Note also that as your data size increases, you might want to replace CLASS and NWAY with SORT and BY.

Whether you parallelize steps using CLASS depends on the following:

If the step also uses the NWAY keyword, parallelize it.
When the step specifies both CLASS and NWAY, you parallelize it just like a step using the BY keyword, except the step input doesn't have to be sorted. This means you hash partition the input data based on the fields specified to the CLASS option. See the previous section for an example using the hash partitioning method.
If the CLASS clause does not use NWAY, execute it sequentially.
If the PROC STEP generates a report, execute it sequentially, unless it has a BY clause.

For example, the following SAS code uses PROC SUMMARY with both the CLASS and NWAY keywords:


libname prod "/prod";
   proc summary data=prod.txdlst
      missing NWAY;
      CLASS acctno lstrxd fpxd;
      var xdamt xdcnt; 
      output out=prod.xnlstr(drop=_type_ _freq_) sum=;
run;

In order to parallelize this example, you hash partition the data based on the fields specified in the CLASS option. Note that you do not have to partition the data on all of the fields, only to specify enough fields that your data is be correctly distributed to the processing nodes.

For example, you can hash partition on acctno if it ensures that your records are properly grouped. Or you can partition on two of the fields, or on all three. An important consideration with hash partitioning is that you should specify as few fields as necessary to the partitioner because every additional field requires additional overhead to perform partitioning.

The following figure shows the InfoSphere® DataStage® application data flow for this example:

Shows an sas operator passing data to a hash operator

The SAS code (DATA step) for the first sas operator is:


libname prod "/prod";
data liborch.p_txdlst
   set prod.txdlst;
run;

The SAS code for the second sas operator is:


proc summary data=liborch.p_txdlst                                                          
   missing NWAY;
   CLASS acctno lstrxd fpxd;
   var xdamt xdcnt; 
   output out=liborch.p_xnlstr(drop=_type_ _freq_) sum=;
run;