Example applications

Examples of writing SAS applications for parallel execution.

Example 1: parallelizing a SAS data step

This section contains an example that executes a SAS DATA step in parallel. Here is a figure describing this step:

Shows an SAS step taking a single data set and outputting a single data set

The step takes a single SAS data set as input and writes its results to a single SAS data set as output. The DATA step recodes the salary field of the input data to replace a dollar amount with a salary-scale value. Here is the original SAS code:


libname prod "/prod";
data prod.sal_scl;
      set prod.sal;
      if (salary < 10000)
         then salary = 1;
      else if (salary < 25000)
         then salary = 2;
      else if (salary < 50000)
         then salary = 3;
      else if (salary < 100000)
         then salary = 4;
      else salary = 5;
   run;

This DATA step requires little effort to parallelize because it processes records without regard to record order or relationship to any other record in the input. Also, the step performs the same operation on every input record and contains no BY clauses or RETAIN statements.

The following figure shows the InfoSphere® DataStage® data flow diagram for executing this DATA step in parallel:

Shows the same SAS step being executed in parallel using the sasin, sas, and sasout operators

In this example, you:

Get the input from a SAS data set using a sequential sas operator;
Execute the DATA step in a parallel sas operator;
Output the results as a standard InfoSphere DataStage data set (you must provide a schema for this) or as a parallel SAS data set. You might also pass the output to another sas operator for further processing. The schema required might be generated by first outputting the data to a Parallel SAS data set, then referencing that data set. InfoSphere DataStage automatically generates the schema.

The first sequential sas operator executes the following SAS code as defined by the -source option to the operator:


libname prod "/prod";
data liborch.out;
   set prod.sal;
run;

This parallel sas operator executes the following SAS code:


libname prod "/prod";
data liborch.p_sal;
   set liborch.sal;
   . . . (salary field code from previous page)
run;

The sas operator can then do one of three things: use the sasout operator with its -schema option to output the results as a standard InfoSphere DataStage data set, output the results as a Parallel SAS data set, or pass the output directly to another sas operator as an SAS data set. The default output format is SAS data set. When the output is to a Parallel SAS data set or to another sas operator, for example, as a standard InfoSphere DataStage data set, the liborch statement must be used. Conversion of the output to a standard InfoSphere DataStage data set or a Parallel SAS data set is discussed in SAS data set format and Parallel SAS data set format.

Example 2: using the hash partitioner with a SAS DATA step

This example reads two INFORMIX tables as input, hash partitions on the workloc field, then uses a SAS DATA step to merge the data and score it before writing it out to a parallel SAS data set.

Shows two Informix tables being read, the data passed through hash partitioners

The sas operator in this example runs the following DATA step to perform the merge and score:


data liborch.emptabd;
     merge liborch.wltab liborch.emptab;
     by workloc;
       a_13 = (f1-f3)/2;
       a_15 = (f1-f5)/2;
      .
      .
      .
run;

Records are hashed based on the workloc field. In order for the merge to work correctly, all records with the same value for workloc must be sent to the same processing node and the records must be ordered. The merge is followed by a parallel InfoSphere DataStage sas operator that scores the data, then writes it out to a parallel SAS data set.

Example 3: using a SAS SUM statement

This section contains an example using the SUM clause with a DATA step. In this example, the DATA step outputs a SAS data set where each record contains two fields: an account number and a total transaction amount. The transaction amount in the output is calculated by summing all the deposits and withdrawals for the account in the input data where each input record contains one transaction.

Here is the SAS code for this example:


libname prod "/prod";
   proc sort data=prod.trans; 
      out=prod.s_trans
      by acctno;
   data prod.up_trans (keep = acctno sum);
      set prod.s_trans;
      by acctno;
      if first.acctno then sum=0;
      if type = "D" 
         then sum + amt;
      if type = "W" 
         then sum - amt;
      if last.acctno then output;
   run;

The SUM variable is reset at the beginning of each group of records where the record groups are determined by the account number field. Because this DATA step uses the BY clause, you use the InfoSphere DataStage hash partitioning operator with this step to make sure all records from the same group are assigned to the same node.

Note that DATA steps using SUM without the BY clause view their input as one large group. Therefore, if the step used SUM but not BY, you would execute the step sequentially so that all records would be processed on a single node.

Shown below is the data flow diagram for this example:

Shows an SAS data set being imported, hashed and then summed

The SAS DATA step executed by the second sas operator is:


data liborch.nw_trans (keep = acctno sum);
   set liborch.p_trans;
   by acctno;
   if first.acctno then sum=0;
   if type = "D" 
      then sum + amt;
   if type = "W" 
      then sum - amt;
   if last.acctno then output;
run;