Rules of thumb
Once you have identified a program as a potential candidate for use in InfoSphere® DataStage®, you need to determine how to divide the SAS code itself into InfoSphere DataStage steps.
The sas operator can be run either parallel or sequentially. Any converted SAS program that satisfies one of the four criteria outlined above will contain at least one parallel segment. How much of the program should be contained in this segment? Are there portions of the program that need to be implemented in sequential segments? When does a SAS program require multiple parallel segments? Here are some guidelines you can use to answer such questions.
- Identify the slow portions of the sequential SAS program by inspecting the CPU and real-time values for each of the PROC and DATA steps in your application. Typically, these are steps that manipulate records (CPU-intensive) or that sort or merge (memory-intensive). You should be looking for times that are a relatively large fraction of the total run time of the application and that are measured in units of many minutes to hours, not seconds to minutes. You might need to set the SAS fullstimer option on in your config.sas612 or in your SAS program itself to generate a log of these sequential run times.
- Start by parallelizing only those slow portions of the application that you have identified in Step 1 above. As you include more code within the parallel segment, remember that each parallel copy of your code (referred to as a partition) sees only a fraction of the data. This fraction is determined by the partitioning method you specify on the input or inpipe lines of your sas operator source code.
- Any two sas operators should only be connected by one pipe. This ensures that all pipes in the InfoSphere DataStage program are simultaneously flowing for the duration of the execution. If two segments are connected by multiple pipes, each pipe must drain entirely before the next one can start.
- Keep the number of sas operators to a minimum. There is a performance cost associated with each operator that is included in the data flow. Rule 3 takes precedence over this rule. That is, when reducing the number of operators means connecting any two operators with more than one pipe, don't do it.
- If you are reading or writing a sequential file such as a flat ASCII text file or a SAS data set, you should include the SAS code that does this in a sequential sas operator. Use one sequential operator for each such file. You will see better performance inputting one sequential file per operator than if you lump many inputs into one segment followed by multiple pipes to the next segment, in line with Rule 2 above.
- When you choose a partition key or combination of keys for a parallel operator, you should keep in mind that the best overall performance of the parallel application occurs if each of the partitions is given approximately equal quantities of data. For instance, if you are hash partitioning by the key field year (which has five values in your data) into five parallel segments, you will end up with poor performance if there are big differences in the quantities of data for each of the five years. The application is held up by the partition that has the most data to process. If there is no data at all for one of the years, the application will fail because the SAS process that gets no data will issue an error statement. Furthermore, the failure will occur only after the slowest partition has finished processing, which might be well into your application. The solution might be to partition by multiple keys, for example, year and storeNumber, to use roundrobin partitioning where possible, to use a partitioning key that has many more values than there are partitions in your InfoSphere DataStage application, or to keep the same key field but reduce the number of partitions. All of these methods should serve to balance the data distribution over the partitions.
- Multiple parallel segments in your InfoSphere DataStage application are required when you need to parallelize portions of code that are sorted by different key fields. For instance, if one portion of the application performs a merge of two data sets using the patientID field as the BY key, this PROC MERGE will need to be in a parallel segment that is hash partitioned on the key field patientID. If another portion of the application performs a PROC MEANS of a data set using the procedureCode field as the CLASS key, this PROC MEANS will have to be in a parallel sas operator that is hash partitioned on the procedureCode key field.
- If you are running a query that includes an ORDER BY clause against a relational database, you should remove it and do the sorting in parallel, either using SAS PROC SORT or an InfoSphere DataStage input line order statement. Performing the sort in parallel outside of the database removes the sequential bottleneck of sorting within the RDBMS.
- A sort that has been performed in a parallel operator will order the data only within that operator. If the data is then streamed into a sequential operator, the sort order will be lost. You will need to re-sort within the sequential operator to guarantee order.
- Within a parallel sas operator you might only use SAS work directories for intermediate writes to disk. SAS generates unique names for the work directories of each of the parallel operators. In an SMP environment this is necessary because it prevents the multiple CPUs from writing to the same work file. Do not use a custom-specified SAS library within a parallel operator.
- Do not use a liborch directory within a parallel segment unless it is connected to an inpipe or an outpipe. A liborch directory might not be both written and read within the same parallel operator.
- A liborch directory can be used only once for an input, inpipe, output or outpipe. If you need to read or write the contents of a liborch directory more than once, you should write the contents to disk via a SAS work directory and copy this as needed.
- Remember that all InfoSphere DataStage operators in a step run simultaneously. This means that you cannot write to a custom-specified SAS library as output from one InfoSphere DataStage operator and simultaneously read from it in a subsequent operator. Connections between operators must be via InfoSphere DataStage pipes which are virtual data sets normally in SAS data set format.