Static and dynamic models

Static models estimate resource utilization at compilation time. Dynamic models predict job performance at run time.

The following table describes the differences between static and dynamic models. Use this table to help you decide what type of model to generate.

Table 1. Static and dynamic models
Characteristics	Static models	Dynamic models
Job run	Not required.	Required.
Sample data	Requires automatic data sampling. Uses the actual size of the input data if the size can be determined. Otherwise, the sample size is set to a default value of 1000 records on each output link from each source stage.	Accepts automatic data sampling or a data range: Automatic data sampling determines the sample size dynamically according to the stage type: For a database source stage, the sample size is set to 1000 records on each output link from the stage. For all other source stage types, the sample size is set to the minimum number of input records among all sources on all partitions. A data range specifies the number of records to include in the sample for each data source. If the size of the sample data exceeds the actual size of the input data, the model uses the entire input data set.
Scratch space	Estimates are based on a worst-case scenario.	Estimates are based on linear regression.
Disk space	Estimates are based on a worst-case scenario.	Estimates are based on linear regression.
CPU utilization	Not estimated.	Estimates are based on linear regression.
Number of records	Estimates are based on a best-case scenario. No record is dropped. Input data is propagated from the source stages to all other stages in the job.	Dynamically determined. Best-case scenario does not apply. Input data is processed, not propagated. Records can be dropped. Estimates are based on linear regression.
Record size	Solely determined by the record schema. Estimates are based on a worst-case scenario.	Dynamically determined by the actual record at run time. Estimates are based on linear regression.
Data partitioning	Data is assumed to be evenly distributed among all partitions.	Dynamically determined. Estimates are based on linear regression.

When a model is based on a worst-case scenario, the model uses maximum values. For example, if a variable can hold up to 100 characters, the model assumes that the variable always holds 100 characters. When a model is based on a best-case scenario, the model assumes that no single input record is dropped anywhere in the data flow.

The accuracy of a model depends on these factors:

Schema definition: The size of records with variable-length fields cannot be determined until the records are processed. Use fixed-length or bounded-length schemas as much as possible to improve accuracy.

Input data: When the input data contains more records with one type of key field than another, the records might be unevenly distributed across partitions. Specify a data sampling range that is representative of the input data.

Parallel processing environment: The availability of system resources when you run a job can affect the degree to which buffering occurs. Generate models in an environment that is similar to your production environment in terms of operating system, processor type, and number of processors.