Static and dynamic models
Static models estimate resource utilization at compilation time. Dynamic models predict job performance at run time.
The following table describes the differences between static and
dynamic models. Use this table to help you decide what type of model to generate.
| Characteristics | Static models | Dynamic models |
|---|---|---|
| Job run | Not required. | Required. |
| Sample data | Requires automatic data sampling. Uses the actual size of the input data if the size can be determined. Otherwise, the sample size is set to a default value of 1000 records on each output link from each source stage. | Accepts automatic data sampling or a data range:
|
| Scratch space | Estimates are based on a worst-case scenario. | Estimates are based on linear regression. |
| Disk space | Estimates are based on a worst-case scenario. | Estimates are based on linear regression. |
| CPU utilization | Not estimated. | Estimates are based on linear regression. |
| Number of records | Estimates are based on a best-case scenario. No record is dropped. Input data is propagated from the source stages to all other stages in the job. | Dynamically determined. Best-case scenario does not apply. Input data is processed, not propagated. Records can be dropped. Estimates are based on linear regression. |
| Record size | Solely determined by the record schema. Estimates are based on a worst-case scenario. | Dynamically determined by the actual record at run time. Estimates are based on linear regression. |
| Data partitioning | Data is assumed to be evenly distributed among all partitions. | Dynamically determined. Estimates are based on linear regression. |
When a model is based on a worst-case scenario, the model uses maximum values. For example, if a variable can hold up to 100 characters, the model assumes that the variable always holds 100 characters. When a model is based on a best-case scenario, the model assumes that no single input record is dropped anywhere in the data flow.
The accuracy of a model depends on these factors:
- Schema definition
- The size of records with variable-length fields cannot be determined until the records are processed. Use fixed-length or bounded-length schemas as much as possible to improve accuracy.
- Input data
- When the input data contains more records with one type of key field than another, the records might be unevenly distributed across partitions. Specify a data sampling range that is representative of the input data.
- Parallel processing environment
- The availability of system resources when you run a job can affect the degree to which buffering occurs. Generate models in an environment that is similar to your production environment in terms of operating system, processor type, and number of processors.