Distribution Fitting
A statistical distribution is the theoretical frequency of the occurrence of values that a variable can take. In the Simulation Fitting node, a set of theoretical statistical distributions is compared to each field of data. The distributions that are available for fitting are described in the topic Distributions. The parameters of the theoretical distribution are adjusted to give the best fit to the data according to a measurement of the goodness of fit; either the Anderson-Darling criterion or the Kolmogorov-Smirnov criterion. The results of the distribution fitting by the Simulation Fitting node show which distributions were fitted, the best estimates of the parameters for each distribution, and how well each distribution fits the data. During distribution fitting, correlations between fields with numeric storage types, and contingencies between fields with a categorical distribution, are also calculated. The results of the distribution fitting are used to create a Simulation Generate node.
- Use an upstream node to remove records with missing values.
- Use an upstream node to impute values for missing values.
The role of a field is not taken into account when the distributions are fitted. For example, fields with the role Target are treated the same as fields with roles of Input, None, Both, Partition, Split, Frequency, and ID.
Fields are treated differently during distribution fitting according to their storage type and measurement level. The treatment of fields during distribution fitting is described in the following table.
Storage type | Measurement Level | |||||
---|---|---|---|---|---|---|
Continuous | Categorical | Flag | Nominal | Ordinal | Typeless | |
String | Impossible | Categorical, dice and fixed distributions are fitted | ||||
Integer | ||||||
Real | ||||||
Time | All distributions are fitted. Correlations and contingencies are calculated. | The categorical distribution is fitted. Correlations are not calculated. | Binomial, negative binomial and Poisson distributions are fitted, and correlations are calculated. | Field is ignored and not passed to the Simulation Generate node. | ||
Date | ||||||
Timestamp | ||||||
Unknown | Appropriate storage type is determined from the data. |
Fields with the measurement level ordinal are treated like continuous fields and are included in the correlations table in the Simulation Generate node. If you want a distribution other than binomial, negative binomial or Poisson to be fitted to an ordinal field, you must change the measurement level of the field to continuous. If you have previously defined a label for each value of an ordinal field, and then change the measurement level to continuous, the labels will be lost.
Fields that have single values are not treated differently during distribution fitting to fields with multiple values. Fields with the storage type time, date, or timestamp are treated as numeric.
Fitting Distributions to Split Fields
If your data contains a split field, and you want distribution fitting to be carried out separately for each split, you must transform the data by using an upstream Restructure node. Using the Restructure node, generate a new field for each value of the split field. This restructured data can then be used for distribution fitting in the Simulation Fitting node.