Setting Options for the Simulation Generate Node
You can use the options on the Data tab of the Simulation Generate node dialog box to do the following:
- View, specify, and edit the statistical distribution information for the fields.
- View, specify, and edit the correlations between the fields.
- Specify the number of iterations and cases to simulate.
Select an item. Enables you to switch between the three views of the Simulation Generate node: Simulated Fields, Correlations, and Advanced Options.
Simulated Fields view
If the Simulation Generate node has been generated or updated from a Simulation Fitting node using historical data, in the Simulated Fields view you can view and edit the statistical distribution information for each field. The following information about each field is copied to the Types tab of the Simulation Generate node from the Simulation Fitting node:
- Measurement level
- Values
- Missing
- Check
- Role
If you do not have historical data, you can define fields and specify their distributions by selecting a storage type, and selecting a distribution type and entering the required parameters. Generating data in this way means that information about the measurement level of each field will not be available until the data are instantiated, for example, on the Types tab or in a Type node.
The Simulated Fields view contains several tools, which you can use to perform the following tasks:
- Add and remove fields.
- Change the order of fields on the display.
- Specify a storage type for each field.
- Specify a statistical distribution for each field.
- Specify parameter values for the statistical distribution of each field.
Simulated Fields. This table contains one empty row if the Simulation Generate node has been added to the stream canvas from the Sources palette. When this row is edited, a new empty row is added to the bottom of the table. If the Simulation Generate node has been created from a Simulation Fitting node, this table will contain one row for each field of the historical data. Extra rows can be added to the table by clicking the Add new field icon.
The Simulated Fields table is made up of the following columns:
- Field. Contains the names of the fields. The field names can be edited by typing in the cells.
- Storage. The cells in this column contain a drop-down
list of storage types. Available storage types are
String, Integer,
Real, Time,
Date, and Timestamp. The
choice of storage type determines which distributions are available in the
Distribution column. If the Simulation Generate node has been created from a
Simulation Fitting node, the storage type is copied over from the Simulation
Fitting node.Note: For fields with datetime storage types, you must specify the distribution parameters as integers. For example, to specify a mean date of 1 January 1970, use the integer 0. The signed integer represents the number of seconds since (or before) midnight on 1 January 1970.
- Status. Icons in the Status column indicate the fit
status for each field.
No distribution has been specified for the field or one or more distribution parameter is missing. In order to run the simulation, you must specify a distribution for this field and enter valid values for the parameters. The field is set to the closest fitting distribution. Note: This icon can only ever be displayed if the Simulation Generate node is created from a Simulation Fitting node.The closest fitting distribution has been replaced with an alternative distribution from the Fit Details sub-dialog box. See the topic Fit Details for more information. The distribution has been manually specified or edited, and might include a parameter specified at more than one level. - Locked. Locking a simulated field, by selecting the check box in the column with the lock icon, excludes the field from automatic updating by a linked Simulation Fitting node. This is most useful when you manually specify a distribution and want to ensure that it will not be affected by automatic distribution fitting when a linked Simulation Fitting node is executed.
- Distribution. The cells in this column contain a
drop-down list of statistical distributions. The choice of storage type
determines which distributions are available in this column for a given field.
See the topic Distributions for more information.
Note: You cannot specify the Fixed distribution for every field. If you want every field in your generated data to be fixed, you can use a User Input node followed by a Balance node.
- Parameters. The distribution parameters associated with each fitted distribution are displayed in this column. Multiple values of a parameter are comma separated. Specifying multiple values for a parameter generates multiple iterations for the simulation. See the topic Iterations for more information. If parameters are missing, this is reflected in the icon displayed in the Status column. To specify values for the parameters, click this column in the row corresponding to the field of interest and choose Specify from the list. This opens the Specify Parameters sub-dialog box. See the topic Specify Parameters for more information. This column is disabled if Empirical is chosen in the Distribution column.
- Min, Max. In this column, for some distributions, you can specify a minimum value, a maximum value, or both for the simulated data. Simulated data that are smaller than the minimum value and larger than the maximum value will be rejected, even though they would be valid for the specified distribution. To specify minimum and maximum values, click this column in the row corresponding to the field of interest and choose Specify from the list. This opens the Specify Parameters sub-dialog box. See the topic Specify Parameters for more information. This column is disabled if Empirical is chosen in the Distribution column.
Use Closest Fit. Only enabled if the Simulation Generate node has been created automatically from a Simulation Fitting node using historical data, and a single row in the Simulated Fields table is selected. Replaces the information for the field in the selected row with the information of the best fitting distribution for the field. If the information in the selected row has been edited, pressing this button will reset the information back to the best fitting distribution determined from the Simulation Fitting node.
Fit Details. Only enabled if the Simulation Generate node has been created automatically from a Simulation Fitting node. Opens the Fit Details sub-dialog box. See the topic Fit Details for more information.
Several useful tasks can be carried out using the icons on the right of the Simulated Fields view. These icons are described in the following table.
Icon | Tooltip | Description |
---|---|---|
|
Edit distribution parameters | Only enabled when a single row in the Simulated Fields table is selected. Opens the Specify Parameters sub-dialog box for the selected row. See the topic Specify Parameters for more information. |
|
Add new field | Only enabled when a single row in the Simulated Fields table is selected. Adds a new empty row to the bottom of the Simulated Fields table. |
|
Create multiple copies | Only enabled when a single row in the Simulated Fields table is selected. Opens the Clone Field sub-dialog box. See the topic Clone Field for more information. |
|
Delete selected field | Deletes the selected row from the Simulated Fields table. |
|
Move to top | Only enabled if the selected row is not already the top row of the Simulated Fields table. Moves the selected row to the top of the Simulated Fields table. This action affects the order of the fields in the simulated data. |
|
Move up | Only enabled if the selected row is not the top row of the Simulated Fields table. Moves the selected row up one position in the Simulated Fields table. This action affects the order of the fields in the simulated data. |
|
Move down | Only enabled if the selected row is not the bottom row of the Simulated Fields table. Moves the selected row down one position in the Simulated Fields table. This action affects the order of the fields in the simulated data. |
|
Move to bottom | Only enabled if the selected row is not already the bottom row of the Simulated Fields table. Moves the selected row to the bottom of the Simulated Fields table. This action affects the order of the fields in the simulated data. |
Do not clear Min and Max when refitting. When selected, the minimum and maximum values are not overwritten when the distributions are updated by executing a connected Simulation Fitting node.
Correlations view
Input fields to predictive models are often known to be correlated--for example, height and weight. Correlations between fields that will be simulated must be accounted for in order to ensure that the simulated values preserve those correlations.
If the Simulation Generate node has been generated or updated from a Simulation Fitting node using historical data, in the Correlations view you can view and edit the calculated correlations between pairs of fields. If you do not have historical data, you can specify the correlations manually based on your knowledge of how the fields are correlated.
You can choose to display the correlations in a matrix or list format.
Correlations matrix. Displays the correlations between pairs of fields in a matrix. The field names are listed, in alphabetical order, down the left and along the top of the matrix. Only the cells below the diagonal can be edited; a value between -1.000 and 1.000, inclusive, must be entered. The cell above the diagonal is updated when focus is changed away from its mirrored cell below the diagonal; both cells then display the same value. The diagonal cells are always disabled and always have a correlation of 1.000. The default value for all other cells is 0.000. A value of 0.000 specifies that there is no correlation between the associated pair of fields. Only continuous and ordinal fields are included in the matrix. Nominal, categorical and flag fields, and fields that are assigned the Fixed distribution are not shown in the table.
Correlations list. Displays the correlations between pairs of fields in a table. Each row of the table shows the correlation between a pair of fields. Rows cannot be added or deleted. The columns with the headings Field 1 and Field 2 contain the field names, which cannot be edited. The Correlation column contains the correlations, which can be edited; a value between -1.000 and 1.000, inclusive, must be entered. The default value for all cells is 0.000. Only continuous and ordinal fields are included in the list. Nominal, categorical and flag fields, and fields that are assigned the Fixed distribution are not shown in the list.
- Fitted. Replaces the current correlations with those calculated using the historical data.
- Zeroes. Replaces the current correlations with zeroes.
- Cancel. Closes the dialog box. The correlations are unchanged.
Show As. Select Table to display the correlations as a matrix. Select List to display the correlations as a list.
Do not recalculate correlations when refitting. Select this option if you want to manually specify correlations and prevent them from being overwritten when automatically fitting distributions using a Simulation Fitting node and historical data.
Use fitted multiway contingency table for inputs with a categorical distribution. By default, all fields with a categorical distribution are included in a contingency table (or multiway contingency table, depending on the number of fields with a categorical distribution). The contingency table is constructed, like the correlations, when a Simulation Fitting node is executed. The contingency table cannot be viewed. When this option is selected, fields with a categorical distribution are simulated using the actual percentages from the contingency table. That is, any associations between nominal fields are recreated in the new, simulated data. When this option is cleared, fields with categorical distributions are simulated using the expected percentages from the contingency table. If you modify a field, the field is removed from the contingency table.
Advanced Options view
Number of cases to simulate. Displays the options for specifying the number of cases to be simulated, and how any iterations will be named.
- Maximum number of cases. This specifies the maximum number of cases of simulated data, and associated target values, to generate. The default value is 100,00, minimum value is 1000, and maximum value is 2,147,483,647.
- Iterations. This number is calculated automatically and cannot be edited. An iteration is created automatically each time a distribution parameter has multiple values specified.
- Total rows. Only enabled when the number of iterations is greater than 1. The number is calculated automatically, using the equation shown, and cannot be edited.
- Create iteration field. Only enabled when the number of iterations is greater than 1. When selected, the Name field is enabled. See the topic Iterations for more information.
- Name. Only enabled when the Create iteration field check box is selected, and the number of iterations is greater than 1. Edit the name of the iteration field by typing in this text field. See the topic Iterations for more information.
Random seed. Setting a random seed allows you to replicate your simulation.
- Replicate results. When selected, the Generate button and Random seed field are enabled.
- Random seed. Only enabled when the Replicate results check box is selected. In this field you can specify an integer to be used as the random seed. The default value is 629111597.
- Generate. Only enabled when the Replicate results check box is selected. Creates a pseudo-random integer between 1 and 999999999, inclusive, in the Random seed field.