Mimic node

In Synthetic Data Generator, you can use the Mimic node to set your requirements for how the synthetic data should resemble the sample seed data.

The Mimic node analyze the statistical distribution of each field in the sample seed data, and it generates (or updates) a Generate node with the best fitting distribution assigned to each field. The Generate node can then automatically generate synthetic data based on the analysis.

Description: Use the Mimic node in a Synthetic Data Generator flow to configure how closely the synthetic data resembles the statistical properties and distributions of the original dataset. The node uses advanced algorithms to learn the underlying patterns and relationships in the input data to create realistic yet artificial records.; You can also configure the differential privacy feature. For more information, see Using differential privacy.
Using the node: Use the Mimic node after an Import node or Anonymize node. When the Mimic node runs, a Generate node is created if one does not already exist.; You can include the Mimic node multiple times in a Synthetic Data Generator flow if want to generate synthetic data by using different settings on the same dataset. One Import of Anonymize node can branch out to several Mimic nodes.
Mandatory or optional: The Mimic node is mandatory unless you are generating synthetic data by using a custom data schema. You must connect an Import node or Anonymize node to the Mimic node.

Columns set as typeless

The Mimic node always excludes columns set as Typeless from the dataset during data preprocessing. Any columns that are excluded do not appear in the Generate node. Machine learning algorithms generally ignore typeless fields because they might not have any predictive value or discernable pattern that can be determined through statistical analysis.

If want to keep a column that is set as Typeless in the dataset, you can enable Typeless as Realistic in the settings for the Mimic node and then edit the column in the Generate node to set the dictionary for that column. The data in the column is then treated as that dictionary type.

Scripting with the Mimc node

You can use scripting languages, like Python, to progammatically set properties for nodes.

Mimc node properties

The following properties are specific to the Mimc node. For information about common node properties, see Properties for flows and nodes.

Table 1. Node properties for scripting
Property Name	Data Type	Property Description
`bins`	Integer	For continuous fields, the Empirical distribution is the cumulative distribution function of the historical data.
`custom_gen_node_name`	Boolean	You can generate the name of the generated (or updated) Generate node automatically by selecting `Auto`.
`delta`	Integer	Maximum allowable probability of privacy leakage. The value should be <= 1/nn* where n is the sample size at a time. Either epsilon or delta has to be greater than 0
`epsilon`	Integer	Determines the privacy budget. Smaller values provides greater privacy protection and loss of accuracy.
`frequency_weight_field`	field	Specify the weight field if your data set contains one. The weight field is then excluded from the distribution fitting process.
`gen_node_name`	String	Specify a custom name for the generated (or updated) Generate node.
`good_fit_type`	String	For continuous fields, specify either the `AnderDarling` test or the `KolmogSmirn` test of goodness of fit to rank distributions when fitting distributions to the fields.
`locale`	String	The locale determines which dictionary types are available for field generation, it can have one value from ["de_DE", "en_US", "es_ES", "fr_FR", "it_IT", "ja_JP", "ko_KR", "pl_PL", "pt_BR", "ru_RU", "zh_CN"]
`missing_value_imputation`	Boolean	Set to `True` to use imputation to replace your missing data. Imputation means replacing missing data with an estimate, then analyzing the full data set as if the imputed values were real data.
`missing_value_imputation_continuous_strategy`	String	When replacing missing values for a continuous field, it can be set as `mean` or `fixed`. The default is 'mean'.
`missing_value_imputation_continuous_replace_value`	Integer	When `fixed` is set for `missing_value_imputation_continuous_strategy`, you can set a value here.
`missing_value_imputation_nominal_strategy`	String	When replacing missing values for a nominal field, it can be set as `mode` or `fixed`. The default is the mean.
`missing_value_imputation_nominal_replace_value`	Integer	When `fixed` is set for `missing_value_imputation_nominal_strategy`, you can set a value here.
`missing_value_imputation_ordinal_strategy`	String	When replacing missing values for a ordinal field, it can be set as `medium` or `fixed`. The default is the mean.
`missing_value_imputation_ordinal_replace_value`	Integer	When `fixed` is set for `missing_value_imputation_ordinal_strategy`, you can set a value here.
`missing_value_imputation_strategies`	Structured Property	When `True` is set for `missing_value_imputation`, you can use this property to replace missing values for specific fields. See the Example script.
`random_seed`	Integer	Enables you to reproduce differential private synthetic output
`typeless_as_realistic`	Boolean	Set to `True` to treat the typeless field as a dictionary type.
`use_diff_privacy`	Boolean	Set to `True` to ensure that no sensitive data is exposed in the synthetic data generated by controlling the privacy budget (`epsilon`) and leakage (`delta`) parameters.
`used_cases_type`	String	Specifies the number of cases to use when fitting distributions to the fields in the data set. Use `AllCases` or `FirstNCases`.
`used_cases`	Integer	The number of cases.

Example script

The following is an example of the properties used in a script.

mimicnode.setPropertyValue("missing_value_imputation", "True")
mimicnode.setPropertyValue("missing_value_imputation_strategies","[['Age','Fixed','60']]")