Mimic node

In Synthetic Data Generator, you can use the Mimic node to set your requirements for how the synthetic data should resemble the sample seed data.

The Mimic node analyze the statistical distribution of each field in the sample seed data, and it generates (or updates) a Generate node with the best fitting distribution assigned to each field. The Generate node can then automatically generate synthetic data based on the analysis.

Description
Use the Mimic node in a Synthetic Data Generator flow to configure how closely the synthetic data resembles the statistical properties and distributions of the original dataset. The node uses advanced algorithms to learn the underlying patterns and relationships in the input data to create realistic yet artificial records.
You can also configure the differential privacy feature. For more information, see Using differential privacy.
Using the node
Use the Mimic node after an Import node or Anonymize node. When the Mimic node runs, a Generate node is created if one does not already exist.
You can include the Mimic node multiple times in a Synthetic Data Generator flow if want to generate synthetic data by using different settings on the same dataset. One Import of Anonymize node can branch out to several Mimic nodes.
Mandatory or optional
The Mimic node is mandatory unless you are generating synthetic data by using a custom data schema. You must connect an Import node or Anonymize node to the Mimic node.

Columns set as typeless

The Mimic node always excludes columns set as Typeless from the dataset during data preprocessing. Any columns that are excluded do not appear in the Generate node. Machine learning algorithms generally ignore typeless fields because they might not have any predictive value or discernable pattern that can be determined through statistical analysis.

If want to keep a column that is set as Typeless in the dataset, you can enable Typeless as Realistic in the settings for the Mimic node and then edit the column in the Generate node to set the dictionary for that column. The data in the column is then treated as that dictionary type.

Scripting with the Mimc node

You can use scripting languages, like Python, to progammatically set properties for nodes.

Mimc node properties

The following properties are specific to the Mimc node. For information about common node properties, see Properties for flows and nodes.

Table 1. Node properties for scripting
Property Name Data Type Property Description
bins Integer For continuous fields, the Empirical distribution is the cumulative distribution function of the historical data.
custom_gen_node_name Boolean You can generate the name of the generated (or updated) Generate node automatically by selecting Auto.
delta Integer Maximum allowable probability of privacy leakage. The value should be <= 1/n*n where n is the sample size at a time. Either epsilon or delta has to be greater than 0
epsilon Integer Determines the privacy budget. Smaller values provides greater privacy protection and loss of accuracy.
frequency_weight_field field Specify the weight field if your data set contains one. The weight field is then excluded from the distribution fitting process.
gen_node_name String Specify a custom name for the generated (or updated) Generate node.
good_fit_type String For continuous fields, specify either the AnderDarling test or the KolmogSmirn test of goodness of fit to rank distributions when fitting distributions to the fields.
locale String The locale determines which dictionary types are available for field generation, it can have one value from ["de_DE", "en_US", "es_ES", "fr_FR", "it_IT", "ja_JP", "ko_KR", "pl_PL", "pt_BR", "ru_RU", "zh_CN"]
missing_value_imputation Boolean Set to True to use imputation to replace your missing data. Imputation means replacing missing data with an estimate, then analyzing the full data set as if the imputed values were real data.
missing_value_imputation_continuous_strategy String When replacing missing values for a continuous field, it can be set as mean or fixed. The default is 'mean'.
missing_value_imputation_continuous_replace_value Integer When fixed is set for missing_value_imputation_continuous_strategy, you can set a value here.
missing_value_imputation_nominal_strategy String When replacing missing values for a nominal field, it can be set as mode or fixed. The default is the mean.
missing_value_imputation_nominal_replace_value Integer When fixed is set for missing_value_imputation_nominal_strategy, you can set a value here.
missing_value_imputation_ordinal_strategy String When replacing missing values for a ordinal field, it can be set as medium or fixed. The default is the mean.
missing_value_imputation_ordinal_replace_value Integer When fixed is set for missing_value_imputation_ordinal_strategy, you can set a value here.
missing_value_imputation_strategies Structured Property When True is set for missing_value_imputation, you can use this property to replace missing values for specific fields. See the Example script.
random_seed Integer Enables you to reproduce differential private synthetic output
typeless_as_realistic Boolean Set to True to treat the typeless field as a dictionary type.
use_diff_privacy Boolean Set to True to ensure that no sensitive data is exposed in the synthetic data generated by controlling the privacy budget (epsilon) and leakage (delta) parameters.
used_cases_type String Specifies the number of cases to use when fitting distributions to the fields in the data set. Use AllCases or FirstNCases.
used_cases Integer The number of cases.

Example script

The following is an example of the properties used in a script.

mimicnode.setPropertyValue("missing_value_imputation", "True")
mimicnode.setPropertyValue("missing_value_imputation_strategies","[['Age','Fixed','60']]")