Mimic node
In Synthetic Data Generator, you can use the Mimic node to set your requirements for how the synthetic data should resemble the sample seed data.
The Mimic node analyze the statistical distribution of each field in the sample seed data, and it generates (or updates) a Generate node with the best fitting distribution assigned to each field. The Generate node can then automatically generate synthetic data based on the analysis.
- Description
- Use the Mimic node in a Synthetic Data Generator flow to configure how closely the synthetic data resembles the statistical properties and distributions of the original dataset. The node uses advanced algorithms to learn the underlying patterns and relationships in the input data to create realistic yet artificial records.
- You can also configure the differential privacy feature. For more information, see Using differential privacy.
- Using the node
- Use the Mimic node after an Import node or Anonymize node. When the Mimic node runs, a Generate node is created if one does not already exist.
- You can include the Mimic node multiple times in a Synthetic Data Generator flow if want to generate synthetic data by using different settings on the same dataset. One Import of Anonymize node can branch out to several Mimic nodes.
- Mandatory or optional
- The Mimic node is mandatory unless you are generating synthetic data by using a custom data schema. You must connect an Import node or Anonymize node to the Mimic node.
Columns set as typeless
The Mimic node always excludes columns set as Typeless from the dataset during data preprocessing. Any columns that are excluded do not appear in the Generate node. Machine learning algorithms generally ignore typeless fields because they might not have any predictive value or discernable pattern that can be determined through statistical analysis.
If want to keep a column that is set as Typeless in the dataset, you can enable Typeless as Realistic in the settings for the Mimic node and then edit the column in the Generate node to set the dictionary for that column. The data in the column is then treated as that dictionary type.
Scripting with the Mimc node
You can use scripting languages, like Python, to progammatically set properties for nodes.
Mimc node properties
The following properties are specific to the Mimc node. For information about common node properties, see Properties for flows and nodes.
| Property Name | Data Type | Property Description |
|---|---|---|
bins |
Integer | For continuous fields, the Empirical distribution is the cumulative distribution function of the historical data. |
custom_gen_node_name |
Boolean | You can generate the name of the generated (or updated) Generate node automatically by selecting Auto. |
delta |
Integer | Maximum allowable probability of privacy leakage. The value should be <= 1/n*n where n is the sample size at a time. Either epsilon or delta has to be greater than 0 |
epsilon |
Integer | Determines the privacy budget. Smaller values provides greater privacy protection and loss of accuracy. |
frequency_weight_field |
field | Specify the weight field if your data set contains one. The weight field is then excluded from the distribution fitting process. |
gen_node_name |
String | Specify a custom name for the generated (or updated) Generate node. |
good_fit_type |
String | For continuous fields, specify either the AnderDarling test or the KolmogSmirn test of goodness of fit to rank distributions when fitting distributions to the fields. |
locale |
String | The locale determines which dictionary types are available for field generation, it can have one value from ["de_DE", "en_US", "es_ES", "fr_FR", "it_IT", "ja_JP", "ko_KR", "pl_PL", "pt_BR", "ru_RU", "zh_CN"] |
missing_value_imputation |
Boolean | Set to True to use imputation to replace your missing data. Imputation means replacing missing data with an estimate, then analyzing the full data set as if the imputed values were real data. |
missing_value_imputation_continuous_strategy |
String | When replacing missing values for a continuous field, it can be set as mean or fixed. The default is 'mean'. |
missing_value_imputation_continuous_replace_value |
Integer | When fixed is set for missing_value_imputation_continuous_strategy, you can set a value here. |
missing_value_imputation_nominal_strategy |
String | When replacing missing values for a nominal field, it can be set as mode or fixed. The default is the mean. |
missing_value_imputation_nominal_replace_value |
Integer | When fixed is set for missing_value_imputation_nominal_strategy, you can set a value here. |
missing_value_imputation_ordinal_strategy |
String | When replacing missing values for a ordinal field, it can be set as medium or fixed. The default is the mean. |
missing_value_imputation_ordinal_replace_value |
Integer | When fixed is set for missing_value_imputation_ordinal_strategy, you can set a value here. |
missing_value_imputation_strategies |
Structured Property | When True is set for missing_value_imputation, you can use this property to replace missing values for specific fields. See the Example script. |
random_seed |
Integer | Enables you to reproduce differential private synthetic output |
typeless_as_realistic |
Boolean | Set to True to treat the typeless field as a dictionary type. |
use_diff_privacy |
Boolean | Set to True to ensure that no sensitive data is exposed in the synthetic data generated by controlling the privacy budget (epsilon) and leakage (delta) parameters. |
used_cases_type |
String | Specifies the number of cases to use when fitting distributions to the fields in the data set. Use AllCases or FirstNCases. |
used_cases |
Integer | The number of cases. |
Example script
The following is an example of the properties used in a script.
mimicnode.setPropertyValue("missing_value_imputation", "True")
mimicnode.setPropertyValue("missing_value_imputation_strategies","[['Age','Fixed','60']]")