Multi-mimic node

In Synthetic Data Generator, you can use the Multi-mimic node to set your requirements for how the synthetic data should resemble the sample seed data. The node uses advanced algorithms to learn the underlying patterns and relationships in the input data to create realistic yet artificial records.

Description: The Multi-mimc node analyzes the statistical distribution of each field in the sample seed data, and it generates (or updates) a Multi-gen node with the best fitting distribution assigned to each field. The Multi-gen node can then automatically generate synthetic data based on the analysis.; Use the Multi-mimic node in a Synthetic Data Generator flow to configure how closely the synthetic data resembles the statistical properties and distributions of the original datasets.
Using the node: Use the Multi-mimic node after an Multi-Import node. When the Multi-mimic node runs, a Multi-gen node is created if one does not already exist.; Use the Multi-mimic node when you have imported data from several sources by using the Multi-import node. For more information, see Referential integrity and multi-table nodes.; All the tables that you import need to have two or more rows when using the Multi-mimic node.
Mandatory or optional: The Multi-mimic node is mandatory. You must connect a Multi-import node to the Multi-mimic node.

Scripting with the Multi-mimc node

You can use scripting languages, like Python, to progammatically set properties for nodes.

Multi-mimc node properties

The following properties are specific to the Multi-mimc node. For information about common node properties, see Properties for flows and nodes.

Table 1. Node properties for scripting
Property Name	Data type	Allowed values	Property description
`method`	Enumeration	`clustering` `full`	The method to use for the hierarchical search algorithm.
`algorithm`	Enumeration	`gaussian` `copula`	The algorithm to use for inner table modeling.
`missing_value_imputation`	Boolean	`true` `false`	A flag if you use imputation to replace the missing values.
`init_n_components`	Enumeration	`quick` `hpo` `customized`	The method to use for estimating the number of components for each table.
`init_n_clusters`	Enumeration	`hpo` `customized`	When `clustering` is set for `method`, this property sets the method for estimating the number of clusters for each table
`hpo_max_record`	Integer	Range: 100 ≤ `value` ≤ 10000	The maximum number of records to consider when using the `hpo` method.
`copula_n_quantiles`	Integer	Range: 5 ≤ `value` ≤ 10000	The number of quantiles to use for copula transformation.
`copula_noise_scale`	Float	Range: 0.0 ≤ `value` ≤ 1.0	The scale of noise added to the copula transformation to ensure numerical stability.
`random_state`	Integer	No specific range. Default: 929111600	The model needs a random seed to initialize some parameters. You can specify the random seed here. If you don’t specify one, the current timestamp is used instead. However, using the timestamp means the model uses a different seed each time that it runs. Setting a fixed seed ensures consistent results.
`n_components`	Structured property	See property description	Sets the number of mixture components for a Gaussian Mixture Model (GMM). You can set the number for each table. If you enable `hpo`, the value for this parameter is not used. For details, see Data structure for `n_components` property.
`n_clusters`	Structured property	See property description	When `clustering` is set for `method`, this property sets the number of clusters to form for each table. This property is a list of dictionaries (or arrays), where each dictionary includes the following: `table_display_name`, `number_of_clusters_to_form`, `custom_sample`, `table_name`. For details, see Data structure for `n_clusters` property.

Data structure for `n_components` property

table_display_name: The display name of the table
number_of_mixture_components: An integer that sets the number of clusters to form for each table; The value must be within this range: 1 ≤ value ≤ 15
custom_component: When set to true, the specified number of mixture components are used.; When set to false, a ratio is used to determine the number of mixture components to form. It uses the following formula: value for ratio × number of rows in the fitting phase.
table_name: The table name in dot notation; For an example of the format, see Multi-import node

Data structure for `n_clusters` property

table_display_name: The display name of the table
number_of_clusters_to_form: An integer that sets the number of clusters to form for each table; The value must be within this range: 1 ≤ value ≤ 15
custom_cluster: When set to true, the specified number of clusters are used.; When set to false, a ratio is used to determine the number of clusters to use. It uses the following formula: value for ratio × number of rows in the fitting phase.
table_name: The table name in dot notation; For an example of the format, see Multi-import node

Example script

The following script creates a Multi-mimic node and sets some properties for it.

stream = sdg.script.stream()
multimimic = stream.createAt("mimicplus", "Multi-mimic", 195, 187)
multimimic.setPropertyValue("method", "Clustering")
multimimic.setPropertyValue("algorithm", "Gaussian")
multimimic.setPropertyValue("missing_value_imputation", "true")
multimimic.setPropertyValue("init_n_components", "quick")
multimimic.setPropertyValue("init_n_clusters", "customized")
multimimic.setPropertyValue("hpo_max_record", 1000)
multimimic.setPropertyValue("hpo_max_record", 1000)
multimimic.setPropertyValue("random_state", 82911159)

# set when init_n_components is set as customized
multimimic.setPropertyValue("n_components", [['PERF.PRODUCTS', 3 , 'true', '40039389-1b77-47d5-86d1-b37a5e6bf52e.PERF.PRODUCTS'], ['PERF.CATEGORIES', 2 , 'true', '40039389-1b77-47d5-86d1-b37a5e6bf52e.PERF.CATEGORIES']])
# set when init_n_clustes is set as customized
multimimic.setPropertyValue("n_clusters", [['PERF.PRODUCTS', 2 , 'true', '40039389-1b77-47d5-86d1-b37a5e6bf52e.PERF.PRODUCTS'], ['PERF.CATEGORIES', 3 , 'true', '40039389-1b77-47d5-86d1-b37a5e6bf52e.PERF.CATEGORIES']])
stream.link(connections, multimimic)
mimicnode.run([])