Multi-mimic node
In Synthetic Data Generator, you can use the Multi-mimic node to set your requirements for how the synthetic data should resemble the sample seed data. The node uses advanced algorithms to learn the underlying patterns and relationships in the input data to create realistic yet artificial records.
- Description
- The Multi-mimc node analyzes the statistical distribution of each field in the sample seed data, and it generates (or updates) a Multi-gen node with the best fitting distribution assigned to each field. The Multi-gen node can then automatically generate synthetic data based on the analysis.
- Use the Multi-mimic node in a Synthetic Data Generator flow to configure how closely the synthetic data resembles the statistical properties and distributions of the original datasets.
- Using the node
- Use the Multi-mimic node after an Multi-Import node. When the Multi-mimic node runs, a Multi-gen node is created if one does not already exist.
- Use the Multi-mimic node when you have imported data from several sources by using the Multi-import node. For more information, see Referential integrity and multi-table nodes.
- All the tables that you import need to have two or more rows when using the Multi-mimic node.
- Mandatory or optional
- The Multi-mimic node is mandatory. You must connect a Multi-import node to the Multi-mimic node.
Scripting with the Multi-mimc node
You can use scripting languages, like Python, to progammatically set properties for nodes.
Multi-mimc node properties
The following properties are specific to the Multi-mimc node. For information about common node properties, see Properties for flows and nodes.
| Property Name | Data type | Allowed values | Property description |
|---|---|---|---|
method |
Enumeration | clusteringfull |
The method to use for the hierarchical search algorithm. |
algorithm |
Enumeration | gaussiancopula |
The algorithm to use for inner table modeling. |
missing_value_imputation |
Boolean | truefalse |
A flag if you use imputation to replace the missing values. |
init_n_components |
Enumeration | quickhpocustomized |
The method to use for estimating the number of components for each table. |
init_n_clusters |
Enumeration | hpocustomized |
When clustering is set for method, this property sets the method for estimating the number of clusters for each table |
hpo_max_record |
Integer | Range: 100 ≤ value ≤ 10000 |
The maximum number of records to consider when using the hpo method. |
copula_n_quantiles |
Integer | Range: 5 ≤ value ≤ 10000 |
The number of quantiles to use for copula transformation. |
copula_noise_scale |
Float | Range: 0.0 ≤ value ≤ 1.0 |
The scale of noise added to the copula transformation to ensure numerical stability. |
random_state |
Integer | No specific range. Default: 929111600 | The model needs a random seed to initialize some parameters. You can specify the random seed here. If you don’t specify one, the current timestamp is used instead. However, using the timestamp means the model uses a different seed each time that it runs. Setting a fixed seed ensures consistent results. |
n_components |
Structured property | See property description | Sets the number of mixture components for a Gaussian Mixture Model (GMM). You can set the number for each table. If you enable hpo, the value for this parameter is not used. For details, see Data structure for n_components property. |
n_clusters |
Structured property | See property description | When clustering is set for method, this property sets the number of clusters to form for each table. This property is a list of dictionaries (or arrays), where each dictionary includes the following: table_display_name,
number_of_clusters_to_form, custom_sample, table_name. For details, see Data structure for n_clusters property. |
Data structure for n_components property
- table_display_name
- The display name of the table
- number_of_mixture_components
- An integer that sets the number of clusters to form for each table
- The value must be within this range: 1 ≤
value≤ 15 - custom_component
- When set to
true, the specified number of mixture components are used. - When set to
false, a ratio is used to determine the number of mixture components to form. It uses the following formula:value for ratio×number of rows in the fitting phase. - table_name
- The table name in dot notation
- For an example of the format, see Multi-import node
Data structure for n_clusters property
- table_display_name
- The display name of the table
- number_of_clusters_to_form
- An integer that sets the number of clusters to form for each table
- The value must be within this range: 1 ≤
value≤ 15 - custom_cluster
- When set to
true, the specified number of clusters are used. - When set to
false, a ratio is used to determine the number of clusters to use. It uses the following formula:value for ratio×number of rows in the fitting phase. - table_name
- The table name in dot notation
- For an example of the format, see Multi-import node
Example script
The following script creates a Multi-mimic node and sets some properties for it.
stream = sdg.script.stream()
multimimic = stream.createAt("mimicplus", "Multi-mimic", 195, 187)
multimimic.setPropertyValue("method", "Clustering")
multimimic.setPropertyValue("algorithm", "Gaussian")
multimimic.setPropertyValue("missing_value_imputation", "true")
multimimic.setPropertyValue("init_n_components", "quick")
multimimic.setPropertyValue("init_n_clusters", "customized")
multimimic.setPropertyValue("hpo_max_record", 1000)
multimimic.setPropertyValue("hpo_max_record", 1000)
multimimic.setPropertyValue("random_state", 82911159)
# set when init_n_components is set as customized
multimimic.setPropertyValue("n_components", [['PERF.PRODUCTS', 3 , 'true', '40039389-1b77-47d5-86d1-b37a5e6bf52e.PERF.PRODUCTS'], ['PERF.CATEGORIES', 2 , 'true', '40039389-1b77-47d5-86d1-b37a5e6bf52e.PERF.CATEGORIES']])
# set when init_n_clustes is set as customized
multimimic.setPropertyValue("n_clusters", [['PERF.PRODUCTS', 2 , 'true', '40039389-1b77-47d5-86d1-b37a5e6bf52e.PERF.PRODUCTS'], ['PERF.CATEGORIES', 3 , 'true', '40039389-1b77-47d5-86d1-b37a5e6bf52e.PERF.CATEGORIES']])
stream.link(connections, multimimic)
mimicnode.run([])