Multi-gen node
In Synthetic Data Generator, you use the Multi-gen node to create structured synthetic data from a group of datasets.
- Description
- The Multi-gen node automatically creates synthetic data by using the statistical properties and relationships in the source datasets. Synthetic Data Generator uses various algorithms and statistical models to analyze these propeties and relationships when the Multi-mimic node runs on the existing datasets. The Multi-gen node then uses this information to generate artificial yet realistic-looking data.
- Using the node
- The Multi-gen node is used when you have imported data from several sources by using the Multi-import node. For more information, see Referential integrity and multi-table nodes.
- You do not need to add a Multi-gen node to the Synthetic Data Generator flow. When you run a flow that has a Multi-mimic node in it, a Multi-gen node is automatically created and added after the Multi-mimic node. On subsequent runs, the existing Multi-gen node is updated.
- Mandatory or optional
- The Multi-gen node is mandatory. However, you don't add the Multi-gen node to the flow, it is automatically added after a Multi-mimic node runs for the first time.
- If you want to use referential integrity to generate synthetic data, you must use production data. You cannot define a custom data schema for the synthetic data.
Scripting with the Multi-gen nodes
You can use scripting languages, like Python, to progammatically set properties for nodes.
Multi-gen node properties
The following properties are specific to the Multi-gen node. For information about common node properties, see Properties for flows and nodes.
| Property name | Data type | Allowed values | Property description |
|---|---|---|---|
ratio |
Float | Range: 0.0 < value ≤ 10.0 |
This property specifies the number of rows to generate for each individual table. It uses the following formula: value for ratio × number of rows in the fitting phase. A minimum of one record is always generated. |
random_state |
Integer | No specific range. Default value: 929111600 | The model needs a random seed to initialize some parameters. You can specify the random seed here. If None is specified, the current timestamp is used instead. However, using the timestamp means the model uses a different
seed each time that it runs. Setting a fixed seed ensures consistent results. |
custom_sample |
Structured property | See property description | This property sets the number of rows to generate for each table. It contains a list of dictionaries (or arrays), where each entry includes the following: table_display_name, row_count, custom_sample,
table_name. For details, see Data structure for n_samples property. |
Data structure for n_samples property
- table_display_name
- The display name of the table.
- row_count
- The number of rows to generate.
- custom_sample
- This property controls whether
row_countis fixed (true) or scaled by theratioproperty (false). This property is mostly used in the user interface. For scripting, set it astruefor only the tables where you specify a value forrow_count. - table_name
- The table name in dot notation.
- For an example of the format, see Multi-import node
Example script
The following script finds a Multi-gen node in a Synthetic Data Generator flow and sets some properties for it.
stream = sdg.script.stream()
generateplus = stream.findByType("generateplus", None)
generateplus.setPropertyValue("ratio", 2.0)
generateplus.setPropertyValue("random_state", 29)
generateplus.setPropertyValue("n_samples", [['PERF.CATEGORIES', 10, 'false', '40039389-1b77-47d5-86d1-b37a5e6bf52e.PERF.CATEGORIES'], ['PERF.CUSTOMERS', 1000, 'false', '40039389-1b77-47d5-86d1-b37a5e6bf52e.PERF.CUSTOMERS'], ['PERF.PRODUCTS', 20, 'false', '40039389-1b77-47d5-86d1-b37a5e6bf52e.PERF.PRODUCTS'], ['PERF.SALES', 4000, 'false', '40039389-1b77-47d5-86d1-b37a5e6bf52e.PERF.SALES']])