Multi-import node
In Synthetic Data Generator, you can use the Multi-import node to import data from several different sources. You can then generate synthetic data based on all these data sources.
- Description
- Use the Multi-import node to import several data tables from a database connection to use as sample seed data.
- The Multi-import node imports these datasets as a dataset group, where referential integrity is used to define the relationships in the group. You can then use this group of related datasets to generate synthetic data that has similar relationships and distributions as all the original tables that you imported. For more information, see Referential integrity and multi-table nodes.
- If you use scripting, you can also import files that are saved as project assets.
- Using the node
- The Multi-import node is usually the first node in a Synthetic Data Generator flow. It imports datasets that are stored in a structured format, such as a table, and then passes the datasets to the next node for further processing. For more information, see Data sources for Synthetic Data Generator.
- Use only one Multi-import node to import multiple tabular datasets. You cannot use multiple nodes to combine data.
- If you change the sources or settings in the Multi-import node after the flow runs, rerun the Multi-mimic node before you use Run all to run the flow again. Running the Multi-mimic node refreshes the Multi-gen node.
- Mandatory or optional
- The Multi-import node is mandatory if you want to use prodution data from multiple data tables as the seed for the synthetic data. If you want to use referential integrity, you cannot generate synthetic data by defining a custom data schema for the synthetic data.
Scripting with the Multi-import node
You can use scripting languages, like Python, to progammatically set properties for nodes.
Note: You can use the Multi-import node to import data from data assets only when using scripting. You cannot select data assets in the user interface.
Multi-import node properties
The following properties are specific to the Multi-import node. For information about common node properties, see Properties for flows and nodes.
| Property Name | Data type | Property description |
|---|---|---|
assets |
Structured property | A list of assets, where each asset contains the following: id, name, assettype, connpath, primarykey, usersettings. For details, see Data structure for assets property. |
relationships |
Structured property | A list of table relationships, where each relationship includes the following: fromtable, totable, fromcolumns, tocolumns, relationtype. For details, see Data structure for relationships property. |
Data structure for assets property
- id
- The ID for the asset or connection
- name
- The name of the asset or connection in the project
- assettype
- The type of asset (
DataAssetorConnection) - You can only import data assets through scripting.
- connpath
- When
Connectionis set forasset_type, the path to the table within the connection. - Depending on the database connection, the path includes the catalog and schema, for example
catalog_name/schema_name/table_name. - primarykey
- The primary keys for one or more columns in the source table
- You can set several columns in
primarykeyby using a list of strings, for example,['ACCOUNT_ID','DISTRICT_ID']. Composite primary keys are not supported. - usersettings
- Escaped JSON string of interaction properties.
- For example,
{"interactionProperties":{"file_format":"csv","encoding":"UTF-8","first_line_header":true,"infer_schema":true,"infer_record_count":1000,"infer_as_varchar":false,"invalid_data_handling":"fail","file_name":"input.csv"}}
Data structure for relationships property
- fromtable
- The source table with the primary key
- Use dot notation for the table name
- totable
- The destination table that references the primary key
- Use dot notation for the table name
- fromcolumns
- Columns from the source table
- tocolumns
- Target columns in the destination table
- relationtype
- Relationship type (
ONE_TO_ONEorONE_TO_MANY)
Table names in dot notation
When you reference tables, you must use dot notation in most cases. The path in dot notation changes depending on if the table has a catalog and schema.
- Catalog and schema
assetId.catalog.schema.table- Only schema
assetId.schema.table- No schema
assetId.table- For example, with IBM Cloudant
- Data assets
assetId.assetName- For example, for CSV and SAV files
Example script
The following script creates a Multi-import node and sets some properties for it.
stream = sdg.script.stream()
connections = stream.createAt("dataasset_import_multi_table", "Connections", 88, 187)
connections.setPropertyValue("assets", "[['40039389-1b77-47d5-86d1-b37a5e6bf52e', 'DB2 Premium', 'Connection', '/PERF/CATEGORIES', 'CATEGORY_ID', ''], ['40039389-1b77-47d5-86d1-b37a5e6bf52e', 'DB2 Premium', 'Connection', '/PERF/CUSTOMERS', 'CUSTOMER_ID', ''], ['40039389-1b77-47d5-86d1-b37a5e6bf52e', 'DB2 Premium', 'Connection', '/PERF/PRODUCTS', 'PRODUCT_ID', ''], ['40039389-1b77-47d5-86d1-b37a5e6bf52e', 'DB2 Premium', 'Connection', '/PERF/SALES', 'SALE_ID', '']]")
connections.setPropertyValue("relationships", "[['40039389-1b77-47d5-86d1-b37a5e6bf52e.PERF.CATEGORIES', '40039389-1b77-47d5-86d1-b37a5e6bf52e.PERF.PRODUCTS', ['CATEGORY_ID'], ['CATEGORY_ID'] , 'ONE_TO_MANY'], ['40039389-1b77-47d5-86d1-b37a5e6bf52e.PERF.CUSTOMERS', '40039389-1b77-47d5-86d1-b37a5e6bf52e.PERF.SALES', ['CUSTOMER_ID'], ['CUSTOMER_ID'] , 'ONE_TO_MANY'], ['40039389-1b77-47d5-86d1-b37a5e6bf52e.PERF.PRODUCTS', '40039389-1b77-47d5-86d1-b37a5e6bf52e.PERF.SALES', ['PRODUCT_ID'], ['PRODUCT_ID'] , 'ONE_TO_MANY']]")