Referential integrity and multi-table nodes
If you want to create synthetic data by using several data tables from a database connection, you can use multi-table nodes. The multi-table nodes preserve referential integrity by maintaining the defined primary key–foreign key relationships across data tables. You can then generate synthetic data that retains these structural relationships from the production data.
Referential integrity
Multi-table nodes use primary and foreign keys to ensure that relationships across tables are preserved when working with data from multiple source tables in Synthetic Data Generator. A primary key gives a column a unique identifier within its source table. Other tables can then reference this unique identifier by defining it as a foreign key. Primary key–foreign key pairs define how data is related across the tables within the database.
These primary key–foreign key relationships are defined in the source database and can be read and viewed in the Multi-import node.
Relationships in tables
Synthetic Data Generator can maintain relationships within tables and across multiple tables. When you use multi-table nodes, the following types of relationships in datasets are preserved:
- Relationships within tables
- Synthetic Data Generator preserves distributions, joint distributions, and correlations between different rows or columns within the same table.
- Relationships across tables
- Referential integrity preserves parent–child dependencies across multiple tables by using the primary key–foreign key pairs.
When generating synthetic data, these relationships are preserved so that the generated multi-table datasets remain consistent and structurally aligned with the original datasets.
For example, a Customer table contains customer profile data, and a Transaction table records all customer transactions. You can use primary key–foreign key pairs to link each transaction in the Transaction table with a specific customer in the Customer table. When the Synthetic Data Generator generates synthetic versions of these tables, it preserves the primary key–foreign key relationships, which ensures that the synthetic transactions remain correctly linked with the synthetic customer profiles. As a result, the synthetic customers have similar transaction records as customers in the real data.
Using multi-table nodes
If you use a Multi-import node to import data from multiple tables, then all the other nodes in your Synthetic Data Generator flow must be multi-table nodes as well. You cannot mix multi-table nodes with other types of nodes.
The following multi-table nodes are available.
- Multi-import node
- You can use the Multi-import node to pick the data connection and the tables to use. The primary and foreign keys that define the relationships across these tables are read from the source database. For tables with foreign keys to work properly, you must import the corresponding tables that have the related primary keys.
- Multi-mimic node
- Use the Multi-mimic node to configure how multi-table relationships should be modelled. You can adjust methods and parameters to increase the model complexity, which allows the synthetic data to more closely reflect the relationships and data characteristics of the source datasets. However, increased complexity often requires more resources and time.
- Multi-generate node
- In the Multi-generate node, you can set the parameters for generating synthetic data from the imported dataset group. You can also adjust the ratio for how data is generated across the different tables.
- Multi-export node
- For the Multi-export node, you can pick which database connection to export the synthetic data to. When the Synthetic Data Generator flow runs, the synthetic data that generates is automatically saved to the connection that you picked.