Quick start: Generate synthetic tabular data

Take this tutorial to learn how to generate synthetic tabular data in IBM watsonx.ai. The benefit to synthetic data is that you can procure the data on-demand, then customize to fit your use case, and produce it in large quantities. This tutorial helps you learn how to use the graphical flow editor tool, Synthetic Data Generator, to generate synthetic tabular data based on production data or a custom data schema using visual flows and modeling algorithms.

Required services: Watson Studio; Synthetic Data Generator

Your basic workflow includes these tasks:

Open a project. Projects are where you can collaborate with others to work with data.
Add your data to the project. You can add CSV files or data from a remote data source through a connection.
Create and run a synthetic data flow to the project. You use the graphical flow editor tool Synthetic Data Generator to generate synthetic tabular data based on production data or a custom data schema using visual flows and modeling algorithms.
Review the synthetic data flow and output.

Read about synthetic data

Synthetic data is information that has been generated on a computer to augment or replace real data to improve AI models, protect sensitive data, and mitigate bias. Synthetic data helps to mitigate many of the logistical, ethical, and privacy issues that come with training machine learning models on real-world examples.

Watch a video about generating synthetic tabular data

Watch Video Watch this video to preview the steps in this tutorial. There might be slight differences in the user interface shown in the video. The video is intended to be a companion to the written tutorial. The video begins on the watsonx home screen. The user navigates to the Resource hub, selects Data, and opens sample data.

This video provides a visual method to learn the concepts and tasks in this documentation.

Try a tutorial to generate synthetic tabular data

In this tutorial, you will complete these tasks:

Task 1: Open a project
Task 2: Add data to your project
Task 3: Create a synthetic data flow
Task 4: Review the data flow and output

Tips for completing this tutorial

Here are some tips for successfully completing this tutorial.

Get help in the community

If you need help with this tutorial, you can ask a question or find an answer in the watsonx Community discussion forum.

Set up your browser windows

For the optimal experience completing this tutorial, open Cloud Pak for Data in one browser window, and keep this tutorial page open in another browser window to switch easily between the two applications. Consider arranging the two browser windows side-by-side to make it easier to follow along.

Side-by-side tutorial and UI

Tip: If you encounter a guided tour while completing this tutorial in the user interface, click Maybe later.

Task 1: Open a project

You need a project to to store the assets.

Follow the steps to verify that you have an existing project or create a project.

From the watsonx home screen, scroll to the Projects section. If you see any projects listed, then skip to Task 2. If you don't see any projects, then follow these steps to create a project.
From the Quick navigation, click All projects.
Open an existing project, or create a new project:
1. Click New project on the Projects page.
2. Select Create an empty project.
3. On the Create a project screen, type a name and optional description for the project.
4. Click Create.

For more information or to watch a video, see Creating a project.

Check your progress

The following image shows the empty project. You are now ready to open the {{ site.data.keyword.fm_prompt }}.

Task 2: Add data to your project

The data set used in this tutorial contains typical information that a company gathers about their customers. Follow these steps to download the data set, and add it to your project:

Download the Auto Insurance Customers data set (4KB).
From your project, click the Upload asset to project icon .
In the side panel that opens, browse to select the Customers.csv file, and click Open. Stay on the page until the load completes.
The Customers.csv file is added to your project as a data asset.

Check your progress

The following image shows the Assets tab in the project. Now you are ready to create the synthetic data flow.

Task 3: Create a synthetic data flow

Use the Synthetic Data Generator to create a data flow that generates synthetic tabular data based on production data or a custom data schema using visual flows and modeling algorithms. Follow these steps to create a synthetic data flow asset in your project:

From the Assets tab in your project, click New asset > Generate synthetic tabular data.
For the name, type Bank customers.
Click Create.
On the Welcome to Synthetic Data Generator screen, click First time user, and click Continue. This option provides a guided experience for you to build the data flow.
Review the two use cases:
- Leverage your existing data: Generate a structured synthetic data set based on your production data. You can connect to a database, import or upload a file, mask, and generate your output before exporting.
- Create from custom data: Generate a structured synthetic data set based on meta data. You can define the data within each table column, their distributions, and any correlations.
Select the Leverage your existing data use case, and click Next to import existing data.
Click Select data from project to use the customers data asset that you added from the Resource hub.
1. Select Data asset > customers.csv.
2. Click Select.
3. Click Next.
In the list of columns, search for creditcard_number.
1. In the Anonymize column for CREDITCARD_NUMBER, select Yes to mask customers' credit card numbers.
2. Click Next.
On the Mimic options page, change the Number of rows to 1000. Accept the default settings for the rest of the options. These options generate synthetic data, based on your production data, using a set of candidate statistical distributions to modify each column in your data. Click Next.
On the Evaluate screen, toggle the Enable evaluate metrics option. Here, you can specify settings to compare the generated synthetic data with your baseline input. You can choose which metrics to assess.
1. Select the following metrics:
  - Fidelity score
  - Data distinguishability
  - Leakage prevention score
  - Proximity score
2. Click Next.
On the Export data page, type bank_customers.csv for the File name, and click Next.
Review the settings, and click Save flow. The Synthetic Data Generator tool displays with the data flow.
When prompted, click Run flow, and wait for the run to complete.

Check your progress

The following image shows the data flow open in the Synthetic Data Generator. Now you can explore the data flow and view the output.

Task 4: Review the data flow and output

When the run completes, you can explore the data flow. Follow these steps to review the synthetic data flow and the results:

Click the Palette icon to close the node panel.
Double-click the Import node to see the settings.
1. Review the Data properties. The tool read the data set from the project and filled in the appropriate data properties.
2. Expand the Types section. The tool read the values and columns in the data set.
3. Click Cancel.
Double-click the Anonymize node to see the settings.
1. Verify that the CREDITCARD_NUMBER column is set to be anonymized.
2. Expand the Anonymize values section. Here you can customize how the values are anonymized.
3. Click Cancel.
Double-click the Mimic node to see the settings.
1. Review the default settings to mimic the data in the source customers data set.
2. Click Cancel.
Double-click the Evaluate node to see the settings.
1. Review the following settings:
  - The Baseline input is set to Import. The flow shows that the Evaluate node has two inputs, the output from the Anonymize and Generate nodes.
  - The Quality metrics, Privacy metrics, Utility metrics, and Assessment level. Hover over the Information icon to see a description for each setting.
2. Click Cancel.
Double-click the Generate node to see the settings.
1. Review the list of Synthesized columns.
2. Optional: Review the Correlations and Advanced Options.
3. Click Cancel.
Double-click the Export node to see the settings.
1. Optional: By default the exported data is stored in the project. Click Change path to store the exported data in a connection, such as Db2 Warehouse.
2. Click Cancel.
In the Outputs pane, click the results with the name Evaluate. If you don't see the Outputs pane, click the Outputs icon .
Click the View details icon for each of the metrics to see the visualizations for that metric.
On the Chart metrics tab, you can see the same scores. When you are done, close the window.
Click your project name to return to the Assets tab.
Click bank_customers.csv to see a preview of the generated synthetic tabular data.

Check your progress

The following image shows the exported, generated synthetic tabular data set.

Next steps

Try these additional tutorials to get more hands-on experience with watsonx.ai:

Additional resources

View more videos.