What's new and changed in Synthetic Data Generator

Synthetic Data Generator updates can include new features and fixes. Releases are listed in reverse chronological order so that the latest release is at the beginning of the topic.

IBM® watsonx™ Version 2.2.2

A new version of Synthetic Data Generator was released in October 2025.

This release includes the following changes:

Issues fixed in this release: This release of the service includes various fixes.
Customer-reported issues fixed in this release: For a list of customer-reported issues that were fixed in this release, see the Fix List for IBM Cloud Pak® for Data on the IBM Support website.

IBM watsonx Version 2.2.1

A new version of Synthetic Data Generator was released in August 2025.

This release includes the following changes:

New features

This release of Synthetic Data Generator includes the following features:

Use PDFs with the knowledge data builder pipeline

You can now use PDFs as reference documents for the knowledge builder pipeline. The knowledge builder pipeline uses the PDFs as background information when making its question and answer pairs.

For more information about other file types that the knowledge data builder pipeline uses, see Generating synthetic unstructured data.

Use Synthetic Data Generator within projects with Git integration

You can now use Synthetic Data Generator within projects that are integrated with Git. You can use the Git repository to track changes, collaborate with others, and manage your data assets for Synthetic Data Generator.

For more information, see Projects with default Git integration.

Customer-reported issues fixed in this release

For a list of customer-reported issues that were fixed in this release, see the Fix List for IBM Cloud Pak for Data on the IBM Support website.

IBM watsonx Version 2.2.0

A new version of Synthetic Data Generator was released in June 2025.

This release includes the following changes:

New features

This release of Synthetic Data Generator includes the following features:

Generate unstructured text datasets by using the API

You can now create unstructured text datasets that mimic your organization's data. The synthetic data generation API creates synthetic data by using one of the data builder pipelines. You can pick the builder pipelines to use with your unstructured data to tune and evaluate foundation models for your specific use cases.

For more information, see Generating unstructured synthetic data.

Anonymized data is more realistic

Now when you use the Anonymize node to mask real data, the replacement values look more realistic while still replacing the real data. For example, first names, last names, and email addresses look more believable, and zip codes follow patterns like real region codes.

For more information, see Using differential privacy.

Configure how categorical fields are identified

You can now set how many categorical values can be detected before the field is identified as Typeless. Previously, if a Categorical field had more than 250 categories, it would often be identified as a Typeless field. You can adjust this setting in the "Flow Properties" dialog.

Updates

The following updates were introduced in this release:

The "Evaluation results" window has improved visualizations for quality metrics.

Customer-reported issues fixed in this release

For a list of customer-reported issues that were fixed in this release, see the Fix List for IBM Cloud Pak for Data on the IBM Support website.