Synthetic Data Sets

Prebuilt synthetic data sets for AI

IBM® Synthetic Data Sets are prebuilt, artificial datasets designed to train predictive AI models and large language models (LLMs) to benefit IBM Z® and LinuxONE enterprises in financial services.

Built with IBM’s financial services expertise, these data sets deliver rich, privacy-compliant data (downloadable in CSV or DDL) for quick, secure, and accurate AI development.

Webinar: IBM Synthetic Data Sets introduction

Accelerate AI model training securely

Jumpstart AI model creation with downloadable, PII-free datasets built for quick, compliant use.

Enhance models with richer data

Access rich synthetic data including fraud labels and multiple entities for stronger, broader insights.

Validate the accuracy of AI models

Use labeled transactions as an answer key to test, validate, and refine fraud detection models.

Optimize risk detection in finance

Improve predictive accuracy and reduce risk in financial services AI projects with curated datasets.

No real PII included
No real PII included
Logic maintained
Logic maintained
Known growth truth
Known growth truth
Referential integrity
Referential integrity

IBM Synthetic Data Sets diagram showing no real PII included

Compliant datasets

Agent-based model generation methodology is at a statistical population level so no real source data, which can take months accessing, is needed. Datasets are compliant with data privacy regulations because they do not contain any real or anonymized PII because they are artificially generated.

IBM Synthetic Data Sets diagram showing logic maintained

Realistic synthetic data

IBM Synthetic Data Sets are based on years of custom inputs and code worked into our agent-based model that other synthetic data generators don’t offer. These datasets retain and accurately reflect real-world complex relationships and constraints that often present challenges when generating data with other synthetic data generators.

Enhance AI model accuracy

Ground truth training data adds annotations regarding information that is known to be true, enhancing AI model accuracy. IBM Synthetic Data Sets has ground truth known, where each transaction is labeled for fraud and money laundering.

IBM Synthetic Data Sets diagram showing referential integrity

Connect data tables

Referential integrity refers to the relationship between different tables, and that the connection makes sense, is accurate, consistent and up to date. Referential integrity is found across IBM Synthetic Data Sets but isn’t often found with data that uses standard synthetic data generators.

Compliant datasets

Realistic synthetic data

Enhance AI model accuracy

Connect data tables

Use cases

Credit card fraud detection

Accurate fraud detection keeps customers satisfied and loyal while minimizing financial losses. IBM Synthetic Data Sets for Payments Cards improves fraud protection AI models by providing labeled transaction data.

Anti-money laundering

IBM Synthetic Data Sets for Core Banking and Money Laundering provides labeled data, including global and cash transactions unavailable in real banking data. This helps build stronger antimoney laundering models, reducing risks and false positives, saving investigation time and costs.

Insurance claims fraud

Insurers use real claims data but IBM Synthetic Data Sets for Homeowners Insurance adds synthetic “what-if” scenarios that cover diverse claim types and fraud cases. Each claim is labeled for fraud, detection status and reason, providing a rich dataset to train, validate and improve AI models for detecting fraudulent claims.

IBM Synthetic Data Sets wins the Banking Tech Award for “Best AI Solution".

Take the next step

Discover how to jumpstart AI projects on IBM Z and LinuxONE with Synthetic Data Sets.

Read the IBM Redpaper

Watch the product webinar playback

More ways to explore

Documentation

Support

Lifecycle services and support

Community