Home

AI and ML

Synthetic Data Sets

IBM Synthetic Data Sets
Designed to accelerate AI adoption and increase predictive accuracy to drive business innovation and value
Read the IBM Redpaper
Digital illustration of Data Fabric architecture

IBM® Synthetic Data Sets are a family of artificially generated datasets designed to enhance predictive AI model training and large language models (LLMs) to benefit IBM Z® and LinuxONE enterprises in financial services to gain quick access to relevant and rich data for AI projects.

These prebuilt datasets are downloadable and packaged as CSV and DDL files, making them familiar to use and compatible with everything—from databases to spreadsheets to hardware platforms to standard AI tools. These datasets also use IBM's industry expertise and domain knowledge of the financial services sector without using any real client seed data, alleviating security concerns with Personally Identifiable Information (PII).

To address this scenario, IBM Synthetic Data Sets were curated for fraud detection use cases. Thus, clients can download the datasets and enable development of predictive AI models and LLMs for financial services or optimize existing models for improved accuracy and risk mitigation.

Announcing IBM Synthetic Data Sets

Learn how prebuilt synthetic data boosts AI accuracy, speeds up projects and delivers rapid results. Jumpstart your AI journey with IBM Synthetic Data Sets.

Types of datasets
IBM Synthetic Data Sets for Payment Cards

Ideal to train AI models to detect credit card fraud. The dataset includes simulated credit cards and holders with detailed transaction histories. Each transaction is labeled “yes” or “no” for fraud and linked by fraudster ID for pattern tracking.

IBM Synthetic Data Sets for Core Banking and Money Laundering

Ideal for antimoney laundering solutions. The dataset includes simulated banking transactions labeled for money laundering, check fraud and automated push payment (APP) fraud. The dataset captures fraud scenarios and laundering activities, labeling types along with account details and transfers.

IBM Synthetic Data Sets for Homeowners Insurance

Ideal to improve claims fraud detection, underwriting and pricing. The dataset uses homeowner, policy, claim and disaster event information to offer synthetic “what-if” scenarios and labels for fraudulent claims with insights for areas including loan underwriting and credit scoring.

Benefits
Jumpstart training AI models

It serves as quick, easy, privacy-compliant training data to create and build models from scratch. Easy download files help facilitate use with Db2® and other databases and includes key attributes for use cases without any real PII.

Enhance models with richer data

It provides more rich and diverse data to enhance existing predictive models and fine-tune LLMs. Synthetic data includes broader information than what’s available in real data including transaction fraud labels, multiple entities across the banking ecosystem and more.

Validate the accuracy of AI models

It can be used as an “answer sheet” to validate existing fraud or money laundering models because all transactions are labeled for either type of fraud. Test whether existing models can accurately predict fraud with our datasets.

Features

No real PII included Logic maintained Known ground truth Referential integrity
Use cases
Credit card fraud detection

Accurate fraud detection keeps customers satisfied and loyal while minimizing financial losses. IBM Synthetic Data Sets for Payments Cards improves fraud protection AI models by providing labeled transaction data.

Anti-money laundering

IBM Synthetic Data Sets for Core Banking and Money Laundering provides labeled data, including global and cash transactions unavailable in real banking data. This helps build stronger antimoney laundering models, reducing risks and false positives, saving investigation time and costs.

Insurance claims fraud

Insurers use real claims data but IBM Synthetic Data Sets for Homeowners Insurance adds synthetic “what-if” scenarios that cover diverse claim types and fraud cases. Each claim is labeled for fraud, detection status and reason, providing a rich dataset to train, validate and improve AI models for detecting fraudulent claims.

Resources IBM Synthetic Data Sets Redpaper

Read more about IBM Synthetic Data Sets in this IBM Redbooks® Redpaper, which provides greater details about the datasets, methodology, security and ethics by design and data schemas.

Realistic synthetic financial transactions for antimoney laundering models

Read the published academic paper featured at Nuerips with technical details around the generation methodology of synthetic datasets used for detecting money laundering.

Synthesizing credit card transactions

Read about the technical approach and domain knowledge that were combined to generate quality synthetic credit card data used to train models to predict fraud.

A simple, effective and efficient graph transformer for financial fraud detection

Real about how IBM and MIT researchers developed a fraud detection graph transformer (FraudGT) using data from our IBM Synthetic Data Sets.

Real-time subgraph-based feature extraction for financial crime detection

Read about how IBM Research and Caltech developed Graph Feature Preprocessor, a software library for detecting typical money laundering patterns in financial transaction graphs in real time. This model used IBM Synthetic Data Sets to develop the solution.

Take the next step

Discover how to jumpstart AI projects on IBM Z and LinuxONE with Synthetic Data Sets.

Read the IBM Redpaper Register for the webinar
More ways to explore Documentation Support Support and services Community