Home
AI and ML
Synthetic Data Sets
These prebuilt datasets are downloadable and packaged as CSV and DDL files, making them familiar to use and compatible with everything—from databases to spreadsheets to hardware platforms to standard AI tools. These datasets also use IBM's industry expertise and domain knowledge of the financial services sector without using any real client seed data, alleviating security concerns with Personally Identifiable Information (PII).
To address this scenario, IBM Synthetic Data Sets were curated for fraud detection use cases. Thus, clients can download the datasets and enable development of predictive AI models and LLMs for financial services or optimize existing models for improved accuracy and risk mitigation.
Learn how prebuilt synthetic data boosts AI accuracy, speeds up projects and delivers rapid results. Jumpstart your AI journey with IBM Synthetic Data Sets.
Ideal to train AI models to detect credit card fraud. The dataset includes simulated credit cards and holders with detailed transaction histories. Each transaction is labeled “yes” or “no” for fraud and linked by fraudster ID for pattern tracking.
Ideal for antimoney laundering solutions. The dataset includes simulated banking transactions labeled for money laundering, check fraud and automated push payment (APP) fraud. The dataset captures fraud scenarios and laundering activities, labeling types along with account details and transfers.
Ideal to improve claims fraud detection, underwriting and pricing. The dataset uses homeowner, policy, claim and disaster event information to offer synthetic “what-if” scenarios and labels for fraudulent claims with insights for areas including loan underwriting and credit scoring.
Accurate fraud detection keeps customers satisfied and loyal while minimizing financial losses. IBM Synthetic Data Sets for Payments Cards improves fraud protection AI models by providing labeled transaction data.
IBM Synthetic Data Sets for Core Banking and Money Laundering provides labeled data, including global and cash transactions unavailable in real banking data. This helps build stronger antimoney laundering models, reducing risks and false positives, saving investigation time and costs.
Insurers use real claims data but IBM Synthetic Data Sets for Homeowners Insurance adds synthetic “what-if” scenarios that cover diverse claim types and fraud cases. Each claim is labeled for fraud, detection status and reason, providing a rich dataset to train, validate and improve AI models for detecting fraudulent claims.
Read more about IBM Synthetic Data Sets in this IBM Redbooks® Redpaper, which provides greater details about the datasets, methodology, security and ethics by design and data schemas.
Read the published academic paper featured at Nuerips with technical details around the generation methodology of synthetic datasets used for detecting money laundering.
Read about the technical approach and domain knowledge that were combined to generate quality synthetic credit card data used to train models to predict fraud.
Real about how IBM and MIT researchers developed a fraud detection graph transformer (FraudGT) using data from our IBM Synthetic Data Sets.
Read about how IBM Research and Caltech developed Graph Feature Preprocessor, a software library for detecting typical money laundering patterns in financial transaction graphs in real time. This model used IBM Synthetic Data Sets to develop the solution.