My IBM

What is synthetic data?

31 January 2023

What is synthetic data?

Synthetic data is artificial data designed to mimic real-word data. It’s generated through statistical methods or by using artificial intelligence (AI) techniques like deep learning and generative AI.

Despite being artificially generated, synthetic data retains the underlying statistical properties of the original data that it is based on. As such, synthetic datasets can supplement or even replace real datasets.

Synthetic data can act as a placeholder for test data and is primarily used to train machine learning models, serving as a potential solution to the ever-growing need for—yet short supply of—high-quality real-world training data for AI models. However, synthetic data is also gaining traction in sectors like finance and healthcare where data is in limited supply, time-consuming to obtain or difficult to access due to data privacy concerns and security requirements. In fact, research firm Gartner predicts that 75% of businesses will employ generative AI to create synthetic customer data by 2026.¹

Types of synthetic data

Synthetic data can come in multimedia, tabular or text form. Synthetic text data can be used for natural language processing (NLP), while synthetic tabular data can be used to create relational database tables. Synthetic multimedia, such as video, images or other unstructured data, can be applied for computer vision tasks like image classification, image recognition and object detection.

Synthetic data can also be classified according to its level of synthesis:

Fully synthetic
Partially synthetic
Hybrid

Fully synthetic

Fully synthetic data entails generating entirely new data that doesn’t include any real-world information. It estimates the attributes, patterns and relationships underpinning real data to emulate it as closely as possible.

Financial organizations, for instance, might lack samples of suspicious transactions to effectively train AI models in fraud detection. They can then generate fully synthetic data representing fraudulent transactions to improve model training, which is similar to financial services firm J.P. Morgan’s approach.

Partially synthetic

Partially synthetic data is derived from real-world information but replaces portions of the original dataset—typically those containing sensitive information—with artificial values. This privacy-preserving technique helps protect personal data while still maintaining the characteristics of real data.

Partially synthetic data can be especially valuable in clinical research, for example, where real data is crucial to the results but safeguarding patients’ personally identifiable information (PII) and medical records is equally critical.

Hybrid

Hybrid synthetic data combines real datasets with fully synthetic ones. It takes records from the original dataset and randomly pairs them with records from their synthetic counterparts. Hybrid synthetic data can be used to analyze and glean insights from customer data, for instance, without tracing back any sensitive data to a specific customer.

How is synthetic data generated?

Organizations can choose to generate their own synthetic data. They can also use solutions such as the Synthetic Data Vault, a Python library for creating synthetic data, or other open-source algorithms, frameworks, packages and tools. Prebuilt datasets, such as IBM® Synthetic Data Sets, are another option.

Here are some common synthetic data generation techniques:

Statistical methods
Generative adversarial networks (GANs)
Transformer models
Variational autoencoders (VAEs)
Agent-based modeling

Statistical methods

These methodologies are suitable for data whose distribution, correlations and traits are well-known and can therefore be simulated through mathematical models.

In distribution-based approaches, statistical functions can be used to define the data distribution. Then, by randomly sampling from this distribution, new data points can be generated.

For correlation-based strategies, interpolation or extrapolation can be applied. In time series data, for instance, linear interpolation can create new data points between adjacent ones, while linear extrapolation can generate data points beyond existing ones.

Generative adversarial networks (GANs)

Generative adversarial networks (GANs) involve a pair of neural networks: a generator that creates synthetic data and a discriminator that acts as an adversary that distinguishes real from artificial data. Both networks are iteratively trained, with the discriminator’s feedback enhancing the generator’s output until the discriminator is no longer able to differentiate between artificial and real data. GANs are often used for image generation.

Transformer models

Transformer models, such as OpenAI’s generative pretrained transformers (GPTs), serve as the foundation of both small language models (SLMs) and large language models (LLMs). Transformers process data using encoders and decoders.

Encoders transform input sequences into numerical representations called embeddings that capture the semantics and position of tokens in the input sequence. A self-attention mechanism allows transformers to “focus their attention” on the most important tokens in the input sequence, regardless of their position. Decoders then use this self-attention mechanism and the encoders’ embeddings to generate the most statistically probable output sequence.

Transformer models excel in understanding the structure of and patterns in language. As such, they can be used to create artificial text data or generate synthetic tabular data.

Variational autoencoders (VAEs)

Variational autoencoders (VAEs) are generative models that produce variations of the data they’re trained on. An encoder compresses input data into a lower-dimensional space, capturing the meaningful information contained in the input. A decoder then reconstructs new data from this compressed representation. Like GANs, VAEs can be used to generate synthetic images.

Agent-based modeling

This simulation strategy entails modeling a complex system as a virtual environment containing individual entities, also known as agents. Agents operate based on a predefined set of rules, interacting with their environment and other agents. Agent-based modeling simulates these interactions and agent behaviors to produce synthetic data.

For instance, agent-based models in epidemiology represent individuals in a population as agents. Upon modeling agent interactions, synthetic data such as the rate of contact and likelihood of infection can be generated. The data can then aid in predicting infectious disease spread and examining the effects of interventions.

The latest AI News + Insights  

Discover expertly curated insights and news on AI, cloud and more in the weekly Think Newsletter.

Subscribe today

Benefits of synthetic data

Synthetic data is a growing technology, offering these advantages for enterprises:

Customization
Efficiency
Increased data privacy
Richer data

Customization

Data science teams can tailor synthetic data to fit the exact specifications and needs of a business. And because data scientists have greater control over synthetic datasets, managing and analyzing them becomes easier.

Efficiency

Generating synthetic data eliminates the time-consuming process of gathering real data, making it quicker to produce and helping accelerate workflows. Synthetic data also comes prelabeled, thereby removing the tedious step of manually labeling volumes of data and annotating them by hand.

Increased data privacy

Synthetic data resembles real-world data, but it can be generated such that any personal data isn’t traceable to a particular individual. This acts as a form of data anonymization, helping keep sensitive information safe. Synthetic data also allows enterprises to avoid intellectual property and copyright issues, doing away with web crawlers that scrape and collect information from websites without users’ knowledge or consent.

Richer data

Artificial datasets can help boost data diversity, creating or augmenting data for underrepresented groups in AI training. Synthetic data can also fill in the gaps when the original data is scarce or no real data exists. And including edge cases or outliers as data points can broaden the scope of synthetic datasets, reflecting the real world’s variability and unpredictability.

Challenges of synthetic data

Despite synthetic data’s benefits, it also comes with some downsides. Following best practices for synthetic data generation can help address these drawbacks and allow companies to maximize the value of artificial data.

Here are some challenges associated with synthetic data:

Bias
Model collapse
Trade-off between accuracy and privacy
Verification

Bias

Synthetic data can still exhibit the biases that might be present in the real-world data that it is based on. Using diverse data sources and adding multiple sources of data, including from varied regions and demographic groups, can help mitigate bias.

Model collapse

Model collapse happens when an AI model is repeatedly trained on AI-generated data, causing model performance to decline. A healthy mix of real and artificial training datasets can help prevent this problem.

Trade-off between accuracy and privacy

During the synthetic data generation process, a battle between accuracy and privacy ensues. Prioritizing accuracy might mean retaining more personal data, while keeping privacy top of mind might result in a reduction in accuracy. Finding the right balance for a company’s use cases is vital.

Verification

Additional checks and tests must be conducted to validate synthetic data quality after it’s generated. This introduces an extra step to the workflow, but it’s a crucial one to make sure artificial datasets are free from any errors, inconsistencies or inaccuracies.

Synthetic data use cases

Synthetic data is versatile and can be generated for a wide range of applications. Here are some industries where synthetic data can be a boon:

Automotive
Finance
Healthcare
Manufacturing

Automotive

Agent-based modeling can be employed to generate artificial data related to traffic flow, helping improve road and transport systems. The use of synthetic data can help car manufacturers avoid the costly and time-consuming process of obtaining real crash data for vehicle safety testing. Makers of autonomous vehicles can use synthetic data to train self-driving cars in navigating different scenarios.

Finance

Synthetic financial data can be implemented for assessing and managing risk, predictive modeling and forecasting and testing trading algorithms, among other applications. IBM Synthetic Data Sets, for instance, consist of simulated data to aid fraud detection in credit cards and home insurance claims and simulated banking transactions for anti-money laundering solutions.

Healthcare

Synthetic datasets can help pharmaceutical companies speed up drug development. Medical researchers, meanwhile, can use partially synthetic data for clinical trials or fully synthetic data to create artificial patient records or medical imaging for formulating innovative or preventive treatments. Agent-based modeling can also be applied in epidemiology to study disease transmission and interventions.

Manufacturing

Manufacturing companies can use synthetic data to improve the visual inspection capabilities of computer vision models that examine products in real time for defects and deviations from standards. Artificial datasets can also enhance predictive maintenance, with synthetic sensor data helping machine learning models better anticipate equipment failures and recommend appropriate and timely measures.

Mixture of Experts | 28 March, episode 48

Decoding AI: Weekly News Roundup

Join our world-class panel of engineers, researchers, product leaders and more as they cut through the AI noise to bring you the latest in AI news and insights.

Watch the latest podcast episodes

Footnotes

¹ 3 Bold and Actionable Predictions for the Future of GenAI, Gartner, 12 April 2024

Four steps to better business forecasting with analytics

Use the power of analytics and business intelligence to plan, forecast and shape future outcomes that best benefit your company and customers.

Resources

Gartner® Predicts 2024: How AI will impact analytics users

Gain unique insights into the evolving landscape of ABI solutions, highlighting key findings, assumptions and recommendations for data and analytics leaders.

The hybrid, open data lakehouse for AI

Simplify data access and automate data governance. Discover the power of integrating a data lakehouse strategy into your data architecture, including cost-optimizing your workloads and scaling AI and analytics, with all your data, anywhere.

The data differentiator

Explore the data leader’s guide to building a data-driven organization and driving business advantage.

Managing data for AI and analytics at scale

Learn how an open data lakehouse approach can provide trustworthy data and faster analytics and AI projects execution.

How to successfully align your AI, data and analytics strategy

Connect your data and analytics strategy to business objectives with these 4 key steps.

Overcoming low adoption to make smart decisions

Take a deeper look into why business intelligence challenges might persist and what it means for users across an organization.

Footnotes

¹ 3 Bold and Actionable Predictions for the Future of GenAI, Gartner, 12 April 2024

What is synthetic data?

31 January 2023

What is synthetic data?

Types of synthetic data

Fully synthetic

Partially synthetic

Hybrid

How is synthetic data generated?

Statistical methods

Generative adversarial networks (GANs)

Transformer models

Variational autoencoders (VAEs)

Agent-based modeling

The latest AI News + Insights

Benefits of synthetic data

Customization

Efficiency

Increased data privacy

Richer data

Challenges of synthetic data

Bias

Model collapse

Trade-off between accuracy and privacy

Verification

Synthetic data use cases

Automotive

Finance

Healthcare

Manufacturing

Decoding AI: Weekly News Roundup

Footnotes

Resources

Related solutions

Footnotes

The latest AI News + Insights