What is synthetic data?

What is synthetic data?

Synthetic data is artificial data designed to mimic real-word data. It’s generated through statistical methods or by using artificial intelligence (AI) techniques like deep learning and generative AI.

Despite being artificially generated, synthetic data retains the underlying statistical properties of the original data that it is based on. As such, synthetic datasets can supplement or even replace real datasets.

Synthetic data can act as a placeholder for test data and is primarily used to train machine learning models, serving as a potential solution to the ever-growing need for—yet short supply of—high-quality real-world training data for AI models. However, synthetic data is also gaining traction in sectors like finance and healthcare where data is in limited supply, time-consuming to obtain or difficult to access due to data privacy concerns and security requirements. In fact, research firm Gartner predicts that 75% of businesses will employ generative AI to create synthetic customer data by 2026.1

Types of synthetic data

Synthetic data can come in multimedia, tabular or text form. Synthetic text data can be used for natural language processing (NLP), while synthetic tabular data can be used to create relational database tables. Synthetic multimedia, such as video, images or other unstructured data, can be applied for computer vision tasks like image classification, image recognition and object detection.

Synthetic data can also be classified according to its level of synthesis:

    ● Fully synthetic

    ● Partially synthetic

    ● Hybrid

Fully synthetic

Fully synthetic data entails generating entirely new data that doesn’t include any real-world information. It estimates the attributes, patterns and relationships underpinning real data to emulate it as closely as possible.

Financial organizations, for instance, might lack samples of suspicious transactions to effectively train AI models in fraud detection. They can then generate fully synthetic data representing fraudulent transactions to improve model training, which is similar to financial services firm J.P. Morgan’s approach.

Partially synthetic

Partially synthetic data is derived from real-world information but replaces portions of the original dataset—typically those containing sensitive information—with artificial values. This privacy-preserving technique helps protect personal data while still maintaining the characteristics of real data.

Partially synthetic data can be especially valuable in clinical research, for example, where real data is crucial to the results but safeguarding patients’ personally identifiable information (PII) and medical records is equally critical.

Hybrid

Hybrid synthetic data combines real datasets with fully synthetic ones. It takes records from the original dataset and randomly pairs them with records from their synthetic counterparts. Hybrid synthetic data can be used to analyze and glean insights from customer data, for instance, without tracing back any sensitive data to a specific customer.

How is synthetic data generated?

Organizations can choose to generate their own synthetic data. They can also use solutions such as the Synthetic Data Vault, a Python library for creating synthetic data, or other open-source algorithms, frameworks, packages and tools. Prebuilt datasets, such as IBM® Synthetic Data Sets, are another option.

Here are some common synthetic data generation techniques:

    ● Statistical methods

    ● Generative adversarial networks (GANs)

    ● Transformer models

    ● Variational autoencoders (VAEs)

    ● Agent-based modeling

Statistical methods

These methodologies are suitable for data whose distribution, correlations and traits are well-known and can therefore be simulated through mathematical models.

In distribution-based approaches, statistical functions can be used to define the data distribution. Then, by randomly sampling from this distribution, new data points can be generated.

For correlation-based strategies, interpolation or extrapolation can be applied. In time series data, for instance, linear interpolation can create new data points between adjacent ones, while linear extrapolation can generate data points beyond existing ones.

Generative adversarial networks (GANs)

Generative adversarial networks (GANs) involve a pair of neural networks: a generator that creates synthetic data and a discriminator that acts as an adversary that distinguishes real from artificial data. Both networks are iteratively trained, with the discriminator’s feedback enhancing the generator’s output until the discriminator is no longer able to differentiate between artificial and real data. GANs are often used for image generation.

Transformer models

Transformer models, such as OpenAI’s generative pretrained transformers (GPTs), serve as the foundation of both small language models (SLMs) and large language models (LLMs). Transformers process data using encoders and decoders.

Encoders transform input sequences into numerical representations called embeddings that capture the semantics and position of tokens in the input sequence. A self-attention mechanism allows transformers to “focus their attention” on the most important tokens in the input sequence, regardless of their position. Decoders then use this self-attention mechanism and the encoders’ embeddings to generate the most statistically probable output sequence.

Transformer models excel in understanding the structure of and patterns in language. As such, they can be used to create artificial text data or generate synthetic tabular data.

Variational autoencoders (VAEs)

Variational autoencoders (VAEs) are generative models that produce variations of the data they’re trained on. An encoder compresses input data into a lower-dimensional space, capturing the meaningful information contained in the input. A decoder then reconstructs new data from this compressed representation. Like GANs, VAEs can be used to generate synthetic images.

Agent-based modeling

This simulation strategy entails modeling a complex system as a virtual environment containing individual entities, also known as agents. Agents operate based on a predefined set of rules, interacting with their environment and other agents. Agent-based modeling simulates these interactions and agent behaviors to produce synthetic data.

For instance, agent-based models in epidemiology represent individuals in a population as agents. Upon modeling agent interactions, synthetic data such as the rate of contact and likelihood of infection can be generated. The data can then aid in predicting infectious disease spread and examining the effects of interventions.

3D design of balls rolling on a track

The latest AI News + Insights 


Discover expertly curated insights and news on AI, cloud and more in the weekly Think Newsletter. 

Benefits of synthetic data

Synthetic data is a growing technology, offering these advantages for enterprises:

    ● Customization

    ● Efficiency

    ● Increased data privacy

    ● Richer data

Customization

Data science teams can tailor synthetic data to fit the exact specifications and needs of a business. And because data scientists have greater control over synthetic datasets, managing and analyzing them becomes easier.

Efficiency

Generating synthetic data eliminates the time-consuming process of gathering real data, making it quicker to produce and helping accelerate workflows. Synthetic data also comes prelabeled, thereby removing the tedious step of manually labeling volumes of data and annotating them by hand.

Increased data privacy

Synthetic data resembles real-world data, but it can be generated such that any personal data isn’t traceable to a particular individual. This acts as a form of data anonymization, helping keep sensitive information safe. Synthetic data also allows enterprises to avoid intellectual property and copyright issues, doing away with web crawlers that scrape and collect information from websites without users’ knowledge or consent.

Richer data

Artificial datasets can help boost data diversity, creating or augmenting data for underrepresented groups in AI training. Synthetic data can also fill in the gaps when the original data is scarce or no real data exists. And including edge cases or outliers as data points can broaden the scope of synthetic datasets, reflecting the real world’s variability and unpredictability.

Challenges of synthetic data

Despite synthetic data’s benefits, it also comes with some downsides. Following best practices for synthetic data generation can help address these drawbacks and allow companies to maximize the value of artificial data.

Here are some challenges associated with synthetic data:

    ● Bias

    ● Model collapse

    ● Trade-off between accuracy and privacy

    ● Verification

Bias

Synthetic data can still exhibit the biases that might be present in the real-world data that it is based on. Using diverse data sources and adding multiple sources of data, including from varied regions and demographic groups, can help mitigate bias.

Model collapse

Model collapse happens when an AI model is repeatedly trained on AI-generated data, causing model performance to decline. A healthy mix of real and artificial training datasets can help prevent this problem.

Trade-off between accuracy and privacy

During the synthetic data generation process, a battle between accuracy and privacy ensues. Prioritizing accuracy might mean retaining more personal data, while keeping privacy top of mind might result in a reduction in accuracy. Finding the right balance for a company’s use cases is vital.

Verification

Additional checks and tests must be conducted to validate synthetic data quality after it’s generated. This introduces an extra step to the workflow, but it’s a crucial one to make sure artificial datasets are free from any errors, inconsistencies or inaccuracies.

Synthetic data use cases

Synthetic data is versatile and can be generated for a wide range of applications. Here are some industries where synthetic data can be a boon:

    ● Automotive

    ● Finance

    ● Healthcare

    ● Manufacturing

Automotive

Agent-based modeling can be employed to generate artificial data related to traffic flow, helping improve road and transport systems. The use of synthetic data can help car manufacturers avoid the costly and time-consuming process of obtaining real crash data for vehicle safety testing. Makers of autonomous vehicles can use synthetic data to train self-driving cars in navigating different scenarios.

Finance

Synthetic financial data can be implemented for assessing and managing risk, predictive modeling and forecasting and testing trading algorithms, among other applications. IBM Synthetic Data Sets, for instance, consist of simulated data to aid fraud detection in credit cards and home insurance claims and simulated banking transactions for anti-money laundering solutions.

Healthcare

Synthetic datasets can help pharmaceutical companies speed up drug development. Medical researchers, meanwhile, can use partially synthetic data for clinical trials or fully synthetic data to create artificial patient records or medical imaging for formulating innovative or preventive treatments. Agent-based modeling can also be applied in epidemiology to study disease transmission and interventions.

Manufacturing

Manufacturing companies can use synthetic data to improve the visual inspection capabilities of computer vision models that examine products in real time for defects and deviations from standards. Artificial datasets can also enhance predictive maintenance, with synthetic sensor data helping machine learning models better anticipate equipment failures and recommend appropriate and timely measures.

Mixture of Experts | 14 February, episode 42

Decoding AI: Weekly News Roundup

Join our world-class panel of engineers, researchers, product leaders and more as they cut through the AI noise to bring you the latest in AI news and insights.

Footnotes
Related solutions
IBM watsonx.ai

Train, validate, tune and deploy generative AI, foundation models and machine learning capabilities with IBM watsonx.ai, a next-generation enterprise studio for AI builders. Build AI applications in a fraction of the time with a fraction of the data.

Discover watsonx.ai
Artificial intelligence solutions

Put AI to work in your business with IBM’s industry-leading AI expertise and portfolio of solutions at your side.

Explore AI solutions
AI consulting and services

Reinvent critical workflows and operations by adding AI to maximize experiences, real-time decision-making and business value.

Explore AI services
Take the next step

Get one-stop access to capabilities that span the AI development lifecycle. Produce powerful AI solutions with user-friendly interfaces, workflows and access to industry-standard APIs and SDKs.

Explore watsonx.ai Book a live demo