Synthetic data is artificial data designed to mimic real-word data. It’s generated through statistical methods or by using artificial intelligence (AI) techniques like deep learning and generative AI.
Despite being artificially generated, synthetic data retains the underlying statistical properties of the original data that it is based on. As such, synthetic datasets can supplement or even replace real datasets.
Synthetic data can act as a placeholder for test data and is primarily used to train machine learning models, serving as a potential solution to the ever-growing need for—yet short supply of—high-quality real-world training data for AI models. However, synthetic data is also gaining traction in sectors like finance and healthcare where data is in limited supply, time-consuming to obtain or difficult to access due to data privacy concerns and security requirements. In fact, research firm Gartner predicts that 75% of businesses will employ generative AI to create synthetic customer data by 2026.1
Synthetic data can come in multimedia, tabular or text form. Synthetic text data can be used for natural language processing (NLP), while synthetic tabular data can be used to create relational database tables. Synthetic multimedia, such as video, images or other unstructured data, can be applied for computer vision tasks like image classification, image recognition and object detection.
Synthetic data can also be classified according to its level of synthesis:
● Fully synthetic
● Partially synthetic
● Hybrid
Fully synthetic data entails generating entirely new data that doesn’t include any real-world information. It estimates the attributes, patterns and relationships underpinning real data to emulate it as closely as possible.
Financial organizations, for instance, might lack samples of suspicious transactions to effectively train AI models in fraud detection. They can then generate fully synthetic data representing fraudulent transactions to improve model training, which is similar to financial services firm J.P. Morgan’s approach.
Partially synthetic data is derived from real-world information but replaces portions of the original dataset—typically those containing sensitive information—with artificial values. This privacy-preserving technique helps protect personal data while still maintaining the characteristics of real data.
Partially synthetic data can be especially valuable in clinical research, for example, where real data is crucial to the results but safeguarding patients’ personally identifiable information (PII) and medical records is equally critical.
Hybrid synthetic data combines real datasets with fully synthetic ones. It takes records from the original dataset and randomly pairs them with records from their synthetic counterparts. Hybrid synthetic data can be used to analyze and glean insights from customer data, for instance, without tracing back any sensitive data to a specific customer.
Organizations can choose to generate their own synthetic data. They can also use solutions such as the Synthetic Data Vault, a Python library for creating synthetic data, or other open-source algorithms, frameworks, packages and tools. Prebuilt datasets, such as IBM® Synthetic Data Sets, are another option.
Here are some common synthetic data generation techniques:
● Statistical methods
● Generative adversarial networks (GANs)
● Transformer models
● Variational autoencoders (VAEs)
● Agent-based modeling
These methodologies are suitable for data whose distribution, correlations and traits are well-known and can therefore be simulated through mathematical models.
In distribution-based approaches, statistical functions can be used to define the data distribution. Then, by randomly sampling from this distribution, new data points can be generated.
For correlation-based strategies, interpolation or extrapolation can be applied. In time series data, for instance, linear interpolation can create new data points between adjacent ones, while linear extrapolation can generate data points beyond existing ones.
Generative adversarial networks (GANs) involve a pair of neural networks: a generator that creates synthetic data and a discriminator that acts as an adversary that distinguishes real from artificial data. Both networks are iteratively trained, with the discriminator’s feedback enhancing the generator’s output until the discriminator is no longer able to differentiate between artificial and real data. GANs are often used for image generation.
Transformer models, such as OpenAI’s generative pretrained transformers (GPTs), serve as the foundation of both small language models (SLMs) and large language models (LLMs). Transformers process data using encoders and decoders.
Encoders transform input sequences into numerical representations called embeddings that capture the semantics and position of tokens in the input sequence. A self-attention mechanism allows transformers to “focus their attention” on the most important tokens in the input sequence, regardless of their position. Decoders then use this self-attention mechanism and the encoders’ embeddings to generate the most statistically probable output sequence.
Transformer models excel in understanding the structure of and patterns in language. As such, they can be used to create artificial text data or generate synthetic tabular data.
Variational autoencoders (VAEs) are generative models that produce variations of the data they’re trained on. An encoder compresses input data into a lower-dimensional space, capturing the meaningful information contained in the input. A decoder then reconstructs new data from this compressed representation. Like GANs, VAEs can be used to generate synthetic images.
This simulation strategy entails modeling a complex system as a virtual environment containing individual entities, also known as agents. Agents operate based on a predefined set of rules, interacting with their environment and other agents. Agent-based modeling simulates these interactions and agent behaviors to produce synthetic data.
For instance, agent-based models in epidemiology represent individuals in a population as agents. Upon modeling agent interactions, synthetic data such as the rate of contact and likelihood of infection can be generated. The data can then aid in predicting infectious disease spread and examining the effects of interventions.
Synthetic data is a growing technology, offering these advantages for enterprises:
● Customization
● Efficiency
● Increased data privacy
● Richer data
Data science teams can tailor synthetic data to fit the exact specifications and needs of a business. And because data scientists have greater control over synthetic datasets, managing and analyzing them becomes easier.
Generating synthetic data eliminates the time-consuming process of gathering real data, making it quicker to produce and helping accelerate workflows. Synthetic data also comes prelabeled, thereby removing the tedious step of manually labeling volumes of data and annotating them by hand.
Synthetic data resembles real-world data, but it can be generated such that any personal data isn’t traceable to a particular individual. This acts as a form of data anonymization, helping keep sensitive information safe. Synthetic data also allows enterprises to avoid intellectual property and copyright issues, doing away with web crawlers that scrape and collect information from websites without users’ knowledge or consent.
Artificial datasets can help boost data diversity, creating or augmenting data for underrepresented groups in AI training. Synthetic data can also fill in the gaps when the original data is scarce or no real data exists. And including edge cases or outliers as data points can broaden the scope of synthetic datasets, reflecting the real world’s variability and unpredictability.
Despite synthetic data’s benefits, it also comes with some downsides. Following best practices for synthetic data generation can help address these drawbacks and allow companies to maximize the value of artificial data.
Here are some challenges associated with synthetic data:
● Bias
● Model collapse
● Trade-off between accuracy and privacy
● Verification
Synthetic data can still exhibit the biases that might be present in the real-world data that it is based on. Using diverse data sources and adding multiple sources of data, including from varied regions and demographic groups, can help mitigate bias.
Model collapse happens when an AI model is repeatedly trained on AI-generated data, causing model performance to decline. A healthy mix of real and artificial training datasets can help prevent this problem.
During the synthetic data generation process, a battle between accuracy and privacy ensues. Prioritizing accuracy might mean retaining more personal data, while keeping privacy top of mind might result in a reduction in accuracy. Finding the right balance for a company’s use cases is vital.
Additional checks and tests must be conducted to validate synthetic data quality after it’s generated. This introduces an extra step to the workflow, but it’s a crucial one to make sure artificial datasets are free from any errors, inconsistencies or inaccuracies.
Synthetic data is versatile and can be generated for a wide range of applications. Here are some industries where synthetic data can be a boon:
● Automotive
● Finance
● Healthcare
● Manufacturing
Agent-based modeling can be employed to generate artificial data related to traffic flow, helping improve road and transport systems. The use of synthetic data can help car manufacturers avoid the costly and time-consuming process of obtaining real crash data for vehicle safety testing. Makers of autonomous vehicles can use synthetic data to train self-driving cars in navigating different scenarios.
Synthetic financial data can be implemented for assessing and managing risk, predictive modeling and forecasting and testing trading algorithms, among other applications. IBM Synthetic Data Sets, for instance, consist of simulated data to aid fraud detection in credit cards and home insurance claims and simulated banking transactions for anti-money laundering solutions.
Synthetic datasets can help pharmaceutical companies speed up drug development. Medical researchers, meanwhile, can use partially synthetic data for clinical trials or fully synthetic data to create artificial patient records or medical imaging for formulating innovative or preventive treatments. Agent-based modeling can also be applied in epidemiology to study disease transmission and interventions.
Manufacturing companies can use synthetic data to improve the visual inspection capabilities of computer vision models that examine products in real time for defects and deviations from standards. Artificial datasets can also enhance predictive maintenance, with synthetic sensor data helping machine learning models better anticipate equipment failures and recommend appropriate and timely measures.
1 3 Bold and Actionable Predictions for the Future of GenAI, Gartner, 12 April 2024
Gain unique insights into the evolving landscape of ABI solutions, highlighting key findings, assumptions and recommendations for data and analytics leaders.
Simplify data access and automate data governance. Discover the power of integrating a data lakehouse strategy into your data architecture, including cost-optimizing your workloads and scaling AI and analytics, with all your data, anywhere.
Explore the data leader's guide to building a data-driven organization and driving business advantage.
Learn how an open data lakehouse approach can provide trustworthy data and faster analytics and AI projects execution.
Connect your data and analytics strategy to business objectives with these 4 key steps.
Take a deeper look into why business intelligence challenges might persist and what it means for users across an organization.