8 best practices for synthetic data generation

factory production line of small objects

Authors

Staff Writer

IBM Think

Staff Editor, AI Models

IBM Think

When you hear the word “synthetic,” you might associate it with something artificial or fabricated. Take synthetic fibers such as polyester and nylon, for example, which are man-made through chemical processes.

While synthetic fibers are more affordable and easier to mass-produce, their quality can rival that of natural fibers. They’re often designed to mimic their natural counterparts and are engineered for specific uses—be it elastic elastane, heat-retaining acrylic or durable polyester.

The same is true for synthetic data. This artificially generated information can supplement or even replace real-world data when training or testing artificial intelligence (AI) models. Compared to real datasets that can be costly to obtain, difficult to access, time-consuming to label and have a limited supply, synthetic datasets can be synthesized through computer simulations or generative models. This makes them cheaper to produce on-demand in nearly limitless volumes and customized to an organization’s needs.

Despite its benefits, synthetic data also comes with challenges. The generation process can be complex, with data scientists having to create realistic data while still maintaining quality and privacy.

Yet synthetic data is here to stay. Research firm Gartner predicts that by 2026, 75% of businesses will use generative AI to create synthetic customer data.¹

To help enterprises get the most out of artificial data, here are 8 best practices for synthetic data generation:

1. Know your purpose

Understand why your business needs synthetic data and the use cases where it might be more helpful than real data. In healthcare, for instance, patient records or medical images can be artificially generated—without containing any sensitive data or personally identifiable information (PII). This also allows safe data sharing between researchers and data science teams.

Synthetic data can be used as test data during software development, standing in for sensitive production data but still emulating its characteristics. It also allows companies to avoid copyright and intellectual property issues, generating data instead of employing web crawlers to scrape and collect information from websites without users’ knowledge or consent.

Also, artificial data can act as a form of data augmentation. It can be used to boost data diversity, especially for underrepresented groups in AI model training. And when information is sparse, synthetic data can fill in the gaps.

Financial services firm J.P. Morgan, for example, found it difficult to effectively train AI-powered models for fraud detection due to the lack of fraudulent cases compared to non-fraudulent ones. The organization used synthetic data generation to create more examples of fraudulent transactions (link resides outside ibm.com), thereby enhancing model training.

2. Preparation is key

Synthetic data quality is only as good as the real-world data underpinning it. When preparing original datasets for synthetic data generation by machine learning (ML) algorithms, make sure to check for and correct any errors, inaccuracies and inconsistencies. Remove any duplicates, and enter the missing values.

Consider adding edge cases or outliers to the original data. These data points can represent uncommon events, rare scenarios or extreme cases that mirror the unpredictability and variability of the real world.

“It comes down to the seed examples,” says Akash Srivastava, chief architect at InstructLab (link resides outside ibm.com), an open source project from IBM® and Red Hat that employs a collaborative approach to adding new knowledge and skills to a model, which is powered by IBM’s new synthetic data generation method and phased-training protocol. “The examples through which you seed the generation need to mimic your real-world use case.”

3. Diversify data sources

Synthetic data is still prone to inheriting and reflecting the biases that might be present in the original data it’s based on. Blending information from multiple sources, including different demographic groups and regions, can help mitigate bias in the generated data.

Diverse data sources can also elevate the quality of synthetic datasets. Varied sources can offer essential details or vital context that a single source or only a handful of sources lack. Also, incorporating retrieval-augmented generation into the synthetic data generation process can provide access to up-to-date and domain-specific data that can increase accuracy and further improve quality.

The latest AI News + Insights  

Discover expertly curated insights and news on AI, cloud and more in the weekly Think Newsletter.

4. Choose appropriate synthesis techniques

Selecting the right synthetic data generation technique depends on a few factors, including data type and complexity. Relatively simple data might benefit from statistical methods. More intricate datasets—structured data like tabular data or unstructured data such as images or videos, for example—might require deep learning models. Enterprises might also opt to combine synthesis techniques according to their requirements.

Here are some common mechanisms for synthetic data generation:

Statistical distribution

Data scientists can analyze statistical distributions in real data and generate synthetic samples that mirror those distributions. However, this requires significant knowledge and expertise, and not all data fit into a known distribution.

Generative adversarial networks

Generative adversarial networks (GANs) consist of two neural networks: a generator that creates synthetic data and a discriminator that acts as an adversary, discriminating between artificial and real data. Both networks are trained iteratively, with the discriminator’s feedback improving the generator’s output until the discriminator is no longer able to distinguish artificial from real data.

GANs can be used to generate synthetic images for computer vision and image classification tasks.

Variational autoencoders

Variational autoencoders (VAEs) are deep learning models that generate variations of the data they’re trained on. An encoder compresses input data into a lower-dimensional space, capturing the meaningful information contained in the input. A decoder then reconstructs new data from this compressed representation. Like GANs, VAEs can be used for image generation.

Transformer models

Transformer models, such as generative pretrained transformers (GPTs), excel in understanding the structure and patterns in language. They can be used to generate synthetic text data for natural language processing applications or to create artificial tabular data for classification or regression tasks.

5. Consider model collapse

It’s important to consider model collapse, wherein a model’s performance declines as it’s repeatedly trained on AI-generated data. That’s why it’s essential to ground the synthetic data generation process in real data.

At InstructLab, for instance, synthetic data generation is driven by a taxonomy, which defines the domain or topics that the original data comes from. This prevents the model from deciding the data that it must be trained on.

“You’re not asking the model to just keep going in a loop and collapse. We completely bypass the collapsing by decoupling the model from the sampling process,” Srivastava says.

6. Employ validation methods

High-quality data is vital to model performance. Verify synthetic data quality by using fidelity- and utility-based metrics. Fidelity refers to how closely synthetic datasets resemble real-world datasets. Utility evaluates how well synthetic data can be used to train deep learning or ML models.

Fidelity

Measuring fidelity involves comparing synthetic data with the original data, often by using statistical methods and visualizations like histograms. This helps determine whether generated datasets preserve the statistical properties of real datasets, such as distribution, mean, median, range and variance, among others.

Assessing correlational similarity through correlation and contingency coefficients, for example, is also essential to help ensure dependencies and relationships between data points are maintained and accurately represent real-world patterns. Neural networks, generative models and language models are typically skilled at capturing relationships in tabular data and time-series data.

Utility

Measuring utility entails using synthetic data as training data for machine learning models, then comparing model performance against training with real data. Here are some common metrics for benchmarking:

Accuracy or precision calculates the percentage of correct predictions.

Recall quantifies the actual correct predictions.

The F1 score combines accuracy and recall into a single metric.

Both the inception score and Fréchet inception distance (FID) evaluate the quality of generated images.

Synthetic data generation tools or providers might already have these metrics on hand, but you can also use other analytics packages like SDMetrics (link resides outside ibm.com), an open source Python library for assessing tabular synthetic data.

The human touch is still crucial when validating artificial data, and it can be as simple as taking 5 to 10 random samples from the synthetic dataset and appraising them yourself. “You have to have a human in the loop for verification,” says Srivastava. “These are very complicated systems, and just like in any complicated system, there are many delicate points at which things might go wrong. Rely on metrics, rely on benchmarks, rigorously test your pipeline, but always take a few random samples and manually check that they are giving you the kind of data you want.”

7. Keep data privacy top of mind

One of the advantages of using synthetic data is that it doesn’t contain any sensitive data or PII. However, enterprises must still verify that the new data they generate complies with privacy regulations. Such as the European Union’s General Data Protection Regulation (GDPR) or the US Health Insurance Portability and Accountability Act (HIPAA).

Treat synthetic data like proprietary data, applying built-in security measures and access controls to prevent data hacks and leaks. Safeguards must also be applied during the generation process to prevent the risk of synthetic data being reverse engineered and traced back to its real-world equivalent, revealing sensitive information during data analysis. These safeguards include techniques like masking to hide or mask sensitive data, anonymization to scrub or remove PII, and differential privacy to add “noise” or introduce randomness to the dataset.

“At the minimum, PII masking or scrubbing is required, or you could go a step further and use differential privacy methods,” Srivastava says. “It becomes even more important if you are not using local models. If you’re sending [data] to some third-party provider, it is even more important that you’re extra careful about these aspects.”

Note that synthetic data can’t usually be optimized simultaneously for fidelity, utility and privacy—there will often be a tradeoff. Masking or anonymization might nominally reduce utility, while differential privacy might slightly decrease accuracy. However, not implementing any privacy measures can potentially expose PII. Organizations must balance and prioritize what is crucial for their specific use cases.

8. Document, monitor and refine

Keep a record of your synthetic data generation workflow, such as strategies for cleaning and preparing original datasets, mechanisms for generating data and maintaining privacy, and verification results. Include the rationale behind your choices and decisions for accountability and transparency.

Documentation is especially valuable when conducting periodic reviews of your synthetic data generation process. These records serve as audit trails that can help with evaluating the effectiveness and reproducibility of the workflow.

Routinely monitor how synthetic data is used and how it performs to identify any unexpected behaviors that might crop up or opportunities for improvement. Adjust and refine the generation process as needed.

Much like fibers are the foundation of fabrics, data is the building block of AI models. And while synthetic data generation is still in its early stages. Advancements in the generation process can help enhance synthetic data in the future to a point where it matches the quality, reliability and utility of real data, akin to the way synthetic fibers almost equal natural fibers.

Mixture of Experts | 2 January, episode 88

Decoding AI: Weekly News Roundup

Join our world-class panel of engineers, researchers, product leaders and more as they cut through the AI noise to bring you the latest in AI news and insights.

Watch all episodes of Mixture of Experts

Footnotes

¹ 3 Bold and Actionable Predictions for the Future of GenAI (link resides outside ibm.com), Gartner, 12 April 2024

Four steps to better business forecasting with analytics

Use the power of analytics and business intelligence to plan, forecast and shape future outcomes that best benefit your company and customers.

Resources

The hybrid, open data lakehouse for AI

Simplify data access and automate data governance. Discover the power of integrating a data lakehouse strategy into your data architecture, including cost-optimizing your workloads and scaling AI and analytics, with all your data, anywhere.

From data chaos to AI clarity: Activating AI through high-quality enterprise data

Understand how focusing on well-governed, secure and collaborative access to data at scale empowers enterprises to maximize their AI investments

Decision intelligence: Thoughtful, data-driven choices

Learn how data intelligence helps leaders make sense of data, use generative AI wisely and make decisions based on what truly matters.

Streamlining and evolving fraud investigations with AI

Discover how Cogniware leverages AI solutions from IBM to drive efficiency in the financial crime space.

Turning data strategy into AI impact

Discover how to scale AI with a strong data foundation, deliver explainable and governed outcomes, and apply real-world lessons to your own AI roadmap.

How the C-suite is turning information into impact

Explore insights from 1,700 CDOs in this cross-industry report for data leaders.

Unify and access your data to help scale your AI

Learn why the path to AI-ready data often starts with effective access to both structured and unstructured data and the challenges that can impede data leaders.

Unleash the power of AI for seamless data integration

Understand why organizations need to adopt a unified approach that lets them manage the full spectrum of integration capabilities from a single pane of glass, eliminating the need to rely on numerous tools.

8 best practices for synthetic data generation

Authors

1. Know your purpose

2. Preparation is key

3. Diversify data sources

The latest AI News + Insights

4. Choose appropriate synthesis techniques

Statistical distribution

Generative adversarial networks

Variational autoencoders

Transformer models

5. Consider model collapse

6. Employ validation methods

Fidelity

Utility

7. Keep data privacy top of mind

8. Document, monitor and refine

Decoding AI: Weekly News Roundup

Footnotes

Share

Resources

The latest AI News + Insights