When you hear the word “synthetic,” you might associate it with something artificial or fabricated. Take synthetic fibers such as polyester and nylon, for example, which are man-made through chemical processes.
While synthetic fibers are more affordable and easier to mass-produce, their quality can rival that of natural fibers. They’re often designed to mimic their natural counterparts and are engineered for specific uses—be it elastic elastane, heat-retaining acrylic or durable polyester.
The same is true for synthetic data. This artificially generated information can supplement or even replace real-world data when training or testing artificial intelligence (AI) models. Compared to real datasets that can be costly to obtain, difficult to access, time-consuming to label and have a limited supply, synthetic datasets can be synthesized through computer simulations or generative models. This makes them cheaper to produce on-demand in nearly limitless volumes and customized to an organization’s needs.
Despite its benefits, synthetic data also comes with challenges. The generation process can be complex, with data scientists having to create realistic data while still maintaining quality and privacy.
Yet synthetic data is here to stay. Research firm Gartner predicts that by 2026, 75% of businesses will use generative AI to create synthetic customer data.1
To help enterprises get the most out of artificial data, here are 8 best practices for synthetic data generation:
Understand why your business needs synthetic data and the use cases where it might be more helpful than real data. In healthcare, for instance, patient records or medical images can be artificially generated—without containing any sensitive data or personally identifiable information (PII). This also allows safe data sharing between researchers and data science teams.
Synthetic data can be used as test data during software development, standing in for sensitive production data but still emulating its characteristics. It also allows companies to avoid copyright and intellectual property issues, generating data instead of employing web crawlers to scrape and collect information from websites without users’ knowledge or consent.
Also, artificial data can act as a form of data augmentation. It can be used to boost data diversity, especially for underrepresented groups in AI model training. And when information is sparse, synthetic data can fill in the gaps.
Financial services firm J.P. Morgan, for example, found it difficult to effectively train AI-powered models for fraud detection due to the lack of fraudulent cases compared to non-fraudulent ones. The organization used synthetic data generation to create more examples of fraudulent transactions (link resides outside ibm.com), thereby enhancing model training.
Synthetic data quality is only as good as the real-world data underpinning it. When preparing original datasets for synthetic data generation by machine learning (ML) algorithms, make sure to check for and correct any errors, inaccuracies and inconsistencies. Remove any duplicates, and enter the missing values.
Consider adding edge cases or outliers to the original data. These data points can represent uncommon events, rare scenarios or extreme cases that mirror the unpredictability and variability of the real world.
“It comes down to the seed examples,” says Akash Srivastava, chief architect at InstructLab (link resides outside ibm.com), an open source project from IBM® and Red Hat that employs a collaborative approach to adding new knowledge and skills to a model, which is powered by IBM’s new synthetic data generation method and phased-training protocol. “The examples through which you seed the generation need to mimic your real-world use case.”
Synthetic data is still prone to inheriting and reflecting the biases that might be present in the original data it’s based on. Blending information from multiple sources, including different demographic groups and regions, can help mitigate bias in the generated data.
Diverse data sources can also elevate the quality of synthetic datasets. Varied sources can offer essential details or vital context that a single source or only a handful of sources lack. Also, incorporating retrieval-augmented generation into the synthetic data generation process can provide access to up-to-date and domain-specific data that can increase accuracy and further improve quality.
Selecting the right synthetic data generation technique depends on a few factors, including data type and complexity. Relatively simple data might benefit from statistical methods. More intricate datasets—structured data like tabular data or unstructured data such as images or videos, for example—might require deep learning models. Enterprises might also opt to combine synthesis techniques according to their requirements.
Here are some common mechanisms for synthetic data generation:
Data scientists can analyze statistical distributions in real data and generate synthetic samples that mirror those distributions. However, this requires significant knowledge and expertise, and not all data fit into a known distribution.
Generative adversarial networks (GANs) consist of two neural networks: a generator that creates synthetic data and a discriminator that acts as an adversary, discriminating between artificial and real data. Both networks are trained iteratively, with the discriminator’s feedback improving the generator’s output until the discriminator is no longer able to distinguish artificial from real data.
GANs can be used to generate synthetic images for computer vision and image classification tasks.
Variational autoencoders (VAEs) are deep learning models that generate variations of the data they’re trained on. An encoder compresses input data into a lower-dimensional space, capturing the meaningful information contained in the input. A decoder then reconstructs new data from this compressed representation. Like GANs, VAEs can be used for image generation.
Transformer models, such as generative pretrained transformers (GPTs), excel in understanding the structure and patterns in language. They can be used to generate synthetic text data for natural language processing applications or to create artificial tabular data for classification or regression tasks.
It’s important to consider model collapse, wherein a model’s performance declines as it’s repeatedly trained on AI-generated data. That’s why it’s essential to ground the synthetic data generation process in real data.
At InstructLab, for instance, synthetic data generation is driven by a taxonomy, which defines the domain or topics that the original data comes from. This prevents the model from deciding the data that it must be trained on.
“You’re not asking the model to just keep going in a loop and collapse. We completely bypass the collapsing by decoupling the model from the sampling process,” Srivastava says.
High-quality data is vital to model performance. Verify synthetic data quality by using fidelity- and utility-based metrics. Fidelity refers to how closely synthetic datasets resemble real-world datasets. Utility evaluates how well synthetic data can be used to train deep learning or ML models.
Measuring fidelity involves comparing synthetic data with the original data, often by using statistical methods and visualizations like histograms. This helps determine whether generated datasets preserve the statistical properties of real datasets, such as distribution, mean, median, range and variance, among others.
Assessing correlational similarity through correlation and contingency coefficients, for example, is also essential to help ensure dependencies and relationships between data points are maintained and accurately represent real-world patterns. Neural networks, generative models and language models are typically skilled at capturing relationships in tabular data and time-series data.
Measuring utility entails using synthetic data as training data for machine learning models, then comparing model performance against training with real data. Here are some common metrics for benchmarking:
Accuracy or precision calculates the percentage of correct predictions.
Recall quantifies the actual correct predictions.
The F1 score combines accuracy and recall into a single metric.
Both the inception score and Fréchet inception distance (FID) evaluate the quality of generated images.
Synthetic data generation tools or providers might already have these metrics on hand, but you can also use other analytics packages like SDMetrics (link resides outside ibm.com), an open source Python library for assessing tabular synthetic data.
The human touch is still crucial when validating artificial data, and it can be as simple as taking 5 to 10 random samples from the synthetic dataset and appraising them yourself. “You have to have a human in the loop for verification,” says Srivastava. “These are very complicated systems, and just like in any complicated system, there are many delicate points at which things might go wrong. Rely on metrics, rely on benchmarks, rigorously test your pipeline, but always take a few random samples and manually check that they are giving you the kind of data you want.”
One of the advantages of using synthetic data is that it doesn’t contain any sensitive data or PII. However, enterprises must still verify that the new data they generate complies with privacy regulations. Such as the European Union’s General Data Protection Regulation (GDPR) or the US Health Insurance Portability and Accountability Act (HIPAA).
Treat synthetic data like proprietary data, applying built-in security measures and access controls to prevent data hacks and leaks. Safeguards must also be applied during the generation process to prevent the risk of synthetic data being reverse engineered and traced back to its real-world equivalent, revealing sensitive information during data analysis. These safeguards include techniques like masking to hide or mask sensitive data, anonymization to scrub or remove PII, and differential privacy to add “noise” or introduce randomness to the dataset.
“At the minimum, PII masking or scrubbing is required, or you could go a step further and use differential privacy methods,” Srivastava says. “It becomes even more important if you are not using local models. If you’re sending [data] to some third-party provider, it is even more important that you’re extra careful about these aspects.”
Note that synthetic data can’t usually be optimized simultaneously for fidelity, utility and privacy—there will often be a tradeoff. Masking or anonymization might nominally reduce utility, while differential privacy might slightly decrease accuracy. However, not implementing any privacy measures can potentially expose PII. Organizations must balance and prioritize what is crucial for their specific use cases.
Keep a record of your synthetic data generation workflow, such as strategies for cleaning and preparing original datasets, mechanisms for generating data and maintaining privacy, and verification results. Include the rationale behind your choices and decisions for accountability and transparency.
Documentation is especially valuable when conducting periodic reviews of your synthetic data generation process. These records serve as audit trails that can help with evaluating the effectiveness and reproducibility of the workflow.
Routinely monitor how synthetic data is used and how it performs to identify any unexpected behaviors that might crop up or opportunities for improvement. Adjust and refine the generation process as needed.
Much like fibers are the foundation of fabrics, data is the building block of AI models. And while synthetic data generation is still in its early stages. Advancements in the generation process can help enhance synthetic data in the future to a point where it matches the quality, reliability and utility of real data, akin to the way synthetic fibers almost equal natural fibers.
1 3 Bold and Actionable Predictions for the Future of GenAI (link resides outside ibm.com), Gartner, 12 April 2024
Gain unique insights into the evolving landscape of ABI solutions, highlighting key findings, assumptions and recommendations for data and analytics leaders.
Simplify data access and automate data governance. Discover the power of integrating a data lakehouse strategy into your data architecture, including cost-optimizing your workloads and scaling AI and analytics, with all your data, anywhere.
Explore the data leader's guide to building a data-driven organization and driving business advantage.
Learn how an open data lakehouse approach can provide trustworthy data and faster analytics and AI projects execution.
Connect your data and analytics strategy to business objectives with these 4 key steps.
Take a deeper look into why business intelligence challenges might persist and what it means for users across an organization.
To thrive, companies must use data to build customer loyalty, automate business processes and innovate with AI-driven solutions.
Unlock the value of enterprise data with IBM Consulting, building an insight-driven organization that delivers business advantage.
Introducing Cognos Analytics 12.0, AI-powered insights for better decision-making.