Synthetic data is data that has been created artificially through computer simulation or that algorithms can generate to take the place of real-world data. The data can be used as an alternative or supplement to real-world data when real-world data is not readily available; it can also aid in data science experiments.
This new data can be used as a placeholder for test data sets and is more frequently being used for the training of machine learning models because of its benefit to data privacy. One example is synthetic data used in healthcare to protect patient data and enhance clinical trials. The interest from the healthcare sector stems from the compliance regulations surrounding patient data. HIPPA or The Health Insurance Portability and Accountability Act is a federal law that protects individuals' information from being discriminated against, which synthetic data helps overcome by creating AI generated data.
While the data is artificial, synthetic data reflects real-world events on a mathematical and statistical basis. The technique is gaining in popularity in the further development of deep learning and many other use cases.
Gartner, a market research firm, predicts (link resides outside ibm.com), by 2024, 60% of the data used in training AI models will be synthetically generated.
Use this model selection framework to choose the most appropriate model while balancing your performance requirements with cost, risks and deployment needs.
Register for the white paper on AI governance
Synthetic data is created programmatically with machine learning techniques to mirror the statistical properties of real-world data. Synthetic data can be generated in a multitude of ways, with really no limit to size, time, or location.
The data set can be collected from actual events or objects or people using computer simulations or algorithms. A way to generate synthetic data is through open-source data generation tools, which can be bought or purchased. The data generation tools are what are used to create said synthetic data. When using the tool, by going through the synthetic data process, data scientists can model off information already created by the real-world data and work from it to make a new dataset.
One example is the Synthetic Data Vault (SDV), which was developed at MIT, is a synthetic data generation ecosystem of libraries “that allows users to easily learn single-table (link resides outside ibm.com), multi-table (link resides outside ibm.com) and timeseries (link resides outside ibm.com) datasets to later on generate new Synthetic Data that has the same format and statistical properties as the original dataset,” according to SDV (link resides outside ibm.com).
Below are the different types of synthetic data creation methods:
Variational Auto Encoders (VAE): VAEs are generative models in which encoder-decoder network pairs are trained to reconstruct training data distributions in such a way that the latent space of the encoder network is smooth.
Generative Adversarial Networks (GANs) Video (8:22): GAN was created by Ian Goodfellow to create fake images that replicate the real one. GANs has vast applicability in model training to generate realistic, highly detailed representations.
GANs is a machine learning tool that uses two neural networks in its architecture. The objective of the generator network is the creation of fake output. If we use the example of a flower, it takes random real flowers and produces artificial flowers as an output.
The prevalence of synthetic data is rather new and is not to be confused with data augmentation or data anonymization. Let's take a closer look at the differences between these terms.
Data augmentation is a technique that uses the original data with some minor changes and creates modified copies. The purpose is to increase the data set artificially. One common use is in image augmentation using filters, such as blur and rotate, to create new versions of existing images or frames. This technique, for example, will brighten or rotate an image to create a new one.
Data anonymization is a technique that helps you protect sensitive data, such as personally identifiable information or restricted business data to avoid the risk of compromising confidential information. It is defined in policy rules that are enforced for an asset. Depending on the method of data anonymization, data is redacted, masked, or substituted in the asset preview.
Unlike, the above techniques, synthetic data uses machine learning to artificially generate new data altogether as opposed to altering or modifying the real-world data.
Synthetic data is growing in popularity because of its accuracy and ability to generate large training datasets to train neural networks without the hassle, effort, or cost of manual data labeling. It has a vast number of uses and there are several approaches to consider.
Here are some types of synthetic data:
It is important to look at both the compelling benefits and the challenges with the use of synthetic data as it gains in popularity. High-skilled artificial intelligence, or AI specialists, who understand the intricacies of how data works are required when generating synthetic data. Companies or organizations who wish to use synthetic data must also establish a framework to check the accuracy of their data generation projects.
Data labeling is a time-consuming aspect of machine learning and synthetic data removes that tedious step. It is both a time saver and a cost reduction. Synthetic data has already been generated synthetically, and therefore, it’s already been labeled correctly.
Another benefit of synthetic data is that it can be useful to gain training data for edge cases. These are events or instances that could occur infrequently but are vital to your AI model. Synthetic data’s ability to provide data for edge cases allows companies to innovate faster in different domains since they don’t have to wait around for new, rare data points to generate.
There are also some use cases that might be so new there is no real data that exists, which is where AI generated data can play a role. One example of this is preparing datasets for the potential impact of a global pandemic where real data may not already exist.
It should be mentioned that synthetic data is not a perfect solution to bias as seen in synthetic data research when it pertains to medicine. Research shows some cohorts of patients could be underrepresented in real-world data and therefore bias can carry over in machine learning.
Banking: The financial sector has found benefits in synthetic data thanks to its ability to expose fraudulent activity on credit and debit cards. Credit card payments that might look and act like normal transaction data can be found out with the use of synthetic data techniques. Synthetic data can be used to test fraud detection systems to ensure they’re working properly and/or create new ways forward in detection.
While synthetic data has grown in popularity across many different industries, its most prominent use cases within IBM include:
Experiment with foundation models and build machine learning models automatically in our next-generation studio for AI builders.