What is synthetic data?
Generate synthetic data with IBM solutions Subscribe for AI updates
Illustration with collage of pictograms of clouds, pie chart, graph pictograms
What is synthetic data?

Synthetic data is data that has been created artificially through computer simulation or that algorithms can generate to take the place of real-world data. The data can be used as an alternative or supplement to real-world data when real-world data is not readily available; it can also aid in data science experiments.

This new data can be used as a placeholder for test data sets and is more frequently being used for the training of machine learning models because of its benefit to data privacy. One example is synthetic data used in healthcare to protect patient data and enhance clinical trials. The interest from the healthcare sector stems from the compliance regulations surrounding patient data. HIPPA or The Health Insurance Portability and Accountability Act is a federal law that protects individuals' information from being discriminated against, which synthetic data helps overcome by creating AI generated data.

While the data is artificial, synthetic data reflects real-world events on a mathematical and statistical basis. The technique is gaining in popularity in the further development of deep learning and many other use cases. 

Gartner, a market research firm, predicts (link resides outside ibm.com), by 2024, 60% of the data used in training AI models will be synthetically generated.

How to choose the right AI foundation model

Use this model selection framework to choose the most appropriate model while balancing your performance requirements with cost, risks and deployment needs.

Related content

Register for the white paper on AI governance

How does synthetic data work?

Synthetic data is created programmatically with machine learning techniques to mirror the statistical properties of real-world data. Synthetic data can be generated in a multitude of ways, with really no limit to size, time, or location. 

The data set can be collected from actual events or objects or people using computer simulations or algorithms. A way to generate synthetic data is through open-source data generation tools, which can be bought or purchased. The data generation tools are what are used to create said synthetic data. When using the tool, by going through the synthetic data process, data scientists can model off information already created by the real-world data and work from it to make a new dataset. 

One example is the Synthetic Data Vault (SDV), which was developed at MIT, is a synthetic data generation ecosystem of libraries “that allows users to easily learn single-table (link resides outside ibm.com), multi-table (link resides outside ibm.com) and timeseries (link resides outside ibm.com) datasets to later on generate new Synthetic Data that has the same format and statistical properties as the original dataset,” according to SDV (link resides outside ibm.com).

Below are the different types of synthetic data creation methods: 

Variational Auto Encoders (VAE): VAEs are generative models in which encoder-decoder network pairs are trained to reconstruct training data distributions in such a way that the latent space of the encoder network is smooth.

Generative Adversarial Networks (GANs) Video (8:22): GAN was created by Ian Goodfellow to create fake images that replicate the real one. GANs has vast applicability in model training to generate realistic, highly detailed representations.

GANs is a machine learning tool that uses two neural networks in its architecture. The objective of the generator network is the creation of fake output. If we use the example of a flower, it takes random real flowers and produces artificial flowers as an output.

Synthetic data vs. data augmentation vs. data anonymization

The prevalence of synthetic data is rather new and is not to be confused with data augmentation or data anonymization. Let's take a closer look at the differences between these terms.

Data augmentation is a technique that uses the original data with some minor changes and creates modified copies. The purpose is to increase the data set artificially. One common use is in image augmentation using filters, such as blur and rotate, to create new versions of existing images or frames. This technique, for example, will brighten or rotate an image to create a new one. 

Data anonymization is a technique that helps you protect sensitive data, such as personally identifiable information or restricted business data to avoid the risk of compromising confidential information. It is defined in policy rules that are enforced for an asset. Depending on the method of data anonymization, data is redacted, masked, or substituted in the asset preview.

Unlike, the above techniques, synthetic data uses machine learning to artificially generate new data altogether as opposed to altering or modifying the real-world data.

Types of synthetic data

Synthetic data is growing in popularity because of its accuracy and ability to generate large training datasets to train neural networks without the hassle, effort, or cost of manual data labeling. It has a vast number of uses and there are several approaches to consider.

Here are some types of synthetic data:

  • Fully synthetic: No real-data is used with this technique. The computer program may use real-world data characteristics though to narrow down and estimate realistic parameters. Typically, the data generator for this technique will identify the density function of features in the real data and then estimate parameters. The data is then randomly generated and because of this it provides a strong privacy protection. Privacy-protected data is only masked in this technique.
  • Partially synthetic: This technique replaces only a portion of some selected sensitive features with synthetic values and keeps some real data or existing unstructured data. This technique can be helpful when data scientists are trying to fill in the gaps in original data and is done to preserve privacy in the newly generated data. The techniques used to generate this type of data include multiple imputation and model-based techniques.
  • Hybrid: A combination of real and synthetic data that takes random records from a real dataset and pairs it with close synthetic records. This technique has advantages from both fully and partially synthetic data. While it can provide good privacy preservation, the drawback is the longer processing time and more memory.
Benefits and challenges of synthetic data

It is important to look at both the compelling benefits and the challenges with the use of synthetic data as it gains in popularity. High-skilled artificial intelligence, or AI specialists, who understand the intricacies of how data works are required when generating synthetic data. Companies or organizations who wish to use synthetic data must also establish a framework to check the accuracy of their data generation projects. 

  • Data quality: Unlike real-world data, synthetic data removes the inaccuracies or errors that can occur when working with data that is being compiled in the real world. Synthetic data can provide high quality and balanced data if provided with proper variables. The artificially-generated data is also able to fill in missing values and create labels that can enable more accurate predictions for your company or business.  

Data labeling is a time-consuming aspect of machine learning and synthetic data removes that tedious step. It is both a time saver and a cost reduction. Synthetic data has already been generated synthetically, and therefore, it’s already been labeled correctly. 

  • Scalability: Utilizing machine learning properly requires large amounts of data, which is where synthetic data can come into play. Often, it’s difficult to obtain the scale of data necessary for training and testing a predictive model, which is where synthetic data can aid and fill in the gaps to supplement real-world data and achieve larger scale of inputs. 

Another benefit of synthetic data is that it can be useful to gain training data for edge cases. These are events or instances that could occur infrequently but are vital to your AI model. Synthetic data’s ability to provide data for edge cases allows companies to innovate faster in different domains since they don’t have to wait around for new, rare data points to generate.

There are also some use cases that might be so new there is no real data that exists, which is where AI generated data can play a role. One example of this is preparing datasets for the potential impact of a global pandemic where real data may not already exist.

  • Ease of use: Often with real-world data there are several outside factors to consider such as privacy, filtering errors, and potentially having to convert data so formats match up. Whereas with synthetic data it is simpler to generate and eliminates the inaccuracies and duplicates. This ensures all data has uniform formatting and labeling that is necessary when working with large amounts of data. 
  • Bias: A benefit of synthetic data is that it can help us reduce biases because it can help us create more balanced data sets. While synthetic data is based on real-world data, the machine learning models can mitigate those biases that come up. 

It should be mentioned that synthetic data is not a perfect solution to bias as seen in synthetic data research when it pertains to medicine. Research shows some cohorts of patients could be underrepresented in real-world data and therefore bias can carry over in machine learning. 

  • Privacy: The privacy concerns brought on when using real data is more or less eliminated when using synthetic data and is a big benefit to companies. The AI generated data can be like real-world data, but it can’t be traced back to any one original set. The technique is being touted as a workaround for personally identifiable information data that wouldn’t typically be viable to use. 
Synthetic data industry use cases  
  • Healthcare providers: The use of synthetic data through AI system GANs has gained much attention because of its ability to create “high-fidelity fake data,” according to The Lancet (link resides outside ibm.com). Synthetic data has gained popularity because it could serve as a method of protecting patient privacy and enhancing clinical research without jeopardizing a patients medical records. “Synthetic data carries the ability to create fake patient records and fake medical imaging that is truly non-identifiable because the data does not relate to any real individual. In a sense, the synthetic data is a derivative of the original real data, but no synthetic datapoint can be attributed to a single real datapoint,” said The Lancet. 
  • Autonomous vehicles: Companies that produce autonomous vehicles are using synthetic data to help test vehicles safely through a realistic simulation. The synthetic data can be created to train autonomous vehicles to navigate in a simulated parking lot and around pedestrians. The technique is helping to revolutionize self-driving cars and could be one of the biggest reasons they make it onto the road in the real-world. The traditional ways of collecting data require accidents or unfortunate road collisions to happen in real-time, but with synthetic data the information can be created artificially without having any accidents occur.
  • Banking: The financial sector has found benefits in synthetic data thanks to its ability to expose fraudulent activity on credit and debit cards. Credit card payments that might look and act like normal transaction data can be found out with the use of synthetic data techniques. Synthetic data can be used to test fraud detection systems to ensure they’re working properly and/or create new ways forward in detection. 

IBM contributions to synthetic data 

While synthetic data has grown in popularity across many different industries, its most prominent use cases within IBM include:

  • AI/Machine learning model training: Synthetic data is increasingly being used for AI model training. An example of this is synthetic images tailored for specific AI tasks. The artificial images are computer-generated to look real, yet don’t require the permissions real-world data entails. One way of doing this is through generative models. IBM researchers, in collaboration with colleagues at Boston University, developed Task2Sim (link resides outside ibm.com), an AI model that learns to generate fake, task-specific data for pretraining image-classification models. “The beauty of synthetic images is that you can control their parameters — the background, lighting, and the way objects are posed,” said Rogerio Feris, an IBM researcher who co-authored both papers. “You can generate unlimited training data, and you get labels for free.”
  • Language models: In a paper spotlighted by IBM at the International Conference on Learning Representations in 2022 researchers showed that “pretraining a language model on a made-up language grounded in images could make it easier to master a low-resource language like Urdu,” according to an IBM blog post. “When humans learn to talk, they associate words with visual concepts,” said Yang Zhang, an IBM researcher with the MIT-IBM Watson AI Lab. “We try to mimic that idea here.”


IBM Solutions

Experiment with foundation models and build machine learning models automatically in our next-generation studio for AI builders.

Explore today
Take the next step

Train, validate, tune and deploy generative AI, foundation models and machine learning capabilities with IBM watsonx.ai, a next-generation enterprise studio for AI builders. Build AI applications in a fraction of the time with a fraction of the data.

Explore watsonx.ai Book a live demo