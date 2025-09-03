Streamline and accelerate AI initiatives: 5 best practices for synthetic data use

Artificial Intelligence

 

 

03 September 2025

A recent report by Gartner predicts that by 2028, 80% of the data used for artificial intelligence (AI) is going to be synthetic. However, the same report also shows that most organizations are only just starting to consider or test the use of synthetic data.

Organizations must understand the advantages of synthetic data to fully capitalize on them. Some of these advantages include the ability to generate high-quality annotated data at scale, accelerated model development and deployment, and reduced costs associated with data collection and labeling.

The IBM Responsible Technology Board's new white paper Unlocking AI opportunities with the responsible generation and use of synthetic data offers a roadmap for navigating benefits and challenges of synthetic data, from accelerating AI model development and improving data quality to identifying and mitigating potential risks.

By exploring the intersection of technology, ethics and governance, this paper provides insights and best practices for organizations that seek to harness the full potential of synthetic data in their AI initiatives.

What is synthetic data and what are its applications?

Leaders in various industries are competing to drive innovation and create value with AI. However, currently, only 25% of AI initiatives are achieving their expected return on investment (ROI). This AI innovation race highlights the possible limitations of relying on real-world data to train AI systems.

Real-world data can be difficult to obtain, might not be diverse enough and is often expensive. It can be challenging to develop balanced and cost-effective AI models based on real-world data.

Here is where synthetic data shines. In simple terms, synthetic data is data that is artificially generated to resemble real-world data. It helps reduce the risks associated with real-world data, such as inaccuracies, data gaps and potential privacy concerns.

Also, synthetic data can help streamline the resource-intensive process of collecting, cleaning and annotating real-world data. As a result, using synthetic data can accelerate the development of AI models, improve their accuracy and enhance overall data-driven decision-making.

Synthetic data has far-reaching applications across multiple industries, offering a versatile solution for a wide range of use cases. In the insurance sector, for instance, synthetic data can help companies detect and prevent fraudulent claims by simulating complex scenarios that might not be well-represented in real-world data.

By generating synthetic data that reflects the nuances of real-world claims, insurers can train AI models to better identify patterns and anomalies that can indicate fraud. This improvement can lead to more accurate claims processing, reduced financial losses and improved customer experiences.

Beyond insurance, synthetic data can also be used to enhance AI safety training and improve cybersecurity defenses. Research organizations can use synthetic data to generate high-risk scenarios, allowing them to train AI models to respond to safety threats and fine-tune their performance.

Financial institutions can use synthetic data to simulate complex transactions and identify potential vulnerabilities in their systems, enabling them to develop more robust defenses against cyberthreats. In each of these cases, synthetic data can help organizations overcome data limitations and unlock new opportunities.

Addressing synthetic data challenges and risks

Despite its promise, generating and using synthetic data can introduce and amplify certain risks within the AI lifecycle. The IBM Responsible Technology Board explores these risks and their potential mitigations in detail in the white paper Unlocking AI opportunities with the responsible generation and use of synthetic data. These risks include:

  • Temporal gap: This refers to the discrepancies between the static nature of synthetic data, which is generated at a specific point in time, and the dynamic nature of real-world data. Temporal gaps can make synthetic data outdated or obsolete, leading to mismatches between current realities and the assumptions embedded in a synthetic data set. One potential way to mitigate this risk is to regularly regenerate the synthetic data set. For example, this can be done by incorporating retrieval augmented generation (RAG) into the synthetic data generation process to capture more up-to-date data.
  • Data bias: Biases present in a synthetic data set, including the ones inherited from seeded real-world data or exacerbated by the generation methods used, can influence the training and fine-tuning of the AI model. Biases in unrepresentative synthetic data can be carried forward into AI models trained on that data, and potentially into more synthetic datasets generated from such models. Using tools such as AI Fairness 360 to test the data and models is a recommended potential mitigation technique.
  • Data privacy: Sometimes, synthetic data can be reverse-engineered to reveal information about the underlying real-world seed data or the process used to generate it, potentially enabling reidentification of individuals or their personal information. To mitigate potential risks to data privacy, practitioners can implement robust data anonymization techniques to protect sensitive information in the original seed data. Also, practitioners can avoid generating synthetic data that contains any real personally identifiable information (PII) or sensitive personally identifiable information (SPII). They can instead consider using statistical representations to simulate real-world patterns without identifying individuals.

By acknowledging and addressing these risks, we can create a foundation for responsible synthetic data practices. This foundation, once established, can unlock the full potential of synthetic data to drive business value, improve outcomes and continue advancing the field of AI.

5 best practices for responsibly generating and using synthetic data

Embracing the opportunity of synthetic data requires a thoughtful approach to its generation and use. By adopting best practices, organizations can maximize the benefits of synthetic data, enable its safe and effective use and drive innovation forward.

Unlocking AI opportunities with the responsible generation and use of synthetic data outlines five best practices for generating and using synthetic data, which help balance innovation with responsibility. They are:

  1. Consider the specific context of use and domain requirements for your synthetic data, including the type of AI model you’re training, the industry you’re in and the intended applications. This approach helps you determine the right type and quality of synthetic data needed to achieve your objectives.
  2. Collaborate with domain experts and use domain-specific data generation methods. This will help you generate synthetic data that more accurately reflects real-world scenarios, patterns and edge cases. By leveraging their expertise, you can create more effective and relevant synthetic data.
  3. Evaluate and validate synthetic data by using multiple metrics that assess quality, accuracy and relevance. This process includes evaluating statistical properties, data distribution and task-specific utility. By using multiple metrics, you can better validate that your synthetic data is reliable and effective for your intended use.
  4. Maintain documentation and version control of your synthetic data generation process, including the methods used, assumptions made and decisions taken. Maintain version control to track changes to the data and enable collaboration among stakeholders. This approach helps you maintain synthetic data that is transparent, reproducible and trustworthy.
  5. Update and refine synthetic data to support data integrity and relevance over time. This process includes updating the data to reflect changes in the real world, refining the data to improve its quality, and adapting to new requirements and use cases. By doing so, you can enable your synthetic data to continue to support your AI initiatives and drive meaningful outcomes.

These best practices provide a framework for responsibly leveraging the potential of synthetic data, whether your organization is already experienced in using synthetic data or is just starting to explore its possibilities.

Maximizing the potential of synthetic data

At IBM, we’re committed to helping advance responsible AI. As synthetic data becomes increasingly prominent in AI model training and development, it can potentially surpass the use of real-world data. It is essential to responsibly address the unique risks associated with synthetic data to unlock its potential for driving innovation, improving outcomes and creating value.

Similar to the IBM Responsible Technology Board's previous white papers on agentic AI and foundation models, our goal with Unlocking AI opportunities with the responsible generation and use of synthetic data is to offer a comprehensive understanding of the opportunities, challenges and best practices related to synthetic data. By doing so, we aim to empower practitioners to responsibly harness the potential of synthetic data as they accelerate AI innovation.

Author

IBM Responsible Technology Board

Resources

The 2025 CEO’s guide: 5 mindshifts to supercharge business growth

Activate these five mindshifts to cut through the uncertainty, spur business reinvention, and supercharge growth with agentic AI.
Level up your ML expertise

Learn fundamental concepts and build your skills with hands-on labs, courses, guided projects, trials and more.
Unlock the power of generative AI + ML

Learn how to confidently incorporate generative AI and machine learning into your business.
Put AI to work: Driving ROI with gen AI

Want to get a better return on your AI investments? Learn how scaling gen AI in key areas drives change by helping your best minds build and deliver innovative new solutions.
How to choose the right foundation model

Learn how to select the most suitable AI foundation model for your use case.
Explore IBM Granite

IBM® Granite™ is our family of open, performant and trusted AI models, tailored for business and optimized to scale your AI applications. Explore language, code, time series and guardrail options.
How to thrive in this new era of AI with trust and confidence

Dive into the 3 critical elements of a strong AI strategy: creating a competitive edge, scaling AI across the business and advancing trustworthy AI.
AI in Action Report

We surveyed 2,000 organizations about their AI initiatives to discover what's working, what's not and how you can get ahead.
Related solutions
IBM watsonx.ai

Train, validate, tune and deploy generative AI, foundation models and machine learning capabilities with IBM watsonx.ai, a next-generation enterprise studio for AI builders. Build AI applications in a fraction of the time with a fraction of the data.

 Explore watsonx.ai
AI for developers

Move your applications from prototype to production with the help of our AI development solutions.

 Explore AI development tools
AI consulting and services

Reinvent critical workflows and operations by adding AI to maximize experiences, real-time decision-making and business value.

 Explore AI services
Take the next step

Get one-stop access to capabilities that span the AI development lifecycle. Produce powerful AI solutions with user-friendly interfaces, workflows and access to industry-standard APIs and SDKs.

 Explore watsonx.ai Book a live demo