Examining synthetic data: The promise, risks and realities

Female hacker looking at computer while working in startup company

As artificial intelligence reshapes industries worldwide, developers are grappling with an unexpected challenge: a shortage of high-quality, real-world data to train their increasingly sophisticated models. Now, a potential solution is emerging from an unlikely source—data that doesn’t exist in reality at all.

Synthetic data, artificially generated information designed to mimic real-world scenarios, is rapidly gaining traction in AI development. It promises to overcome data bottlenecks, address privacy concerns, and reduce costs. However, as the field evolves, questions about its limitations and real-world impact are coming to the fore.

The latest AI trends, brought to you by experts

Get curated insights on the most important—and intriguing—AI news. Subscribe to our weekly Think newsletter. See the IBM Privacy Statement.

Thank you! You are subscribed.

Your subscription will be delivered in English. You will find an unsubscribe link in every newsletter. You can manage your subscriptions or unsubscribe here. Refer to our IBM Privacy Statement for more information.

The rise of synthetic data

Tech giants are betting big on synthetic data. NVIDIA recently announced Nemotron-4 340B, a family of open models designed to generate synthetic data for training large language models (LLMs) across various industries. This move addresses a critical challenge in AI development: the prohibitively high cost and difficulty of accessing robust datasets.

“High-quality training data plays a critical role in the performance, accuracy and quality of responses from a custom LLM,” NVIDIA wrote on its blog. The Nemotron-4 340B family includes base, instruct and reward models that form a pipeline for generating and refining synthetic data, potentially accelerating the development of powerful, domain-specific LLMs.

IBM researcher Akash Srivastava explains that in the context of large language models, synthetic data is often generated by one AI model to train or customize another. “Researchers and developers in the industry are using these models to generate data for particular target tasks,” Srivastava notes.

Investigators from MIT-IBM Watson AI Lab and IBM Research recently introduced a new approach to improving LLMs using synthetic data. The method, called LAB (Large-scale Alignment for chatBots), aims to reduce reliance on human annotations and proprietary AI models like GPT-4.

LAB employs a taxonomy-guided synthetic data generation process and a multi-phase training framework. The researchers report, “LAB-trained models can achieve competitive performance across several benchmarks compared to models trained with traditional human-annotated or GPT-4 generated synthetic data.”

To demonstrate LAB’s effectiveness, the team created two models, LABRADORITE-13B and MERLINITE-7B, which reportedly outperformed other fine-tuned versions of the same base models on several key metrics. The researchers used the open-source Mixtral model to generate synthetic training data, potentially offering a more cost-effective approach to enhancing LLMs.

The quality of synthetic data is crucial for its effectiveness. Raul Salles de Padua, Director of Engineering, AI and Quantum at Multiverse Computing, explains, “The fidelity of synthetic data is calculated by comparing it to real-world data through statistical and analytical tests. This includes an assessment of how well the synthetic data preserves key statistical properties, such as means, variances and correlations between variables.”

Despite its promise, synthetic data isn’t without challenges. De Padua points out, “The challenge with synthetic data is in creating data that is both useful and privacy-preserving. Without putting these safeguards in place, synthetic data could reveal personal details, potentially leading to identity theft, discrimination or other privacy violations.”

Recent research has uncovered potential pitfalls in relying too heavily on synthetic data. A recent study published in Nature revealed a phenomenon called “model collapse.” When AI models are repeatedly trained on AI-generated text, their outputs can become increasingly nonsensical, raising concerns about the long-term viability of using synthetic data, especially as AI-generated content becomes more prevalent online.

Ethical considerations also loom large. De Padua warns of the “risk of the synthetic data not accurately representing the diversity of the real-world population, producing potential bias in models that fail to perform equitably across different demographic groups.”

Mixture of Experts | 5 December, episode 84

Decoding AI: Weekly News Roundup

Join our world-class panel of engineers, researchers, product leaders and more as they cut through the AI noise to bring you the latest in AI news and insights.

The future of AI training

In critical applications like healthcare and autonomous vehicles, synthetic data can play a vital role. De Padua notes, “In healthcare, synthetic data can supplement real datasets, providing a wider range of scenarios for training models, leading to better diagnostic and predictive capabilities.” For autonomous vehicles, he adds, “By using synthetic data for augmentation, models can be exposed to a wider range of conditions and edge cases that might not be present in the original dataset.”

Looking to the future, de Padua believes synthetic data will likely supplement rather than replace real-world data in AI training. “The accuracy and representativeness of synthetic data are crucial. Technological advances in data generation algorithms will play a significant role in increasing the reliability of synthetic data,” he explains.

As AI increasingly integrates into our daily lives, from healthcare diagnostics to self-driving cars, the balance between synthetic and real-world data in AI training will be crucial. The challenge for AI developers moving forward will be to harness the benefits of synthetic data while mitigating its risks.

“We’re at a critical juncture in AI development,” says Srivastava. “Getting the balance right between synthetic and real-world data will determine the future of AI—its capabilities, limitations and, ultimately, its impact on society.”

Author

Sascha Brodsky

Staff Writer

IBM

Related solutions
IBM® watsonx.ai™

Train, validate, tune and deploy generative AI, foundation models and machine learning capabilities with IBM watsonx.ai, a next-generation enterprise studio for AI builders. Build AI applications in a fraction of the time with a fraction of the data.

Discover watsonx.ai
Analytics tools and solutions

Use data to build customer loyalty, automate business processes and innovate with AI-driven solutions.

Discover analytics solutions
Data and analytics consulting services

Unlock the value of enterprise data with IBM Consulting, building an insight-driven organization that delivers business advantage.

Discover analytics services
Take the next step

Get one-stop access to capabilities that span the AI development lifecycle. Produce powerful AI solutions with user-friendly interfaces, workflows and access to industry-standard APIs and SDKs.

Explore watsonx.ai Book a live demo