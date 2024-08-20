Tech giants are betting big on synthetic data. NVIDIA recently announced Nemotron-4 340B, a family of open models designed to generate synthetic data for training large language models (LLMs) across various industries. This move addresses a critical challenge in AI development: the prohibitively high cost and difficulty of accessing robust datasets.

“High-quality training data plays a critical role in the performance, accuracy and quality of responses from a custom LLM,” NVIDIA wrote on its blog. The Nemotron-4 340B family includes base, instruct and reward models that form a pipeline for generating and refining synthetic data, potentially accelerating the development of powerful, domain-specific LLMs.

IBM researcher Akash Srivastava explains that in the context of large language models, synthetic data is often generated by one AI model to train or customize another. “Researchers and developers in the industry are using these models to generate data for particular target tasks,” Srivastava notes.

Investigators from MIT-IBM Watson AI Lab and IBM Research recently introduced a new approach to improving LLMs using synthetic data. The method, called LAB (Large-scale Alignment for chatBots), aims to reduce reliance on human annotations and proprietary AI models like GPT-4.

LAB employs a taxonomy-guided synthetic data generation process and a multi-phase training framework. The researchers report, “LAB-trained models can achieve competitive performance across several benchmarks compared to models trained with traditional human-annotated or GPT-4 generated synthetic data.”

To demonstrate LAB’s effectiveness, the team created two models, LABRADORITE-13B and MERLINITE-7B, which reportedly outperformed other fine-tuned versions of the same base models on several key metrics. The researchers used the open-source Mixtral model to generate synthetic training data, potentially offering a more cost-effective approach to enhancing LLMs.

The quality of synthetic data is crucial for its effectiveness. Raul Salles de Padua, Director of Engineering, AI and Quantum at Multiverse Computing, explains, “The fidelity of synthetic data is calculated by comparing it to real-world data through statistical and analytical tests. This includes an assessment of how well the synthetic data preserves key statistical properties, such as means, variances and correlations between variables.”

Despite its promise, synthetic data isn’t without challenges. De Padua points out, “The challenge with synthetic data is in creating data that is both useful and privacy-preserving. Without putting these safeguards in place, synthetic data could reveal personal details, potentially leading to identity theft, discrimination or other privacy violations.”

Recent research has uncovered potential pitfalls in relying too heavily on synthetic data. A recent study published in Nature revealed a phenomenon called “model collapse.” When AI models are repeatedly trained on AI-generated text, their outputs can become increasingly nonsensical, raising concerns about the long-term viability of using synthetic data, especially as AI-generated content becomes more prevalent online.

Ethical considerations also loom large. De Padua warns of the “risk of the synthetic data not accurately representing the diversity of the real-world population, producing potential bias in models that fail to perform equitably across different demographic groups.”