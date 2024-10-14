A common adage in AI development and computer science is that an artificial intelligence (AI) model is only as good as the data it was trained on. In recent years, researchers have found that generative models trained solely on their predecessors’ output produce increasingly inaccurate results. These models, beset by “irreversible defects,” eventually become useless.1 This takes place because any errors present in one model’s output during its fitting are later included in the training of its successor. Then, the new model also produces its own errors. Model collapse progresses as errors compound with successive generations.2

These errors occur because generative AI models produce datasets with less variation than original data distributions. Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao and a team of fellow researchers based at British and Canadian universities authored a widely cited report on model collapse. Through experiments on AI models, the team found that models trained on AI-generated data, also known as synthetic data, initially lost information from the tails, or extremes, of the true distribution of data—what they called “early model collapse.” In later model iterations, the data distribution converged so much that it looked nearly nothing like the original data—which researchers termed “late model collapse.”3

In real-world scenarios, model collapse might happen due to the training processes used for large generative AI models, such as large language models (LLMs). LLMs are mostly trained on human-generated data scraped from the Internet. However, as more AI-generated content proliferates across the web, the more it might be used to train future models instead of human-generated data, potentially precipitating model collapse.

The phenomenon of model collapse poses serious ramifications for AI development, leading researchers to propose several solutions. Such solutions include tracking data provenance, preserving access to original data sources, and combining accumulated AI-generated data with real data to train AI models.