10 October 2024
Model collapse refers to the declining performance of generative AI models that are trained on AI-generated content.
A common adage in AI development and computer science is that an artificial intelligence (AI) model is only as good as the data it was trained on. In recent years, researchers have found that generative models trained solely on their predecessors’ output produce increasingly inaccurate results. These models, beset by “irreversible defects,” eventually become useless.1 This takes place because any errors present in one model’s output during its fitting are later included in the training of its successor. Then, the new model also produces its own errors. Model collapse progresses as errors compound with successive generations.2
These errors occur because generative AI models produce datasets with less variation than original data distributions. Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao and a team of fellow researchers based at British and Canadian universities authored a widely cited report on model collapse. Through experiments on AI models, the team found that models trained on AI-generated data, also known as synthetic data, initially lost information from the tails, or extremes, of the true distribution of data—what they called “early model collapse.” In later model iterations, the data distribution converged so much that it looked nearly nothing like the original data—which researchers termed “late model collapse.”3
In real-world scenarios, model collapse might happen due to the training processes used for large generative AI models, such as large language models (LLMs). LLMs are mostly trained on human-generated data scraped from the Internet. However, as more AI-generated content proliferates across the web, the more it might be used to train future models instead of human-generated data, potentially precipitating model collapse.
The phenomenon of model collapse poses serious ramifications for AI development, leading researchers to propose several solutions. Such solutions include tracking data provenance, preserving access to original data sources, and combining accumulated AI-generated data with real data to train AI models.
Generative AI models have made headlines in recent years for creating inaccurate and nonsensical outputs, also called AI hallucinations. For example, Google Bard’s chatbot made an erroneous claim about the James Webb Space Telescope, while the tendency for AI-generated images of humans to possess extra fingers is ubiquitous.
While inaccurate and nonsensical outputs are inconvenient and at times entertaining, the consequences of model collapse can also be far-reaching:
Inaccurate outputs from model collapse can create costly consequences for businesses that use AI in decision-making. Everything from customer service chatbots to AI-powered medical diagnostic tools might be affected. Imagine, for instance, an AI diagnostic model that fails to correctly diagnose a patient as having a rare disease because the low-probability condition was eventually forgotten and remove from training datasets in previous model generations.
Under model collapse, models might discard outlying data points related to real human interactions and preferences. As a result, users seeking less popular or unique content could find themselves disappointed with model outputs.4 Consider, for instance, an AI recommendation system for online shoppers: if a consumer prefers lime green shoes, but the system continuously recommends black and white ones because they are top sellers, the consumer might be inclined to seek help elsewhere.
If widely used AI systems undergoing model collapse perpetually produce narrower outputs, “long-tail” ideas might eventually fade out of the public’s consciousness, limiting the scope of human knowledge and exacerbating common biases in society.5 For example, scientists today can turn to AI-powered research tools for studies to inform their research. However, tools affected by model collapse might provide only widely cited studies for review, potentially depriving users of key information that could lead to important discoveries.
The different types of generative AI models are vulnerable to different impacts from model collapse.
In LLMs, model collapse can manifest in increasingly irrelevant, nonsensical and repetitive text outputs. In one experiment, researchers fine-tuned OPT-125M, an open source large language model released by Meta. Generations of the model were trained on the data produced by their predecessors. After an initial English-language input about architecture, one model generation eventually produced an output about jack rabbits with different-colored tails.6
Model collapse is especially noticeable in image-generating models as the image output decreases in quality, diversity and precision. One experiment used a dataset of distinct, handwritten numbers to train a Variational Autoencoder (VAE). After multiple iterative training cycles, later generations of the model yielded outputs in which many of the digits resembled each other.7 A different study that included a generative adversarial network (GAN) model trained on diverse images of faces found that the model eventually yielded more homogeneous faces.8
Gaussian Mixture models can organize data into clusters, but researchers have found that a GMM tasked with separating data into two clusters performed significantly worse after a few dozen iterations. The model’s perception of the underlying data distribution changed over time and by its 2000th iteration generation, its output displayed very little variance.9
Model collapse is one of multiple model degradation phenomena observed in machine learning. Others include catastrophic forgetting, mode collapse, model drift and performative prediction. Each bears similarities to, but is distinct from, model collapse.
Both catastrophic forgetting and model collapse involve information lost by AI systems. However, catastrophic forgetting is distinct from model collapse. Catastrophic forgetting occurs when a single model learns new information and “forgets” previous information, resulting in degraded performance when that model is applied to a task that requires the use of the older information. Model collapse is different because it entails performance decline over successive model generations, rather than lost data and the deterioration of performance within one model.10
Though similar in name to model collapse, mode collapse is a phenomenon specific to GAN models. Such models consist of two different parts—a generator and a discriminator—that help produce synthetic data that is statistically similar to real data. The generator is charged with creating the data, while the discriminator serves as a continual check on the process, identifying data that appears inauthentic. Mode collapse occurs when the generator’s output lacks variance and this flaw goes undetected by the discriminator, resulting in degraded performance.
Model drift refers to the degradation of machine learning model performance due to changes in data or in the relationships between input and output variables. Models that are built with historical data can become stagnant. If an AI model’s training, based on old training data, doesn’t align with incoming data, it can’t accurately interpret that data or use that incoming data to reliably make accurate predictions. Model collapse is different because it involves training models on new, AI-generated data in iterative cycles.
Researchers have compared model collapse in generative AI models to performative prediction in supervised learning models because both entail the pollution of training sets by previous machine learning model inputs. Performative prediction occurs when a supervised learning model’s output influences real-world outcomes in a way that conform with the model’s prediction. This, in turn, influences future model outputs, yielding a “self-fulling prophecy.” Performative prediction is also known as a fairness feedback loop when this process entrenches discrimination.11 For example, an AI-powered home loan decisioning model, trained on data from the US’s discriminatory redlining era, can encourage lenders to inadvertently replicate such discrimination today.
Several strategies might help AI developers and organizations prevent model collapse. They include:
High-quality original data sources can provide important variance that might be missing in some AI-generated data. Ensuring AI models are still trained on such human-generated data can preserve AI systems’ ability to perform well when tasked with accounting for low-probability events, such as a consumer preferring an unusual product or a scientist benefiting from information in a rarely cited study. In such circumstances, a resulting output might not be common or popular, but is still, in fact, most accurate.
It can be difficult to differentiate between model-generated data and human-generated data in information ecosystems, but coordination among LLM developers and AI researchers might help ensure access to information on data provenance. One such coordinated effort exists through The Data Provenance Initiative, a collective of AI researchers from MIT and other universities that has audited more than 4,000 datasets.12
According to one study, AI developers can avoid degraded performance by training AI models with both real data and multiple generations of synthetic data. This accumulation stands in contrast with the practice of entirely replacing original data with AI-generated data.13
As AI developers explore data accumulation, they might also benefit from improvements in the quality of synthetic data produced specifically for machine learning training purposes. Advances in data generation algorithms can help enhance the reliability of synthetic data and increase its utility. In healthcare, for example, synthetic data can even be used to provide a wider range of scenarios for training models, leading to better diagnostic capabilities.
AI governance tools can help AI developers and companies mitigate the risk of declining AI performance by empowering oversight and control over AI systems. Such tools can include automatic detection systems for bias, drift, performance and anomalies, potentially detecting model collapse before it impacts an organization’s bottom line.
Learn about the new challenges of generative AI, the need for governing AI and ML models and steps to build a trusted, transparent and explainable AI framework.
Read about driving ethical and compliant practices with a portfolio of AI products for generative AI models.
Gain a deeper understanding of how to ensure fairness, manage drift, maintain quality and enhance explainability with watsonx.governance™.
We surveyed 2,000 organizations about their AI initiatives to discover what’s working, what’s not and how you can get ahead.
Govern generative AI models from anywhere and deploy on cloud or on premises with IBM watsonx.governance.
See how AI governance can help increase your employees’ confidence in AI, accelerate adoption and innovation, and improve customer trust.
Prepare for the EU AI Act and establish a responsible AI governance approach with the help of IBM Consulting®.
1, 3, 6, 7 “The Curse of Recursion: Training on Generated Data Makes Models Forget.” arXiv.org. 14 April 2024.
2 “The Internet Isn’t Completely Weird Yet; AI Can Fix That.” IEEE Spectrum. 23 June 2023.
4, 5 “AI and the Problem of Knowledge Collapse.” arXiv.org. 22 April 2024.
8 “Breaking MAD: Generative AI could break the Internet.” Rice University News and Media Relations. 30 July 2024.
9, 10 “Supplementary Information: AI models collapse when trained on recursively generated data.” Nature Portfolio. Accessed on 22 September 2024.
11 “Fairness Feedback Loops: Training on Synthetic Data Amplifies Bias.” ACM Conference on Fairness, Accountability, and Transparency. Accessed 30 September 2024.
12 “About.” Data Provenance Initiative. Accessed 23 September 2024.
13 “Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data.” arXiv.org. 29 April 2024.
IBM web domains
ibm.com, ibm.org, ibm-zcouncil.com, insights-on-business.com, jazz.net, mobilebusinessinsights.com, promontory.com, proveit.com, ptech.org, s81c.com, securityintelligence.com, skillsbuild.org, softlayer.com, storagecommunity.org, think-exchange.com, thoughtsoncloud.com, alphaevents.webcasts.com, ibm-cloud.github.io, ibmbigdatahub.com, bluemix.net, mybluemix.net, ibm.net, ibmcloud.com, galasa.dev, blueworkslive.com, swiss-quantum.ch, blueworkslive.com, cloudant.com, ibm.ie, ibm.fr, ibm.com.br, ibm.co, ibm.ca, community.watsonanalytics.com, datapower.com, skills.yourlearning.ibm.com, bluewolf.com, carbondesignsystem.com, openliberty.io