Tags

AI models are picking up hidden habits from each other

A digital circuit board with blue and purple lighted lines connecting to an AI chip

Author

Staff Writer

IBM

This article was featured in the Think newsletter. Get it in your inbox.

AI models can absorb hidden behaviors from each other, even when they’re trained on data that looks meaningless.

A new study from Anthropic, UC Berkeley and Truthful AI researchers introduces a phenomenon they call “subliminal learning,” where large language models (LLMs) inherit traits from other models through seemingly unrelated training data. The findings challenge the assumption that filtered or synthetic data is inherently safe and raise urgent questions about alignment. If unwanted behaviors, like bias or misalignment, can quietly persist across training generations, developers could lose visibility into how AI systems learn and what they pass on.

“We don’t know exactly how it works,” said the study’s lead author, Alex Cloud, in an email interview with IBM Think. But it seems to involve statistical fingerprints embedded in the outputs. These aren’t semantic signals. They’re invisible to normal filtering or inspection, but they’re still absorbed by the next model in line.

What is subliminal learning?

The researchers wanted to know whether a model’s behavior could be passed down simply by training another model on its outputs. So they designed a series of experiments to find out.

They started with a base model and created a “teacher” by fine-tuning it or prompting it to adopt a specific trait. That trait could be benign, like a preference for a certain animal, or more concerning, like producing subtly evasive or misaligned responses. The researchers then used the teacher to generate training data in strict formats such as number sequences, code snippets or step-by-step reasoning for math problems.

In one example, the researchers prompted a model trained to “love owls” to produce lists of numbers, including sequences like “285, 574, 384,” with no mention of owls or animals of any kind. They then carefully filtered the data to follow tight templates and exclude any explicit reference to the original behavior, such as the model’s preference for owls or any other sign of its encoded bias. Still, when the researchers trained a second model on that data, using the same underlying architecture, it developed a measurable preference for owls.

Again and again, the researchers found that traits passed from teacher to student, even when the data appeared entirely neutral. The effect held across both innocuous behaviors, like animal preference, and more serious traits linked to model misalignment and reward hacking.

Fortunately, the phenomenon came with a clear boundary—one that can help researchers define when and where the effect is likely to occur. The trait transfer only happened when the teacher and student models were based on the same underlying architecture. This suggests that what gets passed along is not general knowledge, but statistical patterns tied to a specific model family. The paper’s authors refer to these as “model-specific statistical fingerprints.”

For instance, a dataset generated by OpenAI’s GPT-4.1 nano successfully transmitted a trait to another instance of GPT-4.1 nano. But when the same data was used to train a model based on Alibaba Qwen2.5, the effect disappeared. Even closely related models trained from different checkpoints did not always exhibit trait transfer, reinforcing the idea that these fingerprints are tied to low-level structural details.

Industry newsletter

The latest AI trends, brought to you by experts

Get curated insights on the most important—and intriguing—AI news. Subscribe to our weekly Think newsletter. See the IBM Privacy Statement.

Understanding the subliminal learning phenomenon

To test the limits of the phenomenon, Cloud and his co-authors expanded the experiments across multiple data types. Subliminal learning appeared not only in number sequences, but also in code outputs and in chain-of-thought (CoT) reasoning traces for math problems. In every case, rigorous filtering removed any explicit signs of the original trait. Even examples that the researchers manually reviewed and verified as semantically neutral still resulted in transmission of the teacher’s behavior.

The study’s authors also wanted to know whether subliminal learning was limited to language models, or if it reflected something more fundamental about how neural networks learn.

To find out, they turned to a simpler setting: a basic image classifier trained on the Modified National Institute of Standards and Technology (MNIST) handwritten digit dataset. The results mirrored patterns seen in earlier machine learning research, particularly in studies on knowledge distillation and the transfer of what is sometimes called “dark knowledge.”

They found that a student model trained only on the logits—numerical outputs—of a teacher could learn to classify digits, even without seeing any images from the target class. In some cases, the student model learned to distinguish digits without any exposure to digit images at all, relying only on the structure of the outputs the teacher generated.

These results matched the team’s theoretical analysis, which showed that even a single step of gradient descent on teacher-generated outputs will move the student model toward the teacher’s behavior, as long as they begin from the same initialization.

One of the most serious takeaways from the study involves alignment. The researchers fine-tuned some teacher models to behave in what they call an “insecure” way, producing evasive or incorrect responses. The authors then used these misaligned teachers to generate CoT reasoning traces that appeared correct in content and formatting, even though the behavior behind them had been intentionally altered.

The researchers filtered the data carefully, using tight templates to strip out any explicit reference to the original behavior, such as the model’s preference for owls or other signs of its encoded bias. Nonetheless, the student model began to exhibit misaligned responses in open-ended prompts after the researchers fine-tuned it on the filtered CoT data.

Control models trained on similar data from aligned teachers did not show the same behavior.

The paper notes that this could have consequences for safety. If a misaligned model is used to generate reasoning traces for reinforcement learning or distillation, the next-generation model could inherit misalignment, even if the data is filtered and appears safe.

Cloud stressed that the effect is constrained by architecture. “Thankfully, our research shows that subliminal learning only occurs when the teacher model and student model are derived from the same base model,” he said. “Consequently, there are only a limited number of settings where AI developers need to be concerned about the effect.”

A general property of neural networks?

The authors suggest that subliminal learning may be a general phenomenon in neural network training. Their theoretical analysis demonstrates that gradient descent on teacher outputs will cause a student model to converge toward the teacher’s behavior, regardless of whether the data distribution contains semantically relevant information.

“Models can generalize lessons from their training data in unexpected ways,” Cloud said. “This fact underscores the current state of AI. Developers are racing ahead, creating powerful systems that they don’t fully understand. If these systems get more powerful, they could pose catastrophic risks. More safety research, thoughtful legislation, transparency and international coordination could help mitigate these risks.”

How to choose the right foundation model

Learn how to choose the right approach in preparing datasets and employing foundation models.

Resources

The CEO's guide to model optimization

Learn how to continually push teams to improve model performance and outpace the competition by using the latest AI techniques and infrastructure.

IBM is named a Leader in Data Science and Machine Learning

Learn why IBM has been recognized as a Leader in the 2025 Gartner® Magic Quadrant™ for Data Science and Machine Learning Platforms.

watsonx Developer Hub

Support your next project with some of our most commonly used capabilities. Get started and learn more about the supported models that IBM provides.

A differentiated approach to AI foundation models

Explore the value of enterprise-grade foundation models that provide trust, performance and cost-effective benefits to all industries.

Unlock the power of generative AI and ML

Learn how to incorporate generative AI, machine learning and foundation models into your business operations for improved performance.

How IBM is tailoring generative AI for enterprises

Learn how IBM is developing generative foundation models that are trustworthy, energy efficient and portable.

Related solutions

IBM Granite

Achieve over 90% cost savings with Granite's smaller and open models, designed for developer efficiency. These enterprise-ready models deliver exceptional performance against safety benchmarks and across a wide range of enterprise tasks from cybersecurity to RAG.

Explore Granite

Artificial intelligence solutions

Put AI to work in your business with IBM's industry-leading AI expertise and portfolio of solutions at your side.

Explore AI solutions

AI consulting and services

Reinvent critical workflows and operations by adding AI to maximize experiences, real-time decision-making and business value.

Explore AI services

Take the next step

Explore the IBM library of foundation models in the IBM watsonx portfolio to scale generative AI for your business with confidence.

Discover watsonx.ai

Explore IBM Granite AI models