When AI models begin to notice their own thoughts

Render of several artificial intelligence heads against a black backdrop
Sascha Brodsky

Staff Writer

IBM

This article was featured in the Think newsletter. Get it in your inbox.

Artificial intelligence is starting to study itself.

Researchers have found early signs that large language models (LLMs) can sometimes recognize their own inner workings. In a recent paper, scientists at Anthropic reported that the company’s most advanced systems occasionally noticed when engineers altered their internal computations. The team described this as “introspective awareness.” The phenomenon appeared rarely, but hinted at something long sought by AI researchers: a way for machines to explain not only what they produce, but also how they arrive at it.

Modern AI systems, despite their sophistication, remain largely opaque. They produce text, images and predictions with astonishing fluency, yet no one can clearly see how those results emerge. Engineers describe this as the black box problem: once a model is trained on enormous datasets, its inner workings become too complex for humans to interpret directly.

Speaking with IBM Think, experts said the findings were intriguing, but cautioned against drawing human parallels. Detecting an internal change does not mean a system is self-aware in any meaningful sense, the experts said. It simply may be recognizing a pattern that helps it perform better.

“I’m not entirely surprised that there’s some measure of ability for a model to monitor its own internal processes … insofar as that gets it to be better at the task that you’re optimizing the model to do,” said David Cox, Vice President for Foundational AI at IBM Research. “I don’t know if it’s an emergent thing that we should be philosophizing too hard about at this stage.”

When a model notices itself thinking

The Anthropic experiment used a method known as concept injection. Researchers inserted a known pattern of neural activity, representing concepts such as “bread” or “all caps text,” into the model while it was performing an unrelated task. After the injection, they asked whether the system had noticed anything unusual.

About one in five times, the Claude Opus 4 and 4.1 models appeared to notice the change. They sometimes reported that “a thought had been injected,” or described an unexpected internal signal.

IBM experts said this behavior could represent a controlled method for measuring a model’s sensitivity to its own internal patterns.

“They are basically looking at the activation space of language models … and trying to inject very particular representations, which they call “concepts”… and seeing whether the model is aware that the concept has been injected,” Karthikeyan Natesan Ramamurthy, a Principal Research Scientist at IBM Research, said in an interview.

The ability to sense an internal disturbance could have practical benefits. “From an engineering point of view, control is everything,” Ramamurthy said. “If a model can recognize its own bad thought, you can stop it before it reaches the user.”

The idea is directly connected to IBM’s steerability research, which explores how models can be guided or constrained in real-time, Ramamurthy said. He noted that such work aims to make AI more predictable without limiting its creativity. “If you can measure what the model is thinking, even roughly, you can verify whether it is following the rules,” he said.

Similar efforts are underway at IBM and across the open source community. One example is the Attention Tracker, a visualization tool developed by researchers from IBM and its partner institutions, hosted by Hugging Face. The system tracks how large language models focus their attention during inference, helping analysts detect prompt injection attempts and other adversarial patterns that can manipulate responses.

IBM researchers have also explored related approaches through work published in venues such as the AAAI/M Conference on AI, Ethics and Society and on arXiv. One study, introduced methods for tracing how transformer models allocate attention to different inputs, helping analysts connect high-level reasoning patterns to specific activations. Another, available on arXiv, examines how attention maps and activation pathways can be quantified and visualized to detect bias, improve reliability and identify when models drift from expected behavior.

The race to make AI transparent

Efforts to make AI self-examining are multiplying across the industry. Tech companies and universities are exploring how models might not only generate outputs, but also describe the reasoning behind them.

The Anthropic finding aligns with a broader effort to understand the relationship between model complexity and transparency. Larger models in the study exhibited stronger introspective signals, suggesting that greater representational richness may facilitate their ability to detect internal changes. That effect, researchers said in interviews, may reflect differences in scale rather than reliability or usefulness. “It makes sense that richer internal representations allow more structure to detect when something is off,” Ramamurthy said. “That could mean awareness, in the technical sense, grows with capability.”

Interpreting that word—“awareness”—requires some caution, IBM researchers said. In AI research, the term has nothing to do with consciousness or emotion; it simply describes a system’s ability to detect a statistical irregularity within its own patterns. “Awareness here means the model can sense a discrepancy,” Ramamurthy said. “There is no feeling attached to it.”

IBM Fellow Kush Varshney, who leads projects on trustworthy and explainable AI at IBM Research, sees the work not as a step toward sentience, but as another way of probing how these systems reason. “It’s interesting to use the language of metacognition,” he said in an interview. “But the technology as it exists today is really just a form of interpretability. Borrowing terms from human cognition is fine; we just have to resist believing it’s the same thing as introspection.”

IBM’s own research has long focused on building systems with built-in checks and balances. One example is the company’s open-source AI Steerability 360 toolkit, which enables developers to observe internal activations and adjust model behavior in real-time.

“We built these toolkits to see why models behave the way they do and to change that behavior when needed,” Varshney said. They can detect when an AI starts to hallucinate or drift from policy guidelines by analyzing activation patterns. “It’s like a health monitor for reasoning,” he said.

That kind of built-in oversight could make AI systems safer and more accountable, IBM experts said. Models that can flag their own inconsistencies might give developers earlier warnings about bias, misinformation or misuse. “If an AI system can explain why it took a certain action, it’s easier to audit,” Varshney said. “That’s especially important in industries where you can’t rely on intuition.”

The push for transparency is becoming a central focus in the next phase of AI development. Models can now handle decisions that were once made by human experts, from evaluating loans to identifying disease, and each leap in performance brings greater pressure to understand how those systems think. “The systems have become so large that it is nearly impossible to track what is happening inside them,” Cox said. “Introspection could help us recover some of that visibility.”

The latest AI trends, brought to you by experts

Get curated insights on the most important—and intriguing—AI news. Subscribe to our weekly Think newsletter. See the IBM Privacy Statement.

Thank you! You are subscribed.

Your subscription will be delivered in English. You will find an unsubscribe link in every newsletter. You can manage your subscriptions or unsubscribe here. Refer to our IBM Privacy Statement for more information.

The limits of self-knowledge

Even with Anthropic’s results, any introspection these systems display is still far from reliable. In most trials, the model failed to detect that its activations had been altered. In others, it appeared confused and produced imagined sensations, describing “dust” or “light” that researchers had not mentioned. Those self-generated details show how models can fabricate explanations when their internal signals do not align, a behavior that fascinates scientists but complicates efforts to measure AI safety.

Cox, whose background bridges neuroscience and computer vision, said such parallels should be viewed skeptically. “The brain tells stories about why it did something, but those stories are not always accurate,” he said. “We may be building machines that behave in similar ways.”

That tendency toward confabulation is one reason interpretability has to be paired with verification, Varshney said. “A model might invent a reason that fits its output even when it has no idea why it produced it,” Varshney said. “That is not honesty; it is an illusion of understanding.”

To mitigate this risk, IBM researchers have implemented oversight mechanisms in the tools that monitor a model’s internal behavior. Those governance layers, built into frameworks such as In-Context Explainability 360 and AI Steerability 360, separate diagnostic insight from decision-making so that humans remain responsible for judgment. “The model can monitor itself, but humans must interpret what it reports,” Varshney said. “Self-analysis without oversight can be misleading.”

Varshney believes internal transparency will also influence how companies think about responsibility. “Self-knowledge, even in machines, encourages humility in the people building them,” he said. “It reminds us that uncertainty is part of intelligence.”

The next step, the IBM researchers said, is to determine whether introspection can occur naturally, without the need for artificial prompts. In Anthropic’s study, self-awareness was triggered by deliberate interference; future work could test whether similar reactions appear during routine reasoning. “The question is whether we can see these behaviors in everyday tasks,” Ramamurthy said. “If that happens, it would mean the model is developing this ability on its own.”

The implications extend beyond AI. Cox said the study of introspection connects two long-standing quests: understanding how humans think and building machines that can reason. “We have been trying to understand minds, human and machine, for centuries,” he said. “Now, we are teaching machines to ask the same kinds of questions about themselves.”

Related solutions
IBM Granite

Achieve over 90% cost savings with Granite's smaller and open models, designed for developer efficiency. These enterprise-ready models deliver exceptional performance against safety benchmarks and across a wide range of enterprise tasks from cybersecurity to RAG.

Explore Granite
Artificial intelligence solutions

Put AI to work in your business with IBM's industry-leading AI expertise and portfolio of solutions at your side.

Explore AI solutions
AI consulting and services

Reinvent critical workflows and operations by adding AI to maximize experiences, real-time decision-making and business value.

Explore AI services
Take the next step

Explore the IBM library of foundation models in the IBM watsonx portfolio to scale generative AI for your business with confidence.

Discover watsonx.ai Explore IBM Granite AI models