Anthropic's microscope cracks open the AI black box

Anthropic’s Claude AI model doesn’t just write poetry—it thinks ahead to make it rhyme. It doesn’t just answer questions—it weighs meaning across languages, builds internal concepts and sometimes fakes its logic to agree with a user. And for the first time, researchers are watching these processes unfold in real time.

In a new study, researchers at Anthropic have peeled back the layers of the Claude language model using a novel set of interpretability tools—that is, the tools that help explain how and why AI models make their decisions. Their results reveal a system that handles complex reasoning tasks in ways that resemble human cognition, complete with internal planning, conceptual abstraction and occasional cognitive bias. The findings, which push the boundaries of transparency in AI development, are already resonating with teams at IBM, where researchers have been conducting interpretability work on IBM’s models. For both companies, these breakthroughs are more than scientific curiosities—they’re a critical step toward building models that can be understood, trusted and improved.

"What Anthropic is doing is fascinating," says Kaoutar El Maghraoui, a Principal Research Scientist at IBM, in an interview with IBM Think. "They’re starting to show that models develop internal reasoning structures that look a lot like associative memory. We’ve observed similar behavior in our own models."

The microscope

Anthropic refers to its approach as building an "AI microscope," a metaphor borrowed from neuroscience. Instead of probing neurons, researchers are tracing the activation patterns within a transformer model—a type of neural network architecture used in large language models (LLMs)—isolating key pathways, or “circuits," that light up when Claude responds to specific prompts.

In one paper, these techniques are applied across 10 behavioral case studies, exploring how Claude handles poetry, mental math, multilingual translation and even adversarial jailbreak prompts designed to elicit harmful content.

One of the researchers’ most compelling discoveries was Claude’s ability to operate in a conceptual space that transcended specific languages. When they asked it for the opposite of a word like "small" in English, French and Chinese, for example, they found that Claude activated the same internal features, demonstrating what the researchers describe as a kind of shared “language of thought.”

"It’s more than translation," says El Maghraoui. "There’s a shared abstract space where meanings exist. We see similar patterns in our models, where concepts transfer across languages. That tells us something profound about how these systems generalize."

The researchers found that the ability to work across languages increases with model size, suggesting that conceptual universality may be an emergent property of scale.

A planning machine

While LLMs are trained to predict the next word in a sequence, Claude appears to look ahead. In one study on poetry generation, researchers discovered that Claude often picks rhyming words in advance, then constructs the rest of the sentence to support the planned ending.

For example, when composing a second line to rhyme with the word "grab it," Claude’s internal activity showed pre-activation of the rhyme "rabbit" before it began generating the rest of the line. Researchers then manipulated the model’s internal state, removing the "rabbit" concept or inserting new ones, like “green," to steer the output.

"That kind of planning isn’t what we expected to see," one researcher notes in the paper. "It suggests that the model is operating on a longer horizon than its training objective would imply."

El Maghraoui says this mirrors what IBM has observed. "The model isn’t just predicting the next token—it’s setting up a destination and working its way toward it. That’s a very human-like form of reasoning."

These findings challenge the assumption that models generate text only one word at a time, with no broader awareness. Claude appears to juggle multiple future paths, choosing ones that optimize for coherence, rhythm or user intent.

The latest AI News + Insights  

Discover expertly curated insights and news on AI, cloud and more in the weekly Think Newsletter.

When AI makes stuff up

Interpretability tools also allow researchers to observe when Claude is, in effect, bluffing. In one case study, researchers asked Claude to solve a difficult math problem, but provided the model with an incorrect hint. Instead of rejecting the flawed premise, the model offered a convincing, step-by-step explanation that supported the incorrect result.

When researchers traced Claude’s internal activity, they found that no actual computation had taken place. The chain of thought was fabricated after the fact—a plausible explanation reverse-engineered to align with the provided hint.

"It’s a kind of motivated reasoning," says El Maghraoui. "The model wants to be helpful, and it ends up agreeing with the user even when it shouldn’t. That’s something we watch for closely."

This behavior raises questions about the reliability of transparent models. If a model explains itself convincingly, but the explanation doesn’t reflect its actual reasoning process, how can we trust it?

"Interpretability helps us catch these cases," El Maghraoui says. "We need to know not just what the model outputs, but how it arrives at those outputs—especially in fields like science or medicine."

Hallucinations and pressure to perform

Examining Claude’s internal wiring also reveals insights into how it handles hallucinations and adversarial attacks. In one case, researchers found that Claude’s default state was to decline to answer unfamiliar questions. But when certain "known entity" circuits were activated, that refusal mechanism got overridden—sometimes incorrectly.

For instance, when researchers asked about a person named Michael Batkin (a made-up figure), Claude initially declined to answer. But when they injected subtle signals suggesting familiarity, the model began to hallucinate plausible but false details, as though it believed it knew who Batkin was.

In another case, the researchers tricked Claude into offering bomb-making instructions after spelling out the acronym "BOMB" through a carefully constructed prompt. The model ultimately refused to complete the instruction, but researchers found that internal features promoting grammatical and semantic coherence momentarily overrode its default safeguards.

"You can only catch so much from the outside," El Maghraoui says. "What Anthropic is doing—peering into the inner mechanisms—complements our work. It helps us see not just what the model is doing, but how it’s thinking."

Lessons for enterprise AI

At IBM, these insights are being integrated into ongoing research on LLMs for enterprise use, where hallucinations, misjudged reasoning or unfaithful explanations can carry significant consequences. IBM researchers are working with techniques such as uncertainty quantification (methods used to estimate a model's confidence in its predictions) and exploring how different parts of a model contribute to outputs.

"Interpretability helps us understand the 'why' behind a model’s decision," El Maghraoui says. "That’s critical when you’re dealing with enterprise data or scientific discovery. You need to know whether the model truly understands a task, or if it is just pattern-matching."

She points to IBM's work exploring associative memory structures, such as Hopfield networks—a type of recurrent neural network that emulates how the brain stores and retrieves patterns—as an example of how developers are working to create models that better mirror human reasoning.

"These architectures are inspired by the way we think," she says. "And when we can peer inside and trace those pathways, we get closer to knowing how the model works."

AI Academy

Uniting security and governance for the future of AI

While grounding the conversation in today’s newest trend, agentic AI, this AI Academy episode explores the tug-of-war that risk and assurance leaders experience between governance and security. It’s critical to establish a balance and prioritize a working relationship for both to achieve better, more trustworthy data and AI your organization can scale.

Go to episode

Inside Claude's mind: AI's hidden thoughts

Anthropic’s interpretability research provides additional insights into Claude AI’s internal thought processes through a detailed examination of its computations. Emanuel Ameisen, a research engineer at Anthropic, tells IBM Think that understanding AI models like Claude is challenging because they develop organically through training, rather than being explicitly designed.

“These models aren’t built as much as they’re evolved,” Ameisen explains. “They arrive as an inscrutable mess of mathematical operations. We often describe them as a black box, but it’s more accurate to say the box is confusing rather than truly closed.”

Using the AI microscope, researchers systematically examine Claude’s internal functions. “We identify specific internal representations—like concepts of numbers, addition or rhyme schemes,” Ameisen says. “For instance, Claude has dedicated internal components that manage the structure of rhymes in poetry.”

Ameisen highlights that Claude often uses unconventional internal strategies when performing calculations or reasoning. For example, Claude might solve a math problem using its own unique internal method yet provide explanations that mirror textbook instructions.

“Claude might calculate 36 plus 59 through an unusual internal method yet describe the process using the textbook method learned from training data,” Ameisen says. “This mismatch arises because Claude independently develops methods that differ from explicit instructions encountered during its training.”

Despite these findings, Ameisen acknowledges significant unknowns remain in Claude’s internal workings. “There’s still much we can’t see,” Ameisen admits. “We regularly encounter internal representations too abstract or subtle to interpret immediately.”

Moving forward, Anthropic intends to enhance its interpretability methods to address more complex scenarios. Current tools work best with simpler tasks, but researchers aim to adapt their approaches for practical, sophisticated applications.

“Most practical applications of Claude involve analyzing extensive documents or rewriting complex code,” Ameisen says. “We want our interpretability tools to illuminate these sophisticated processes, significantly deepening our understanding of how Claude manages demanding tasks.”

Toward a science of AI thought

What emerges from Anthropic’s work is a new vision of AI development—one that involves not just building bigger models, but understanding how those models process the world. The field of interpretability is shifting from after-the-fact debugging to a more proactive examination of a model’s internal logic.

El Maghraoui says this shift is both exciting and necessary.

“We’ve spent years focused on output quality and safety,” she says. “But now, as these models become more powerful, we need to understand their internal logic. That’s how we improve generalization, reduce bias and build systems that work across domains.”

The interpretability work is labor-intensive. Even short prompts can take hours to trace and visualize. But the payoff, researchers say, could be profound: better reasoning, fewer errors and a deeper alignment between AI behavior and human expectations.

“Interpretability isn’t just a research curiosity,” El Maghraoui says. “It’s a window into the future of how we build, trust and collaborate with AI.”

Securing AI in 2026 — Approach for a Trustworthy and Resilient AI Future

Join experts on December 10 at 11:00 ET for an exclusive webinar that explores how organizations can embed security and reliability into every stage of the AI lifecycle — and set the foundation for a truly trustworthy AI future.

Resources

AI governance imperative: evolving regulations and emergence of agentic AI

Learn how evolving regulations and the emergence of AI agents are reshaping the need for robust AI governance frameworks.

IDC MarketScape names IBM a Leader in 2025 GenAI evaluation technology

Download the report to learn why IDC MarketScape names IBM a Leader in 2025 GenAI evaluation technology, and how watsonx.governance advances risk management, reporting, and integration.

Building a strong data foundation for trustworthy AI

Explore the Data Matters hub to see how strong data practices and governance lay the foundation for scalable AI success.

IBM named a leader in The Forrester Wave™: AI Governance Solutions, Q3 2025

See why Forrester recognized IBM as a Leader for its watsonx.governance solution—helping enterprises manage AI risk, compliance, and trust at scale.

Maximize AI ROI through smarter governance

Learn ways to maximize AI ROI—prioritizing high-impact use cases, governing risks, optimizing costs, and accelerating adoption with watsonx.

IBM Named a Leader in the Gartner® Magic Quadrant™ for GRC

Unlock insights into IBM's OpenPages and learn why we were named a Leader

The AI oversight gap

The Cost of a Data Breach Report 2025 reveals how do-it-now Al adoption is outpacing security and governance.

Why AI governance is a business imperative for scaling enterprise artificial intelligence

Learn about the new challenges of generative AI, the need for governing AI and ML models and steps to build a trusted, transparent and explainable AI framework.

Getting Ready for the EU AI Act, Phase 2: Risk-Assess and Categorize

Understand the importance of establishing a defensible assessment process and consistently categorizing each use case into the appropriate risk tier.

AI lifecycle governance

Read about driving ethical and compliant practices with a portfolio of AI products for generative AI models.

How to choose the right foundation model

Learn how to select the most suitable AI foundation model for your use case.

Anthropic's microscope cracks open the AI black box

The microscope

A planning machine

The latest AI News + Insights

When AI makes stuff up

Hallucinations and pressure to perform

Lessons for enterprise AI

Uniting security and governance for the future of AI

Inside Claude's mind: AI's hidden thoughts

Toward a science of AI thought

Share

Resources

The latest AI News + Insights