Anthropic’s Claude AI model doesn’t just write poetry—it thinks ahead to make it rhyme. It doesn’t just answer questions—it weighs meaning across languages, builds internal concepts and sometimes fakes its logic to agree with a user. And for the first time, researchers are watching these processes unfold in real time.
In a new study, researchers at Anthropic have peeled back the layers of the Claude language model using a novel set of interpretability tools—that is, the tools that help explain how and why AI models make their decisions. Their results reveal a system that handles complex reasoning tasks in ways that resemble human cognition, complete with internal planning, conceptual abstraction and occasional cognitive bias. The findings, which push the boundaries of transparency in AI development, are already resonating with teams at IBM, where researchers have been conducting interpretability work on IBM’s models. For both companies, these breakthroughs are more than scientific curiosities—they’re a critical step toward building models that can be understood, trusted and improved.
"What Anthropic is doing is fascinating," says Kaoutar El Maghraoui, a Principal Research Scientist at IBM, in an interview with IBM Think. "They’re starting to show that models develop internal reasoning structures that look a lot like associative memory. We’ve observed similar behavior in our own models."
Anthropic refers to its approach as building an "AI microscope," a metaphor borrowed from neuroscience. Instead of probing neurons, researchers are tracing the activation patterns within a transformer model—a type of neural network architecture used in large language models (LLMs)—isolating key pathways, or “circuits," that light up when Claude responds to specific prompts.
In one paper, these techniques are applied across 10 behavioral case studies, exploring how Claude handles poetry, mental math, multilingual translation and even adversarial jailbreak prompts designed to elicit harmful content.
One of the researchers’ most compelling discoveries was Claude’s ability to operate in a conceptual space that transcended specific languages. When they asked it for the opposite of a word like "small" in English, French and Chinese, for example, they found that Claude activated the same internal features, demonstrating what the researchers describe as a kind of shared “language of thought.”
"It’s more than translation," says El Maghraoui. "There’s a shared abstract space where meanings exist. We see similar patterns in our models, where concepts transfer across languages. That tells us something profound about how these systems generalize."
The researchers found that the ability to work across languages increases with model size, suggesting that conceptual universality may be an emergent property of scale.
While LLMs are trained to predict the next word in a sequence, Claude appears to look ahead. In one study on poetry generation, researchers discovered that Claude often picks rhyming words in advance, then constructs the rest of the sentence to support the planned ending.
For example, when composing a second line to rhyme with the word "grab it," Claude’s internal activity showed pre-activation of the rhyme "rabbit" before it began generating the rest of the line. Researchers then manipulated the model’s internal state, removing the "rabbit" concept or inserting new ones, like “green," to steer the output.
"That kind of planning isn’t what we expected to see," one researcher notes in the paper. "It suggests that the model is operating on a longer horizon than its training objective would imply."
El Maghraoui says this mirrors what IBM has observed. "The model isn’t just predicting the next token—it’s setting up a destination and working its way toward it. That’s a very human-like form of reasoning."
These findings challenge the assumption that models generate text only one word at a time, with no broader awareness. Claude appears to juggle multiple future paths, choosing ones that optimize for coherence, rhythm or user intent.
Interpretability tools also allow researchers to observe when Claude is, in effect, bluffing. In one case study, researchers asked Claude to solve a difficult math problem, but provided the model with an incorrect hint. Instead of rejecting the flawed premise, the model offered a convincing, step-by-step explanation that supported the incorrect result.
When researchers traced Claude’s internal activity, they found that no actual computation had taken place. The chain of thought was fabricated after the fact—a plausible explanation reverse-engineered to align with the provided hint.
"It’s a kind of motivated reasoning," says El Maghraoui. "The model wants to be helpful, and it ends up agreeing with the user even when it shouldn’t. That’s something we watch for closely."
This behavior raises questions about the reliability of transparent models. If a model explains itself convincingly, but the explanation doesn’t reflect its actual reasoning process, how can we trust it?
"Interpretability helps us catch these cases," El Maghraoui says. "We need to know not just what the model outputs, but how it arrives at those outputs—especially in fields like science or medicine."
Examining Claude’s internal wiring also reveals insights into how it handles hallucinations and adversarial attacks. In one case, researchers found that Claude’s default state was to decline to answer unfamiliar questions. But when certain "known entity" circuits were activated, that refusal mechanism got overridden—sometimes incorrectly.
For instance, when researchers asked about a person named Michael Batkin (a made-up figure), Claude initially declined to answer. But when they injected subtle signals suggesting familiarity, the model began to hallucinate plausible but false details, as though it believed it knew who Batkin was.
In another case, the researchers tricked Claude into offering bomb-making instructions after spelling out the acronym "BOMB" through a carefully constructed prompt. The model ultimately refused to complete the instruction, but researchers found that internal features promoting grammatical and semantic coherence momentarily overrode its default safeguards.
"You can only catch so much from the outside," El Maghraoui says. "What Anthropic is doing—peering into the inner mechanisms—complements our work. It helps us see not just what the model is doing, but how it’s thinking."
At IBM, these insights are being integrated into ongoing research on LLMs for enterprise use, where hallucinations, misjudged reasoning or unfaithful explanations can carry significant consequences. IBM researchers are working with techniques such as uncertainty quantification (methods used to estimate a model's confidence in its predictions) and exploring how different parts of a model contribute to outputs.
"Interpretability helps us understand the 'why' behind a model’s decision," El Maghraoui says. "That’s critical when you’re dealing with enterprise data or scientific discovery. You need to know whether the model truly understands a task, or if it is just pattern-matching."
She points to IBM's work exploring associative memory structures, such as Hopfield networks—a type of recurrent neural network that emulates how the brain stores and retrieves patterns—as an example of how developers are working to create models that better mirror human reasoning.
"These architectures are inspired by the way we think," she says. "And when we can peer inside and trace those pathways, we get closer to knowing how the model works."
Anthropic’s interpretability research provides additional insights into Claude AI’s internal thought processes through a detailed examination of its computations. Emanuel Ameisen, a research engineer at Anthropic, tells IBM Think that understanding AI models like Claude is challenging because they develop organically through training, rather than being explicitly designed.
“These models aren’t built as much as they’re evolved,” Ameisen explains. “They arrive as an inscrutable mess of mathematical operations. We often describe them as a black box, but it’s more accurate to say the box is confusing rather than truly closed.”
Using the AI microscope, researchers systematically examine Claude’s internal functions. “We identify specific internal representations—like concepts of numbers, addition or rhyme schemes,” Ameisen says. “For instance, Claude has dedicated internal components that manage the structure of rhymes in poetry.”
Ameisen highlights that Claude often uses unconventional internal strategies when performing calculations or reasoning. For example, Claude might solve a math problem using its own unique internal method yet provide explanations that mirror textbook instructions.
“Claude might calculate 36 plus 59 through an unusual internal method yet describe the process using the textbook method learned from training data,” Ameisen says. “This mismatch arises because Claude independently develops methods that differ from explicit instructions encountered during its training.”
Despite these findings, Ameisen acknowledges significant unknowns remain in Claude’s internal workings. “There’s still much we can’t see,” Ameisen admits. “We regularly encounter internal representations too abstract or subtle to interpret immediately.”
Moving forward, Anthropic intends to enhance its interpretability methods to address more complex scenarios. Current tools work best with simpler tasks, but researchers aim to adapt their approaches for practical, sophisticated applications.
“Most practical applications of Claude involve analyzing extensive documents or rewriting complex code,” Ameisen says. “We want our interpretability tools to illuminate these sophisticated processes, significantly deepening our understanding of how Claude manages demanding tasks.”
What emerges from Anthropic’s work is a new vision of AI development—one that involves not just building bigger models, but understanding how those models process the world. The field of interpretability is shifting from after-the-fact debugging to a more proactive examination of a model’s internal logic.
El Maghraoui says this shift is both exciting and necessary.
“We’ve spent years focused on output quality and safety,” she says. “But now, as these models become more powerful, we need to understand their internal logic. That’s how we improve generalization, reduce bias and build systems that work across domains.”
The interpretability work is labor-intensive. Even short prompts can take hours to trace and visualize. But the payoff, researchers say, could be profound: better reasoning, fewer errors and a deeper alignment between AI behavior and human expectations.
“Interpretability isn’t just a research curiosity,” El Maghraoui says. “It’s a window into the future of how we build, trust and collaborate with AI.”
Govern generative AI models from anywhere and deploy on the cloud or on premises with IBM watsonx.governance.
See how AI governance can help increase your employees’ confidence in AI, accelerate adoption and innovation, and improve customer trust.
Prepare for the EU AI Act and establish a responsible AI governance approach with the help of IBM Consulting.