Anthropic’s Claude AI model doesn’t just write poetry—it thinks ahead to make it rhyme. It doesn’t just answer questions—it weighs meaning across languages, builds internal concepts and sometimes fakes its logic to agree with a user. And for the first time, researchers are watching these processes unfold in real time.

In a new study, researchers at Anthropic have peeled back the layers of the Claude language model using a novel set of interpretability tools—that is, the tools that help explain how and why AI models make their decisions. Their results reveal a system that handles complex reasoning tasks in ways that resemble human cognition, complete with internal planning, conceptual abstraction and occasional cognitive bias. The findings, which push the boundaries of transparency in AI development, are already resonating with teams at IBM, where researchers have been conducting interpretability work on IBM’s models. For both companies, these breakthroughs are more than scientific curiosities—they’re a critical step toward building models that can be understood, trusted and improved.

"What Anthropic is doing is fascinating," says Kaoutar El Maghraoui, a Principal Research Scientist at IBM, in an interview with IBM Think. "They’re starting to show that models develop internal reasoning structures that look a lot like associative memory. We’ve observed similar behavior in our own models."