Large language models are getting eerily good at understanding human speech—but what if they’re also mirroring the brain itself?
In a new study published in Nature Human Behaviour, scientists found that OpenAI’s Whisper model processes language strikingly similar to how real neurons respond during natural conversations. Lead researcher Ariel Goldstein tells IBM Think that he and his team analyzed more than 100 hours of brain recordings taken from people engaged in unscripted dialogue. By comparing those recordings with Whisper’s internal workings, they discovered that the model’s layered representations closely align with how the brain processes speech, from raw sound to meaning.
Goldstein says the findings could have significant commercial implications. Enterprises might one day design AI voice tools that decode speech as flexibly and efficiently as the brain, cutting training time, enhancing transcription and even powering next-generation neural prosthetics.
"Language happens in messy, social contexts, not sterile labs," Goldstein says. "Our study shows that human cognition and AI models might share a deeper, more flexible code for handling conversations."
The recordings were gathered using electrocorticography (ECoG), which places electrodes directly on the brain's surface. Though invasive, this technique offers a high-fidelity look at neural activity. Goldstein’s team recorded brain activity from patients already undergoing monitoring for epilepsy surgery, capturing spontaneous, everyday conversations instead of isolated word cues or artificial prompts.
The brain-AI connection has inspired innovations at IBM Research, where scientists have developed chips like NorthPole, which mimic neural architecture by eliminating traditional memory-compute bottlenecks. IBM's prototype has demonstrated remarkable efficiency, performing inference on large AI models up to 46.9 times faster than leading GPUs.
The study found that neural signals and Whisper’s model embeddings showed a high degree of linear alignment, suggesting the brain processes language not in rigid, separated stages, but in flexible, overlapping layers, just like deep learning systems. Acoustic, semantic and grammatical information weren’t confined to isolated areas in the brain or the AI model. Instead, they appeared fused within the same layers, hinting at a shared optimization strategy for meaning.
"This idea that we have a system that is optimized for a task—and it induces representations that correlate with psycholinguistic concepts, but not exactly—is a new way of thinking about how the brain processes information," Goldstein explains.
He notes that, unlike earlier views that divided the brain’s language functions into discrete modules—some for sound, others for grammar, others for meaning—his team’s findings suggest the brain may process all these simultaneously in integrated regions, much like a deep learning model trained to complete tasks end-to-end.
Whisper, developed by OpenAI, was chosen for its architectural similarity to the brain’s task: transforming acoustic input into coherent language. "The brain doesn’t receive words—it receives sound," Goldstein says. "Whisper mimics this by converting raw audio to text, layer by layer."
Moreover, the team found that semantic signals could sometimes be detected before a person actually began speaking. This suggests the brain may pre-encode intent or meaning prior to speech, further blurring the line between thought and expression.
Goldstein notes that this breakthrough could enhance real-time transcription, improve voice assistants and enable smarter AI customer service agents for businesses. The idea is that aligning AI models more closely with human brain signals—especially in noisy, real-world conditions—might boost performance without requiring hundreds of thousands of training hours.
"It is possible that if we constrain future speech-to-text models using neural signals or human neural representations, it might improve the performance of these models," Goldstein says. "But it's speculative. We didn’t test it directly."
Imagine a future voice assistant trained not just on transcripts, but on brain-style representations of meaning. This could reduce the data requirements for training and increase robustness in unpredictable environments, like call centers or driver-assist systems.
The research also holds promise for assistive technologies. Decoding internal language signals could restore communication for individuals with degenerative diseases or who have lost the ability to speak. Large language models could serve as scaffolding, helping translate rough neural intent into grammatically coherent language.
"If the problem is not cognitive, but about controlling the muscles—yes, we might eventually build devices that decode meaning from the brain and help people communicate," he says. "But we used invasive methods in this study. If you're building something for practical use, it would have to work non-invasively, and those signals are noisier."
There’s also a speculative frontier: mind-reading. Goldstein is cautious. "Speaking is part of the process of forming a thought," he notes. "It’s not like we have everything fully formed in our mind and then just hit 'send.' We might be able to capture something at the conceptual level, but not necessarily a fine-grained internal monologue."
Still, early evidence from the study found traces of semantic content in brain signals before a word was spoken, suggesting that with enough resolution and context, a machine might predict what someone intends to say.
Goldstein emphasizes that while today's language models like Whisper and GPT are fundamentally feed-forward architectures—data flows in one direction—the brain is recursive and feedback-driven. "The brain’s end state becomes its next input," he says. "There’s a constant loop of self-modification. That’s a major difference."
He suggests future AI systems gain power by incorporating similar feedback loops, where output informs future inputs in real time. This has implications for language and any system that learns through interaction, like robotics or autonomous agents.
The research also opens the door to new kinds of interdisciplinary collaborations. Goldstein’s lab now explores how multimodal inputs—vision, sound, motion—might be integrated into AI systems that better reflect how people experience the world.
"If we can take the same modalities humans use—bodily, visual, auditory—and build models trained in similar ways, we might get much closer to modeling the brain," he says.
Looking ahead, Goldstein has his eye on something quieter. Not social chatter or reactive speech, but introspection.
"People talking to themselves, describing their internal state—that's where I'd like to go next," he says. "Not social interaction, but the quiet voice of the mind."
He believes modeling internal dialogue—our most private conversations—could offer profound insights into consciousness and cognition. But it’s also ethically fraught. What happens when machines can eavesdrop on our thoughts, even if imperfectly?
"We need to think seriously about surveillance, behavioral manipulation and unintended consequences," he warns. "I’m not alarmed personally, but we should be prepared. We need to allocate resources to understand how this kind of behavior might unfold."
Goldstein resists sensationalism. The brain is not a computer, and AI is not a brain. However, the similarities between the two may be more than superficial metaphors.
"This is a step forward," he says, "but there's still magic in how our brains piece together words on the fly."
