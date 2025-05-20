While large language models (LLMs) are getting sharper with words, they’re sometimes getting fuzzier with facts.
These mistakes, known as hallucinations, aren’t harmless bugs. They point to a core issue in how AI systems generate language. Instead of pulling facts from a database, the models predict what sounds right based on patterns in their training data. That guesswork can lead to fake quotes, made-up policies and false claims delivered with confidence. Researchers are now working on new ways to make these systems more reliable, teaching them how to answer and when to pause, revise or forget.
"What’s really broken is this non-deterministic response," Ruchir Puri, IBM's Chief Scientist, tells IBM Think in an interview. "The same question, with the same intent, can produce different answers depending on how it's phrased. That’s deeply problematic if you're relying on these models for anything serious."
OpenAI’s latest benchmark results highlight the issue. The o3 model reportedly hallucinated 33% of the time on PersonQA, a dataset testing factual accuracy about public figures. The o4-mini model did worse, inventing information in nearly 8 out of 10 responses to general knowledge prompts. These are not obscure systems—they’re being tested for tasks like legal research, healthcare queries and executive decision support.
Some experts say the data paints an incomplete picture and that hallucinations are not increasing across the board.
"We’re seeing real gains," Ja-Naé Duane, a data scientist and co-author of SuperShifts: Transforming How We Live, Learn and Work in the Age of Intelligence, tells IBM Think in an interview, adding that Gemini 2.0 Flash now produces hallucinations in under 1% of test cases, compared with 22% in 2021. "So yes, we have a long way to go, but we’re absolutely headed in the right direction."
Duane emphasized that hallucinations haven’t necessarily gotten worse but that they’ve become more visible.
“The stakes are higher now,” she says. “We’re putting these models into legal workflows, medical settings and enterprise tools. A mistake that once went unnoticed in a chatbot is now a serious liability.”
While state-of-the-art systems like Gemini 2.0 Flash have sharply reduced hallucination rates, others—especially models built for complex reasoning—still struggle. “These reasoning-focused models are being pushed to solve harder problems,” Duane explains. “That means they’re often operating closer to the edge of what they can reliably do, which increases the risk of generating answers that sound right but aren’t.”
She argues that solving the problem requires more than scale. “It’s not just about building bigger models anymore,” she says. “We need architectures that understand not just what to say, but why it matters—and how to stay grounded in truth when it counts.”
Duane believes the real progress will come from pairing better models with systems designed to support them—memory, validators and agents working in tandem. “We’re entering a phase where model intelligence is only one piece of the puzzle,” she states. “Context management, real-time learning and adaptive tools will be equally important.”
Knowing how large language models work is essential to understanding why they sometimes get things wrong. LLMs predict the next word in a sentence based on patterns they've learned from large amounts of text. They aren't pulling facts from a database but making educated guesses. This can lead to answers that sound accurate but are false, especially when the topic is unclear, uncommon or beyond what the model has been trained on.
Hallucinations are challenging to eliminate because they are not bugs in the system; they are an inherent feature of how these probabilistic models work. When no solid pattern is available in the training data, or when a prompt is too vague or open-ended, the model may invent something that sounds plausible.
There’s also a more philosophical question at play. When an AI model invents something, is it failing or creating?
Puri notes that as models become more powerful in their reasoning, they may also exhibit more “creative” behavior that borders on hallucination. “One could argue that creativity involves some kind of hallucination,” he says. “You imagine the unimaginable. But in enterprise applications, that’s a liability, not a strength.”
IBM Researcher Payel Das is among those trying to address the issue by rethinking how models handle information. “It’s the paradox of progress,” Das tells IBM Think in an interview. “These models are getting better at reasoning, but not necessarily at remembering. They can solve harder problems but still get the basics wrong.”
Her team at IBM has been developing Larimar, a memory augmentation system designed to give models a form of editable, short-term memory. The idea is to let models revise or forget facts as needed, without retraining the entire system; a real-time flexibility that current LLMs largely lack.
“Models today are static and brittle,” she says. “You can’t teach them something mid-conversation or update their understanding without retraining them entirely. Larimar is a step toward making them more flexible.”
Other memory-based approaches are showing promise, too. MemReasoner, developed by Microsoft researchers, focuses on helping models reason more effectively across long sequences by selecting and connecting relevant information from earlier parts of a conversation. IBM’s own CAMELoT project is designed to help models stay coherent when working with large volumes of text or extended interactions.
Outside the lab, companies like Vectara are building practical tools to tackle hallucinations. Vectara’s “guardian agents” monitor AI outputs in real-time and rewrite errors before they reach users. Das says while no single fix will solve the problem, combining memory and revision strategies is a strong step forward.
“We’ll never eliminate every mistake,” states Das. “Just like people make mistakes. But we can make models that are better at learning, adapting and correcting themselves. And that makes a huge difference.”
Learn about the new challenges of generative AI, the need for governing AI and ML models and steps to build a trusted, transparent and explainable AI framework.
Understand the importance of establishing a defensible assessment process and consistently categorizing each use case into the appropriate risk tier.
Read about driving ethical and compliant practices with a portfolio of AI products for generative AI models.
Gain a deeper understanding of how to ensure fairness, manage drift, maintain quality and enhance explainability with watsonx.governance™.
We surveyed 2,000 organizations about their AI initiatives to discover what's working, what's not and how you can get ahead.
Govern generative AI models from anywhere and deploy on cloud or on premises with IBM watsonx.governance.
See how AI governance can help increase your employees’ confidence in AI, accelerate adoption and innovation, and improve customer trust.
Prepare for the EU AI Act and establish a responsible AI governance approach with the help of IBM Consulting®.
Direct, manage and monitor your AI with a single portfolio to speed responsible, transparent and explainable AI.