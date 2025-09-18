Artificial intelligence has a confidence problem. The same large language models (LLMs) that generate fluent text for millions of users can also invent facts with equal poise, a flaw researchers call hallucination. And despite steady improvements in model accuracy, this tendency to produce wrong but plausible answers has proven stubbornly hard to fix.
A new study by OpenAI suggests the problem is not a mysterious glitch deep in the code, but a side effect of how researchers measure progress in AI. Benchmarks that rank models by accuracy can push them to guess rather than hold back, rewarding confident errors over admissions of uncertainty. It is a subtle incentive with wide consequences: the very scoreboards that drive competition in the field may be teaching systems to bluff.
“Evaluations are really at the heart of it, similar to how KPIs incentivize humans,” Ayhan Sebin, an AI Ecosystem and Partnership Development Executive at IBM, told IBM Think in an interview. “If the scoring system rewards guesses, then the models will learn to guess.”
Kate Soule, a Director of Technical Product Management for IBM’s Granite models, described the issue as a calibration problem. Benchmarks today reward models for always producing an answer, which favors risky guesses over withholding. But if models go too far in the other direction and refuse to answer at all, they are not very useful either.
“Right now, we are at one end of the spectrum, where accuracy is prioritized above all else,” she said on a recent episode of the Mixture of Experts podcast. “If we only go to the other end, where a model says ‘I don’t know’ for every answer, it is not very useful either. We need better reward functions and better evaluations that help us calibrate where on that spectrum models sit.”
Hallucinations are not new. From the earliest chatbots, users noticed that the programs could produce polished sentences filled with incorrect information, with the smoothness of the prose often making the errors difficult to detect.
The new research argues that the incentives for lying are baked in. By treating a wrong answer and an admission of “I don’t know” as equally bad, the benchmarks can encourage guessing, Santosh Vempala, a computer scientist at the Georgia Institute of Technology and a co-author of the paper, told IBM Think in an interview.
“If you do not know the answer but take a wild guess, you might get lucky and be right,” he said. “Leaving it blank guarantees a zero.”
The researchers tested their idea on the SimpleQA benchmark, where models can either answer or say, “I don’t know.” They found that o4-mini seemed more accurate than a smaller GPT-5 model, but it guessed far more often and was wrong 75 percent of the time, while the GPT-5 model abstained more and made fewer mistakes overall.
According to the OpenAI paper, language models are pushed to guess rather than admit uncertainty because most tests reward answers and penalize saying “I don’t know.” This makes them look better on leaderboard scores but less reliable in real‐world use.
Chris Hay, a Distinguished Engineer at IBM, underscored how reinforcement learning practices can encourage bad habits. “Because reinforcement learning is really, ‘You got this right, have a cookie,’ it essentially means that the lack of I don’t know’ capability is reinforced,” he said on Mixture of Experts. “Models are penalized for abstaining and rewarded for guessing, and external benchmarks push providers to maximize accuracy scores, even if that increases hallucinations.”
That uncertainty has led the authors to focus on evaluation. One option is to tweak training, so models get less punishment for saying “I don’t know.” But Vempala warned that this could break the very balance that makes them sound fluent.
“A potential change would be to penalize ‘IDK’ less than incorrect next-token prediction during pre-training, but this might have other undesirable consequences,” Vempala said. “Since we do not fully understand why pre-training with next-token prediction and standard log loss works so well to generate entire documents, it is unclear if such a change in the objective might reduce overall performance.”
Fixing benchmarks may help, but Soule said that won’t solve everything. “There are always going to be hallucinations,” she said on the Mixture of Experts podcast. “We are going to need a combination of tools, symbolic approaches, and verification layers on top of the models to detect when a statement lacks evidence in the grounding context.”
The study lands as the industry races to cut hallucinations. IBM researchers are also exploring new approaches to the problem. One project, called Larimar, is designed to give models a form of short-term, editable memory. The idea is to allow AI systems to revise or discard information in real time rather than carry it forward indefinitely. That flexibility could reduce the risk of errors compounding or persisting, and it may help models stay accurate without requiring developers to engage in the costly process of retraining from scratch.
Larimar builds on the observation that current systems lack mechanisms to update specific facts once training is complete. By introducing a layer of memory that can be edited, the approach enables models to adjust to new or corrected information as they operate.
Payel Das, a Principal Research Staff Member and Manager of Trusted AI at IBM Research, described Larimar as a way of aligning model performance more closely with how humans remember, revise and sometimes forget.
“Models today are static and brittle,” Das told IBM Think in an interview. “You can’t teach them something mid-conversation or update their understanding without retraining them entirely. Larimar is a step toward making them more flexible.”
Hallucinations aren’t going away. But with new tools like Larimar and a better understanding of how training incentives fuel bluffing, researchers are finding ways to keep them in check.
