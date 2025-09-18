Artificial intelligence has a confidence problem. The same large language models (LLMs) that generate fluent text for millions of users can also invent facts with equal poise, a flaw researchers call hallucination. And despite steady improvements in model accuracy, this tendency to produce wrong but plausible answers has proven stubbornly hard to fix.

A new study by OpenAI suggests the problem is not a mysterious glitch deep in the code, but a side effect of how researchers measure progress in AI. Benchmarks that rank models by accuracy can push them to guess rather than hold back, rewarding confident errors over admissions of uncertainty. It is a subtle incentive with wide consequences: the very scoreboards that drive competition in the field may be teaching systems to bluff.

“Evaluations are really at the heart of it, similar to how KPIs incentivize humans,” Ayhan Sebin, an AI Ecosystem and Partnership Development Executive at IBM, told IBM Think in an interview. “If the scoring system rewards guesses, then the models will learn to guess.”

Kate Soule, a Director of Technical Product Management for IBM’s Granite models, described the issue as a calibration problem. Benchmarks today reward models for always producing an answer, which favors risky guesses over withholding. But if models go too far in the other direction and refuse to answer at all, they are not very useful either.

“Right now, we are at one end of the spectrum, where accuracy is prioritized above all else,” she said on a recent episode of the Mixture of Experts podcast. “If we only go to the other end, where a model says ‘I don’t know’ for every answer, it is not very useful either. We need better reward functions and better evaluations that help us calibrate where on that spectrum models sit.”