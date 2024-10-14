OpenAI made history recently by securing a USD 6.6 billion investment to scale up its large language models—increasing their size, data volume and computational resources. Meanwhile, Anthropic’s CEO said his company already has USD 1 billion models in development, with USD 100 billion models coming soon.

But as spending balloons, new research published in Nature suggests that LLMs may in fact become less reliable as they grow.

The crux of the problem, according to researchers from the Polytechnic University of Valencia, is the assumption that as LLMs become more powerful and better aligned by using strategies such as fine-tuning and filtering, they also become more reliable from a user perspective. Or, put differently: people may make the false assumption that as models become more powerful, their errors will follow a predictable pattern that humans can understand and adjust their queries to.

What a human finds difficult, however, is not necessarily the same as what an LLM finds difficult, the researchers found. Using old and new models of OpenAI’s ChatGPT, Meta’s Llama and BigScience’s BLOOM, the researchers tested core numerical, scientific and knowledge skills using tasks involving addition, vocabulary, geographical knowledge and basic and advanced science questions.

Overall, the study observed that newer, larger language models performed better on tasks that humans rated as higher in difficulty, but they are still far from perfect on tasks that humans consider easy, leading to no operating conditions where these models could be trusted to be flawless. And since newer LLMs improve mainly on the high-difficulty instances, it exacerbates the disparity between what humans find difficult and LLM success.

Rather than asking whether bigger LLMs are better, we should ask, “Can you fact-check a model quickly?” says Bishwaranjan Bhattacharjee, a Master Inventor at IBM. The problem, however, is that humans are bad at spotting errors made by the models and often misjudge incorrect model outputs as correct, even when given the option of saying “I’m not sure.”

“Errors have gone up substantially for newer LLMs, as they now rarely avoid answering questions beyond their competence,” says paper co-author Lexin Zhou. “The bigger problem is that these newer LLMs confidently provide incorrect responses.” People using an LLM for tasks in areas where they don’t have deep expertise may have a false sense of their reliability, as they can’t spot errors as easily. These findings indicate that humans are not well-equipped to serve as reliable supervisors of these models.