The artificial intelligence systems powering everything from search engines to software assistants are trained on vast amounts of data, fine-tuned by teams of engineers and benchmarked to within an inch of their synthetic lives. The results? Gleaming leaderboard scores. Impressive accuracy metrics. A lot of bragging rights.
But ask the people building those systems whether those benchmarks reflect how well AI works in the wild, and you’ll hear a different story. Now, a quiet shift is spreading across the AI world. From IBM to Amazon to academic labs at Stanford and Carnegie Mellon, a new generation of researchers is rethinking how to test what AI knows and what it can do.
“You end up with a model that seems smart, but has simply memorized the shape of the problem,” Gabe Goodhart, Chief Architect of AI Open Innovation at IBM, tells IBM Think in an interview. “It’s not that different from memorizing a few hundred practice questions before a standardized exam.”
If the new generation of benchmarks has a rallying cry, it might be this: stop testing AI in a vacuum. Instead, researchers are increasingly seeking to evaluate AI in context, alongside its users and inside real-world interactions. That’s the argument from Maarten Sap, an Assistant Professor at Carnegie Mellon University who studies how AI behaves in the wild.
“Benchmarks today don’t reflect how people actually use AI,” he tells IBM Think. “They’re misaligned, often based on factoid-style tasks that resemble trivia—not real-life queries. And worse, we don’t even know if models are truly solving them, because they may have seen the answers during training.”
Sap, who has conducted research showing that models frequently fail when users speak in dialects or use informal language, says the field needs to stop treating AI as a solo performer. “We need to evaluate the AI-user interaction—the dyad—not the AI system alone,” he says. “That’s where things fall apart, especially over multiple turns of conversation.”
Benchmarks were once the gold standard of AI progress: clean, controlled evaluations that compare models on a shared task. In technical circles, benchmarks provide a reproducible way to measure performance across specific datasets. But in the age of foundation models—vast neural networks trained to perform a wide range of tasks across multiple domains—those once-reliable tests are beginning to show their limits.
Not surprisingly, many of today’s models ace tests they’ve essentially seen before. That’s because benchmark datasets often get swept into the enormous collections of training data used to build these models. As Goodhart puts it: “The real test isn’t if a model scores well on a test—it’s whether it helps you get something done.”
These scores still carry weight. High performance on a benchmark can attract funding, headlines and business deals. But researchers at Stanford HELM, along with Pratiksha Thaker, a postdoctoral researcher at Carnegie Mellon, have shown that these results can hide flaws. Benchmarks designed to test whether models can “unlearn” sensitive or toxic information, for example, often break down when it’s unclear what information needs to be forgotten.
IBM’s alternative is LiveXiv, a benchmark that updates every month with fresh material pulled from arXiv, an online repository of scientific papers. Rather than presenting models with a static set of questions, LiveXiv feeds them unseen charts, tables and prompts, testing whether they can make sense of something new.
LiveXiv is part of a broader category of benchmarks designed to test generalization: a model’s ability to perform well on inputs that differ in structure or style from its training data. For vision-language models—systems that combine text with images—Goodhart says this is especially important. It’s one thing to recognize an object in a photo; it’s another to reason through a messy figure caption from a particle physics paper.
Amazon has launched a similar challenge in the coding domain. SWE-PolyBench, a multilingual coding benchmark, expands on SWE-Bench by adding tasks in Java, JavaScript and TypeScript, in addition to Python. It assesses how well AI agents can navigate real repositories, understand code in context and generate functional fixes.
Traditional benchmarks check if an agent produces working code. SWE-PolyBench goes further, measuring whether a model can correctly identify which files and even which parts of the code—called “syntax tree nodes”—need to change
Performance dropped sharply outside of Python, especially in tasks that required updating multiple files. The benchmark also revealed weaknesses in handling feature requests, more open-ended than bug fixes and multi-class changes.
Another underexplored area is dialogue. In January 2025, IBM released MTRAG, a benchmark to evaluate conversational retrieval-augmented generation (RAG). RAG works by retrieving relevant documents or snippets before generating a response, helping models stay grounded in facts. It’s widely used in chatbots, virtual assistants and enterprise search systems.
MTRAG tests how well models are able to handle live, multi-turn conversations across domains like finance, government services and IT documentation. Some questions are intentionally unanswerable, to see if the model bluffs or admits it doesn’t know. Others require the model to keep track of previous turns in a chat, mimicking real dialogue flow.
The Granite model, IBM’s flagship LLM, performed respectably—but like its peers, it stumbled on ambiguous questions and follow-ups. However, the benchmark is already helping guide model updates, IBM said in a research article, and the company plans to expand it with more domains and finer-grained citation tracking.
From LiveXiv to SWE-PolyBench to MTRAG, one trend stands out, Goodhart says: benchmarks are shifting from static, answer-based testing to more dynamic, interaction-driven evaluations. It’s no longer enough for an AI model to simply be correct—it needs to be useful. It needs to adapt.
“What we care about is how these systems help people work,” says Goodhart. “That’s not something you can capture with one number.”
In other words, Goodhart says, these new benchmarks represent a needed shift from showmanship to substance.
“I don’t need a model that passes the bar exam,” Goodhart says. “I need one that helps me finish my job before lunch.”
Explore Granite 3.2 and the IBM library of foundation models in the watsonx portfolio to scale generative AI for your business with confidence.
Put AI to work in your business with IBM’s industry-leading AI expertise and portfolio of solutions at your side.
Reinvent critical workflows and operations by adding AI to maximize experiences, real-time decision-making and business value.