Beyond the leaderboard: The quiet revolt against AI’s report card

29 April 2025

Author

Sascha Brodsky

Tech Reporter, Editorial Lead

IBM

The artificial intelligence systems powering everything from search engines to software assistants are trained on vast amounts of data, fine-tuned by teams of engineers and benchmarked to within an inch of their synthetic lives. The results? Gleaming leaderboard scores. Impressive accuracy metrics. A lot of bragging rights.

But ask the people building those systems whether those benchmarks reflect how well AI works in the wild, and you’ll hear a different story. Now, a quiet shift is spreading across the AI world. From IBM to Amazon to academic labs at Stanford and Carnegie Mellon, a new generation of researchers is rethinking how to test what AI knows and what it can do.

“You end up with a model that seems smart, but has simply memorized the shape of the problem,” Gabe Goodhart, Chief Architect of AI Open Innovation at IBM, tells IBM Think in an interview. “It’s not that different from memorizing a few hundred practice questions before a standardized exam.”

If the new generation of benchmarks has a rallying cry, it might be this: stop testing AI in a vacuum. Instead, researchers are increasingly seeking to evaluate AI in context, alongside its users and inside real-world interactions. That’s the argument from Maarten Sap, an Assistant Professor at Carnegie Mellon University who studies how AI behaves in the wild.

“Benchmarks today don’t reflect how people actually use AI,” he tells IBM Think. “They’re misaligned, often based on factoid-style tasks that resemble trivia—not real-life queries. And worse, we don’t even know if models are truly solving them, because they may have seen the answers during training.”

Sap, who has conducted research showing that models frequently fail when users speak in dialects or use informal language, says the field needs to stop treating AI as a solo performer. “We need to evaluate the AI-user interaction—the dyad—not the AI system alone,” he says. “That’s where things fall apart, especially over multiple turns of conversation.”

The benchmark mirage

Benchmarks were once the gold standard of AI progress: clean, controlled evaluations that compare models on a shared task. In technical circles, benchmarks provide a reproducible way to measure performance across specific datasets. But in the age of foundation models—vast neural networks trained to perform a wide range of tasks across multiple domains—those once-reliable tests are beginning to show their limits.

Not surprisingly, many of today’s models ace tests they’ve essentially seen before. That’s because benchmark datasets often get swept into the enormous collections of training data used to build these models. As Goodhart puts it: “The real test isn’t if a model scores well on a test—it’s whether it helps you get something done.”

These scores still carry weight. High performance on a benchmark can attract funding, headlines and business deals. But researchers at Stanford HELM, along with Pratiksha Thaker, a postdoctoral researcher at Carnegie Mellon, have shown that these results can hide flaws. Benchmarks designed to test whether models can “unlearn” sensitive or toxic information, for example, often break down when it’s unclear what information needs to be forgotten.

3D design of balls rolling on a track

The latest AI News + Insights 


Discover expertly curated insights and news on AI, cloud and more in the weekly Think Newsletter. 

LiveXiv and the return of the unfamiliar

IBM’s alternative is LiveXiv, a benchmark that updates every month with fresh material pulled from arXiv, an online repository of scientific papers. Rather than presenting models with a static set of questions, LiveXiv feeds them unseen charts, tables and prompts, testing whether they can make sense of something new.

LiveXiv is part of a broader category of benchmarks designed to test generalization: a model’s ability to perform well on inputs that differ in structure or style from its training data. For vision-language models—systems that combine text with images—Goodhart says this is especially important. It’s one thing to recognize an object in a photo; it’s another to reason through a messy figure caption from a particle physics paper.

A new challenger

Amazon has launched a similar challenge in the coding domain. SWE-PolyBench, a multilingual coding benchmark, expands on SWE-Bench by adding tasks in Java, JavaScript and TypeScript, in addition to Python. It assesses how well AI agents can navigate real repositories, understand code in context and generate functional fixes.

Traditional benchmarks check if an agent produces working code. SWE-PolyBench goes further, measuring whether a model can correctly identify which files and even which parts of the code—called “syntax tree nodes”—need to change

Performance dropped sharply outside of Python, especially in tasks that required updating multiple files. The benchmark also revealed weaknesses in handling feature requests, more open-ended than bug fixes and multi-class changes.

AI Academy

Why foundation models are a paradigm shift for AI

Learn about a new class of flexible, reusable AI models that can unlock new revenue, reduce costs and increase productivity, then use our guidebook to dive deeper.

Testing the limits of conversation

Another underexplored area is dialogue. In January 2025, IBM released MTRAG, a benchmark to evaluate conversational retrieval-augmented generation (RAG). RAG works by retrieving relevant documents or snippets before generating a response, helping models stay grounded in facts. It’s widely used in chatbots, virtual assistants and enterprise search systems.

MTRAG tests how well models are able to handle live, multi-turn conversations across domains like finance, government services and IT documentation. Some questions are intentionally unanswerable, to see if the model bluffs or admits it doesn’t know. Others require the model to keep track of previous turns in a chat, mimicking real dialogue flow.

The Granite model, IBM’s flagship LLM, performed respectably—but like its peers, it stumbled on ambiguous questions and follow-ups. However, the benchmark is already helping guide model updates, IBM said in a research article, and the company plans to expand it with more domains and finer-grained citation tracking.

A new philosophy for a new era

From LiveXiv to SWE-PolyBench to MTRAG, one trend stands out, Goodhart says: benchmarks are shifting from static, answer-based testing to more dynamic, interaction-driven evaluations. It’s no longer enough for an AI model to simply be correct—it needs to be useful. It needs to adapt.

“What we care about is how these systems help people work,” says Goodhart. “That’s not something you can capture with one number.”

In other words, Goodhart says, these new benchmarks represent a needed shift from showmanship to substance.

“I don’t need a model that passes the bar exam,” Goodhart says. “I need one that helps me finish my job before lunch.”

Related solutions
Foundation models

Explore Granite 3.2 and the IBM library of foundation models in the watsonx portfolio to scale generative AI for your business with confidence.

Explore watsonx.ai
Artificial intelligence solutions

Put AI to work in your business with IBM’s industry-leading AI expertise and portfolio of solutions at your side.

Explore AI solutions
AI consulting and services

Reinvent critical workflows and operations by adding AI to maximize experiences, real-time decision-making and business value.

Explore AI services
Take the next step

Explore the IBM library of foundation models in the IBM watsonx portfolio to scale generative AI for your business with confidence.

Explore watsonx.ai Explore AI solutions