A hot topic, if there was ever one, benchmarks have become a central debate now that AI capabilities are advancing so quickly that they’re consistently outpacing the tools used to measure them.

“Every year, we look at how these algorithms are performing across benchmarks, and every year it seems like they're beating those benchmarks,” says Vanessa Parli, one of the report authors, in an interview with IBM Think. “Similarly, this year, that is happening even with the newer benchmarks.”

The report noted that in 2023, researchers introduced new benchmarks—MMMU, GPQA and SWE-bench—to test the limits of advanced AI systems. Just a year later, performance sharply increased: scores rose by 18.8, 48.9 and 67.3 percentage points on MMMU, GPQA and SWE-bench, respectively, according to the report.

This raises ambiguity within the research community on the true meaning—and value—of an LLM benchmark. Parli poses critical questions for consideration: “Are we measuring the right thing? Are those benchmarks compromised? And how should the scientific community evaluate models?”

Thinking ahead, Ash Minhas also questions what the future of benchmarking will look like. “Where is that going to stop?” he asks in an interview with IBM Think. “Is the Turing Test going to have to constantly be a moving goal post? Is humanity's last exam really the last exam?”

Meanwhile, experts caution against the risk of overfitting, a phenomenon in which an AI model has learned to perform exceptionally well on specific benchmark tests but may fail to generalize to new, unseen data in real-world applications. “Are we just training the model to pass the benchmark?” he adds. “MMMU is a good benchmark, but is it because the model knows how to respond to the benchmark?”

Minhas also warns that the excitement and momentum of progress could be taking priority over caring about ethics, fairness and bias.