Did a new model cheat on a given benchmark? Which benchmark is best? And what does "best" even mean when each benchmark measures performance on a different task?
These questions make experts like IBM’s Senior Research Scientist Marina Danilevsky approach model evaluation with caution. “Performing well on a benchmark is just that—performing well on that benchmark,” she tells IBM Think. Transparency is key, she says. “We need to acknowledge the many things that a given benchmark does not test, so that the next benchmarks address some of those holes.”
In contrast to the quest for a single, be-all and end-all benchmark, new solutions are shifting control to users. A team from open-source AI platform Hugging Face recently launched YourBench, an open-source tool that enables enterprises and developers to use their own data to create custom benchmarks to evaluate their model performance. Most benchmarks test “general capabilities,” says Sumuk Shashidhar, a Hugging Face Researcher in an IBM Think interview. “For many use cases in real life, what matters most is how well a model performs your specific task,” he says.
To enhance the usefulness of benchmarks for real-life applications, YourBench automatically generates domain-tailored benchmarks directly from documents provided by the user, both cheaply and without having to annotate the documents manually, says Shashidhar. Specifically, the researchers demonstrate the efficacy of YourBench by replicating seven diverse MMLU—or Massive Multitask Language Understanding—subsets for under USD 15 in total inference costs, while preserving the relative model performance rankings. MMLU is used to evaluate how well language models understand and apply knowledge across various subjects.
Some companies, like IBM, have already developed a custom benchmark generator similar to YourBench. “This reminds me of our homegrown pipeline for creating synthetic data for training or for evaluation,” says Danilevsky. “Creating synthetic data is easy. Creating good synthetic data is hard,” she says. “So while YourBench is effective with MMLU subsets, that does not translate to being good at anything I throw at it.”
Another alternative that has soared in popularity is Chatbot Arena (CA), a crowdsourced benchmark. Instead of rigorous math or language tests, Chatbot Arena lets users ask a question, get answers from two anonymous AI models and rate which one is better.
Started by two University of California, Berkeley graduate students, CA now gets early access to models from all the major AI players so enthusiasts can battle bots against one another, “creating suspense and gamifying model evaluation," says CA Co-Founder Anastasios Angelopoulos in an IBM Think interview. The CA leaderboard, like a Billboard Hot 100 for AI models, has received over two million votes to date.
Since they are tracking new models closely, Angelopoulos was less surprised than many when DeepSeek-R1 soared in popularity. “Open source models have been catching up for some time, so DeepSeek only confirmed that trend.”
The founders created Chatbot Arena in response to the frustration with traditional benchmarks. Part of the challenge, says Angelopoulos, is that “benchmarks are static—certain models get very good at specific benchmarks.” As a result, there is a risk of “overfitting data,” he says, in which a model learns the training data too well. The benefit of Chatbot Arena, he adds, is that the data is live. “You can’t overfit the data. It doesn’t become contaminated or stale.”
For Danilevsky, “The Chatbot Arena leaderboard aggregate by itself is not actionable,” she says. “Having more nuanced feedback on a model beyond a thumbs-up and thumbs-down is needed for many real-world applications.” Still, the concept is very popular, she acknowledges. “I would just want a bit more understanding of how and why people are responding as they do to a given model. Additional metadata would be really useful here.”
Even Angelopoulos believes “real use is measuring something different than benchmarks.” He uses OpenAI’s GPT-4.5 model as an example. “It didn’t perform well on many qualitative benchmarks, but people loved it. You need a different tool to measure the vibe of a model.”
Explore Granite 3.2 and the IBM library of foundation models in the watsonx portfolio to scale generative AI for your business with confidence.
Put AI to work in your business with IBM’s industry-leading AI expertise and portfolio of solutions at your side.
Reinvent critical workflows and operations by adding AI to maximize experiences, real-time decision-making and business value.