AI’s new benchmark rule: BYOB

Author

Staff Writer

IBM

Did a new model cheat on a given benchmark? Which benchmark is best? And what does “best” even mean when each benchmark measures performance on a different task?

These questions make experts like IBM’s Senior Research Scientist Marina Danilevsky approach model evaluation with caution. “Performing well on a benchmark is just that—performing well on that benchmark,” she tells IBM Think. Transparency is key, she says. “We need to acknowledge the many things that a given benchmark does not test, so that the next benchmarks address some of those holes.”

In contrast to the quest for a single, be-all and end-all benchmark, new solutions are shifting control to users. A team from open-source AI platform Hugging Face recently launched YourBench, an open-source tool that enables enterprises and developers to use their own data to create custom benchmarks to evaluate their model performance. Most benchmarks test “general capabilities,” says Sumuk Shashidhar, a Hugging Face Researcher in an IBM Think interview. “For many use cases in real life, what matters most is how well a model performs your specific task,” he says.

To enhance the usefulness of benchmarks for real-life applications, YourBench automatically generates domain-tailored benchmarks directly from documents provided by the user, both cheaply and without having to annotate the documents manually, says Shashidhar. Specifically, the researchers demonstrate the efficacy of YourBench by replicating seven diverse MMLU—or Massive Multitask Language Understanding—subsets for under USD 15 in total inference costs, while preserving the relative model performance rankings. MMLU is used to evaluate how well language models understand and apply knowledge across various subjects.

The latest AI News + Insights  

Discover expertly curated insights and news on AI, cloud and more in the weekly Think Newsletter.

Benchmarks, homegrown and crowdsourced

Some companies, like IBM, have already developed a custom benchmark generator similar to YourBench. “This reminds me of our homegrown pipeline for creating synthetic data for training or for evaluation,” says Danilevsky. “Creating synthetic data is easy. Creating good synthetic data is hard,” she says. “So while YourBench is effective with MMLU subsets, that does not translate to being good at anything I throw at it.”

Another alternative that has soared in popularity is Chatbot Arena (CA), a crowdsourced benchmark. Instead of rigorous math or language tests, Chatbot Arena lets users ask a question, get answers from two anonymous AI models and rate which one is better.

Started by two University of California, Berkeley graduate students, CA now gets early access to models from all the major AI players so enthusiasts can battle bots against one another, “creating suspense and gamifying model evaluation,” says CA Co-Founder Anastasios Angelopoulos in an IBM Think interview. The CA leaderboard, like a Billboard Hot 100 for AI models, has received over two million votes to date.

Since they are tracking new models closely, Angelopoulos was less surprised than many when DeepSeek-R1 soared in popularity. “Open source models have been catching up for some time, so DeepSeek only confirmed that trend.”

AI Academy

Why foundation models are a paradigm shift for AI

Learn about a new class of flexible, reusable AI models that can unlock new revenue, reduce costs and increase productivity, then use our guidebook to dive deeper.

Go to episode

Leaderboards vs. Vibes

The founders created Chatbot Arena in response to the frustration with traditional benchmarks. Part of the challenge, says Angelopoulos, is that “benchmarks are static—certain models get very good at specific benchmarks.” As a result, there is a risk of “overfitting data,” he says, in which a model learns the training data too well. The benefit of Chatbot Arena, he adds, is that the data is live. “You can’t overfit the data. It doesn’t become contaminated or stale.”

For Danilevsky, “The Chatbot Arena leaderboard aggregate by itself is not actionable,” she says. “Having more nuanced feedback on a model beyond a thumbs-up and thumbs-down is needed for many real-world applications.” Still, the concept is very popular, she acknowledges. “I would just want a bit more understanding of how and why people are responding as they do to a given model. Additional metadata would be really useful here.”

Even Angelopoulos believes “real use is measuring something different than benchmarks.” He uses OpenAI’s GPT-4.5 model as an example. “It didn’t perform well on many qualitative benchmarks, but people loved it. You need a different tool to measure the vibe of a model.”

How to choose the right foundation model

Learn how to choose the right approach in preparing datasets and employing foundation models.

AI’s new benchmark rule: BYOB

Author

The latest AI News + Insights

Benchmarks, homegrown and crowdsourced

Why foundation models are a paradigm shift for AI

Leaderboards vs. Vibes

Share

Resources

The latest AI News + Insights