Did a new model cheat on a given benchmark? Which benchmark is best? And what does "best" even mean when each benchmark measures performance on a different task?

These questions make experts like IBM’s Senior Research Scientist Marina Danilevsky approach model evaluation with caution. “Performing well on a benchmark is just that—performing well on that benchmark,” she tells IBM Think. Transparency is key, she says. “We need to acknowledge the many things that a given benchmark does not test, so that the next benchmarks address some of those holes.”

In contrast to the quest for a single, be-all and end-all benchmark, new solutions are shifting control to users. A team from open-source AI platform Hugging Face recently launched YourBench, an open-source tool that enables enterprises and developers to use their own data to create custom benchmarks to evaluate their model performance. Most benchmarks test “general capabilities,” says Sumuk Shashidhar, a Hugging Face Researcher in an IBM Think interview. “For many use cases in real life, what matters most is how well a model performs your specific task,” he says.

To enhance the usefulness of benchmarks for real-life applications, YourBench automatically generates domain-tailored benchmarks directly from documents provided by the user, both cheaply and without having to annotate the documents manually, says Shashidhar. Specifically, the researchers demonstrate the efficacy of YourBench by replicating seven diverse MMLU—or Massive Multitask Language Understanding—subsets for under USD 15 in total inference costs, while preserving the relative model performance rankings. MMLU is used to evaluate how well language models understand and apply knowledge across various subjects.