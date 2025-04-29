The artificial intelligence systems powering everything from search engines to software assistants are trained on vast amounts of data, fine-tuned by teams of engineers and benchmarked to within an inch of their synthetic lives. The results? Gleaming leaderboard scores. Impressive accuracy metrics. A lot of bragging rights.

But ask the people building those systems whether those benchmarks reflect how well AI works in the wild, and you’ll hear a different story. Now, a quiet shift is spreading across the AI world. From IBM to Amazon to academic labs at Stanford and Carnegie Mellon, a new generation of researchers is rethinking how to test what AI knows and what it can do.

“You end up with a model that seems smart, but has simply memorized the shape of the problem,” Gabe Goodhart, Chief Architect of AI Open Innovation at IBM, tells IBM Think in an interview. “It’s not that different from memorizing a few hundred practice questions before a standardized exam.”

If the new generation of benchmarks has a rallying cry, it might be this: stop testing AI in a vacuum. Instead, researchers are increasingly seeking to evaluate AI in context, alongside its users and inside real-world interactions. That’s the argument from Maarten Sap, an Assistant Professor at Carnegie Mellon University who studies how AI behaves in the wild.

“Benchmarks today don’t reflect how people actually use AI,” he tells IBM Think. “They’re misaligned, often based on factoid-style tasks that resemble trivia—not real-life queries. And worse, we don’t even know if models are truly solving them, because they may have seen the answers during training.”

Sap, who has conducted research showing that models frequently fail when users speak in dialects or use informal language, says the field needs to stop treating AI as a solo performer. “We need to evaluate the AI-user interaction—the dyad—not the AI system alone,” he says. “That’s where things fall apart, especially over multiple turns of conversation.”