Who watches the AI watchers? The challenge of self-evaluating AI

Author

Sascha Brodsky

Tech Reporter, Editorial Lead

IBM

AI models are increasingly being asked to evaluate themselves, raising a critical question: who watches the watchers?

AI companies are developing models for evaluating other AI systems, marking a shift from traditional human-led assessment methods. Meta's new model, for example, can assess AI performance without human input. And it’s sparking discussion among researchers about the accuracy and limitations of automated testing.

The Self-Taught Evaluator uses AI-generated training data and a chain-of-thought technique to evaluate science, coding and math responses. The goal is transparency, but it comes with risks. "This is a big problem—validating validators," says IBM Fellow Kush Varshney. AI self-evaluation holds promise for greater efficiency, but it also carries serious risks. Varshney and his team at IBM are developing evaluation metrics for LLM-as-a-judge models—LLMs that can assess other AI outputs. Still, as Varshney puts it, "This is very much an open research problem."

Feedback loops

The central concern is whether AI models can be trusted to improve themselves, or if they risk amplifying their own errors.

"Think about taking a microphone close to a speaker,” Varshney says. “In any feedback system, errors or noise get amplified.”

Meta's Self-Taught Evaluator attempts to mitigate these risks by providing a reasoning trail, in a manner akin to a human explaining their thought process. However, relying on synthetic data and self-improvement raises a question: are the model's judgments unbiased? Bias results when an AI system's outputs reflect unfair or skewed assumptions, typically as a result of biased data or algorithms.

"The goal has to be to make LLM judges unbiased, so we need to evaluate their bias," Varshney says. One method involves shuffling multiple-choice answers to determine if a model displays positional bias, meaning that it favors one answer over others. Along with verbosity bias and self-enhancement bias, positional biases can all skew evaluations. "Managing these risks is a part of AI development," Varshney says. Responsible AI advancement requires identifying and mitigating biases.

Human oversight remains crucial

Ensuring reliable AI self-evaluation is even more challenging in specialized fields like advanced mathematics and scientific research. This means that human experts are often needed to validate results and keep AI systems reliable and on track.

"There are a lot of tricks in trying to do that, like shuffling multiple-choice answers around," Varshney says.

According to Dev Nag, Founder and CEO of QueryPal, Meta’s Self-Taught Evaluator is about amplifying and scaling human judgment—not removing humans from the loop.

"Think of it as similar to how a teacher might create practice problems based on their understanding of what makes a good or bad answer," Nag says. "Just as AlphaGo used the rules of Go as its foundation before engaging in self-play, the Self-Taught Evaluator builds upon human-established quality criteria before generating synthetic training examples that implicitly embed human judgment."

Even with self-monitoring AI, periodic audits can catch hidden biases or problems, says Dan O'Toole, Chairman and CEO of Arrive AI.

"Employing multiple AI models to perform the same evaluation independently, or chaining them sequentially, reduces errors and highlights potential issues," he says. Explainability is also essential. "The chain of thought is an important step toward transparency, increasing trustworthiness."

O’Toole stresses that specialized metrics are crucial for fields like advanced mathematics and scientific research. Meta, for example, has used MT-Bench and RewardBench for general-purpose evaluation, he says, but benchmarks like GSM8K are appropriate for mathematical problem-solving. CRUXEval can assist with code reasoning, while domain-specific benchmarks like FactKB, PubMed and SciBench can help ensure that models meet specific needs.

Nag emphasizes that measuring performance and ensuring dependability are crucial, particularly in specialized fields. He believes that the ultimate benchmark should be how well results align with assessments made by human experts in the field.

“The Self-Taught Evaluator's 88.7% agreement with human judgments on RewardBench is a strong baseline, but tracking other factors, like consistency, explainability and the system's ability to identify edge cases, is equally important,” he says. "Just as AlphaGo's self-play was validated by its performance against human champions, evaluator systems should be regularly tested against panels of domain experts."

Jen Clark, who directs advisory and technology services at EisnerAmper, emphasizes that AI development requires structured frameworks to ensure both safety and effective progress.

"As AI continues to advance, it's crucial to rely on methodologies that have supported human research, like the scientific method, strong communities and collaboration networks," she says. "Focusing efforts here is essential for crowdsourcing AI safety and managing the speed and scale of AI development."

How to choose the right AI foundation model

Related: IBM experts break down LLM benchmarks and best practices

Who watches the AI watchers? The challenge of self-evaluating AI

12 November 2024

Author

Sascha Brodsky

Feedback loops

Human oversight remains crucial