Ensuring reliable AI self-evaluation is even more challenging in specialized fields like advanced mathematics and scientific research. This means that human experts are often needed to validate results and keep AI systems reliable and on track.

"There are a lot of tricks in trying to do that, like shuffling multiple-choice answers around," Varshney says.

According to Dev Nag, Founder and CEO of QueryPal, Meta’s Self-Taught Evaluator is about amplifying and scaling human judgment—not removing humans from the loop.

"Think of it as similar to how a teacher might create practice problems based on their understanding of what makes a good or bad answer," Nag says. "Just as AlphaGo used the rules of Go as its foundation before engaging in self-play, the Self-Taught Evaluator builds upon human-established quality criteria before generating synthetic training examples that implicitly embed human judgment."

Even with self-monitoring AI, periodic audits can catch hidden biases or problems, says Dan O'Toole, Chairman and CEO of Arrive AI.

"Employing multiple AI models to perform the same evaluation independently, or chaining them sequentially, reduces errors and highlights potential issues," he says. Explainability is also essential. "The chain of thought is an important step toward transparency, increasing trustworthiness."

O’Toole stresses that specialized metrics are crucial for fields like advanced mathematics and scientific research. Meta, for example, has used MT-Bench and RewardBench for general-purpose evaluation, he says, but benchmarks like GSM8K are appropriate for mathematical problem-solving. CRUXEval can assist with code reasoning, while domain-specific benchmarks like FactKB, PubMed and SciBench can help ensure that models meet specific needs.

Nag emphasizes that measuring performance and ensuring dependability are crucial, particularly in specialized fields. He believes that the ultimate benchmark should be how well results align with assessments made by human experts in the field.

“The Self-Taught Evaluator's 88.7% agreement with human judgments on RewardBench is a strong baseline, but tracking other factors, like consistency, explainability and the system's ability to identify edge cases, is equally important,” he says. "Just as AlphaGo's self-play was validated by its performance against human champions, evaluator systems should be regularly tested against panels of domain experts."

Jen Clark, who directs advisory and technology services at EisnerAmper, emphasizes that AI development requires structured frameworks to ensure both safety and effective progress.

"As AI continues to advance, it's crucial to rely on methodologies that have supported human research, like the scientific method, strong communities and collaboration networks," she says. "Focusing efforts here is essential for crowdsourcing AI safety and managing the speed and scale of AI development."