New Gemini model boosts Google’s standing in high-stakes AI tests

3D rendering of a neural network architecture, with several thin squares formed by aligned and interconnected colored dots
Sascha Brodsky

Staff Writer

IBM

This article was featured in the Think newsletter. Get it in your inbox.

Google’s Gemini 3 launched this week with impressive gains on some of the field’s hardest reasoning evaluations, a shift IBM researchers say reflects a real advance in Google’s frontier-model capabilities.

Gemini 3 introduces a set of feature upgrades that Google describes as a step up in practical capability. According to the company’s announcement, the model now handles text, images, audio and video in a single context window; adds new agentic-coding tools that let developers generate working applications from prompts and expands its reach across Google Search, the Gemini app and enterprise platforms such as Vertex AI.

Benchmark boost

Google also boasts benchmark jumps that it says reflect improvements in reasoning and tool use. The company highlighted gains on ARC-AGI, stronger performance in terminal-based code execution and better results on developer-oriented tasks that require planning steps and running tools.

Google is positioning Gemini 3 as the centerpiece of a broader ecosystem built around agentic tooling and cross-application coordination. Central to that effort is Antigravity, an integrated development environment designed to let the model plan tasks, call tools, operate across terminals and browsers, and distribute work among multiple agents.

The latest AI trends, brought to you by experts

Get curated insights on the most important—and intriguing—AI news. Subscribe to our weekly Think newsletter. See the IBM Privacy Statement.

Thank you! You are subscribed.

Your subscription will be delivered in English. You will find an unsubscribe link in every newsletter. You can manage your subscriptions or unsubscribe here. Refer to our IBM Privacy Statement for more information.

First impressions show both promise and caveats

Early testers noted that Google reported sizable gains for Gemini 3 on several high-difficulty evaluations, including Humanity’s Last Exam, GPQA Diamond and ARC-AGI-2, and highlighted improvements in how the model interprets text, images, audio and video together. They also pointed to new coding and agentic tools that can generate working applications with less prompting than earlier versions. Even with those advances, IBM Senior Research Scientist Marina Danilevsky said on a recent episode of the Mixture of Experts podcast that Gemini 3 “is still hallucinating, and it still really likes to give answers rather than say that it does not know the answers.”

Other researchers emphasized the importance of Google’s ecosystem strategy. IBM Chief Architect of AI Open Innovation Gabe Goodhart said on the podcast that “a really great model is not that differentiated anymore.” He argued that the competitive edge now lies in the surrounding tools rather than model size alone. He pointed to Antigravity as an example, calling it “something you cannot get anywhere else,” with the ability to launch “a fleet of delegate worker agents” that can run tasks in parallel.

Hands-on testing made the contrast clearer. Merve Unuvar, Director of Agentic Middleware and Applications Research in AI at IBM, said on the podcast that she asked Gemini 3 to build a personal workout dashboard. The model spun up a working Streamlit interface in under two minutes and delivered a clean set of recommendations. But when she asked for more tailored guidance, it produced advice that ignored information it already had, telling her to “eat high-nutrition food after the workout to ‘grow,’” despite knowing her age.

Goodhart said the real test for Gemini 3 will come from how well it handles complex, multi-agent workflows, not just benchmarks.

“If the model can actually hold up to that level of independence and parallel analysis,” he said, “it could be a real breakthrough.”

Related solutions
IBM Granite

Achieve over 90% cost savings with Granite's smaller and open models, designed for developer efficiency. These enterprise-ready models deliver exceptional performance against safety benchmarks and across a wide range of enterprise tasks from cybersecurity to RAG.

Explore Granite
Artificial intelligence solutions

Put AI to work in your business with IBM's industry-leading AI expertise and portfolio of solutions at your side.

Explore AI solutions
AI consulting and services

Reinvent critical workflows and operations by adding AI to maximize experiences, real-time decision-making and business value.

Explore AI services
Take the next step

Explore the IBM library of foundation models in the IBM watsonx portfolio to scale generative AI for your business with confidence.

Discover watsonx.ai Explore IBM Granite AI models