This article was featured in the Think newsletter. Get it in your inbox.
Google’s Gemini 3 launched this week with impressive gains on some of the field’s hardest reasoning evaluations, a shift IBM researchers say reflects a real advance in Google’s frontier-model capabilities.
Gemini 3 introduces a set of feature upgrades that Google describes as a step up in practical capability. According to the company’s announcement, the model now handles text, images, audio and video in a single context window; adds new agentic-coding tools that let developers generate working applications from prompts and expands its reach across Google Search, the Gemini app and enterprise platforms such as Vertex AI.
Google also boasts benchmark jumps that it says reflect improvements in reasoning and tool use. The company highlighted gains on ARC-AGI, stronger performance in terminal-based code execution and better results on developer-oriented tasks that require planning steps and running tools.
Google is positioning Gemini 3 as the centerpiece of a broader ecosystem built around agentic tooling and cross-application coordination. Central to that effort is Antigravity, an integrated development environment designed to let the model plan tasks, call tools, operate across terminals and browsers, and distribute work among multiple agents.
Industry newsletter
Get curated insights on the most important—and intriguing—AI news. Subscribe to our weekly Think newsletter. See the IBM Privacy Statement.
Your subscription will be delivered in English. You will find an unsubscribe link in every newsletter. You can manage your subscriptions or unsubscribe here. Refer to our IBM Privacy Statement for more information.
Early testers noted that Google reported sizable gains for Gemini 3 on several high-difficulty evaluations, including Humanity’s Last Exam, GPQA Diamond and ARC-AGI-2, and highlighted improvements in how the model interprets text, images, audio and video together. They also pointed to new coding and agentic tools that can generate working applications with less prompting than earlier versions. Even with those advances, IBM Senior Research Scientist Marina Danilevsky said on a recent episode of the Mixture of Experts podcast that Gemini 3 “is still hallucinating, and it still really likes to give answers rather than say that it does not know the answers.”
Other researchers emphasized the importance of Google’s ecosystem strategy. IBM Chief Architect of AI Open Innovation Gabe Goodhart said on the podcast that “a really great model is not that differentiated anymore.” He argued that the competitive edge now lies in the surrounding tools rather than model size alone. He pointed to Antigravity as an example, calling it “something you cannot get anywhere else,” with the ability to launch “a fleet of delegate worker agents” that can run tasks in parallel.
Hands-on testing made the contrast clearer. Merve Unuvar, Director of Agentic Middleware and Applications Research in AI at IBM, said on the podcast that she asked Gemini 3 to build a personal workout dashboard. The model spun up a working Streamlit interface in under two minutes and delivered a clean set of recommendations. But when she asked for more tailored guidance, it produced advice that ignored information it already had, telling her to “eat high-nutrition food after the workout to ‘grow,’” despite knowing her age.
Goodhart said the real test for Gemini 3 will come from how well it handles complex, multi-agent workflows, not just benchmarks.
“If the model can actually hold up to that level of independence and parallel analysis,” he said, “it could be a real breakthrough.”
Achieve over 90% cost savings with Granite's smaller and open models, designed for developer efficiency. These enterprise-ready models deliver exceptional performance against safety benchmarks and across a wide range of enterprise tasks from cybersecurity to RAG.
Put AI to work in your business with IBM's industry-leading AI expertise and portfolio of solutions at your side.
Reinvent critical workflows and operations by adding AI to maximize experiences, real-time decision-making and business value.