The mathematicians teaching AI to reason
AI systems are beginning to generate mathematical proofs. What will that mean for the people who make them?
AI systems are beginning to generate mathematical proofs. What will that mean for the people who make them?
This article was featured in the Think newsletter. Get it in your inbox.
The camera captures him in a moment of stillness, a soft geometry of light and shadow. Terence Tao sits before a blackboard in his UCLA office, chalk poised, the faint ghost of old equations forming a pale mist across the slate. Around him, the light falls in clean vertical shafts through half-drawn blinds. His posture is both relaxed and intent, as though he’s listening for something inaudible.
Tao is 50 now, a Fields Medalist and arguably the most celebrated mathematician of his generation. But his office looks much like that of any graduate student: one blackboard, one desk, one patient search for order.
Lately, Tao has been exploring how artificial intelligence might change the practice of mathematics. On his blog, he has written about using formal proof assistants like Lean and has built prototype tools that utilize language models to check estimates and organize ideas. He sees these systems as extensions of human reasoning, ways to test the limits of what mathematicians can formalize and understand.
“It’s kind of a transformative impact,” Tao said in a recent interview with IBM Think, his voice soft, steady, precise. “But maybe not in the ways people expect. It won’t replace how we do mathematics. It will open up many more ways to do it.”
For most of human history, mathematics has advanced through the minds of a small number of people who could think several steps ahead of the rest. Archimedes, Newton, Gauss, Ramanujan, Hilbert, Gödel—the list is short, the intervals between revolutions long.
Then, in July, something unusual happened. Google DeepMind and OpenAI announced that their LLMs had achieved gold-medal scores at the International Mathematical Olympiad, a prestigious competition for the world’s top high-school prodigies.
“It’s a real milestone,” Tao said when I asked about it. “It came earlier than expected, but it’s not an apples-to-apples comparison because these systems can simulate multiple solvers interacting with each other.” He smiled faintly, the way he often does when pointing out an asymmetry others miss. “A human solves alone. An AI can think as a crowd.”
The achievement means more than the numbers suggest. For decades, large language models (LLMs), systems trained on vast libraries of human writing, had stumbled when confronted with mathematics. They could summarize, translate, imitate styles of prose, even compose competent sonnets, but they could not count reliably, balance equations or prove the simplest theorems. Mathematics exposed what language prediction could not provide: genuine logical structure.
“Once you get outside the zone of what’s in the literature, what’s in the training data, AI really struggles,” Tao said. “It can sound like reasoning, but it’s really guessing the next sentence.”
Early models were notorious for hallucinations. They cited nonexistent theorems, mixed definitions and confused notation. They invented proofs that looked convincing until examined line by line. A computer could produce an elegant-looking explanation for why 17 is a prime number that divides 289, which of course it does not.
“They were just spitting out rubbish references,” Tao said. “They’re getting better, but we still have to verify everything.”
The breakthrough that made the Olympiad results possible came from reversing the usual AI approach. Instead of treating math as a language problem, researchers built systems that work with the symbols and logical rules of mathematics itself.
These hybrid systems integrate symbolic logic, algebraic search and formal verification. Rather than trying to *be* mathematicians, they act as collaborators: proposing conjectures, checking steps and mapping relationships that might take human teams months to trace.
“AI tools are already helping us explore the long tail, hundreds of thousands of math problems that rarely receive attention because the number of mathematicians is finite,” Tao said. “They can triage the routine cases and surface the genuinely hard ones for experts.”
An hour’s train ride north of Manhattan, up a winding hill in Yorktown Heights, sits the Thomas J. Watson Research Center, stretching low across the hillside, all glass walls and brushed steel, reflecting the sky above the trees. In corridors where physicists once designed the first commercial hard drive and the IBM System/360 mainframe, the whiteboards are once again crowded with equations.
Lior Horesh, a Senior Manager, Principal Research Scientist at IBM Research, sees the same shift Tao describes. The growing partnership between human intuition and machine reasoning. “The body of mathematics has become too vast for any individual to master,” he told me in an interview for IBM Think.
“Even if one could absorb all existing mathematical knowledge, grasping the full implications of that knowledge—the web of interconnections, dependencies and emergent patterns—exceeds human cognitive capacity,” he said. “This is where AI becomes instrumental. AI systems can provide coherent, curated representations of mathematical knowledge and help develop new theoretical frameworks of stunning complexity.”
Horesh is among the creators of AI-Hilbert, which was named for David Hilbert, the German mathematician who at the dawn of the twentieth century sought to formalize all of mathematics under a finite set of axioms, and which aims to capture that ideal in algorithmic form. The AI model unifies symbolic reasoning with data-driven search, generating compact formulas that satisfy both theory and observation. It can rediscover known laws from raw data, such as Kepler’s third law of planetary motion or Einstein’s time-dilation equation, but its deeper purpose is exploratory.
“Rather than proceeding purely deductively,” Horesh explained, “mathematicians can now work inductively with AI as a partner. The system develops predictive models linking seemingly unrelated concepts—say, algebraic signatures and geometric signatures in knot theory. These predictions become conjectures that humans and AI can then work to establish formally and with rigor.”
He called this workflow a postulator–verifier partnership: humans define what properties are desirable, and the AI proposes candidate structures, then checks them for consistency. “Progress doesn’t come from brute-force computation,” Horesh said. “It comes from finding the right mathematical representations and constraints that make reasoning feasible at scale.”
The approach borrows an idea from engineering: tractability making complex problems manageable enough for computation. Complex proofs can fail because they are monolithic, with too many interlocking parts to verify simultaneously. AI decomposes them into smaller subproblems, solving each independently before fitting them together.
“It’s the difference between trying to lift a boulder and moving it pebble by pebble,” Horesh said. This modularity has already accelerated formal theorem proving, where AI generates candidate lemmas, a small proven step that helps solve a bigger math problem, and humans refine the logic.
The partnership echoes the rhythms of collaborative creativity. Horesh compared it to jazz improvisation: the AI offers riffs; the mathematician decides which melody to pursue. The human provides taste and direction; the machine offers scale and stamina. “It doesn’t get tired,” he said. “It doesn’t get discouraged.” He smiled. “But it also doesn’t understand beauty. That part is still ours.”
Horesh’s move from numerical computing to symbolic reasoning reflects a broader shift at IBM, from solving practical problems to exploring how machines can represent knowledge. AI-Hilbert was built to help discover equations, but its strongest results came when humans and AI worked together.
That same logic drives a newer IBM framework, AI Noether, which embeds reasoning inside algebraic geometry, allowing the system to navigate higher-dimensional spaces without collapsing into computational chaos.
“Each breakthrough in scalability requires discovering new restricted settings, new algebraic structures, new logical frameworks,” Horesh said. “It’s as much a design challenge as a scientific one.”
Industry newsletter
Get curated insights on the most important—and intriguing—AI news. Subscribe to our weekly Think newsletter. See the IBM Privacy Statement.
Your subscription will be delivered in English. You will find an unsubscribe link in every newsletter. You can manage your subscriptions or unsubscribe here. Refer to our IBM Privacy Statement for more information.
When Tao speaks about mathematics, his cadence resembles that of a proof, each sentence conditional, bounded and quietly rigorous. He believes the coming transformation will depend not on machines overtaking mathematicians, but on machines expanding the kinds of problems humans can even imagine addressing.
“We’ve always had to triage problems,” he said. “There’s an infinite number of mathematical questions, but only a finite number of mathematicians. AI could take on the long tail of problems that don’t require deep creativity and flag the ones that do.”
Already, Tao uses simple AI tools in his own research. “They’re basically better search engines,” he said. “They help me get into a field, understand definitions and find papers faster.” His phrasing was practical rather than visionary, the tone of a craftsman evaluating a new instrument. “You should only use AI to the extent that you can verify its output,” he said. “If you’re using AI to do things you don’t understand, that’s a red flag.”
He sees the new systems less as assistants than as mirrors. “AI can generate ideas,” he said, “but you still need enough expertise to tell the good ones from the bad. Otherwise, it can waste weeks on approaches that were never going to work.” He compared AI models to energetic graduate students who produce both flashes of brilliance and an abundance of wrong turns. The danger is mistaking enthusiasm for accuracy.
Across universities, proof assistants such as Lean and Coq have already begun to change how mathematicians document their work. Tao described them as “languages that can encode mathematical proofs,” where every step must compile correctly.
“You can get a language model to output a proof in Lean,” he said, “and the compiler will automatically check whether it’s valid. If there’s a syntax error, it fails. You can even feed the errors back into the model so it can try again.”
Tao spoke of these systems the way a pilot might describe an autopilot: useful in clear weather, dangerous in turbulence. “They try to cheat really badly,” he said, with a quick laugh. Some models, when cornered, simply add axioms to make false statements accurate. “That’s not reasoning; it’s wishful thinking at scale.”
What most fascinates Tao is how these tools reveal the contours of human reasoning. “We’re much better at creating the appearance of intelligence than actual intelligence,” he said. “AI can say intelligent things, but that doesn’t mean it understands them.” The comment sounded almost like a theorem in itself, one whose proof is still unfolding.
In Tao’s view, AI extends a long tradition of tools that reshape how knowledge advances. Where earlier technologies amplified the spread of computational ideas, AI accelerates the process of forming and testing conjectures. The risk, he said, is not replacement, but overconfidence. “We actually need tools that fail rather visibly when they fail,” Tao said. “That’s much more useful than a tool that works 90% of the time and fails invisibly 10% of the time.”
The mathematician’s insistence on transparency reflects his own working style. Tao’s blog, an open notebook of problems, proofs and partial insights, has a readership that includes thousands of mathematicians around the world. He publishes sketches of ideas in progress, inviting readers to comment, correct or extend them.
The culture of mathematics may already have been primed for AI collaboration. Online platforms such as Polymath, which Tao helped found, allow mathematicians to work collectively on single problems, posting intermediate lemmas and conjectures in real time. Many of the same workflows, including distributed exploration, iterative refinement and continuous verification, are now being encoded in machine-assisted reasoning systems. The machines are learning to do what communities of humans have done for centuries: to argue, refine and prove together.
Yet the questions of authorship and understanding remain unresolved. “If an AI produces a proof that no one can follow,” Tao asked, “does that count as knowledge?” It was less a rhetorical flourish than a genuine concern. In mathematics, understanding why something is true has always mattered as much as knowing that it is true. “A black-box proof,” he said, “isn’t really mathematics. It’s engineering with symbols.”
Horesh shares Tao’s skepticism that machines can truly think. “Current AI systems do not reason in the classical sense,” he said. “They recognize patterns in vast data, generate plausible outputs, but they don’t perform systematic logical deduction with the transparency and reliability that characterizes mathematical reasoning.”
In Horesh’s view, the line between pattern recognition and reasoning isn’t fixed so much as it’s shaped by design. “Each step toward transparency,” he said, “requires designing representations that humans can inspect.” He gestured toward his whiteboard. “If you can’t see what the machine is doing, you can’t call it reasoning.”
Both men believe the limits of AI come not only from computers, but from logic itself. Gödel proved that even the best mathematical systems have truths they can’t fully explain. ‘Incompleteness isn’t our bottleneck,’ Horesh said. ‘It reminds us that human insight still matters—rules alone can only take us so far.’”
Even the promise of scalability, Horesh said, depends on restraint. “Every domain of mathematics has its own algebraic structure,” he said. “You can’t expect one universal system to handle them all.” Instead, AI-Hilbert and its successors will likely evolve as a federation of specialized solvers, each adapted to its own branch of mathematics. “There’s beauty in that diversity,” he said. “It mirrors the field itself.”
When I mentioned Tao’s warning about systems that fail invisibly, Horesh nodded. “Exactly,” he said. “Failure has to be observable. Otherwise, we can’t learn from it.” He paused. “Mathematics has always advanced by understanding failure.”
Both researchers spoke about AI with cautious optimism. Tao said it could help mathematicians explore far more ideas, but people would still need to judge which ones matter. Horesh agreed. He said real progress will come from working together, not replacing humans. Machines can search and test, but only people can decide what a true discovery is.
Late in our conversation, Tao reflected on how the tools themselves might reshape creativity. “When I was younger,” he said, “I thought of mathematics as building cathedrals of logic. Now it feels more like gardening. You plant ideas, prune them, see what grows.” The metaphor could apply equally to his view of AI: “If we treat it as a partner rather than an oracle, it might help us see patterns we’d never have noticed,” he said.