This article was featured in the Think newsletter. Get it in your inbox.
Artificial intelligence can churn out code but can’t think like a software engineer.
That’s the conclusion of new research from MIT’s Computer Science and Artificial Intelligence Laboratory, which found that while large language models (LLMs) excel at generating code snippets, they fall short of the sophisticated reasoning, planning and collaboration that real-world software engineering demands. The study, conducted in collaboration with researchers from Stanford, UC Berkeley and Cornell and presented at this week’s International Conference on Machine Learning, challenges assumptions about AI’s readiness to transform software development.
“Long-horizon code planning requires a sophisticated degree of reasoning and human interaction,” Alex Gu, a PhD candidate at MIT CSAIL and the study’s lead author, said in an interview with IBM Think. “The model must consider various tradeoffs, such as performance, memory, code quality, etc., and use that to accurately decide how to design the code.”
AI coding tools are now a staple of modern software development. In 2025, 82% of developers reported using AI coding tools weekly or more, and 59% said they relied on three or more assistants in their workflow. Another 78% reported clear productivity gains, demonstrating how deeply AI is shaping the way code is written today.
Industry newsletter
Get curated insights on the most important—and intriguing—AI news. Subscribe to our weekly Think newsletter. See the IBM Privacy Statement.
Your subscription will be delivered in English. You will find an unsubscribe link in every newsletter. You can manage your subscriptions or unsubscribe here. Refer to our IBM Privacy Statement for more information.
The MIT research defines what it calls “long-horizon code planning” as a key limitation of current AI systems. According to Gu, this involves reasoning about how code fits into larger systems and considering the global consequences of local decisions.
“Long-horizon code planning requires a sophisticated degree of reasoning and human interaction,” Gu said. “The model must consider tradeoffs like performance, memory and code quality, and use that to decide how to design the code.”
Gu pointed to the example of designing a new programming language. The task, he explained, requires considering all the various ways the language must be used, deciding what API functions to expose and thinking about user usage patterns. The study notes that models must also reason about the global effects of local code changes, as slight changes to the design of a single function can propagate to the rest of the codebase.
The MIT research identifies problems with how AI coding capabilities are currently evaluated. According to Gu, most coding benchmarks focus on generating small, self-contained programs from scratch, which doesn’t reflect the reality of large-scale software engineering.
“One aspect we mention is task diversity: while real-world software engineering [SWE] involves tasks such as software testing or software maintenance, these are rarely reflected in today’s benchmarks,” Gu said.
Equally important, he added, is the ability of AI systems to infer user intent, a skill essential for tailoring solutions to specific use cases. “A website for a business likely needs to be more robust than a website designed for fun.”
The research found that LLMs perform best on tasks that closely resemble examples seen during training, creating challenges for projects that rely on low-resource programming languages or specialized libraries. According to Gu, low-resource languages and specialized libraries appear relatively infrequently in this data pool, so LLMs struggle more with them.
“Performing these tasks relies more heavily on extrapolating to unseen data and domains (generalization), which is often harder than reiterating code similar to the training distribution,” Gu said.
According to the study, this limitation means that AI coding agents tend to be less effective in legacy systems, scientific computing environments and internal tools where documentation may be limited.
The MIT study identifies the need for AI systems to develop an accurate semantic model of a project’s codebase. According to Gu, this involves understanding software structure, how components interact and how those relationships change over time.
“First, AI must understand the structure of the codebase and how the various parts come together,” he said. “Second, it must understand how the individual functions work. Finally, it should update its model of the codebase as new features are added.”
The study notes that current AI models do not have a persisting state between prompts, lacking memory of how a codebase has evolved or an internal representation of its architecture.
Despite these limitations, the authors identify several areas for potential improvement. Gu said better benchmarks could help—especially if they can evaluate AI systems on a broader range of tasks, including testing, maintenance and human-AI collaboration.
He also sees near-term promise in areas beyond coding, particularly in education. “AI already has strong capabilities at solving most elementary- and middle-school problems,” he said. “AI has a lot of potential to streamline existing workflows in education, such as generating practice problems, grading and identifying students’ misconceptions.”
Harness generative AI and advanced automation to create enterprise-ready code faster.
Optimize software development efforts with trusted AI-driven tools that minimize time spent on writing code, debugging, code refactoring or code completion and make more room for innovation.
Reinvent critical workflows and operations by adding AI to maximize experiences, real-time decision-making and business value.