AI can write code, but can it beat software engineers?

Man sitting at a desk with his back to us coding on a computer with multiple screens

Author

Sascha Brodsky

Staff Writer

IBM

This article was featured in the Think newsletter. Get it in your inbox.

Artificial intelligence can churn out code but can’t think like a software engineer.

That’s the conclusion of new research from MIT’s Computer Science and Artificial Intelligence Laboratory, which found that while large language models (LLMs) excel at generating code snippets, they fall short of the sophisticated reasoning, planning and collaboration that real-world software engineering demands. The study, conducted in collaboration with researchers from Stanford, UC Berkeley and Cornell and presented at this week’s International Conference on Machine Learning, challenges assumptions about AI’s readiness to transform software development.

“Long-horizon code planning requires a sophisticated degree of reasoning and human interaction,” Alex Gu, a PhD candidate at MIT CSAIL and the study’s lead author, said in an interview with IBM Think. “The model must consider various tradeoffs, such as performance, memory, code quality, etc., and use that to accurately decide how to design the code.”

AI coding tools are now a staple of modern software development. In 2025, 82% of developers reported using AI coding tools weekly or more, and 59% said they relied on three or more assistants in their workflow. Another 78% reported clear productivity gains, demonstrating how deeply AI is shaping the way code is written today.

The latest AI trends, brought to you by experts

Get curated insights on the most important—and intriguing—AI news. Subscribe to our weekly Think newsletter. See the IBM Privacy Statement.

Thank you! You are subscribed.

Your subscription will be delivered in English. You will find an unsubscribe link in every newsletter. You can manage your subscriptions or unsubscribe here. Refer to our IBM Privacy Statement for more information.

The planning challenge

The MIT research defines what it calls “long-horizon code planning” as a key limitation of current AI systems. According to Gu, this involves reasoning about how code fits into larger systems and considering the global consequences of local decisions.

“Long-horizon code planning requires a sophisticated degree of reasoning and human interaction,” Gu said. “The model must consider tradeoffs like performance, memory and code quality, and use that to decide how to design the code.”

Gu pointed to the example of designing a new programming language. The task, he explained, requires considering all the various ways the language must be used, deciding what API functions to expose and thinking about user usage patterns. The study notes that models must also reason about the global effects of local code changes, as slight changes to the design of a single function can propagate to the rest of the codebase.

The MIT research identifies problems with how AI coding capabilities are currently evaluated. According to Gu, most coding benchmarks focus on generating small, self-contained programs from scratch, which doesn’t reflect the reality of large-scale software engineering.

“One aspect we mention is task diversity: while real-world software engineering [SWE] involves tasks such as software testing or software maintenance, these are rarely reflected in today’s benchmarks,” Gu said.

Equally important, he added, is the ability of AI systems to infer user intent, a skill essential for tailoring solutions to specific use cases. “A website for a business likely needs to be more robust than a website designed for fun.”

The research found that LLMs perform best on tasks that closely resemble examples seen during training, creating challenges for projects that rely on low-resource programming languages or specialized libraries. According to Gu, low-resource languages and specialized libraries appear relatively infrequently in this data pool, so LLMs struggle more with them.

“Performing these tasks relies more heavily on extrapolating to unseen data and domains (generalization), which is often harder than reiterating code similar to the training distribution,” Gu said.

According to the study, this limitation means that AI coding agents tend to be less effective in legacy systems, scientific computing environments and internal tools where documentation may be limited.

Codebase understanding

The MIT study identifies the need for AI systems to develop an accurate semantic model of a project’s codebase. According to Gu, this involves understanding software structure, how components interact and how those relationships change over time.

“First, AI must understand the structure of the codebase and how the various parts come together,” he said. “Second, it must understand how the individual functions work. Finally, it should update its model of the codebase as new features are added.”

The study notes that current  AI models do not have a persisting state between prompts, lacking memory of how a codebase has evolved or an internal representation of its architecture.

Despite these limitations, the authors identify several areas for potential improvement. Gu said better benchmarks could help—especially if they can evaluate  AI systems on a broader range of tasks, including testing, maintenance and human-AI collaboration.

He also sees near-term promise in areas beyond coding, particularly in education. “AI already has strong capabilities at solving most elementary- and middle-school problems,” he said. “AI has a lot of potential to streamline existing workflows in education, such as generating practice problems, grading and identifying students’ misconceptions.”

AI Academy

The rise of generative AI for business

Learn about the historical rise of generative AI and what it means for business.

Related solutions
IBM watsonx Code Assistant™

Harness generative AI and advanced automation to create enterprise-ready code faster.

Explore watsonx Code Assistant
AI coding solutions

Optimize software development efforts with trusted AI-driven tools that minimize time spent on writing code, debugging, code refactoring or code completion and make more room for innovation.

Explore AI coding solutions
AI consulting and services

Reinvent critical workflows and operations by adding AI to maximize experiences, real-time decision-making and business value.

Explore AI services
Take the next step

Harness generative AI and advanced automation to create enterprise-ready code faster. IBM watsonx Code Assistant™ leverages Granite models to augment developer skill sets, simplifying and automating your development and modernization efforts.

Explore watsonx Code Assistant Explore AI coding solutions