Tags

AI can write code, but can it beat software engineers?

Man sitting at a desk with his back to us coding on a computer with multiple screens

Author

Staff Writer

IBM

This article was featured in the Think newsletter. Get it in your inbox.

Artificial intelligence can churn out code but can’t think like a software engineer.

That’s the conclusion of new research from MIT’s Computer Science and Artificial Intelligence Laboratory, which found that while large language models (LLMs) excel at generating code snippets, they fall short of the sophisticated reasoning, planning and collaboration that real-world software engineering demands. The study, conducted in collaboration with researchers from Stanford, UC Berkeley and Cornell and presented at this week’s International Conference on Machine Learning, challenges assumptions about AI’s readiness to transform software development.

“Long-horizon code planning requires a sophisticated degree of reasoning and human interaction,” Alex Gu, a PhD candidate at MIT CSAIL and the study’s lead author, said in an interview with IBM Think. “The model must consider various tradeoffs, such as performance, memory, code quality, etc., and use that to accurately decide how to design the code.”

AI coding tools are now a staple of modern software development. In 2025, 82% of developers reported using AI coding tools weekly or more, and 59% said they relied on three or more assistants in their workflow. Another 78% reported clear productivity gains, demonstrating how deeply AI is shaping the way code is written today.

Industry newsletter

The latest AI trends, brought to you by experts

Get curated insights on the most important—and intriguing—AI news. Subscribe to our weekly Think newsletter. See the IBM Privacy Statement.

The planning challenge

The MIT research defines what it calls “long-horizon code planning” as a key limitation of current AI systems. According to Gu, this involves reasoning about how code fits into larger systems and considering the global consequences of local decisions.

“Long-horizon code planning requires a sophisticated degree of reasoning and human interaction,” Gu said. “The model must consider tradeoffs like performance, memory and code quality, and use that to decide how to design the code.”

Gu pointed to the example of designing a new programming language. The task, he explained, requires considering all the various ways the language must be used, deciding what API functions to expose and thinking about user usage patterns. The study notes that models must also reason about the global effects of local code changes, as slight changes to the design of a single function can propagate to the rest of the codebase.

The MIT research identifies problems with how AI coding capabilities are currently evaluated. According to Gu, most coding benchmarks focus on generating small, self-contained programs from scratch, which doesn’t reflect the reality of large-scale software engineering.

“One aspect we mention is task diversity: while real-world software engineering [SWE] involves tasks such as software testing or software maintenance, these are rarely reflected in today’s benchmarks,” Gu said.

Equally important, he added, is the ability of AI systems to infer user intent, a skill essential for tailoring solutions to specific use cases. “A website for a business likely needs to be more robust than a website designed for fun.”

The research found that LLMs perform best on tasks that closely resemble examples seen during training, creating challenges for projects that rely on low-resource programming languages or specialized libraries. According to Gu, low-resource languages and specialized libraries appear relatively infrequently in this data pool, so LLMs struggle more with them.

“Performing these tasks relies more heavily on extrapolating to unseen data and domains (generalization), which is often harder than reiterating code similar to the training distribution,” Gu said.

According to the study, this limitation means that AI coding agents tend to be less effective in legacy systems, scientific computing environments and internal tools where documentation may be limited.

Codebase understanding

The MIT study identifies the need for AI systems to develop an accurate semantic model of a project’s codebase. According to Gu, this involves understanding software structure, how components interact and how those relationships change over time.

“First, AI must understand the structure of the codebase and how the various parts come together,” he said. “Second, it must understand how the individual functions work. Finally, it should update its model of the codebase as new features are added.”

The study notes that current AI models do not have a persisting state between prompts, lacking memory of how a codebase has evolved or an internal representation of its architecture.

Despite these limitations, the authors identify several areas for potential improvement. Gu said better benchmarks could help—especially if they can evaluate AI systems on a broader range of tasks, including testing, maintenance and human-AI collaboration.

He also sees near-term promise in areas beyond coding, particularly in education. “AI already has strong capabilities at solving most elementary- and middle-school problems,” he said. “AI has a lot of potential to streamline existing workflows in education, such as generating practice problems, grading and identifying students’ misconceptions.”

AI Academy

The rise of generative AI for business

Learn about the historical rise of generative AI and what it means for business.

Gartner® Top Strategic Technology Trends for 2026: AI-Native Development Platforms

Read the Gartner report to discover actionable insights that will help you transform developer productivity into a catalyst for enterprise growth.

Resources

Take your gen AI skills to the next level

Learn fundamental concepts and build your skills with hands-on labs, courses, guided projects, trials and more.

Put AI to work: Driving ROI with gen AI

Want to get a better return on your AI investments? Learn how scaling gen AI in key areas drives change by helping your best minds build and deliver innovative new solutions.

From AI projects to profits: How agentic AI can sustain financial returns

Learn how organizations are shifting from launching AI in disparate pilots to applying AI to drive transformation at the core.

The CEO's guide to generative AI

Learn how CEOs can balance the value generative AI can create against the investment it demands and the risks it introduces.

watsonx Developer Hub

Support your next project with some of our most commonly used capabilities. Get started and learn more about the supported models that IBM provides.

The truth about successful generative AI

Uncover the benefits of AI platforms that enable foundation model customization through technology, processes and best practices, to help you easily operationalize the gen AI lifecycle.

IBM is named a leader in data science and machine learning

Learn why IBM has been recognized as a leader in the 2025 Gartner® Magic Quadrant™ for Data Science and Machine Learning Platforms.

Explore IBM Granite

IBM Granite is our family of open, performant and trusted AI models tailored for business and optimized to scale your AI applications. Explore language, code, time series and guardrail options.

How to choose the right foundation model

Learn how to select the most suitable AI foundation model for your use case.

How to thrive in this new era of AI with trust and confidence

Dive into the 3 critical elements of a strong AI strategy: creating a competitive edge, scaling AI across the business and advancing trustworthy AI.

Related solutions

IBM watsonx Code Assistant™

Harness generative AI and advanced automation to create enterprise-ready code faster.

Explore watsonx Code Assistant

AI coding solutions

Optimize software development efforts with trusted AI-driven tools that minimize time spent on writing code, debugging, code refactoring or code completion and make more room for innovation.

Explore AI coding solutions

AI consulting and services

Reinvent critical workflows and operations by adding AI to maximize experiences, real-time decision-making and business value.

Explore AI services

Take the next step

Harness generative AI and advanced automation to create enterprise-ready code faster. IBM watsonx Code Assistant™ leverages Granite models to augment developer skill sets, simplifying and automating your development and modernization efforts.

Explore watsonx Code Assistant

Explore AI coding solutions