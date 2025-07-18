The MIT research defines what it calls “long-horizon code planning” as a key limitation of current AI systems. According to Gu, this involves reasoning about how code fits into larger systems and considering the global consequences of local decisions.

“Long-horizon code planning requires a sophisticated degree of reasoning and human interaction,” Gu said. “The model must consider tradeoffs like performance, memory and code quality, and use that to decide how to design the code.”

Gu pointed to the example of designing a new programming language. The task, he explained, requires considering all the various ways the language must be used, deciding what API functions to expose and thinking about user usage patterns. The study notes that models must also reason about the global effects of local code changes, as slight changes to the design of a single function can propagate to the rest of the codebase.

The MIT research identifies problems with how AI coding capabilities are currently evaluated. According to Gu, most coding benchmarks focus on generating small, self-contained programs from scratch, which doesn’t reflect the reality of large-scale software engineering.

“One aspect we mention is task diversity: while real-world software engineering [SWE] involves tasks such as software testing or software maintenance, these are rarely reflected in today’s benchmarks,” Gu said.

Equally important, he added, is the ability of AI systems to infer user intent, a skill essential for tailoring solutions to specific use cases. “A website for a business likely needs to be more robust than a website designed for fun.”

The research found that LLMs perform best on tasks that closely resemble examples seen during training, creating challenges for projects that rely on low-resource programming languages or specialized libraries. According to Gu, low-resource languages and specialized libraries appear relatively infrequently in this data pool, so LLMs struggle more with them.

“Performing these tasks relies more heavily on extrapolating to unseen data and domains (generalization), which is often harder than reiterating code similar to the training distribution,” Gu said.

According to the study, this limitation means that AI coding agents tend to be less effective in legacy systems, scientific computing environments and internal tools where documentation may be limited.