Does AI really make coding faster?

the back of a green-haired woman sitting at a desk with multiple screens coding on a computer

Author

Antonia Davison

Staff Writer

This article was featured in the Think newsletter. Get it in your inbox.

For the past couple of years, AI’s frontier models have been making a bold promise: that using coding assistants results in faster code, fewer bugs and less grunt work for developers. Tools like GitHub Copilot and Cursor—powered by large language models (LLMs) such as Claude or GPT—are designed to automate the tedious parts of programming so human programmers can focus on the harder, more creative problems in their codebase.

At least, that’s been the pitch so far. But METR (short for Model Evaluation and Threat Research and pronounced “meter”), a Berkeley nonprofit that evaluates frontier models capabilities, wanted to see if there was real evidence to support that claim. What they found flips the narrative: coding assistants might actually slow developers down.

METR researchers observed the work of 16 experienced developers who had been contributing to large, open-source repositories that they’ve worked on for several years. Each developer provided a list of real tasks they’d normally tackle, from bug fixes to new features. The researchers then randomly split the tasks into two groups: one where developers could use AI tools, and one where they couldn’t.

AI in the mix

When AI was allowed, the devs could choose whatever tools they wanted; most chose Cursor Pro paired with Claude 3.5 or 3.7 Sonnet. They recorded their screens as they completed each task, then reported what they thought their total implementation time was. The study’s results were surprising. “When developers are allowed to use AI tools, they take 19% longer to complete issues—a significant slowdown that goes against developer beliefs and expert forecasts,” the paper’s authors wrote.

We asked IBM’s AI Advocacy Lead PJ Hagerty and Distinguished Engineer Chris Hay to take a look at METR’s study and share their impressions.

Hagerty cautioned that the hype around AI assistants may be outpacing its real-world utility. “The promise of AI making people more productive stems from the tech leadership and gen AI companies looking to capitalize on the hype of AI,” he told IBM Think. “In reality, the AI is learning as it goes and likely using the same resources a junior developer might use—Stack Overflow, Github and general googling—but with none of the context.”

“I think it's a relevant result,” Hay added. “But I don't think we should be going, ‘Wow, AI is useless. I’m faster doing it myself.’ I think there is a point, though, that for certain tasks, maybe it is quicker just to do it yourself rather than convincing the AI.”

The latest AI trends, brought to you by experts

Get curated insights on the most important—and intriguing—AI news. Subscribe to our weekly Think newsletter. See the IBM Privacy Statement.

Thank you! You are subscribed.

Your subscription will be delivered in English. You will find an unsubscribe link in every newsletter. You can manage your subscriptions or unsubscribe here. Refer to our IBM Privacy Statement for more information.

Perception isn’t always reality

The other half of the study’s results are just as interesting: the devs expected AI to speed up their work by 24% before they started. Yet even after they experienced the 19% slowdown, they still believed AI had sped them up by 20%.

So what’s behind this perception gap? We checked in with METR’s Nate Rush, one of the study’s authors. “This is a great question, and one that our work does not fully speak to,” Rush told IBM Think. “Ideally, future work will further explore how developers’ expectations on AI usefulness affect how they use the tools [and] why this perception gap exists.”

Beyond the perception issue, the study raises a number of important questions: is time savings the only way we should be measuring developer productivity, anyway? How do metrics like code quality and team impact fit into the overall picture?

“Our study only speaks to time savings, which is only a measure of one aspect of productivity,” Rush said. “There is no ‘one right metric,’ but likely a collection of metrics that are informative about the impact of AI tools.” He added that while this study focused on time, his team has found the SPACE framework of developer productivity (SPACE is short for Satisfaction, Performance, Activity, Communication and Efficiency) useful for thinking about future directions.

Another question: could the model versions—in this case, Claude 3.5 and 3.7 Sonnet—have affected performance time? “Here’s the reality,” Hay said. “I think the versions do matter. Claude 4 Sonnet is significantly better. Claude 4 Opus is significantly better. We’re not talking a small amount of better. We’re talking a lot amount of better.”

According to Quentin Anthony, one of the study’s 16 participants, the human element is another important consideration. “We like to say that LLMs are tools, but treat them more like a magic bullet,” he wrote on X. “LLMs are a big dopamine shortcut button that may one-shot your problem. Do you keep pressing the button that has a 1% chance of fixing everything? It’s a lot more enjoyable than the grueling alternative, at least to me.” (Anthony added that social media distractions are another easy way to cause delays.)

So, as AI coding assistants evolve and improve, where will they have the most sustainable long-term impact on software development? “Once they become stable, trustable and useful, I think code assistants will best sit at the QA layer—testing, quality assurance, accessibility,” Hagerty said. “Things that are constrained and rules-based are the best application of these tools.”

That’s because, he said, writing code is fundamentally different from checking it. “Coding itself is a creative activity. It’s building something from nothing in a unique ecosystem. AI assistants miss that nuance. But they can likely test using a system of rules that are more general and universal.”

Related solutions
Model customization with InstructLab

See how InstructLab enables developers to optimize model performance through customization and alignment, tuning toward a specific use case by taking advantage of existing enterprise and synthetic data.

Discover watsonx.ai
AI for developers

Move your applications from prototype to production with the help of our AI development solutions.

Explore AI development tools
AI consulting and services

Reinvent critical workflows and operations by adding AI to maximize experiences, real-time decision-making and business value.

Explore AI services
Take the next step

Enhance AI model performance with end-to-end model customization with enterprise data in a matter of hours, not months. See how InstructLab enables developers to optimize model performance through customization and alignment, tuning toward a specific use case by taking advantage of existing enterprise and synthetic data.

Explore watsonx.ai Explore AI development tools