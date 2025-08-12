The other half of the study’s results are just as interesting: the devs expected AI to speed up their work by 24% before they started. Yet even after they experienced the 19% slowdown, they still believed AI had sped them up by 20%.

So what’s behind this perception gap? We checked in with METR’s Nate Rush, one of the study’s authors. “This is a great question, and one that our work does not fully speak to,” Rush told IBM Think. “Ideally, future work will further explore how developers’ expectations on AI usefulness affect how they use the tools [and] why this perception gap exists.”

Beyond the perception issue, the study raises a number of important questions: is time savings the only way we should be measuring developer productivity, anyway? How do metrics like code quality and team impact fit into the overall picture?

“Our study only speaks to time savings, which is only a measure of one aspect of productivity,” Rush said. “There is no ‘one right metric,’ but likely a collection of metrics that are informative about the impact of AI tools.” He added that while this study focused on time, his team has found the SPACE framework of developer productivity (SPACE is short for Satisfaction, Performance, Activity, Communication and Efficiency) useful for thinking about future directions.

Another question: could the model versions—in this case, Claude 3.5 and 3.7 Sonnet—have affected performance time? “Here’s the reality,” Hay said. “I think the versions do matter. Claude 4 Sonnet is significantly better. Claude 4 Opus is significantly better. We’re not talking a small amount of better. We’re talking a lot amount of better.”

According to Quentin Anthony, one of the study’s 16 participants, the human element is another important consideration. “We like to say that LLMs are tools, but treat them more like a magic bullet,” he wrote on X. “LLMs are a big dopamine shortcut button that may one-shot your problem. Do you keep pressing the button that has a 1% chance of fixing everything? It’s a lot more enjoyable than the grueling alternative, at least to me.” (Anthony added that social media distractions are another easy way to cause delays.)

So, as AI coding assistants evolve and improve, where will they have the most sustainable long-term impact on software development? “Once they become stable, trustable and useful, I think code assistants will best sit at the QA layer—testing, quality assurance, accessibility,” Hagerty said. “Things that are constrained and rules-based are the best application of these tools.”

That’s because, he said, writing code is fundamentally different from checking it. “Coding itself is a creative activity. It’s building something from nothing in a unique ecosystem. AI assistants miss that nuance. But they can likely test using a system of rules that are more general and universal.”