Think 2026 Turn agentic AI into real business value | Think keynotes

LLMs corrupt the documents they work on. Does agentic AI make it worse?

Illustration depicting document degradation over time

You give an LLM a spreadsheet and tell it to make some changes to columns A and B. Bad news: you might have just messed up columns C and D.

A new paper from researchers at Microsoft evaluated 19 different large language models (LLMs) and came to a startling conclusion: as an LLM edits a document, it corrupts the content of that document. The more times an LLM touches a document, the more the overall document degrades. It’s like the classic copy of a copy problem, but worse: if you copy a photo too many times, it gets fuzzy and faded; if an LLM edits a document too many times, it gets wrong.

Here’s how the study worked, and what it revealed.

There and back again

First, the researchers curated DELEGATE-52, a dataset of documents across 52 different professional domains, from accounting ledgers to aviation bulletins, calendars to crystal structures. Then they designed prompts for performing relevant edit tasks for each document. More specifically, they designed pairs of edit tasks: a “forward” instruction to change the document and a “backward” instruction that reverses the change.  Example: for a piece of sheet music in G major, the forward instruction “transpose this up a perfect fourth to C major” is followed by “transpose this down a perfect fourth to G major.”  In other words, pitch the music up, then pitch it down by the same interval. In theory, each “round trip” of forward and backward edit should yield a document identical to the original. But it doesn’t.

The researchers quantified corruption by comparing the original document to the state of the document after each round trip. On average, after just two simulated LLM interactions (or one round trip), 18% of a document’s content no longer matched the original.  After six interactions, a third of document content was corrupted. After 20 interactions, the documents were, on average, over 50% corrupted.

The severity of the problem depended on the type of document being worked on. In general, LLMs corrupted documents less when they were repetitive, numerical and structurally dense—they almost perfectly preserved Python code—and corrupted them more when they contained mostly natural language prose such as creative writing or recipes.

Some LLMs fared better than others, but by the end of the simulation even the top three models—Gemini 3.1 Pro, Claude 4.6 Opus and GPT-5.4—had degraded documents by 25% on average. The Microsoft paper makes it clear that LLMs alone, acting outside of any agentic architecture, shouldn’t be trusted with complex documents.

“None of this is surprising or shocking,” said Mihai Criveti, a Distinguished Engineer at IBM and Chief Architect of watsonx Orchestrate, in an interview with IBM Think. “Large language models are unreliable narrators.”

This, one might think, is a problem we can solve with agentic AI. An agentic harness ostensibly equips an LLM with tools, instructions, guidelines and guardrails that help it to execute tasks more reliably and autonomously. So the researchers’ next observation might seem surprising: basic agentic tool use actually made things (slightly) worse.

“Just because something like [Claude] Opus now supports a 1 million-token context window doesn’t mean it can effectively use 1 million tokens. Many models start to struggle at around the 10,000-token mark.”

— Mihai Criveti, Distinguished Engineer and Chief Architect of watsonx Orchestrate at IBM

Tools of the trade

The paper offers a pretty simple explanation: context length. It’s well known that LLMs struggle to maintain fully accurate performance as context length grows—which is a problem when you need to preserve the fidelity of every single token in a document. “Over time, the conversational context that models support and tend to be effective in has increased,” said Criveti. “But just because something like [Claude] Opus now supports a 1 million-token context window doesn’t mean it can effectively use 1 million tokens. Many models start to struggle at around the 10,000-token mark.”

Each document in DELEGATE-52 is about 3–5,000 tokens long. In real-world settings, LLMs (or agents) are often provided multiple related documents at once; to simulate this, each domain “environment” in DELEGATE-52 also contains 8–12,000 tokens of related (but irrelevant) “distractor” content. When the researchers introduced agentic tool use, they observed that each task consumed, on average, 2–5 times more input tokens. All of this is well beyond that rough 10,000-token line—and as the paper’s authors emphasize, all of these simulation parameters “underestimate enterprise scale.”

“Many agentic harnesses and system prompts or agent prompts already fill [the reliable part of] that context window, just with instructions,” Criveti explained. And that’s before all the extra tokens that agents spend on tool calls. It would seem that tool use exacerbates, rather than solves, the long context challenge.

Are we doomed? Can we never trust an LLM-based system, agentic or not, to work on a document (unless it’s mostly Python)?

Building a better agent

No, we’re not doomed. An AI agent can be trustworthy. But that trust has to be earned, through exhaustive design and testing. “The conclusion is not that agents are making things worse,” says Criveti. “Bad agents can make things worse. Good agents can make things better.”

For their study, the Microsoft researchers implemented a basic ReAct agent. “We note this is not an optimized state-of-the-art agent system,” the paper clarifies. Originally introduced in a seminal 2022 paper, the ReAct paradigm (short for “reasoning and acting”) is best suited to short context, open-ended tasks. When dealing with the workloads we delegate to modern agents, you’ll usually need something more sophisticated.

Modern agentic frameworks don’t just enable LLMs to use tools: after years of development, they now also enable you to painstakingly dictate and oversee how and when they use them. Tools alone don’t build a good house—you need blueprints, schedules, engineers, inspections. “A good agentic harness gives you the necessary control to put in place the kind of logic that you need for this type of workflow,” says Criveti. “But you really have to think through the process yourself.”

For instance, you might break document context down into smaller chunks that an LLM can reliably handle without degradation, and direct it to act only upon relevant segments. Of course, that’s easier said than done. You need a pipeline for preprocessing documents and a specific strategy for splitting and recombining context. You need to provision your agents with access to the right files and design a retrieval flow. You’ll need robust observability to catch document degradation. You’ll need robust telemetry to understand why it happened. You’ll need an evaluation system to validate and optimize all those moving parts. And you’ll need to orchestrate them. Modern agentic platforms, such as IBM watsonx Orchestrate, are built for that reality.

Ultimately, this new research is a cautionary tale about overconfidence. Though they’re capable of incredible things, LLMs have flaws, and agentic AI isn’t magic. If you’re counting on either to do important work, you need to account for that. “Writing agents is hard,” Criveti said. “Writing a good agent is harder. But if you write them correctly ... you actually do get substantial benefits.”

Dave Bergmann

Senior Staff Writer, AI Models

IBM Think

Related solutions
Model customization with InstructLab

See how InstructLab enables developers to optimize model performance through customization and alignment, tuning toward a specific use case by taking advantage of existing enterprise and synthetic data.

Discover watsonx.ai
AI for developers

Move your applications from prototype to production with the help of our AI development solutions.

Explore AI development tools
AI consulting and services

Reinvent critical workflows and operations by adding AI to maximize experiences, real-time decision-making and business value.

Explore AI services
Take the next step

Enhance AI model performance with end-to-end model customization with enterprise data in a matter of hours, not months. See how InstructLab enables developers to optimize model performance through customization and alignment, tuning toward a specific use case by taking advantage of existing enterprise and synthetic data.

  1. Explore watsonx.ai
  2. Explore AI development tools