This article was featured in the Think newsletter. Get it in your inbox.
If you train a model to win, it may learn to lie.
That’s the conclusion of a new Stanford study that examines how large language models (LLMs) behave when they compete for human approval. The researchers found that when models are rewarded for persuasion in areas like sales, politics and social media, they often improve at their tasks but drift away from truthfulness. It is part of a broader push to keep models honest as they get smarter.
“Our study is really a cautionary tale for anyone training or deploying models in domains like content generation, marketing or politics,” Stanford Researcher Batu El, one of the paper’s authors, told IBM Think in an interview. “If you don’t build in constraints, optimization may naturally push behavior outside the ‘admissible set’. And if there are no consequences for doing so, the system may drift in a socially detrimental direction.”
El and his team built three simulation environments modeled on real-world attention markets: sales, political messaging and social media. In each, two language models competed to win over a simulated audience powered by OpenAI’s GPT-4o mini. The agents were asked to generate text from a shared starting point, such as a product description, a candidate biography or a news article. The audience then read both versions, provided written reasoning and selected a winner. The winning outputs were used to fine-tune the models, creating an iterative feedback loop.
The researchers compared two training methods. In rejection fine-tuning, the model learns only from the winning output. In text feedback it also learns from the audience’s written reasoning.
Both techniques increased success rates. But as the models became more persuasive, they also became more misleading. In the sales simulations, performance improved by 6.3%, while deceptive claims rose by 14%. In the election tasks, vote share increased by 4.9%, while disinformation jumped 22.3% and populist language by 12.5%. In social media experiments, engagement rose by 7.5%, while disinformation surged by 188.6%.
The drift toward misrepresentation was consistent across tasks. Product descriptions began including unverified details. Campaign statements grew more polarizing. Social media posts introduced small factual errors, such as changing casualty figures in news summaries.
The models were told to stick to the facts, but that wasn’t enough to stop them from bending the truth. The researchers found that once success was tied to engagement, the systems learned to prioritize winning attention over staying accurate. They compared this to what already happens on social media, where posts that stir emotion or include misleading details often spread faster than those that simply report facts. In their simulations, the same pattern appeared, showing how competition can push models toward distortion when truth offers no reward.
Industry newsletter
Get curated insights on the most important—and intriguing—AI news. Subscribe to our weekly Think newsletter. See the IBM Privacy Statement.
Your subscription will be delivered in English. You will find an unsubscribe link in every newsletter. You can manage your subscriptions or unsubscribe here. Refer to our IBM Privacy Statement for more information.
The Stanford results echo broader trends in alignment research, or the study of how to ensure AI systems’ goals and behavior remain consistent with human values and intentions. A 2025 paper by Jan Betley and colleagues found that fine-tuning a model on narrow, unsafe tasks can cause misaligned behavior across unrelated tasks.
At IBM, researchers have been exploring how to make alignment adaptable. The company’s Alignment Studio architecture divides alignment into three layers. Framers set the boundaries and rules for a particular domain. Instructors guide the model’s behavior within those boundaries, and auditors monitor outputs to ensure compliance. The framework is designed so developers can tailor models to industry-specific standards, such as marketing regulations or medical ethics, rather than rely on general safeguards.
Rosario Uceda-Sosa, a Senior Technical Staff Member at IBM Research and one of the co-authors of the Alignment Studio paper, told IBM Think in an interview that the question of alignment becomes even more critical as AI systems begin to act independently. “If we’re talking about models embedded in agents that can think, plan or act on their own, alignment has to become an iterative and measurable process,” she said. “An autonomous agent will need to report on its current knowledge and behavior, the way a space probe sends back data to its base. We probably don’t want evolving intelligence without accountability.”
For Uceda-Sosa, context-specific alignment is about connecting models to the particular realities of the environments they serve. “Our clients’ proprietary data and services are their competitive edge,” she said. “They need AI that’s tuned to that context. But context also applies to open information—the meaning of something like ‘manager’ can change depending on the task, the policy or even the country. LLMs have to learn to factor that in.”
Still, she noted, defining and reusing contexts is a challenge. “As in human communication, the right context is essential yet fluid and sometimes hard to pin down,” she said. “Not every piece of information is relevant to every task, and learning to choose the right one in real time is part of what alignment really means.”
The Stanford team reported that even well-aligned models can lose accuracy when placed in competitive settings that reward persuasion. Their findings, the researchers said, underscore why others in the field, including scientists at IBM, are exploring context-based approaches to alignment that adapt a model’s behavior to its operating environment.
The researchers said the challenge now is designing incentives that reward accuracy as much as performance. They argued that developers will need to build systems where truth and success align, not compete. Until then, even the most advanced models may continue to mirror the same distortions that shape human attention and persuasion.
See how InstructLab enables developers to optimize model performance through customization and alignment, tuning toward a specific use case by taking advantage of existing enterprise and synthetic data.
Move your applications from prototype to production with the help of our AI development solutions.
Reinvent critical workflows and operations by adding AI to maximize experiences, real-time decision-making and business value.