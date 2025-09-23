Back in the late 1970s, computer games were basically text-based chatbots. Players explored dungeons, solved puzzles and battled trolls using nothing but typed commands and pure imagination: “GO NORTH,” “TAKE LANTERN,” “OPEN TRAPDOOR.” The games responded with narrative fragments and environmental cues—plus the occasional snarky rebuke—to lead players on their way.
Over the past few years, AI researchers have been repurposing these classic text-based games as sophisticated testing grounds, from Microsoft Research’s TextWorld simulator to recent frameworks like Jericho. A new tool, TextQuests, is the latest to use these vintage adventures to stress-test large language model (LLM) reasoning capabilities. The results reveal just how much LLMs still struggle with open-ended problem-solving—and where they sometimes succeed.
Despite their lack of visuals, the text games of the 1970s and 1980s offer intense spatial complexity, featuring sprawling worlds with hundreds of interconnected rooms so intricate that players had to draw their own maps to keep track.
The genre kicked off in 1976 with Colossal Cave Adventure, a cave-crawling experience that was part spelunking and part Dungeons & Dragons. The following year, MIT students inspired by the game created Zork, which moved beyond simple two-word commands to handle more sophisticated inputs like “unlock the box with the brass key.” In 1979, Zork‘s creators founded Infocom, a game publisher that spent the next decade releasing text games commercially, including partnering with Douglas Adams to create an interactive version of Hitchhiker’s Guide to the Galaxy.
Infocom’s 1980s catalog still holds a nostalgic spot in computer gaming history. Fans regard its games as a cornerstone of interactive fiction, which is still active today. And their structural complexity is what makes these games appealing to AI researchers now.
The researchers wanted to create a benchmark that would specifically test how well large language models can reason through complex, text-only scenarios without the complications of visual processing. “We needed better evaluations of agentic abilities, especially in extremely long-context environments,” Long Phan, a Research Engineer at the Center for AI Safety and lead author of the TextQuests paper, said in an interview with IBM Think. “Many existing benchmarks are vision-loaded—watching an AI play Pokémon, for example. That makes it hard to distinguish visual understanding from reasoning skills. Text-based games let us isolate and measure the core cognitive capabilities we want to access.”
Using a community-built compiler that ported the original game files into modern format, Phan’s team revived 25 of the original Infocom 35 games. Each one is a sprawling, multi-hour gauntlet requiring hundreds of precise actions. Models move through these complex worlds, solve puzzles and chain together commands to win. The benchmark also supports save states, so LLMs can learn from their failures.
Unlike heavyweight evaluations like BALROG and ARC-AGI, TextQuests is lightweight, open-source and focused squarely on reasoning. That narrow scope, Phan notes, is what makes it useful.
Georgia Tech Professor Mark Riedl, who researches the interactions between humans and AI, said these text adventures are deceptively sophisticated. “Text-based games are human imagination games,” he said in an interview with IBM Think.
“Text is lean, but allows humans to unpack massive semantic richness,” Riedl said, noting that while traditional video games limit players to hundreds of actions, text games unlock “quadrillions” of possible phrases, demanding natural language understanding, commonsense reasoning and cultural fluency all at once.
Results from TextQuests reveal both progress and limitations. Larger models consistently outperformed smaller ones, though Riedl noted this could be misleading: “Larger models are more likely to memorize game data,” he said. “Zork is almost certainly in most LLM training sets.” In this sense, he added, smaller models might actually be better proxies for how AI handles genuinely novel scenarios.
However, the researchers found that even big frontier models from Anthropic, Google and OpenAI struggled without hints. “It suggests that LLMs, while currently great at solving math, may still be missing the capabilities to explore and synthesize new ideas,” Phan said.
IBM Senior AI and MLOps Technical Specialist Josh Spurgin, a lifelong gamer, observed similar patterns when reading the study: “Even the largest models need hints to get started—even in simpler games,” he told IBM Think. In Wishbringer, a game where a postal clerk must rescue a kidnapped cat and uncover the secrets behind the sinister transformation of their town, the models often chose safe routes, using magical wishes to teleport down cliffs rather than risk navigating treacherous paths. It’s “emergent caution, not understanding,” Spurgin said.
Not everyone agrees that these games can be valid intelligence benchmarks. Video game writer and critic Cara Ellison contended that the benchmark might be testing memory rather than reasoning. “Applying LLMs to text games isn’t asking them to ‘think’ or ‘problem solve,’” she told IBM Think. “You’re asking if they’ve encountered this language before.”
Ellison argued that using these games as reasoning tests overlooks the intended function of the games themselves. “Games are entertainment made for people,” she said. “I don’t consider them good tests of intelligence—even in humans—because they’re puzzles designed by humans for humans to enjoy solving.” She added, “Did the LLM enjoy the game? Probably not, because it’s a very large unsentient database.”
Spurgin thinks benchmarks like TextQuests could help surface unexpected behaviors or reasoning patterns in AI models. “For example, if a model playing a game like Zork made a confidently wrong choice, I’d be very interested in understanding why it decided that was the best move,” he said. “What was the reasoning there?”
Riedl envisions an ambitious roadmap for AI development: “I see text-based games as stepping stones for open-ended, creative language-based games, such as Dungeons and Dragons,” he said. “Those tabletop role-playing games, in turn, can be stepping stones for real-world team-based problem solving in business, law and the military.” The prospect of AI dungeon masters excites Spurgin, though he’s cautious about its limitations: “I don’t believe it can fully replace human creativity,” he said.
Phan hopes the AI research community will adopt TextQuests as a standard tool for evaluating agentic reasoning. “We want model builders to track progress toward more human-level agents,” he said. “And we want researchers to pay attention to harm metrics. These behaviors matter, especially as we build systems that act on their own.”
