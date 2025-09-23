The researchers wanted to create a benchmark that would specifically test how well large language models can reason through complex, text-only scenarios without the complications of visual processing. “We needed better evaluations of agentic abilities, especially in extremely long-context environments,” Long Phan, a Research Engineer at the Center for AI Safety and lead author of the TextQuests paper, said in an interview with IBM Think. “Many existing benchmarks are vision-loaded—watching an AI play Pokémon, for example. That makes it hard to distinguish visual understanding from reasoning skills. Text-based games let us isolate and measure the core cognitive capabilities we want to access.”

Using a community-built compiler that ported the original game files into modern format, Phan’s team revived 25 of the original Infocom 35 games. Each one is a sprawling, multi-hour gauntlet requiring hundreds of precise actions. Models move through these complex worlds, solve puzzles and chain together commands to win. The benchmark also supports save states, so LLMs can learn from their failures.

Unlike heavyweight evaluations like BALROG and ARC-AGI, TextQuests is lightweight, open-source and focused squarely on reasoning. That narrow scope, Phan notes, is what makes it useful.

Georgia Tech Professor Mark Riedl, who researches the interactions between humans and AI, said these text adventures are deceptively sophisticated. “Text-based games are human imagination games,” he said in an interview with IBM Think.

“Text is lean, but allows humans to unpack massive semantic richness,” Riedl said, noting that while traditional video games limit players to hundreds of actions, text games unlock “quadrillions” of possible phrases, demanding natural language understanding, commonsense reasoning and cultural fluency all at once.