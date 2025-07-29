Last month, an 8-bit, zero-RAM Atari 2600 console game from 1979 caused a hubbub in AI and gaming circles when it defeated ChatGPT and Microsoft Copilot at chess. Many AI skeptics saw it as a David and Goliath story—and proof that generative AI (gen AI) is all talk and no intelligence. “AI schooled by 50-year-old Atari,” read a typical headline about the experiment, which Citrix engineer Robert Caruso conducted and described in several LinkedIn posts.
According to IBM experts, the brouhaha is a good reminder of what gen AI can and cannot do, and how it works. “People are using ChatGPT to do too many things,” IBM Distinguished Engineer Chris Hay told IBM Think. “Thinking ChatGPT can do chess is like thinking it can be your girlfriend or therapist.”
According to Hay, the LLMs’ resounding loss is a good reminder of the way LLMs operate: they produce outputs based on learned correlations between words, not rule-following or planning. LLMs are bad at chess for the same reason that they’re sometimes bad at addition and tic-tac-toe: they are not designed for those tasks.
“Ultimately,” Hay said, “Chess is a search problem.” With each move, the player—be it human or machine—must search through a decision tree, evaluate possible outcomes and assign value rank to various options. Atari Video Chess is coded specifically for that. LLM logic, by contrast, is “based on next token prediction systems,” Hay said. “These models are essentially given the whole of the internet and then they’re trained to predict the next word.”
The game of chess does not operate on next token prediction. “There are too many combinations of the game,” said Hay.
In an interview with IBM Think, PJ Hagerty, Lead of AI Advocacy at IBM, a competitive chess player and the owner of an original Atari 2600 console, stressed that the Atari chess game uses a “brute force method.”
That method, used by many early strategy-based video games, involves sifting through many different possible moves and using probability-based calculation to pick the optimal move. “That code focused on a logic tree using averages to determine the best possible move,” Hagerty said.
Here’s how chess decision trees, aka game trees, work, whether the player be human or machine.
Let’s say two people are playing chess. White has the opening move, which the player decides will be moving the king’s pawn two squares forward. In chess notation, this is known as E2, which happens to be the most popular first move. It is then Black’s turn. Based on White’s E2, Black has 20 different legal options for its move: 16 pawn moves and 4 knight moves. Each of those different options leads to further options still, so the tree size will likely grow exponentially with each branching factor.
For example, in the historic 1956 match between American players Bobby Fischer and Donald Byrne, the match lasted 41 full turns, or 82 plies. The average number of possible chess moves for any given state of play, depending on whom you ask, is 20-40; for the sake of simplicity, let’s say it’s 30 plies. In that instance, the approximate game tree size is 3082. We will not write out the full number, since it is 122 digits long.
For the average human player, it’s not easy to hold the pros and cons of that many moves in our head all at once. And players typically don’t. Instead, they instinctively rule out blatantly bad moves based on certain core principles, such as “don’t expose the queen early in the game.”
In neural net lingo, this winnowing-down process is known as “pruning,” since it literally involves cutting away branches of a game tree. Human players can prune. So can modern-day, sophisticated computer chess machines. But a 46-year-old console chess game can’t—so it runs every option for the next 1-2 moves and picks the one with the highest statistical chance of leading to victory. And a gen AI tool can do neither brute force nor pruning.
This is not to say that LLMs will never be able to get good at chess. It’s a simple matter of giving them the right tools, said Hay—like a structured system prompt and ability to plan several steps ahead. “If you were to tell [ChatGPT], ‘I give you permission to generate code,’ as well as access to a notepad for planning where it could keep track of the game, I bet it could probably win.” Currently, LLMs don’t have the built-in agency to decide its goal is to win at chess, and what tools it needs to win. But that won’t be the case forever. Tool calling—the ability for AI models to interact with external tools—is an active area of research throughout the world of AI, including at IBM’s Granite model project.
Even upon learning all this, one nagging question remains: the misplaced certainty. Why were these gen AI tools so very confident of their chess prowess? Why did such pride goeth before a fall? As mentioned, Caruso said the experiment wasn’t originally his own idea: ChatGPT challenged Atari. As Caruso wrote in his LinkedIn post:
[“Chat GPT] claimed it was a strong player… and would easily beat Atari’s Video Chess, which only thinks 1–2 moves ahead. It was curious how quickly it could win and requested that I set up a game using the Stella emulator.”
Copilot was equally cocky, Caruso wrote in another LinkedIn post:
“It claimed it could think 10–15 moves ahead—but figured it would stick to 3–5 moves against the 2600 because it makes ‘suboptimal moves’ that it ‘could capitalize on... rather than obsess over deep calculations.’”
As for Google Gemini, Caruso told tech publication The Register that Gemini at first told Caruso it could win such a match. “It is not a mere large language model,” it reportedly told Caruso. Upon being given more details about the previous matches, Gemini folded. According to Caruso, Gemini replied thusly: “Atari’s earned the win this round. I’ll tip my digital king with dignity and honor [to the] the vintage silicon mastermind that bested me fair and square.”
Out of curiosity, I decided to go to the horse’s mouth and ask ChatGPT myself, typing the prompt, “How do you feel about losing a chess game to a 1979 Atari console game?”
ChatGPT’s reply: “I’d probably say ‘Touché, silicon ancestor.’” [italics in original]
What? ChatGPT was not aware the match took place. I informed ChatGPT that it was not a hypothetical question.
Me: It’s not theoretical. It happened. Here is one of many news articles describing the event: https://www.cnet.com/tech/services-and-software/how-did-chatgpt-get-absolutely-wrecked-at-chess-losing-to-a-1970s-era-atari-2600/
ChatGPT began its reply by repeating me word for word:
ChatGPT: That wasn’t just theoretical—it really happened. In June 2025, Citrix engineer Robert Caruso set up a match between ChatGPT and an emulated Atari 2600 running Video Chess—the console’s 1979 chess cartridge—in beginner mode. After about 90 minutes of play, ChatGPT repeatedly lost track of the board (mixing up rooks and bishops and missing pawn forks), even when the board was reset or visuals clarified. Eventually, it conceded the game.
This type of response, say AI experts, falls into the category of hallucination—the phenomenon by which LLMs sometimes spit out replies that are inaccurate or nonsensical. If we think of a hallucination as an autocomplete fail, here the prompt involved a question about a specific chess match with no added context, and ChatGPT gave what seemed to be the most common or likely autocomplete: What to say after losing a chess match.
“It’s just a stochastic parrot trying to be helpful,” explained Ash Minhas, Manager of Technical Content and AI Advocate at IBM. (To be “stochastic” is to be in a sort of in-between state of being both random whilst also following patterns, when averaged out over time.)
“Models are trained with reinforcement learning from human feedback (RLHF) to sound helpful and confident,” Minhas explained. “Users wouldn’t use a system that said, ‘I’m not sure.’”
In other words, what sounds like boastfulness is a function of the next token prediction system that causes it to autocomplete what the next response would likely be. And given the bias toward helpfulness, as Minhas pointed out, the LLM response will reflect that bias.
In other words, the LLMs aren’t actually too big for their britches; they’re operating as designed. Perhaps ChatGPT put it best during its session with me:
“Fact: I do not possess subjective awareness.”
