Even upon learning all this, one nagging question remains: the misplaced certainty. Why were these gen AI tools so very confident of their chess prowess? Why did such pride goeth before a fall? As mentioned, Caruso said the experiment wasn’t originally his own idea: ChatGPT challenged Atari. As Caruso wrote in his LinkedIn post:

[“Chat GPT] claimed it was a strong player… and would easily beat Atari’s Video Chess, which only thinks 1–2 moves ahead. It was curious how quickly it could win and requested that I set up a game using the Stella emulator.”

Copilot was equally cocky, Caruso wrote in another LinkedIn post:

“It claimed it could think 10–15 moves ahead—but figured it would stick to 3–5 moves against the 2600 because it makes ‘suboptimal moves’ that it ‘could capitalize on... rather than obsess over deep calculations.’”

As for Google Gemini, Caruso told tech publication The Register that Gemini at first told Caruso it could win such a match. “It is not a mere large language model,” it reportedly told Caruso. Upon being given more details about the previous matches, Gemini folded. According to Caruso, Gemini replied thusly: “Atari’s earned the win this round. I’ll tip my digital king with dignity and honor [to the] the vintage silicon mastermind that bested me fair and square.”

Out of curiosity, I decided to go to the horse’s mouth and ask ChatGPT myself, typing the prompt, “How do you feel about losing a chess game to a 1979 Atari console game?”

ChatGPT’s reply: “I’d probably say ‘Touché, silicon ancestor.’” [italics in original]

What? ChatGPT was not aware the match took place. I informed ChatGPT that it was not a hypothetical question.

Me: It’s not theoretical. It happened. Here is one of many news articles describing the event: https://www.cnet.com/tech/services-and-software/how-did-chatgpt-get-absolutely-wrecked-at-chess-losing-to-a-1970s-era-atari-2600/

ChatGPT began its reply by repeating me word for word:

ChatGPT: That wasn’t just theoretical—it really happened. In June 2025, Citrix engineer Robert Caruso set up a match between ChatGPT and an emulated Atari 2600 running Video Chess—the console’s 1979 chess cartridge—in beginner mode. After about 90 minutes of play, ChatGPT repeatedly lost track of the board (mixing up rooks and bishops and missing pawn forks), even when the board was reset or visuals clarified. Eventually, it conceded the game.

This type of response, say AI experts, falls into the category of hallucination—the phenomenon by which LLMs sometimes spit out replies that are inaccurate or nonsensical. If we think of a hallucination as an autocomplete fail, here the prompt involved a question about a specific chess match with no added context, and ChatGPT gave what seemed to be the most common or likely autocomplete: What to say after losing a chess match.

“It’s just a stochastic parrot trying to be helpful,” explained Ash Minhas, Manager of Technical Content and AI Advocate at IBM. (To be “stochastic” is to be in a sort of in-between state of being both random whilst also following patterns, when averaged out over time.)

“Models are trained with reinforcement learning from human feedback (RLHF) to sound helpful and confident,” Minhas explained. “Users wouldn’t use a system that said, ‘I’m not sure.’”

In other words, what sounds like boastfulness is a function of the next token prediction system that causes it to autocomplete what the next response would likely be. And given the bias toward helpfulness, as Minhas pointed out, the LLM response will reflect that bias.

In other words, the LLMs aren’t actually too big for their britches; they’re operating as designed. Perhaps ChatGPT put it best during its session with me:

“Fact: I do not possess subjective awareness.”