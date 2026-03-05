Pop quiz: you have to wash your car, and the car wash is 100 feet away. Do you drive there or do you walk?
If you are human, you most likely said “drive.” But in a recent viral challenge, people have been posing this question to LLMs—and frequently, the chatbot has been telling them to walk, even though this means their car won’t get washed. As one model put it: “You’ll spend longer starting the car, pulling out and finding a spot than you will just walking. Drive only if you’re already in the car and it’s unsafe to walk.”
Social media responses to the car wash challenge have largely fallen into two camps: AI skeptics who see the results as confirmation that AI isn’t so intelligent after all (“Forget the Turing test if an LLM can’t pass the car wash test,” one user wrote) and proponents who blame the human testers for writing prompts with insufficient information (“Stupid and unclear example.”).
But, as is often the case with AI, the truth is more nuanced. IBM Distinguished Scientist Chris Hay said in an interview with IBM Think that to understand the LLM’s odd responses, you have to remember a few things about how LLMs work. First of all, “LLMs are next token prediction models,” he said. “Have they seen this kind of question before? If not, then the model can make these mistakes.”
Next, it’s important to consider that even within most LLMs, there are different levels of “thinking” power. Hay pointed to ChatGPT, which offers users a choice of settings: “auto,” “instant” and “thinks longer for better answers.” He added, “The models failing on this task are typically either the smaller models or the ones with ‘thinking’ switched off. The more tokens the LLM can spend on the problem, the more likely they’ll get the answer.”
Some on social media suggested that it is the LLM’s job to ask the user questions about the query, e.g., “Why are you going to the car wash?” Hay wasn’t convinced, responding, “That would get annoying really quickly.”
Get curated insights on the most important—and intriguing—AI news. Subscribe to our weekly Think newsletter. See the IBM Privacy Statement.
Marina Danilevsky, an IBM Senior Research Scientist who manages core language and conversational technologies, concurred with Hay that no one wants an LLM constantly interrogating the user. It’s sometimes hard to strike a balance “between being helpful and being useful,” Danilevsky noted. “If the LLM were to always ask ‘What do you mean?’ people would go crazy. But then, when the LLM jumps to conclusions, people get mad. This mismatch is constantly there.”
At bottom, Danilevsky said, such challenges are aimed at testing LLMs’ ability to assess user intent. “User intent, at a 10-million-foot view, is knowing what someone means when they ask for something,” she said. “And it’s mostly based on a mix of personalization and experience. The more experience you have, the better.”
Intent, she noted, is the reason why “you’re usually going to get a better experience from a medical doctor than from entering keywords into a search engine. A medical doctor diagnosing you knows what the intent is, even if you don’t. Whereas if you enter symptoms into a search engine, it is not going to know intent if the user doesn’t know either.”
While most medical professionals would not advise getting health advice from either type of system, LLMs are sometimes better suited than search engines when it comes to understanding user queries, according to Danilevsky. This, she explained, is because whereas search engines are retrievers, LLMs are generators. “With a retriever, the input is a query or a few keywords; the output is [a ranked list of] documents. With a generator, if you input words, you’re going to get words out. It’s different.”
One source of frustration (or mirth) that people have with challenges like the car wash query is that LLMs don’t appear to work out in real time what your question is getting at. One vivid recent example is the upside-down cup challenge. Phil Nguyen (@father_phi on Instagram), known for his amusing videos where he essentially torments various LLMs using funny prompts, posed a question to the usual GPT suspects: “A friend of mine gave me a cup. The thing is, the top is sealed and the bottom is open. How do I drink from it?”
The correct answer, or the punch line if you will, is that the cup is upside down, and one need only invert it to drink from it normally. Yet the LLMs roundly said that it was impossible to drink from the cup. ChatGPT even concluded that the cup was a “gag gift or novelty cup.” When Nguyen tried to help the LLM by showing it a photo of the cup, ChatGPT still insisted the cup was unusable. And when Nguyen finally turned the cup right-side-up on camera, ChatGPT concluded that it “must be one of those reversible cups.”
So why was the LLM so impervious to clues? Because AI systems don’t learn from mistakes in real time, Danilevsky explained. “An LLM doesn’t learn until you force it to,” she said. “You have to tell it that it was doing something wrong.”
LLM learning is a matter of customization, she explained, referring to the process of adapting a pre-trained LLM to specific tasks. That’s different from personalization, she said, where AI is used to tailor messaging, product recommendations and services to individual users.
The LLM customization process involves selecting a pre-trained model, also known as a foundation model, then tailoring the model to its intended use case. Users who woot over the inverted cup example, according to Danilevsky, are expecting an LLM to have the same personalization as a search engine. Differently put, the users expect a browser level of personalization than what they’re actually seeing.
“With a retriever, it’s got a lot of space for personalization,” Danilevsky said. “It notes, ‘Oh, you wrote this and this; [I see what] you must have meant. Let me update some weights somewhere in my retriever algorithm. Don’t worry; I’ll do a better job later.’”
With LLMs, however, feeding a model a single correction at a time (e.g., “a cup with a sealed top and open bottom is most likely a regular cup that is inverted”) does not yield the same personalized learning. The new bit of data would compete with the many millions of other pieces of data that the model has been trained on. Danilevsky explained, “In order to train an LLM, you have to tell it, ‘Hey, I’m going to shove you through a training cycle.’” Training requires temporarily taking the model offline, and you can’t do this for every single thing a model gets wrong. It would be inconvenient to say the least, she said, since “millions of people are using the model.” So how often should developers pause an LLM in order to make it learn from its mistakes? “That’s what a lot of the big [tech] companies are trying to figure out,” Danilevsky said.
For those looking to replicate either the car wash challenge or the cup challenge at home, it won’t work for you at this point. “Because it’s on Reddit, you can’t use those examples anymore,” she said. “It’s been learned.”
Could you come up with a new challenge to stump an LLM? Obviously, Danilevsky said, though she questions what the endgame is. “What is it you’re trying to do? Make the LLM give you an answer other than the one you want? You can make a person do that.”
One might well wonder why new viral LLM challenges seem to pop up so often, and why they’re met with such glee. According to Danilevsky, “a lot of this is the engineering mindset; as soon as you see [a new technology], your first desire is to break it and see what happens when you break it. We like to test the boundaries of how something works.” She pointed out that such puzzles raise just as many questions about the user as about the LLM. “Again, stepping back a little bit, what are you trying to accomplish with user intent here?” In other words, if your question to the LLM is intentionally murky, it’s little wonder the machine would not know what you want.
Discover our five predictions about what will define the most successful enterprises in 2030 and the steps leaders can take to gain an AI-first advantage.
Discover IBM Granite®, our family of open, performant and trusted AI models, tailored for business and optimized to scale your AI applications. Explore language, code, time series and guardrail options.
Techsplainers by IBM breaks down the essentials of LLMs, from key concepts to real‑world use cases. Clear, quick episodes help you learn the fundamentals fast.
Learn how to select the most suitable AI foundation model for your use case.
Dive into IBM Developer articles, blogs and tutorials to deepen your knowledge of LLMs.
Learn how to continually push teams to improve model performance and outpace the competition by using the latest AI techniques and infrastructure.
Explore the value of enterprise-grade foundation models that provide trust, performance and cost-effective benefits to all industries.
See how InstructLab enables developers to optimize model performance through customization and alignment, tuning toward a specific use case by taking advantage of existing enterprise and synthetic data.
Move your applications from prototype to production with the help of our AI development solutions.
Reinvent critical workflows and operations by adding AI to maximize experiences, real-time decision-making and business value.
Enhance AI model performance with end-to-end model customization with enterprise data in a matter of hours, not months. See how InstructLab enables developers to optimize model performance through customization and alignment, tuning toward a specific use case by taking advantage of existing enterprise and synthetic data.