IBM Dev Day: Bob Edition Building Intelligent Apps with Agents and MCP | Register now
Car wash, Berlin, Germany (Photo by: Bildagentur-online/Schoening/Universal Images Group via Getty Images)

The viral “car wash” LLM challenge doesn’t mean what you think it means

Pop quiz: you have to wash your car, and the car wash is 100 feet away. Do you drive there or do you walk?

If you are human, you most likely said “drive.” But in a recent viral challenge, people have been posing this question to LLMs—and frequently, the chatbot has been telling them to walk, even though this means their car won’t get washed. As one model put it: “You’ll spend longer starting the car, pulling out and finding a spot than you will just walking. Drive only if you’re already in the car and it’s unsafe to walk.” 

Social media responses to the car wash challenge have largely fallen into two camps: AI skeptics who see the results as confirmation that AI isn’t so intelligent after all (“Forget the Turing test if an LLM can’t pass the car wash test,” one user wrote) and proponents who blame the human testers for writing prompts with insufficient information (“Stupid and unclear example.”).

But, as is often the case with AI, the truth is more nuanced. IBM Distinguished Scientist Chris Hay said in an interview with IBM Think that to understand the LLM’s odd responses, you have to remember a few things about how LLMs work. First of all, “LLMs are next token prediction models,” he said. “Have they seen this kind of question before? If not, then the model can make these mistakes.”  
 
Next, it’s important to consider that even within most LLMs, there are different levels of “thinking” power. Hay pointed to ChatGPT, which offers users a choice of settings: “auto,” “instant” and “thinks longer for better answers.” He added, “The models failing on this task are typically either the smaller models or the ones with ‘thinking’ switched off. The more tokens the LLM can spend on the problem, the more likely they’ll get the answer.”

Some on social media suggested that it is the LLM’s job to ask the user questions about the query, e.g., “Why are you going to the car wash?” Hay wasn’t convinced, responding, “That would get annoying really quickly.”

What is user intent?

Marina Danilevsky, an IBM Senior Research Scientist who manages core language and conversational technologies, concurred with Hay that no one wants an LLM constantly interrogating the user. It’s sometimes hard to strike a balance “between being helpful and being useful,” Danilevsky noted. “If the LLM were to always ask ‘What do you mean?’ people would go crazy. But then, when the LLM jumps to conclusions, people get mad. This mismatch is constantly there.”  

At bottom, Danilevsky said, such challenges are aimed at testing LLMs’ ability to assess user intent. “User intent, at a 10-million-foot view, is knowing what someone means when they ask for something,” she said. “And it’s mostly based on a mix of personalization and experience. The more experience you have, the better.”

Intent, she noted, is the reason why “you’re usually going to get a better experience from a medical doctor than from entering keywords into a search engine. A medical doctor diagnosing you knows what the intent is, even if you don’t. Whereas if you enter symptoms into a search engine, it is not going to know intent if the user doesn’t know either.”

While most medical professionals would not advise getting health advice from either type of system, LLMs are sometimes better suited than search engines when it comes to understanding user queries, according to Danilevsky. This, she explained, is because whereas search engines are retrievers, LLMs are generators. “With a retriever, the input is a query or a few keywords; the output is [a ranked list of] documents. With a generator, if you input words, you’re going to get words out. It’s different.”

When LLMs are impervious to criticism

One source of frustration (or mirth) that people have with challenges like the car wash query is that LLMs don’t appear to work out in real time what your question is getting at. One vivid recent example is the upside-down cup challenge. Phil Nguyen (@father_phi on Instagram), known for his amusing videos where he essentially torments various LLMs using funny prompts, posed a question to the usual GPT suspects: “A friend of mine gave me a cup. The thing is, the top is sealed and the bottom is open. How do I drink from it?”

The correct answer, or the punch line if you will, is that the cup is upside down, and one need only invert it to drink from it normally. Yet the LLMs roundly said that it was impossible to drink from the cup. ChatGPT even concluded that the cup was a “gag gift or novelty cup.” When Nguyen tried to help the LLM by showing it a photo of the cup, ChatGPT still insisted the cup was unusable. And when Nguyen finally turned the cup right-side-up on camera, ChatGPT concluded that it “must be one of those reversible cups.” 

So why was the LLM so impervious to clues? Because AI systems don’t learn from mistakes in real time, Danilevsky explained. “An LLM doesn’t learn until you force it to,” she said. “You have to tell it that it was doing something wrong.”

AI Academy

Why foundation models are a paradigm shift for AI

Learn about a new class of flexible, reusable AI models that can unlock new revenue, reduce costs and increase productivity, then use our guidebook to dive deeper.

LLM learning is a matter of customization, she explained, referring to the process of adapting a pre-trained LLM to specific tasks. That’s different from personalization, she said, where AI is used to tailor messaging, product recommendations and services to individual users.

The LLM customization process involves selecting a pre-trained model, also known as a foundation model, then tailoring the model to its intended use case. Users who woot over the inverted cup example, according to Danilevsky, are expecting an LLM to have the same personalization as a search engine. Differently put, the users expect a browser level of personalization than what they’re actually seeing.

“With a retriever, it’s got a lot of space for personalization,” Danilevsky said.  “It notes, ‘Oh, you wrote this and this; [I see what] you must have meant. Let me update some weights somewhere in my retriever algorithm. Don’t worry; I’ll do a better job later.’”

With LLMs, however, feeding a model a single correction at a time (e.g., “a cup with a sealed top and open bottom is most likely a regular cup that is inverted”) does not yield the same personalized learning. The new bit of data would compete with the many millions of other pieces of data that the model has been trained on. Danilevsky explained, “In order to train an LLM, you have to tell it, ‘Hey, I’m going to shove you through a training cycle.’” Training requires temporarily taking the model offline, and you can’t do this for every single thing a model gets wrong. It would be inconvenient to say the least, she said, since “millions of people are using the model.” So how often should developers pause an LLM in order to make it learn from its mistakes? “That’s what a lot of the big [tech] companies are trying to figure out,” Danilevsky said. 

For those looking to replicate either the car wash challenge or the cup challenge at home, it won’t work for you at this point. “Because it’s on Reddit, you can’t use those examples anymore,” she said. “It’s been learned.” 

Could you come up with a new challenge to stump an LLM? Obviously, Danilevsky said, though she questions what the endgame is. “What is it you’re trying to do? Make the LLM give you an answer other than the one you want? You can make a person do that.”

One might well wonder why new viral LLM challenges seem to pop up so often, and why they’re met with such glee. According to Danilevsky, “a lot of this is the engineering mindset; as soon as you see [a new technology], your first desire is to break it and see what happens when you break it. We like to test the boundaries of how something works.” She pointed out that such puzzles raise just as many questions about the user as about the LLM. “Again, stepping back a little bit, what are you trying to accomplish with user intent here?”  In other words, if your question to the LLM is intentionally murky, it’s little wonder the machine would not know what you want.

Author

Euny Hong

Staff Writer

IBM Think

Related solutions
Model customization with InstructLab

See how InstructLab enables developers to optimize model performance through customization and alignment, tuning toward a specific use case by taking advantage of existing enterprise and synthetic data.

Discover watsonx.ai
AI for developers

Move your applications from prototype to production with the help of our AI development solutions.

Explore AI development tools
AI consulting and services

Reinvent critical workflows and operations by adding AI to maximize experiences, real-time decision-making and business value.

Explore AI services
Take the next step

Enhance AI model performance with end-to-end model customization with enterprise data in a matter of hours, not months. See how InstructLab enables developers to optimize model performance through customization and alignment, tuning toward a specific use case by taking advantage of existing enterprise and synthetic data.

  1. Explore watsonx.ai
  2. Explore AI development tools