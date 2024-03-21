Prompt injections exploit the fact that LLM applications do not clearly distinguish between developer instructions and user inputs. By writing carefully crafted prompts, hackers can override developer instructions and make the LLM do their bidding.

To understand prompt injection attacks, it helps to first look at how developers build many LLM-powered apps.

LLMs are a type of foundation model, a highly flexible machine learning model trained on a large dataset. They can be adapted to various tasks through a process called "instruction fine-tuning." Developers give the LLM a set of natural language instructions for a task, and the LLM follows them.

Thanks to instruction fine-tuning, developers don't need to write any code to program LLM apps. Instead, they can write system prompts, which are instruction sets that tell the AI model how to handle user input. When a user interacts with the app, their input is added to the system prompt, and the whole thing is fed to the LLM as a single command.

The prompt injection vulnerability arises because both the system prompt and the user inputs take the same format: strings of natural-language text. That means the LLM cannot distinguish between instructions and input based solely on data type. Instead, it relies on past training and the prompts themselves to determine what to do. If an attacker crafts input that looks enough like a system prompt, the LLM ignores developers' instructions and does what the hacker wants.

The data scientist Riley Goodside was one of the first to discover prompt injections. Goodside used a simple LLM-powered translation app to illustrate how the attacks work. Here is a slightly modified version of Goodside's example2: