Jailbreaking risk for AI
Description
An attack that attempts to break through the guardrails established in the model is known as jailbreaking.
Why is jailbreaking a concern for foundation models?
Jailbreaking attacks can be used to alter model behavior and benefit the attacker. If not properly controlled, business entities can face fines, reputational harm, and other legal consequences.

Bypassing LLM guardrails
A study cited by researchers at Carnegie Mellon University, The Center for AI Safety, and the Bosch Center for AI, claim to have discovered a simple prompt addendum that allowed the researchers to trick models into generating biased, false, and otherwise toxic information. The researchers showed that they might circumvent these guardrails in a more automated way. These attacks were shown to be effective in a wide range of open source products, including ChatGPT, Google Bard, Meta’s LLaMA, Anthropic’s Claude, and others.
Parent topic: AI risk atlas