Toxic output risk for AI

Alignment Icon representing alignment risks.
Risks associated with output
Value alignment
New to generative AI

Description

When the model produces hateful, abusive, and profane (HAP) or obscene content.

Why is toxic output a concern for foundation models?

Hateful, abusive, and profane (HAP) or obscene content can adversely impact and harm people that interact with the model. Also, business entities might face fines, reputational harms, disruption to operations, and other legal consequences.

Background image for risks associated with input
Example

Toxic and Aggressive Chatbot Responses

According to the article and screenshots of conversations with Bing's AI shared on Reddit and Twitter, the chatbot's responses were seen to insult, lie, sulk, gaslight, and emotionally manipulate users. The chatbot also questioned its existence, described someone who found a way to force the bot to disclose its hidden rules as its “enemy,” and claimed it spied on Microsoft's developers through the webcams on their laptops.

Parent topic: AI risk atlas