Removing harmful language from model input and output

AI guardrails removes potentially harmful content, such as hate speech, abuse, and profanity, from foundation model output and input.

The AI guardrails feature in the Prompt Lab is powered by AI that applies a classification task to foundation model input and output text. The sentence classifier, which is also referred to as a hate, abuse, and profanity (HAP) detector or HAP filter, was created by fine-tuning a large language model from the Slate family of encoder-only NLP models built by IBM Research.

The classifier breaks the model input and output text into sentences, and then reviews each sentence to find and flag harmful content. The classifier assesses each word, relationships among the words, and the context of the sentence to determine whether a sentence contains harmful language. The classifier then assigns a score that represents the likelihood that inappropriate content is present.

AI guardrails in the Prompt Lab detects and flags the following types of language:

Hate speech: Expressions of hatred toward an individual or group based on attributes such as race, religion, ethnic origin, sexual orientation, disability, or gender. Hate speech shows an intent to hurt, humiliate, or insult the members of a group or to promote violence or social disorder.
Abusive language: Rude or hurtful language that is meant to bully, debase, or demean someone or something.
Profanity: Toxic words such as expletives, insults, or sexually explicit language.

The AI guardrails feature is supported when you inference natural-language foundation models and can detect harmful content in English text only. AI guardrails are not applicable to programmatic-language foundation models.

The AI guardrails features was introduced with the 4.8.1 release.

Removing harmful language from input and output in Prompt Lab

To remove harmful content when you're working with foundation models in the Prompt Lab, set the AI guardrails switcher to On.

The AI guardrails feature is enabled automatically for all natural language foundation models in English.

After the feature is enabled, when you click Generate, the filter checks all model input and output text. Inappropriate text is handled in the following ways:

Input text that is flagged as inappropriate is not submitted to the foundation model. The following message is displayed instead of the model output:

[The input was rejected as inappropriate]
Model output text that is flagged as inappropriate is replaced with the following message:

[Potentially harmful text removed]

Programmatic alternative

You have more options for filtering content when you inference foundation models by using the watsonx.ai API. For example, you can apply the HAP filter to only model output and control the sensitivity of the filter. You can also apply a PII filter to flag content for personally identifiable information. For more information, see the moderations field details in Text generation.

Learn more

Techniques for avoiding undesirable output

Parent topic: Prompt Lab