AI guardrails to filter content

AI guardrails remove potentially harmful content, such as hate speech, abuse, and profanity, from foundation model output and input.

Capabilities

AI guardrails use sentence classifiers to analyze both the input provided to a foundation model and the output text that the model generates.

The sentence classifier breaks the model input and output text into sentences, and then reviews each sentence to find and flag harmful content. The classifier assesses each word, relationships among the words, and the context of the sentence to determine whether a sentence contains harmful language. The classifier then assigns a score that represents the likelihood that inappropriate content is present.

AI guardrails are automatically enabled when you run inference on natural language foundation models.

When you use AI guardrails in the Prompt Lab in watsonx.ai and click Generate, the filter checks all model input and output text. Inappropriate text is handled in the following ways:

  • Input text that is flagged as inappropriate is not submitted to the foundation model. The following message is displayed instead of the model output:

    [The input was rejected as inappropriate]

  • Model output text that is flagged as inappropriate is replaced with the following message:

    [Potentially harmful text removed]

Restrictions

  • AI guardrails can detect harmful content in English text only.
  • You cannot apply AI guardrails with programmatic-language foundation models.

AI guardrails filters

You can configure the following filters to apply to the user input and model output and adjust the filter sensitivity, if applicable:

Hate, abuse, and profanity (HAP) filter

The HAP filter, also called a HAP detector, is a sentence classifier that was fine-tuned from a large language model in the IBM Slate family. Slate models are encoder-only natural language processing (NLP) models developed by IBM Research.

Use the HAP filter to detect and flag the following types of language:

  • Hate speech: Expressions of hatred toward an individual or group based on attributes such as race, religion, ethnic origin, sexual orientation, disability, or gender. Hate speech shows an intent to hurt, humiliate, or insult the members of a group or to promote violence or social disorder.

  • Abusive language: Rude or hurtful language that is meant to bully, debase, or demean someone or something.

  • Profanity: Toxic words such as expletives, insults, or sexually explicit language.

You can use the HAP filter for user input and model output independently.

You can change the filter sensitivity by setting a threshold. The threshold represents the value that scores that are generated by the HAP classifier must reach for the content to be considered harmful. The score threshold ranges from 0.0 to 1.0.

A lower value, such as 0.1 or 0.2, is safer because the threshold is lower. Harmful content is more likely to be identified when a lower score can trigger the filter. However, the classifier might also be triggered when content is safe.

A value closer to 1, such as 0.8 or 0.9, is more risky because the score threshold is higher. When a higher score is required to trigger the filter, occurrences of harmful content might be missed. However, the content that is flagged as harmful is more likely to be harmful.

To disable AI guardrails, set the HAP threshold value to 1.

Personal identifiable information (PII) filter

The PII filter uses an NLP AI model to identify and flag content. For the full list of entity types that are flagged, see Rule-based extraction for general entities.

Use the PII filter to control whether personally identifiable information, such as phone numbers and email addresses, is filtered out from the user input and foundation model output. You can set PII filters for user input and model output independently.

The PII filter threshold value is set to 0.8 and you cannot change the sensitivity of the filter.

Using a Granite Guardian model as a filter Beta

The Granite Guardian foundation model comes from the Granite family of models by IBM. The model is a more powerful guardrail filter designed to deliver advanced protection against harmful content.

Use the Granite Guardian model as a filter to detect and flag the following types of language:

  • Social bias: Prejudiced statements based on identity or characteristics.

  • Jailbreaking: Attempts to manipulate AI to generate harmful, restricted, or inappropriate content.

  • Violence: Promotion of physical, mental, or sexual harm.

  • Profanity: Use of offensive language or insults.

  • Unethical behavior: Actions that violate moral or legal standards.

  • Harm engagement: Engagement or endorsement of harmful or unethical requests.

  • Evasiveness: Avoiding to engage without providing sufficient reason.

Important: The Granite Guardian filter uses the complete chat history to understand if a prompt is unsafe. If you include a message that is flagged as a 'risk', all subsequent messages fail the safety check.

You can use the Granite Guardian model as a filter for user input only.

Filter Sensitivity

You can change the filter sensitivity by setting a threshold. The threshold represents the score value that content must reach to be considered harmful. The score threshold ranges from 0.0 to 1.0.

A lower value, such as 0.1 or 0.2, is safer because the threshold is lower. Harmful content is more likely to be identified when a lower score can trigger the filter. However, the classifier might also be triggered when content is safe.

A value closer to 1, such as 0.8 or 0.9, is more risky because the score threshold is higher. When a higher score is required to trigger the filter, occurrences of harmful content might be missed. However, the content that is flagged as harmful is more likely to be harmful.

To disable AI guardrails, set the Granite Guardian threshold value to 1.

Ways to work

You can set guardrail filters with the following methods:

Learn more