Removing harmful language from model input and output

AI guardrails removes potentially harmful content, such as hate speech, abuse, and profanity, from foundation model output and input.

The AI guardrails feature in the Prompt Lab is powered by AI that applies a classification task to foundation model input and output text. The sentence classifier, which is also referred to as a hate, abuse, and profanity (HAP) detector or HAP filter, was created by fine-tuning a large language model from the Slate family of encoder-only NLP models built by IBM Research.

The classifier breaks the model input and output text into sentences, and then reviews each sentence to find and flag harmful content. The classifier assesses each word, relationships among the words, and the context of the sentence to determine whether a sentence contains harmful language. The classifier then assigns a score that represents the likelihood that inappropriate content is present.

AI guardrails in the Prompt Lab detects and flags the following types of language:

  • Hate speech: Expressions of hatred toward an individual or group based on attributes such as race, religion, ethnic origin, sexual orientation, disability, or gender. Hate speech shows an intent to hurt, humiliate, or insult the members of a group or to promote violence or social disorder.

  • Abusive language: Rude or hurtful language that is meant to bully, debase, or demean someone or something.

  • Profanity: Toxic words such as expletives, insults, or sexually explicit language.

The AI guardrails feature is supported when you inference natural-language foundation models and can detect harmful content in English text only. AI guardrails are not applicable to programmatic-language foundation models.

Removing harmful language from input and output in Prompt Lab

To remove harmful content when you're working with foundation models in the Prompt Lab, set the AI guardrails switcher to On.

The AI guardrails feature is enabled automatically for all natural language foundation models in English.

After the feature is enabled, when you click Generate, the filter checks all model input and output text. Inappropriate text is handled in the following ways:

  • Input text that is flagged as inappropriate is not submitted to the foundation model. The following message is displayed instead of the model output:

    [The input was rejected as inappropriate]

  • Model output text that is flagged as inappropriate is replaced with the following message:

    [Potentially harmful text removed]

Configuring AI guardrails

You can control whether the hate, abuse, and profanity (HAP) filter is applied at all and change the sensitivity of the HAP filter for the user input and foundation model output independently.

To configure AI guardrails, complete the following steps:

  1. With AI Guardrails enabled, click the AI guardrails settings icon AI guardrails settings icon.

  2. To disable AI guardrails for user input or foundation model output only, set the HAP slider for the user input or model output to 1.

  3. To change the sensitivity of the guardrails, move the HAP sliders.

    The slider value represents the threshold that scores from the HAP classifier must reach for the content to be considered harmful. The score threshold ranges from 0.0 to 1.0.

    A lower value, such as 0.1 or 0.2, is safer because the threshold is lower. Harmful content is more likely to be identified when a lower score can trigger the filter. However, the classifier might also be triggered when content is safe.

    A value closer to 1, such as 0.8 or 0.9, is more risky because the score threshold is higher. When a higher score is required to trigger the filter, occurrences of harmful content might be missed. However, the content that is flagged as harmful is more likely to be harmful.

    Experiment with adjusting the sliders to find the best settings for your needs.

  4. Click Save.

Programmatic alternative

When you prompt a foundation model by using the API, you can use the moderations field to apply filters to foundation model input and output. For more information, see the watsonx.ai API reference. For more information about how to adjust filters with the Python library, see Inferencing a foundation model programmatically.

When you submit inference requests from the API, you can also apply a PII filter to flag content that might contain personally identifiable information. The PII filter is disabled for inference requests that are submitted from Prompt Lab.

The PII filter uses a natural language processing AI model to identify and flag mentions of personally identifiable information (PII) information, such as phone numbers and email addresses. For the full list of entity types that are flagged, see Rule-based extraction for general entities. The filter threshold value is 0.8 and cannot be changed.

Learn more

Parent topic: Prompt Lab