My IBM

Purifying AI: HAP filtering against harmful content

20 December 2024

Authors

Alice Gomstyn

IBM Content Contributor

The world wide web facilitates connection, accelerates business growth and puts centuries of knowledge at our fingertips.

But for all its benefits, it can also be a cesspool of hateful language and harmful content. And this cesspool drains into the greater ocean of internet data that’s used to train many of today’s foundation models, such as large language models (LLMs) and their natural language processing (NLP) capabilities.

This seepage of offensive language threatens the integrity and usability of these artificial intelligence (AI) models. Why? Because if LLMs are trained on datasets that include hateful human behavior, it follows that they could produce harmful outcomes. What’s more, this harmful content can also find its way into AI models during fine tuning, optimization through retrieval augmented generation (RAG), or when an LLM is interacting with a user.

The filtration and removal of offensive content is central to ensuring that AI models are safe, inclusive and unbiased, providing a positive experience for users. One such solution is the model-powered systematic filtering of hate, abuse and profanity (HAP), referred to as HAP filtering.

The latest AI News + Insights  

Discover expertly curated insights and news on AI, cloud and more in the weekly Think Newsletter.

Subscribe today

What is HAP filtering?

HAP filtering is a system that uses a classification model to detect and remove hate speech, abusive language and profanity from an LLM’s input and output text.

What is a classification model?

To fully understand HAP filtering, it’s helpful to understand classification models. Classification models are machine learning models that divide data points into predefined groups called classes. They learn class characteristics from input data and then assign possible classes to new data according to those learned characteristics. A spam email filter, for example, uses a classification algorithm. A HAP filtering classification model may also be referred to more specifically as a sentence classifier, or more simply as a HAP filter or HAP detector.

What is considered HAP content?

Hate speech, abusive language and profanity can be defined as follows:

Hate speech: Expressions of hatred toward an individual or group based on attributes such as race, religion, ethnic origin, sexual orientation, disability or gender. Hate speech shows an intent to hurt, humiliate or insult the members of a group, or to promote violence or social disorder.
Abusive language: Rude or hurtful language that is meant to bully, debase or demean someone or something.
Profanity: Toxic words such as expletives, insults or sexually explicit language.

How does HAP filtering work?

In practice, a HAP filtering sentence classifier assesses each word of a model’s input or output text to determine whether it contains HAP content. Then, it assigns a score that represents the likelihood that HAP content is present, perhaps from 0 to 1. In this case, a score closer to 1 indicates a higher likelihood of HAP content. Depending on the threshold that the user sets for HAP content (such as "a score greater than 0.5 = HAP"), the model would then assign a label to each sentence indicating whether or not it contains HAP.

Finally, the HAP content could either be flagged and removed if it is in pretraining data. Or, if the HAP content is an output, it could be replaced with a guardrail message indicating that the output contained harmful text that was removed.

AI Academy

Trust, transparency and governance in AI

AI trust is arguably the most important topic in AI. It's also an understandably overwhelming topic. We'll unpack issues such as hallucination, bias and risk, and share steps to adopt AI in an ethical, responsible and fair manner.

Go to episode

Use cases for HAP filters

According to IBM Research, there are currently three main use cases for HAP filters:

Filtering LLM training data
Aligning models using reinforcement learning
Controlling generative AI outputs

Filtering LLM training data

LLMs are usually trained on an array of data sources, some of which can contain hateful or inappropriate content. HAP filtering can help prevent LLMs from learning from such content. It often occurs during data preprocessing when there is still a large volume of raw data.

Aligning models using reinforcement learning

HAP models are also used during alignment. For example, alignment through reinforcement learning rewards outputs based on how they align with intended goals. If the reward is scored using a HAP filter, the reward could be a "non-HAP" score, which the model is then trained to maximize.

Controlling generative AI outputs

HAP models can help control generative AI model outputs, without retraining the original model. This control requires modifying the generation process to score model predictions using both the original scoring method as well as HAP scoring to ensure acceptable, hate-free content.

It's important to note that in addition to HAP filtering, there often exist other data cleaning, data quality and alignment steps taken to reduce instances of incorrect, inappropriate or biased data from entering or exiting the model.

IBM’s next-gen HAP filters: open source and offensive spans

As with many AI-adjacent technologies, innovation moves fast in the world of HAP filtering. IBM researchers identified two ways to improve HAP filters: through smaller, open source models and an offensive span identification tool.

Smaller, open source HAP filters

In an ideal world, HAP filtering would occur at each stage of the LLM lifecycle. But this use would require speed that most of today’s HAP filters lack due to their large size.

This inspired IBM’s faster, newer HAP filter: Granite-Guardian-HAP-38m. This 38 million parameter encoder model is smaller than its 125 million parameter predecessor (Granite-Guardian-HAP-125m). As such, it can run eight times faster on a central processing unit (CPU) and twice as fast on a graphics processing unit (GPU) (both found in smartphones and PCs) to quickly filter data at each stage of the LLM lifecycle.

Variants of both HAP filtering models are available on watsonx.ai™. But to continue encouraging a trustworthy AI ecosystem, IBM has open sourced both HAP filters on Hugging Face.

Click to read more about IBM's open source HAP filters

Offensive span identification

To introduce greater granularity and language diversity to HAP filters, IBM researchers developed a HAP visualization tool called MUTED: A MUltilingual Targeted Demonstration.

Going beyond sentence level annotation, MUTED breaks sentences into “targets” and offensive spans (or, the offensive argument). For example, in the sentence “Those people are horrible drivers,” the target is “those people” and the offensive span is “horrible drivers.” The idea is that MUTED would identify offensive spans, rank their intensity using heat maps and then hide them from users if they are considered harmful.¹

Footnotes

¹ "Muted: Multilingual Targeted Offensive Speech Identification and Visualization," Association for Computational Linguistics, December 2023.

AI governance for the enterprise

Learn the key benefits gained with automated AI governance for both today’s generative AI and traditional machine learning models.

Resources

Why AI governance is a business imperative for scaling enterprise artificial intelligence

Learn about the new challenges of generative AI, the need for governing AI and ML models and steps to build a trusted, transparent and explainable AI framework.

AI lifecycle governance

Read about driving ethical and compliant practices with a portfolio of AI products for generative AI models.

AI governance for generative AI prompt models

Gain a deeper understanding of how to ensure fairness, manage drift, maintain quality and enhance explainability with watsonx.governance™.

AI in Action 2024

We surveyed 2,000 organizations about their AI initiatives to discover what’s working, what’s not and how you can get ahead.

How to choose the right foundation model

Learn how to select the most suitable AI foundation model for your use case.

Purifying AI: HAP filtering against harmful content

Tags

20 December 2024

Authors

Alexandra Jonker

Alice Gomstyn

The latest AI News + Insights

What is HAP filtering?

What is a classification model?

What is considered HAP content?

How does HAP filtering work?

Trust, transparency and governance in AI

Use cases for HAP filters

IBM’s next-gen HAP filters: open source and offensive spans

Smaller, open source HAP filters

Offensive span identification

Footnotes

Resources

Related solutions

Purifying AI: HAP filtering against harmful content

Tags

20 December 2024

Share

Authors

Alexandra Jonker

Alice Gomstyn

The latest AI News + Insights

What is HAP filtering?

What is a classification model?

What is considered HAP content?

How does HAP filtering work?

Trust, transparency and governance in AI

Use cases for HAP filters

IBM’s next-gen HAP filters: open source and offensive spans

Smaller, open source HAP filters

Offensive span identification

Footnotes

Resources

Related solutions

The latest AI News + Insights