My IBM Log in Subscribe

Purifying AI: HAP filtering against harmful content

20 December 2024

Authors

Alexandra Jonker

Editorial Content Lead

Alice Gomstyn

IBM Content Contributor

The world wide web facilitates connection, accelerates business growth and puts centuries of knowledge at our fingertips.

But for all its benefits, it can also be a cesspool of hateful language and harmful content. And this cesspool drains into the greater ocean of internet data that’s used to train many of today’s foundation models, such as large language models (LLMs) and their natural language processing (NLP) capabilities.

This seepage of offensive language threatens the integrity and usability of these artificial intelligence (AI) models. Why? Because if LLMs are trained on datasets that include hateful human behavior, it follows that they could produce harmful outcomes. What’s more, this harmful content can also find its way into AI models during fine tuning, optimization through retrieval augmented generation (RAG), or when an LLM is interacting with a user.

The filtration and removal of offensive content is central to ensuring that AI models are safe, inclusive and unbiased, providing a positive experience for users. One such solution is the model-powered systematic filtering of hate, abuse and profanity (HAP), referred to as HAP filtering.

3D design of balls rolling on a track

The latest AI News + Insights 


Discover expertly curated insights and news on AI, cloud and more in the weekly Think Newsletter. 

What is HAP filtering?

HAP filtering is a system that uses a classification model to detect and remove hate speech, abusive language and profanity from an LLM’s input and output text.

What is a classification model?

To fully understand HAP filtering, it’s helpful to understand classification models. Classification models are machine learning models that divide data points into predefined groups called classes. They learn class characteristics from input data and then assign possible classes to new data according to those learned characteristics. A spam email filter, for example, uses a classification algorithm. A HAP filtering classification model may also be referred to more specifically as a sentence classifier, or more simply as a HAP filter or HAP detector.

What is considered HAP content?

Hate speech, abusive language and profanity can be defined as follows:

  • Hate speech: Expressions of hatred toward an individual or group based on attributes such as race, religion, ethnic origin, sexual orientation, disability or gender. Hate speech shows an intent to hurt, humiliate or insult the members of a group, or to promote violence or social disorder.

  • Abusive language: Rude or hurtful language that is meant to bully, debase or demean someone or something.

  • Profanity: Toxic words such as expletives, insults or sexually explicit language.

How does HAP filtering work?

In practice, a HAP filtering sentence classifier assesses each word of a model’s input or output text to determine whether it contains HAP content. Then, it assigns a score that represents the likelihood that HAP content is present, perhaps from 0 to 1. In this case, a score closer to 1 indicates a higher likelihood of HAP content. Depending on the threshold that the user sets for HAP content (such as "a score greater than 0.5 = HAP"), the model would then assign a label to each sentence indicating whether or not it contains HAP.

Finally, the HAP content could either be flagged and removed if it is in pretraining data. Or, if the HAP content is an output, it could be replaced with a guardrail message indicating that the output contained harmful text that was removed.

AI Academy

Trust, transparency and governance in AI

AI trust is arguably the most important topic in AI. It's also an understandably overwhelming topic. We'll unpack issues such as hallucination, bias and risk, and share steps to adopt AI in an ethical, responsible and fair manner.

Use cases for HAP filters

According to IBM Research, there are currently three main use cases for HAP filters:

  • Filtering LLM training data
  • Aligning models using reinforcement learning
  • Controlling generative AI outputs
Filtering LLM training data

LLMs are usually trained on an array of data sources, some of which can contain hateful or inappropriate content. HAP filtering can help prevent LLMs from learning from such content. It often occurs during data preprocessing when there is still a large volume of raw data.

Aligning models using reinforcement learning

HAP models are also used during alignment. For example, alignment through reinforcement learning rewards outputs based on how they align with intended goals. If the reward is scored using a HAP filter, the reward could be a "non-HAP" score, which the model is then trained to maximize.

Controlling generative AI outputs

HAP models can help control generative AI model outputs, without retraining the original model. This control requires modifying the generation process to score model predictions using both the original scoring method as well as HAP scoring to ensure acceptable, hate-free content.

It's important to note that in addition to HAP filtering, there often exist other data cleaning, data quality and alignment steps taken to reduce instances of incorrect, inappropriate or biased data from entering or exiting the model.

IBM’s next-gen HAP filters: open source and offensive spans

As with many AI-adjacent technologies, innovation moves fast in the world of HAP filtering. IBM researchers identified two ways to improve HAP filters: through smaller, open source models and an offensive span identification tool.

Smaller, open source HAP filters

In an ideal world, HAP filtering would occur at each stage of the LLM lifecycle. But this use would require speed that most of today’s HAP filters lack due to their large size.

This inspired IBM’s faster, newer HAP filter: Granite-Guardian-HAP-38m. This 38 million parameter encoder model is smaller than its 125 million parameter predecessor (Granite-Guardian-HAP-125m). As such, it can run eight times faster on a central processing unit (CPU) and twice as fast on a graphics processing unit (GPU) (both found in smartphones and PCs) to quickly filter data at each stage of the LLM lifecycle.

Variants of both HAP filtering models are available on watsonx.ai™. But to continue encouraging a trustworthy AI ecosystem, IBM has open sourced both HAP filters on Hugging Face

Offensive span identification

To introduce greater granularity and language diversity to HAP filters, IBM researchers developed a HAP visualization tool called MUTED: A MUltilingual Targeted Demonstration.

Going beyond sentence level annotation, MUTED breaks sentences into “targets” and offensive spans (or, the offensive argument). For example, in the sentence “Those people are horrible drivers,” the target is “those people” and the offensive span is “horrible drivers.” The idea is that MUTED would identify offensive spans, rank their intensity using heat maps and then hide them from users if they are considered harmful.1

Footnotes

1 "Muted: Multilingual Targeted Offensive Speech Identification and Visualization," Association for Computational Linguistics, December 2023.

Related solutions

Related solutions

IBM® Granite™

Our third generation of AI language models are here. Fit for purpose and open sourced, these enterprise-ready models deliver exceptional performance against safety benchmarks and across a wide range of enterprise tasks from cybersecurity to RAG.

Meet Granite
Foundation models

Explore the IBM library of foundation models in the watsonx portfolio to scale generative AI for your business with confidence.

Discover watsonx.ai
AI governance solutions and services

Unlock your AI’s full potential and see how AI governance can help increase your employees’ confidence in AI, accelerate adoption and innovation, and improve customer trust.

Explore AI governance solutions
Take the next step

IBM® Granite™ is our family of open, performant, and trusted AI models, tailored for business and optimized to scale your AI applications. Fit for purpose and open sourced, these enterprise-ready models deliver exceptional performance against safety benchmarks and across a wide range of enterprise tasks from cybersecurity to RAG.

Meet Granite