The world wide web facilitates connection, accelerates business growth and puts centuries of knowledge at our fingertips.
But for all its benefits, it can also be a cesspool of hateful language and harmful content. And this cesspool drains into the greater ocean of internet data that’s used to train many of today’s foundation models, such as large language models (LLMs) and their natural language processing (NLP) capabilities.
This seepage of offensive language threatens the integrity and usability of these artificial intelligence (AI) models. Why? Because if LLMs are trained on datasets that include hateful human behavior, it follows that they could produce harmful outcomes. What’s more, this harmful content can also find its way into AI models during fine tuning, optimization through retrieval augmented generation (RAG), or when an LLM is interacting with a user.
The filtration and removal of offensive content is central to ensuring that AI models are safe, inclusive and unbiased, providing a positive experience for users. One such solution is the model-powered systematic filtering of hate, abuse and profanity (HAP), referred to as HAP filtering.
HAP filtering is a system that uses a classification model to detect and remove hate speech, abusive language and profanity from an LLM’s input and output text.
To fully understand HAP filtering, it’s helpful to understand classification models. Classification models are machine learning models that divide data points into predefined groups called classes. They learn class characteristics from input data and then assign possible classes to new data according to those learned characteristics. A spam email filter, for example, uses a classification algorithm. A HAP filtering classification model may also be referred to more specifically as a sentence classifier, or more simply as a HAP filter or HAP detector.
Hate speech, abusive language and profanity can be defined as follows:
In practice, a HAP filtering sentence classifier assesses each word of a model’s input or output text to determine whether it contains HAP content. Then, it assigns a score that represents the likelihood that HAP content is present, perhaps from 0 to 1. In this case, a score closer to 1 indicates a higher likelihood of HAP content. Depending on the threshold that the user sets for HAP content (such as "a score greater than 0.5 = HAP"), the model would then assign a label to each sentence indicating whether or not it contains HAP.
Finally, the HAP content could either be flagged and removed if it is in pretraining data. Or, if the HAP content is an output, it could be replaced with a guardrail message indicating that the output contained harmful text that was removed.
According to IBM Research, there are currently three main use cases for HAP filters:
LLMs are usually trained on an array of data sources, some of which can contain hateful or inappropriate content. HAP filtering can help prevent LLMs from learning from such content. It often occurs during data preprocessing when there is still a large volume of raw data.
HAP models are also used during alignment. For example, alignment through reinforcement learning rewards outputs based on how they align with intended goals. If the reward is scored using a HAP filter, the reward could be a "non-HAP" score, which the model is then trained to maximize.
HAP models can help control generative AI model outputs, without retraining the original model. This control requires modifying the generation process to score model predictions using both the original scoring method as well as HAP scoring to ensure acceptable, hate-free content.
It's important to note that in addition to HAP filtering, there often exist other data cleaning, data quality and alignment steps taken to reduce instances of incorrect, inappropriate or biased data from entering or exiting the model.
As with many AI-adjacent technologies, innovation moves fast in the world of HAP filtering. IBM researchers identified two ways to improve HAP filters: through smaller, open source models and an offensive span identification tool.
In an ideal world, HAP filtering would occur at each stage of the LLM lifecycle. But this use would require speed that most of today’s HAP filters lack due to their large size.
This inspired IBM’s faster, newer HAP filter: Granite-Guardian-HAP-38m. This 38 million parameter encoder model is smaller than its 125 million parameter predecessor (Granite-Guardian-HAP-125m). As such, it can run eight times faster on a central processing unit (CPU) and twice as fast on a graphics processing unit (GPU) (both found in smartphones and PCs) to quickly filter data at each stage of the LLM lifecycle.
Variants of both HAP filtering models are available on watsonx.ai™. But to continue encouraging a trustworthy AI ecosystem, IBM has open sourced both HAP filters on Hugging Face.
To introduce greater granularity and language diversity to HAP filters, IBM researchers developed a HAP visualization tool called MUTED: A MUltilingual Targeted Demonstration.
Going beyond sentence level annotation, MUTED breaks sentences into “targets” and offensive spans (or, the offensive argument). For example, in the sentence “Those people are horrible drivers,” the target is “those people” and the offensive span is “horrible drivers.” The idea is that MUTED would identify offensive spans, rank their intensity using heat maps and then hide them from users if they are considered harmful.1
1 "Muted: Multilingual Targeted Offensive Speech Identification and Visualization," Association for Computational Linguistics, December 2023.
Learn about the new challenges of generative AI, the need for governing AI and ML models and steps to build a trusted, transparent and explainable AI framework.
Read about driving ethical and compliant practices with a portfolio of AI products for generative AI models.
Gain a deeper understanding of how to ensure fairness, manage drift, maintain quality and enhance explainability with watsonx.governance™.
We surveyed 2,000 organizations about their AI initiatives to discover what’s working, what’s not and how you can get ahead.
Our third generation of AI language models are here. Fit for purpose and open sourced, these enterprise-ready models deliver exceptional performance against safety benchmarks and across a wide range of enterprise tasks from cybersecurity to RAG.
Explore the IBM library of foundation models in the watsonx portfolio to scale generative AI for your business with confidence.
Unlock your AI’s full potential and see how AI governance can help increase your employees’ confidence in AI, accelerate adoption and innovation, and improve customer trust.
IBM web domains
ibm.com, ibm.org, ibm-zcouncil.com, insights-on-business.com, jazz.net, mobilebusinessinsights.com, promontory.com, proveit.com, ptech.org, s81c.com, securityintelligence.com, skillsbuild.org, softlayer.com, storagecommunity.org, think-exchange.com, thoughtsoncloud.com, alphaevents.webcasts.com, ibm-cloud.github.io, ibmbigdatahub.com, bluemix.net, mybluemix.net, ibm.net, ibmcloud.com, galasa.dev, blueworkslive.com, swiss-quantum.ch, blueworkslive.com, cloudant.com, ibm.ie, ibm.fr, ibm.com.br, ibm.co, ibm.ca, community.watsonanalytics.com, datapower.com, skills.yourlearning.ibm.com, bluewolf.com, carbondesignsystem.com