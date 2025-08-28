• HAP (Hate speech, abuse, and profanity): Detecting harmful language, racial slurs and offensive content.

• PII (Personally identifiable information): Spotting sensitive personal data that shouldn't be shared.

• Harm detection: Identifying content that could cause physical, emotional, or psychological harm. This type includes the curation of misinformation and its spread at scale on online platforms.

• Social bias: Recognizing unfair prejudices or discriminatory language that sometimes go unchecked on social media platforms and online communities.

• Jailbreak detection: Catching prompt engineering attempts to bypass AI safety measures to produce illicit responses from the model that could potentially slip by traditional human review.

• Violence: Flagging content that promotes or describes violent acts towards others as well as self-harm.

• Profanity: The model is trained in identifying and flagging inappropriate or offensive language.

• Unethical behavior: Spotting content that promotes harmful or immoral actions.

• Content safety: Overall assessment of whether content meets safety standards.