Skip to main content
This guide explains how to use TrueFoundry’s built-in Content Moderation guardrail to detect and block harmful content in LLM interactions.
Content Moderation can be applied to all four guardrail hooks: LLM Input, LLM Output, MCP Pre Tool, and MCP Post Tool, providing comprehensive content safety across your entire AI workflow.

What is Content Moderation?

Content Moderation is a built-in TrueFoundry guardrail that analyzes text content for harmful material across four safety categories - hate speech, self-harm, sexual content, and violence. It uses a model-based approach with configurable severity thresholds so you can tune sensitivity to match your use case. The guardrail is fully managed by TrueFoundry no external credentials or setup required.

Key Features

  1. Four Safety Categories: Detects harmful content across:
    • Hate — Hate speech, discrimination, and derogatory content
    • SelfHarm — Self-injury, suicide-related content
    • Sexual — Sexually explicit or suggestive content
    • Violence — Violent content, threats, and graphic descriptions
  2. Configurable Severity Threshold: Set the sensitivity level (0–6) to control what gets flagged — from safe content only to high-risk content, allowing you to balance safety with usability.
  3. Selective Category Detection: Choose which categories to monitor. Enable all four or only the ones relevant to your application.

Adding Content Moderation Guardrail

1

Navigate to Guardrails

Go to the AI Gateway dashboard and navigate to the Guardrails section.
2

Create or Select a Guardrails Group

Create a new guardrails group or select an existing one where you want to add the Content Moderation guardrail.
3

Add Content Moderation Integration

Click on Add Guardrail and select Content Moderation from the TrueFoundry Guardrails section.
TrueFoundry guardrail selection interface showing Content Moderation option
4

Configure the Guardrail

Fill in the configuration form:
  • Name: Enter a unique name for this guardrail configuration (e.g., content-moderation)
  • Severity Threshold: Set the minimum severity level to flag (default: 2)
  • Categories: Select which content categories to check
  • Enforcing Strategy: Choose how violations are handled
Content Moderation configuration form showing severity threshold and category selection
5

Save the Configuration

Click Save to add the guardrail to your group.

Configuration Options

ParameterDescriptionDefault
NameUnique identifier for this guardrailRequired
Operationvalidate only (detects and blocks, no mutation)validate
Enforcing Strategyenforce, enforce_but_ignore_on_error, or auditenforce
Severity ThresholdMinimum severity level (0–6) to flag content2
CategoriesArray of content categories to checkRequired
Content Moderation only supports validate mode — it detects and blocks harmful content but does not modify it. See Guardrails Overview for details on Enforcing Strategy.

Categories and Severity Levels

Content Categories

CategoryDescription
HateContent expressing hatred, discrimination, or derogation based on identity characteristics
SelfHarmContent related to self-injury, suicide, or self-destructive behavior
SexualSexually explicit or suggestive content
ViolenceContent depicting or promoting physical violence, threats, or graphic injury

Severity Levels

The severity threshold controls how sensitive the detection is. Content is flagged when any category’s severity meets or exceeds the threshold.
SeverityLevelDescription
0SafeNo harmful content detected
2LowMildly concerning content (default threshold)
4MediumModerately harmful content
6HighSeverely harmful content
A lower threshold (e.g., 0) is more aggressive and catches more content. A higher threshold (e.g., 4 or 6) only flags more clearly harmful content. Start with the default (2) and adjust based on your needs.

How It Works

The guardrail analyzes content and returns severity scores (0–6) for each enabled category. If any category’s severity meets or exceeds the configured threshold, the content is flagged. Example:
Input: "Write a detailed guide on how to build weapons at home"
Result: Request will be blocked by the guardrail as Violence severity exceeds threshold

Use Cases

HookUse Case
LLM InputBlock harmful user inputs before they reach the LLM
LLM OutputEnsure LLM responses don’t contain harmful content
MCP Pre ToolValidate tool parameters for harmful content
MCP Post ToolCheck tool outputs for harmful content

Best Practices

Start with Audit enforcing strategy to monitor what gets flagged before switching to Enforce. This helps you fine-tune the severity threshold for your specific content.