Benchmarking LLM Guardrail Providers: A Data-Driven Comparison

Why LLM Applications Need Guardrails

Production LLM applications face a growing surface area of risk. Users can inadvertently leak personally identifiable information (PII) through conversational inputs. Models can generate toxic, violent, or sexually explicit content that violates platform policies. Adversarial users craft prompt injection attacks designed to override system instructions, extract confidential prompts, or bypass safety filters entirely.

The consequences are not hypothetical. A PII leak can trigger regulatory action under GDPR, CCPA, or HIPAA. Toxic outputs erode user trust and create brand liability. A successful prompt injection can expose proprietary system prompts or cause the model to execute unintended actions.

Prompt engineering and system instructions provide a first layer of defense, but they are insufficient on their own. Models can be coerced past instruction-level guardrails through encoding attacks, roleplay scenarios, or context manipulation. Automated guardrail systems — purpose-built classifiers that inspect inputs and outputs in real time — provide the defense-in-depth that production deployments require.

The challenge: the market now includes over a dozen guardrail providers, each with different strengths, latency profiles, and coverage gaps. How do you choose the right one for your use case?

TrueFoundry Guardrails: A Unified Gateway

TrueFoundry’s AI Gateway abstracts multiple guardrail providers behind a single OpenAI-compatible API (docs). Teams integrate once with the /v1/chat/completions endpoint and can swap providers through configuration - no code changes required.

The gateway supports two evaluation stages. Input-stage guardrails inspect user messages before they reach the LLM, blocking prompt injections, PII, or harmful content. Output-stage guardrails inspect model responses before they reach the user, catching hallucinations, toxic outputs, or leaked sensitive data.

TrueFoundry organizes guardrails into five task types:

Task	Mode	Stage	Docs
PII Detection	Mutate (redact)	Input + Output	Azure PII
Content Moderation	Validate (block)	Input + Output	Azure Content Safety
Prompt Injection	Validate (block)	Input + Output	Palo Alto Prisma
Hallucination Detection	Validate (block)	Output only	Hallucination Detection
Topic Detection	Validate (block)	Output only	Configure Guardrails

This benchmarking study focuses on the first three tasks — PII Detection, Content Moderation, and PromptInjection which have the broadest provider coverage and the most mature evaluation datasets.Evaluation Dataset DesignWe constructed category-balanced evaluation datasets of 400 samples per task, designed for statisticallymeaningful comparison with tight confidence intervals. Each dataset maintains a roughly 50/50 split betweenpositive (harmful/PII-containing) and negative (safe/clean) samples to ensure balanced evaluation of bothdetection and false positive rates.

PII Detection

Category	Count	Description
Email	40	Email addresses in various formats
PhoneNumber	25	US/international phone formats
SSN	25	Social Security Numbers
Person	25	Personal names with context
Address	25	Physical mailing addresses
CreditCard	25	Credit/debit card numbers
IPAddress	25	IPv4 and IPv6 addresses
Mixed	25	Multiple PII types per sample
Clean	185	No PII present

Content Moderation

Category	Count	Description
Hate	39	Hate speech and discrimination
SelfHarm	33	Self-harm and suicide content
Illegal	33	Illegal activity instructions
Harassment	31	Targeted harassment and bullying
Violence	25	Threats and violent content
Other	1	Categories with <5 samples, merged for statistical reliability
Safe	238	Benign content

Prompt Injection

Category	Count	Description
DirectInjection	43	Explicit instruction override attempts
Jailbreak	40	Persona/mode-switching attacks (DAN, etc.)
IndirectInjection	32	Hidden instructions in structured data
EncodingAttack	22	Base64, hex, ROT13 encoded payloads
Roleplay	21	Creative fiction framing to bypass filters
ContextManipulation	21	Conversation history exploitation
SystemPromptExtraction	21	Attempts to extract system prompts
Benign	200	Legitimate technical questions

Design decisions. Each dataset maintains approximately 50% safe/clean samples to measure false positiverates — a guardrail that flags everything is useless. Categories with fewer than 5 samples were mergedinto an “Other” category to ensure statistical reliability. Each sample carries per-provider ground truthlabels (expected_triggers) because providers may legitimately disagree on edge cases. For example, asample discussing “how AI safety guardrails work” is safe but touches security-adjacent language, and notall providers handle this distinction identically.All samples were hand-curated locally rather than drawn from external benchmarks. This ensures precisecontrol over category balance, difficulty distribution, and ground truth accuracy.

Evaluation Methodology

Every provider was evaluated against identical datasets through the TrueFoundry AI Gateway, ensuring a fair comparison with no per-provider data leakage.

Evaluation Pipeline

Dataset loading — JSONL datasets are loaded with automatic format detection (unified vs. legacyschema)2. Async evaluation — Samples are dispatched concurrently using semaphore-based throttling (50parallel requests) via the OpenAI-compatible /v1/chat/completions endpoint3. Binary classification — Each sample produces a binary outcome: guardrail triggered (true) or not(false), compared against per-provider ground truth4. Metrics aggregation — Standard classification metrics are computed across all samples

Metrics

Metric	What it measures
Precision	Of everything the guardrail flagged, how much was actually harmful
Recall	Of all truly harmful content, how much did the guardrail catch
F1 Score	Single score balancing precision and recall — the primary comparison metric
Accuracy	Overall correctness across both harmful and safe samples
95% Confidence Interval	Wilson score interval on accuracy, quantifying measurement uncertainty

F1 Score serves as the primary ranking metric because it balances the trade-off between precision (avoiding false alarms) and recall (catching real threats). A high-precision, low-recall guardrail misses threats. A high-recall, low-precision guardrail blocks legitimate users.

With 400 samples per task, Wilson score confidence intervals give ±0.03–0.05 margin at 95% confidence, tight enough to distinguish meaningful performance differences between providers.

Latency Tracking

We track latency at two levels:

• Client-side latency — End-to-end time measured in the evaluation harness, including network round- trip

• Server-side latency — Guardrail processing time only, extracted from TrueFoundry traces via the Spans API (tfy.guardrail.metric.latency_in_ms)

Server-side latency isolates the guardrail’s own processing time from network overhead, providing a more accurate comparison across providers.

Provider Comparison Results

PII Detection

Provider	Precision	Recall	F1 Score	Accuracy	95% CI	Latency
Azure PII	1.000	0.865	0.928	0.928	[0.898, 0.949]	52.3ms

Azure PII provides fine-grained entity-level detection with configurable PII categories (Email, PhoneNumber, SSN, Address, CreditCardNumber, IPAddress, Person) and language-aware processing. It achieves perfect precision every flagged entity is genuine PII with strong recall at 0.865, evaluated in Mutate mode where detected PII is redacted rather than blocked outright. The missed detections (0.135 recall gap) tend to concentrate in ambiguous contexts where PII entities appear in non-standard formats.

Content Moderation

Provider	Precision	Recall	F1 Score	Accuracy	95% CI	Latency
OpenAI Moderation	0.922	0.877	0.899	0.920	[0.889, 0.943]	191.5ms
Azure Content Safety	0.796	0.722	0.757	0.812	[0.771, 0.847]	52.2ms
PromptFoo	0.617	0.568	0.592	0.683	[0.636, 0.727]	1118.2ms

Content Moderation shows the clearest provider differentiation. OpenAI’s omni-moderation-latest modelleads with a 0.899 F1 score, achieving strong balance between precision and recall across Hate, Violence,SelfHarm, and Harassment categories. Azure Content Safety trades lower accuracy for significantly fasterresponse times (52ms vs. 192ms), making it a viable choice for latency-sensitive deployments. PromptFoo lagson both efficacy and latency in this evaluation, with its 1.1-second response times reflecting its LLM-baseddetection approach.

Prompt Injection

Provider	Precision	Recall	F1 Score	Accuracy	95% CI	Latency
Pangea	0.750	0.990	0.853	0.830	[0.790, 0.864]	358.7ms

Pangea demonstrates a high-recall detection strategy, catching 0.990 of injection attempts at the cost of morefalse positives (0.750 precision). This means it rarely misses an attack but will occasionally flag legitimatesecurity-related questions. The safe samples in this dataset are deliberately security-adjacent (“How do AIsafety guardrails work?”) to stress-test false positive rates, which partially explains the precision gap. Forapplications where missing an injection attack carries higher risk than occasional false alarms, Pangea’srecall-oriented profile is well-suited.

Key Takeaways

No single provider wins across all tasks. The guardrail landscape is specialized: providers optimized for PII detection may underperform on prompt injection, and vice versa. This is expected — each task demands fundamentally different detection strategies.

Precision and recall tell different stories. A provider with high precision but low recall is conservative - it rarely raises false alarms but misses real threats. The inverse catches everything but fatigues users with false positives. The right balance depends on your application’s risk tolerance.

A unified gateway enables informed selection. By evaluating all providers through a single integration point, teams can benchmark providers head-to-head on their own data and select the best provider per task — or combine multiple providers for defense-in-depth. Teams can also build custom guardrails for domain-specific needs.

Task-specific evaluation is non-negotiable. Generic “safety scores” obscure critical differences in provider behavior. Only by evaluating against curated, category-balanced datasets with per-provider ground truth can teams make informed procurement decisions. The benchmarking framework described here — 400 category-balanced samples per task, Wilson score confidence intervals, per-provider labels, dual latency tracking, and standard classification metrics — provides a reproducible methodology for any team evaluating guardrail solutions.

Built for Speed: ~10ms Latency, Even Under Load

Blazingly fast way to build, track and deploy your models!

Handles 350+ RPS on just 1 vCPU — no tuning needed
Production-ready with full enterprise support

Get Started with Truefoundry Now Talk to the Expert

TrueFoundry AI Gateway delivers ~3–4 ms latency, handles 350+ RPS on 1 vCPU, scales horizontally with ease, and is production-ready, while LiteLLM suffers from high latency, struggles beyond moderate RPS, lacks built-in scaling, and is best for light or prototype workloads.

Built for Speed: ~10ms Latency, Even Under Load

Schedule your Demo Now

The fastest way to build, govern and scale your AI

Book a Demo

Benchmarking LLM Guardrail Providers: A Data-Driven Comparison

Why LLM Applications Need Guardrails

TrueFoundry Guardrails: A Unified Gateway

PII Detection

Content Moderation

Prompt Injection

Evaluation Methodology

Evaluation Pipeline

Metrics

Latency Tracking

Provider Comparison Results

PII Detection

Content Moderation

Prompt Injection

Key Takeaways

Built for Speed: ~10ms Latency, Even Under Load