Benchmarking LLM Guardrail Providers: A Data-Driven Comparison
Why LLM Applications Need Guardrails
Production LLM applications face a growing surface area of risk. Users can inadvertently leak personally identifiable information (PII) through conversational inputs. Models can generate toxic, violent, or sexually explicit content that violates platform policies. Adversarial users craft prompt injection attacks designed to override system instructions, extract confidential prompts, or bypass safety filters entirely.
The consequences are not hypothetical. A PII leak can trigger regulatory action under GDPR, CCPA, or HIPAA. Toxic outputs erode user trust and create brand liability. A successful prompt injection can expose proprietary system prompts or cause the model to execute unintended actions.
Prompt engineering and system instructions provide a first layer of defense, but they are insufficient on their own. Models can be coerced past instruction-level guardrails through encoding attacks, roleplay scenarios, or context manipulation. Automated guardrail systems — purpose-built classifiers that inspect inputs and outputs in real time — provide the defense-in-depth that production deployments require.
The challenge: the market now includes over a dozen guardrail providers, each with different strengths, latency profiles, and coverage gaps. How do you choose the right one for your use case?
TrueFoundry Guardrails: A Unified Gateway
TrueFoundry’s AI Gateway abstracts multiple guardrail providers behind a single OpenAI-compatible API (docs). Teams integrate once with the /v1/chat/completions endpoint and can swap providers through configuration - no code changes required.
The gateway supports two evaluation stages. Input-stage guardrails inspect user messages before they reach the LLM, blocking prompt injections, PII, or harmful content. Output-stage guardrails inspect model responses before they reach the user, catching hallucinations, toxic outputs, or leaked sensitive data.
TrueFoundry organizes guardrails into five task types:
This benchmarking study focuses on the first three tasks — PII Detection, Content Moderation, and PromptInjection which have the broadest provider coverage and the most mature evaluation datasets.Evaluation Dataset DesignWe constructed category-balanced evaluation datasets of 400 samples per task, designed for statisticallymeaningful comparison with tight confidence intervals. Each dataset maintains a roughly 50/50 split betweenpositive (harmful/PII-containing) and negative (safe/clean) samples to ensure balanced evaluation of bothdetection and false positive rates.
PII Detection
Content Moderation
Prompt Injection
Design decisions. Each dataset maintains approximately 50% safe/clean samples to measure false positiverates — a guardrail that flags everything is useless. Categories with fewer than 5 samples were mergedinto an “Other” category to ensure statistical reliability. Each sample carries per-provider ground truthlabels (expected_triggers) because providers may legitimately disagree on edge cases. For example, asample discussing “how AI safety guardrails work” is safe but touches security-adjacent language, and notall providers handle this distinction identically.All samples were hand-curated locally rather than drawn from external benchmarks. This ensures precisecontrol over category balance, difficulty distribution, and ground truth accuracy.
Evaluation Methodology
Every provider was evaluated against identical datasets through the TrueFoundry AI Gateway, ensuring a fair comparison with no per-provider data leakage.
Evaluation Pipeline
Dataset loading — JSONL datasets are loaded with automatic format detection (unified vs. legacyschema)2. Async evaluation — Samples are dispatched concurrently using semaphore-based throttling (50parallel requests) via the OpenAI-compatible /v1/chat/completions endpoint3. Binary classification — Each sample produces a binary outcome: guardrail triggered (true) or not(false), compared against per-provider ground truth4. Metrics aggregation — Standard classification metrics are computed across all samples
Metrics
F1 Score serves as the primary ranking metric because it balances the trade-off between precision (avoiding false alarms) and recall (catching real threats). A high-precision, low-recall guardrail misses threats. A high-recall, low-precision guardrail blocks legitimate users.
With 400 samples per task, Wilson score confidence intervals give ±0.03–0.05 margin at 95% confidence, tight enough to distinguish meaningful performance differences between providers.
Latency Tracking
We track latency at two levels:
• Client-side latency — End-to-end time measured in the evaluation harness, including network round- trip
• Server-side latency — Guardrail processing time only, extracted from TrueFoundry traces via the Spans API (tfy.guardrail.metric.latency_in_ms)
Server-side latency isolates the guardrail’s own processing time from network overhead, providing a more accurate comparison across providers.
Provider Comparison Results
PII Detection
Azure PII provides fine-grained entity-level detection with configurable PII categories (Email, PhoneNumber, SSN, Address, CreditCardNumber, IPAddress, Person) and language-aware processing. It achieves perfect precision every flagged entity is genuine PII with strong recall at 0.865, evaluated in Mutate mode where detected PII is redacted rather than blocked outright. The missed detections (0.135 recall gap) tend to concentrate in ambiguous contexts where PII entities appear in non-standard formats.
Content Moderation
Content Moderation shows the clearest provider differentiation. OpenAI’s omni-moderation-latest modelleads with a 0.899 F1 score, achieving strong balance between precision and recall across Hate, Violence,SelfHarm, and Harassment categories. Azure Content Safety trades lower accuracy for significantly fasterresponse times (52ms vs. 192ms), making it a viable choice for latency-sensitive deployments. PromptFoo lagson both efficacy and latency in this evaluation, with its 1.1-second response times reflecting its LLM-baseddetection approach.
Prompt Injection
Pangea demonstrates a high-recall detection strategy, catching 0.990 of injection attempts at the cost of morefalse positives (0.750 precision). This means it rarely misses an attack but will occasionally flag legitimatesecurity-related questions. The safe samples in this dataset are deliberately security-adjacent (“How do AIsafety guardrails work?”) to stress-test false positive rates, which partially explains the precision gap. Forapplications where missing an injection attack carries higher risk than occasional false alarms, Pangea’srecall-oriented profile is well-suited.
Key Takeaways
No single provider wins across all tasks. The guardrail landscape is specialized: providers optimized for PII detection may underperform on prompt injection, and vice versa. This is expected — each task demands fundamentally different detection strategies.
Precision and recall tell different stories. A provider with high precision but low recall is conservative - it rarely raises false alarms but misses real threats. The inverse catches everything but fatigues users with false positives. The right balance depends on your application’s risk tolerance.
A unified gateway enables informed selection. By evaluating all providers through a single integration point, teams can benchmark providers head-to-head on their own data and select the best provider per task — or combine multiple providers for defense-in-depth. Teams can also build custom guardrails for domain-specific needs.
Task-specific evaluation is non-negotiable. Generic “safety scores” obscure critical differences in provider behavior. Only by evaluating against curated, category-balanced datasets with per-provider ground truth can teams make informed procurement decisions. The benchmarking framework described here — 400 category-balanced samples per task, Wilson score confidence intervals, per-provider labels, dual latency tracking, and standard classification metrics — provides a reproducible methodology for any team evaluating guardrail solutions.
Built for Speed: ~10ms Latency, Even Under Load
Blazingly fast way to build, track and deploy your models!
- Handles 350+ RPS on just 1 vCPU — no tuning needed
- Production-ready with full enterprise support
TrueFoundry AI Gateway delivers ~3–4 ms latency, handles 350+ RPS on 1 vCPU, scales horizontally with ease, and is production-ready, while LiteLLM suffers from high latency, struggles beyond moderate RPS, lacks built-in scaling, and is best for light or prototype workloads.





.webp)



