CoreWeave Weave Guardrails Integration

This guide explains how to integrate the CoreWeave Weave toxicity scorer with TrueFoundry AI Gateway as input and output guardrails. The integration runs the WeaveToxicityScorerV1 (Celadon) classifier inside a small wrapper service that you deploy on TrueFoundry. The AI Gateway invokes the wrapper through its Custom Guardrail interface. The scorer runs locally in the wrapper pod - no calls to W&B at runtime, ~25-30 ms per call on CPU after warmup.

Source repository: truefoundry/integrations-custom-guardrails/integrations/coreweave-weave/. It contains the Dockerfile, deploy script, rail handlers, and tests referenced below.

What is the CoreWeave Weave Toxicity Scorer?

CoreWeave Weave (formerly Weights & Biases Weave) ships a family of local scorers for GenAI safety. This integration wraps WeaveToxicityScorerV1 - a DeBERTa-v3-small model (Celadon, trained on Toxic Commons) with five toxicity heads: Race/Origin, Gender/Sex, Religion, Ability, and Violence.

Key Features of the Weave Scorer on TrueFoundry

Local toxicity classification on inbound user messages and outbound assistant responses - no external service calls, no LLM round-trip per request.
Two operations per direction: a Validate rail that blocks toxic content, and a Mutate rail that masks it with a fixed placeholder.
Per-request threshold tuning via the dashboard Config JSON (total_threshold, category_threshold) - adjust sensitivity without redeploying.
Fast cold start: the ~550 MB Celadon model is baked into the Docker image at build time, so the pod starts warm and avoids a multi-minute HuggingFace download stall.

The v1 bundle wraps a single scorer across four endpoints. Celadon is a toxicity classifier only - it does not detect prompt injection, secret leakage, or PII. For PII use the Guardrails AI integration.

Architecture

The AI Gateway dispatches the input rail call and the model call in parallel for low time-to-first-token. The wrapper extracts the relevant message, scores it with Celadon, and returns a verdict. The scorer is stateless across calls; thresholds are applied per-request so dashboard Config tuning takes effect without a restart. The Validate rails always return HTTP 200 and signal the policy decision in the JSON body:

{"verdict": true} - allow
{"verdict": false, "message": "..."} - block

On a block, the AI Gateway cancels the in-flight model call. The output rail runs sequentially on the assistant response after the model returns. The Mutate rails also return HTTP 200, but use the mutate response shape to mask content rather than block:

{"verdict": true, "transformed": false, "result": <original body>} - pass through unchanged
{"verdict": true, "transformed": true, "result": <modified body>} - the AI Gateway replaces the in-flight body with result

On a mutation, the input rail replaces the last user message’s content with [message removed by safety filter]; the output rail replaces the first assistant choice’s content with I can't help with that. See Custom guardrail response contract for the underlying protocol.

Prerequisites

Before integrating the Weave scorer with TrueFoundry, ensure you have:

A TrueFoundry workspace you can deploy services into.
The model FQN you want to protect (e.g. openai-main/gpt-4o-mini).
A cluster with a configured base host (visible at Integrations → Clusters → <cluster>).

No W&B API key is required. The Celadon model is pulled from the public wandb/WeaveToxicityScorerV1 HuggingFace repo at image-build time and runs entirely offline thereafter.

Integration Steps

Clone the wrapper repository

Clone the integration repo and switch to the CoreWeave Weave folder:

git clone https://github.com/truefoundry/integrations-custom-guardrails
cd integrations-custom-guardrails/integrations/coreweave-weave

Configure environment variables

Copy .env.example to .env and fill in the values. You will reference one TrueFoundry secret that you create in the next step - get its FQN from Platform → Secrets after creating it.

.env

# Wrapper auth - the AI Gateway sends this as `Authorization: Bearer ...` when calling the wrapper.
WRAPPER_API_KEY=<generate with `python -c "import secrets; print(secrets.token_urlsafe(32))"`>

# Server runtime
PORT=8000
LOG_LEVEL=info

# Weave scorer runtime (cpu by default; "cuda" on a GPU node)
WEAVE_TOXICITY_DEVICE=cpu

# Deploy-time only
TFY_SERVICE_NAME=coreweave-weave-guardrails-tfy
TFY_WORKSPACE_FQN=<cluster>:<workspace>
TFY_PUBLIC_HOST=ml.<cluster>.truefoundry.cloud
TFY_PUBLIC_PATH=/coreweave-weave-guardrails-tfy
# Tenant-scoped, colon-separated form: tfy-secret://<tenant>:<secret-group>:<secret-key>
WRAPPER_API_KEY_SECRET_FQN=tfy-secret://<tenant>:coreweave-weave-guardrails-tfy:wrapper-api-key

Generate WRAPPER_API_KEY with python -c "import secrets; print(secrets.token_urlsafe(32))". The AI Gateway will send this value as Authorization: Bearer … when calling the wrapper.

Create a TrueFoundry secret

Navigate to Platform → Secrets and create a Secret Group named coreweave-weave-guardrails-tfy with one secret:

Secret Name	Value
`wrapper-api-key`	The same random string you put in `.env` as `WRAPPER_API_KEY`.

Copy the secret’s FQN and confirm the entry in .env (WRAPPER_API_KEY_SECRET_FQN) matches. The FQN is tenant-scoped and colon-separated, for example tfy-secret://tfy-eo:coreweave-weave-guardrails-tfy:wrapper-api-key.

Deploy the wrapper service

Install the TrueFoundry CLI, log in, and deploy:

pip install -U truefoundry
tfy login
python deploy.py --wait

The first build is slow because the Dockerfile pre-downloads the ~550 MB Celadon model so the deployed pod starts warm. Subsequent builds use TrueFoundry’s image layer cache and are much faster. After the build, the pod takes a few seconds to load and warm up the scorer before readiness passes.

Verify the service is healthy:

curl -s https://ml.<cluster>.truefoundry.cloud/coreweave-weave-guardrails-tfy/health
# {"status":"ok"}

Navigate to AI Gateway → Guardrails → + Add New Guardrails Group.

Group name: coreweave-weave
Description (optional): CoreWeave Weave toxicity scorer (Celadon): validate + mutate rails
Click + Add Guardrail Config → Custom Guardrail Config for each rail you want. The four endpoints are independent - register only the ones you need.

For each rail, use the same template:

Field	Value
Name	the rail name from the table below (e.g. `toxicity-input`)
Operation	`Validate` or `Mutate` per the table
URL	`https://ml.<cluster>.truefoundry.cloud/coreweave-weave-guardrails-tfy/<suffix>`
Auth Data	Custom Bearer Auth, token = the `wrapper-api-key` secret value
Headers	(empty)
Config	`{}` (or override thresholds - see Tuning Thresholds)
Enforcing Strategy	`Enforce But Ignore On Error` (recommended)

The four rails to register:

Direction	Name	URL suffix	Operation	Behavior
Input	`toxicity-input`	`/toxicity-input`	`Validate`	Blocks the request on toxicity
Output	`toxicity-output`	`/toxicity-output`	`Validate`	Blocks the response on toxicity
Input	`toxicity-input-mutate`	`/toxicity-input-mutate`	`Mutate`	Masks the user message on toxicity
Output	`toxicity-output-mutate`	`/toxicity-output-mutate`	`Mutate`	Replaces the assistant response on toxicity

Save the group.

Pick Validate OR Mutate per direction - do not stack both on the same direction. The wrapper signals decisions on HTTP 200; real failures (scorer load error, wrapper crash) come as HTTP 5xx. With Enforce But Ignore On Error, transient outages pass through while real policy decisions still apply. Use Enforce for safety-critical rails where fail-closed is the right trade-off. See Custom guardrail response contract and Enforcing Strategy.

TrueFoundry Custom Guardrail configuration form populated for the CoreWeave Weave toxicity-input guardrail with Custom Bearer Auth, Validate operation, Enforce strategy, Request target, and the wrapper toxicity-input URL — CoreWeave Weave Custom Guardrail configuration in TrueFoundry

Apply the guardrail to traffic

There are two ways to route requests through the rails - pick based on whether you want every call to a model protected, or per-call opt-in.

Pin to a model (every call protected)
Per-request opt-in

Navigate to AI Gateway → Models → <model> → Guardrails tab → attach the coreweave-weave group → Save. Every caller of this model now passes through the rails.

Send the X-TFY-GUARDRAILS header on individual requests. Selector format is <group-name>/<config-name>. Pick validate or mutate per direction - do not stack both on the same direction.

from openai import OpenAI
import json

client = OpenAI(
    api_key="<TFY API key>",
    base_url="https://gateway.truefoundry.ai",
)

# Block on toxicity (validate rails)
resp = client.chat.completions.create(
    model="openai-main/gpt-4o-mini",
    messages=[{"role": "user", "content": "What is the capital of France?"}],
    extra_headers={
        "X-TFY-GUARDRAILS": json.dumps({
            "llm_input_guardrails":  ["coreweave-weave/toxicity-input"],
            "llm_output_guardrails": ["coreweave-weave/toxicity-output"],
        }),
    },
)

To mask instead of block, use the mutate rails:

"X-TFY-GUARDRAILS": json.dumps({
    "llm_input_guardrails":  ["coreweave-weave/toxicity-input-mutate"],
    "llm_output_guardrails": ["coreweave-weave/toxicity-output-mutate"],
})

Test end-to-end

Issue two test calls through the AI Gateway - one that should succeed and one that should be blocked:

GW=https://gateway.truefoundry.ai
TFY_KEY=<your TFY API key>
MODEL=openai-main/gpt-4o-mini

# Should succeed with a normal completion
curl -s "$GW/chat/completions" \
  -H "Authorization: Bearer $TFY_KEY" -H "Content-Type: application/json" \
  -H 'X-TFY-GUARDRAILS: {"llm_input_guardrails":["coreweave-weave/toxicity-input"],"llm_output_guardrails":["coreweave-weave/toxicity-output"]}' \
  -d "{\"model\":\"$MODEL\",\"messages\":[{\"role\":\"user\",\"content\":\"What is the capital of France?\"}]}"

# Should be blocked: guardrail_checks_failed with the toxicity verdict
curl -s "$GW/chat/completions" \
  -H "Authorization: Bearer $TFY_KEY" -H "Content-Type: application/json" \
  -H 'X-TFY-GUARDRAILS: {"llm_input_guardrails":["coreweave-weave/toxicity-input"]}' \
  -d "{\"model\":\"$MODEL\",\"messages\":[{\"role\":\"user\",\"content\":\"I hate <group> and they should all be eliminated.\"}]}"

A successful block returns:

{
  "status": "failure",
  "message": "Input Guardrail checks failed for integrations: [coreweave-weave/toxicity-input] ...",
  "error": { "type": "guardrail_checks_failed", "code": "400" },
  "guardrail_checks": {
    "input_guardrails": [{
      "guardrail_integration": "coreweave-weave/toxicity-input",
      "result": "failed",
      "data": {
        "verdict": false,
        "explanation": "WeaveToxicity (input): blocked on Violence (score=3, total=4, thresholds={'total': 5, 'category': 3})",
        "guardrailUrl": "https://..."
      }
    }]
  }
}

The block message names the top-scoring toxicity category and is preserved in guardrail_checks.input_guardrails[0].data.explanation. With the mutate rails, the request instead completes normally but the toxic content is replaced ([message removed by safety filter] for input, I can't help with that. for output).

Tuning Thresholds

Celadon scores each toxicity category from 0 upward. The wrapper blocks (or masks) when either the total across all categories exceeds total_threshold, or the single highest category score reaches category_threshold. Pass overrides in the Custom Guardrail Config’s Config JSON field to tune sensitivity per-rail without redeploying:

{
  "total_threshold": 5,
  "category_threshold": 3
}

Setting	Default	Notes
`total_threshold`	`5`	Sum across all five categories that triggers a block/mask.
`category_threshold`	`3`	Single-category score that triggers a block/mask. Tuned one step above Weave’s own default of `2`.

The category_threshold=3 default is deliberate: Celadon scores short capitalized greetings like "Hi" and "Hey" at Race/Origin=2, which produces false positives at threshold 2. Score 3+ is where the classifier reliably indicates real hate, death threats, or overt slurs.

Lower category_threshold to 2 to catch milder harassment (e.g. "you are a worthless idiot" scores Violence=2), at the cost of greeting false positives.
Lower to 1 to catch veiled threats too (e.g. "I hope someone breaks her face" scores Violence=1), with more noise overall.

Known Accuracy Gaps

Short capitalized greetings false-positive at score 2. "Hi" and "Hey" score Race/Origin=2; "Hello", "Hi there", and lowercase "hi" score 0. This motivated the category_threshold=3 default.
Mild harassment scores 2 and passes the default. "you are a worthless idiot" and similar score Violence=2 and pass at category_threshold=3. Real hate / death threats / overt slurs score 3+ and block. Set {"category_threshold": 2} per-rail if you need the milder band to block too.
Veiled threats score 1. Phrases like "I hope someone breaks her face" score Violence=1, below both defaults. Set {"category_threshold": 1} to catch them.
Toxicity only. Celadon does not detect prompt injection, secret leakage, or PII. For PII use the Guardrails AI integration.
Label-space quirk. The five dimensions are conceptual, not orthogonal - e.g. homophobic content tends to score Race/Origin rather than Gender/Sex. The block message names the top-scoring dimension, which is informative but not always semantically tidy.

Troubleshooting

Blocks are returning 200 with the model's normal response

The Validate rails signal decisions via {"verdict": false} on HTTP 200. If the AI Gateway returns a normal completion when the wrapper reported a block, your tenant gateway may not be honoring the verdict field. Confirm by curling the wrapper directly - if you get 200 + {"verdict": false} but the AI Gateway still returns a completion, the AI Gateway is the issue.Workaround: switch the Custom Guardrail Configs’ Enforcing Strategy to Enforce. This maps the wrapper’s non-success state to a block. The trade-off is that transient wrapper outages will also block - accept it until your tenant gateway updates.

The wrapper is being called but returns the wrong shape

Call a rail endpoint directly to bypass the AI Gateway. The Validate rails return HTTP 200 with:

{"verdict": true} → pass
{"verdict": false, "message": "<reason>"} → block

curl -sS -X POST https://ml.<cluster>.truefoundry.cloud/coreweave-weave-guardrails-tfy/toxicity-input \
  -H "Authorization: Bearer $WRAPPER_API_KEY" -H "Content-Type: application/json" \
  -d '{"requestBody":{"messages":[{"role":"user","content":"<test prompt>"}]},"context":{"user":{"subjectId":"u1","subjectType":"user"}}}'

The Mutate rails return {"verdict": true, "transformed": <bool>, "result": <body>}. Non-200 responses indicate real errors (scorer init crash, missing bearer token).

401 Unauthorized from the wrapper

The Authorization: Bearer … value the AI Gateway sends doesn’t match the wrapper’s WRAPPER_API_KEY env var. Three places must agree:

The TFY secret coreweave-weave-guardrails-tfy/wrapper-api-key value.
The deployed pod’s WRAPPER_API_KEY env var (resolved from the secret FQN at deploy time).
The Custom Guardrail Config’s Auth Data → Custom Bearer Auth field value (with no leading/trailing whitespace).

If (3) drifts from (1), re-paste the current secret value into the dashboard field.

Did my redeploy actually replace the running pod?

Curl the debug endpoint to see which scorer + thresholds + routes the running pod has loaded:

curl -sS https://ml.<cluster>.truefoundry.cloud/coreweave-weave-guardrails-tfy/debug/loaded-config \
  -H "Authorization: Bearer $WRAPPER_API_KEY" | jq

Check the wrapper_version field against the git SHA you just deployed. If it lags, your new image isn’t serving traffic yet - most commonly TrueFoundry’s image build cache served a stale layer. Force a rebuild by touching Dockerfile and redeploying.

A prompt that should block isn't blocking

Most likely a threshold-tuning issue, not a bug. Celadon scores mild harassment at 2 and veiled threats at 1, both below the category_threshold=3 default. Lower the threshold in the rail’s Config JSON ({"category_threshold": 2} or 1) - see Tuning Thresholds. Curl the rail directly with your prompt to see the raw category scores in the block message.

Known Limitations

Toxicity classification only. No prompt injection, PII, or secrets detection. Layer with other guardrails (e.g. Guardrails AI) for defense in depth.
Fixed-string mutation. Celadon is a scorer, not a rewriter, so the Mutate rails replace toxic content with a fixed placeholder rather than a sanitized rewrite of the original.
No streaming-aware guardrails. The TrueFoundry custom-guardrail contract is buffered: the AI Gateway holds the full assistant response before calling the output rail. Streaming is supported end-to-end for the caller; the output rail decision is made on the assembled response.
In-memory state is per-replica. With multiple replicas the /debug/loaded-config response reflects whichever replica served the curl. After a deploy, retry the curl a few times to surface heterogeneity.

Reference

Field	Value
Wrapper validate endpoints	`https://<host>/<path>/toxicity-{input,output}`
Wrapper mutate endpoints	`https://<host>/<path>/toxicity-{input,output}-mutate`
Wrapper health endpoint	`https://<host>/<path>/health`
Wrapper debug endpoint	`https://<host>/<path>/debug/loaded-config`
Auth	`Authorization: Bearer <WRAPPER_API_KEY>`
Selector format	`coreweave-weave/toxicity-input`, `coreweave-weave/toxicity-output-mutate`, etc.
Validate response contract	`HTTP 200 + {"verdict": bool, "message": Optional[str]}`
Mutate response contract	`HTTP 200 + {"verdict": true, "transformed": bool, "result": <body>}`
Default thresholds	`total_threshold=5`, `category_threshold=3`
Repo	`truefoundry/integrations-custom-guardrails/integrations/coreweave-weave/`
Scorer model	`wandb/WeaveToxicityScorerV1` (Apache 2.0; re-host of `PleIAs/celadon`)
Scorer docs	Weave local scorers

Get Started

LLM Gateway

MCP Registry and Gateway

Skills Registry

Prompt Registry

Guardrails and Security

Observability

Deployment

Admin Guide

Chat

Messages

Embeddings

Rerank

Responses

Image

Audio

Batch

Files

Fine-tuning

Moderations

Models

CoreWeave Weave Guardrails Integration

What is the CoreWeave Weave Toxicity Scorer?

Key Features of the Weave Scorer on TrueFoundry

Architecture

Prerequisites

Integration Steps

Tuning Thresholds

Known Accuracy Gaps

Troubleshooting

Known Limitations

Reference

​What is the CoreWeave Weave Toxicity Scorer?

​Key Features of the Weave Scorer on TrueFoundry

​Architecture

​Prerequisites

​Integration Steps

​Tuning Thresholds

​Known Accuracy Gaps

​Troubleshooting

​Known Limitations

​Reference

What is the CoreWeave Weave Toxicity Scorer?

Key Features of the Weave Scorer on TrueFoundry

Architecture

Prerequisites

Integration Steps

Tuning Thresholds

Known Accuracy Gaps

Troubleshooting

Known Limitations

Reference