Skip to main content
This guide explains how to integrate the CoreWeave Weave toxicity scorer with TrueFoundry AI Gateway as input and output guardrails. The integration runs the WeaveToxicityScorerV1 (Celadon) classifier inside a small wrapper service that you deploy on TrueFoundry. The gateway invokes the wrapper through its Custom Guardrail interface. The scorer runs locally in the wrapper pod - no calls to W&B at runtime, ~25-30 ms per call on CPU after warmup.
Source repository: truefoundry/integrations-custom-guardrails/integrations/coreweave-weave/. It contains the Dockerfile, deploy script, rail handlers, and tests referenced below.

What is the CoreWeave Weave Toxicity Scorer?

CoreWeave Weave (formerly Weights & Biases Weave) ships a family of local scorers for GenAI safety. This integration wraps WeaveToxicityScorerV1 - a DeBERTa-v3-small model (Celadon, trained on Toxic Commons) with five toxicity heads: Race/Origin, Gender/Sex, Religion, Ability, and Violence.

Key Features of the Weave Scorer on TrueFoundry

  1. Local toxicity classification on inbound user messages and outbound assistant responses - no external service calls, no LLM round-trip per request.
  2. Two operations per direction: a Validate rail that blocks toxic content, and a Mutate rail that masks it with a fixed placeholder.
  3. Per-request threshold tuning via the dashboard Config JSON (total_threshold, category_threshold) - adjust sensitivity without redeploying.
  4. Fast cold start: the ~550 MB Celadon model is baked into the Docker image at build time, so the pod starts warm and avoids a multi-minute HuggingFace download stall.
The v1 bundle wraps a single scorer across four endpoints. Celadon is a toxicity classifier only - it does not detect prompt injection, secret leakage, or PII. For PII use the Guardrails AI integration.

Architecture

The gateway dispatches the input rail call and the model call in parallel for low time-to-first-token. The wrapper extracts the relevant message, scores it with Celadon, and returns a verdict. The scorer is stateless across calls; thresholds are applied per-request so dashboard Config tuning takes effect without a restart. The Validate rails always return HTTP 200 and signal the policy decision in the JSON body:
  • {"verdict": true} - allow
  • {"verdict": false, "message": "..."} - block
On a block, the gateway cancels the in-flight model call. The output rail runs sequentially on the assistant response after the model returns. The Mutate rails also return HTTP 200, but use the mutate response shape to mask content rather than block:
  • {"verdict": true, "transformed": false, "result": <original body>} - pass through unchanged
  • {"verdict": true, "transformed": true, "result": <modified body>} - the gateway replaces the in-flight body with result
On a mutation, the input rail replaces the last user message’s content with [message removed by safety filter]; the output rail replaces the first assistant choice’s content with I can't help with that. See Custom guardrail response contract for the underlying protocol.

Prerequisites

Before integrating the Weave scorer with TrueFoundry, ensure you have:
  • A TrueFoundry workspace you can deploy services into.
  • The model FQN you want to protect (e.g. openai-main/gpt-4o-mini).
  • A cluster with a configured base host (visible at Integrations → Clusters → <cluster>).
No W&B API key is required. The Celadon model is pulled from the public wandb/WeaveToxicityScorerV1 HuggingFace repo at image-build time and runs entirely offline thereafter.

Integration Steps

1

Clone the wrapper repository

Clone the integration repo and switch to the CoreWeave Weave folder:
git clone https://github.com/truefoundry/integrations-custom-guardrails
cd integrations-custom-guardrails/integrations/coreweave-weave
2

Configure environment variables

Copy .env.example to .env and fill in the values. You will reference one TrueFoundry secret that you create in the next step - get its FQN from Platform → Secrets after creating it.
.env
# Wrapper auth - the gateway sends this as `Authorization: Bearer ...` when calling the wrapper.
WRAPPER_API_KEY=<generate with `python -c "import secrets; print(secrets.token_urlsafe(32))"`>

# Server runtime
PORT=8000
LOG_LEVEL=info

# Weave scorer runtime (cpu by default; "cuda" on a GPU node)
WEAVE_TOXICITY_DEVICE=cpu

# Deploy-time only
TFY_SERVICE_NAME=coreweave-weave-guardrails-tfy
TFY_WORKSPACE_FQN=<cluster>:<workspace>
TFY_PUBLIC_HOST=ml.<cluster>.truefoundry.cloud
TFY_PUBLIC_PATH=/coreweave-weave-guardrails-tfy
# Tenant-scoped, colon-separated form: tfy-secret://<tenant>:<secret-group>:<secret-key>
WRAPPER_API_KEY_SECRET_FQN=tfy-secret://<tenant>:coreweave-weave-guardrails-tfy:wrapper-api-key
Generate WRAPPER_API_KEY with python -c "import secrets; print(secrets.token_urlsafe(32))". The gateway will send this value as Authorization: Bearer … when calling the wrapper.
3

Create a TrueFoundry secret

Navigate to Platform → Secrets and create a Secret Group named coreweave-weave-guardrails-tfy with one secret:
Secret NameValue
wrapper-api-keyThe same random string you put in .env as WRAPPER_API_KEY.
Copy the secret’s FQN and confirm the entry in .env (WRAPPER_API_KEY_SECRET_FQN) matches. The FQN is tenant-scoped and colon-separated, for example tfy-secret://tfy-eo:coreweave-weave-guardrails-tfy:wrapper-api-key.
4

Deploy the wrapper service

Install the TrueFoundry CLI, log in, and deploy:
pip install -U truefoundry
tfy login
python deploy.py --wait
The first build is slow because the Dockerfile pre-downloads the ~550 MB Celadon model so the deployed pod starts warm. Subsequent builds use TrueFoundry’s image layer cache and are much faster. After the build, the pod takes a few seconds to load and warm up the scorer before readiness passes.
Verify the service is healthy:
curl -s https://ml.<cluster>.truefoundry.cloud/coreweave-weave-guardrails-tfy/health
# {"status":"ok"}
5

Register the Custom Guardrail Configs in TrueFoundry

Navigate to AI Gateway → Guardrails → + Add New Guardrails Group.
  1. Group name: coreweave-weave
  2. Description (optional): CoreWeave Weave toxicity scorer (Celadon): validate + mutate rails
  3. Click + Add Guardrail Config → Custom Guardrail Config for each rail you want. The four endpoints are independent - register only the ones you need.
For each rail, use the same template:
FieldValue
Namethe rail name from the table below (e.g. toxicity-input)
OperationValidate or Mutate per the table
URLhttps://ml.<cluster>.truefoundry.cloud/coreweave-weave-guardrails-tfy/<suffix>
Auth DataCustom Bearer Auth, token = the wrapper-api-key secret value
Headers(empty)
Config{} (or override thresholds - see Tuning Thresholds)
Enforcing StrategyEnforce But Ignore On Error (recommended)
The four rails to register:
DirectionNameURL suffixOperationBehavior
Inputtoxicity-input/toxicity-inputValidateBlocks the request on toxicity
Outputtoxicity-output/toxicity-outputValidateBlocks the response on toxicity
Inputtoxicity-input-mutate/toxicity-input-mutateMutateMasks the user message on toxicity
Outputtoxicity-output-mutate/toxicity-output-mutateMutateReplaces the assistant response on toxicity
Save the group.
Pick Validate OR Mutate per direction - do not stack both on the same direction. The wrapper signals decisions on HTTP 200; real failures (scorer load error, wrapper crash) come as HTTP 5xx. With Enforce But Ignore On Error, transient outages pass through while real policy decisions still apply. Use Enforce for safety-critical rails where fail-closed is the right trade-off. See Custom guardrail response contract and Enforcing Strategy.
TrueFoundry Custom Guardrail configuration form populated for the CoreWeave Weave toxicity-input guardrail with Custom Bearer Auth, Validate operation, Enforce strategy, Request target, and the wrapper toxicity-input URL
6

Apply the guardrail to traffic

There are two ways to route requests through the rails - pick based on whether you want every call to a model protected, or per-call opt-in.
Navigate to AI Gateway → Models → <model> → Guardrails tab → attach the coreweave-weave group → Save. Every caller of this model now passes through the rails.
7

Test end-to-end

Issue two test calls through the gateway - one that should succeed and one that should be blocked:
GW=https://gateway.truefoundry.ai
TFY_KEY=<your TFY API key>
MODEL=openai-main/gpt-4o-mini

# Should succeed with a normal completion
curl -s "$GW/chat/completions" \
  -H "Authorization: Bearer $TFY_KEY" -H "Content-Type: application/json" \
  -H 'X-TFY-GUARDRAILS: {"llm_input_guardrails":["coreweave-weave/toxicity-input"],"llm_output_guardrails":["coreweave-weave/toxicity-output"]}' \
  -d "{\"model\":\"$MODEL\",\"messages\":[{\"role\":\"user\",\"content\":\"What is the capital of France?\"}]}"

# Should be blocked: guardrail_checks_failed with the toxicity verdict
curl -s "$GW/chat/completions" \
  -H "Authorization: Bearer $TFY_KEY" -H "Content-Type: application/json" \
  -H 'X-TFY-GUARDRAILS: {"llm_input_guardrails":["coreweave-weave/toxicity-input"]}' \
  -d "{\"model\":\"$MODEL\",\"messages\":[{\"role\":\"user\",\"content\":\"I hate <group> and they should all be eliminated.\"}]}"
A successful block returns:
{
  "status": "failure",
  "message": "Input Guardrail checks failed for integrations: [coreweave-weave/toxicity-input] ...",
  "error": { "type": "guardrail_checks_failed", "code": "400" },
  "guardrail_checks": {
    "input_guardrails": [{
      "guardrail_integration": "coreweave-weave/toxicity-input",
      "result": "failed",
      "data": {
        "verdict": false,
        "explanation": "WeaveToxicity (input): blocked on Violence (score=3, total=4, thresholds={'total': 5, 'category': 3})",
        "guardrailUrl": "https://..."
      }
    }]
  }
}
The block message names the top-scoring toxicity category and is preserved in guardrail_checks.input_guardrails[0].data.explanation. With the mutate rails, the request instead completes normally but the toxic content is replaced ([message removed by safety filter] for input, I can't help with that. for output).

Tuning Thresholds

Celadon scores each toxicity category from 0 upward. The wrapper blocks (or masks) when either the total across all categories exceeds total_threshold, or the single highest category score reaches category_threshold. Pass overrides in the Custom Guardrail Config’s Config JSON field to tune sensitivity per-rail without redeploying:
{
  "total_threshold": 5,
  "category_threshold": 3
}
SettingDefaultNotes
total_threshold5Sum across all five categories that triggers a block/mask.
category_threshold3Single-category score that triggers a block/mask. Tuned one step above Weave’s own default of 2.
The category_threshold=3 default is deliberate: Celadon scores short capitalized greetings like "Hi" and "Hey" at Race/Origin=2, which produces false positives at threshold 2. Score 3+ is where the classifier reliably indicates real hate, death threats, or overt slurs.
  • Lower category_threshold to 2 to catch milder harassment (e.g. "you are a worthless idiot" scores Violence=2), at the cost of greeting false positives.
  • Lower to 1 to catch veiled threats too (e.g. "I hope someone breaks her face" scores Violence=1), with more noise overall.

Known Accuracy Gaps

  • Short capitalized greetings false-positive at score 2. "Hi" and "Hey" score Race/Origin=2; "Hello", "Hi there", and lowercase "hi" score 0. This motivated the category_threshold=3 default.
  • Mild harassment scores 2 and passes the default. "you are a worthless idiot" and similar score Violence=2 and pass at category_threshold=3. Real hate / death threats / overt slurs score 3+ and block. Set {"category_threshold": 2} per-rail if you need the milder band to block too.
  • Veiled threats score 1. Phrases like "I hope someone breaks her face" score Violence=1, below both defaults. Set {"category_threshold": 1} to catch them.
  • Toxicity only. Celadon does not detect prompt injection, secret leakage, or PII. For PII use the Guardrails AI integration.
  • Label-space quirk. The five dimensions are conceptual, not orthogonal - e.g. homophobic content tends to score Race/Origin rather than Gender/Sex. The block message names the top-scoring dimension, which is informative but not always semantically tidy.

Troubleshooting

The Validate rails signal decisions via {"verdict": false} on HTTP 200. If the gateway returns a normal completion when the wrapper reported a block, your tenant gateway may not be honoring the verdict field. Confirm by curling the wrapper directly - if you get 200 + {"verdict": false} but the gateway still returns a completion, the gateway is the issue.Workaround: switch the Custom Guardrail Configs’ Enforcing Strategy to Enforce. This maps the wrapper’s non-success state to a block. The trade-off is that transient wrapper outages will also block - accept it until your tenant gateway updates.
Call a rail endpoint directly to bypass the gateway. The Validate rails return HTTP 200 with:
  • {"verdict": true} → pass
  • {"verdict": false, "message": "<reason>"} → block
curl -sS -X POST https://ml.<cluster>.truefoundry.cloud/coreweave-weave-guardrails-tfy/toxicity-input \
  -H "Authorization: Bearer $WRAPPER_API_KEY" -H "Content-Type: application/json" \
  -d '{"requestBody":{"messages":[{"role":"user","content":"<test prompt>"}]},"context":{"user":{"subjectId":"u1","subjectType":"user"}}}'
The Mutate rails return {"verdict": true, "transformed": <bool>, "result": <body>}. Non-200 responses indicate real errors (scorer init crash, missing bearer token).
The Authorization: Bearer … value the gateway sends doesn’t match the wrapper’s WRAPPER_API_KEY env var. Three places must agree:
  1. The TFY secret coreweave-weave-guardrails-tfy/wrapper-api-key value.
  2. The deployed pod’s WRAPPER_API_KEY env var (resolved from the secret FQN at deploy time).
  3. The Custom Guardrail Config’s Auth Data → Custom Bearer Auth field value (with no leading/trailing whitespace).
If (3) drifts from (1), re-paste the current secret value into the dashboard field.
Curl the debug endpoint to see which scorer + thresholds + routes the running pod has loaded:
curl -sS https://ml.<cluster>.truefoundry.cloud/coreweave-weave-guardrails-tfy/debug/loaded-config \
  -H "Authorization: Bearer $WRAPPER_API_KEY" | jq
Check the wrapper_version field against the git SHA you just deployed. If it lags, your new image isn’t serving traffic yet - most commonly TrueFoundry’s image build cache served a stale layer. Force a rebuild by touching Dockerfile and redeploying.
Most likely a threshold-tuning issue, not a bug. Celadon scores mild harassment at 2 and veiled threats at 1, both below the category_threshold=3 default. Lower the threshold in the rail’s Config JSON ({"category_threshold": 2} or 1) - see Tuning Thresholds. Curl the rail directly with your prompt to see the raw category scores in the block message.

Known Limitations

  • Toxicity classification only. No prompt injection, PII, or secrets detection. Layer with other guardrails (e.g. Guardrails AI) for defense in depth.
  • Fixed-string mutation. Celadon is a scorer, not a rewriter, so the Mutate rails replace toxic content with a fixed placeholder rather than a sanitized rewrite of the original.
  • No streaming-aware guardrails. The TrueFoundry custom-guardrail contract is buffered: the gateway holds the full assistant response before calling the output rail. Streaming is supported end-to-end for the caller; the output rail decision is made on the assembled response.
  • In-memory state is per-replica. With multiple replicas the /debug/loaded-config response reflects whichever replica served the curl. After a deploy, retry the curl a few times to surface heterogeneity.

Reference

FieldValue
Wrapper validate endpointshttps://<host>/<path>/toxicity-{input,output}
Wrapper mutate endpointshttps://<host>/<path>/toxicity-{input,output}-mutate
Wrapper health endpointhttps://<host>/<path>/health
Wrapper debug endpointhttps://<host>/<path>/debug/loaded-config
AuthAuthorization: Bearer <WRAPPER_API_KEY>
Selector formatcoreweave-weave/toxicity-input, coreweave-weave/toxicity-output-mutate, etc.
Validate response contractHTTP 200 + {"verdict": bool, "message": Optional[str]}
Mutate response contractHTTP 200 + {"verdict": true, "transformed": bool, "result": <body>}
Default thresholdstotal_threshold=5, category_threshold=3
Repotruefoundry/integrations-custom-guardrails/integrations/coreweave-weave/
Scorer modelwandb/WeaveToxicityScorerV1 (Apache 2.0; re-host of PleIAs/celadon)
Scorer docsWeave local scorers