WeaveToxicityScorerV1 (Celadon) classifier inside a small wrapper service that you deploy on TrueFoundry. The gateway invokes the wrapper through its Custom Guardrail interface. The scorer runs locally in the wrapper pod - no calls to W&B at runtime, ~25-30 ms per call on CPU after warmup.
Source repository:
truefoundry/integrations-custom-guardrails/integrations/coreweave-weave/. It contains the Dockerfile, deploy script, rail handlers, and tests referenced below.What is the CoreWeave Weave Toxicity Scorer?
CoreWeave Weave (formerly Weights & Biases Weave) ships a family of local scorers for GenAI safety. This integration wrapsWeaveToxicityScorerV1 - a DeBERTa-v3-small model (Celadon, trained on Toxic Commons) with five toxicity heads: Race/Origin, Gender/Sex, Religion, Ability, and Violence.
Key Features of the Weave Scorer on TrueFoundry
- Local toxicity classification on inbound user messages and outbound assistant responses - no external service calls, no LLM round-trip per request.
- Two operations per direction: a Validate rail that blocks toxic content, and a Mutate rail that masks it with a fixed placeholder.
- Per-request threshold tuning via the dashboard Config JSON (
total_threshold,category_threshold) - adjust sensitivity without redeploying. - Fast cold start: the ~550 MB Celadon model is baked into the Docker image at build time, so the pod starts warm and avoids a multi-minute HuggingFace download stall.
Architecture
The gateway dispatches the input rail call and the model call in parallel for low time-to-first-token. The wrapper extracts the relevant message, scores it with Celadon, and returns a verdict. The scorer is stateless across calls; thresholds are applied per-request so dashboard Config tuning takes effect without a restart. The Validate rails always returnHTTP 200 and signal the policy decision in the JSON body:
{"verdict": true}- allow{"verdict": false, "message": "..."}- block
HTTP 200, but use the mutate response shape to mask content rather than block:
{"verdict": true, "transformed": false, "result": <original body>}- pass through unchanged{"verdict": true, "transformed": true, "result": <modified body>}- the gateway replaces the in-flight body withresult
content with [message removed by safety filter]; the output rail replaces the first assistant choice’s content with I can't help with that. See Custom guardrail response contract for the underlying protocol.
Prerequisites
Before integrating the Weave scorer with TrueFoundry, ensure you have:- A TrueFoundry workspace you can deploy services into.
- The model FQN you want to protect (e.g.
openai-main/gpt-4o-mini). - A cluster with a configured base host (visible at Integrations → Clusters → <cluster>).
No W&B API key is required. The Celadon model is pulled from the public
wandb/WeaveToxicityScorerV1 HuggingFace repo at image-build time and runs entirely offline thereafter.Integration Steps
Configure environment variables
Copy
.env.example to .env and fill in the values. You will reference one TrueFoundry secret that you create in the next step - get its FQN from Platform → Secrets after creating it..env
Create a TrueFoundry secret
Navigate to Platform → Secrets and create a Secret Group named
Copy the secret’s FQN and confirm the entry in
coreweave-weave-guardrails-tfy with one secret:| Secret Name | Value |
|---|---|
wrapper-api-key | The same random string you put in .env as WRAPPER_API_KEY. |
.env (WRAPPER_API_KEY_SECRET_FQN) matches. The FQN is tenant-scoped and colon-separated, for example tfy-secret://tfy-eo:coreweave-weave-guardrails-tfy:wrapper-api-key.Deploy the wrapper service
Install the TrueFoundry CLI, log in, and deploy:Verify the service is healthy:
The first build is slow because the Dockerfile pre-downloads the ~550 MB Celadon model so the deployed pod starts warm. Subsequent builds use TrueFoundry’s image layer cache and are much faster. After the build, the pod takes a few seconds to load and warm up the scorer before readiness passes.
Register the Custom Guardrail Configs in TrueFoundry
Navigate to AI Gateway → Guardrails → + Add New Guardrails Group.
The four rails to register:
Save the group.
- Group name:
coreweave-weave - Description (optional):
CoreWeave Weave toxicity scorer (Celadon): validate + mutate rails - Click + Add Guardrail Config → Custom Guardrail Config for each rail you want. The four endpoints are independent - register only the ones you need.
| Field | Value |
|---|---|
| Name | the rail name from the table below (e.g. toxicity-input) |
| Operation | Validate or Mutate per the table |
| URL | https://ml.<cluster>.truefoundry.cloud/coreweave-weave-guardrails-tfy/<suffix> |
| Auth Data | Custom Bearer Auth, token = the wrapper-api-key secret value |
| Headers | (empty) |
| Config | {} (or override thresholds - see Tuning Thresholds) |
| Enforcing Strategy | Enforce But Ignore On Error (recommended) |
| Direction | Name | URL suffix | Operation | Behavior |
|---|---|---|---|---|
| Input | toxicity-input | /toxicity-input | Validate | Blocks the request on toxicity |
| Output | toxicity-output | /toxicity-output | Validate | Blocks the response on toxicity |
| Input | toxicity-input-mutate | /toxicity-input-mutate | Mutate | Masks the user message on toxicity |
| Output | toxicity-output-mutate | /toxicity-output-mutate | Mutate | Replaces the assistant response on toxicity |
Pick Validate OR Mutate per direction - do not stack both on the same direction. The wrapper signals decisions on HTTP 200; real failures (scorer load error, wrapper crash) come as
HTTP 5xx. With Enforce But Ignore On Error, transient outages pass through while real policy decisions still apply. Use Enforce for safety-critical rails where fail-closed is the right trade-off. See Custom guardrail response contract and Enforcing Strategy.
Apply the guardrail to traffic
There are two ways to route requests through the rails - pick based on whether you want every call to a model protected, or per-call opt-in.
- Pin to a model (every call protected)
- Per-request opt-in
Navigate to AI Gateway → Models → <model> → Guardrails tab → attach the
coreweave-weave group → Save. Every caller of this model now passes through the rails.Test end-to-end
Issue two test calls through the gateway - one that should succeed and one that should be blocked:A successful block returns:The block message names the top-scoring toxicity category and is preserved in
guardrail_checks.input_guardrails[0].data.explanation. With the mutate rails, the request instead completes normally but the toxic content is replaced ([message removed by safety filter] for input, I can't help with that. for output).Tuning Thresholds
Celadon scores each toxicity category from 0 upward. The wrapper blocks (or masks) when either the total across all categories exceedstotal_threshold, or the single highest category score reaches category_threshold.
Pass overrides in the Custom Guardrail Config’s Config JSON field to tune sensitivity per-rail without redeploying:
| Setting | Default | Notes |
|---|---|---|
total_threshold | 5 | Sum across all five categories that triggers a block/mask. |
category_threshold | 3 | Single-category score that triggers a block/mask. Tuned one step above Weave’s own default of 2. |
category_threshold=3 default is deliberate: Celadon scores short capitalized greetings like "Hi" and "Hey" at Race/Origin=2, which produces false positives at threshold 2. Score 3+ is where the classifier reliably indicates real hate, death threats, or overt slurs.
- Lower
category_thresholdto2to catch milder harassment (e.g."you are a worthless idiot"scores Violence=2), at the cost of greeting false positives. - Lower to
1to catch veiled threats too (e.g."I hope someone breaks her face"scores Violence=1), with more noise overall.
Known Accuracy Gaps
- Short capitalized greetings false-positive at score 2.
"Hi"and"Hey"scoreRace/Origin=2;"Hello","Hi there", and lowercase"hi"score 0. This motivated thecategory_threshold=3default. - Mild harassment scores 2 and passes the default.
"you are a worthless idiot"and similar score Violence=2 and pass atcategory_threshold=3. Real hate / death threats / overt slurs score3+and block. Set{"category_threshold": 2}per-rail if you need the milder band to block too. - Veiled threats score 1. Phrases like
"I hope someone breaks her face"score Violence=1, below both defaults. Set{"category_threshold": 1}to catch them. - Toxicity only. Celadon does not detect prompt injection, secret leakage, or PII. For PII use the Guardrails AI integration.
- Label-space quirk. The five dimensions are conceptual, not orthogonal - e.g. homophobic content tends to score
Race/Originrather thanGender/Sex. The block message names the top-scoring dimension, which is informative but not always semantically tidy.
Troubleshooting
Blocks are returning 200 with the model's normal response
Blocks are returning 200 with the model's normal response
The Validate rails signal decisions via
{"verdict": false} on HTTP 200. If the gateway returns a normal completion when the wrapper reported a block, your tenant gateway may not be honoring the verdict field. Confirm by curling the wrapper directly - if you get 200 + {"verdict": false} but the gateway still returns a completion, the gateway is the issue.Workaround: switch the Custom Guardrail Configs’ Enforcing Strategy to Enforce. This maps the wrapper’s non-success state to a block. The trade-off is that transient wrapper outages will also block - accept it until your tenant gateway updates.The wrapper is being called but returns the wrong shape
The wrapper is being called but returns the wrong shape
Call a rail endpoint directly to bypass the gateway. The Validate rails return The Mutate rails return
HTTP 200 with:{"verdict": true}→ pass{"verdict": false, "message": "<reason>"}→ block
{"verdict": true, "transformed": <bool>, "result": <body>}. Non-200 responses indicate real errors (scorer init crash, missing bearer token).401 Unauthorized from the wrapper
401 Unauthorized from the wrapper
Did my redeploy actually replace the running pod?
Did my redeploy actually replace the running pod?
Curl the debug endpoint to see which scorer + thresholds + routes the running pod has loaded:Check the
wrapper_version field against the git SHA you just deployed. If it lags, your new image isn’t serving traffic yet - most commonly TrueFoundry’s image build cache served a stale layer. Force a rebuild by touching Dockerfile and redeploying.A prompt that should block isn't blocking
A prompt that should block isn't blocking
Most likely a threshold-tuning issue, not a bug. Celadon scores mild harassment at 2 and veiled threats at 1, both below the
category_threshold=3 default. Lower the threshold in the rail’s Config JSON ({"category_threshold": 2} or 1) - see Tuning Thresholds. Curl the rail directly with your prompt to see the raw category scores in the block message.Known Limitations
- Toxicity classification only. No prompt injection, PII, or secrets detection. Layer with other guardrails (e.g. Guardrails AI) for defense in depth.
- Fixed-string mutation. Celadon is a scorer, not a rewriter, so the Mutate rails replace toxic content with a fixed placeholder rather than a sanitized rewrite of the original.
- No streaming-aware guardrails. The TrueFoundry custom-guardrail contract is buffered: the gateway holds the full assistant response before calling the output rail. Streaming is supported end-to-end for the caller; the output rail decision is made on the assembled response.
- In-memory state is per-replica. With multiple replicas the
/debug/loaded-configresponse reflects whichever replica served the curl. After a deploy, retry the curl a few times to surface heterogeneity.
Reference
| Field | Value |
|---|---|
| Wrapper validate endpoints | https://<host>/<path>/toxicity-{input,output} |
| Wrapper mutate endpoints | https://<host>/<path>/toxicity-{input,output}-mutate |
| Wrapper health endpoint | https://<host>/<path>/health |
| Wrapper debug endpoint | https://<host>/<path>/debug/loaded-config |
| Auth | Authorization: Bearer <WRAPPER_API_KEY> |
| Selector format | coreweave-weave/toxicity-input, coreweave-weave/toxicity-output-mutate, etc. |
| Validate response contract | HTTP 200 + {"verdict": bool, "message": Optional[str]} |
| Mutate response contract | HTTP 200 + {"verdict": true, "transformed": bool, "result": <body>} |
| Default thresholds | total_threshold=5, category_threshold=3 |
| Repo | truefoundry/integrations-custom-guardrails/integrations/coreweave-weave/ |
| Scorer model | wandb/WeaveToxicityScorerV1 (Apache 2.0; re-host of PleIAs/celadon) |
| Scorer docs | Weave local scorers |