The Agentic Token Explosion: Cost Attribution & Budgets for Claude Code in CI/CD

Diseñado para la velocidad: ~ 10 ms de latencia, incluso bajo carga
¡Una forma increíblemente rápida de crear, rastrear e implementar sus modelos!
- Gestiona más de 350 RPS en solo 1 vCPU, sin necesidad de ajustes
- Listo para la producción con soporte empresarial completo
Why CI/CD changes the economics
Interactive AI sessions have a built-in pacing mechanism: the human in front of the keyboard. The human reads the agent's output, decides what to do next, and burns roughly one prompt every few minutes. That tempo is a soft rate limit even when no policy enforces it.
CI/CD pipelines do not have that. An agent configured for automated PR review can be triggered hundreds of times an hour by ordinary commit traffic, and there is nothing in its environment that slows it down. The math is worse than the trigger frequency suggests, because the per-call cost itself grows — agentic frameworks like ReAct append every action's result back into the context window before the next reasoning step. Token consumption per agent run grows roughly O(n²) in the number of steps.
.

The provider billing blind spot
Anthropic and OpenAI dashboards will tell you exactly how many tokens your organization consumed last Tuesday. They will not tell you why. The provider has no application context — they cannot distinguish a critical production data pipeline from a junior engineer's infinite-looping side project. Both bill the same.
Without granular attribution, finance defaults to the only tool available: blanket bans. AI usage gets paused pending review. Legitimate workloads get throttled alongside the runaways. Engineering leaders learn to dread the monthly close call. The fundamental problem is that attribution must happen at ingest, not at billing — by the time the invoice arrives, the labels you needed are gone.
The provider invoice answers “how much.” The gateway-attributed ledger answers “which repo, which pipeline, which agent step, and what to fix.” That difference turns a finance crisis into an engineering ticket.
Gateway-level metadata tagging
The foundation of cost attribution is mandatory tagging at the gateway. Every request from a CI/CD pipeline injects a small JSON object via the X-TFY-METADATA header, identifying the team, the repository, the pipeline, the agent step, and the cost center responsible. The shape is simple, deliberate, and the same across every team:
HTTP · required header on every CI request
X-TFY-METADATA: {
"team": "payments-platform",
"repo": "transaction-service",
"pipeline": "pr-security-audit",
"agent_step": "step-2-policy-check",
"cost_center": "eng-backend",
"environment": "production"
}Tags are mandatory, not advisory. Untagged requests get rejected at the gateway, not silently passed through. This is the policy that produces 100% observability — there is no “unknown” bucket in the dashboard, because there is no path that produces one. The cost of enforcement is one Cedar/OPA rule. The cost of not enforcing is a quarterly finance escalation.
With tags in hand, the gateway counts input, output, and cached tokens for every call, prices the result against current provider rates, and writes a fully attributed ledger entry. Cost views are sliced by user, model, and team out of the box, with a Download Raw Data option that lets you export with custom groupBy fields (username, model_name, teams, or any metadata key you've been tagging). Every dollar has a name.
Per-project budgets with circuit breakers
Visibility without enforcement is a dashboard nobody acts on. TrueFoundry attaches hierarchical, mathematically-enforced budgets to every cost center the tagging produces. Budgets are an ordered list of rules, each scoped by subjects, models, or metadata keys. Two semantics distinguish budget rules from rate-limit rules and they're worth understanding precisely:
- Budget tracking happens for every matching rule. If a request matches three rules, the cost is debited against all three. Layered budgets — a $500 team budget on top of a $50 per-repo budget on top of a $10 per-developer budget — all stay in sync simultaneously.
- Allow/block decisions come from the first matching rule only. Rules are evaluated top-to-bottom, and the first one whose conditions match decides whether the request goes through or is rejected. Place high-priority overrides at the top, defaults at the bottom.
Budget alerts fire at four configurable thresholds — 75%, 90%, 95%, and 100% of the cap — with notification channels for email, Slack webhook, and Slack bot. The check runs every 20 minutes against the latest attributed ledger:
Table 1 — Budget thresholds. Each threshold fires once per budget period (day / week / month) and resets at the start of the next period. Alerts are checked every 20 minutes.
The 100% behavior is part of the design, not an afterthought. The gateway returns a structured error that names the exhausted budget and points the operator at the dashboard:
JSON · 429 response on hard cap
{
"error": "Budget Exceeded",
"rule_id": "transaction-service-daily",
"detail": "Repository \"transaction-service\" has exhausted its
daily $50 AI budget at 14:32 UTC.",
"mitigation": "Review pipeline logs for infinite loops or request a
quota increase via the platform team.",
"dashboard": "https://gateway.example.com/budgets/transaction-service"
}
A pipeline that hits its budget should know what to do next without the developer having to chase the platform team for context. CI runners interpret 429 as a standard backoff signal; the build fails cleanly with an actionable message rather than crashing in confusing ways.
There is one more behavior worth knowing about: audit mode. Setting block_on_budget_exceed: false on any rule keeps tracking and alerting active but lets requests through. This is the right default during the first month of rollout. Watch the alerts fire against simulated caps; tune the caps; only then turn enforcement on. Skipping audit mode is how you wake up to an angry team whose pipelines all failed at 03:00.
YAML · layered budget config
name: cicd-budget
type: gateway-budget-config
rules:
- id: "ml-team-override"
when: { subjects: ["team:ml-engineering"] }
limit_to: 200
unit: cost_per_day
budget_applies_per: ["user"]
- id: "default-user-daily"
when: {}
limit_to: 10
unit: cost_per_day
budget_applies_per: ["user"]
- id: "per-repo-daily"
when: {}
limit_to: 50
unit: cost_per_day
budget_applies_per: ["metadata.repo"]
alerts:
thresholds: [75, 90, 100]
notification_target:
- type: slack-webhook
notification_channel: "ai-budget-alerts"
Building a cost attribution dashboard
Tagged data flowing into the gateway's metrics layer lets the platform team build dashboards that answer ownership questions instead of producing more aggregate noise. Instead of staring at a spike and asking “who did this?”, the dashboard already tells you that at 02:00 UTC the frontend team deployed a new agent to react-monorepo that hallucinated a missing dependency and entered a 400-step resolution loop.
That kind of operational context turns cost from a finance problem into an engineering problem. Once you can see that switching the initial code-summarization step from Sonnet to Haiku cuts that step's cost by 80% without affecting PR review quality, you make the change. You don't argue about budget caps in a steering committee. The TrueFoundry cost-tracking views ship out of the box for User, Model, and Team perspectives, and the raw-data export lets you slice on any metadata key — so a per-repo, per-pipeline, or per-agent-step view is a one-click download, not a data-engineering project.

Forecasting monthly spend before the bill arrives
Aggregated tagging data also makes forecasting tractable. Agentic workloads are bursty — periodic heavy CI jobs dominate the bill — which is why simple trailing averages systematically underestimate spend. Mean-of-last-7-days is the wrong forecast for a workload whose 95th percentile is 4× its mean.
The right model is a P95 rolling forecast, run per repo and per team. P95 captures the burst risk that an average smooths over, projecting end-of-month spend with enough lead time to adjust budgets, raise quotas, or kill an offending pipeline before finance sees the surprise. “Surprise” is the operating word: this is a forecast designed not to produce them. In practice, a 7-day P95 has tracked actual end-of-month spend within 8–12% on the workloads we've measured — close enough to act on, far better than the trailing-mean alternative.
A real example: $8,400 → under $800
A 50-engineer organization built a three-step Claude Code review agent that ran on every pull request: (1) summarize the diff, (2) review the diff against security policies via an MCP documentation server, (3) suggest code changes. Sensible architecture, useful workflow, no obvious red flags.
At ~15 PRs per engineer per week, accounting for retries and the context-window cost of injecting whole files into prompts, the agent averaged around 400,000 input tokens per PR. Month-one bill for CI/CD automation: $8,400.
Table 2 — Walkthrough of one cost-attribution debug. Five gateway clicks, one config change. Without attribution, the response would have been a blanket ban on Sonnet for CI workflows. With attribution, the response was a one-line config change.
That gap — between “ban the model” and “cache one prompt” — is the entire payoff of doing attribution properly. The cost data exists either way; the question is whether you have the labels to read it.
FAQ
Should budgets be denominated in dollars or tokens?
Both, concurrently. Dollars align with finance and operational planning. Tokens are the engineering metric that lets you debug prompt efficiency. TrueFoundry tracks both — finance owns the dollar dashboards, engineering owns the token dashboards, and the gateway is the source of truth for both. Provider price changes are absorbed at the dollar layer without engineering having to refactor anything; new fine-tunes are absorbed at the token layer without finance needing to know the model name.
What happens when a hard limit hits mid-pipeline?
The pipeline receives a 429 with the descriptive error shown earlier and a link to the budget dashboard. CI runners interpret 429 as a standard backoff signal; the build fails cleanly with an actionable message rather than crashing in confusing ways. Quota increases are filed as standard tickets against the platform team — the dashboard URL in the error body short-circuits the usual round of “I don't understand why this is failing.”
Does mandatory tagging slow down rollout?
In practice, no — the SDK wrappers handle injection automatically inside CI templates, so individual developers never edit headers. The one-time cost is updating the team's pipeline templates; the recurring cost is zero. The recurring benefit is every dashboard, every alert, and every postmortem that follows.
What's the difference between rate limits and budget limits — when do I use each?
Rate limits stop bursts; budget limits stop spend. Rate limits are denominated in requests/minute or tokens/minute — they protect downstream services from being hammered and are evaluated per request. Budgets are denominated in dollars per day/week/month — they protect the company's wallet and are evaluated against the cumulative ledger. Most production stacks run both, scoped to different entities. The patterns are complementary, not redundant.
TrueFoundry AI Gateway ofrece una latencia de entre 3 y 4 ms, gestiona más de 350 RPS en una vCPU, se escala horizontalmente con facilidad y está listo para la producción, mientras que LitellM presenta una latencia alta, tiene dificultades para superar un RPS moderado, carece de escalado integrado y es ideal para cargas de trabajo ligeras o de prototipos.
La forma más rápida de crear, gobernar y escalar su IA















.webp)
.webp)
.webp)
.webp)
.webp)


.webp)
.png)


.webp)




