AI Governance and Audit for Enterprise LLMs: Virtual Keys, RBAC, and Compliance-Grade Logs

Published: June 23, 2026

Diseñado para la velocidad: ~ 10 ms de latencia, incluso bajo carga

¡Una forma increíblemente rápida de crear, rastrear e implementar sus modelos!

Gestiona más de 350 RPS en solo 1 vCPU, sin necesidad de ajustes
Listo para la producción con soporte empresarial completo

Empieza con Truefoundry ahora Hable con el experto

When LLMs move from a pilot to production across many teams, governance stops being optional. Someone will ask who can call which model, with what budget, on whose data — and whether every call can be reconstructed for an audit. This post is the control plane that answers those questions: virtual keys that decouple access from provider credentials, RBAC and policy-as-code, budgets and quotas as governance, compliance-grade audit logs, data residency, and how these gateway controls map to obligations like the EU AI Act — framed as what they help satisfy, not as a compliance guarantee.

Key Takeaways

Governance becomes non-negotiable when LLMs go multi-team and multi-model. Without a control plane you get shared API keys with no attribution, no spend limits, no audit trail, and shadow AI — exactly what an auditor or a budget owner will eventually ask about.
Virtual keys decouple access from provider credentials: per-team or per-app keys that map to real provider keys at the gateway, so you can attribute usage, revoke access without rotating provider keys, and scope what each key can reach.
RBAC and policy-as-code (Cedar/OPA) answer "who can call what" — which teams may use the frontier model, which routes, which tools, and which providers a given data class may touch.
Budgets, quotas, and rate limits are governance and fairness controls, not just abuse protection: hard and soft limits per team, alerting, and enforcement when a limit is exceeded.
A compliance-grade audit log is immutable, complete (who, what, when, which model, which data category), tamper-evident, and exportable to a SIEM — and it logs metadata and redaction events, never raw PII.
Data residency and sovereignty are routing decisions: region-aware routing, blocking certain providers for certain data classes, and self-hosted models for regulated data.
The gateway is the single control plane for keys, RBAC, budgets, policy, audit, and residency. TrueFoundry's AI Gateway provides these as the layer every request already passes through — it helps satisfy logging and oversight obligations, but it is a control, not a compliance certification.

Quarter-end at Northwind. Mei, the platform lead, got a question from the security and compliance team she couldn't answer: which teams had sent customer data to which model providers over the last quarter, and could she produce the records. She couldn't. Every service called the model providers through one shared API key, checked into a config years ago. There was no per-team attribution, no record of which requests carried customer data, no way to revoke one team's access without rotating the key for everyone, and no audit trail beyond the providers' own opaque billing. The LLM usage had grown from one prototype to a dozen production services, and the governance had not grown with it.

Nothing had gone wrong, exactly — no breach, no overspend anyone had caught. But "we can't answer the question" is its own finding, and it's the one that turns a routine audit into a project. This post is the control plane that makes the question answerable before someone asks it.

What TrueFoundry's AI Gateway Provides Here

Everything in this post — virtual keys, RBAC, budgets, rate limits, audit logs, residency rules, and guardrails as enforced policy — is something TrueFoundry's AI Gateway expresses as configuration in one control plane. Access control defines who (users, teams, virtual accounts) may call which provider accounts and models; Personal Access Tokens and Virtual Account Tokens are how applications authenticate to the gateway instead of holding raw provider keys; rate-limit and budget configs apply per user, team, virtual account, model, or any custom metadata key; and guardrails — including Cedar and OPA as policy-as-code at the MCP-tool boundary — run as enforced rules at four lifecycle hooks.

Every request crosses the same path: authenticate, resolve the calling identity, evaluate access policy and per-key budgets, evaluate rate-limit rules in order (first match wins), run input guardrails, route to a provider, emit an audit-grade trace, then run output guardrails. The same view becomes the record an auditor needs: who called what, when, against which policy, with which guardrail outcomes. Request Traces and OpenTelemetry export let the trail land in your SIEM rather than a vendor dashboard you cannot query.

TrueFoundry AI Gateway request flow — Fig 1: *How a request flows through the gateway in production: validation → identity → rate/budget checks → load balancing → provider adapter → async logging. Source:* *TrueFoundry — Gateway Plane Architecture*.

‍

Fig 2: *How identity, policy, and audit compose on a single request. Each stage is gateway configuration, and each decision is recorded against the same trace ID.*

The application code is unchanged from any OpenAI-style call — the governance is in the bearer token and the metadata header, not in client logic. A Personal Access Token resolves to a user; a Virtual Account Token resolves to a non-human identity for production services. The X-TFY-METADATA header carries the structured fields (team, project, cost_center, environment) that policies, budgets, and audit logs match against:

Calling the gateway with an identity and audit metadata (Python, OpenAI-compatible)

from openai import OpenAI

client = OpenAI(
    base_url="https://<your-truefoundry-gateway-url>",   # your gateway endpoint
    api_key="<your-virtual-account-token>",              # VAT for production; PAT in dev
)

resp = client.chat.completions.create(
    model="openai-main/gpt-5.5",
    messages=[{"role": "user", "content": "Summarize this document."}],
    extra_headers={
        # Structured identity for audit, attribution, and policy matching.
        "X-TFY-METADATA": '{"team":"support-ai","project":"helpdesk","cost_center":"cc-203","environment":"production"}',
    },
)
print(resp.choices[0].message.content)

1. Why Governance Becomes Non-Negotiable in Production

A single prototype calling one model on one key needs no governance. A dozen services across several teams, calling several providers, on data of varying sensitivity, needs a control plane — because the failure modes are no longer hypothetical. A shared key means no usage attribution, so you can't tell finance which team is driving spend or tell security which team touched customer data. No spend limits means one runaway agent (recall the routing post's silent escalation) can burn the budget before anyone notices. No audit trail means you can't reconstruct what happened for an incident or an auditor. And no access control means shadow AI: teams wiring up models without anyone tracking it.

On top of the operational pressure sits regulatory pressure. The EU AI Act is phasing in obligations around record-keeping, transparency, and human oversight (section 7), and sector regimes — SOC 2, HIPAA, financial rules — have long expected access control and audit. The common thread is that they all assume you can answer Mei's question. AI governance is the work of being able to.

Fig 3: Teams hold virtual keys, not provider credentials. Every request crosses one control plane that resolves identity, applies RBAC and policy, enforces budget, runs guardrails, routes by residency, and writes an immutable audit record — exported to the SIEM — before reaching a provider.

2. Virtual Keys: Decoupling Access from Provider Credentials

The root cause of Mei's problem is the shared provider key. A virtual key fixes it: instead of handing teams the real provider credential, you issue each team or app its own gateway-managed key that maps to the underlying provider key at the gateway. The application authenticates with its virtual key; the gateway holds the real one.

That one indirection buys most of governance. Usage attributes to the virtual key, so spend and data access can be reported per team (the cost-attribution post builds on exactly this). Revocation is local — disable one team's virtual key without rotating the provider key for everyone else. And access is scopable — a virtual key can be limited to certain models, routes, or data classes. The provider credentials live in one place, the gateway, rather than scattered across a dozen service configs where they can't be tracked or rotated cleanly. In TrueFoundry's AI Gateway, virtual keys are the unit that ties usage, budgets, and access policy to a team or application rather than to an opaque shared credential.

3. RBAC and Policy-as-Code: Who Can Call What

Virtual keys establish identity; RBAC and policy decide what that identity may do. The questions are concrete: which teams may use the expensive frontier model, which may reach a given provider, which may call which tools (in an MCP setting), and which data classes may be sent where. Encoding these as policy-as-code — with an engine like Cedar or OPA — makes the rules explicit, reviewable, and versioned, rather than living as tribal knowledge or scattered conditionals.

Illustrative access policy (conceptual — exact schema is gateway-specific)# Only the research team may call the frontier model.allow if principal.team == "research" and resource.model == "gpt-5.5"# Customer-data requests must stay on an EU-resident model.deny if request.data_class == "customer_pii" and resource.region != "eu"# Finance team may not call external providers at all — self-hosted only.allow if principal.team == "finance" and resource.kind == "self_hosted"

The value of policy-as-code is that it turns "who can call what" into something you can review in a pull request, test, and prove to an auditor — the same governance discipline applied to MCP tool access in TrueFoundry's MCP security work, where Cedar and OPA gate which tools an agent may invoke. The gateway is the enforcement point because it's the one place every request crosses before reaching a provider or a tool.

4. Budgets, Quotas, and Rate Limits as Governance

Budgets and rate limits are usually framed as protection against abuse or runaway cost. In a governance context they're also fairness and accountability controls: each team gets a defined share, overruns are visible, and no single team can quietly consume the org's entire model budget. The mechanics are hard and soft limits per team or virtual key, alerting as a limit approaches, and enforcement — typically a 429 — when it's exceeded, the same enforcement path described in the cost-attribution post.

The governance framing changes how you set them. A soft limit with alerting is an accountability tool — it tells a budget owner their team is trending over, without breaking production. A hard limit is a guardrail against the runaway case, like the silently-escalating cascade from the routing post. Rate limits per key also double as a fairness mechanism in a shared-capacity setting, keeping one team's batch job from starving another team's interactive traffic. Setting these at TrueFoundry's AI Gateway means they apply per virtual key, consistently, rather than being reimplemented per service.

The schema is intentionally rule-based so a policy reads the way you'd describe it to an auditor: who (subjects: users, teams, virtual accounts, or any custom metadata key), what (models), how much (limit and unit — requests or tokens per minute, hour, or day), and scoped how (one shared limit, or a separate limit per user / per model / per metadata value via rate_limit_applies_per). Rules evaluate in order and the first match wins, so specific rules sit above broader fallbacks:

gateway-rate-limiting-config (real schema from TrueFoundry docs)

name: ratelimiting-config
type: gateway-rate-limiting-config
rules:
  # 1. Specific override: bob's runaway script gets a hard 1k requests/day on gpt4
  - id: "bob-gpt4-day-cap"
    when:
      subjects: ["user:bob@example.com"]
      models: ["openai-main/gpt4"]
    limit_to: 1000
    unit: requests_per_day

  # 2. Team-level token cap on a costly model
  - id: "backend-team-gpt4-tpm"
    when:
      subjects: ["team:backend"]
      models: ["openai-main/gpt4"]
    limit_to: 20000
    unit: tokens_per_minute

  # 3. Fairness floor: every user gets their own 1M-token/day budget on any model
  - id: "user-daily-token-cap"
    when: {}
    limit_to: 1000000
    unit: tokens_per_day
    rate_limit_applies_per: ["user"]

  # 4. Project-level cap based on metadata in the request header
  - id: "project-hourly-cap"
    when: {}
    limit_to: 50000
    unit: tokens_per_hour
    rate_limit_applies_per: ["metadata.project_id"]

TrueFoundry AI Gateway rate-limiting config UI — *Fig 4: The same YAML rules are edited in* ***AI Gateway → Configs*** *(or applied via tfy apply from Git for PR review and audit history). Source:* *TrueFoundry — Rate Limiting*.

Under the hood, enforcement uses a sliding-window token-bucket algorithm with twelve 5-second buckets summed across a 60-second window — bursty enough that a brief spike doesn't lock a team out, strict enough that a runaway script trips it within seconds. Because the gateway resolves identity (PAT or VAT → user, team, virtual account) and reads X-TFY-METADATA on every call, the same rule expression covers the four governance audiences a single policy usually has to serve: a developer rate-limited per-key, a team rate-limited collectively, a model rate-limited globally, and a project rate-limited by its metadata tag.

The audit story is the other half. Every rate-limit decision, every guardrail outcome, every fallback hop, and every model call lands on the same request trace (x-tfy-trace-id in the response), and the same trace is what's exposed via Request Traces in the UI and exported through OpenTelemetry. That's what turns "we have logs" into "we have an audit trail we can hand to a regulator."

5. Compliance-Grade Audit Logs: What Makes a Log Hold Up

"We have logs" and "we have a compliance-grade audit trail" are different claims. A log that holds up to an audit has four properties. It is immutable — entries can't be edited or deleted after the fact. It is complete — every call records who (which team/virtual key), what (model, route, action), when, and which data category was involved. It is tamper-evident — entries carry something like a cryptographic trace ID or hash chain so alteration is detectable. And it is exportable — it streams to your SIEM rather than living only in a vendor dashboard you can't query.

Log metadata and events, never raw PII

The audit log must record enough to reconstruct what happened without itself becoming a sensitive-data store. Log the data category ("customer_pii detected and redacted"), the guardrail event, the model, the team, and the trace ID — not the raw prompt content or the PII itself. As the PII post notes, a redaction layer that logs the values it redacted has just created a second copy of the data it was protecting. Compliance-grade means complete on metadata and disciplined about content.

This is the layer that would have let Mei answer her question: an immutable, per-team, data-category-tagged record of every call, exported to the SIEM, queryable after the fact. It builds directly on the tracing from the OpenTelemetry post — the same spans that power observability, with the completeness and immutability properties an audit demands.

6. Data Residency and Sovereignty

For regulated data, where a request is processed is itself a governance decision, and it's enforced as routing. Region-aware routing keeps requests carrying EU personal data on EU-resident endpoints. Provider restrictions block certain data classes from reaching certain providers entirely. And for the most sensitive data, self-hosted open-weight models keep processing inside your own infrastructure, so regulated content never leaves your boundary at all.

These decisions compose with the policy engine from section 3: a policy that says "customer PII may only be processed on an EU-resident model" is a residency rule expressed as access policy, and the gateway enforces it by routing — or refusing to route — accordingly. Because TrueFoundry's AI Gateway fronts hosted and self-hosted models through one interface, residency becomes a routing rule rather than a separate integration, and the choice of where a data class is processed is logged like every other decision.

7. Mapping Gateway Controls to the EU AI Act (Carefully)

The EU AI Act is phasing in: prohibited practices and AI-literacy obligations have applied since February 2025, obligations for general-purpose AI models since August 2025, and the obligations for high-risk systems — risk management, data governance, record-keeping/logging, transparency, and human oversight — are landing around August 2026, though the timing for use-based (Annex III) high-risk systems is subject to the Commission's Digital Omnibus simplification proposal, which at the time of writing is in trilogue and would defer parts toward December 2027. The dates move; treat them as a live area and confirm against the current text.

Where a gateway helps is the operational substrate several of these obligations assume. Record-keeping and automatic logging map to the compliance-grade audit log (section 5). Human oversight maps to the human-in-the-loop checkpoints discussed in the prompt-injection post for high-risk actions. Data governance maps to residency routing and the PII guardrails. Transparency obligations are supported by being able to show what the system did and on what data.

A control is not a certification

The gateway helps you satisfy logging, oversight, and traceability requirements; it does not make an organization "compliant." Compliance depends on your system's risk classification, your role (provider vs. deployer), your documentation, conformity assessment where required, and obligations well outside any gateway. Anyone marketing a gateway as "EU AI Act compliance" is overstating it. Treat these controls as necessary operational plumbing, and consult the regulation and qualified counsel for what compliance actually requires for your use case.

8. Guardrails as Enforced Policy

Governance isn't only about access and budgets; it's also about what content is allowed to flow. The guardrails from the rest of this series — PII/PHI detection, prompt-injection defense, content moderation, secrets detection — are governance controls when they're enforced as policy rather than left to each application's discretion. "Customer PII is redacted on every route" and "outputs are screened before they reach a logging sink" are policy statements, and the place to enforce them uniformly is the gateway.

This closes the loop with the earlier posts. The PII post's four insertion points, the prompt-injection post's input/output guardrails, and this post's RBAC and audit are facets of one control plane: identity decides who, policy decides what they may call, guardrails decide what content may pass, budgets decide how much, residency decides where, and the audit log records all of it. TrueFoundry's guardrails run at the LLM and MCP hooks as enforced, configurable policy, which is what turns a set of good intentions into governance an auditor can verify.

9. FAQs

What's the difference between a virtual key and just having separate provider keys per team?

Separate provider keys give you some attribution but leave the real credentials scattered across service configs, hard to rotate and easy to leak, and they don't carry policy. A virtual key is gateway-managed: it maps to the real provider key centrally, carries the team's RBAC, budget, and residency policy, and can be revoked or re-scoped without touching the provider credential or any other team. The indirection is the point.

Does an audit log mean logging the prompts?

No — and you usually shouldn't. Log the metadata needed to reconstruct what happened: team, model, route, action, data category, guardrail events, timestamp, and a trace ID. Logging raw prompt content recreates the sensitive-data problem the PII guardrails exist to prevent. Compliance-grade means complete on metadata and immutable, not a verbatim transcript of user data.

Will a gateway make us EU AI Act compliant?

No. It helps satisfy specific operational obligations — record-keeping, logging, supporting human oversight, data governance — but compliance depends on your risk classification, your role as provider or deployer, documentation, and conformity assessment that no gateway performs. Treat the gateway as necessary plumbing and consult the actual regulation and counsel for what your use case requires. The timelines are also still shifting under the Digital Omnibus proposal.

Isn't this just the cost-attribution post again?

They share the virtual-key and per-team-attribution foundation, but the lens differs. The cost post is about spend — who's driving it, budgets, chargeback. This post is about control and accountability — who may call what, what content may flow, what's recorded for an audit, where regulated data is processed. Same control plane, different obligations resting on it.

Where should governance live — app or gateway?

The gateway, because governance is by definition cross-cutting: one consistent set of access, budget, residency, and audit rules across every service, enforced at the one point every request crosses. Per-app governance drifts and leaves gaps — which is how a shared key survives for years. The application still owns domain-specific judgments, like which actions are high-risk enough to require human sign-off.

Mei's audit finding wasn't a breach; it was an inability to answer. Governance is the unglamorous work of making the question answerable in advance — who, what, how much, where, and on whose data — and the gateway is where those answers are enforced and recorded, one request at a time.

About TrueFoundry

TrueFoundry's AI Gateway is an enterprise-grade control plane that sits between your applications and 1,600+ models — across OpenAI, Anthropic, Google, AWS Bedrock, Azure OpenAI, and your own self-hosted models — behind a single OpenAI-compatible API. It turns the governance controls in this post into configuration rather than per-service code: Virtual Accounts for non-human production identity, Personal Access Tokens for development, RBAC scoped per provider account, rate-limit and budget rules expressed as YAML with per-user/per-team/per-model/per-metadata scopes, and policy-as-code guardrails (Cedar and OPA) at the MCP tool boundary.

Because the gateway already sits on every request and emits a complete trace for every call, it is also where compliance-grade audit becomes practical. Identity, policy decisions, guardrail outcomes, model choice, token counts, and cost all land on the same trace ID — visible in Request Traces in the UI, exportable through OpenTelemetry, and accessible by API for SIEM integration. The same gateway adds exact and semantic caching, fallbacks and retries, and observability dashboards, deploys as SaaS or in your VPC, on-prem, or air-gapped with SOC 2, HIPAA, and ITAR compliance, and is recognized in Gartner's Market Guide for AI Gateways. See the access control, rate limiting, and guardrails docs, or the AI Gateway overview to go deeper.

References

Northwind and Mei are illustrative. The governance patterns — virtual keys, RBAC, policy-as-code, audit logging, residency routing — are standard control-plane practice applied to LLM traffic. EU AI Act dates and obligations are summarized from the European Commission's published timeline and reporting as of May 2026 and are subject to ongoing amendment (notably the Digital Omnibus proposal in trilogue); this post is engineering guidance, not legal advice, and the gateway controls described help satisfy specific obligations rather than constituting compliance. Confirm current requirements against the regulation and qualified counsel for your use case.

TrueFoundry AI Gateway ofrece una latencia de entre 3 y 4 ms, gestiona más de 350 RPS en una vCPU, se escala horizontalmente con facilidad y está listo para la producción, mientras que LitellM presenta una latencia alta, tiene dificultades para superar un RPS moderado, carece de escalado integrado y es ideal para cargas de trabajo ligeras o de prototipos.

Diseñado para la velocidad: ~ 10 ms de latencia, incluso bajo carga

Programe su demostración ahora