Governing Multi-Agent Systems: A2A Traffic at the Gateway

Conçu pour la vitesse : latence d'environ 10 ms, même en cas de charge

Une méthode incroyablement rapide pour créer, suivre et déployer vos modèles !

Gère plus de 350 RPS sur un seul processeur virtuel, aucun réglage n'est nécessaire
Prêt pour la production avec un support complet pour les entreprises

Commencez à utiliser Truefoundry dès maintenant Parlez à l'expert

Single-agent systems call models and tools. Multi-agent systems add something new: agents calling agents. That east-west traffic — an orchestrator delegating to sub-agents, agents handing work to each other over the still-young Agent2Agent protocol — is where cost runs away, blast radius widens, and "which agent did what" becomes unanswerable. The protocols standardize how agents discover one another and exchange work, and they provide security hooks — but they don’t prescribe your enterprise identity model, policy graph, budget model, or observability and control plane. This post is that governance layer, and why it belongs at the gateway.

Key Takeaways

Multi-agent systems introduce a new traffic pattern — east-west, agents calling agents and agents calling tools — distinct from the north-south app-to-model traffic gateways were first built to manage.
The Agent2Agent (A2A) protocol standardizes agent-to-agent communication — capability discovery via agent cards, task and message exchange, and security hooks like declared auth schemes — but, like MCP, it defines the mechanism and leaves your enterprise identity model, policy graph, budgets, and control plane to you.
A shared service key is the wrong identity model. When one credential fronts many agents, you can't authorize per agent, attribute cost per agent, or reconstruct which agent took which action. Agents need their own identities.
The dominant failure mode is blast radius, not a single bad call: loops and runaway fan-out, where one agent calls another in a cycle, burn budget fast and silently. Depth limits, per-agent rate limits, and timeouts contain it.
Observability must span the whole run — an end-to-end trace across agent steps, model calls, and tool invocations — because per-request metrics lose the shape of a multi-agent workflow and hide where it went wrong.
Prompt injection propagates across agents: a poisoned input or tool result read by one agent can steer the agents it delegates to, so injection defense is an agent-to-agent concern, not only an input-boundary one.
The gateway is the agent control plane. TrueFoundry's Agent Gateway gives agents identity, per-agent RBAC and budgets, retries, timeouts and loop safeguards, and end-to-end tracing — unifying model, MCP-tool, and agent-to-agent governance in one place.

Tomás, a platform engineer, walked in to a cost alert and a mystery. Overnight, the company's new multi-agent research workflow had spent more than its entire previous month. An orchestrator agent delegated subtasks to a set of sub-agents; one sub-agent, hitting a transient error, retried by re-invoking the orchestrator, which delegated again — a loop that ran for hours. By morning the agents had called each other tens of thousands of times. Tomás wanted to know which agent started it and where the cycle formed, and found he couldn't: every agent authenticated with the same shared service key, the calls between agents weren't recorded as a graph, and there was no per-agent rate limit that would have tripped. The system had governance for calls to the model provider. It had almost none for calls between its own agents.

This is the gap multi-agent systems open. The moment agents start delegating to one another, you have a new internal network — one with no identity, no policy, and no trace by default. The agent frameworks help you build the workflow; they don't govern it. This post is how to give that internal network the same identity, limits, and observability you'd never run a microservice mesh without.

1. The New Traffic Pattern: Agents Calling Agents

For most of the gateway story so far, traffic has been north-south: an application calls a model, maybe through a tool. Multi-agent systems add east-west traffic — agents invoking other agents. An orchestrator delegates to specialists; a specialist consults another; results flow back up. The still-young Agent2Agent (A2A) protocol gives this a standard shape, with agents publishing capability descriptions (agent cards) that others discover, and exchanging tasks and messages over a common interface, much as MCP standardized how agents reach tools.

The analogy worth holding onto is the move from a monolith to microservices. The instant your agents talk to each other, you have a distributed system with the failure modes of one: cascading retries, cycles, fan-out amplification, and the loss of a single clear call stack. And like microservices, the answer isn't to wish the calls away but to put them behind a layer that gives every caller an identity, every call a policy, and every flow a trace. That layer, for agents, is the agent gateway.

Fig 1: An orchestrator delegating to sub-agents, with every edge — east-west agent-to-agent and north-south agent-to-tool — passing through the gateway, which attaches identity, checks per-agent policy and quota, and records a trace span. The dashed red edge is Tomás's loop, the case a depth limit and per-agent rate limit are meant to catch.

2. Agent Identity: Why a Shared Service Key Isn't Enough

Tomás's root problem was identity. When every agent authenticates with one shared service key, the system literally cannot tell its agents apart — which means it can't authorize them differently, can't attribute cost to them separately, and can't reconstruct which one acted.

The fix is to give each agent its own identity, issued and verified at the gateway, and to propagate it on every call the agent makes — to a model, to a tool, and to another agent. That identity is what every later control hangs off: authorization decisions, rate limits, cost attribution, and trace attribution all key on "which agent."

Each agent carries its own identity on every hop (illustrative)

# The gateway issues and verifies a per-agent identity, not a shared key.
ctx = AgentContext(
    agent_id="agent:research",          # this agent's own identity
    on_behalf_of="user:tomas",          # the human principal, preserved end-to-end
    run_id="run_4f9c",                  # correlates every hop of one workflow
    depth=2,                            # how deep in the delegation chain we are
)

# Propagated when this agent delegates to another agent or calls a tool:
gateway.invoke(target="agent:writer", context=ctx, payload=task)

Centralizing identity at TrueFoundry's Agent Gateway — which manages authentication, identity, and service-account management for agents at the gateway layer — means the identity is established once and trusted everywhere downstream, rather than each agent framework inventing its own scheme. Preserving the human principal (on whose behalf the workflow runs) alongside the agent identity is what keeps end-user authorization and audit intact even three delegations deep.

3. A2A Authorization and Policy: Which Agent May Invoke Which

Identity enables authorization, and the questions are concrete in a multi-agent system. May the research agent invoke the writer agent, or only the orchestrator? May a sub-agent call external tools directly, or only through its parent? Which agents may spend against which budget? Expressing these as policy-as-code — the same Cedar or OPA approach from the governance and routing posts — turns the agent graph's allowed edges into something explicit and reviewable rather than implicit in code.

Per-agent authorization for east-west calls (illustrative policy)

# Default-deny: an agent may only invoke agents it is explicitly allowed to.
allow if principal.agent_id == "agent:orchestrator"
      and action == "invoke"
      and resource.agent_id in ["agent:research", "agent:writer", "agent:critic"]

# Sub-agents may NOT invoke the orchestrator — this edge is what created the loop.
deny  if principal.agent_id in ["agent:research", "agent:writer"]
      and resource.agent_id == "agent:orchestrator"

# Only the research agent may reach external search tools.
allow if principal.agent_id == "agent:research"
      and resource.kind == "mcp_tool"
      and resource.name == "web_search"

Notice the second rule: a policy that forbids sub-agents from re-invoking the orchestrator would have cut Tomás's loop at the first hop, independent of any rate limit. Authorization isn't only a security control here; constraining the shape of the agent graph is also how you prevent whole classes of runaway behavior. The gateway becomes the enforcement point when it's the one place every east-west call is routed through.

It helps to be precise about what the protocols decide and what they leave to you. Discovery and transport are standardized; the identity model, policy, budgets, and enforcement point are not:

4. Containing the Blast Radius: Loops, Runaway Fan-Out, and Rate Limits

Even with good authorization, multi-agent systems fail in ways single calls don't, because the unit of damage is the cascade. A retry that re-delegates can form a cycle; an agent that fans out to many children can amplify one request into thousands; a slow sub-agent can stall a whole workflow. These are the agent-scale version of the thundering-herd and silent-escalation problems familiar from routing and failover at the model layer.

Containment is layered. A delegation-depth limit caps how deep the chain can recurse, breaking cycles structurally. Per-agent rate limits cap how often any one agent can invoke others, so a loop trips a ceiling instead of running all night. Timeouts and stall detection stop an agent waiting forever on a child. And a global per-run budget caps the total spend of one workflow regardless of its shape. TrueFoundry's Agent Gateway documents the relevant primitives — retry policies, fallback paths, timeouts and safeguards against infinite loops or stalled agents, plus token- and cost-based quotas per agent, workflow, or environment. The exact configuration shape below is illustrative; the primitives are what the product page describes.

Blast-radius controls for a multi-agent run (illustrative gateway config)

run_limits:
  max_delegation_depth: 5          # breaks cycles structurally
  max_total_tokens: 500000         # whole-run budget, force-stop past this
  max_wall_clock_seconds: 600

per_agent:
  invoke_rate_limit: 60/min        # one agent can't call others without bound
  timeout_seconds: 45              # stall detection on a child call
  on_breach: halt_and_alert        # stop the run, page a human

The shift in mindset is to treat a multi-agent run as a bounded transaction with a budget and a depth, not an open-ended conversation. With those bounds enforced at the gateway, Tomás's overnight loop becomes a tripped limit and an alert at 2am instead of a five-figure invoice at 9am.

5. Observability: Tracing a Multi-Agent Run End-to-End

Per-request metrics — latency, tokens, errors on each individual call — are necessary but not sufficient for multi-agent systems, because they lose the thing you most need: the shape of the run. When something goes wrong three delegations deep, you need the whole tree — which agent called which, in what order, with what inputs and outputs, and where the cost accrued. That's an end-to-end trace spanning agent steps, model calls, and tool invocations, stitched together by the run identifier that every hop carries.

TrueFoundry Agent Gateway tracing view showing an end-to-end multi-agent execution trace across agent steps, model calls, and tool invocations with latency and token usage per step — TrueFoundry's Agent Gateway observability — end-to-end execution traces spanning agent steps, model calls, and tool interactions, with latency, retries, and token usage captured per step. This is the view that would have shown Tomás exactly where the cycle formed. Source: *TrueFoundry Agent Gateway*.

This builds directly on the tracing from our OpenTelemetry post: the same span model, with the agent as a first-class dimension and the run as the trace that ties spans together. TrueFoundry's Agent Gateway captures these end-to-end execution traces and lets you inspect the per-step logs to diagnose failures — turning "the agents spent too much last night" into "this edge formed a cycle at depth four," which is the difference between a mystery and a fix.

6. Cost Attribution by Agent Identity

Cost in a multi-agent system is meaningless without identity. "The workflow cost X" doesn't tell you whether the spend is the orchestrator's planning calls, one sub-agent's expensive model choice, or a loop. Attributing tokens and cost to the specific agent, workflow, and run — keyed on the identity from section 2 — is what makes the spend legible and the runaway diagnosable.

This is the cost-attribution post's per-team accounting extended to the agent as the unit. The Agent Gateway attributes token usage and cost to specific agents, workflows, teams, and environments, which does double duty: it answers the finance question (which agent drives spend) and it surfaces the operational anomaly (a single agent's cost spiking is often the first visible sign of a loop, well before the monthly bill). Pair it with the per-run budget from section 4 and cost becomes both observable and bounded.

7. Security: How Prompt Injection Propagates Across Agents

Multi-agent systems give prompt injection a new way to travel. As covered in our prompt-injection post, an agent that reads untrusted content — a retrieved document, a tool result — can be steered by instructions hidden in it. In a multi-agent system, that compromised agent then talks to other agents, and its output becomes their input. An injection that lands on the research agent can propagate to the writer and critic agents downstream, because to them the research agent is a trusted peer, not an untrusted source.

‍

TrueFoundry AI Gateway offre une latence d'environ 3 à 4 ms, gère plus de 350 RPS sur 1 processeur virtuel, évolue horizontalement facilement et est prête pour la production, tandis que LiteLM souffre d'une latence élevée, peine à dépasser un RPS modéré, ne dispose pas d'une mise à l'échelle intégrée et convient parfaitement aux charges de travail légères ou aux prototypes.

Conçu pour la vitesse : latence d'environ 10 ms, même en cas de charge

Planifiez votre démo dès maintenant