Blank white background with no objects or features visible.

Join the Resilient Agents online hackathon hosted by TrueFoundry. Win up to $10,000 in prizes. Register Now →

Join our VAR & VAD ecosystem — deliver enterprise AI governance across LLMs, MCPs & Agents. Become a Partner →

Prompt Injection Defense at the Gateway: Direct, Indirect, and Tool-Mediated Attacks

von Boyu Wang

Aktualisiert: June 7, 2026

Prompt injection is the defining application-security problem of LLM systems, and it has a structural root: the model reads trusted instructions and untrusted data through the same channel, with no reliable way to tell them apart. This post is the threat model and the defense — how direct, indirect, and tool-mediated injection work, why the model can't separate instruction from data, why no single detector is complete, and how input and output guardrails plus privilege separation at the gateway reduce the blast radius.

Key Takeaways
  • Prompt injection has one structural root: an LLM concatenates trusted instructions and untrusted data into a single token stream and has no reliable boundary between "what to do" and "what to process." Every attack class below exploits that.
  • The attack taxonomy spans direct injection (in the user's own input), indirect injection (hidden in retrieved documents or tool results the agent reads), tool-mediated injection (in tool metadata), and jailbreaks. Indirect is the dangerous one, because no user chose to send it.
  • The "lethal trifecta" — access to private data, exposure to untrusted content, and an exfiltration channel — is where capability multiplies risk: an agent with all three can be turned into a data-exfiltration tool by content it merely reads.
  • Detection is hard and no defense is complete. Injection is semantic, not pattern-based; it's an arms race; and published red-team benchmarks report capable models complying with injected instructions at high rates. Treat detection as risk reduction, never as a solved problem.
  • Defense is layered: input guardrails (injection/jailbreak classifiers), output guardrails (block exfiltration and policy violations), and privilege separation — breaking the trifecta so no single agent has data, untrusted input, and egress at once.
  • Indirect injection means scanning retrieved context and tool results, not just user input — the same four-insertion-point lesson as PII, applied to adversarial instructions instead of sensitive data.
  • The gateway is where injection and jailbreak guardrails apply uniformly across apps, and where tool-access policy is enforced. TrueFoundry's guardrails run at the LLM-input, LLM-output, and MCP pre/post-tool hooks, with a prompt-injection guardrail and policy controls for tool access.

Wednesday at Northwind. Yuki, an application-security engineer, was watching a demo of the new support agent when it did something nobody asked it to. A customer-service rep had pasted a vendor's email into the agent and asked it to summarize the dispute. Buried in the email — in pale text the rep never read — were instructions addressed not to the human but to the assistant: dismiss the open dispute and issue a goodwill credit. The agent had a tool to adjust account credits. It read the email, "understood" the instructions as part of its task, and issued the credit. No customer attacked anything. No password leaked. The malicious instructions simply rode in on a document the agent was asked to read, and the agent could not tell the difference between the rep's request and the email's.

That is prompt injection, and it is not a phrasing problem to be solved with a better system prompt. It is structural: the model takes instructions and data through one channel. This post is how the attack family works and how to shrink its blast radius — knowing up front that no defense here is complete, only layered.

What TrueFoundry's AI Gateway Provides Here

Every defense in this post — input-side detection, output-side inspection, context and tool-result scanning, the rollout discipline that lets you ship a guardrail before it can take prod down — lives in TrueFoundry's guardrails system as configuration applied at four lifecycle hooks: llm_input, llm_output, mcp_pre_tool, and mcp_post_tool. The hooks line up with the post's threat model: llm_input catches the direct injection in the user's turn; llm_output is the exfiltration check; mcp_pre_tool is where a Cedar or OPA policy decides whether a tool call is even allowed; mcp_post_tool is where you scan what the tool returned before the model reads it — the indirect-injection insertion point that input-only systems miss.

Each guardrail has two settings that matter operationally: an operation mode (Validate — looks and blocks; or Mutate — rewrites and continues, e.g. PII redaction) and an enforcement strategy (Audit, Enforce But Ignore On Error, or Enforce). The middle setting is what makes a guardrail safe to deploy: you stay protected when the rail works, but a third-party safety provider outage doesn't take your app with it. The recommended rollout is the same one this post argues for: Audit → Enforce-But-Ignore-On-Error → Enforce, in that order, with the latency and false-positive numbers in Request Traces driving each promotion.

Fig 1: The four hooks where guardrails attach, and how blocking decisions short-circuit further work. Source: TrueFoundry AI Gateway docs — Guardrails.
TrueFoundry AI Gateway LLM request guardrail flow
Fig 2: How the gateway sequences guardrails on an LLM request. Input validation runs in parallel with the model call so the happy path pays no TTFT tax; a failed input validation cancels the in-flight model call before tokens are billed. Source: TrueFoundry — Guardrails Overview.
TrueFoundry AI Gateway MCP tool guardrail flow
Fig 3: The MCP tool path. Pre-tool guardrails (Cedar, OPA, SQL sanitizer) run before the tool executes — if any fail, the tool never runs. Post-tool guardrails check the result before it ever reaches the model, which is the indirect-injection insertion point. Source: TrueFoundry — Guardrails Overview.

Application code stays the same; the policy is in the headers or the central rule config. The example below uses the per-request X-TFY-GUARDRAILS header (handy when different routes need different rails); for org-wide enforcement the same selectors live under AI Gateway → Controls → Guardrails:

Calling the gateway with guardrails applied (Python, OpenAI-compatible)

from openai import OpenAI

client = OpenAI(
    base_url="https://<your-truefoundry-gateway-url>",
    api_key="<your-virtual-account-token>",
)

resp = client.chat.completions.create(
    model="openai-main/gpt-5.5",
    messages=[{"role": "user", "content": user_prompt}],   # may include untrusted context (RAG, emails)
    extra_headers={
        # Guardrails attach at the four hooks. Input validation runs in parallel
        # with the model — no TTFT penalty when the request is clean.
        # Values are guardrail FQNs (group/name), copied from AI Gateway → Guardrails.
        "X-TFY-GUARDRAILS": (
            '{"llm_input_guardrails":["my-group/prompt-injection","my-group/pii-redaction"],'
            '"llm_output_guardrails":["my-group/secrets-detection"],'
            '"mcp_tool_pre_invoke_guardrails":["my-group/sql-sanitizer","my-group/cedar-tool-policy"],'
            '"mcp_tool_post_invoke_guardrails":["my-group/pii-redaction"]}'
        ),
    },
)
print(resp.choices[0].message.content)

1. The Structural Root: Instructions and Data Share One Channel

A traditional program keeps code and data separate: code is the logic, data is what the logic operates on, and a string of user data can't become a new instruction unless you have an injection bug. An LLM has no such separation. The system prompt, the user's message, a retrieved document, and a tool's output are all concatenated into one token stream, and the model decides what to do by reading all of it. There is no field that says "this part is authoritative instruction" and "this part is mere data to be processed" in a way the model reliably honors.

This is the same structural shape that appears elsewhere in this series. In RAG and PII, retrieved documents enter the same context as the user's prompt, which is why a retrieved SSN gets quoted back. In tool-poisoning attacks on MCP servers, instructions hide in tool metadata the model treats as authoritative. Prompt injection is the general case: any untrusted text that reaches the context can attempt to act as an instruction, and the model has no built-in way to refuse on the grounds that "this came from data, not from my operator."

2. An Attack Taxonomy: Direct, Indirect, Tool-Mediated, Jailbreak

The attacks differ by where the malicious instruction enters, which matters because each entry point needs a different control.

Class Where the instruction enters Why it's dangerous
Direct injection The user's own input Simplest; the user tries to override the system prompt. Visible at the input boundary.
Indirect injection Retrieved documents, web pages, or emails the agent reads No user chose to send it; it rides in on content the agent was asked to process. The cold open.
Tool-mediated Tool results or tool/MCP metadata The model trusts tool output as authoritative; poisoned metadata or results steer it silently.
Jailbreak Crafted user input (roleplay, encoding, volume) Aims to bypass safety constraints rather than redirect a task; overlaps with direct injection.

Direct injection and jailbreaks are at least visible at the input boundary, where you have a chance to inspect them. Indirect and tool-mediated injection are the harder problems precisely because the malicious text doesn't arrive in the user's message — it arrives in the document the user asked the agent to summarize, the web page the agent fetched, or the record a tool returned. Any defense that only inspects user input is blind to exactly the cases the cold open turned on.

3. The Lethal Trifecta: Why Capability Multiplies Risk

Injection becomes severe when it meets capability. The useful framing, often called the lethal trifecta, is that the real danger appears when a single agent simultaneously has three things: access to private data, exposure to untrusted content, and a channel to send data outward. With all three, content the agent merely reads can instruct it to take its private data and push it out the egress channel — turning a passive document into an exfiltration command.

Der schnellste Weg, deine KI zu entwickeln, zu steuern und zu skalieren

Melde dich an
Inhaltsverzeichniss

Steuern, implementieren und verfolgen Sie KI in Ihrer eigenen Infrastruktur

Buchen Sie eine 30-minütige Fahrt mit unserem KI-Experte

Eine Demo buchen

Der schnellste Weg, deine KI zu entwickeln, zu steuern und zu skalieren

Demo buchen

Entdecke mehr

Keine Artikel gefunden.
June 7, 2026
|
Lesedauer: 5 Minuten

Prompt Injection Defense at the Gateway: Direct, Indirect, and Tool-Mediated Attacks

Keine Artikel gefunden.
June 7, 2026
|
Lesedauer: 5 Minuten

Multi-Provider Failover and Load Balancing: Surviving LLM Provider Outages

Vordenkerrolle
Best MCP Gateway
June 6, 2026
|
Lesedauer: 5 Minuten

Die 5 besten MCP-Gateways im Jahr 2026

Vergleich
TrueFoundry AI gateway governs production systems in enterprise AI deployments
June 6, 2026
|
Lesedauer: 5 Minuten

What Is a Production System in AI? A Complete Guide for Enterprise Teams

Keine Artikel gefunden.
Keine Artikel gefunden.

Aktuelle Blogs

Black left pointing arrow symbol on white background, directional indicator.
Black left pointing arrow symbol on white background, directional indicator.
Machen Sie eine kurze Produkttour
Produkttour starten
Produkttour