Blank white background with no objects or features visible.

TrueFoundry recognized in Gartner Hype Cycle for Platform Engineering 2026. Read the full report →

Join our VAR & VAD ecosystem — deliver enterprise AI governance across LLMs, MCPs & Agents. Become a Partner →

Loop Engineering, Continued: From One Governed Loop to an Operable Fleet

The first post in this series took loop engineering — designing the system that prompts your agents — and showed why a loop is a production system the instant it runs without you, and why a laptop is the wrong runtime to run it on. That post got one loop onto a governed runtime. This one is about what happens next: when there are many loops, running for months, in a world that is neither reliable nor friendly. Call it Day 2 of loop engineering.

Key Takeaways

  • The first post solved the single-loop problem: identity, bounds, gates, traces, a runtime that survives the laptop closing. Day 2 is everything that only shows up once the pattern succeeds — many loops, long-lived, under load and under attack.
  • Loops come in fleets, not singletons. They share state, contend for the same models and tools, trigger one another, and fan work out to sub-agents. The unit of operation becomes a portfolio — which needs an inventory, isolation, and per-principal accounting.
  • The world is unreliable. A loop running unattended at 3 a.m. meets provider outages, overloaded errors, and rate limits with nobody there to retry. Routing, fallbacks, retries, idempotency, and caching turn “the run died” into “the run rerouted.”
  • The world is hostile. A loop ingesting issues, emails, and web pages is a prompt-injection and confused-deputy machine — and the missing human turn is exactly what makes injection dangerous. The defense is least privilege at the agent and tool-grant boundary, plus guardrails on every model and tool call.
  • A loop is software that changes while it runs. Long-lived loops drift; their prompts and skills get edited; their “done” rots. They need versioning, staged rollout, evals as a promotion gate, instant rollback, and a way to retire what stopped paying.
  • Most of this maps to documented TrueFoundry primitives — Agent Harness for runtime, approvals, and traces; AI Gateway for routing, resilience, and economics; MCP Gateway for governed tool access; and the Skills and Prompt registries for versioned behavior. The fleet-operating model — ownership, lifecycle policy, eval gates, retirement, and per-loop accounting — is the recommended pattern you build on those primitives.

Six months after Noor’s loops moved off her laptop, Northwind has eleven registered triage and maintenance loops, and the problems have changed shape. Nothing is on fire the way the weekend token bill was on fire. Instead: two loops both “own” the release-notes draft and overwrite each other every Friday; the Tuesday dependency-bump loop quietly stopped doing anything useful three weeks ago because a model it depended on was deprecated and it kept passing its own checker anyway; a security reviewer noticed that the loop reading inbound bug reports will cheerfully follow instructions written in a bug report; and the cheapest, most-used loop now spends more than the next four combined, and nobody can say whether it’s worth it. None of these are the failures the first post fixed. They’re the failures that only appear once the pattern works well enough to multiply. That’s the subject here.

The first post did the vertical translation — laptop primitive to governed primitive, one loop at a time. This post does the horizontal one: what an organization has to make true across loops and over their lifetime. The vocabulary still follows Addy Osmani’s June 2026 essay and the surrounding discussion (Steinberger, Cherny, Willison), paraphrased with credit; the enterprise half follows TrueFoundry’s Agent Harness and Gateway documentation.

The first post vs. this one

First post — one loop, made safe This post — a fleet, operated over time
Translate each loop primitive to a governed equivalent Coordinate primitives across loops that interact
Survive a crash; survive the laptop closing Survive provider outages, rate limits, and flaky tools mid-run
Guardrails and credentials for one agent’s identity The unattended attack surface of loops reading untrusted inputs
Versioned skills, mentioned The full day-2 lifecycle: drift, staged rollout, eval gates, rollback, retirement

1 Loops Are a Fleet, Not a Singleton

Osmani’s anatomy describes a loop in the singular — one heartbeat, one set of skills, one maker and one checker. That is the right way to learn the pattern and the wrong way to run it, because loops do not stay solitary. They are useful, so they reproduce, and the moment there are several they start interacting in four ways the single-loop picture never has to account for.

Four ways loops interact once there is more than one

Interaction What it looks like What it really is
Share
state
Two loops touch the same repo, queue, or release artifact Shared mutable state — the oldest concurrency bug, wearing a model’s hat
Contend Eleven loops plus developers draw on the same quotas and GPUs A noisy-neighbor problem on finite provider and on-prem capacity
Fan out One sub-agent per service, per teammate, per failing test — in parallel A small fleet inside each loop, before you have two loops
Trigger Discovery files an issue → triage picks it up → fix drafts a PR → release ships A pipeline of loops, with ordering, backpressure, and partial failure

The fan-out case is already first-class on the harness. TrueFoundry’s Agent Harness sub-agents let the root agent delegate focused subtasks to parallel sub-agents, each with its own isolated context, returning only concise results — a context-hygiene win and a fleet-management win, since each delegated run is a unit you can trace independently. Delegation is deliberately one level deep and sub-agents can’t message the user — guardrails against a fleet that spawns a fleet that spawns a fleet. (Worth being precise: sub-agents share the root agent’s tools and sandbox, so the isolation here is about context, not a separate set of tool grants — privilege boundaries belong at the agent and MCP level, which is the next section.)

The operating pattern that answers all four interactions is to treat each loop as a registered agent definition rather than a script, so the fleet has an inventory. On the harness, each loop is an agent definition — a model, MCP servers, skills, instructions — that lives in a catalog, runs in the same gateway plane as all other model and tool traffic, and carries its own identity. That last word is what makes the fleet tractable: per-agent identity is what lets contention, cost, and access be reasoned about per loop instead of per laptop. Northwind’s release-notes collision gets fixed not with cleverness but with ownership — one loop owns the artifact, the registry says so, and “which loops touch this?” is a lookup, not a hunt.

Fig 1. Treating each loop as a registered agent definition turns four messy interactions into four governed ones: ownership for shared state, per-loop budgets for contention, isolated-context sub-agents for fan-out, and traced pipelines for loop-to-loop triggers.

2 The Unreliable World: Keeping a Loop Alive Mid-Run

The defining fact about a loop is that nobody is watching it, and the corollary the single-loop discussion underplays is that nobody is watching it when the world breaks. A human at the keyboard absorbs a provider 503 without noticing — they just hit enter again. An unattended loop at 3 a.m. meets the same 503 and either dies, or worse, retries forever against a broken dependency and reprises Noor’s weekend bill. Reliability mid-run is not a nice-to-have for loops; it is the difference between a loop and a liability.

The AI Gateway is where a loop borrows the reliability it can’t build for itself.

Reliability a loop borrows from the gateway

Mechanism What it does for an unattended loop
Routing &
fallback
Reference a Virtual Model, not one endpoint. On a 429 / 503 the gateway fails over to the next target by policy — even on an Anthropic overloaded_error before the first chunk.
Retries Bounded, configurable retries (attempts, backoff, exact trigger status codes) at the gateway. A flaky call is a non-event, not a loop-killer.
Latency /
priority routing
Route to the lowest-latency healthy target, or down a priority chain (cheap-and-local first, burst to premium only when saturated) — also how a hungry loop stops starving interactive users.
Caching Exact-match and semantic caching serve repeat triage prompts with zero model invocation on a hit, isolated per user/account with optional per-loop namespaces.

Le moyen le plus rapide de créer, de gérer et de faire évoluer votre IA

INSCRIVEZ-VOUS
Table des matières

Gouvernez, déployez et suivez l'IA dans votre propre infrastructure

Réservez un séjour de 30 minutes avec notre Expert en IA

Réservez une démo

Le moyen le plus rapide de créer, de gérer et de faire évoluer votre IA

Démo du livre

Découvrez-en plus

Aucun article n'a été trouvé.
June 15, 2026
|
5 min de lecture

Lasso Security integration with Truefoundry AI Gateway

Terminologie LLM
Outils LLM
LLM et GenAI
June 15, 2026
|
5 min de lecture

Loop Engineering, Continued: From One Governed Loop to an Operable Fleet

Aucun article n'a été trouvé.
June 15, 2026
|
5 min de lecture

Spec-Driven Development for AI Agents, Done Right: Specs as Governed Artifacts

Aucun article n'a été trouvé.
June 15, 2026
|
5 min de lecture

Online Evaluation and Quality Monitoring at the Gateway

Aucun article n'a été trouvé.
Aucun article n'a été trouvé.

Blogs récents

Black left pointing arrow symbol on white background, directional indicator.
Black left pointing arrow symbol on white background, directional indicator.
Faites un rapide tour d'horizon des produits
Commencer la visite guidée du produit
Visite guidée du produit