Spec-Driven Development for AI Agents: Governing Specs

Conçu pour la vitesse : latence d'environ 10 ms, même en cas de charge

Une méthode incroyablement rapide pour créer, suivre et déployer vos modèles !

Gère plus de 350 RPS sur un seul processeur virtuel, aucun réglage n'est nécessaire
Prêt pour la production avec un support complet pour les entreprises

Commencez à utiliser Truefoundry dès maintenant Parlez à l'expert

Spec-driven development is the rare buzzword with a paper trail. It has a founding talk (Sean Grove's "The New Code," AI Engineer World's Fair 2025), flagship tooling from two giants (GitHub's Spec Kit, AWS's Kiro), a radical wing (Tessl's spec-as-source), and a sober taxonomy from Thoughtworks that separates the levels of ambition. The core idea has held up well over a year of scrutiny because it captures something real: when agents write the code, the specification — not the prompt, not the code — is where human intent actually lives, so it should be written first, kept, and treated as the source of truth. What the movement hasn't fully worked out is what happens at fleet scale, when the specs themselves become the sprawl: dozens of spec files, competing AGENTS.md variants, and no record of which spec produced which deployed behavior. That's the part this post is about — because a spec is scaffold, and scaffold is something you can govern.

Key Takeaways

Spec-driven development (SDD) means writing a structured spec before AI writes code, and keeping the spec as the source of truth for both humans and agents — "documentation first," in Thoughtworks' Birgitta Böckeler's framing of the still-fluid definition.
The intellectual spark is often cited as OpenAI's Sean Grove ("The New Code," 2025): prompts are transient — teams keep generated code and delete the prompt — while a written, versioned spec aligns humans and models alike, with OpenAI's own Model Spec as the exemplar.
The tooling converged fast: GitHub Spec Kit (Specify → Plan → Tasks → Implement), AWS Kiro (Requirements → Design → Tasks), and Tessl's stronger spec-as-source stance; Böckeler's useful ladder runs spec-first → spec-anchored → spec-as-source.
The companion discipline is restraint: practitioner guidance treats roughly 150–200 standing instructions as a practical upper bound — a rule of thumb, not a measured guarantee — before reliability degrades, so agent files (AGENTS.md, CLAUDE.md) should stay lean, structured WHAT/WHY/HOW, and progressively disclosed — comprehensiveness is a failure mode.
SDD's unsolved problem is governance at scale: specs are behavior-determining artifacts, and at fleet scale they need what code got decades ago — versions, owners, review, and a link from spec version to deployed behavior.
In glossary terms, a spec is scaffold and a spec change is a policy change — so it belongs in versioned artifacts, behind an eval gate, with run metadata recording which spec version shaped each run.
On TrueFoundry's Agent Harness, this governance lane has a natural home: agent instructions live with the agent definition; reusable procedure specs can become versioned Skills Registry artifacts — pinned, RBAC-governed, rolled back, audited, and mounted just-in-time; sensitive tool use can be gated; and runs are traced through the harness. For feature specs and plans, the recommended pattern is to record the spec version in run metadata and gate promotions through evaluation.

Bastien, an engineering manager, was an early and sincere convert. His team adopted spec-driven development in the autumn, and it delivered exactly what the talks promised: features that used to die in prompt ping-pong shipped from a one-page spec, review conversations moved up a level, and the safe-delegation window stretched from twenty-minute tasks to afternoon-sized ones. Six months later he ran an audit and found what success looks like when nobody governs it: forty-one spec files across nine repos, four of them describing the same service differently; nine AGENTS.md variants, three auto-generated and bloated past anything a model reliably follows; specs that had drifted from the code they once produced, with no record of which version of which spec shaped the agent run that shipped any given behavior. The practice was right. The artifacts were feral. Bastien's question to his platform team was the one this post answers: "we made specs the source of truth — so why is the truth unversioned?"

To answer it, start where the term started — the origin story contains the exact argument that makes governance the logical next step.

1. Where the Buzzword Comes From: Grove, the Shredded Source, and the Model Spec

One widely cited origin point is a conference talk. At the AI Engineer World's Fair in San Francisco in 2025, Sean Grove — then working on alignment at OpenAI — delivered "The New Code," which many practitioners cite as the movement's crispest articulation, and gave it a central image: developers prompt a model, keep the generated code, and throw the prompt away, which Grove likened to shredding the source and version-controlling the binary. His argument ran deeper than workflow hygiene. Code, he argued, was always only part of a programmer's value; the larger part is structured communication of intent — and a written specification is the artifact that aligns humans on what should be built before it ever aligns a model. Grove's exhibit was OpenAI's own Model Spec: a versioned, living Markdown document expressing the company's intended model behavior, where clauses carry unique IDs and example prompts that function as unit tests. Intent, written once, versioned, testable — and consumed by both people and machines.

The industry's reaction gave the idea its name and its tools. GitHub open-sourced Spec Kit in September 2025, framing the shift bluntly — in their words, the lingua franca of development moves up a level, and code becomes the last mile. AWS shipped Kiro, an IDE that centers its spec workflow on Requirements, Design, and Tasks before execution. Tessl staked out the radical end: specs as the primary maintained artifact, code regenerated from them. And Andrej Karpathy — who had coined "vibe coding" in February 2025 — described the next phase a year later as agentic engineering: orchestrating agents against detailed specifications with human oversight. The vibe era named the problem; SDD named the correction.

2. What SDD Actually Is — and Böckeler's Ladder

Because the term spread faster than its definition, the most useful map is Birgitta Böckeler's October 2025 analysis for Thoughtworks (published on martinfowler.com), which examined Spec Kit, Kiro, and Tessl and found not one practice but a ladder of ambition. Spec-first: write a well-thought-out spec, use it to drive the AI-assisted task, done — the spec may not outlive the feature. Spec-anchored: keep the spec after the task completes and evolve the feature through it — the spec becomes the maintained description of behavior. Spec-as-source: the Tessl position — the spec is the artifact you maintain; code is its compiled output. Most teams practicing SDD today are on the first two rungs, and the workflow is recognizably the same everywhere: specify, plan, decompose into tasks, implement — with the agent consuming the spec at every stage.

Why it works is also no longer speculative. Structured specs constrain the search space that makes unconstrained generation chaotic; acceptance criteria give the agent (and its verifier) something checkable; and decomposition turns a vague feature into agent-sized tasks. Practitioner reports describe the safe-delegation window expanding from ten-to-twenty-minute tasks to multi-hour feature delivery with consistent quality — field evidence from teams adopting the practice rather than controlled studies, and worth reading as such. The honest boundary came from the same Thoughtworks analysis that mapped the ladder: SDD suits larger features and greenfield work; for a small bug fix, the spec overhead isn't worth it. A discipline that knows where it doesn't apply is usually one worth taking seriously.

3. The Companion Discipline: Restraint and the Instruction Budget

SDD arrived with a sibling practice that matters just as much for agents: the standing instruction file — AGENTS.md, CLAUDE.md, and their cousins — that rides along in every agent session as the project's persistent voice. And the hard-won 2026 consensus about these files is a restraint discourse. Practitioner guidance converged on a working number: frontier models are often treated as reliably following on the order of 150–200 standing instructions before compliance starts to degrade — a rule of thumb, not a model guarantee — so the auto-generated, everything-included agent file is a failure mode wearing a diligence costume. The recommendations that emerged read like an editor's checklist: keep the file lean (HumanLayer's commonly cited recommendation is under ~300 lines); structure it WHAT/WHY/HOW — stack and structure, purpose, build commands and conventions; include explicit do/don't lists and permission boundaries; don't document file paths (they rot), don't use the file as a linter (use linters), don't embed task-specific instructions that stale by Friday. For larger codebases: progressive disclosure — a brief root file, deeper files loaded where relevant.

Readers of this series will recognize both halves of that advice. The instruction budget is a context-engineering fact wearing a documentation hat — standing instructions are preloaded context, and preloaded bulk degrades attention exactly as our JIT context post describes; progressive disclosure is just-in-time loading applied to specs. And "the agent file is the project's persistent voice" is a scaffold claim — which is the thread that leads to the governance gap, because Bastien's nine bloated variants weren't a writing problem. They were an unowned-artifact problem.

4. The Fleet-Scale Gap: Specs Are Behavior, and Behavior Sprawls

Here is the structural observation the SDD discourse mostly hasn't reached yet. Everything the movement says about specs — source of truth, consumed by agents, aligning humans — is a claim that specs determine behavior. In the vocabulary of the agent glossary, a spec is scaffold: content the model works from. A change to scaffold is a policy change — behaviorally the same category of event as a model swap. SDD, practiced sincerely, multiplies exactly these artifacts: feature specs, plan files, task lists, AGENTS.md variants, acceptance criteria. Each one steers agents. Almost none of them, in a typical adoption, has a version history tied to deployments, an owner, a review path, or an answer to "which runs did this file shape?"

That's Bastien's audit, named. Forty-one specs and nine agent files aren't a tidiness problem — they're forty-one unversioned behavior inputs and nine competing standing policies. The known SDD pain points all trace to the same root: spec/code drift (early SDD adopters commonly report that specs and code fall out of sync unless the spec is updated as part of the workflow) is unmonitored divergence between the declared truth and the deployed one; duplicate specs are forked policies; bloated auto-generated agent files are unreviewed policy accretion. The movement made specs important; it hasn't yet made them governed. Software solved this exact problem for code with version control, code review, CI gates, and deploy records — and the fix for specs is the same machinery, pointed at the new artifact class.

‍

TrueFoundry AI Gateway offre une latence d'environ 3 à 4 ms, gère plus de 350 RPS sur 1 processeur virtuel, évolue horizontalement facilement et est prête pour la production, tandis que LiteLM souffre d'une latence élevée, peine à dépasser un RPS modéré, ne dispose pas d'une mise à l'échelle intégrée et convient parfaitement aux charges de travail légères ou aux prototypes.

Conçu pour la vitesse : latence d'environ 10 ms, même en cas de charge

Planifiez votre démo dès maintenant