Spec-Driven Development for AI Agents: Governing Specs

Built for Speed: ~10ms Latency, Even Under Load

Blazingly fast way to build, track and deploy your models!

Handles 350+ RPS on just 1 vCPU — no tuning needed
Production-ready with full enterprise support

Get Started with Truefoundry Now Talk to the Expert

Spec-driven development is the rare buzzword with a paper trail. It has a founding talk (Sean Grove's "The New Code," AI Engineer World's Fair 2025), flagship tooling from two giants (GitHub's Spec Kit, AWS's Kiro), a radical wing (Tessl's spec-as-source), and a sober taxonomy from Thoughtworks that separates the levels of ambition. The core idea has held up well over a year of scrutiny because it captures something real: when agents write the code, the specification — not the prompt, not the code — is where human intent actually lives, so it should be written first, kept, and treated as the source of truth. What the movement hasn't fully worked out is what happens at fleet scale, when the specs themselves become the sprawl: dozens of spec files, competing AGENTS.md variants, and no record of which spec produced which deployed behavior. That's the part this post is about — because a spec is scaffold, and scaffold is something you can govern.

Key Takeaways

Spec-driven development (SDD) means writing a structured spec before AI writes code, and keeping the spec as the source of truth for both humans and agents — "documentation first," in Thoughtworks' Birgitta Böckeler's framing of the still-fluid definition.
The intellectual spark is often cited as OpenAI's Sean Grove ("The New Code," 2025): prompts are transient — teams keep generated code and delete the prompt — while a written, versioned spec aligns humans and models alike, with OpenAI's own Model Spec as the exemplar.
The tooling converged fast: GitHub Spec Kit (Specify → Plan → Tasks → Implement), AWS Kiro (Requirements → Design → Tasks), and Tessl's stronger spec-as-source stance; Böckeler's useful ladder runs spec-first → spec-anchored → spec-as-source.
The companion discipline is restraint: practitioner guidance treats roughly 150–200 standing instructions as a practical upper bound — a rule of thumb, not a measured guarantee — before reliability degrades, so agent files (AGENTS.md, CLAUDE.md) should stay lean, structured WHAT/WHY/HOW, and progressively disclosed — comprehensiveness is a failure mode.
SDD's unsolved problem is governance at scale: specs are behavior-determining artifacts, and at fleet scale they need what code got decades ago — versions, owners, review, and a link from spec version to deployed behavior.
In glossary terms, a spec is scaffold and a spec change is a policy change — so it belongs in versioned artifacts, behind an eval gate, with run metadata recording which spec version shaped each run.
On TrueFoundry's Agent Harness, this governance lane has a natural home: agent instructions live with the agent definition; reusable procedure specs can become versioned Skills Registry artifacts — pinned, RBAC-governed, rolled back, audited, and mounted just-in-time; sensitive tool use can be gated; and runs are traced through the harness. For feature specs and plans, the recommended pattern is to record the spec version in run metadata and gate promotions through evaluation.

Bastien, an engineering manager, was an early and sincere convert. His team adopted spec-driven development in the autumn, and it delivered exactly what the talks promised: features that used to die in prompt ping-pong shipped from a one-page spec, review conversations moved up a level, and the safe-delegation window stretched from twenty-minute tasks to afternoon-sized ones. Six months later he ran an audit and found what success looks like when nobody governs it: forty-one spec files across nine repos, four of them describing the same service differently; nine AGENTS.md variants, three auto-generated and bloated past anything a model reliably follows; specs that had drifted from the code they once produced, with no record of which version of which spec shaped the agent run that shipped any given behavior. The practice was right. The artifacts were feral. Bastien's question to his platform team was the one this post answers: "we made specs the source of truth — so why is the truth unversioned?"

To answer it, start where the term started — the origin story contains the exact argument that makes governance the logical next step.

1. Where the Buzzword Comes From: Grove, the Shredded Source, and the Model Spec

One widely cited origin point is a conference talk. At the AI Engineer World's Fair in San Francisco in 2025, Sean Grove — then working on alignment at OpenAI — delivered "The New Code," which many practitioners cite as the movement's crispest articulation, and gave it a central image: developers prompt a model, keep the generated code, and throw the prompt away, which Grove likened to shredding the source and version-controlling the binary. His argument ran deeper than workflow hygiene. Code, he argued, was always only part of a programmer's value; the larger part is structured communication of intent — and a written specification is the artifact that aligns humans on what should be built before it ever aligns a model. Grove's exhibit was OpenAI's own Model Spec: a versioned, living Markdown document expressing the company's intended model behavior, where clauses carry unique IDs and example prompts that function as unit tests. Intent, written once, versioned, testable — and consumed by both people and machines.

The industry's reaction gave the idea its name and its tools. GitHub open-sourced Spec Kit in September 2025, framing the shift bluntly — in their words, the lingua franca of development moves up a level, and code becomes the last mile. AWS shipped Kiro, an IDE that centers its spec workflow on Requirements, Design, and Tasks before execution. Tessl staked out the radical end: specs as the primary maintained artifact, code regenerated from them. And Andrej Karpathy — who had coined "vibe coding" in February 2025 — described the next phase a year later as agentic engineering: orchestrating agents against detailed specifications with human oversight. The vibe era named the problem; SDD named the correction.

2. What SDD Actually Is — and Böckeler's Ladder

Because the term spread faster than its definition, the most useful map is Birgitta Böckeler's October 2025 analysis for Thoughtworks (published on martinfowler.com), which examined Spec Kit, Kiro, and Tessl and found not one practice but a ladder of ambition. Spec-first: write a well-thought-out spec, use it to drive the AI-assisted task, done — the spec may not outlive the feature. Spec-anchored: keep the spec after the task completes and evolve the feature through it — the spec becomes the maintained description of behavior. Spec-as-source: the Tessl position — the spec is the artifact you maintain; code is its compiled output. Most teams practicing SDD today are on the first two rungs, and the workflow is recognizably the same everywhere: specify, plan, decompose into tasks, implement — with the agent consuming the spec at every stage.

Why it works is also no longer speculative. Structured specs constrain the search space that makes unconstrained generation chaotic; acceptance criteria give the agent (and its verifier) something checkable; and decomposition turns a vague feature into agent-sized tasks. Practitioner reports describe the safe-delegation window expanding from ten-to-twenty-minute tasks to multi-hour feature delivery with consistent quality — field evidence from teams adopting the practice rather than controlled studies, and worth reading as such. The honest boundary came from the same Thoughtworks analysis that mapped the ladder: SDD suits larger features and greenfield work; for a small bug fix, the spec overhead isn't worth it. A discipline that knows where it doesn't apply is usually one worth taking seriously.

3. The Companion Discipline: Restraint and the Instruction Budget

SDD arrived with a sibling practice that matters just as much for agents: the standing instruction file — AGENTS.md, CLAUDE.md, and their cousins — that rides along in every agent session as the project's persistent voice. And the hard-won 2026 consensus about these files is a restraint discourse. Practitioner guidance converged on a working number: frontier models are often treated as reliably following on the order of 150–200 standing instructions before compliance starts to degrade — a rule of thumb, not a model guarantee — so the auto-generated, everything-included agent file is a failure mode wearing a diligence costume. The recommendations that emerged read like an editor's checklist: keep the file lean (HumanLayer's commonly cited recommendation is under ~300 lines); structure it WHAT/WHY/HOW — stack and structure, purpose, build commands and conventions; include explicit do/don't lists and permission boundaries; don't document file paths (they rot), don't use the file as a linter (use linters), don't embed task-specific instructions that stale by Friday. For larger codebases: progressive disclosure — a brief root file, deeper files loaded where relevant.

Readers of this series will recognize both halves of that advice. The instruction budget is a context-engineering fact wearing a documentation hat — standing instructions are preloaded context, and preloaded bulk degrades attention exactly as our JIT context post describes; progressive disclosure is just-in-time loading applied to specs. And "the agent file is the project's persistent voice" is a scaffold claim — which is the thread that leads to the governance gap, because Bastien's nine bloated variants weren't a writing problem. They were an unowned-artifact problem.

4. The Fleet-Scale Gap: Specs Are Behavior, and Behavior Sprawls

Here is the structural observation the SDD discourse mostly hasn't reached yet. Everything the movement says about specs — source of truth, consumed by agents, aligning humans — is a claim that specs determine behavior. In the vocabulary of the agent glossary, a spec is scaffold: content the model works from. A change to scaffold is a policy change — behaviorally the same category of event as a model swap. SDD, practiced sincerely, multiplies exactly these artifacts: feature specs, plan files, task lists, AGENTS.md variants, acceptance criteria. Each one steers agents. Almost none of them, in a typical adoption, has a version history tied to deployments, an owner, a review path, or an answer to "which runs did this file shape?"

That's Bastien's audit, named. Forty-one specs and nine agent files aren't a tidiness problem — they're forty-one unversioned behavior inputs and nine competing standing policies. The known SDD pain points all trace to the same root: spec/code drift (early SDD adopters commonly report that specs and code fall out of sync unless the spec is updated as part of the workflow) is unmonitored divergence between the declared truth and the deployed one; duplicate specs are forked policies; bloated auto-generated agent files are unreviewed policy accretion. The movement made specs important; it hasn't yet made them governed. Software solved this exact problem for code with version control, code review, CI gates, and deploy records — and the fix for specs is the same machinery, pointed at the new artifact class.

Fig 1: The pipeline every SDD tool converged on (Spec Kit's Specify→Plan→Tasks→Implement; Kiro's Requirements→Design→Tasks), and the governance lane the practice needs once specs steer a fleet rather than a feature.

5. Specs as Governed Scaffold: the Mapping

Run the SDD artifact list through a platform that already treats scaffold as governed content, and each item lands somewhere specific. Standing instructions — the AGENTS.md role — belong with the agent definition on TrueFoundry's Agent Harness: they should be stored as versioned content, reviewed as diffs, and referenced by the definition rather than edited ad hoc in a prompt box — so the restraint discipline has a reviewer, not just an author. Reusable procedure specs — the runbooks, the "how we do migrations here," the multi-step acceptance recipes — are skills: promoted to the Skills Registry as versioned SKILL.md artifacts with RBAC, pinned by agents as skill@version, and mounted just-in-time — which implements progressive disclosure literally: the catalog's one-line description rides along; the full procedure loads on activation. One spec, forty agents, zero forks.

Feature specs and plans can stay in the repo where the work lives, but the agent run should record their immutable version or commit SHA in its trace metadata — and the harness already records each run's steps, so the spec version travels with the run rather than being reconstructed later. Do that, and "which spec version shaped this run" becomes a query instead of archaeology. And the whole assembly is the policy: the agent definition — model by name, instructions at a version, skills pinned, tool grants — is one declarative, diffable artifact, which is exactly the unit our policy post argues should be versioned, gated, and rolled back together. The pleasant surprise of this mapping is that nothing about it fights SDD; it completes the founding argument. Grove's case was that intent deserves to be a versioned, testable, first-class artifact. The Model Spec did that for one organization's model behavior. The governance lane does it for every spec your fleet consumes.

An SDD-shaped agent definition: specs as pinned, versioned scaffold (illustrative)

agent:
  name: payments-feature-agent
  model: claude-sonnet-4-6
  instructions: ./AGENTS.md@7c91f       # lean, WHAT/WHY/HOW, ~180 instructions — reviewed
  skills:                                # reusable procedure specs, from the registry
    - migration-runbook@v4               # one spec, every agent; mounted just-in-time
    - acceptance-review@v2
  context:
    feature_spec: specs/payments-refunds.md@3e02a   # the Specify artifact, at a version
  mcp_servers: [github, ci]
# Recommended pattern: a spec edit should bump a version → run an eval gate →
# record the artifact versions on the run trace. Bastien's audit becomes a query.

TrueFoundry Agent Harness builder: instructions, skills, and MCP servers attached to one agent definition — Fig 2: Specs as governed configuration: instructions and skills attached by name and version to one declarative agent definition — the same artifact whether authored in the no-code builder, SDK, or API. Source: *TrueFoundry Agent Harness docs*.

6. Spec Changes Through the Gate

The operational consequence deserves its own section because it's where SDD's biggest unhandled risk lives. If the spec is the source of truth and agents implement against it, then a spec edit is the highest-leverage change in the system — it redirects every downstream agent decision — and in most adoptions it's also the least-ceremonied: a Markdown edit, merged on a Friday, consumed by Monday's runs. The fix is the policy discipline applied without exception: a spec or agent-file change should bump a version, the version change should trigger the evaluation gate (the same suite a model swap would face — behaviorally it's the same kind of event), and promotion should be staged. For shared skills, a version bump rolls out per-agent and deliberately, not registry-wide and silently. None of this slows the cheap edit much; all of it makes the expensive edit visible before production does.

The same machinery dissolves the drift problem honestly. Spec/code drift is unavoidable in spec-first practice and managed in spec-anchored practice — but only if divergence is detectable, which requires knowing which spec version the deployed behavior came from. When runs record the spec version they consumed, drift becomes a diff between the current spec and the version that last passed the gate, surfaced on a schedule — rather than a discovery made during an incident, which is where Bastien found his.

7. Honest Limits: Where SDD Strains

The skeptics deserve their section, because their strongest points shape the governance design. The waterfall critique — SDD as requirement documents reborn — lands wherever specs are written once, comprehensively, up front; the practice survives it by keeping specs lean and iterative rather than encyclopedic. The "code is the ultimate executable truth" objection — when generated code is wrong, you fix code, not prose — is true at the margin, and is exactly why spec-as-source remains the minority position; spec-anchored, with drift detection, is the defensible middle. And the overhead point stands as Thoughtworks' own analysis states it: small fixes don't warrant the ceremony, so the governance lane needs a proportionate fast path — record the version, run the cheap automated gate, skip the committee.

One more limit is internal to this post's argument: versioning specs doesn't make them good. A precisely governed bad spec produces precisely traceable bad behavior. The instruction budget, the WHAT/WHY/HOW structure, the checkable acceptance criteria — the craft half of SDD — remains human work, and the platform's contribution is honest but bounded: it makes the craft reviewable, attributable, and safe to iterate on.

8. A Minimum Viable Governance Lane

None of this requires a platform rewrite. The lane is five artifact classes, each with an owner, a versioning mechanism, a gate proportionate to its blast radius, and a field on the run trace that ties it to behavior. The pattern below is what we'd recommend a team stand up first; the Skills Registry row is a documented TrueFoundry capability, the rest are conventions any team can adopt in their repo and their harness configuration today.

Artifact	Owner	Versioning	Gate	Runtime trace field
AGENTS.md	repo owner / platform	commit SHA or artifact version	agent eval suite	agent.instructions.version
Feature spec	product + eng owner	commit SHA	acceptance / e2e tests	agent.spec.version
Plan / tasks	eng owner	commit SHA	code / test review	agent.plan.version
Reusable runbook	platform owner	Skills Registry version	skill eval	agent.skill_versions[]
MCP tool grant	platform / security	policy version	approval / policy test	agent.tool_policy.version

The payoff is a single legible record per run. Whatever shaped the behavior — instructions, spec, plan, skills, tool policy — is named and pinned on the trace, so the post-incident question becomes a lookup rather than an excavation:

A run trace that answers "which versions shaped this behavior?" (illustrative)

{
  "agent.run_id": "run_123",
  "agent.name": "payments-feature-agent",
  "agent.definition.version": "agentdef_2026_06_13_04",
  "agent.instructions.version": "AGENTS.md@7c91f",
  "agent.spec.version": "specs/payments-refunds.md@3e02a",
  "agent.skills": ["migration-runbook@v4", "acceptance-review@v2"],
  "agent.eval_gate": "payments-agent-regression@2026-06-12",
  "agent.eval_result": "passed"
}

Skill versions come from the registry as a first-class feature; the spec, plan, and instruction versions are commit SHAs you choose to stamp onto the run. The discipline is cheap to adopt and compounds: the first time an agent ships behavior nobody can explain, this record is the difference between a query and a war room.

9. Where This Lands: Bastien's Second Audit

Rebuilt on governed rails, Bastien's estate six months later looks like this. The nine AGENTS.md variants became three reviewed instruction sets — one per domain — each lean enough to live inside the instruction budget, each versioned, each with an owner whose name appears on the diff. The reusable procedures hiding inside feature specs were extracted into seven registry skills, pinned by version across the fleet; the migration runbook that existed in four conflicting copies exists once, at v4, and the registry can list every agent that mounts it. Feature specs still live with their features — SDD's workflow didn't change — but every harness run records the spec and instruction versions it consumed in its trace metadata, so "which spec produced this behavior" is a query with an answer, and a spec edit faces the same eval gate as a model swap. The forty-one files didn't become bureaucracy. They became an inventory.

That's the completion of the founding argument, not a revision of it. Grove's point was that intent deserves better than a shredded prompt; the SDD tools made intent a written artifact; the governance lane makes it an operated one — versioned, gated, traced, shared without forking. The lingua franca of development did move up a level. The infrastructure should follow it there.

10. Frequently Asked Questions

Who actually coined "spec-driven development"? No single coinage holds up — the term crystallized across 2025 rather than appearing at once. The intellectual spark most often cited is Sean Grove's "The New Code" talk at the AI Engineer World's Fair 2025; GitHub's Spec Kit (September 2025) and AWS's Kiro gave the practice its tooling and much of its vocabulary; Tessl pushed the strong form; and Birgitta Böckeler's Thoughtworks analysis (October 2025) gave it the spec-first / spec-anchored / spec-as-source taxonomy most practitioners now use. Credit honestly: a movement, not a moment.

Is SDD just waterfall with extra steps? It rhymes, and pretending otherwise loses credibility. The differences that matter: SDD specs are lean and iterative rather than comprehensive and frozen; the implement loop is hours, not quarters, so spec and code can converge continuously; and the spec's consumer is an agent that genuinely executes it, which big-upfront-design never had. Where teams write encyclopedic specs once and toss them over a wall, they've reinvented the failure mode, agents or no agents.

Do AGENTS.md files really need governance? They're just docs. They're standing instructions consumed on every agent session — behavior-determining content, not documentation. An unreviewed edit to one is a silent policy change across every run it touches, which is precisely the category of change version control and review exist for. The practical test: if editing the file could alter what ships to production, it's not "just docs."

Where's the line between a spec and a skill? Scope and reuse. A feature spec describes one piece of work and lives with it. A skill is a reusable procedure — how this org does migrations, reviews, incident write-ups — consumed by many agents over time, which is what earns it registry treatment: versions, RBAC, pinning, just-in-time mounting. When the same prose keeps getting copied between specs, that's the promotion signal.

We're two engineers with one agent. Does any of this apply? The craft half, fully — spec-first, lean agent files, checkable acceptance criteria pay off at any scale. The governance lane prices in with multiplication: more agents, more authors, more shared specs. Adopt it at the first "which spec is the real one?" conversation; before that, a git repo and discipline are honestly enough.

References

Sean Grove (OpenAI) — "The New Code," AI Engineer World's Fair 2025 — the founding argument: specs as the versioned, testable home of intent. Grove's exhibit, OpenAI's Model Spec, is the exemplar of intent written as a versioned, testable Markdown document. Paraphrased with credit; quoted fragments under fifteen words.
Birgitta Böckeler (Thoughtworks) — "Understanding Spec-Driven-Development: Kiro, spec-kit, and Tessl," martinfowler.com, October 2025 — the spec-first / spec-anchored / spec-as-source taxonomy and the honest caveats on where SDD applies.
GitHub Spec Kit (September 2025) — the Specify → Plan → Tasks → Implement flow (/speckit.specify, /speckit.plan, /speckit.tasks, /speckit.implement) and the "executable specifications" framing.
AWS Kiro — the spec workflow centered on requirements.md, design.md, and tasks.md with EARS-style acceptance criteria, from requirements through design to task execution.
Tessl — the spec-as-source position: the spec as the primary maintained artifact, code regenerated from it.
Andrej Karpathy — "vibe coding" (February 2025) and subsequent discussion of agentic engineering against detailed specifications with human oversight.
Practitioner restraint guidance on agent instruction files (instruction budgets, file-length recommendations, WHAT/WHY/HOW, progressive disclosure) — community engineering write-ups including HumanLayer's, early 2026; treated as rules of thumb, not controlled studies, and attributed in-text. EPAM's Spec Kit field report is the commonly cited source for expanded delegation windows, read as field evidence rather than a controlled study.
TrueFoundry Agent Harness and Skills Registry — the documented capabilities this post relies on: versioned, RBAC-governed, pinnable, roll-back-able, audited SKILL.md artifacts with progressive disclosure, plus harness orchestration, sandboxing, approvals, and tracing.

Northwind and Bastien are an illustrative composite, not a specific organization. Origins are reported per public talks, posts, and analyses at the time of writing — Grove's talk, GitHub's and AWS's launches, Tessl's positioning, and Böckeler's Thoughtworks taxonomy — paraphrased with credit; quoted fragments are under fifteen words and attributed. Practitioner figures (instruction budgets, file-length guidance, delegation-window reports) are community guidance, not controlled studies. Configuration snippets are simplified for readability and not literal product schema. TrueFoundry capabilities reflect public documentation at the time of writing; verify against current docs.

‍

TrueFoundry AI Gateway delivers ~3–4 ms latency, handles 350+ RPS on 1 vCPU, scales horizontally with ease, and is production-ready, while LiteLLM suffers from high latency, struggles beyond moderate RPS, lacks built-in scaling, and is best for light or prototype workloads.

Built for Speed: ~10ms Latency, Even Under Load

Schedule your Demo Now