AIエージェントのためのスペック駆動開発、正しく行うには：統制されたアーティファクトとしてのスペック

Published: July 4, 2026

Built for Speed: ~10ms Latency, Even Under Load

Blazingly fast way to build, track and deploy your models!

Handles 350+ RPS on just 1 vCPU — no tuning needed
Production-ready with full enterprise support

Get Started with Truefoundry Now Talk to the Expert

Spec-driven development is the rare buzzword with a paper trail. It has a founding talk (Sean Grove's "The New Code," AI Engineer World's Fair 2025), flagship tooling from two giants (GitHub's Spec Kit, AWS's Kiro), a radical wing (Tessl's spec-as-source), and a sober taxonomy from Thoughtworks that separates the levels of ambition. The core idea has held up well over a year of scrutiny because it captures something real: when agents write the code, the specification — not the prompt, not the code — is where human intent actually lives, so it should be written first, kept, and treated as the source of truth. What the movement hasn't fully worked out is what happens at fleet scale, when the specs themselves become the sprawl: dozens of spec files, competing AGENTS.md variants, and no record of which spec produced which deployed behavior. That's the part this post is about — because a spec is scaffold, and scaffold is something you can govern.

Key Takeaways

Spec-driven development (SDD) means writing a structured spec before AI writes code, and keeping the spec as the source of truth for both humans and agents — "documentation first," in Thoughtworks' Birgitta Böckeler's framing of the still-fluid definition.
The intellectual spark is often cited as OpenAI's Sean Grove ("The New Code," 2025): prompts are transient — teams keep generated code and delete the prompt — while a written, versioned spec aligns humans and models alike, with OpenAI's own Model Spec as the exemplar.
The tooling converged fast: GitHub Spec Kit (Specify → Plan → Tasks → Implement), AWS Kiro (Requirements → Design → Tasks), and Tessl's stronger spec-as-source stance; Böckeler's useful ladder runs spec-first → spec-anchored → spec-as-source.
The companion discipline is restraint: practitioner guidance treats roughly 150–200 standing instructions as a practical upper bound — a rule of thumb, not a measured guarantee — before reliability degrades, so agent files (AGENTS.md, CLAUDE.md) should stay lean, structured WHAT/WHY/HOW, and progressively disclosed — comprehensiveness is a failure mode.
SDD's unsolved problem is governance at scale: specs are behavior-determining artifacts, and at fleet scale they need what code got decades ago — versions, owners, review, and a link from spec version to deployed behavior.
In glossary terms, a spec is scaffold and a spec change is a policy change — so it belongs in versioned artifacts, behind an eval gate, with run metadata recording which spec version shaped each run.
On TrueFoundry's Agent Harness, this governance lane has a natural home: agent instructions live with the agent definition; reusable procedure specs can become versioned Skills Registry artifacts — pinned, RBAC-governed, rolled back, audited, and mounted just-in-time; sensitive tool use can be gated; and runs are traced through the harness. For feature specs and plans, the recommended pattern is to record the spec version in run metadata and gate promotions through evaluation.

Bastien, an engineering manager, was an early and sincere convert. His team adopted spec-driven development in the autumn, and it delivered exactly what the talks promised: features that used to die in prompt ping-pong shipped from a one-page spec, review conversations moved up a level, and the safe-delegation window stretched from twenty-minute tasks to afternoon-sized ones. Six months later he ran an audit and found what success looks like when nobody governs it: forty-one spec files across nine repos, four of them describing the same service differently; nine AGENTS.md variants, three auto-generated and bloated past anything a model reliably follows; specs that had drifted from the code they once produced, with no record of which version of which spec shaped the agent run that shipped any given behavior. The practice was right. The artifacts were feral. Bastien's question to his platform team was the one this post answers: "we made specs the source of truth — so why is the truth unversioned?"

To answer it, start where the term started — the origin story contains the exact argument that makes governance the logical next step.

1. Where the Buzzword Comes From: Grove, the Shredded Source, and the Model Spec

One widely cited origin point is a conference talk. At the AI Engineer World's Fair in San Francisco in 2025, Sean Grove — then working on alignment at OpenAI — delivered "The New Code," which many practitioners cite as the movement's crispest articulation, and gave it a central image: developers prompt a model, keep the generated code, and throw the prompt away, which Grove likened to shredding the source and version-controlling the binary. His argument ran deeper than workflow hygiene. Code, he argued, was always only part of a programmer's value; the larger part is structured communication of intent — and a written specification is the artifact that aligns humans on what should be built before it ever aligns a model. Grove's exhibit was OpenAI's own Model Spec: a versioned, living Markdown document expressing the company's intended model behavior, where clauses carry unique IDs and example prompts that function as unit tests. Intent, written once, versioned, testable — and consumed by both people and machines.

The industry's reaction gave the idea its name and its tools. GitHub open-sourced Spec Kit in September 2025, framing the shift bluntly — in their words, the lingua franca of development moves up a level, and code becomes the last mile. AWS shipped Kiro, an IDE that centers its spec workflow on Requirements, Design, and Tasks before execution. Tessl staked out the radical end: specs as the primary maintained artifact, code regenerated from them. And Andrej Karpathy — who had coined "vibe coding" in February 2025 — described the next phase a year later as agentic engineering: orchestrating agents against detailed specifications with human oversight. The vibe era named the problem; SDD named the correction.

2. What SDD Actually Is — and Böckeler's Ladder

Because the term spread faster than its definition, the most useful map is Birgitta Böckeler's October 2025 analysis for Thoughtworks (published on martinfowler.com), which examined Spec Kit, Kiro, and Tessl and found not one practice but a ladder of ambition. Spec-first: write a well-thought-out spec, use it to drive the AI-assisted task, done — the spec may not outlive the feature. Spec-anchored: keep the spec after the task completes and evolve the feature through it — the spec becomes the maintained description of behavior. Spec-as-source: the Tessl position — the spec is the artifact you maintain; code is its compiled output. Most teams practicing SDD today are on the first two rungs, and the workflow is recognizably the same everywhere: specify, plan, decompose into tasks, implement — with the agent consuming the spec at every stage.

Why it works is also no longer speculative. Structured specs constrain the search space that makes unconstrained generation chaotic; acceptance criteria give the agent (and its verifier) something checkable; and decomposition turns a vague feature into agent-sized tasks. Practitioner reports describe the safe-delegation window expanding from ten-to-twenty-minute tasks to multi-hour feature delivery with consistent quality — field evidence from teams adopting the practice rather than controlled studies, and worth reading as such. The honest boundary came from the same Thoughtworks analysis that mapped the ladder: SDD suits larger features and greenfield work; for a small bug fix, the spec overhead isn't worth it. A discipline that knows where it doesn't apply is usually one worth taking seriously.

3. The Companion Discipline: Restraint and the Instruction Budget

SDD arrived with a sibling practice that matters just as much for agents: the standing instruction file — AGENTS.md, CLAUDE.md, and their cousins — that rides along in every agent session as the project's persistent voice. And the hard-won 2026 consensus about these files is a restraint discourse. Practitioner guidance converged on a working number: frontier models are often treated as reliably following on the order of 150–200 standing instructions before compliance starts to degrade — a rule of thumb, not a model guarantee — so the auto-generated, everything-included agent file is a failure mode wearing a diligence costume. The recommendations that emerged read like an editor's checklist: keep the file lean (HumanLayer's commonly cited recommendation is under ~300 lines); structure it WHAT/WHY/HOW — stack and structure, purpose, build commands and conventions; include explicit do/don't lists and permission boundaries; don't document file paths (they rot), don't use the file as a linter (use linters), don't embed task-specific instructions that stale by Friday. For larger codebases: progressive disclosure — a brief root file, deeper files loaded where relevant.

Readers of this series will recognize both halves of that advice. The instruction budget is a context-engineering fact wearing a documentation hat — standing instructions are preloaded context, and preloaded bulk degrades attention exactly as our JIT context post describes; progressive disclosure is just-in-time loading applied to specs. And "the agent file is the project's persistent voice" is a scaffold claim — which is the thread that leads to the governance gap, because Bastien's nine bloated variants weren't a writing problem. They were an unowned-artifact problem.

4. The Fleet-Scale Gap: Specs Are Behavior, and Behavior Sprawls

Here is the structural observation the SDD discourse mostly hasn't reached yet. Everything the movement says about specs — source of truth, consumed by agents, aligning humans — is a claim that specs determine behavior. In the vocabulary of the agent glossary, a spec is scaffold: content the model works from. A change to scaffold is a policy change — behaviorally the same category of event as a model swap. SDD, practiced sincerely, multiplies exactly these artifacts: feature specs, plan files, task lists, AGENTS.md variants, acceptance criteria. Each one steers agents. Almost none of them, in a typical adoption, has a version history tied to deployments, an owner, a review path, or an answer to "which runs did this file shape?"

That's Bastien's audit, named. Forty-one specs and nine agent files aren't a tidiness problem — they're forty-one unversioned behavior inputs and nine competing standing policies. The known SDD pain points all trace to the same root: spec/code drift (early SDD adopters commonly report that specs and code fall out of sync unless the spec is updated as part of the workflow) is unmonitored divergence between the declared truth and the deployed one; duplicate specs are forked policies; bloated auto-generated agent files are unreviewed policy accretion. The movement made specs important; it hasn't yet made them governed. Software solved this exact problem for code with version control, code review, CI gates, and deploy records — and the fix for specs is the same machinery, pointed at the new artifact class.

Fig 1: The pipeline every SDD tool converged on (Spec Kit's Specify→Plan→Tasks→Implement; Kiro's Requirements→Design→Tasks), and the governance lane the practice needs once specs steer a fleet rather than a feature.

5. Specs as Governed Scaffold: the Mapping

Run the SDD artifact list through a platform that already treats scaffold as governed content, and each item lands somewhere specific. Standing instructions — the AGENTS.md role — belong with the agent definition on TrueFoundry's Agent Harness: they should be stored as versioned content, reviewed as diffs, and referenced by the definition rather than edited ad hoc in a prompt box — so the restraint discipline has a reviewer, not just an author. Reusable procedure specs — the runbooks, the "how we do migrations here," the multi-step acceptance recipes — are skills: promoted to the Skills Registry as versioned SKILL.md artifacts with RBAC, pinned by agents as skill@version, and mounted just-in-time — which implements progressive disclosure literally: the catalog's one-line description rides along; the full procedure loads on activation. One spec, forty agents, zero forks.

Feature specs and plans can stay in the repo where the work lives, but the agent run should record their immutable version or commit SHA in its trace metadata — and the harness already records each run's steps, so the spec version travels with the run rather than being reconstructed later. Do that, and "which spec version shaped this run" becomes a query instead of archaeology. And the whole assembly is the policy: the agent definition — model by name, instructions at a version, skills pinned, tool grants — is one declarative, diffable artifact, which is exactly the unit our policy post argues should be versioned, gated, and rolled back together. The pleasant surprise of this mapping is that nothing about it fights SDD; it completes the founding argument. Grove's case was that intent deserves to be a versioned, testable, first-class artifact. The Model Spec did that for one organization's model behavior. The governance lane does it for every spec your fleet consumes.

An SDD-shaped agent definition: specs as pinned, versioned scaffold (illustrative)

agent:
  name: payments-feature-agent
  model: claude-sonnet-4-6
  instructions: ./AGENTS.md@7c91f       # lean, WHAT/WHY/HOW, ~180 instructions — reviewed
  skills:                                # reusable procedure specs, from the registry
    - migration-runbook@v4               # one spec, every agent; mounted just-in-time
    - acceptance-review@v2
  context:
    feature_spec: specs/payments-refunds.md@3e02a   # the Specify artifact, at a version
  mcp_servers: [github, ci]
# Recommended pattern: a spec edit should bump a version → run an eval gate →
# record the artifact versions on the run trace. Bastien's audit becomes a query.

TrueFoundry Agent Harness builder: instructions, skills, and MCP servers attached to one agent definition — Fig 2: Specs as governed configuration: instructions and skills attached by name and version to one declarative agent definition — the same artifact whether authored in the no-code builder, SDK, or API. Source: *TrueFoundry Agent Harness docs*.

6. Spec Changes Through the Gate

The operational consequence deserves its own section because it's where SDD's biggest unhandled risk lives. If the spec is the source of truth and agents implement against it, then a spec edit is the highest-leverage change in the system — it redirects every downstream agent decision — and in most adoptions it's also the least-ceremonied: a Markdown edit, merged on a Friday, consumed by Monday's runs. The fix is the policy discipline applied without exception: a spec or agent-file change should bump a version, the version change should trigger the evaluation gate (the same suite a model swap would face — behaviorally it's the same kind of event), and promotion should be staged. For shared skills, a version bump rolls out per-agent and deliberately, not registry-wide and silently. None of this slows the cheap edit much; all of it makes the expensive edit visible before production does.

The same machinery dissolves the drift problem honestly. Spec/code drift is unavoidable in spec-first practice and managed in spec-anchored practice — but only if divergence is detectable, which requires knowing which spec version the deployed behavior came from. When runs record the spec version they consumed, drift becomes a diff between the current spec and the version that last passed the gate, surfaced on a schedule — rather than a discovery made during an incident, which is where Bastien found his.

7. Honest Limits: Where SDD Strains

The skeptics deserve their section, because their strongest points shape the governance design. The waterfall critique — SDD as requirement documents reborn — lands wherever specs are written once, comprehensively, up front; the practice survives it by keeping specs lean and iterative rather than encyclopedic. The "code is the ultimate executable truth" objection — when generated code is wrong, you fix code, not prose — is true at the margin, and is exactly why spec-as-source remains the minority position; spec-anchored, with drift detection, is the defensible middle. And the overhead point stands as Thoughtworks' own analysis states it: small fixes don't warrant the ceremony, so the governance lane needs a proportionate fast path — record the version, run the cheap automated gate, skip the committee.

One more limit is internal to this post's argument: versioning specs doesn't make them good. A precisely governed bad spec produces precisely traceable bad behavior. The instruction budget, the WHAT/WHY/HOW structure, the checkable acceptance criteria — the craft half of SDD — remains human work, and the platform's contribution is honest but bounded: it makes the craft reviewable, attributable, and safe to iterate on.

8. A Minimum Viable Governance Lane

None of this requires a platform rewrite. The lane is five artifact classes, each with an owner, a versioning mechanism, a gate proportionate to its blast radius, and a field on the run trace that ties it to behavior. The pattern below is what we'd recommend a team stand up first; the Skills Registry row is a documented TrueFoundry capability, the rest are conventions any team can adopt in their repo and their harness configuration today.

Artifact	Owner	Versioning	Gate	Runtime trace field
AGENTS.md	repo owner / platform	commit SHA or artifact version	agent eval suite	agent.instructions.version
Feature spec	product + eng owner	commit SHA	acceptance / e2e tests	agent.spec.version
Plan / tasks	eng owner	commit SHA	code / test review	agent.plan.version
Reusable runbook	platform owner	Skills Registry version	skill eval	agent.skill_versions[]
MCP tool grant	platform / security	policy version	approval / policy test	agent.tool_policy.version

The payoff is a single legible record per run. Whatever shaped the behavior — instructions, spec, plan, skills, tool policy — is named and pinned on the trace, so the post-incident question becomes a lookup rather than an excavation:

A run trace that answers "which versions shaped this behavior?" (illustrative)

{
  "agent.run_id": "run_123",
  "agent.name": "payments-feature-agent",
  "agent.definition.version": "agentdef_2026_06_13_04",
  "agent.instructions.version": "AGENTS.md@7c91f",
  "agent.spec.version": "specs/payments-refunds.md@3e02a",
  "agent.skills": ["migration-runbook@v4", "acceptance-review@v2"],
  "agent.eval_gate": "payments-agent-regression@2026-06-12",
  "agent.eval_result": "passed"
}

Skill versions come from the registry as a first-class feature; the spec, plan, and instruction versions are commit SHAs you choose to stamp onto the run. The discipline is cheap to adopt and compounds: the first time an agent ships behavior nobody can explain, this record is the difference between a query and a war room.

9. Where This Lands: Bastien's Second Audit

Rebuilt on governed rails, Bastien's estate six months later looks like this. The nine AGENTS.md variants became three reviewed instruction sets — one per domain — each lean enough to live inside the instruction budget, each versioned, each with an owner whose name appears on the diff. The reusable procedures hiding inside feature specs were extracted into seven registry skills, pinned by version across the fleet; the migration runbook that existed in four conflicting copies exists once, at v4, and the registry can list every agent that mounts it. Feature specs still live with their features — SDD's workflow didn't change — but every harness run records the spec and instruction versions it consumed in its trace metadata, so "which spec produced this behavior" is a query with an answer, and a spec edit faces the same eval gate as a model swap. The forty-one files didn't become bureaucracy. They became an inventory.

That's the completion of the founding argument, not a revision of it. Grove's point was that intent deserves better than a shredded prompt; the SDD tools made intent a written artifact; the governance lane makes it an operated one — versioned, gated, traced, shared without forking. The lingua franca of development did move up a level. The infrastructure should follow it there.

10. Frequently Asked Questions

Who actually coined "spec-driven development"? No single coinage holds up — the term crystallized across 2025 rather than appearing at once. The intellectual spark most often cited is Sean Grove's "The New Code" talk at the AI Engineer World's Fair 2025; GitHub's Spec Kit (September 2025) and AWS's Kiro gave the practice its tooling and much of its vocabulary; Tessl pushed the strong form; and Birgitta Böckeler's Thoughtworks analysis (October 2025) gave it the spec-first / spec-anchored / spec-as-source taxonomy most practitioners now use. Credit honestly: a movement, not a moment.

Is SDD just waterfall with extra steps? It rhymes, and pretending otherwise loses credibility. The differences that matter: SDD specs are lean and iterative rather than comprehensive and frozen; the implement loop is hours, not quarters, so spec and code can converge continuously; and the spec's consumer is an agent that genuinely executes it, which big-upfront-design never had. Where teams write encyclopedic specs once and toss them over a wall, they've reinvented the failure mode, agents or no agents.

Do AGENTS.md files really need governance? They're just docs. They're standing instructions consumed on every agent session — behavior-determining content, not documentation. An unreviewed edit to one is a silent policy change across every run it touches, which is precisely the category of change version control and review exist for. The practical test: if editing the file could alter what ships to production, it's not "just docs."

Where's the line between a spec and a skill? Scope and reuse. A feature spec describes one piece of work and lives with it. A skill is a reusable procedure — how this org does migrations, reviews, incident write-ups — consumed by many agents over time, which is what earns it registry treatment: versions, RBAC, pinning, just-in-time mounting. When the same prose keeps getting copied between specs, that's the promotion signal.

2人のエンジニアと1つのエージェントの場合、これは当てはまりますか？ 開発の実践面、つまりスペックファースト、軽量なエージェントファイル、検証可能な受け入れ基準といったものは、どんな規模でも十分に効果を発揮します。ガバナンスの側面は、エージェント、作成者、共有スペックが増えるにつれて、その複雑さや重要性が乗数的に増していきます。「どのスペックが正しいのか？」という最初の議論が生じた時点で導入を検討すべきでしょう。それまでは、Gitリポジトリと規律があれば正直十分です。

参考文献

ショーン・グローブ（OpenAI） — 「新しいコード」 AIエンジニア世界博覧会2025 — 根本的な主張：意図のバージョン管理され、テスト可能な本拠地としてのスペック。グローブ氏の展示、OpenAIのモデルスペックは、バージョン管理され、テスト可能なMarkdownドキュメントとして記述された意図の模範です。引用元を明記して要約。引用部分は15語未満。
ビルギッタ・ベッケラー（Thoughtworks） — 「スペック駆動開発の理解：Kiro、spec-kit、Tessl」 martinfowler.com、2025年10月 — スペックファースト／スペックアンカー／スペックアズソースの分類法、およびSDDが適用される場所に関する正直な注意点。
GitHub Spec Kit （2025年9月） — Specify → Plan → Tasks → Implement のフロー（/speckit.specify、 /speckit.plan、 /speckit.tasks、 /speckit.implement)と「実行可能な仕様」という枠組み。
AWS Kiro — 仕様ワークフローの中心となるのは requirements.md、 design.md、そして tasks.md EARS形式の受け入れ基準を用いて、要件定義から設計、タスク実行まで。
Tessl — 仕様をソースとする立場：仕様を主要な保守対象成果物とし、そこからコードを再生成する。
Andrej Karpathy — "vibe coding" (2025年2月) と、詳細な仕様に基づき人間の監視下で行われるエージェント型エンジニアリングに関するその後の議論。
エージェント指示ファイルに関する実務者向けの制約ガイダンス（指示予算、ファイル長の推奨、WHAT/WHY/HOW、段階的開示）— コミュニティのエンジニアリングに関する記事には、 HumanLayerのもの（2026年初頭）などがあり、これらは管理された研究ではなく経験則として扱われ、本文中で引用されています。EPAMの Spec Kit フィールドレポートは、委任範囲の拡大に関して一般的に引用される情報源であり、管理された研究ではなく実地証拠として読まれています。
TrueFoundry Agent Harness とスキルレジストリ — この投稿が依拠する文書化された機能：バージョン管理され、RBACによって統制され、ピン留め可能で、ロールバック可能、監査済みのSKILL.mdアーティファクト（段階的開示付き）、さらにハーネスオーケストレーション、サンドボックス化、承認、トレーシング。

NorthwindとBastienは、特定の組織ではなく、説明のための架空の複合体です。出典は、執筆時点での公開講演、投稿、分析（Grove氏の講演、GitHubとAWSのローンチ、Tesslのポジショニング、Böckeler氏のThoughtworks分類）に基づいて報告されており、出典を明記して言い換えられています。引用された断片は15語未満であり、出典が示されています。実務家の数値（指示予算、ファイル長ガイドライン、委任期間レポート）は、管理された研究ではなく、コミュニティのガイダンスです。設定スニペットは読みやすさのために簡略化されており、実際の製品スキーマではありません。TrueFoundryの機能は、執筆時点での公開ドキュメントを反映しています。最新のドキュメントでご確認ください。

‍

TrueFoundry AI Gateway delivers ~3–4 ms latency, handles 350+ RPS on 1 vCPU, scales horizontally with ease, and is production-ready, while LiteLLM suffers from high latency, struggles beyond moderate RPS, lacks built-in scaling, and is best for light or prototype workloads.

Built for Speed: ~10ms Latency, Even Under Load

Schedule your Demo Now