Tool vs. Skill vs. Sub-agent: The Delegation Spectrum and Its Governance

Auf Geschwindigkeit ausgelegt: ~ 10 ms Latenz, auch unter Last
Unglaublich schnelle Methode zum Erstellen, Verfolgen und Bereitstellen Ihrer Modelle!
- Verarbeitet mehr als 350 RPS auf nur 1 vCPU — kein Tuning erforderlich
- Produktionsbereit mit vollem Unternehmenssupport
Of all the lines the agent glossary draws, the one between tools, skills, and sub-agents is the blurriest — the glossary itself admits the boundary shifts across frameworks. But the spectrum underneath is stable and worth engineering around: a tool is an action, a skill is a packaged procedure, a sub-agent is a delegate that reasons. Those are three different kinds of delegation, and — the part most architectures miss — three different governance problems. Authorize a tool, version a skill, give a sub-agent an identity: get the rung wrong and you'll govern a reasoning delegate like a function call, which is how blast-radius surprises are made.
Ingrid, a staff engineer, was reviewing an agent before it touched production when she hit a line in its tool list that stopped her: one "tool" was not a tool. Nine entries were honest function calls — query the order database, post a Slack message. The tenth, "resolve-customer-issue," was an entire second agent wearing a tool's name tag: it had its own prompt, its own model call budget, its own tool access, and it reasoned for up to fifteen steps before returning a string. It had inherited the parent's permissions wholesale — the failure mode a governed setup avoids by scoping the delegate explicitly rather than letting it inherit the parent's grants — appeared in traces as a single opaque call, and had been security-reviewed as "a tool" — which is to say, not really reviewed at all. Nothing about it was malicious. Everything about it was misfiled.
Misfiling is the failure mode this vocabulary exists to prevent. The Hugging Face agent glossary — whose terms this series follows, mapped to infrastructure in our anchor post — separates the three cleanly while admitting frameworks blur them. This post takes the spectrum rung by rung: what each one is, when to reach for it, and the governance surface each one demands. Because Ingrid's tenth entry wasn't a naming problem. It was a delegation problem filed under the wrong governance regime.
1. The Spectrum, Defined
A tool is an action. The model emits a structured request — call this function with these arguments — and the harness executes it: an API call, a database query, a shell command, a web fetch. The result comes back into context and the loop continues. The tool doesn't reason; it runs. Everything the model knows about it is its description, and the call itself is a discrete, inspectable event with arguments in and a result out.
A skill is packaged knowledge. Where a tool is "run this command," a skill bundles what's needed to accomplish a goal — investigate this class of bug, produce this kind of report — typically as structured instructions and procedure (the SKILL.md pattern) loaded into the agent's context on demand. A skill doesn't execute anything by itself; it shapes how the agent uses the tools it has. It's scaffold, made portable and reusable across agents.
A sub-agent is a delegate. The calling agent hands off a subtask to another agent — one with its own model, its own scaffold, its own context window, possibly its own tools — which reasons through the work independently and returns a result. The parent receives the result rather than managing every intermediate step, the way a manager doesn't keystroke for a direct report. That opacity is the point, and also the governance problem. The distinction from a tool is exactly the one Ingrid tripped over: a tool executes; a sub-agent decides, and can itself use tools, and can be wrong in open-ended ways a function cannot.

2. Choosing a Rung: the Lowest That Does the Job
The spectrum trades control for capability as you climb. A tool is maximally legible — its behavior is its implementation, its invocation is one auditable event — and minimally capable: it handles exactly what it was coded for. A sub-agent is maximally capable — it can absorb ambiguity, adapt mid-subtask, recover from surprises — and minimally legible: its behavior emerges from a model reasoning in a context you don't see from the parent. A skill sits between: it raises the agent's competence at a class of task without adding a new delegated principal.
That gives a clean selection heuristic: choose the lowest rung that does the job. If the subtask is deterministic and well-specified — fetch this record, run these tests — it's a tool; wrapping it in a sub-agent adds cost, latency, and an open-ended failure mode for nothing. If the subtask is a procedure a domain expert could write down — triage a bug this way, structure the report like this — it's a skill: the parent agent's own reasoning, upgraded with packaged know-how, no delegation needed. Reserve the sub-agent for subtasks that need genuine judgment with their own working state — deep-dive this log corpus, research this question — where the work would either overwhelm the parent's context or require reasoning the parent shouldn't interleave with its main thread. Ingrid's misfiled delegate, judged by this heuristic, was legitimately a sub-agent; the error was never the rung, it was the filing.
3. Tools: Govern the Call
Because a tool is a discrete executed action, its governance unit is the call. Three questions attach to every invocation: is this caller authenticated, is this caller authorized for this tool, and is this particular call acceptable — the last covering both inbound arguments (an injected instruction smuggled into a parameter) and the outbound result (a secret or PII in the response). That's the per-call surface, and it's exactly what an MCP gateway exists to provide: a registry so agents reach only sanctioned tools, central authentication so agents don't carry per-tool credentials, RBAC per tool, and pre/post-call guardrails with a per-call audit trail. TrueFoundry's MCP Gateway implements this rung end to end, including curated least-privilege tool surfaces per role via virtual MCP servers.
One scaffold-layer note belongs here because it's among the most frequent tool failures in practice: the model selects tools purely from their descriptions, so a vague or overlapping description produces misuse that no per-call check will catch — the call is authorized, well-formed, and wrong. Treat tool descriptions as reviewed, versioned scaffold, and treat a tool that keeps being misused as having a documentation bug before assuming the model has a reasoning one.
4. Skills: Govern the Artifact
A skill executes nothing, so per-call governance has nothing to grab; its governance unit is the artifact. The questions that matter are provenance and versioning: who wrote this procedure, who reviewed it, which version is each agent running, and what changed between versions — because a skill is behavior, and an unreviewed edit to a widely used skill is a silent behavior change across every agent that mounts it. The failure mode of ungoverned skills is the failure mode of ungoverned copy-paste: procedures forked into private variants, stale versions running indefinitely, and no way to answer "which agents are affected?" when a procedure turns out to be wrong.
The fix is the same one code got: a registry. TrueFoundry's Skills Registry treats skills as versioned SKILL.md artifacts with RBAC — published, reviewed, version-pinned by agents, and mounted into context on demand rather than pasted into every definition. On-demand mounting matters twice: it's a context-engineering win (the procedure costs window space only when relevant) and a governance win (the registry knows exactly which agents use which skill at which version, so the "which agents are affected" question has an answer). A skill in a registry is institutional knowledge; a skill in a paste buffer is a rumor.
5. Sub-agents: Govern the Principal
A sub-agent reasons, so neither the call nor the artifact is the right governance unit — the principal is. Four properties make a sub-agent governable, and all four were missing from Ingrid's tenth entry. Identity: the sub-agent acts as itself, not anonymously inside its parent, so its actions are attributable. Scoped permissions: it gets the tools its subtask needs — not the parent's grants by inheritance; a log-analysis delegate has no business with the parent's write access to the order system. Bounded resources: its own step and token budgets, so a delegate that wanders can't silently consume the parent's run (the loop-prevention concern of our multi-agent post, one level down). Its own trace: the parent's record shows the delegation and the result; the sub-agent's record shows the fifteen steps in between — an opaque single line in a trace is exactly how a misfiled delegate hides.
This is the rung where the harness earns its keep, because these properties are runtime properties. TrueFoundry's Agent Harness runs sub-agents as first-class parts of its context-engineering machinery: each delegate works in an isolated context window and returns conclusions, not transcripts, to the parent — and each appears in the per-step trace with its own model calls, tool calls, and cost. Delegation without per-principal governance is how one agent's permissions quietly become an org chart of unaudited copies; delegation with it is just good decomposition.
6. The Context Axis: the Spectrum as a Window-Management Instrument
There's a second axis running through the spectrum that has nothing to do with safety and everything to do with quality: what each rung costs the parent's context window. A preloaded tool costs its description on every step it’s exposed (each preloaded tool description can ride along in the context) plus its results as they arrive — which is why result handling (truncation, offloading) is harness work, and why an agent with eighty broadly preloaded tools can pay a standing context cost for all of them whether it uses them or not, which is exactly why deferred, on-demand tool loading exists. A skill, mounted on demand, costs nothing until it's relevant and its full procedure only while it is. A sub-agent is the most powerful context instrument of all: it takes the entire subtask out of the parent's window — the delegate burns its own fresh context on the fifteen-step investigation and returns three sentences.
Read this way, the spectrum doubles as a context-engineering toolkit, which is exactly how the harness's context-engineering suite treats it: curated tool surfaces keep the standing description cost down, skills load procedures just-in-time, and sub-agents isolate context-heavy subtasks so the parent's window stays lean across a long run. The same decomposition decision — which rung handles this work — is simultaneously a governance decision and a window-budget decision, and the architectures that feel effortless are the ones where both were made on purpose.
7. Composing the Rungs: a Worked Example
The rungs compose, and a realistic agent uses all three. Take Ingrid's customer-operations agent, refiled properly:
One agent, three rungs — each governed at its own surface (illustrative)
agent:
name: customer-ops
model: customer-ops-default
mcp_servers: # TOOLS — per-call governance at MCP Gateway
- orders-db: [lookup_order, update_status] # curated, least-privilege
- slack: [post_message]
skills: # SKILLS — versioned artifacts from the registry
- refund-policy@v7 # pinned; registry knows who runs what
- escalation-runbook@v2
subagents: # SUB-AGENTS — principals with their own scope
- name: issue-investigator # Ingrid's "tenth tool," filed correctly
mcp_servers: [logs-readonly] # scoped: NOT the parent's grants
max_steps: 20 # bounded; traced as itself
The composition reads naturally: tools are what it can do, skills are what it knows how to do, the sub-agent is what it hands off. And each line lands in a different governance system that already knows what to do with it — the MCP Gateway authorizes and audits the calls, the Skills Registry tracks the versions, the harness runs the delegate with its own identity, budget, and trace. The architecture isn't three special cases; it's one spectrum with the governance pre-attached to each rung. That's what "encode the spectrum, not the labels" means in practice: whatever a framework calls these things, file each by what it is — action, procedure, or principal — and the right controls follow.
8. Refactoring Along the Spectrum
The rung that's right today drifts. A tool that grows option flags and conditional behavior is reaching toward being a procedure — consider whether the logic belongs in a skill that orchestrates simpler tools. A skill that keeps saying "judge whether..." is reaching toward being a delegate — procedures shouldn't require judgment; that's the signal to promote the judgment into a sub-agent with its own scope. And a sub-agent whose runs turn out to be near-identical every time is reaching down — if the delegate never actually exercises judgment, demote the work to a skill plus tools and reclaim the legibility and cost.
Two practices make this refactoring safe. First, the trace tells you when to do it: per-step records show the tool being invoked with ever-stranger arguments, the skill being overridden by ad-hoc reasoning, the sub-agent producing the same five steps every run. Second, evaluation tells you whether it worked: a rung change is a behavior change — a policy change, in the glossary's vocabulary, which the eval loop should gate like any other. Decomposition isn't a one-time design decision; it's a maintained property, and the spectrum gives the maintenance a direction.
9. Frequently Asked Questions
Frameworks really don't agree on these terms — does the distinction still matter? The labels vary; the spectrum doesn't, and the glossary is candid about exactly this. What matters operationally is filing each capability by what it is — an executed action, a loaded procedure, a reasoning principal — because the filing determines the governance. Ingrid's incident wasn't caused by a framework's naming; it was caused by governance following the name instead of the nature.
Is a sub-agent just a tool from the caller's point of view? Mechanically it's often invoked through the same interface, which is why the misfiling is so easy. The difference is on the other side of the call: a tool's behavior is its implementation; a sub-agent's behavior is a model reasoning in its own context, with its own tools, capable of being wrong in open-ended ways. Same calling convention, different species — and the governance must follow the species.
When is a skill better than just writing longer instructions? When the procedure is reusable, shared, or worth governing. Inline instructions are fine for one agent's one-off behavior. The moment a procedure is used by several agents, maintained by someone other than the agent's author, or important enough that you'd want to know which version is running where — it's a skill, and it belongs in a registry with versions and RBAC rather than in a paste buffer.
Should a sub-agent ever inherit its parent's permissions? Default no. Inheritance is convenient and quietly maximal — the delegate gets everything the parent has, needed or not, which inverts least privilege exactly where reasoning autonomy makes it matter most. Scope the sub-agent to its subtask's tools. If that feels laborious, that's the architecture telling you the delegation boundary deserves the same care as any other trust boundary, because it is one.
Doesn't climbing the spectrum just mean more cost and latency? Usually, yes — a sub-agent spends model calls a tool wouldn't — and that's part of the lowest-rung heuristic. But the comparison isn't rung-versus-rung in a vacuum; it's against the alternative of the parent doing the work in-window. A context-heavy subtask done by the parent can degrade the whole run (and its cost rides the parent's growing window); the same subtask delegated runs in a fresh, cheap context. Measure per completed task, not per call.
Ingrid's tenth entry never needed to be banned — it needed to be filed. Refiled as what it was, it got an identity, a scope, a budget, and a trace, and went to production a week later as the most-reviewed component in the agent rather than the least. That's the spectrum's whole promise: an agent's capabilities aren't a flat list of "tools" but three kinds of delegation, each with governance that already knows its name. Action, procedure, principal. File accordingly.
References
- Hugging Face — agent glossary (May 2026) — the tool/skill/sub-agent definitions this post builds on, paraphrased with credit.
- TrueFoundry MCP Gateway — per-call governance for the tool rung: registry, auth, RBAC, guardrails, audit.
- TrueFoundry Skills Registry — per-artifact governance for the skill rung: versions, provenance, RBAC, on-demand mounting.
- TrueFoundry Agent Harness — per-principal governance for the sub-agent rung: isolation, budgets, traces.
- The agent glossary, mapped to production — this series' anchor.
Ingrid is an illustrative composite, not a specific person, organization, or incident. Vocabulary follows Hugging Face's agent glossary (May 2026), which notes the tool/skill/sub-agent boundary shifts across frameworks. Configuration snippets are simplified for readability and not literal product schema. TrueFoundry capabilities reflect public docs at the time of writing; verify against current documentation.
TrueFoundry AI Gateway bietet eine Latenz von ~3—4 ms, verarbeitet mehr als 350 RPS auf einer vCPU, skaliert problemlos horizontal und ist produktionsbereit, während LiteLM unter einer hohen Latenz leidet, mit moderaten RPS zu kämpfen hat, keine integrierte Skalierung hat und sich am besten für leichte Workloads oder Prototyp-Workloads eignet.
Der schnellste Weg, deine KI zu entwickeln, zu steuern und zu skalieren



















.webp)
.webp)





.webp)
.webp)
.webp)



