ADR-0031: YAML agent definition format

Status: Accepted Date: 2026-05-02 Related:

ADR-0006 (six-layer context model; agent definitions own Layers 1 and 2)
ADR-0023 (tools are referenced by name; loose binding from agent definition)

Context

Agent definitions in @agent-platform/core are typed as AgentDefinition — a frozen, validated structure containing metadata, core context, characteristics, tool refs, sub-agent refs, memory config, autonomy boundaries, escalation rules, and a model tier. The shape is locked.

Until now, every agent definition has been authored in TypeScript: apps/example/src/agents.ts, apps/worker/src/agents.ts, apps/worker/src/merchandising.ts. This works for the platform’s own scaffolding, but it isn’t the platform vision. The vision is agents are data: editable by operators, manageable in version control, and (eventually) configurable through a UI without recompiling the worker.

This ADR locks the on-disk format and the loader contract that produces validated AgentDefinition instances from that format.

Decision

Agent definitions live as YAML files on disk. A new @agent-platform/agent-loader package reads, validates, and resolves them into AgentDefinition instances at load time (worker startup, or build time for bundled deployments).

Format choices (the seven things deliberated)

1. File layout: one YAML file per agent.

agents/triage.yaml, agents/refund-decision.yaml, etc. Multi-agent files were considered and rejected: they invite ordering games (one agent’s escalation rule references a sibling agent in the same file), tangle PR review (a one-line agent edit is buried in a long file), and don’t map cleanly to a future UI’s “list of agents → click one to edit” flow.

The cost is that “topology” — which agents work together in a deployment — must be declared somewhere else. Today this is implicit (the worker’s agent host loads the directory and registers everything). When the platform supports multiple deployments per worker, an agent-set.yaml topology manifest will land. Deferred.

2. System prompt: schema accepts string OR { file: <path> }.

core_context:
  system_prompt: "Inline prompt for short cases."
  # OR:
  system_prompt:
    file: ./prompts/triage.md

Production system prompts are often 100+ lines. Editing them inside YAML’s quoting rules is painful; markdown editors give syntax highlighting; PR diffs of prompt edits read cleanly when the prompt is its own file. But forcing a separate file for every two-line sub-agent is overkill. Both forms are supported; the loader resolves the file form to a string before AgentDefinitionSchema validates.

3. Schema validation: Zod.

Consistent with the rest of the codebase. The existing AgentDefinitionSchema in @agent-platform/schemas is the validation target — the loader doesn’t add new schemas, it adapts YAML input to that schema’s shape.

JSON Schema was considered for its tooling ecosystem. Rejected: we don’t expose YAML schemas externally; internal consistency wins. Zod → JSON Schema conversion is available if/when we need it (third-party agent definitions, schema-driven UI form generators).

4. Tool references: loose, by name string.

tools:
  - shopify_get_order_by_email
  - send_email
  - emit_event

Tools are inherently dynamic (MCP tools load at runtime; built-in tools are name-resolved against ToolRegistry). Tight typed references would couple agent files to TypeScript imports — making agents code, not data. The loader validates names against the known tool list at load time; typos fail loudly at startup, not at build time.

5. Sub-agent references: loose, by name string.

Same reasoning. A future “compile” step could verify the sub-agent graph is acyclic; deferred.

6. Versioning: apiVersion: agent-platform/v1 mandatory.

apiVersion: agent-platform/v1
kind: Agent

K8s-style envelope. Two purposes:

Per-agent versioning (metadata.version: 0.1.0) is operator bookkeeping.
Platform apiVersion is loader bookkeeping. When v2 lands, the loader can support v1 and v2 simultaneously for a deprecation window. Old agents keep working; new ones use the new shape.

The cost is one extra mandatory line per file. Trivial. The benefit is a clean upgrade path that doesn’t exist if agents declare nothing about which schema version they target.

7. File location: loader code in packages/agent-loader/; YAML files in apps/worker/agents/.

Clean separation:

packages/* is reusable code (shared infrastructure)
apps/* owns its config (deploy-time artifacts)

Multiple apps can use the same loader against different agent directories. No cross-app coupling.

The loader interface

Two-layer API:

Primitive (runtime-agnostic):

loadAgentFromString(yamlSource: string, opts?: { reader?: FileReader; source?: string }): Promise<AgentDefinition>

Works in Worker, Node, tests. Caller supplies YAML as a string. Optional reader resolves system_prompt: { file: ... } references; optional source is an identifier for clear error messages.

Node convenience (uses fs/promises):

loadAgentFromFile(filePath: string): Promise<AgentDefinition>
loadAgentsFromDirectory(dirPath: string): Promise<Map<string, AgentDefinition>>

loadAgentFromFile reads one file, auto-builds a reader that resolves prompt files relative to the YAML’s directory.

loadAgentsFromDirectory walks a directory (non-recursive), loads every *.yaml/*.yml, returns a Map keyed by metadata.name. Duplicate names across files = fatal error.

Worker integration

The Worker has no fs at runtime. YAML files and prompt files are bundled at build time (esbuild text loader, wrangler text_blobs, or raw import of .yaml strings — to be locked in a future commit when the scenario lands). The loader’s loadAgentFromString works in the Worker because it doesn’t touch disk; bundled strings feed in directly with a custom reader.

What’s NOT in this ADR

Hot reload. Agents load once at startup. Reloading at runtime is a security hole (untrusted YAML mutates running agents) and a complexity vector. Not in v1.
Runtime YAML loading from R2/KV. Useful for multi-tenant where operators upload agents through a UI. Build-time bundling is enough for v1. Defer until needed.
Topology / agent-set manifests. When we have multiple deployments, this becomes a real concern. Defer until then.
Migration of existing hardcoded agents. The apps/example/src/agents.ts and apps/worker/src/agents.ts definitions stay. Migration is a separate, opt-in commit.

Error handling

All loader errors wrap as ConfigError (the existing class in @agent-platform/errors) with file path context. Operators see “couldn’t load agents/triage.yaml: invalid envelope — apiVersion required” instead of a raw ZodError stack. The schema’s .issues array is preserved in ConfigError.context for tooling that wants structured access.

Consequences

Positive:

Agent definitions are now data. Operators can edit YAML without TypeScript knowledge.
Schema is deterministic — what loads is exactly what AgentDefinition says it is.
File-per-agent layout maps cleanly to future tooling (UI, validation tools, git diffs).
ApiVersion envelope means future schema evolution doesn’t break old files.
Loose tool/sub-agent references decouple agent files from build-time TypeScript.

Negative:

One more file format to learn (though YAML is widely known).
Operators can author syntactically valid YAML that fails schema validation. Loud errors at startup mitigate this; future YAML-LSP schema integration would catch it earlier.
Worker bundling of YAML adds a small build-time complexity. Acceptable.
The loose tool name approach means typos surface at agent load time, not build time. Deemed acceptable; load happens at startup, errors are fatal and loud.

Implementation

@agent-platform/agent-loader package shipped 2026-05-02:

src/envelope.ts — apiVersion + kind validation
src/resolve.ts — system_prompt file ref resolution
src/loader.ts — main pipeline (parse → envelope → resolve → schema validate)
src/directory.ts — Node convenience: file + directory loaders
src/index.ts — public exports

~30 tests covering the full pipeline, edge cases, and error handling.

Open questions

YAML LSP / schema in editors. A .schema.json published in the repo would let editors validate agent YAML on save. Future work; not blocking.
Topology files. When the platform supports multiple deployments per worker, a separate topology format will be needed. The shape isn’t decided yet.