ADR-0032: Asynchronous coordination primitives — queues, event bus, outbox
ADR-0032: Asynchronous coordination primitives — queues, event bus, outbox
Section titled “ADR-0032: Asynchronous coordination primitives — queues, event bus, outbox”Status: Accepted Date: 2026-05-02 Extends: ADR-0027 (storage primitives per memory layer and data type) Related:
- ADR-0014 (runtime is Cloudflare Workers)
- ADR-0024 (Durable Objects for async job records)
- ADR-0023 (tools as the agent-runtime extension point; the event bus exposes via a built-in tool)
- ADR-0031 (agents as data; topology questions deferred to a future ADR)
Context
Section titled “Context”ADR-0027 committed the platform’s durable storage primitives across all six memory layers and the major data types (job records, audit trails, traces, configs, blobs). It deliberately did not address asynchronous coordination — Cloudflare Queues, the event bus, agent-to-agent handoff state — because no implementation needed it yet.
The Phase 1 demo scenario (Order Triage + Refund Decision + Communication, locked separately) forces this question into the open. Concretely:
- The triage agent classifies an incoming customer email and delegates to a refund-decision agent. Synchronous delegation is already supported by the runtime.
- The refund-decision agent, after deciding, must emit either an “auto-approved refund action” event (consumed by a Worker that calls Shopify) or an “escalate to human review” event (consumed by a queue that surfaces in a human review tool). Both paths cross a process boundary the runtime does not currently span.
- A communication agent drafts customer-facing emails. In v1 these are either written to a stub (JSON file/log) or emitted as another event for a downstream sender; either way the abstraction is the same.
This ADR settles three things:
- What primitive holds in-flight asynchronous work — what does the queue look like, who provides it.
- What the event bus is — is it a thin convention over the queue, or a real abstraction with its own package and interface.
- What we do (or don’t do) about transactional consistency between D1 writes and queue publishes — the outbox question.
Decision
Section titled “Decision”1. Cloudflare Queues is the v1 transport
Section titled “1. Cloudflare Queues is the v1 transport”The platform uses Cloudflare Queues for asynchronous work between agents and consumers. Properties relied upon:
- At-least-once delivery. Consumers must be idempotent.
- Batched delivery to consumers. A single consumer Worker invocation may receive multiple messages.
- Dead-letter queue (DLQ) per topic. Repeated failures route to DLQ for human inspection.
- No global ordering guarantees. Per-topic ordering is best-effort; per-message-key ordering is not provided.
Two queues are needed for the demo scenario, and they exemplify the v1 patterns:
| Queue | Producer | Consumer | Properties |
|---|---|---|---|
human_review | Refund-decision agent (and any future agent emitting escalations) | A future human-review surface (UI eventually, today: log + DLQ inspection) | Async, low-volume, best-effort |
shopify_actions | Refund-decision agent (and any future agent that issues e-commerce actions) | Worker that translates events into Shopify Admin API calls | Async, idempotent (consumer enforces), per-order serialization handled at consumer |
Cross-message ordering is not enforced at the queue level. When per-resource serialization matters (e.g. don’t apply refund-then-cancel to the same order out of order), the consumer enforces ordering via idempotency keys keyed on the resource ID, plus a DO-per-resource if linearization is genuinely required. This is the pattern used today; documented here so it is not reinvented per consumer.
Why not DO-coordinated queues
Section titled “Why not DO-coordinated queues”A “DO owns ordered state, Queues handles transport” pattern was considered. Rejected for v1:
- For the Scenario B queues (human review, Shopify actions), cross-order ordering is never required.
- Per-order ordering — when a single order has multiple events emitted — is enforced by the consumer (idempotency keys + DO-per-order if needed) without queue-side coordination.
- DO-coordinated queues add a moving part (the coordinator DO, its lifecycle, its state) for a property no current consumer needs.
The pattern is named explicitly so that when a future use case does need cross-message linearization (e.g. “apply these inventory adjustments in the order they were emitted, across multiple SKUs”), this ADR’s deferred-list catches the need rather than the implementation reinventing it.
2. The event bus is a thin abstraction layer in packages/event-bus
Section titled “2. The event bus is a thin abstraction layer in packages/event-bus”The platform exposes an EventBus interface to producers (agents, the runtime, scheduled jobs) rather than letting them publish to Cloudflare Queues directly:
interface EventBus { publish<T>(topic: TopicName, event: T): Promise<void>;}- The
EventBusinterface lives in a new@agent-platform/event-buspackage (interface + types only — same separation pattern as@agent-platform/llmand@agent-platform/embeddings). - The Cloudflare Queues implementation lives in
@agent-platform/event-bus-cloudflare(or inevent-busdirectly if the v1 surface is small enough — the implementation commit settles this). - Tests use a mock
EventBusthat records published events, identical to theMockEmbeddingAdapterandModelAdaptermock patterns. - Producers receive an
EventBusinstance, not a Queue binding. Per-tenant scoping, observability, and validation hooks live in the abstraction.
Why a thin abstraction beats raw Queues with topic conventions
Section titled “Why a thin abstraction beats raw Queues with topic conventions”Three reasons:
- Consistency with the rest of the codebase. Every other external dependency (LLMs, embeddings, Shopify) sits behind an abstraction package per ADR-0028’s “no vendor loyalty” principle. Queues should not be the exception.
- Schema validation at publish time. Each topic has a Zod schema for its payload. The bus enforces validation; consumers can rely on the message shape. Without the abstraction, every publish site duplicates the validation logic or skips it.
- Migration path. When the platform runs in non-Cloudflare environments (operator preference, cost arbitrage, multi-cloud), or when in-process delivery becomes useful for tests and local dev, the swap is one impl package — not a refactor of every publish call site.
The cost of the abstraction is one small package. Trivial.
Built-in tool: emit_event
Section titled “Built-in tool: emit_event”Per ADR-0023, the agent runtime’s extension surface is the tool registry. The event bus reaches agents via a built-in emit_event(topic, payload) tool, registered when an agent’s YAML lists it under tools:. The tool is a thin wrapper around EventBus.publish — same factory pattern as the memory tools in @agent-platform/builtin-tools (per ADR-0030 §6).
This means: an agent’s YAML names emit_event in its tools list. The agent host wires the tool with a per-turn-fresh EventBus instance. The agent calls emit_event(topic="shopify_actions", payload={...}) from its loop. Validation, queueing, and (eventually) audit logging happen inside the abstraction.
3. Outbox pattern: explicitly deferred
Section titled “3. Outbox pattern: explicitly deferred”The transactional consistency problem: an agent writes a decision to D1, then publishes an event to the bus. If the second call fails after the first succeeds, the event is lost — Shopify never gets called, but the D1 row says it was. If we publish first and write second, we may emit events for state that didn’t persist.
The standard solution is the transactional outbox: writes and event publishes go to the same D1 transaction (the event is queued in an outbox table within the same transaction as the state write); a separate process (DO or polling Worker) drains the outbox to the queue with at-least-once delivery to the queue.
Decision: outbox is not built in v1. Reasons:
- The Phase 1 demo’s queues — human review and Shopify actions — both tolerate best-effort delivery:
- Human review: A lost escalation is recoverable by humans noticing the gap. Embarrassing, not catastrophic.
- Shopify actions: The consumer enforces idempotency on
order_id + action_type. A duplicate event is a no-op. A lost event means the action doesn’t happen — recoverable by retry, by the agent re-running on the next email, or by manual operator intervention.
- The outbox pattern is meaningful infrastructure: a polling DO or scheduled Worker, an outbox schema migration, careful failure-mode testing. Easily 2–3 commits.
- Building it before any production traffic exists is speculative. We don’t yet know which events justify the cost; some might be fine forever as best-effort.
Trigger condition for revisit: when a queue publish failure causes user-visible state divergence in production traffic — e.g. a refund decision recorded in D1 but never reaching Shopify, observed by a customer or operator — build the outbox. The trigger is concrete and observable, not hypothetical.
In the interim, agents and consumers operate under these rules:
- D1 write is the durable record. If D1 succeeded and the queue publish failed, the event can be reconstructed from D1 — at the cost of a re-run or a manual replay.
- Consumers are idempotent. Required, not optional. Duplicate delivery is cheaper than lost delivery for our use cases.
- Publish failures are logged loudly. A failed publish writes a
ToolErrorlog entry that includes the topic, payload, and the D1 record ID, so a manual replay is at most a few seconds of operator work. - Critical paths are documented. When an agent’s flow has a publish where state divergence would be customer-visible, the agent’s YAML notes this in
escalation_rulesor a comment, so a future audit can find them.
Updated summary tables (extends ADR-0027)
Section titled “Updated summary tables (extends ADR-0027)”ADR-0027 published two tables (per memory layer, per data type). Adding the third:
Per coordination concern
Section titled “Per coordination concern”| Concern | Primitive | Notes |
|---|---|---|
| Async work between agent and consumer | Cloudflare Queues via EventBus | Two topics for v1 (human review, Shopify actions); more added per scenario |
| Topic schema validation | Zod, in the bus impl | Per-topic Zod schemas; published events validated before transport |
| Per-resource serialization | Consumer-side idempotency + DO-per-resource if needed | No queue-level guarantee; lifted to consumer when required |
| Dead-letter for repeated failures | Cloudflare Queues DLQ | One DLQ per topic; manual inspection in v1, automated alarms post-Phase-1 |
| Transactional D1+publish | Deferred (best-effort + idempotency) | Outbox pattern named, not built; trigger condition documented |
Cross-cutting rules (extending ADR-0027’s six)
Section titled “Cross-cutting rules (extending ADR-0027’s six)”- The event bus is the only path for async cross-process work between agents. Direct Queues bindings on agent-host code are forbidden — every publish goes through the bus. This makes the validation and observability hooks unbypassable.
- Consumers are idempotent by construction. Each topic’s payload schema includes a key (resource ID + action type) the consumer uses to deduplicate. Topics that cannot be made idempotent are a smell — examine before adding.
- Topic names are stable contracts. Renaming a topic is a breaking change requiring producer and consumer migration. Topic schema additions are non-breaking (add optional fields); removals and type changes are breaking.
Considered and rejected
Section titled “Considered and rejected”- DO-coordinated queues for v1. Rejected for the reasons above — adds a coordinator DO for properties no current consumer needs.
- Direct Cloudflare Queues bindings on agent-host code. Rejected per cross-cutting rule #7.
- Outbox pattern in v1. Deferred per “Outbox” section above; not skipped, just not now.
- Topic conventions without an abstraction. Rejected per the abstraction rationale above.
Explicitly deferred (do not decide here)
Section titled “Explicitly deferred (do not decide here)”- Outbox / transactional handoff — named trigger condition; build when triggered.
- Replay / event sourcing — the bus exposes
publishonly; replay from D1 is the current answer. Event sourcing as a primary pattern would be a much larger shift. - Cross-region delivery semantics — Cloudflare Queues are global by default; data residency questions inherit ADR-0027’s deferred answer.
- Cron / scheduled triggers — adjacent to queues but not in scope here. ADR-0033 (or a future amendment) when cron grows beyond the single weekly merchandising trigger.
- Topology / agent-set manifests — when multiple agent deployments per worker exist, queue topic ownership becomes a topology concern. Deferred per ADR-0031.
- Schema registry / cross-deployment topic compatibility — once multiple deployments share a queue, schema versioning becomes a concern. Single-deployment v1 doesn’t need it.
Trigger conditions for revisit
Section titled “Trigger conditions for revisit”- A queue publish failure causes user-visible state divergence → build the outbox.
- A use case requires cross-message linearization → DO-coordinated queue, named in this ADR’s “considered and rejected.”
- Cloudflare Queues exhibits a quality issue (delivery latency, throughput, DLQ behavior) that affects a hot path → benchmark alternatives. Cloud-portable abstraction makes the swap bounded.
- A consumer cannot be made idempotent without unreasonable contortions → reconsider the topic shape, or escalate to outbox-with-deduplication-table.
Consequences
Section titled “Consequences”Becomes easy:
- Agent-to-async-consumer handoff is a single function call (
emit_event) with the validation, observability, and (future) replay hooks centralized. - Switching transports later is a single-package swap.
- Tests get an in-process bus implementation cheaply.
- Adding a new topic is a Zod schema + a
publishcall site + a consumer Worker — three small artifacts, no infrastructure changes.
Becomes hard / accepted tradeoffs:
- Best-effort publish. Lost events are possible. Idempotency and loud logging are the v1 mitigations; outbox is the v2 answer.
- Per-message keys live in payloads, not in the queue. Cloudflare Queues does not support partition keys; consumers extract the dedup key from the payload. Less ergonomic than partitioned queues, but workable.
- Two-package layout for the bus (interface + impl) is one more boundary to maintain. Same cost we accepted for LLMs and embeddings; consistent.
Explicitly out of scope:
- The implementation. This ADR locks the architecture; the next commit ships
@agent-platform/event-buswith the interface, the Cloudflare Queues impl, theemit_eventbuilt-in tool, and tests. Per-topic schemas land with the consumers that need them, in subsequent commits.
Implementation pointers (for follow-up commits, not this ADR)
Section titled “Implementation pointers (for follow-up commits, not this ADR)”@agent-platform/event-bus(new package):EventBusinterface,MockEventBusfor tests, possibly the Cloudflare Queues impl in the same package if the surface is small. Same shape as@agent-platform/llm+@agent-platform/llm-anthropic.@agent-platform/builtin-toolsgainscreateEmitEventTool(getBus). Same factory pattern as the memory tools.apps/worker/wrangler.tomlgains[[queues.producers]]for each topic the worker can publish to, and[[queues.consumers]]for each topic it consumes. Names match the topic names exactly.- DLQ wiring is per-topic, declared inline with the producer block.
- Per-topic Zod schemas land in
packages/schemas(existing package — its scope is exactly this kind of trust-boundary validation).
Amendment — consumer policy v1 (commit 8, 2026-05-03)
Section titled “Amendment — consumer policy v1 (commit 8, 2026-05-03)”The original ADR specified per-topic schemas + queue producers + (eventually) consumers, but did not commit retry, DLQ, or validation policy on the consume side. Commit 8 ships the first consumers (logs only) and locks v1 policy as follows:
Validation on consume. Every message is validated against the topic’s payload schema, even though the producer also validates. This catches three real failure modes that producer-side validation alone misses:
- Producer/consumer schema drift (one deployed first; the other is still on the old shape)
- Manual
wrangler queues sendinjection during ops, which bypasses the producer - Bit-rot from queue records that pre-date a schema migration
The cost is one Zod safeParse per message (microseconds for these payloads). The benefit is that producer/consumer disagreements surface as explicit schema_mismatch log lines rather than NPEs deep in the handler. The two-place definition (one in bus.ts, one in queue-consumer.ts) is intentional — both reference the same source schemas in scenarios/<name>/schemas.ts, so they can’t drift from each other; only the registries that wire them up could drift, and that drift is exactly what validation-on-consume catches.
Always-ack policy (no retry). v1 consumers are logs-only and idempotent. Every message is ack()’d after processing, including on schema-mismatch failures:
- Logs-only is idempotent; retrying gains nothing
- Schema mismatch is producer-side; retrying reproduces the same bug
- An exception in the consumer doesn’t auto-retry —
ack()runs first
When Phase 2 mutations land (Shopify refunds, customer notifications), retry policy gets designed alongside the side-effect operations, not bolted onto the logs-only consumer.
No DLQ binding. Originally the implementation pointers section above mentioned per-topic DLQs. v1 defers this:
- DLQ requires creating two more queues + wrangler.toml updates + operator runbook updates — real ops cost
- Always-ack means nothing would land in a DLQ from this consumer anyway
- When retry policy matters, DLQ design is part of the same conversation
The pointer in the ADR is amended: producers ship in commit 6 with no DLQ binding; consumers ship in commit 8 with no DLQ binding. DLQ wiring is deferred to whenever retry semantics arrive.
Single queue() entry point. Cloudflare Workers permits exactly one queue() handler per Worker. The handler dispatches by batch.queue to per-queue functions. Adding a new queue means adding a function + a dispatch line; no Worker-level refactoring.
Bundle delta after consumer wire-up. ~4 KiB compressed (1179 → 1183 KiB). Negligible.