ADR-0014: Cloudflare Workers as the runtime platform

Status: Accepted Date: 2026-04-21

Context

The “runtime platform” question has been listed as open since the monorepo was scaffolded. Every subsequent decision — the logger choice, the secret-management story, the observability plumbing, the database client, the HTTP framework — depends on where agent code actually runs. With ADR-0013 now defining what “enterprise-ready” means for this project, deferring the platform decision further would mean writing enterprise-ready components against an abstract target, which is the worst of both worlds: abstract enough to be wrong, concrete enough to have to redo.

The options considered in the open question were:

Cloudflare Workers + Durable Objects. Named as the default in the earliest planning sessions. Per-agent persistent state fits the Durable Object primitive naturally. Global low-latency dispatch, scale-to-zero billing, integrated storage (D1, KV, R2, Vectorize), integrated logging (Workers Logs), and tail workers for observability without extra infrastructure.
Node.js long-running (Fly.io / Railway / ECS). Maximum ecosystem compatibility. Every npm package works. No platform-specific constraints on CPU time, memory, or code size. But: cold starts, per-instance billing, separate logging / metrics / secrets stack to choose and operate.
Hybrid (Workers for fast paths, Node for long-running). Two runtimes to understand, two deployment pipelines, two sets of platform quirks. The implied benefits only materialize at a scale we do not yet have.

What changed the calculus

SQLite-backed Durable Objects are generally available. Durable Objects are now a production primitive for per-agent state, not a beta.
The project’s step-by-step development preference. A platform that requires operating more infrastructure (Node + orchestrator + log aggregation + metrics backend + secrets manager) means more that can break during each step. Workers collapses most of that into the platform.
First vertical is e-commerce, consumed via HTTPS. The workload is request/response with occasional background tasks (via Queues or Durable Object alarms), not long-running batch jobs. This is the Workers sweet spot.
Cost model at the current stage. Scale-to-zero matters when the platform is pre-revenue. A $5/month Workers Paid plan covers everything we need today; a Node deployment that is “always on” starts at ~$20/month for a single small instance and adds separately-billed services for logging, metrics, and secrets.

What has not changed

The code is platform-agnostic where it can be. TypeScript targets ES2022 (ADR-0002); the core and schemas packages have no runtime dependencies beyond Zod; the runtime package uses only structuredClone and Object.freeze, both native on both platforms. Moving off Workers later would be a deployment change, not a rewrite.
The storage choice is still open (open-questions.md#storage-primitives). Committing to Workers narrows it (D1 and KV become defaults rather than options) but does not foreclose it — Hyperdrive would allow Postgres from Workers if we later decided to.

Decision

The Agent Platform runs on Cloudflare Workers. Specifically:

Agent runtime code is deployed as Workers modules (ES modules format).
Per-agent persistent state lives in Durable Objects (SQLite-backed, given its GA status and superior introspection story compared to the KV storage backend).
Storage primitives default to the Cloudflare stack (D1 for relational, KV for cache / working memory, Vectorize for long-term memory embeddings, R2 for files). Each specific choice still needs its own ADR when the component that uses it ships; this ADR only sets the default from which to argue.
The nodejs_compat compatibility flag is enabled in wrangler.toml for any Worker that benefits from Node built-ins. We do not lean on it for core abstractions.
The compatibility_date for every Worker is pinned to a specific date and bumped deliberately, not floated.
Local development uses Wrangler’s built-in dev server; tests that specifically exercise Worker behavior use @cloudflare/vitest-pool-workers (deferred — see Consequences below).

Consequences

Every subsequent platform-shaped ADR has a concrete target. The logger ADR can specify that it works within Workers’ console/Workers Logs pipeline rather than abstracting over three backends. The secret-management ADR can specify Worker secrets (wrangler secret put) rather than “some secrets manager.” The HTTP-framework ADR (currently open) narrows to Hono or raw fetch handler — Fastify and Elysia are off the table for Workers.
The packages/core and packages/schemas promises are preserved. Neither package depends on anything Workers-specific and neither will. The Workers-specific code lives in apps/* and in packages/runtime only to the extent the runtime needs to call platform APIs (bindings, Workers Logs, Durable Object stubs). Business Packs stay platform-agnostic to the maximum extent possible — vertical logic is not Workers logic.
We take on platform-specific constraints. CPU-time budget per request (currently 30s on Workers Paid, more on Standard Unbound but on a different price model), 128 MiB memory per isolate, code-size caps, eval disallowed. Every component ADR that could bump into one of these limits states so explicitly.
Workers Logs is the default log sink. Bar 7 from ADR-0013 (“structured logs”) is satisfied by emitting console.log(JSON.stringify(...)) into the Workers Logs pipeline. The logger wrapper (future ADR) abstracts the call so tests do not need Workers to run.
Observability is mostly solved out of the box. Request logs, CPU-time metrics, invocation counts, and exception captures are provided by the platform. Custom metrics go via Workers Analytics Engine. This means bars 5 and 6 from ADR-0013 (LLM-call trace, audit record) are additions on top of an already-capable substrate rather than an observability stack to build from scratch.
Testing inside the Workers runtime is deferred, with a trigger. @cloudflare/vitest-pool-workers requires Vitest 4.1+; we currently pin Vitest 3.2.4 (ADR-0004). Today’s tests (62 in the workspace) exercise platform-agnostic logic and run in Vitest’s Node environment — that is correct for what they test. The trigger to revisit: the first component whose behavior depends on a Workers-specific API (a binding, Durable Object lifecycle, Queues, KV) is added, at which point ADR-0004 is superseded with an ADR bumping Vitest and adding @cloudflare/vitest-pool-workers. Platform-agnostic packages (core, schemas, current runtime) keep running in the Node pool regardless; only Worker-specific packages run in the Worker pool.
Deployment is Wrangler-based. wrangler deploy per Worker. CI/CD (open-questions.md#cicd) is still an open question in the “what deploys and when” sense, but the deploy tool is no longer variable.
Business Packs remain the escape valve. If a future vertical’s workload is fundamentally not Workers-shaped (long-running Python ML inference, say), that pack can be a separate service the platform calls over the network. The core platform stays on Workers; verticals that do not fit become integrations, not rewrites.

Consequences for the repo

apps/* will contain Workers, each with its own wrangler.toml and compatibility_date.
packages/runtime will grow a thin Workers-platform shim only when needed (e.g. reading from a Durable Object binding). The shim is behind an interface so non-Worker test environments can substitute it.
tsconfig.base.json already targets ES2022 with DOM and DOM.Iterable libs (ADR-0002); this is compatible with Workers’ V8 isolate. No change needed.
pnpm-workspace.yaml does not change.

Alternatives considered

Node.js on Fly.io or Railway. The most portable option. Rejected because portability is not the problem we need to solve today — every requirement we have is met by Workers, and the operational complexity of running Node (logging, metrics, secrets, HA, cold-start mitigation, process supervision) is work we would have to do ourselves. The enterprise-readiness bar from ADR-0013 is easier to meet on a platform that provides audit-grade logging and secret management out of the box than one where we assemble it from parts.
Hybrid Workers + Node. Considered for the case where a future workload genuinely doesn’t fit Workers. Rejected for Phase 1 because we do not have such a workload. If one appears, it becomes a specific, scoped integration in a Business Pack — not a cross-cutting architectural choice.
AWS Lambda. Comparable serverless semantics to Workers but with a substantially heavier platform (VPC, IAM, CloudWatch, Secrets Manager, API Gateway for HTTP). The “integrated platform” argument that favors Workers favors it over Lambda even more sharply.
Bun runtime on a VPS. Fast, Node-compatible, good DX. Rejected for the same reason Node is rejected: we would still be operating the infrastructure ourselves, plus Bun is a newer runtime with a smaller production track record in the contexts that matter (edge deployment, managed observability).
Stay on “no decision” and make platform-agnostic abstractions. The status quo. Rejected because ADR-0013 requires every subsequent component to meet a concrete enterprise bar, and you cannot meet a concrete bar with abstract plumbing. “Works on any platform” means “tested on none.”