Cloudflare Durable Objects

A Durable Object is a Worker with state. Where a regular Worker is request-scoped (no memory between requests), a Durable Object has its own private storage and runs as a single instance worldwide for a given object ID. The platform uses one Durable Object class — AgentJob — to run asynchronous agent jobs to completion.

What we use it for

The /jobs async path. When an HTTP request hits /jobs with a long-running agent task, the Worker:

Generates a job_id
Creates an AgentJob Durable Object with that ID
Stores the request payload in the DO’s storage
Schedules an alarm 1 second in the future
Returns 202 Accepted { job_id, status: "queued" } immediately

When the alarm fires, the DO:

Reads its stored payload
Runs the full agent turn (no 30-second wall-time budget — DOs can run for tens of minutes if needed)
Writes the result to its own storage
Updates the JOBS_INDEX KV namespace so the job appears in /jobs listings

The HTTP client polls /jobs/:id until status is completed and reads the report.

Why we picked it

The choice was: how do we run agent turns that exceed the Workers wall-time limit?

Option	Verdict
Durable Objects	Chosen. Native Cloudflare primitive; alarms run with no wall-time limit; per-object storage gives us per-job state without adding a database.
Cloudflare Workflows	Strong contender. Built specifically for durable execution with retry/checkpoint semantics. Slightly heavier API. Tracked as a Phase 2 reconsideration.
Off-platform queue + worker (e.g. SQS + ECS)	Adds infra. Loses the “one Worker handles everything” property.
Run synchronously and cap at 30s	Some agent turns just won’t fit (deep delegation chains, many tool calls). Hard cap is wrong.

DOs won because they were the smallest tool that solved the problem: per-job state + alarm-driven execution, no external services, no extra deploy unit. ADR-001 (the original “do we use LangGraph?” decision) is still open precisely because Workflows might be a better fit at Phase 2 scale.

What it costs

Durable Objects (Workers Paid):

1M requests per month free
400K GB-seconds compute free (the duration × memory product)

A Phase 1 async job is 1 request + ~20 seconds of execution at ~128 MB ≈ 2.5 GB-seconds per job. 400K ÷ 2.5 ≈ 160K async jobs per month free.

After free tier: $0.15 per million requests; $12.50 per million GB-seconds.

DO storage is billed alongside the rest of Workers KV/D1; per-DO storage is small (a single job’s payload + report) and aggregates to KB-scale.

What it replaces

A separate background worker fleet — typically a cluster of long-running Node processes, scaled by a queue depth metric, with their own monitoring and deploy pipeline. DOs reduce this to a class declaration in the Worker bundle plus a wrangler.toml binding.

Where to look

apps/worker/src/agent-job.ts — the AgentJob Durable Object class; fetch() for the initial setup, alarm() for the actual run
apps/worker/src/handlers.ts — the /jobs POST handler that creates and seeds the DO
apps/worker/src/job-index.ts — the KV-backed listing layer (DOs are private; KV holds the discoverable index)
apps/worker/wrangler.toml — the [[durable_objects.bindings]] block

Trade-offs we accepted

One job = one DO instance. Each /jobs POST creates a fresh DO. We don’t reuse them. This is fine because DOs are cheap; cleanup happens via /jobs/:id DELETE (which clears the DO’s storage and removes from the index). Orphan cleanup at scale is tracked as follow-up #4.
No cross-job state. Each DO is isolated. If two jobs need to share state (e.g. coordinating a saga), we’d need an additional shared resource. Phase 1 doesn’t need this.
Alarms are best-effort scheduling. An alarm scheduled for 1 second from now might fire 100ms later. For agent jobs this is fine; the latency budget is dominated by LLM calls.