Cloudflare Durable Objects
A Durable Object is a Worker with state. Where a regular
Worker is request-scoped (no memory between requests), a Durable
Object has its own private storage and runs as a single instance
worldwide for a given object ID. The platform uses one Durable
Object class — AgentJob — to run asynchronous agent jobs to
completion.
What we use it for
Section titled “What we use it for”The /jobs async path. When an HTTP request hits /jobs with
a long-running agent task, the Worker:
- Generates a
job_id - Creates an
AgentJobDurable Object with that ID - Stores the request payload in the DO’s storage
- Schedules an alarm 1 second in the future
- Returns
202 Accepted { job_id, status: "queued" }immediately
When the alarm fires, the DO:
- Reads its stored payload
- Runs the full agent turn (no 30-second wall-time budget — DOs can run for tens of minutes if needed)
- Writes the result to its own storage
- Updates the JOBS_INDEX KV namespace so the job appears in
/jobslistings
The HTTP client polls /jobs/:id until status is completed
and reads the report.
Why we picked it
Section titled “Why we picked it”The choice was: how do we run agent turns that exceed the Workers wall-time limit?
| Option | Verdict |
|---|---|
| Durable Objects | Chosen. Native Cloudflare primitive; alarms run with no wall-time limit; per-object storage gives us per-job state without adding a database. |
| Cloudflare Workflows | Strong contender. Built specifically for durable execution with retry/checkpoint semantics. Slightly heavier API. Tracked as a Phase 2 reconsideration. |
| Off-platform queue + worker (e.g. SQS + ECS) | Adds infra. Loses the “one Worker handles everything” property. |
| Run synchronously and cap at 30s | Some agent turns just won’t fit (deep delegation chains, many tool calls). Hard cap is wrong. |
DOs won because they were the smallest tool that solved the problem: per-job state + alarm-driven execution, no external services, no extra deploy unit. ADR-001 (the original “do we use LangGraph?” decision) is still open precisely because Workflows might be a better fit at Phase 2 scale.
What it costs
Section titled “What it costs”Durable Objects (Workers Paid):
- 1M requests per month free
- 400K GB-seconds compute free (the duration × memory product)
A Phase 1 async job is 1 request + ~20 seconds of execution at ~128 MB ≈ 2.5 GB-seconds per job. 400K ÷ 2.5 ≈ 160K async jobs per month free.
After free tier: $0.15 per million requests; $12.50 per million GB-seconds.
DO storage is billed alongside the rest of Workers KV/D1; per-DO storage is small (a single job’s payload + report) and aggregates to KB-scale.
What it replaces
Section titled “What it replaces”A separate background worker fleet — typically a cluster of long-running Node processes, scaled by a queue depth metric, with their own monitoring and deploy pipeline. DOs reduce this to a class declaration in the Worker bundle plus a wrangler.toml binding.
Where to look
Section titled “Where to look”apps/worker/src/agent-job.ts— theAgentJobDurable Object class;fetch()for the initial setup,alarm()for the actual runapps/worker/src/handlers.ts— the/jobsPOST handler that creates and seeds the DOapps/worker/src/job-index.ts— the KV-backed listing layer (DOs are private; KV holds the discoverable index)apps/worker/wrangler.toml— the[[durable_objects.bindings]]block
Trade-offs we accepted
Section titled “Trade-offs we accepted”- One job = one DO instance. Each
/jobsPOST creates a fresh DO. We don’t reuse them. This is fine because DOs are cheap; cleanup happens via/jobs/:id DELETE(which clears the DO’s storage and removes from the index). Orphan cleanup at scale is tracked as follow-up #4. - No cross-job state. Each DO is isolated. If two jobs need to share state (e.g. coordinating a saga), we’d need an additional shared resource. Phase 1 doesn’t need this.
- Alarms are best-effort scheduling. An alarm scheduled for 1 second from now might fire 100ms later. For agent jobs this is fine; the latency budget is dominated by LLM calls.