Skip to content

Cloudflare Durable Objects

A Durable Object is a Worker with state. Where a regular Worker is request-scoped (no memory between requests), a Durable Object has its own private storage and runs as a single instance worldwide for a given object ID. The platform uses one Durable Object class — AgentJob — to run asynchronous agent jobs to completion.

The /jobs async path. When an HTTP request hits /jobs with a long-running agent task, the Worker:

  1. Generates a job_id
  2. Creates an AgentJob Durable Object with that ID
  3. Stores the request payload in the DO’s storage
  4. Schedules an alarm 1 second in the future
  5. Returns 202 Accepted { job_id, status: "queued" } immediately

When the alarm fires, the DO:

  1. Reads its stored payload
  2. Runs the full agent turn (no 30-second wall-time budget — DOs can run for tens of minutes if needed)
  3. Writes the result to its own storage
  4. Updates the JOBS_INDEX KV namespace so the job appears in /jobs listings

The HTTP client polls /jobs/:id until status is completed and reads the report.

The choice was: how do we run agent turns that exceed the Workers wall-time limit?

OptionVerdict
Durable ObjectsChosen. Native Cloudflare primitive; alarms run with no wall-time limit; per-object storage gives us per-job state without adding a database.
Cloudflare WorkflowsStrong contender. Built specifically for durable execution with retry/checkpoint semantics. Slightly heavier API. Tracked as a Phase 2 reconsideration.
Off-platform queue + worker (e.g. SQS + ECS)Adds infra. Loses the “one Worker handles everything” property.
Run synchronously and cap at 30sSome agent turns just won’t fit (deep delegation chains, many tool calls). Hard cap is wrong.

DOs won because they were the smallest tool that solved the problem: per-job state + alarm-driven execution, no external services, no extra deploy unit. ADR-001 (the original “do we use LangGraph?” decision) is still open precisely because Workflows might be a better fit at Phase 2 scale.

Durable Objects (Workers Paid):

  • 1M requests per month free
  • 400K GB-seconds compute free (the duration × memory product)

A Phase 1 async job is 1 request + ~20 seconds of execution at ~128 MB ≈ 2.5 GB-seconds per job. 400K ÷ 2.5 ≈ 160K async jobs per month free.

After free tier: $0.15 per million requests; $12.50 per million GB-seconds.

DO storage is billed alongside the rest of Workers KV/D1; per-DO storage is small (a single job’s payload + report) and aggregates to KB-scale.

A separate background worker fleet — typically a cluster of long-running Node processes, scaled by a queue depth metric, with their own monitoring and deploy pipeline. DOs reduce this to a class declaration in the Worker bundle plus a wrangler.toml binding.

  • apps/worker/src/agent-job.ts — the AgentJob Durable Object class; fetch() for the initial setup, alarm() for the actual run
  • apps/worker/src/handlers.ts — the /jobs POST handler that creates and seeds the DO
  • apps/worker/src/job-index.ts — the KV-backed listing layer (DOs are private; KV holds the discoverable index)
  • apps/worker/wrangler.toml — the [[durable_objects.bindings]] block
  • One job = one DO instance. Each /jobs POST creates a fresh DO. We don’t reuse them. This is fine because DOs are cheap; cleanup happens via /jobs/:id DELETE (which clears the DO’s storage and removes from the index). Orphan cleanup at scale is tracked as follow-up #4.
  • No cross-job state. Each DO is isolated. If two jobs need to share state (e.g. coordinating a saga), we’d need an additional shared resource. Phase 1 doesn’t need this.
  • Alarms are best-effort scheduling. An alarm scheduled for 1 second from now might fire 100ms later. For agent jobs this is fine; the latency budget is dominated by LLM calls.