ADR-0026: Job management — auth, listing, deletion
ADR-0026: Job management — auth, listing, deletion
Section titled “ADR-0026: Job management — auth, listing, deletion”Status: Accepted Date: 2026-04-27
Context
Section titled “Context”After ADR-0025, the merchandising agent runs weekly via cron. The first real run produced two operational problems that the architecture didn’t address:
- No way to list submitted jobs. Job IDs were only knowable from the cron’s
console.logoutput in Cloudflare logs or from the response of a manualPOST /jobscall. If you forgot the job ID — or if the cron fired while you weren’t watching — you couldn’t enumerate what had run. - No way to delete a job. The platform stored every job submission and report in DO storage forever. Failed runs from misconfigured secrets accumulated alongside legitimate ones, with no operator-facing way to clean them up.
- No auth on any endpoint. Anyone who learned the deployed Worker’s URL could hit
POST /jobsand trigger LLM calls at the operator’s expense. Cost is small per call ($0.05) but unbounded over time.
These weren’t blocking the platform from working — Selman’s first manual job run succeeded after the secret fix. But they made operational hygiene impossible. The “POC stage” framing made it tempting to defer all of this to a later session, except that the auth concern in particular is a real cost-leak risk and the listing/deletion concern is a daily-use ergonomics issue.
Decision
Section titled “Decision”Add three things together as one self-contained session: bearer-token auth on management endpoints, a KV-backed job index for listing, and a DELETE /jobs/:id endpoint for cleanup.
These belong together because each is small on its own but together they form a complete operational story. Splitting them across sessions would mean shipping incomplete states (auth without listing = can’t see what’s running; listing without auth = anyone can read your reports).
Auth model
Section titled “Auth model”- One static bearer token. Stored in Worker secret
WORKER_AUTH_TOKEN. Operator generates withopenssl rand -hex 32and sets viawrangler secret put. - Bearer scheme only.
Authorization: Bearer <token>is required onPOST /run,POST /jobs,GET /jobs,GET /jobs/:id, andDELETE /jobs/:id. /healthstays anonymous. Uptime monitors and CI smoke tests work without credentials. The endpoint returns no sensitive information.- Cron handler bypasses auth. Cloudflare invokes
scheduleddirectly — no HTTP request, no Authorization header. This is correct: the cron must fire regardless. - Constant-time token comparison to prevent timing-based extraction. Implementation cost is ~10 lines; doing it right has no downside.
- Distinct internal failure reasons (
missing/wrong_scheme/invalid_token), generic external response. Logs say which case fired (helps operator debug), the 401 body says only{"error": "unauthorized"}(doesn’t help attackers). - Fail-closed on missing secret. If
WORKER_AUTH_TOKENisn’t set on the deployed Worker, all protected endpoints return 500 withauth misconfigured. Better than silently allowing anonymous access.
Listing model: KV-backed index
Section titled “Listing model: KV-backed index”DOs are addressed by id; there is no “list all DO instances” API. A separate index records every submission and updates as statuses change.
- KV namespace
JOBS_INDEXbound inwrangler.toml. - Key shape:
jobs:{ISO_timestamp}:{job_id}. The timestamp prefix gives natural ordering — KV’s lexicographic list is chronological for ISO-8601. - Value shape:
{ job_id, job_type, status, created_at, updated_at }. Deliberately a SUBSET ofJobRecord— we don’t duplicate the agent report (which can be kilobytes). Callers needing the report fetch the DO viaGET /jobs/:id. - Eventually consistent. KV propagates within ~60s globally. For “show me my jobs” this is fine. The DO remains the authoritative source for any individual job’s full state.
- Write amplification. ~3 KV writes per job (queued → running → completed). At KV’s free tier (1000 writes/day) this supports 300+ jobs/day — far more than weekly cron.
- No TTL today. “Keep reports” is a stated requirement. If KV usage becomes a concern (very unlikely; entries are tiny), revisit.
Index write points
Section titled “Index write points”POST /jobs: writes the queued entry. Failure aborts the request — better than a “lost” job.- Cron handler: writes the queued entry. Failure logs but doesn’t abort — the DO has the truth and the cron’s
console.logline still records the job_id. - DO alarm handler: writes the terminal status (completed/failed) when
executeJobreturns. Failure logs but doesn’t abort — DO storage is authoritative; the index becomes stale butGET /jobs/:idstill works. - DO does NOT write the intermediate “running” status. We considered it but the value is marginal — alarm transitions queued → completed/failed within seconds in practice. Skipping the intermediate write halves the KV write count.
Deletion semantics
Section titled “Deletion semantics”DELETE /jobs/:idremoves both the DO storage and the KV index entry.- DO storage is cleared via a
DELETEmethod on the DO’s fetch handler. Pending alarms still fire but become no-ops (the existingif (!record) returnat the top ofalarm()handles this). - Index lookup happens FIRST. We need
created_atto construct the KV key, and that lives only in the index entry. If the index doesn’t have the job, return 404 — even if a stale DO instance somewhere has the record, the operator can’t address it through the public API anyway. - Index removal happens AFTER DO deletion. If KV remove fails, the operator can re-call DELETE; idempotent.
Consequences
Section titled “Consequences”- Operator can manage their own data. Three commands cover the lifecycle: list, delete, leave-alone-for-cron.
- Cost is bounded by token theft, not URL discovery. A leaked URL alone costs nothing.
- The KV write-on-status-change pattern generalizes. Future indexes (errors, audit log, per-tenant scoping) follow the same shape.
- One new wrangler binding to set up. First-time deploy has an extra two-step (
wrangler kv namespace create JOBS_INDEXx2 for prod and preview, paste the IDs). README documents this. - The token is the operator’s responsibility. If lost, rotate via
wrangler secret put. There’s no recovery path; the token IS the auth.
Consequences for the repo
Section titled “Consequences for the repo”- New file
apps/worker/src/auth.ts(~115 lines): pure validation function + Hono middleware factory. - New file
apps/worker/src/job-index.ts(~155 lines): typed wrapper over KV with InMemoryKv test fake. - Changes to
apps/worker/src/index.ts: middleware mount, KV index integration inPOST /jobs, two new endpoints (GET /jobs,DELETE /jobs/:id), env interface addsWORKER_AUTH_TOKENandJOBS_INDEX. - Changes to
apps/worker/src/agent-job-do.ts: adds DELETE handler, alarm writes status update to KV index. - Changes to
apps/worker/wrangler.toml: adds[[kv_namespaces]]block, documents new secrets. - Tests: 17 auth + 11 job-index + 11 routes = 39 new. Workspace: 455 passed + 2 skipped (was 416).
Manual setup required after merge
Section titled “Manual setup required after merge”# One-time KV namespace creation (returns IDs to paste into wrangler.toml)cd apps/workerwrangler kv namespace create JOBS_INDEXwrangler kv namespace create JOBS_INDEX --preview
# Generate and set the auth tokenopenssl rand -hex 32 | xargs -I{} sh -c 'echo "{}" && wrangler secret put WORKER_AUTH_TOKEN <<<"{}"'# (or run interactively: wrangler secret put WORKER_AUTH_TOKEN, paste the value when prompted)
# Save the token somewhere safe — you won't be able to read it back from Cloudflare.Existing secrets (ANTHROPIC_API_KEY, SHOPIFY_ACCESS_TOKEN, SHOPIFY_SHOP_DOMAIN) are unchanged.
After the deploy, every API call needs the token:
curl -H "Authorization: Bearer $WORKER_AUTH_TOKEN" \ https://your-worker-url/jobsAlternatives considered
Section titled “Alternatives considered”- HMAC-signed requests instead of bearer tokens. Replay-resistant; massive overkill for a single-consumer POC.
- Multiple tokens (e.g., comma-separated values in the secret) for different consumers. Premature; we have one consumer.
- Per-route auth tokens. Different scopes for management vs agent endpoints. Premature; “you have access to this Worker or you don’t” is the right model today.
- API key as URL parameter instead of Authorization header. Header is the correct standard; URL params leak through proxy logs and browser history.
- D1 (SQLite) instead of KV for the index. SQL queries would be nice but D1 needs a schema-migration story we don’t have. KV is simpler and sufficient for “list and filter recent.”
- One DO acting as the index (a “registry DO”). Centralizes the data, but creates a hot spot and forces serialization for unrelated reads. KV is colder and faster for this access pattern.
- List from DO storage iteration. Doesn’t exist — Cloudflare doesn’t expose “list all DO instances of class X.” So an external index is unavoidable.
- TTL on KV entries. Auto-expiring would silently lose old reports; conflicts with “keep reports” requirement. Add later if KV usage genuinely becomes an issue.
- Soft-delete (mark as deleted, hide from list) instead of hard-delete. Provides recoverability but doubles KV reads (need to filter on every list). Premature; if you delete a job you meant to keep, it’s gone — that’s fine for a POC.
- Pause cron via a runtime KV flag instead of
wrangler.toml. Adds a new failure mode (forgot to flip the flag), and requires an admin endpoint. Editing wrangler.toml + redeploy is 30 seconds and zero state.
What’s Next
Section titled “What’s Next”This session resolves operational hygiene. The platform’s open questions are now mostly about the agent’s quality, not its plumbing:
- Memory of past reports. Without it, every Monday’s recommendation is independent. Most likely the next session — finally has a forcing function once you’ve seen 2-3 weekly reports and noticed the agent suggesting the same campaign repeatedly.
- Email or Slack delivery. Polling
/jobs/:idworks but isn’t great UX. A simple email tool turns “I should remember to check the report” into “the report shows up in my inbox.” - Smarter cron behavior. Skip if last run failed. Requires the same memory primitive as #1; do them together.
- Structural failure signaling. If all tool calls failed,
statusshould befailed, notcompleted. Tracked from the previous diagnostic conversation. - Better prompts based on real output. Possibly no new code — just iteration on the system prompt once you see a few real reports.