ADR-0026: Job management — auth, listing, deletion

Status: Accepted Date: 2026-04-27

Context

After ADR-0025, the merchandising agent runs weekly via cron. The first real run produced two operational problems that the architecture didn’t address:

No way to list submitted jobs. Job IDs were only knowable from the cron’s console.log output in Cloudflare logs or from the response of a manual POST /jobs call. If you forgot the job ID — or if the cron fired while you weren’t watching — you couldn’t enumerate what had run.
No way to delete a job. The platform stored every job submission and report in DO storage forever. Failed runs from misconfigured secrets accumulated alongside legitimate ones, with no operator-facing way to clean them up.
No auth on any endpoint. Anyone who learned the deployed Worker’s URL could hit POST /jobs and trigger LLM calls at the operator’s expense. Cost is small per call ($0.05) but unbounded over time.

These weren’t blocking the platform from working — Selman’s first manual job run succeeded after the secret fix. But they made operational hygiene impossible. The “POC stage” framing made it tempting to defer all of this to a later session, except that the auth concern in particular is a real cost-leak risk and the listing/deletion concern is a daily-use ergonomics issue.

Decision

Add three things together as one self-contained session: bearer-token auth on management endpoints, a KV-backed job index for listing, and a DELETE /jobs/:id endpoint for cleanup.

These belong together because each is small on its own but together they form a complete operational story. Splitting them across sessions would mean shipping incomplete states (auth without listing = can’t see what’s running; listing without auth = anyone can read your reports).

Auth model

One static bearer token. Stored in Worker secret WORKER_AUTH_TOKEN. Operator generates with openssl rand -hex 32 and sets via wrangler secret put.
Bearer scheme only. Authorization: Bearer <token> is required on POST /run, POST /jobs, GET /jobs, GET /jobs/:id, and DELETE /jobs/:id.
/health stays anonymous. Uptime monitors and CI smoke tests work without credentials. The endpoint returns no sensitive information.
Cron handler bypasses auth. Cloudflare invokes scheduled directly — no HTTP request, no Authorization header. This is correct: the cron must fire regardless.
Constant-time token comparison to prevent timing-based extraction. Implementation cost is ~10 lines; doing it right has no downside.
Distinct internal failure reasons (missing/wrong_scheme/invalid_token), generic external response. Logs say which case fired (helps operator debug), the 401 body says only {"error": "unauthorized"} (doesn’t help attackers).
Fail-closed on missing secret. If WORKER_AUTH_TOKEN isn’t set on the deployed Worker, all protected endpoints return 500 with auth misconfigured. Better than silently allowing anonymous access.

Listing model: KV-backed index

DOs are addressed by id; there is no “list all DO instances” API. A separate index records every submission and updates as statuses change.

KV namespace JOBS_INDEX bound in wrangler.toml.
Key shape: jobs:{ISO_timestamp}:{job_id}. The timestamp prefix gives natural ordering — KV’s lexicographic list is chronological for ISO-8601.
Value shape: { job_id, job_type, status, created_at, updated_at }. Deliberately a SUBSET of JobRecord — we don’t duplicate the agent report (which can be kilobytes). Callers needing the report fetch the DO via GET /jobs/:id.
Eventually consistent. KV propagates within ~60s globally. For “show me my jobs” this is fine. The DO remains the authoritative source for any individual job’s full state.
Write amplification. ~3 KV writes per job (queued → running → completed). At KV’s free tier (1000 writes/day) this supports 300+ jobs/day — far more than weekly cron.
No TTL today. “Keep reports” is a stated requirement. If KV usage becomes a concern (very unlikely; entries are tiny), revisit.

Index write points

POST /jobs: writes the queued entry. Failure aborts the request — better than a “lost” job.
Cron handler: writes the queued entry. Failure logs but doesn’t abort — the DO has the truth and the cron’s console.log line still records the job_id.
DO alarm handler: writes the terminal status (completed/failed) when executeJob returns. Failure logs but doesn’t abort — DO storage is authoritative; the index becomes stale but GET /jobs/:id still works.
DO does NOT write the intermediate “running” status. We considered it but the value is marginal — alarm transitions queued → completed/failed within seconds in practice. Skipping the intermediate write halves the KV write count.

Deletion semantics

DELETE /jobs/:id removes both the DO storage and the KV index entry.
DO storage is cleared via a DELETE method on the DO’s fetch handler. Pending alarms still fire but become no-ops (the existing if (!record) return at the top of alarm() handles this).
Index lookup happens FIRST. We need created_at to construct the KV key, and that lives only in the index entry. If the index doesn’t have the job, return 404 — even if a stale DO instance somewhere has the record, the operator can’t address it through the public API anyway.
Index removal happens AFTER DO deletion. If KV remove fails, the operator can re-call DELETE; idempotent.

Consequences

Operator can manage their own data. Three commands cover the lifecycle: list, delete, leave-alone-for-cron.
Cost is bounded by token theft, not URL discovery. A leaked URL alone costs nothing.
The KV write-on-status-change pattern generalizes. Future indexes (errors, audit log, per-tenant scoping) follow the same shape.
One new wrangler binding to set up. First-time deploy has an extra two-step (wrangler kv namespace create JOBS_INDEX x2 for prod and preview, paste the IDs). README documents this.
The token is the operator’s responsibility. If lost, rotate via wrangler secret put. There’s no recovery path; the token IS the auth.

Consequences for the repo

New file apps/worker/src/auth.ts (~115 lines): pure validation function + Hono middleware factory.
New file apps/worker/src/job-index.ts (~155 lines): typed wrapper over KV with InMemoryKv test fake.
Changes to apps/worker/src/index.ts: middleware mount, KV index integration in POST /jobs, two new endpoints (GET /jobs, DELETE /jobs/:id), env interface adds WORKER_AUTH_TOKEN and JOBS_INDEX.
Changes to apps/worker/src/agent-job-do.ts: adds DELETE handler, alarm writes status update to KV index.
Changes to apps/worker/wrangler.toml: adds [[kv_namespaces]] block, documents new secrets.
Tests: 17 auth + 11 job-index + 11 routes = 39 new. Workspace: 455 passed + 2 skipped (was 416).

Manual setup required after merge

# One-time KV namespace creation (returns IDs to paste into wrangler.toml)
cd apps/worker
wrangler kv namespace create JOBS_INDEX
wrangler kv namespace create JOBS_INDEX --preview

# Generate and set the auth token
openssl rand -hex 32 | xargs -I{} sh -c 'echo "{}" && wrangler secret put WORKER_AUTH_TOKEN <<<"{}"'
# (or run interactively: wrangler secret put WORKER_AUTH_TOKEN, paste the value when prompted)

# Save the token somewhere safe — you won't be able to read it back from Cloudflare.

Existing secrets (ANTHROPIC_API_KEY, SHOPIFY_ACCESS_TOKEN, SHOPIFY_SHOP_DOMAIN) are unchanged.

After the deploy, every API call needs the token:

curl -H "Authorization: Bearer $WORKER_AUTH_TOKEN" \
     https://your-worker-url/jobs

Alternatives considered

HMAC-signed requests instead of bearer tokens. Replay-resistant; massive overkill for a single-consumer POC.
Multiple tokens (e.g., comma-separated values in the secret) for different consumers. Premature; we have one consumer.
Per-route auth tokens. Different scopes for management vs agent endpoints. Premature; “you have access to this Worker or you don’t” is the right model today.
API key as URL parameter instead of Authorization header. Header is the correct standard; URL params leak through proxy logs and browser history.
D1 (SQLite) instead of KV for the index. SQL queries would be nice but D1 needs a schema-migration story we don’t have. KV is simpler and sufficient for “list and filter recent.”
One DO acting as the index (a “registry DO”). Centralizes the data, but creates a hot spot and forces serialization for unrelated reads. KV is colder and faster for this access pattern.
List from DO storage iteration. Doesn’t exist — Cloudflare doesn’t expose “list all DO instances of class X.” So an external index is unavoidable.
TTL on KV entries. Auto-expiring would silently lose old reports; conflicts with “keep reports” requirement. Add later if KV usage genuinely becomes an issue.
Soft-delete (mark as deleted, hide from list) instead of hard-delete. Provides recoverability but doubles KV reads (need to filter on every list). Premature; if you delete a job you meant to keep, it’s gone — that’s fine for a POC.
Pause cron via a runtime KV flag instead of wrangler.toml. Adds a new failure mode (forgot to flip the flag), and requires an admin endpoint. Editing wrangler.toml + redeploy is 30 seconds and zero state.

What’s Next

This session resolves operational hygiene. The platform’s open questions are now mostly about the agent’s quality, not its plumbing:

Memory of past reports. Without it, every Monday’s recommendation is independent. Most likely the next session — finally has a forcing function once you’ve seen 2-3 weekly reports and noticed the agent suggesting the same campaign repeatedly.
Email or Slack delivery. Polling /jobs/:id works but isn’t great UX. A simple email tool turns “I should remember to check the report” into “the report shows up in my inbox.”
Smarter cron behavior. Skip if last run failed. Requires the same memory primitive as #1; do them together.
Structural failure signaling. If all tool calls failed, status should be failed, not completed. Tracked from the previous diagnostic conversation.
Better prompts based on real output. Possibly no new code — just iteration on the system prompt once you see a few real reports.