Testing
The platform has 923 passing tests as of Phase 1, plus 6 intentionally-skipped integration tests. This page covers what those tests are, how they’re organized, and — most importantly — what classes of bug they don’t catch and how we caught those anyway.
If you’re contributing code, this is the page that tells you which test layer your change belongs in.
What’s tested, by layer
Section titled “What’s tested, by layer”Three layers, in increasing order of fidelity and decreasing order of coverage:
| Layer | Count | Wall time | Where it runs | What it covers |
|---|---|---|---|---|
| 1. Vitest unit tests | 823 tests | ~3s | Developer machine, every CI build | Pure functions, type validation, error taxonomy, schema parsing, tool registry, agent runtime, delegation, gateway behavior (with mocks) |
2. Vitest integration tests (vitest-pool-workers) | 98 tests | ~20s | Developer machine via miniflare; memory package only | Real D1 + Vectorize semantics; working memory + long-term memory under real CF runtime |
| 3. End-to-end script | 1 script, opt-in | ~30-60s, ~$0.05 | Against a deployed Worker | Real Anthropic + OpenAI + Shopify + queues + Durable Objects, full delegation chain |
Layer 1 — Vitest unit tests. The bulk. 823 tests across 18
packages and apps/worker. Run on every commit and every PR.
Cover pure functions, type-correctness, error semantics, schema
parsing, runtime behavior with mocked LLM/embedding/storage
gateways. Total wall time: ~3 seconds.
Layer 2 — Vitest integration tests, vitest-pool-workers.
98 tests in packages/memory, run against the real Cloudflare
runtime locally via miniflare. Covers what unit tests can’t —
D1 schema enforcement, Vectorize query semantics, Durable
Object storage. Wall time: ~20 seconds.
Layer 3 — End-to-end demo script. apps/worker/scripts/e2e-demo.sh.
Runs against a deployed Worker, hits real Anthropic, real
OpenAI, real Shopify. Costs ~$0.05/run; gated behind RUN_E2E=1
so it can’t run in CI without explicit opt-in. Confirms the
order-triage delegation chain works end-to-end.
What each test layer can and cannot catch
Section titled “What each test layer can and cannot catch”The honest version, with examples from Phase 1:
| Layer | Catches | Misses |
|---|---|---|
| Unit (Vitest) | Type errors, logic errors in pure code, schema validation bugs, error taxonomy mistakes, runtime turn semantics with mocked gateways | Real-runtime behavior of CF bindings, live API schema mismatches, network-shaped bugs |
| Integration (vitest-pool-workers) | D1 schema enforcement, Vectorize query and metadata-filter behavior, Durable Object storage semantics, real binding contracts | Cross-service interactions (e.g., a queue producer + remote consumer chain), real LLM behavior, third-party API live schemas |
| E2E (deployed) | Real-world bugs everywhere upstream miss. Confirms the deployed system actually works. | Doesn’t run in CI by default (cost, secrets). Slow per-iteration; not a fast inner loop. |
The testing-fidelity gap
Section titled “The testing-fidelity gap”A theme worth surfacing: Vitest passing ≠ production passing. Phase 1 hit this exact bug class three times.
| # | Bug | Why Vitest missed it | How we caught it |
|---|---|---|---|
| 1 | Shopify GraphQL selectionMismatch | Vitest can’t validate against live schemas | Deploy + first real call |
| 2 | Shopify client fetch this-binding | Vitest’s Node fetch doesn’t enforce method binding | Deploy + first real call |
| 3 | embeddings-openai fetch this-binding | Same as #2 | E2E demo run, Step 2 (seed) |
The fetch-binding issue is the most painful: in Cloudflare
Workers, globalThis.fetch requires this === globalThis when
invoked. Storing it on an instance field and calling
this.fetchImpl(...) rebinds this to the instance, throwing
Illegal invocation. Vitest’s Node fetch doesn’t enforce
this, so unit tests pass while production throws.
The fix is small (wrap with .bind(globalThis)); the cost was
catching it three times. After bug #3, we set the trigger
condition for adding a lint rule:
Tracked follow-up #14. When a fourth instance of any Workers-runtime-only bug ships, add a Biome lint rule that flags
const x = globalThis.fooandthis.x = globalThis.foopatterns at review time.
We also tracked follow-up #13:
Tracked follow-up #13. Three production bugs Vitest can’t catch is the trigger for adding miniflare/
vitest-pool-workersintegration tests for the Worker’s Anthropic + OpenAI + Shopify code paths. The memory package already uses this pattern; the Worker’s gateway construction and queue producer paths don’t yet.
Both are open. Phase 2 is the natural place to land them.
What’s NOT tested, on purpose
Section titled “What’s NOT tested, on purpose”Three categories of things we deliberately don’t test in CI:
Real LLM behavior. Anthropic’s actual responses change over time. Tests against real Sonnet would be flaky (model behavior varies) and expensive ($0.05/run × every CI build = real money fast). Solution: mocks for unit tests, the e2e script for opt-in real-LLM verification.
Real Shopify mutations. Today the Shopify client is read-only. Even when Phase 2 introduces mutations, the e2e tests will run against a sandbox shop, not production. We never want “a CI test refunded a real customer” in our incident log.
Cross-tenant scenarios. Phase 1 is single-tenant. The multi-tenant test surface lands with Phase 4. Until then, we don’t write tests for behavior that doesn’t exist.
Test-organization rules
Section titled “Test-organization rules”Co-located unit tests. Every package’s tests live next to
the source under <package>/src/*.test.ts. Discoverability
matters — when you read agent-runtime.ts, the tests are right
there.
Integration tests in a separate test/ directory. The
memory package has both — co-located unit tests for pure logic,
plus packages/memory/test/*.workers.test.ts for the
miniflare-pool ones. Different runner config; different
directory makes that obvious.
No e2e tests committed alongside unit tests. The e2e
script is bash, lives in apps/worker/scripts/e2e-demo.sh,
and never runs in pnpm test. Mixing it in would either make
pnpm test slow + costly, or make it inconsistent depending
on environment.
Test names describe behavior, not function calls. it('throws TurnBudgetExceededError when iteration count exceeds maxIterations')
not it('throws when limit hit'). The former tells you what’s
being tested when it fails six months from now.
Running the tests
Section titled “Running the tests”For day-to-day development:
pnpm test # all 923 tests across the workspace, ~25spnpm test:watch # vitest watch mode for the package you're inpnpm typecheck # TypeScript check across all 18 packagespnpm check # lint + typecheck + test togetherFor the integration tests specifically:
pnpm --filter @agent-platform/memory test# Runs the unit + miniflare-pool testsFor the e2e demo (requires deployed Worker + secrets):
RUN_E2E=1 \WORKER_URL=https://your-worker.workers.dev \WORKER_AUTH_TOKEN="$(cat ~/.agent-platform-token)" \./apps/worker/scripts/e2e-demo.shFull e2e runbook is in apps/worker/README.md.
Why we don’t auto-run e2e in CI (yet)
Section titled “Why we don’t auto-run e2e in CI (yet)”The e2e script costs ~$0.05 per run (Anthropic + OpenAI). If it ran on every PR, that’s a meaningful budget item — and worse, it’d be a budget item attached to PRs from anyone, including contractors and bot accounts. The threat model isn’t theoretical: a few thousand sloppy PRs would mean a few hundred dollars of LLM bill.
Three options, all tracked but not implemented:
- Run on merges to main only. Cheaper but still real cost, and the CI-fail comes after merge — too late to block bad code.
- Run on labeled PRs. A maintainer adds a
run-e2elabel to trigger. More gatekeeping, less convenient. - Run nightly against a dedicated test tenant. Decouples PR latency from e2e cost; surfaces drift overnight. Probably the right answer for Phase 2.
Tracked as follow-up #12 (auto CI E2E runs).
What “good test coverage” means here
Section titled “What “good test coverage” means here”Coverage percentages aren’t a goal on this platform. We don’t have a coverage badge in the README, and we don’t gate PRs on threshold. Two reasons:
The numerator can lie. A line of code can be “covered” by a test that doesn’t actually verify any behavior — calling a function and not asserting on the result still bumps the coverage counter. Coverage targets incentivize writing those tests.
The denominator misses. A class of bug like the
fetch-binding issue is invisible to c8 --coverage: the line
of code IS covered, the issue is at runtime in a different
context. Coverage tools would say it’s tested. It wasn’t.
Instead: every PR’s tests should cover the behaviors the PR is responsible for, with the structure being unit tests for pure logic and integration tests when the change interacts with CF bindings. Reviewer judgment, not numerical threshold.
Where to next
Section titled “Where to next”If you’re contributing:
- Read
apps/worker/README.mdfor the operational runbook (deploy, seed memory, run e2e) - Look at
packages/memory/test/for examples of miniflare-based integration tests if your change touches CF bindings
If you’re evaluating the platform:
- The 923-passing-tests-on-fresh-checkout property is real — every commit in the platform’s history is verified end-to-end on a fresh clone before being claimed done. The git log is trustworthy as a record of working states.
- The testing-fidelity gap is documented because it’s been paid for, in real production bugs. No platform is bug-free; the platforms that survive are the ones that surface their weaknesses honestly.