Skip to content

Testing

The platform has 923 passing tests as of Phase 1, plus 6 intentionally-skipped integration tests. This page covers what those tests are, how they’re organized, and — most importantly — what classes of bug they don’t catch and how we caught those anyway.

If you’re contributing code, this is the page that tells you which test layer your change belongs in.

Three layers, in increasing order of fidelity and decreasing order of coverage:

LayerCountWall timeWhere it runsWhat it covers
1. Vitest unit tests823 tests~3sDeveloper machine, every CI buildPure functions, type validation, error taxonomy, schema parsing, tool registry, agent runtime, delegation, gateway behavior (with mocks)
2. Vitest integration tests (vitest-pool-workers)98 tests~20sDeveloper machine via miniflare; memory package onlyReal D1 + Vectorize semantics; working memory + long-term memory under real CF runtime
3. End-to-end script1 script, opt-in~30-60s, ~$0.05Against a deployed WorkerReal Anthropic + OpenAI + Shopify + queues + Durable Objects, full delegation chain

Layer 1 — Vitest unit tests. The bulk. 823 tests across 18 packages and apps/worker. Run on every commit and every PR. Cover pure functions, type-correctness, error semantics, schema parsing, runtime behavior with mocked LLM/embedding/storage gateways. Total wall time: ~3 seconds.

Layer 2 — Vitest integration tests, vitest-pool-workers. 98 tests in packages/memory, run against the real Cloudflare runtime locally via miniflare. Covers what unit tests can’t — D1 schema enforcement, Vectorize query semantics, Durable Object storage. Wall time: ~20 seconds.

Layer 3 — End-to-end demo script. apps/worker/scripts/e2e-demo.sh. Runs against a deployed Worker, hits real Anthropic, real OpenAI, real Shopify. Costs ~$0.05/run; gated behind RUN_E2E=1 so it can’t run in CI without explicit opt-in. Confirms the order-triage delegation chain works end-to-end.

The honest version, with examples from Phase 1:

LayerCatchesMisses
Unit (Vitest)Type errors, logic errors in pure code, schema validation bugs, error taxonomy mistakes, runtime turn semantics with mocked gatewaysReal-runtime behavior of CF bindings, live API schema mismatches, network-shaped bugs
Integration (vitest-pool-workers)D1 schema enforcement, Vectorize query and metadata-filter behavior, Durable Object storage semantics, real binding contractsCross-service interactions (e.g., a queue producer + remote consumer chain), real LLM behavior, third-party API live schemas
E2E (deployed)Real-world bugs everywhere upstream miss. Confirms the deployed system actually works.Doesn’t run in CI by default (cost, secrets). Slow per-iteration; not a fast inner loop.

A theme worth surfacing: Vitest passing ≠ production passing. Phase 1 hit this exact bug class three times.

#BugWhy Vitest missed itHow we caught it
1Shopify GraphQL selectionMismatchVitest can’t validate against live schemasDeploy + first real call
2Shopify client fetch this-bindingVitest’s Node fetch doesn’t enforce method bindingDeploy + first real call
3embeddings-openai fetch this-bindingSame as #2E2E demo run, Step 2 (seed)

The fetch-binding issue is the most painful: in Cloudflare Workers, globalThis.fetch requires this === globalThis when invoked. Storing it on an instance field and calling this.fetchImpl(...) rebinds this to the instance, throwing Illegal invocation. Vitest’s Node fetch doesn’t enforce this, so unit tests pass while production throws.

The fix is small (wrap with .bind(globalThis)); the cost was catching it three times. After bug #3, we set the trigger condition for adding a lint rule:

Tracked follow-up #14. When a fourth instance of any Workers-runtime-only bug ships, add a Biome lint rule that flags const x = globalThis.foo and this.x = globalThis.foo patterns at review time.

We also tracked follow-up #13:

Tracked follow-up #13. Three production bugs Vitest can’t catch is the trigger for adding miniflare/vitest-pool-workers integration tests for the Worker’s Anthropic + OpenAI + Shopify code paths. The memory package already uses this pattern; the Worker’s gateway construction and queue producer paths don’t yet.

Both are open. Phase 2 is the natural place to land them.

Three categories of things we deliberately don’t test in CI:

Real LLM behavior. Anthropic’s actual responses change over time. Tests against real Sonnet would be flaky (model behavior varies) and expensive ($0.05/run × every CI build = real money fast). Solution: mocks for unit tests, the e2e script for opt-in real-LLM verification.

Real Shopify mutations. Today the Shopify client is read-only. Even when Phase 2 introduces mutations, the e2e tests will run against a sandbox shop, not production. We never want “a CI test refunded a real customer” in our incident log.

Cross-tenant scenarios. Phase 1 is single-tenant. The multi-tenant test surface lands with Phase 4. Until then, we don’t write tests for behavior that doesn’t exist.

Co-located unit tests. Every package’s tests live next to the source under <package>/src/*.test.ts. Discoverability matters — when you read agent-runtime.ts, the tests are right there.

Integration tests in a separate test/ directory. The memory package has both — co-located unit tests for pure logic, plus packages/memory/test/*.workers.test.ts for the miniflare-pool ones. Different runner config; different directory makes that obvious.

No e2e tests committed alongside unit tests. The e2e script is bash, lives in apps/worker/scripts/e2e-demo.sh, and never runs in pnpm test. Mixing it in would either make pnpm test slow + costly, or make it inconsistent depending on environment.

Test names describe behavior, not function calls. it('throws TurnBudgetExceededError when iteration count exceeds maxIterations') not it('throws when limit hit'). The former tells you what’s being tested when it fails six months from now.

For day-to-day development:

Terminal window
pnpm test # all 923 tests across the workspace, ~25s
pnpm test:watch # vitest watch mode for the package you're in
pnpm typecheck # TypeScript check across all 18 packages
pnpm check # lint + typecheck + test together

For the integration tests specifically:

Terminal window
pnpm --filter @agent-platform/memory test
# Runs the unit + miniflare-pool tests

For the e2e demo (requires deployed Worker + secrets):

Terminal window
RUN_E2E=1 \
WORKER_URL=https://your-worker.workers.dev \
WORKER_AUTH_TOKEN="$(cat ~/.agent-platform-token)" \
./apps/worker/scripts/e2e-demo.sh

Full e2e runbook is in apps/worker/README.md.

The e2e script costs ~$0.05 per run (Anthropic + OpenAI). If it ran on every PR, that’s a meaningful budget item — and worse, it’d be a budget item attached to PRs from anyone, including contractors and bot accounts. The threat model isn’t theoretical: a few thousand sloppy PRs would mean a few hundred dollars of LLM bill.

Three options, all tracked but not implemented:

  • Run on merges to main only. Cheaper but still real cost, and the CI-fail comes after merge — too late to block bad code.
  • Run on labeled PRs. A maintainer adds a run-e2e label to trigger. More gatekeeping, less convenient.
  • Run nightly against a dedicated test tenant. Decouples PR latency from e2e cost; surfaces drift overnight. Probably the right answer for Phase 2.

Tracked as follow-up #12 (auto CI E2E runs).

Coverage percentages aren’t a goal on this platform. We don’t have a coverage badge in the README, and we don’t gate PRs on threshold. Two reasons:

The numerator can lie. A line of code can be “covered” by a test that doesn’t actually verify any behavior — calling a function and not asserting on the result still bumps the coverage counter. Coverage targets incentivize writing those tests.

The denominator misses. A class of bug like the fetch-binding issue is invisible to c8 --coverage: the line of code IS covered, the issue is at runtime in a different context. Coverage tools would say it’s tested. It wasn’t.

Instead: every PR’s tests should cover the behaviors the PR is responsible for, with the structure being unit tests for pure logic and integration tests when the change interacts with CF bindings. Reviewer judgment, not numerical threshold.

If you’re contributing:

  • Read apps/worker/README.md for the operational runbook (deploy, seed memory, run e2e)
  • Look at packages/memory/test/ for examples of miniflare-based integration tests if your change touches CF bindings

If you’re evaluating the platform:

  • The 923-passing-tests-on-fresh-checkout property is real — every commit in the platform’s history is verified end-to-end on a fresh clone before being claimed done. The git log is trustworthy as a record of working states.
  • The testing-fidelity gap is documented because it’s been paid for, in real production bugs. No platform is bug-free; the platforms that survive are the ones that surface their weaknesses honestly.