ADR-0030: Long-term memory access pattern

Status: Accepted Date: 2026-05-01 Related:

ADR-0006 (six-layer context model; long-term memory is Layer 6)
ADR-0021 (runtime is memory-agnostic; gateway pattern)
ADR-0027 (long-term memory = Vectorize + D1)
ADR-0028 (LLM error taxonomy; ADR-0028 §4–5 inform the embedding adapter’s error surface)
ADR-0029 (EmbeddingAdapter; OpenAI text-embedding-3-small as v1 starter; configuration metadata on every vector)

Context

ADR-0027 committed long-term memory to D1 (source of truth) + Vectorize (vector index, joined by ID). ADR-0029 committed the embedding-provider interface, the v1 starter (OpenAI text-embedding-3-small), and the principle that every vector carries its (provider, model, dimensions) config as metadata so a future re-embedding migration is a designed operation rather than a panic.

Neither ADR settled how agents use long-term memory at runtime. The questions left open:

Access pattern. Pre-turn retrieval (orchestration code retrieves before the turn and dumps top-K into the bundle), tool-based retrieval (agent calls a “search memory” tool when it wants to), or both?
Per-tenant scoping mechanism. Working memory was naturally per-tenant because it lived in a per-job DO. Long-term memory is in app-level Vectorize and D1 bindings — there is no DO scoping it. The mechanism must be explicit.
Embedding-on-write coupling. The gateway needs an EmbeddingAdapter to write entries (embed before storing). Is the embed call synchronous on the write path, or async via queue?
Read-time embedding. When an agent searches, the gateway embeds the query before searching Vectorize. Cost and latency considerations.
Gateway shape. LongTermMemoryGateway mirrors WorkingMemoryGateway in pattern but the methods differ (semantic search instead of window read).
The contract from core/memory.ts. LongTermMemory.store / search / delete already exists. The ADR confirms it survives the discovery and identifies what the implementation needs beyond it.

The long-term memory subsystem cannot ship without these decisions. This ADR settles them.

Decisions

1. Access pattern: tool-based retrieval (Pattern B), not pre-turn retrieval

Long-term memory is exposed to agents as a tool the agent calls when it decides retrieval would help, not via a pre-populated LongTermMemoryLayer.retrieved array on the bundle.

Concretely:

The platform ships a built-in recall_memory tool (or a small family of memory tools) registered in the Tool Registry.
Agents that have memory_config.long_term_enabled: true get this tool added to their tool list automatically by the orchestration layer (not in the agent’s own YAML — declarative on/off only).
The tool input is a MemoryQuery (text + optional top_k + optional filters); the tool output is the matched MemoryEntry[].
The agent decides what to query, when, and how many times in a turn. The runtime’s existing tool loop handles the round trips.

LongTermMemoryLayer.retrieved in the context bundle remains in the type system (it is part of the locked-in ContextBundle shape from ADR-0006) but is always empty in v1. The orchestration layer passes { retrieved: [] } to the assembler. The bundle field is preserved so a future “background recall” pre-fetch (Pattern C, kitchen sink) is additive: it would populate that field while the tool remains available. v1 does not pre-fetch.

Why agent-driven retrieval over pre-turn:

Agent agency over what to remember. Pre-turn retrieval guesses: “embed the user message, retrieve top-K, hope it’s relevant.” Agent-driven retrieval lets the model decide it needs memory, formulate an actual query, and integrate the result into its reasoning. The model is the better query-author than a heuristic over the raw user message.
Multi-query patterns work naturally. A turn might warrant retrieving “what did we decide about pricing last quarter,” then later “what does this customer prefer.” Pre-turn retrieval would have to anticipate both; tool-based retrieval just lets the agent ask twice.
No retrieval cost when retrieval isn’t needed. Pre-turn retrieval pays the embedding cost on every turn regardless. Tool-based pays only when the agent actually wants memory. For agents that mostly process structured input, this is meaningful cost savings.
Matches the multi-agent vision. The platform’s whole shape is “agents that reason about what to do.” Forcing memory retrieval to be a runtime concern that bypasses the agent’s reasoning regresses on that.
Natural composition with delegation. When parent delegates to a sub-agent, the sub-agent can choose to recall its own context independently. Pre-turn retrieval would have to be re-run for every delegated sub-task, with the orchestration layer guessing what’s relevant for each. Tool-based punts that decision to the sub-agent itself.

Considered and rejected:

Pattern A (pre-turn only). Lower-agency. Pays cost on every turn. Doesn’t fit the platform vision. Rejected as the v1 default; preserved as a future optionality via the bundle field.
Pattern C (both pre-turn and tool). Kitchen sink. Two retrievals per turn (one always, one optional). Cost and complexity. The tool-based form subsumes the pre-turn form — if a future workload demonstrates pre-turn retrieval would help, we add an “always-on first call” wrapper rather than a parallel mechanism.

Trigger to revisit:

Real workloads show agents consistently re-asking the same query at the start of every turn → bake that query into a pre-turn helper, possibly enable Pattern C for those agents.
A latency-sensitive workload appears where the extra tool round-trip is the bottleneck → consider speculative pre-fetch.

2. Per-tenant scoping: tenant_id metadata on every vector + WHERE on every D1 query

Working memory was per-tenant by accident — the DO that owned the run was per-tenant. Long-term memory has no such accident; the platform must enforce tenant isolation explicitly at the gateway.

Mechanism:

D1 long-term memory rows have a non-null tenant_id column. Every read includes WHERE tenant_id = ? as the first clause. Composite index on (tenant_id, agent_id, created_at) for the common access patterns.
Vectorize entries carry tenant_id in metadata. Every search call includes tenant_id in the metadata filter. Cross-tenant search is structurally impossible from the application layer.
The LongTermMemoryGateway constructor takes the tenant_id at construction time, not per-call. Per-turn fresh construction (matching the working-memory gateway pattern) means the tenant is fixed for the gateway’s lifetime; calls cannot accidentally leak across tenants because the gateway has no API to override it.

Why metadata filter, not per-tenant Vectorize indexes:

ADR-0029 already names lt-memory-<provider>-<model> as the index naming convention — one index per active embedding configuration, platform-wide. Splitting further by tenant (lt-memory-<provider>-<model>-<tenant>) was considered there and deferred. This ADR confirms the deferral: until a customer has a hard data-residency requirement that forces per-tenant index isolation, metadata filtering is sufficient. Cloudflare’s Vectorize metadata filtering is structural (not best-effort), and the index is configured to reject any search without tenant_id in the filter — see operational notes below.

Considered and rejected:

Per-tenant Vectorize indexes from day one. Hard isolation but operationally heavy: one index per (provider, model, tenant). Re-embedding migrations would have to walk every tenant’s index. Deferred until a regulatory or customer requirement forces it.
Defaulting tenant_id from the agent record. Implicit derivation. Rejected because it makes scoping less inspectable; explicit constructor parameter forces every caller to make the binding visible.

Trigger to revisit:

Hard data-residency requirement from a customer → migrate to per-tenant indexes (designed migration, not a panic, because every vector already carries tenant_id metadata).
Vectorize metadata filtering shows degraded performance at scale → benchmark per-tenant indexes against the metadata-filtered single index.

3. Embedding-on-write: synchronous, on the write path

When an agent’s recall_memory tool returns entries, those entries usually came from a previous store_memory call (or from automatic ingestion). Storage is write-through:

Agent (or platform helper) calls gateway.store(agent, entry)
Gateway calls embedder.embed([entry.content]) — one embedding call, blocking
Gateway writes the D1 row
Gateway writes the Vectorize entry (vector + metadata including the D1 row ID)
Returns

Why synchronous and not async-via-queue:

Write-through means the agent’s next read sees what it just wrote. Async ingestion would create a window where a stored entry is invisible to a subsequent search — surprising and hard to reason about.
Embedding latency is bounded. OpenAI’s text-embedding-3-small typically responds in 100–300ms for short content. That’s tolerable on the write path. (The corresponding read-path embed-the-query call has the same latency profile and is also synchronous.)
Failure handling is straightforward. A failed embed throws; nothing is written. Async-via-queue would require dead-letter handling, retry orchestration, and a “stored but not searchable yet” intermediate state.
Cost is bounded too. Long-term memory writes are not high-frequency; agents store a few entries per turn, not thousands.

The decision is reversible. If a future workload generates high-write-rate ingestion (mass import, bulk migration), an async path can be added — store stays the synchronous default; a separate bulkStore would queue. v1 does not anticipate that need.

Operational note: the gateway must handle LLMRateLimitError and LLMOverloadedError from the embedding adapter (Series A landed LLMOverloadedError for exactly this). v1’s policy: surface them to the caller as-is; the tool-loop’s existing retry path handles transient errors. The gateway does not implement its own retry.

4. Read-time embedding: synchronous on the search path

Symmetrically: gateway.search(agent, query) embeds query.text before hitting Vectorize. Same provider, same model, same dimensions — guaranteed by the EmbeddingConfig metadata invariant from ADR-0029.

Caching: v1 does not cache query embeddings. Considered: a small in-memory LRU keyed by (query.text, embedding_config). Rejected for v1 because (a) within a single turn the agent rarely repeats the same query verbatim, and (b) across turns the gateway is per-turn fresh (working-memory pattern carried over) so an in-memory cache would not survive. If query embedding cost becomes meaningful, a KV-backed cache is the right home (cheap, eventually consistent, fits ADR-0027’s “KV for caches and list-indexes derivable from D1” rule). Deferred until measured.

5. Gateway shape

The LongTermMemoryGateway interface mirrors WorkingMemoryGateway’s pattern (gateway between storage and turn execution) but with semantic-search semantics:

interface LongTermMemoryGateway {
  /**
   * Search long-term memory for entries matching the query.
   * Returns at most `query.top_k` results. Embedding the query
   * is performed internally; callers pass plain text.
   *
   * Tenant scoping is enforced via the gateway's bound tenant_id,
   * set at construction time. Per-agent filtering is layered on
   * top via the `agent` parameter.
   */
  search(agent: AgentId, query: MemoryQuery): Promise<readonly MemoryEntry[]>;

  /**
   * Store a new entry. Embeds synchronously, writes D1 row and
   * Vectorize entry, returns the new entry's ID.
   */
  store(agent: AgentId, entry: LongTermMemoryInput): Promise<string>;

  /**
   * Delete an entry by ID. Removes both the D1 row and the
   * Vectorize entry. Idempotent; deleting an absent ID is a no-op.
   */
  delete(agent: AgentId, id: string): Promise<void>;
}

class VectorizeBackedLongTermMemoryGateway implements LongTermMemoryGateway {
  constructor(args: {
    readonly tenantId: string;
    readonly d1: D1Database;
    readonly vectorize: VectorizeIndex;
    readonly embedder: EmbeddingAdapter;
    readonly now?: () => string;
  });
}

The gateway:

Implements both LongTermMemory (from core/memory.ts) and the read/store helpers tools call. The contract from core is the underlying contract; the gateway is the production-shaped wrapper that takes Cloudflare bindings and an embedding adapter as constructor inputs.
Holds bindings in fields, not in a static. Per-turn fresh construction is the pattern.
The constructor takes pre-resolved bindings (D1Database, VectorizeIndex, EmbeddingAdapter). The platform’s existing pattern of locally-defined minimal storage interfaces (per ADR-0027 implementation: WorkingMemoryStorage) extends here — the actual signatures use small local interfaces, not @cloudflare/workers-types directly, so tests can fabricate without dragging in Workers types. Concrete naming (MemoryD1Database, MemoryVectorizeIndex) is an implementation detail but the principle is locked here.

6. The recall tool surface

The recall_memory tool registered with the Tool Registry:

name: recall_memory
description: |
  Search the agent's long-term memory for entries matching a query.
  Returns the most semantically similar entries. Use this when you
  need to remember context from a previous interaction or session.
input_schema:
  type: object
  properties:
    query:
      type: string
      description: The search query. Specific, well-formed queries return better results.
    top_k:
      type: integer
      minimum: 1
      maximum: 20
      default: 5
  required: [query]
tags: [memory, builtin]

Plus store_memory for explicit writes:

name: store_memory
description: |
  Save a piece of information to long-term memory so it can be
  recalled in future turns or sessions. Use sparingly — only for
  facts, decisions, or context that is genuinely worth remembering.
input_schema:
  type: object
  properties:
    content:
      type: string
    metadata:
      type: object
      additionalProperties: true
  required: [content]
tags: [memory, builtin]

delete_memory is not exposed to agents in v1. Memory deletion is an admin operation; if an agent decides an entry is wrong, it should store a correction, not delete the original. Rationale: deletion is irreversible, and an agent making a wrong call about which memory to delete is an invisible and damaging failure mode. The LongTermMemory.delete contract method exists for admin-path use.

The tools are registered by the platform when the agent has long_term_enabled: true, not declared in the agent’s YAML. This keeps memory orthogonal to the agent definition: the agent declares “I want long-term memory”; the platform decides what tools that grant.

Consequences

Becomes easy:

Agents that need memory just call the tool. No orchestration-layer pre-fetch logic.
Multi-query patterns within a turn work without any special plumbing.
Costs scale with use, not with turn count — agents that don’t query memory pay nothing for it.
The gateway is symmetric with the working-memory gateway: read/write helpers, per-turn fresh construction, takes bindings + adapter at construction time.
Per-tenant isolation is structural; cross-tenant access is impossible from the application surface, not just discouraged.
A future Pattern C (background pre-fetch) is additive — populates the existing-but-empty LongTermMemoryLayer.retrieved field without touching the tool surface.

Becomes hard / accepted tradeoffs:

Each retrieval is at least one extra LLM round trip (the agent decides to call the tool, the response comes back, the agent integrates it). For latency-critical workloads this matters. v1 accepts the cost; future Pattern C is the escape hatch.
The agent has to know when to call the tool. A poorly-prompted agent might never recall, or might recall constantly. This is the cost of giving the model agency. Mitigated by good description text on the tool and by future evaluation harnesses.
Embedding cost is paid synchronously on both write and read paths. For OpenAI text-embedding-3-small at v1 volumes this is meaningfully sub-cent per call. At scale, query-embedding caching becomes worth it.
Tenant isolation depends on Vectorize metadata filtering being structural and reliable. If Cloudflare’s filter implementation has bugs, tenant data could leak across the platform. We treat this as a vendor-trust assumption; mitigated by the per-tenant-index escape hatch.
The LongTermMemoryLayer.retrieved field in the bundle is always empty in v1 but stays in the type system. Slightly confusing to readers (“why is this field always empty?”); the comment in core/context.ts will be updated to point at this ADR.

Explicitly deferred:

Pre-turn retrieval (Pattern A or C). Bundle field preserved for additive future use.
Query-embedding cache. KV-backed when measured.
bulkStore for high-write ingestion. v1’s store is the only write surface.
delete_memory exposed to agents. Admin-path only.
Per-tenant Vectorize indexes. Metadata filter is sufficient.
Hybrid (semantic + keyword) search. Pure vector for v1; D1 LIKE/FTS fallback if measured to be needed.
Re-embedding migration tooling. Designed-in via EmbeddingConfig metadata; not built.
Memory archival to R2. Cold-storage of old entries when D1 size becomes a concern.
Memory quotas per tenant. v1 has no enforced quota; observability will surface heavy users.
Multi-vector / late-interaction models (ColBERT-style). Not supported by Vectorize today; revisit if Cloudflare ships it.

Trigger conditions for revisit

Workloads consistently re-ask the same query at turn start → enable Pattern C wrapper for those agents.
Latency-sensitive workload where the tool round-trip is the bottleneck → speculative pre-fetch.
Embedding cost on the query path becomes meaningful → KV-backed cache.
Hard data-residency requirement → per-tenant Vectorize indexes (designed migration via metadata).
High-volume ingestion workload → bulkStore + queue.
An agent type where delete_memory is genuinely needed → expose with audit logging and undo window.
Vectorize ceilings hit (recall quality, filter expressivity, scale) → pgvector via Hyperdrive (the named escape hatch from ADR-0027).
Cohere or Anthropic ships embeddings competitive with OpenAI’s → revisit ADR-0029’s starter choice.

Implementation plan (for follow-up commits, not this ADR)

In rough order, each its own commit:

packages/embeddings — EmbeddingAdapter interface, EmbeddingConfig, MockEmbeddingAdapter for tests. Pure types + mock; no provider work.
packages/embeddings-openai — OpenAIEmbeddingAdapter. Maps OpenAI errors to the LLM*Error taxonomy (using the new LLMOverloadedError from Series A). Unit tests + gated integration tests against the live API.
D1 schema migration — long_term_memory table with tenant_id, agent_id, id, content, metadata, embedding_provider, embedding_model, embedding_dimensions, created_at. Composite index on (tenant_id, agent_id, created_at).
VectorizeBackedLongTermMemory in packages/memory — implements the LongTermMemory contract from core/memory.ts. Unit tests with mock D1 + mock Vectorize.
VectorizeBackedLongTermMemoryGateway in packages/memory — the gateway (constructor takes bindings + adapter + tenantId). Unit tests with mock dependencies.
Built-in recall_memory and store_memory tools — registered in the platform’s Tool Registry. Tool handlers wrap the gateway. Unit tests.
Miniflare integration tests — exercise the gateway against real D1 and real Vectorize bindings inside the vitest-pool-workers environment. Across-turn round trip: store, then search in a fresh gateway, see the entry. Cross-tenant isolation verified.
Vectorize index creation in deploy — lt-memory-openai-text-embedding-3-small index, configured to reject searches without tenant_id filter.

Worker secret to add at deploy: OPENAI_API_KEY.

Operational notes

Vectorize index configuration. The index lt-memory-openai-text-embedding-3-small is created with metadata indexes on tenant_id and agent_id. The deployment script should also configure the index to reject queries that omit the tenant_id filter — defense in depth against gateway bugs that might forget to apply the filter. (Cloudflare’s Vectorize supports this as a metadata-index requirement; verify in the implementation commit.)
Embedding rate limits. OpenAI’s tier-1 embedding rate limit is 3,000 RPM. At v1 volumes this is far over-provisioned. The gateway emits trace events for embed calls (per ADR-0027’s observability split: aggregates to Workers Analytics Engine, per-run audit to D1) so capacity is observable before it bites.
Cold-start latency. The OpenAIEmbeddingAdapter will cache nothing across Worker isolate cold starts. First embed in a cold isolate pays full TLS handshake + auth round trip; subsequent calls reuse the connection. v1 accepts this; if cold-start latency becomes a hot-path concern, Workers AI BGE (named in ADR-0029 as the migration target) is in-isolate and has no such cost.

Summary for deliberation

What’s decided:

Long-term memory access is agent-driven via tool, not pre-turn retrieval. recall_memory and store_memory tools registered when long_term_enabled: true.
LongTermMemoryLayer.retrieved stays in the bundle type but is always empty in v1, preserving Pattern C as additive future work.
Per-tenant scoping is tenant_id metadata on every vector + WHERE tenant_id = ? on every D1 query. Constructor-bound; per-turn fresh gateway.
Embedding-on-write is synchronous. Same for read-time query embedding. Async/cache deferred until measured.
Gateway shape mirrors working-memory gateway. Per-turn fresh, takes bindings + adapter at construction.
delete_memory is admin-only, not exposed to agents.

What’s deferred:

Pattern C pre-fetch
Query embedding cache
Bulk ingestion
Per-tenant Vectorize indexes
Hybrid search
Re-embedding migration tooling
Memory archival to R2

Implementation series (8 commits) named at the bottom. Series A (the two new error classes) already landed; this ADR unblocks Series C.

Status: implemented (2026-05-02)

Series C completed. The long-term memory subsystem is shipped and the operational binding is wired into apps/worker/wrangler.toml. Decisions in this ADR were honored without revision. Implementation files:

packages/embeddings/ — EmbeddingAdapter interface + MockEmbeddingAdapter for tests (Series C, commit 1)
packages/embeddings-openai/ — OpenAIEmbeddingAdapter against /v1/embeddings, raw fetch, no SDK; error translation to the @agent-platform/llm taxonomy (Series C, commit 2)
apps/worker/migrations/0001_long_term_memory.sql — D1 schema with surrogate ULID PK, composite index on (tenant_id, agent_id, created_at DESC), defensive CHECK constraints, STRICT table mode (Series C, commit 3)
packages/core/src/ulid.ts — ULID generator + isUlid + ulidTime helpers (Series C, commit 4)
packages/memory/src/vectorize-backed-long-term-memory.ts — VectorizeBackedLongTermMemory class implementing LongTermMemory, with synchronous embed-on-write/read, structural Vectorize metadata filter for tenant + agent isolation, single WHERE id IN D1 query (Series C, commit 4)
packages/memory/src/long-term-memory-gateway.ts — VectorizeBackedLongTermMemoryGateway factory (Series C, commit 5)
packages/builtin-tools/ — recall_memory + store_memory tools using the factory pattern for per-turn-fresh gateway lifetime (Series C, commit 6)
packages/memory/test/long-term-memory-gateway.workers.test.ts — miniflare integration tests against real D1 + mock Vectorize (Series C, commit 7)
packages/memory/test/long-term-memory-gateway.remote.workers.test.ts — opt-in scaffolding for testing against real production Vectorize (Series C, commit 7)
apps/worker/wrangler.toml — [[vectorize]] binding = "LT_MEMORY" and operator setup commands in header comments (Series C, commit 8)

Notable revisions during implementation

Two design choices made during implementation that aren’t in the original ADR text:

store_memory tool returns {stored: true}, NOT {id, stored: true}. The original Q3 design lean was to return the ULID so the agent could reference the entry in follow-up calls. On closer examination during commit 6 scaffolding, that case turned out to be hypothetical — delete_memory is admin-only and there is no update_memory. Returning the ID would have required either changing the LongTermMemory.store contract in core/memory.ts to Promise<string> or adding a parallel gateway method. Neither change was justified. The minimum surface that’s actually used is {stored: true}.
Vectorize metadata indexes (tenant_id, agent_id) MUST be created before first insert. This is a Cloudflare-side requirement that surfaced during commit 8’s deploy documentation: wrangler vectorize create-metadata-index declares which fields are filterable at query time, and Vectorize will not retroactively index pre-existing vectors. Operators following the wrong order (insert-then-index) will see queries return no results or, worse, mix tenants. The setup checklist in apps/worker/README.md step 5e makes this prominent.

What’s NOT yet wired

The subsystem is available but not yet wired into any agent flow. Building the agent host that actually calls createRecallMemoryTool + createStoreMemoryTool and registers them in the ToolRegistry is a future commit — out of Series C scope. Today there are no agent definitions with long_term_enabled: true and no merchandising flow uses the tools. Operators who deploy this commit get the bindings ready but no behavior change until an agent is reconfigured.