ADR-0028: LLM provider abstraction — stress-test and revisions

Status: Accepted Date: 2026-04-29 Amends: ADR-0019 (LLM adapter interface) Related (forthcoming): ADR-0029 — embedding-provider abstraction

Context

ADR-0019 introduced ModelAdapter with MockAdapter and AnthropicAdapter implementations. The interface was designed against Anthropic’s Messages API and has shipped one production agent (the weekly merchandising cron). It has never been stress-tested against a second provider.

Phase 1’s remaining work — working memory, long-term memory — does not require multi-provider support to ship. But the question matters now because:

The next ADR (embeddings, ADR-0029) will need a provider abstraction. Whether that abstraction reuses ModelAdapter or stands alone depends on whether ModelAdapter survives this review.
A second concrete adapter is plausible within Phase 2 (cost optimization, fallback, or a customer constraint). Discovering interface gaps while writing the second adapter is the painful path; discovering them on paper is cheap.
ADR-0019 already has a tracked follow-up for StopReason 'refusal' / 'pause_turn'. That’s a known leak. There may be others.

This ADR does not commit to building any concrete adapter beyond AnthropicAdapter. It commits to an interface that would support adding one when there’s a real reason.

Method

Walk four divergent provider shapes through the existing ModelAdapter interface:

Anthropic Messages — the baseline; the one in production.
OpenAI Chat Completions + Responses — the most-likely second adapter; different tool-call shape, different streaming, different error model.
Ollama / local llama.cpp — emulated function calling, no remote network call, very different latency, no vision in many models, runs on user hardware.
Hugging Face Inference / TGI — bare model endpoints, often no native tool calling at all.

For each, ask: does the request shape survive? Does the response shape survive? What about streaming, errors, retries, capabilities, lifecycle? Where the answer is “no,” propose the smallest revision that fixes it without over-engineering for hypothetical providers.

Findings

1. Tool-call response shape — interface holds

Anthropic returns tool_use content blocks. OpenAI returns tool_calls on the assistant message. Ollama emulates OpenAI’s shape with varying fidelity. HF/TGI typically returns text and expects the caller to parse it.

ADR-0019’s ModelAdapter already returns a normalized AssistantMessage with explicit toolCalls: ToolCall[]. Each adapter is responsible for translating its native shape into that. No revision needed. Normalizing at the adapter boundary (rather than passing through provider-native shapes) was the right call — it pays off here.

HF/TGI without native tool support would require the adapter to inject a tool-calling preamble into the system prompt and parse the model’s text output. That is adapter-internal complexity, not interface complexity. Acceptable.

2. Stop reasons — leak confirmed (subsumes the existing tracked follow-up)

Anthropic emits end_turn, tool_use, max_tokens, stop_sequence, plus refusal and pause_turn (added later). OpenAI emits stop, length, tool_calls, content_filter, function_call. Ollama emits a subset. HF/TGI often emits nothing structured.

ADR-0019’s StopReason enum is closed and provider-leaning. The tracked follow-up notes this for refusal / pause_turn, but the underlying problem is broader: a closed enum across providers is the wrong shape.

Revision: widen StopReason to a discriminated union with a normalized core ('end_turn' | 'tool_use' | 'max_tokens' | 'stop_sequence' | 'content_filter' | 'refusal') plus an explicit { kind: 'provider_specific', raw: string } escape hatch. This forces adapters to make a deliberate choice — map to a known kind, or surface as provider-specific — rather than silently dropping information. Subsumes and closes the tracked follow-up.

3. Streaming — interface assumes a non-streaming shape that doesn’t generalize

ADR-0019 specified generate() as request-in / response-out. No streaming. Fine for the merchandising cron, but breaks the moment a customer-facing agent needs first-token latency.

The shapes differ sharply:

Anthropic: SSE with typed events (content_block_delta, message_delta, message_stop).
OpenAI: SSE with choices[0].delta.content chunks.
Ollama: newline-delimited JSON, not SSE.
HF/TGI: SSE for some endpoints, not others.

These can be normalized — every provider ultimately yields token deltas, tool-call deltas, and a terminal event — but only if the interface admits streaming as a first-class concern.

Revision: add generateStream() returning AsyncIterable<ModelStreamEvent> where ModelStreamEvent is a discriminated union ('text_delta' | 'tool_call_start' | 'tool_call_delta' | 'tool_call_end' | 'stop'). Keep generate() as the non-streaming convenience that buffers a stream internally. Adapters that don’t natively stream fall back to non-streaming and emit one text_delta followed by stop. Adapters that natively stream implement generateStream directly and have generate call it under the hood.

Load-bearing constraint: agents written against generate() must continue to work unchanged. Streaming is purely additive at the agent boundary. The runtime decides whether to stream based on calling context (a sync HTTP response can stream; an async DO job buffers).

4. Errors — taxonomy holds, two extensions needed

Anthropic surfaces overloaded_error, rate_limit_error, invalid_request_error, authentication_error, permission_error, not_found_error, request_too_large, api_error. OpenAI uses HTTP status codes plus an error body with type and code. Ollama returns shell-level errors (process crashed, model not loaded). HF returns whatever the underlying inference server emits.

The existing taxonomy from ADR-0019 (RateLimitError, InvalidRequestError, AuthError, TransientError, ModelError) holds, but two additions are needed:

OverloadedError distinct from RateLimitError. They have different retry semantics: rate-limit means “you specifically are throttled, back off”; overloaded means “the provider is having a bad time, your retry might just make it worse.” Anthropic distinguishes these natively; OpenAI conflates them; Ollama has neither concept. The platform should distinguish.
CapabilityError for “this model can’t do that” — see Finding 5. Distinct from InvalidRequestError: the request isn’t malformed, the routed model can’t satisfy it.

Revision: add OverloadedError and CapabilityError. Document the retry contract per error class so the runtime’s retry logic stays provider-agnostic.

5. Capabilities — biggest gap; the interface has no concept

The current interface lets an agent name a model string. There is no way for an agent to declare “I need vision” or “I need 200k context” or “I need tool calling” and have the runtime verify the chosen model satisfies that.

At one provider this barely matters — Anthropic’s models all support tool calling, vision is a known per-model property. At three providers it matters a lot. Concrete failures:

Asking Ollama for vision when the loaded model is text-only → silent garbage output.
Asking HF/TGI for tool calls when the endpoint has no tool-calling shim → silent garbage output.
Routing a 150k-token prompt to a 32k-context Ollama model → silent truncation, output reflects only the tail.

Pattern across all three: the failure is silent and only visible in output quality. That is the worst kind of failure for a multi-agent system where outputs feed other agents. One agent’s silent failure becomes another agent’s input; by the time you notice, you’re four layers deep debugging a downstream symptom.

Revision: introduce a ModelCapabilities declaration on every adapter:

interface ModelCapabilities {
  toolCalling: boolean;
  vision: boolean;
  streaming: boolean;
  maxContextTokens: number;
  maxOutputTokens: number;
  // structured outputs, JSON mode, etc. added as needed
}

Adapters expose getCapabilities(model: string): ModelCapabilities. The runtime — not the agent — checks capabilities before dispatch.

Enforcement is mandatory, not advisory. If an agent’s request requires a capability the routed model doesn’t support, the runtime throws CapabilityError before the network call. No “warn and proceed” mode. The platform is built on agents trusting other agents’ outputs; silent capability mismatches break that trust invisibly. Loud failure at runtime is preferable to plausible-looking garbage propagating downstream.

The cost of enforcement is paid once at agent-design time by the developer (be explicit about what the agent needs, choose models that satisfy it, or handle CapabilityError deliberately) rather than as an ongoing operator tax of mysterious output quality issues. This is the right tradeoff for a multi-agent platform.

This also unlocks future capability-based routing (“any model that supports vision and 200k context”), but that’s a Phase 2 nice-to-have, not a v1 requirement. The v1 requirement is just check before dispatch.

6. Lifecycle — interface assumes “remote stateless API,” local providers break this

Anthropic, OpenAI, HF/TGI: stateless HTTP. Send request, get response. Adapter is a thin wrapper around fetch.

Ollama: local process. Has setup costs (model load, seconds-to-minutes), shutdown costs, “is the server running” semantics. The interface has no concept of init() / dispose() / healthCheck().

In the Workers context this is a non-issue (no local process). But the platform principle is “core remains runtime-agnostic.” If the runtime ever ships to a Node host or self-hosted deployment, Ollama becomes a real target.

Revision: add optional init() and dispose() to ModelAdapter, defaulting to no-ops. Adapters that need lifecycle hooks (Ollama, future local backends) implement them. Adapters that don’t (Anthropic, OpenAI) leave them as defaults. Optional is load-bearing — making them required would force every adapter to write no-op methods and bloats the simple case.

7. Tokenization and cost — interface omits, but probably correctly

Different providers tokenize differently. Anthropic and OpenAI both expose token counts in responses; Ollama exposes them; HF often doesn’t.

ADR-0019 returns usage: { inputTokens, outputTokens } on the response. This holds. Adapters that can’t get exact counts approximate (HF can use the model’s tokenizer or a heuristic). Cost computation is downstream — a costFromUsage(usage, model) function is a concern of the trace writer (per ADR-0027’s observability split), not the LLM call. No revision needed, flagged for posterity.

8. Embeddings — confirmed split into ADR-0029

Walking embeddings through ModelAdapter is the clearest force-fit of the exercise. Embeddings have no streaming, no tool calls, no system prompt; batch input is the primary mode (chat batching is rare; embedding batching is the default); and the provider mix is different (Anthropic for chat is plausible alongside Voyage or local sentence-transformers for embeddings).

Trying to make ModelAdapter cover both produces an interface where most fields are unused for embedding calls. Confirmed: embeddings get their own EmbeddingAdapter in ADR-0029. They will share LLMError taxonomy and probably share auth/retry infrastructure at the implementation level, but the public interface is separate.

Decision

Amend ADR-0019 with the following revisions:

StopReason becomes a discriminated union with a provider_specific escape hatch. No more closed enum. Subsumes the existing refusal/pause_turn follow-up.
Add generateStream() as a first-class method. generate() becomes a buffering convenience over the stream. Stream events are a normalized discriminated union.
Extend the error taxonomy with OverloadedError and CapabilityError. Document retry contract per class.
Introduce ModelCapabilities and getCapabilities() on every adapter. Runtime enforces capabilities before dispatch and throws CapabilityError on mismatch. No advisory mode.
Add optional init() and dispose() lifecycle hooks. No-op default. Used only by adapters that need them.
Confirm embeddings are out of scope for ModelAdapter. Separate EmbeddingAdapter in ADR-0029.

Findings 1 and 7 require no changes; documented for the record.

Consequences

Becomes easy:

Writing the second adapter (OpenAI is the obvious next target) is now an exercise in mapping native shapes to a normalized interface — not in discovering that the interface itself has gaps.
Streaming agents become possible without a future interface break.
“Wrong model for the task” failures are loud and immediate instead of silent and only visible in output quality.
Local/self-hosted providers (Ollama) become viable without retrofitting lifecycle into the interface later.

Becomes hard / accepted tradeoffs:

AnthropicAdapter needs updates: implement generateStream(), return getCapabilities(), distinguish overloaded from rate-limited, widen stop reasons. Real implementation work on a working component. Additive, not invasive — but not zero.
Agents written against the v1 generate() continue to work unchanged. This must remain true through the migration. Tests cover this.
ModelCapabilities is a static declaration; in reality some capabilities are model-version-dependent and providers ship new models. The static map needs updating per release. Acceptable; flagged as a maintenance point.
Enforcement-mode capabilities means an agent that names a wrong-for-task model fails loudly at runtime rather than producing degraded output. Agent authors must be deliberate about model choice. This is the intended tradeoff.

Explicitly deferred:

Capability-based routing (“give me any model that supports vision + 200k”). v1 is “agent names model, runtime checks capabilities, throws if mismatch.” Routing comes later when there’s a real use case.
Cost computation library. Stays a separate concern downstream of the adapter, in the trace writer per ADR-0027.
Adapter implementations beyond Anthropic. This ADR commits to an interface, not to building OpenAI / Ollama / HF adapters. Those land when there’s a concrete reason (cost, fallback, customer constraint).
Structured-output / JSON-mode normalization. Real divergence here (Anthropic tool-use-as-JSON shim, OpenAI native JSON mode, Ollama format=json, HF varies). Worth its own ADR when the first agent actually needs structured output. Flagged but not addressed.

Trigger conditions for revisit

A second adapter implementation reveals an interface gap not covered above → amendment.
A real customer or agent needs structured output / JSON mode → dedicated ADR.
Capability-based routing becomes a Phase 2 requirement → dedicated ADR.
Streaming reveals back-pressure or cancellation gaps not modeled in generateStream() → amendment.
A provider’s stop-reason space grows beyond the discriminated-union core often enough that provider_specific becomes the common case rather than the exception → revisit normalized core.

Implementation plan (for follow-up commits, not this ADR)

In rough order, each its own commit:

Widen StopReason to a discriminated union; update AnthropicAdapter to populate provider_specific for refusal / pause_turn until they’re promoted to first-class kinds.
Add OverloadedError and CapabilityError to the error taxonomy. Update AnthropicAdapter error mapping.
Add getCapabilities() to ModelAdapter and AnthropicAdapter. Add runtime validation before dispatch (enforce, throw CapabilityError on mismatch).
Add generateStream() to ModelAdapter. Implement in AnthropicAdapter. Have generate() call it under the hood.
Add optional init() / dispose(). No AnthropicAdapter change needed.

Tests grow as each commit lands. No agent code should change as a result of any of these commits — the v1 generate() surface is preserved.