Anthropic

The platform’s primary LLM provider. Every agent turn is an Anthropic API call — main agents on Claude Sonnet, sub-agents on Claude Haiku, with the model choice declared per-agent in the YAML.

What we use it for

Two model tiers, one per agent role:

Agent role	Model	Used for
Main agents	Claude Sonnet	Top-level reasoning. Receives the user’s request, decides what tools to call, when to delegate. Examples: `triage`, `merchandising`.
Sub-agents	Claude Haiku	Focused, narrower tasks. Faster + cheaper per call. Examples: `refund_decision`, `communication`.

The model is part of the agent definition; changing it is editing a YAML file. The runtime treats every model uniformly through the ModelAdapter interface — the rest of the platform doesn’t know which model is in use.

Why we picked it

The decision was: which LLM provider?

Option	Verdict
Anthropic (Claude family)	Chosen. Best-in-class tool use; clear pricing tiers (Haiku/Sonnet/Opus); strong system-prompt adherence; SDK first-class for streaming + tool calls.
OpenAI	Strong contender, especially for older systems. We use OpenAI for embeddings (a different problem). For agent reasoning, Claude’s tool-use behavior was a better fit at platform-design time.
Open-source models (Llama, Mistral) self-hosted	Costs more to run at our scale than to call hosted APIs; loses the “no infra” property; quality gap is real for agent reasoning.
Google Gemini, Cohere, others	Plausible alternatives. The cost of evaluating “which LLM” forever is high; we picked one, abstracted via `ModelAdapter`, and moved on.

The platform isn’t locked to Anthropic — ModelAdapter is the abstraction. Adding an OpenAI adapter for agent reasoning is an afternoon’s work. We chose Anthropic first because the demo matters more than the comparison matrix.

What it costs

Anthropic pricing (per million tokens):

Model	Input	Output
Claude 3.5 Haiku	$1.00	$5.00
Claude 3.5 Sonnet	$3.00	$15.00

Phase 1’s e2e demo run cost roughly:

Triage agent: ~3K input tokens, ~500 output tokens (Sonnet) ≈ $0.017
Refund decision agent: ~2K input, ~400 output (Haiku) ≈ $0.004
Communication agent: ~2K input, ~400 output (Haiku) ≈ $0.004
Plus seed embeddings (OpenAI, separate)

Total: ~$0.025 per full demo run in LLM costs. The $0.05 quoted in the runbook is conservative.

A production scenario at scale (thousands of triage runs per day) puts this at $25-50/day. Tier per task aggressively if cost becomes a concern: many decisions in a delegation chain don’t need Sonnet.

What it replaces

A self-hosted LLM stack — GPU clusters, model serving (vLLM / TGI / etc.), monitoring, scaling. Anthropic’s hosted API reduces this to an API key.

Where to look

packages/llm-anthropic/ — the Anthropic adapter; uses @anthropic-ai/sdk and translates SDK errors into the platform’s error taxonomy
packages/llm/ — the provider-agnostic ModelAdapter interface that the runtime sees
apps/worker/agents/*.yaml — per-agent model declarations (model: claude-sonnet-4-5-20250929, etc.)

Trade-offs we accepted

Vendor concentration. All agent reasoning runs through Anthropic. If they have an outage, agents stop working. The ModelAdapter abstraction means we can fail over to another provider, but we don’t do that automatically today.
No streaming to the client. The Worker buffers the full agent turn and returns a complete AgentReport. Streaming would be nice for the eventual UI but isn’t needed for the HTTP API.
No fine-tuning. Anthropic doesn’t offer fine-tuning today. Our customization happens via system prompts and the YAML characteristics block. This is the right level of control for agent definition; fine-tuning would add a maintenance burden most agents don’t need.