The Multi-Model Routing Pattern
The routing pattern that cut my costs 74% is conceptually simple: classify each task before sending it to a model, and route to the cheapest model that can handle it correctly. The implementation requires a routing layer that lives between your application and the model APIs.
Here is the routing architecture I use in production:
// multi-model-router.ts
// Routes tasks to the cheapest capable model based on complexity classification
type ModelTier = 'haiku' | 'sonnet' | 'opus'
interface TaskClassification {
tier: ModelTier
reason: string
estimatedTokens: number
}
interface RouterConfig {
haiku: { model: string; inputCostPerMTok: number; outputCostPerMTok: number }
sonnet: { model: string; inputCostPerMTok: number; outputCostPerMTok: number }
opus: { model: string; inputCostPerMTok: number; outputCostPerMTok: number }
}
const ROUTER_CONFIG: RouterConfig = {
haiku: {
model: 'claude-haiku-4-5-20251001',
inputCostPerMTok: 0.25,
outputCostPerMTok: 1.25,
},
sonnet: {
model: 'claude-sonnet-4-6',
inputCostPerMTok: 3.0,
outputCostPerMTok: 15.0,
},
opus: {
model: 'claude-opus-4-7',
inputCostPerMTok: 5.0,
outputCostPerMTok: 25.0,
},
}
// Complexity signals that force Opus routing
const OPUS_SIGNALS = [
/security|vulnerability|CVE|auth|payment|webhook/i,
/architecture|refactor.*cross.cutting|design.*system/i,
/multi.file.*analysis|codebase.*review/i,
/trust.boundary|privilege|escalation/i,
]
// Signals that allow Haiku routing (cheap, mechanical tasks)
const HAIKU_SIGNALS = [
/format|lint|rename|replace.all/i,
/meta.description|seo.title|alt.text/i,
/translate|summarize.in.[0-9]+.words/i,
/extract.*list|parse.*json|convert.*csv/i,
]
export function classifyTask(prompt: string, contextTokens: number = 0): TaskClassification {
// High-stakes tasks always go to Opus regardless of apparent simplicity
for (const signal of OPUS_SIGNALS) {
if (signal.test(prompt)) {
return {
tier: 'opus',
reason: 'trust-boundary signal detected',
estimatedTokens: contextTokens + prompt.length / 4,
}
}
}
// Large context forces at least Sonnet (Haiku quality degrades with context)
if (contextTokens > 50_000) {
return {
tier: 'sonnet',
reason: 'large context window',
estimatedTokens: contextTokens + prompt.length / 4,
}
}
// Mechanical/formatting tasks can use Haiku
for (const signal of HAIKU_SIGNALS) {
if (signal.test(prompt)) {
return {
tier: 'haiku',
reason: 'mechanical task signal',
estimatedTokens: contextTokens + prompt.length / 4,
}
}
}
// Default: Sonnet handles most feature work and routine edits
return {
tier: 'sonnet',
reason: 'default routing',
estimatedTokens: contextTokens + prompt.length / 4,
}
}
export function getModel(tier: ModelTier): string {
return ROUTER_CONFIG[tier].model
}
export function estimateCost(tier: ModelTier, inputTokens: number, outputTokens: number): number {
const config = ROUTER_CONFIG[tier]
return (inputTokens / 1_000_000) * config.inputCostPerMTok +
(outputTokens / 1_000_000) * config.outputCostPerMTok
}
This router reduced my Opus 4.7 usage from 100% of calls to roughly 15% — the genuinely complex architectural and security tasks that actually need it. Sonnet handles about 60% of calls (feature implementation, analysis, most agent steps). Haiku handles the remaining 25% (formatting, SEO rewrites, batch string operations). The cost profile shifted from ~$5/MTok average to ~$2.20/MTok average — a 56% reduction in model cost alone, before any token optimization.
GPT-5.5 uses 72% fewer tokens per task than Opus 4.7 for equivalent outputs on coding benchmarks, which changes the economics of cross-provider routing. At $5/$30 per MTok, GPT-5.5 looks more expensive per token than Opus 4.7 at $5/$25. But at 72% token reduction on similar tasks, the effective cost is lower. Routing provider by task type — not just model by task type — is the next frontier of cost optimization, and OpenAI’s efficiency gains are what make it worth modeling.
The multi-model routing developer guide has a more detailed breakdown of cross-provider routing for specific task classes.
Token Optimization Techniques That Actually Work
Routing gets you 40-60% cost reduction. Token optimization gets you the rest. These are the techniques with meaningful impact at production scale.
Context trimming before each agent step. Most agent frameworks accumulate context naively — every tool result, every intermediate output appended to the running context. By step 8 of a 10-step workflow, 60-70% of your input tokens are intermediate results that the model does not need to reason about the current step. Trim aggressively: keep the original task, the most recent 2-3 tool results, and any critical constraints. Archive the rest.
// token-counter.ts — measure and trim context before agent steps
interface ContextWindow {
systemPrompt: string
task: string
history: Array<{ role: 'user' | 'assistant'; content: string; stepIndex: number }>
maxTokens: number
}
function estimateTokens(text: string): number {
// Rough approximation: 4 characters per token for English
return Math.ceil(text.length / 4)
}
export function trimContext(ctx: ContextWindow, targetTokenBudget: number): ContextWindow {
const systemTokens = estimateTokens(ctx.systemPrompt)
const taskTokens = estimateTokens(ctx.task)
const overhead = systemTokens + taskTokens + 500 // reserve for response
let budget = targetTokenBudget - overhead
const kept: typeof ctx.history = []
// Always keep the most recent 3 steps (recency bias is real)
const recent = ctx.history.slice(-3)
for (const step of recent) {
const cost = estimateTokens(step.content)
if (cost <= budget) {
kept.unshift(step)
budget -= cost
}
}
// Fill remaining budget with older steps (most recent first)
const older = ctx.history.slice(0, -3).reverse()
for (const step of older) {
const cost = estimateTokens(step.content)
if (cost <= budget) {
kept.unshift(step)
budget -= cost
} else {
break
}
}
return { ...ctx, history: kept }
}
export function countContextTokens(ctx: ContextWindow): number {
return estimateTokens(ctx.systemPrompt) +
estimateTokens(ctx.task) +
ctx.history.reduce((sum, step) => sum + estimateTokens(step.content), 0)
}
Structured output enforcement. Agents that return free-form prose when JSON was sufficient waste tokens on prose framing that your application immediately discards. Enforcing structured output via response schemas reduces output token counts by 30-50% for data extraction and analysis tasks. Every output token costs 5x an input token — optimizing outputs matters more than optimizing inputs.
Haiku delegation for sub-tasks. Complex agent workflows often include sub-tasks that appear complex but are actually mechanical. “Summarize this 10,000-word document in 200 words” running inside a research agent does not need Opus. Here is the delegation config pattern:
// haiku-delegation.ts — delegate mechanical sub-tasks to cheaper models
import Anthropic from '@anthropic-ai/sdk'
const client = new Anthropic()
const HAIKU_DELEGATABLE_TASKS = {
summarize: (text: string, maxWords: number) => ({
model: 'claude-haiku-4-5-20251001',
max_tokens: maxWords * 2,
messages: [{
role: 'user' as const,
content: `Summarize the following in exactly ${maxWords} words or fewer. Return only the summary, no preamble.\n\n${text}`,
}],
}),
extractJson: (text: string, schema: string) => ({
model: 'claude-haiku-4-5-20251001',
max_tokens: 1024,
messages: [{
role: 'user' as const,
content: `Extract data matching this schema: ${schema}\n\nReturn valid JSON only.\n\nInput:\n${text}`,
}],
}),
rewriteForSeo: (title: string, maxChars: number) => ({
model: 'claude-haiku-4-5-20251001',
max_tokens: 256,
messages: [{
role: 'user' as const,
content: `Rewrite this title for SEO in under ${maxChars} characters. Include the primary keyword. Return only the rewritten title.\n\n${title}`,
}],
}),
}
export async function delegateToHaiku(
task: keyof typeof HAIKU_DELEGATABLE_TASKS,
...args: Parameters<typeof HAIKU_DELEGATABLE_TASKS[typeof task]>
): Promise<string> {
// @ts-expect-error — dynamic args match the function signature
const params = HAIKU_DELEGATABLE_TASKS[task](...args)
const response = await client.messages.create(params)
return response.content[0].type === 'text' ? response.content[0].text : ''
}
Response caching. Agent workflows frequently re-run identical sub-queries: the same research query across different branches of a planning tree, the same code analysis prompt across multiple files. Redis caching with a 1-hour TTL on deterministic queries (same prompt + same context hash) eliminates redundant API calls entirely. In my content research pipeline, 34% of all LLM calls were cache-eligible — that is a 34% reduction in API spend with zero quality impact.
For context on how to track whether these optimizations are actually improving output quality (not just cutting costs), the post on AI agent pilot failure rates covers the measurement frameworks that tell you when you have gone too far.
Building a Cost Dashboard for Your Agent Pipeline
You cannot optimize what you cannot see. Every team that has successfully controlled agentic AI costs has a dashboard. Here is the minimal version that gives you the visibility to make routing decisions:
// cost-dashboard.ts — real-time cost tracking per agent workflow
interface AgentCallRecord {
workflowId: string
stepName: string
model: string
inputTokens: number
outputTokens: number
costUsd: number
timestamp: Date
cacheHit: boolean
}
interface WorkflowCostSummary {
workflowId: string
totalCost: number
callCount: number
avgCostPerCall: number
cacheHitRate: number
modelBreakdown: Record<string, { calls: number; cost: number }>
}
// In-memory store — replace with Redis or Postgres for production persistence
const callRecords: AgentCallRecord[] = []
export function recordCall(record: AgentCallRecord): void {
callRecords.push(record)
// Emit to your monitoring system
if (record.costUsd > 0.50) {
console.warn(`[COST_ALERT] Single call exceeded $0.50: ${record.workflowId}/${record.stepName} = $${record.costUsd.toFixed(4)}`)
}
}
export function getWorkflowSummary(workflowId: string): WorkflowCostSummary {
const records = callRecords.filter((r) => r.workflowId === workflowId)
const modelBreakdown: Record<string, { calls: number; cost: number }> = {}
let totalCost = 0
let cacheHits = 0
for (const r of records) {
totalCost += r.costUsd
if (r.cacheHit) cacheHits++
if (!modelBreakdown[r.model]) modelBreakdown[r.model] = { calls: 0, cost: 0 }
modelBreakdown[r.model].calls++
modelBreakdown[r.model].cost += r.costUsd
}
return {
workflowId,
totalCost,
callCount: records.length,
avgCostPerCall: records.length ? totalCost / records.length : 0,
cacheHitRate: records.length ? cacheHits / records.length : 0,
modelBreakdown,
}
}
export function getDailySpend(): number {
const today = new Date()
today.setHours(0, 0, 0, 0)
return callRecords
.filter((r) => r.timestamp >= today)
.reduce((sum, r) => sum + r.costUsd, 0)
}
Budget alerting is the second component. A cost dashboard without alerts is just a prettier way to notice a problem after it has already occurred:
// budget-alerting.ts — proactive spend alerts before budgets explode
interface BudgetConfig {
dailyLimitUsd: number
monthlyLimitUsd: number
alertAt: number // fraction of limit that triggers warning (e.g., 0.8 = 80%)
onAlert: (message: string) => void
}
export function createBudgetMonitor(config: BudgetConfig) {
let dailyAlertFired = false
let monthlyAlertFired = false
return {
checkBudget(dailySpend: number, monthlySpend: number): void {
const dailyPercent = dailySpend / config.dailyLimitUsd
const monthlyPercent = monthlySpend / config.monthlyLimitUsd
if (dailyPercent >= config.alertAt && !dailyAlertFired) {
config.onAlert(
`Daily AI spend at ${(dailyPercent * 100).toFixed(1)}% of limit ($${dailySpend.toFixed(2)} / $${config.dailyLimitUsd})`
)
dailyAlertFired = true
}
if (monthlyPercent >= config.alertAt && !monthlyAlertFired) {
config.onAlert(
`Monthly AI spend at ${(monthlyPercent * 100).toFixed(1)}% of limit ($${monthlySpend.toFixed(2)} / $${config.monthlyLimitUsd})`
)
monthlyAlertFired = true
}
if (dailySpend >= config.dailyLimitUsd) {
config.onAlert(`DAILY BUDGET EXHAUSTED: $${dailySpend.toFixed(2)} spent. Throttling agent calls.`)
}
},
// Reset daily alert flag at midnight
resetDaily(): void {
dailyAlertFired = false
},
}
}
// Usage — connect to Telegram, Slack, or email for notifications
const monitor = createBudgetMonitor({
dailyLimitUsd: 50,
monthlyLimitUsd: 800,
alertAt: 0.8,
onAlert: (msg) => {
// Send to your notification channel
console.error(`[BUDGET_ALERT] ${msg}`)
},
})
The dashboard + alerting combination is what surfaces the optimization opportunities. After running this for two weeks, the data consistently shows the same pattern: 15-20% of agent workflows are responsible for 70-80% of costs. Those high-cost workflows are almost always candidates for either more aggressive routing (can Sonnet handle step 3 instead of Opus?) or context trimming (are we feeding 40,000 tokens of accumulated history into a step that only needs the last 2,000?).
Uber’s fiscal year started in January. Their budget was gone by April. The gap between “this looks reasonable in a spreadsheet” and “this is destroying our quarterly budget” is measured in weeks once agentic usage patterns take hold at scale. The teams that avoided that outcome were not smarter about AI — they were earlier to instrument their pipelines and route their traffic.
The tools to do this are not complex. The routing logic above fits in a single TypeScript file. The cost dashboard is under 80 lines. The budget alerting is another 40 lines. What makes it powerful is deploying it before the quarterly budget review, not after.
For building out the broader agent architecture that makes routing decisions tractable — including how to structure agent workflows so tasks have clear complexity signals — see the 3-layer agent harness pattern. The routing architecture works best when the agent layer is clean enough that each step has a well-defined purpose and a clear complexity profile.
Run the cost dashboard on your pipeline this week. I guarantee you will find at least one workflow where 80% of your spend is going to Opus for tasks that Sonnet could handle. That is your first 40% cost reduction, and it is sitting there already.
Every tool and template for building production agent pipelines is at wowhow.cloud developer tools — pay once, ship forever. For framework-specific implementation guides, see how the OpenAI Agents SDK handles sandbox execution and why MiniMax M2.7's open-weight model is the most cost-efficient option for teams running high-volume inference on their own infrastructure.
Comments · 0
Beta: comments are stored locally on your device and not visible to other readers.
No comments yet. Be the first to share your thoughts.