Uber’s AI team ran out of budget in April. Their fiscal year started in January.
That sentence appeared on Hacker News and hit the front page in under two hours, accumulating hundreds of comments from engineers who recognized the pattern immediately. Not because Uber is uniquely reckless, but because the same story is playing out at organizations everywhere. The r/LocalLLaMA thread about compute cost frustration — 181 upvotes, hundreds of comments from engineers describing identical spiral — makes the same point from the other direction: whether you’re paying for cloud inference or running your own GPUs, agentic AI costs are destroying budgets that looked perfectly reasonable when the procurement approval was signed.
I cut my own agent pipeline costs by 74% over six weeks using a routing architecture I’ll show you here. The core insight is simple: you are almost certainly sending every task to the same expensive model regardless of complexity, and that single decision is costing you more than everything else combined.
According to Forrester’s 2026 enterprise AI deployment survey, 22% of agent deployments now report negative ROI — not because the agents don’t work, but because the infrastructure costs exceeded the productivity gains. The agents work. The bills are just bigger than anyone planned for.
The Agentic Cost Explosion Nobody Planned For
The math that kills AI budgets is rarely the per-token pricing. It’s the multiplication factor that nobody writes into their procurement estimates.
When a product manager approves $10,000/month for an AI coding assistant, they’re imagining simple prompt-response pairs at a few cents each. What they’re actually getting is an agent that, for every user request, may run a planning step, 4-6 tool calls, 2-3 reflection passes, and a final synthesis — each of which hits the API separately. A task that looks like “one request” in the approval doc is 8-12 API calls in the billing dashboard.
Opus 4.7 is priced at $5 per million input tokens and $25 per million output tokens. GPT-5.5 runs $5 per million input and $30 per million output. At those rates, a simple agentic task that chains 10 LLM calls — each consuming 2,000 input tokens and producing 800 output tokens — costs roughly $0.40. That’s not alarming until you remember that a busy developer using an AI coding agent makes 50-100 such requests per day. Per developer. For a team of 20, that’s $400-$800 per day, $8,000-$16,000 per month, from a single team using a single tool.
Now multiply by the number of agent pipelines your organization has deployed since Q1 2026.
The Pro plan context also exhausts faster than most users expect. Heavy prompting — long context windows, multi-file codebase analysis, extended reasoning chains — depletes a Claude Pro plan after roughly 12 substantial prompts. Power users hit this limit before lunch. The response is either to throttle to cheaper sessions or to upgrade to API access with consumption-based billing, which removes the cap but also removes the cost ceiling that made the Pro plan feel “safe”.
This is the structural trap: fixed-price plans create a ceiling that users run into, pushing them to consumption billing. Consumption billing removes the ceiling and exposes the real cost of agentic usage patterns. Teams that made the switch in Q1 2026 are the ones showing up in that Forrester negative-ROI data.
Why Agent Tasks Cost 10-100x a Simple API Call
The cost multiplier is not a bug in the pricing model. It is the natural consequence of how agents work, and understanding it is the prerequisite to managing it.
A simple API call sends a prompt, receives a response, costs one unit of compute. An agent task is architecturally different. It starts with a planning phase where the model reasons about the task and decides what tools to use — that’s one or two LLM calls. Each tool call has a pre-call reasoning step, the execution itself (which may or may not be an LLM call), and a post-call evaluation where the agent decides whether the result was satisfactory — potentially another LLM call. If the tool result is ambiguous or the agent decides it needs more information, it loops. If the final output needs to be formatted or synthesized from multiple tool results, that’s another LLM call.
A conservative estimate puts a moderate-complexity agent task at 8-15 LLM calls. A complex task — multi-file code review, research synthesis across 10+ sources, multi-step data pipeline — can run 40-100 calls. At Opus 4.7 pricing, 100 calls with average context is not $0.04. It is $4.00-$8.00. Per task. That is the 10-100x multiplier, and it is baked into the architecture.
There is also a context accumulation problem that makes costs grow nonlinearly. Each step in an agent workflow adds to the running context: the original task, the plan, the results of each tool call, the evaluation of each result. By step 8 of a 10-step workflow, the input token count for each call includes all preceding steps. The 9th LLM call in a chain is not the same cost as the first — it may be 5-10x more expensive per call because the context window has grown. This is why agent tasks that “should” cost $2 based on per-call estimates end up costing $15 in production.
The naive solution is to use cheaper models. But for complex reasoning tasks — architectural decisions, security analysis, multi-file refactors — substituting Haiku 4.5 for Opus 4.7 does not save money. It produces wrong outputs that require expensive human correction or re-runs. The real solution is routing: expensive models for tasks that require them, cheap models for tasks that do not.
For a deeper look at how to evaluate whether your agent outputs are actually correct — not just whether they returned HTTP 200 — see our guide on AI agent observability and production monitoring.
The Multi-Model Routing Pattern
The routing pattern that cut my costs 74% is conceptually simple: classify each task before sending it to a model, and route to the cheapest model that can handle it correctly. The implementation requires a routing layer that lives between your application and the model APIs.
Here is the routing architecture I use in production:
// multi-model-router.ts
// Routes tasks to the cheapest capable model based on complexity classification
type ModelTier = 'haiku' | 'sonnet' | 'opus'
interface TaskClassification {
tier: ModelTier
reason: string
estimatedTokens: number
}
interface RouterConfig {
haiku: { model: string; inputCostPerMTok: number; outputCostPerMTok: number }
sonnet: { model: string; inputCostPerMTok: number; outputCostPerMTok: number }
opus: { model: string; inputCostPerMTok: number; outputCostPerMTok: number }
}
const ROUTER_CONFIG: RouterConfig = {
haiku: {
model: 'claude-haiku-4-5-20251001',
inputCostPerMTok: 0.25,
outputCostPerMTok: 1.25,
},
sonnet: {
model: 'claude-sonnet-4-6',
inputCostPerMTok: 3.0,
outputCostPerMTok: 15.0,
},
opus: {
model: 'claude-opus-4-7',
inputCostPerMTok: 5.0,
outputCostPerMTok: 25.0,
},
}
// Complexity signals that force Opus routing
const OPUS_SIGNALS = [
/security|vulnerability|CVE|auth|payment|webhook/i,
/architecture|refactor.*cross.cutting|design.*system/i,
/multi.file.*analysis|codebase.*review/i,
/trust.boundary|privilege|escalation/i,
]
// Signals that allow Haiku routing (cheap, mechanical tasks)
const HAIKU_SIGNALS = [
/format|lint|rename|replace.all/i,
/meta.description|seo.title|alt.text/i,
/translate|summarize.in.[0-9]+.words/i,
/extract.*list|parse.*json|convert.*csv/i,
]
export function classifyTask(prompt: string, contextTokens: number = 0): TaskClassification {
// High-stakes tasks always go to Opus regardless of apparent simplicity
for (const signal of OPUS_SIGNALS) {
if (signal.test(prompt)) {
return {
tier: 'opus',
reason: 'trust-boundary signal detected',
estimatedTokens: contextTokens + prompt.length / 4,
}
}
}
// Large context forces at least Sonnet (Haiku quality degrades with context)
if (contextTokens > 50_000) {
return {
tier: 'sonnet',
reason: 'large context window',
estimatedTokens: contextTokens + prompt.length / 4,
}
}
// Mechanical/formatting tasks can use Haiku
for (const signal of HAIKU_SIGNALS) {
if (signal.test(prompt)) {
return {
tier: 'haiku',
reason: 'mechanical task signal',
estimatedTokens: contextTokens + prompt.length / 4,
}
}
}
// Default: Sonnet handles most feature work and routine edits
return {
tier: 'sonnet',
reason: 'default routing',
estimatedTokens: contextTokens + prompt.length / 4,
}
}
export function getModel(tier: ModelTier): string {
return ROUTER_CONFIG[tier].model
}
export function estimateCost(tier: ModelTier, inputTokens: number, outputTokens: number): number {
const config = ROUTER_CONFIG[tier]
return (inputTokens / 1_000_000) * config.inputCostPerMTok +
(outputTokens / 1_000_000) * config.outputCostPerMTok
}
This router reduced my Opus 4.7 usage from 100% of calls to roughly 15% — the genuinely complex architectural and security tasks that actually need it. Sonnet handles about 60% of calls (feature implementation, analysis, most agent steps). Haiku handles the remaining 25% (formatting, SEO rewrites, batch string operations). The cost profile shifted from ~$5/MTok average to ~$2.20/MTok average — a 56% reduction in model cost alone, before any token optimization.
GPT-5.5 uses 72% fewer tokens per task than Opus 4.7 for equivalent outputs on coding benchmarks, which changes the economics of cross-provider routing. At $5/$30 per MTok, GPT-5.5 looks more expensive per token than Opus 4.7 at $5/$25. But at 72% token reduction on similar tasks, the effective cost is lower. Routing provider by task type — not just model by task type — is the next frontier of cost optimization, and OpenAI’s efficiency gains are what make it worth modeling.
The multi-model routing developer guide has a more detailed breakdown of cross-provider routing for specific task classes.
Token Optimization Techniques That Actually Work
Routing gets you 40-60% cost reduction. Token optimization gets you the rest. These are the techniques with meaningful impact at production scale.
Context trimming before each agent step. Most agent frameworks accumulate context naively — every tool result, every intermediate output appended to the running context. By step 8 of a 10-step workflow, 60-70% of your input tokens are intermediate results that the model does not need to reason about the current step. Trim aggressively: keep the original task, the most recent 2-3 tool results, and any critical constraints. Archive the rest.
// token-counter.ts — measure and trim context before agent steps
interface ContextWindow {
systemPrompt: string
task: string
history: Array<{ role: 'user' | 'assistant'; content: string; stepIndex: number }>
maxTokens: number
}
function estimateTokens(text: string): number {
// Rough approximation: 4 characters per token for English
return Math.ceil(text.length / 4)
}
export function trimContext(ctx: ContextWindow, targetTokenBudget: number): ContextWindow {
const systemTokens = estimateTokens(ctx.systemPrompt)
const taskTokens = estimateTokens(ctx.task)
const overhead = systemTokens + taskTokens + 500 // reserve for response
let budget = targetTokenBudget - overhead
const kept: typeof ctx.history = []
// Always keep the most recent 3 steps (recency bias is real)
const recent = ctx.history.slice(-3)
for (const step of recent) {
const cost = estimateTokens(step.content)
if (cost <= budget) {
kept.unshift(step)
budget -= cost
}
}
// Fill remaining budget with older steps (most recent first)
const older = ctx.history.slice(0, -3).reverse()
for (const step of older) {
const cost = estimateTokens(step.content)
if (cost <= budget) {
kept.unshift(step)
budget -= cost
} else {
break
}
}
return { ...ctx, history: kept }
}
export function countContextTokens(ctx: ContextWindow): number {
return estimateTokens(ctx.systemPrompt) +
estimateTokens(ctx.task) +
ctx.history.reduce((sum, step) => sum + estimateTokens(step.content), 0)
}
Structured output enforcement. Agents that return free-form prose when JSON was sufficient waste tokens on prose framing that your application immediately discards. Enforcing structured output via response schemas reduces output token counts by 30-50% for data extraction and analysis tasks. Every output token costs 5x an input token — optimizing outputs matters more than optimizing inputs.
Haiku delegation for sub-tasks. Complex agent workflows often include sub-tasks that appear complex but are actually mechanical. “Summarize this 10,000-word document in 200 words” running inside a research agent does not need Opus. Here is the delegation config pattern:
// haiku-delegation.ts — delegate mechanical sub-tasks to cheaper models
import Anthropic from '@anthropic-ai/sdk'
const client = new Anthropic()
const HAIKU_DELEGATABLE_TASKS = {
summarize: (text: string, maxWords: number) => ({
model: 'claude-haiku-4-5-20251001',
max_tokens: maxWords * 2,
messages: [{
role: 'user' as const,
content: `Summarize the following in exactly ${maxWords} words or fewer. Return only the summary, no preamble.\n\n${text}`,
}],
}),
extractJson: (text: string, schema: string) => ({
model: 'claude-haiku-4-5-20251001',
max_tokens: 1024,
messages: [{
role: 'user' as const,
content: `Extract data matching this schema: ${schema}\n\nReturn valid JSON only.\n\nInput:\n${text}`,
}],
}),
rewriteForSeo: (title: string, maxChars: number) => ({
model: 'claude-haiku-4-5-20251001',
max_tokens: 256,
messages: [{
role: 'user' as const,
content: `Rewrite this title for SEO in under ${maxChars} characters. Include the primary keyword. Return only the rewritten title.\n\n${title}`,
}],
}),
}
export async function delegateToHaiku(
task: keyof typeof HAIKU_DELEGATABLE_TASKS,
...args: Parameters<typeof HAIKU_DELEGATABLE_TASKS[typeof task]>
): Promise<string> {
// @ts-expect-error — dynamic args match the function signature
const params = HAIKU_DELEGATABLE_TASKS[task](...args)
const response = await client.messages.create(params)
return response.content[0].type === 'text' ? response.content[0].text : ''
}
Response caching. Agent workflows frequently re-run identical sub-queries: the same research query across different branches of a planning tree, the same code analysis prompt across multiple files. Redis caching with a 1-hour TTL on deterministic queries (same prompt + same context hash) eliminates redundant API calls entirely. In my content research pipeline, 34% of all LLM calls were cache-eligible — that is a 34% reduction in API spend with zero quality impact.
For context on how to track whether these optimizations are actually improving output quality (not just cutting costs), the post on AI agent pilot failure rates covers the measurement frameworks that tell you when you have gone too far.
Building a Cost Dashboard for Your Agent Pipeline
You cannot optimize what you cannot see. Every team that has successfully controlled agentic AI costs has a dashboard. Here is the minimal version that gives you the visibility to make routing decisions:
// cost-dashboard.ts — real-time cost tracking per agent workflow
interface AgentCallRecord {
workflowId: string
stepName: string
model: string
inputTokens: number
outputTokens: number
costUsd: number
timestamp: Date
cacheHit: boolean
}
interface WorkflowCostSummary {
workflowId: string
totalCost: number
callCount: number
avgCostPerCall: number
cacheHitRate: number
modelBreakdown: Record<string, { calls: number; cost: number }>
}
// In-memory store — replace with Redis or Postgres for production persistence
const callRecords: AgentCallRecord[] = []
export function recordCall(record: AgentCallRecord): void {
callRecords.push(record)
// Emit to your monitoring system
if (record.costUsd > 0.50) {
console.warn(`[COST_ALERT] Single call exceeded $0.50: ${record.workflowId}/${record.stepName} = $${record.costUsd.toFixed(4)}`)
}
}
export function getWorkflowSummary(workflowId: string): WorkflowCostSummary {
const records = callRecords.filter((r) => r.workflowId === workflowId)
const modelBreakdown: Record<string, { calls: number; cost: number }> = {}
let totalCost = 0
let cacheHits = 0
for (const r of records) {
totalCost += r.costUsd
if (r.cacheHit) cacheHits++
if (!modelBreakdown[r.model]) modelBreakdown[r.model] = { calls: 0, cost: 0 }
modelBreakdown[r.model].calls++
modelBreakdown[r.model].cost += r.costUsd
}
return {
workflowId,
totalCost,
callCount: records.length,
avgCostPerCall: records.length ? totalCost / records.length : 0,
cacheHitRate: records.length ? cacheHits / records.length : 0,
modelBreakdown,
}
}
export function getDailySpend(): number {
const today = new Date()
today.setHours(0, 0, 0, 0)
return callRecords
.filter((r) => r.timestamp >= today)
.reduce((sum, r) => sum + r.costUsd, 0)
}
Budget alerting is the second component. A cost dashboard without alerts is just a prettier way to notice a problem after it has already occurred:
// budget-alerting.ts — proactive spend alerts before budgets explode
interface BudgetConfig {
dailyLimitUsd: number
monthlyLimitUsd: number
alertAt: number // fraction of limit that triggers warning (e.g., 0.8 = 80%)
onAlert: (message: string) => void
}
export function createBudgetMonitor(config: BudgetConfig) {
let dailyAlertFired = false
let monthlyAlertFired = false
return {
checkBudget(dailySpend: number, monthlySpend: number): void {
const dailyPercent = dailySpend / config.dailyLimitUsd
const monthlyPercent = monthlySpend / config.monthlyLimitUsd
if (dailyPercent >= config.alertAt && !dailyAlertFired) {
config.onAlert(
`Daily AI spend at ${(dailyPercent * 100).toFixed(1)}% of limit ($${dailySpend.toFixed(2)} / $${config.dailyLimitUsd})`
)
dailyAlertFired = true
}
if (monthlyPercent >= config.alertAt && !monthlyAlertFired) {
config.onAlert(
`Monthly AI spend at ${(monthlyPercent * 100).toFixed(1)}% of limit ($${monthlySpend.toFixed(2)} / $${config.monthlyLimitUsd})`
)
monthlyAlertFired = true
}
if (dailySpend >= config.dailyLimitUsd) {
config.onAlert(`DAILY BUDGET EXHAUSTED: $${dailySpend.toFixed(2)} spent. Throttling agent calls.`)
}
},
// Reset daily alert flag at midnight
resetDaily(): void {
dailyAlertFired = false
},
}
}
// Usage — connect to Telegram, Slack, or email for notifications
const monitor = createBudgetMonitor({
dailyLimitUsd: 50,
monthlyLimitUsd: 800,
alertAt: 0.8,
onAlert: (msg) => {
// Send to your notification channel
console.error(`[BUDGET_ALERT] ${msg}`)
},
})
The dashboard + alerting combination is what surfaces the optimization opportunities. After running this for two weeks, the data consistently shows the same pattern: 15-20% of agent workflows are responsible for 70-80% of costs. Those high-cost workflows are almost always candidates for either more aggressive routing (can Sonnet handle step 3 instead of Opus?) or context trimming (are we feeding 40,000 tokens of accumulated history into a step that only needs the last 2,000?).
Uber’s fiscal year started in January. Their budget was gone by April. The gap between “this looks reasonable in a spreadsheet” and “this is destroying our quarterly budget” is measured in weeks once agentic usage patterns take hold at scale. The teams that avoided that outcome were not smarter about AI — they were earlier to instrument their pipelines and route their traffic.
The tools to do this are not complex. The routing logic above fits in a single TypeScript file. The cost dashboard is under 80 lines. The budget alerting is another 40 lines. What makes it powerful is deploying it before the quarterly budget review, not after.
For building out the broader agent architecture that makes routing decisions tractable — including how to structure agent workflows so tasks have clear complexity signals — see the 3-layer agent harness pattern. The routing architecture works best when the agent layer is clean enough that each step has a well-defined purpose and a clear complexity profile.
Run the cost dashboard on your pipeline this week. I guarantee you will find at least one workflow where 80% of your spend is going to Opus for tasks that Sonnet could handle. That is your first 40% cost reduction, and it is sitting there already.
Every tool and template for building production agent pipelines is at wowhow.cloud developer tools — pay once, ship forever.
Written by
Anup Karanjkar
Expert contributor at WOWHOW. Writing about AI, development, automation, and building products that ship.
Ready to ship faster?
Browse our catalog of 3,000+ premium dev tools, prompt packs, and templates.
Monday Memo · Free
One insight, every Monday. 7am IST. Zero fluff.
1 field report, 3 links, 1 tool we actually use. Join 11,200+ builders.
Comments · 0
No comments yet. Be the first to share your thoughts.