Multi-agent token cost explodes without accounting. The WOWHOW Cost-Attribution Ledger assigns spend per phase, agent, and tool-call with worked numbers.
A typical three-agent pipeline — planner, executor, reviewer — will burn 400,000 to 800,000 tokens on a task that a single well-prompted call would handle in 40,000. The WOWHOW Cost-Attribution Ledger (CAL) is a framework for attributing every token in a multi-agent run to one of five cost centers: Phase, Agent Role, Tool-Call Type, Context Carry, and Retry Tax. Without that attribution, you are flying blind: you know the invoice total but not which agent is the spender, which tool is the vacuum, or whether your “smart orchestration” is actually a 6x cost multiplier. This post defines the CAL framework, walks through a worked example with illustrative numbers, and shows you how to instrument your own runs to stop guessing.
Why Token Cost Accounting Is Broken in Multi-Agent Systems
Single-agent cost is trivial to track: one call, one bill line. Multi-agent cost is not. When an orchestrator spawns three subagents and each subagent calls three tools, you have at minimum nine context windows in flight. Each window carries the full conversation history it was given. The orchestrator’s 12,000-token system prompt gets copied into every subagent spawn. A tool that returns a 5,000-token JSON blob gets attached to the next LLM call whether or not the agent needs all of it.
The standard debugging pattern — check the total token count on the dashboard — tells you nothing useful. It tells you the aggregate. It does not tell you that your planner agent is spending 60% of the budget on a context window it rebuilds from scratch on every retry, or that your web-search tool is returning 8,000 tokens when the agent uses 200 of them.
Three failure modes show up constantly in production multi-agent runs:
- Context bleed: Agent A builds a 20,000-token working memory. Agent B is spawned with that full memory attached even though it only needs a 500-token summary.
- Retry spiral: A tool-call fails, the agent retries with the full conversation history each time, and each retry adds 2,000 tokens to the context. After five retries, you have spent 10,000 tokens on failure alone.
- Phase overlap: Planning and execution tokens are indistinguishable in the invoice. You cannot tell whether your “planning phase” is 5% or 50% of total spend.
The CAL framework resolves all three by forcing you to label every token at the point of spend, not at the point of billing.
The Five Cost Centers of the CAL Framework
The WOWHOW Cost-Attribution Ledger organizes token spend into five orthogonal cost centers. Every token in a multi-agent run belongs to exactly one entry in each dimension.
| Cost Center | What It Measures | Key Metric | Target Ratio |
|---|---|---|---|
| Phase | Tokens spent in plan / execute / verify / synthesize phases | Phase share % | Plan <15%, Execute <60%, Verify <20%, Synth <10% |
| Agent Role | Tokens attributed to orchestrator, subagent, critic, formatter | Role share % | Orchestrator <20% of total |
| Tool-Call Type | Tokens consumed by search, code-exec, file-read, API, memory | Return/use ratio | Use >25% of returned tokens |
| Context Carry | Tokens added by history / prompt injection vs. new reasoning | Carry overhead % | Carry <40% of input tokens |
| Retry Tax | Tokens spent on failed attempts (tool errors, parse failures) | Tax % | Tax <8% of total |
These five dimensions give you a full-rank attribution matrix. When your total bill spikes, you can immediately answer: which phase? which agent? which tool pattern? was it context carry or retries? You cannot get that answer from a flat token count.
CAL Dimension 1: Phase Attribution
Every multi-agent run has at least two phases: generating a plan and executing it. Larger pipelines add verification and synthesis. The CAL tracks phase boundaries as explicit events, not inferred from timestamps.
The mechanism is a phase tag injected into the system prompt of each LLM call:
PHASE: plan | execute | verify | synthesize
CAL_RUN_ID: run_20260618_042
CAL_AGENT_ID: orchestrator_0
Your logging layer reads these tags and bins every input/output token count under the correct phase. The phase cost then becomes a first-class metric in your post-run report.
In a well-structured run, plan phase should be cheap. Planning is high-density reasoning: a small context window, a tight prompt, one or two tool lookups. If your plan phase is consuming 30% of total tokens, that is a signal your planner is doing execution work — fetching full documents, running code, generating long-form drafts instead of outlines.
CAL Dimension 2: Agent Role Attribution
In a multi-agent system, roles are logical separations of responsibility. The CAL treats each role as a separate billing entity within the run. A single LLM model can fulfill multiple roles sequentially, but the CAL tags each call with its role at call-time.
Four canonical roles in most pipelines:
- Orchestrator: Owns the plan, dispatches tasks, collects results. Should have the highest context window but lowest call frequency.
- Subagent: Executes a discrete, bounded task. Should have a minimal context window scoped to that task only.
- Critic: Reviews output for correctness or quality. Often over-populated with context it does not need.
- Formatter: Transforms structured data into final output format. Should almost never need reasoning tokens — a formatter calling a 100K-context model is waste.
The orchestrator anti-pattern is the most expensive: an orchestrator that re-summarizes every subagent result before dispatching the next task. Each dispatch adds the full accumulation to the next context window. A run with five sequential subagents, each triggering an orchestrator re-summarization, multiplies orchestrator token spend by roughly 5x compared to a parallel dispatch with a single final aggregation.
CAL Dimension 3: Tool-Call Type Attribution
Tool calls are the silent budget eaters. An LLM call that costs 4,000 tokens might trigger a web-search tool that returns 12,000 tokens, all of which get appended to the next call’s context. The tool’s output tokens cost nothing in isolation, but they multiply every subsequent input token count until the context is trimmed.
The CAL tracks two numbers per tool invocation:
- Returned tokens: Total tokens in the tool’s output payload.
- Used tokens: Tokens from that payload that appear in the agent’s subsequent reasoning (measured by substring matching or embedding similarity, depending on your implementation).
The return/use ratio is the core metric. A web-search tool returning 8,000 tokens where the agent extracts a single 200-token fact has a 2.5% use ratio. That is not a tool-call problem; it is a tool-output truncation problem. The fix is trivial: return the top 1,000 tokens and let the agent request more if needed. The impact is immediate: every subsequent call in that agent’s chain is 7,000 input tokens lighter.
Five tool-call types with their typical pathologies:
| Tool Type | Typical Return | Typical Use Ratio | Common Fix |
|---|---|---|---|
| Web Search | 5,000–15,000 tokens | 2–8% | Truncate to top-N results; add query-specific relevance filter |
| File Read | 2,000–50,000 tokens | 5–20% | Line-range or section-targeted reads; never whole-file on first call |
| Code Execution | 200–5,000 tokens | 40–80% | Usually well-used; watch stdout truncation failures that trigger retries |
| API / Database | 1,000–20,000 tokens | 10–40% | Field projection; never return all columns when agent needs three |
| Memory / Vector Store | 500–3,000 tokens | 30–70% | Top-k retrieval already helps; watch for stale entries inflating context |
CAL Dimension 4: Context Carry
Context carry is the fraction of an agent’s input token count that comes from prior conversation history, injected documents, or system prompt boilerplate — as opposed to the new task instruction for that specific call. It is the single largest source of silent cost multiplication in multi-agent pipelines.
The formula is straightforward:
carry_overhead = (input_tokens - task_instruction_tokens) / input_tokens
A call with 18,000 input tokens where the actual task instruction is 2,000 tokens has a carry overhead of 88.9%. That means you are paying for 16,000 tokens of history and boilerplate on every single call in that agent’s chain. If the agent runs 10 calls, you have spent 160,000 tokens on carry alone, not reasoning.
Two carry patterns to eliminate immediately:
Full-history carry: The orchestrator’s complete conversation history gets attached to every subagent spawn. The subagent needs a 500-token task brief. It gets a 15,000-token orchestrator history instead. Fix: generate a structured handoff document at spawn time — task description, required inputs, expected output format, nothing else.
Boilerplate inflation: A 4,000-token system prompt that re-explains the company background, the agent’s role, its tools, its output format, and three pages of safety rules — on every call, including tool-call follow-up turns. Fix: move static boilerplate into a cached prefix. Anthropic’s prompt caching feature reduces the cost of repeated static prefix tokens to roughly 10% of normal input token price. A 4,000-token system prompt called 50 times goes from 200,000 billable tokens to 20,000 billable tokens with a single cache breakpoint marker.
CAL Dimension 5: Retry Tax
Every failed tool-call or parse error that triggers a retry is a taxable event. The retry carries the full conversation history up to that point plus the failure message. In a long chain, a retry at step 8 is far more expensive than a retry at step 1 because it carries seven steps of accumulated context.
The retry tax compounds. A parse failure at step 8 in a 10-step chain might add 12,000 tokens to that call and inflate every subsequent call by the same amount. If the parse failure happens three times before the agent succeeds, the retry tax on that single error is 36,000 tokens, before accounting for the downstream carry effect.
The CAL makes retry tax visible by tagging every call with a retry_depth counter. When retry_depth > 0, the tokens for that call are classified under Retry Tax rather than the normal phase bucket. This surfaces failure-mode cost separately from productive spend.
Retry tax above 15% of total spend is a clear signal that your tool interface is unreliable or your output parsing is fragile. Structured output schemas (JSON mode, typed function signatures) typically cut parse-failure retry rates from 15-25% to under 3%, which directly reduces Retry Tax as a share of total spend.
Worked Example: A Research-and-Draft Pipeline
Below is a fully illustrative worked example. The numbers are constructed to be internally consistent and represent the order-of-magnitude behavior you would observe in a real pipeline of this type, but they are not measurements from a specific run. Use them as a calibration template, not benchmarks.
Pipeline: Orchestrator spawns three subagents to research three sub-topics, then synthesizes results into a 1,500-word draft.
Configuration: Model with 100K context window, web-search tool returning up to 10,000 tokens per call, no prompt caching, full-history carry on all spawns.
| Step | Agent Role | Phase | Input Tokens | Output Tokens | Tool Returns | Carry Overhead |
|---|---|---|---|---|---|---|
| 1. Orchestrator plans | Orchestrator | Plan | 5,000 | 1,200 | 0 | 60% |
| 2. Subagent A: research | Subagent | Execute | 18,000 | 800 | 22,000 | 78% |
| 3. Subagent B: research | Subagent | Execute | 18,000 | 900 | 20,000 | 78% |
| 4. Subagent C: research (retry x1) | Subagent | Execute | 24,000 | 750 | 18,000 | 83% |
| 5. Orchestrator aggregates | Orchestrator | Synthesize | 32,000 | 3,500 | 0 | 82% |
| 6. Critic reviews draft | Critic | Verify | 38,000 | 1,200 | 0 | 91% |
| 7. Formatter finalizes | Formatter | Synthesize | 40,000 | 2,000 | 0 | 93% |
Totals (illustrative): 175,000 input tokens + 10,350 output tokens = ~185,350 tokens total.
Now the CAL attribution breakdown:
| Cost Center | Tokens | Share | CAL Status |
|---|---|---|---|
| Phase: Execute | 108,000 | 58% | Within target (<60%) |
| Phase: Synthesize | 72,000 | 39% | ALERT: target <10% |
| Phase: Plan | 6,200 | 3% | Within target |
| Phase: Verify | 39,200 | 21% | Slightly over (<20% target) |
| Role: Orchestrator | 37,000 | 20% | At limit |
| Role: Subagent | 90,000 | 49% | Expected |
| Role: Critic | 39,200 | 21% | High for role |
| Role: Formatter | 42,000 | 23% | ALERT: formatter should be <5% |
| Tool Return/Use: Web Search | 60,000 returned / ~3,600 used | 6% use ratio | ALERT: target >25% |
| Context Carry (avg across calls) | ~143,000 of 175,000 | 82% | ALERT: target <40% |
| Retry Tax (Subagent C retry) | ~6,000 | 3.2% | Within target (<8%) |
The CAL immediately surfaces four problems that a flat token count hides entirely. The synthesize phase is consuming 39% of budget because the formatter and aggregation calls carry massive histories. The formatter alone is 23% of total spend — a role that should be sub-5%. Context carry at 82% means the pipeline is spending roughly four dollars on history for every one dollar on actual reasoning. And the web-search tool has a 6% use ratio, meaning 94% of what it returns gets ignored.
The Optimized Version: What the CAL Tells You to Change
The CAL is not just a diagnostic. Each alert maps directly to a specific fix. Here is what the four alerts above prescribe:
Alert: Synthesize phase 39% (target <10%)
Root cause: formatter and aggregator carry full conversation history. Fix: generate a structured JSON handoff after the critic pass. The formatter receives a 2,000-token structured document, not a 38,000-token conversation. Synthesize phase drops from 39% to under 8%.
Alert: Formatter role 23% (target <5%)
Root cause: same as above, plus the formatter is running on the same 100K-context model. Fix: route formatter to a cheaper, smaller model (Haiku-class). The formatting task is deterministic template filling, not reasoning. Combined with the context fix, formatter cost drops from 23% to under 3%.
Alert: Web-search use ratio 6% (target >25%)
Root cause: tool returns 10,000 tokens per search call; agent extracts a few hundred. Fix: add a two-step tool design. First call returns metadata + 150-token snippets. Agent decides which snippets are relevant. Second call fetches full text for selected results only. Use ratio rises to 40-60%; downstream carry inflation drops by 70%.
Alert: Context carry 82% (target <40%)
Root cause: every spawn passes full orchestrator history. Fix: generate a structured handoff summary at each agent spawn — task, required context, expected output format, nothing else. Enable prompt caching for the static system prompt. Carry overhead drops to the 35-45% range.
Running the optimized version against the same task with these four changes applied, total token spend drops from ~185,000 to roughly 60,000-75,000 tokens — a 60-65% reduction. The CAL attributed the waste; the fixes were each obvious once the attribution was visible.
Implementing the CAL in Your Pipeline
The CAL does not require a third-party tool. You need three things:
1. Phase and role tags in every LLM call header. Add a structured block to the start of every system prompt with CAL_PHASE, CAL_ROLE, CAL_RUN_ID, and CAL_AGENT_ID fields. These tags cost roughly 20 tokens per call and give your logging layer the dimensions it needs.
2. Token counting at call boundaries. Most LLM providers return token counts in the API response. Log input_tokens, output_tokens, and the phase/role tags together. Do not rely on post-hoc token estimation — count at the API response, not in your prompt template.
3. Tool-call output interception. Before any tool output is appended to an agent’s context, log the token count of the returned payload. Track this as a separate “tool_return_tokens” field alongside the LLM call that follows. You can compute the use ratio later by comparing tool_return_tokens to the delta in the next call’s context.
A minimal implementation in any language fits in under 80 lines. The data model is four columns: run_id, call_id, dimension_values (phase, role, tool_type, retry_depth), and token_counts (input, output, tool_return). Every other CAL metric — carry overhead, use ratio, retry tax — is a derived query over those four columns.
Once you have three or four instrumented runs, patterns emerge fast. You will find that 80% of your excess token spend concentrates in two or three specific agent transitions. That is where the optimization work pays off.
CAL Thresholds as a Go/No-Go Gate
The CAL becomes most valuable when you use its metrics as a quality gate before scaling a pipeline. A research pipeline with 82% context carry is not ready to run at 1,000 tasks per day. At scale, that 82% carry means you are spending $8,200 on carry for every $1,000 of productive reasoning. Before any multi-agent workflow goes to production scale, run the CAL attribution check and confirm all five dimensions are within target thresholds.
Define your thresholds explicitly in your pipeline config, not in a spreadsheet. Something like:
cal_thresholds:
phase_plan_max: 0.15
phase_synthesize_max: 0.10
role_orchestrator_max: 0.20
role_formatter_max: 0.05
tool_use_ratio_min: 0.25
carry_overhead_max: 0.40
retry_tax_max: 0.08
Fail the pipeline run with a hard error if any threshold is breached. This creates the feedback loop: engineers cannot add a new agent role without the CAL noticing if that role inflates carry overhead past the threshold.
You can explore the cost calculators and developer tools at WOWHOW Tools to build out your own CAL dashboard, or browse the full template collection for multi-agent pipeline starters. If you are running production pipelines with real cost pressure, the Pro Vault tier includes the full CAL implementation template with logging adapters for the major LLM APIs.
The immediate action item: pick the most expensive multi-agent run you ran this week. Add CAL phase tags retroactively by reading through your log traces and manually binning each call. You do not need instrumentation to do the first attribution — just a log file and 20 minutes. What you find will determine whether you need a 2x optimization or a 10x one.
Written by
WOWHOW
The WOWHOW team brings 14+ years of production engineering experience. Every tool and product in the catalog is personally built, tested, and curated.
Ready to ship faster?
Start with our free browser tools — no signup — or browse 3,000+ premium dev tools, prompt packs, and templates.
Monday Memo · Free
One insight, every Monday. 7am IST. Zero fluff.
1 field report, 3 links, 1 tool we actually use. Join 11,200+ builders.
Comments · 0
No comments yet. Be the first to share your thoughts.