The WOWHOW Cache-Warm Sequencing (CWS) Framework
CWS is a four-phase orchestration pattern for multi-agent pipelines. It treats cache TTL as a first-class scheduling constraint and structures subagent batches accordingly. The framework applies to any pipeline that: (a) shares a substantial system prompt across multiple calls, (b) runs more than three subagent invocations per job, and (c) has non-trivial inter-call latency from tools, data fetches, or post-processing.
Phase 1 — Stabilize
Before scheduling anything, audit your prompt for instability. Every token in the cacheable prefix must be identical across all calls in a batch. That means:
Pull all per-call context out of the system prompt and into the first user turn. The system prompt should contain only instructions, persona, and static reference material. Dynamic content — the file being analyzed, the task description, the retrieved document — goes in the user message. This sounds obvious but the default in most frameworks is to shove everything into the system prompt for simplicity.
Pin your model version explicitly. Do not use aliases that might resolve to different checkpoints. Use claude-opus-4-8-20260514 not claude-opus-4-8 if your orchestrator resolves aliases at runtime, since an alias might point to a new checkpoint between pipeline runs and silently break the cache.
Lock the cache_control block position. If you use explicit cache markers, they must appear at the same array index across all calls. If call 1 marks block 2 and call 2 marks block 3, the cache is cold.
Phase 2 — Measure
Instrument your pipeline to record two timestamps per call: dispatch time and response time. Compute the inter-call gap: the time between when the previous response arrived and when the next call was dispatched. This is your scheduling baseline.
Most teams skip this step. As a result they have no idea whether their pipelines are cache-warm or cache-cold. The Measure phase is not optional — without it you cannot tune the scheduler in Phase 3.
Also record the cache_read_input_tokens and cache_creation_input_tokens fields from the API response. Anthropic returns both. Your cache-hit rate is cache_read_input_tokens / (cache_read_input_tokens + cache_creation_input_tokens) on the shared prefix. Target: above 80% for any batch of more than four calls.
Phase 3 — Schedule
This is the core of CWS. The scheduling heuristic operates on a concept called the warm budget: the maximum time you can afford between two consecutive calls before the cache expires. With a 5-minute TTL and a target safety margin of 30 seconds, your warm budget is 270 seconds.
The CWS scheduler divides your pipeline into warm windows: groups of subagent calls that can all complete within one warm budget. The rule is simple: any subagent whose expected dispatch-to-dispatch latency from the previous call exceeds 270 seconds must either be batched forward (moved earlier in the sequence) or issued a warm-up call (a lightweight prefill call with no real task, just enough to reset the TTL).
The warm-up call pattern is a key CWS technique. Rather than forcing every subagent to complete within the window, you issue a cheap no-op call — a request with a one-token user message like “acknowledge” — at the 240-second mark. This resets the TTL for another 5 minutes at the cost of one tiny API call. The economics are: one small call to avoid a full write-rate re-hit on a 3,000-token system prompt.
Phase 4 — Batch
Where possible, collapse independent subagents into parallel calls within the same warm window rather than sequential calls across multiple windows. Parallel calls all share the same cached prefix simultaneously, so you pay the write rate once and the read rate for all remaining calls in the batch.
The Batch phase requires dependency analysis: which subagents can run concurrently (no output of A feeds B) versus which must run sequentially (B consumes A's output). CWS recommends drawing a simple DAG of your pipeline before scheduling. Independent branches run in parallel within a window; dependent chains run sequentially but with warm-up calls bridging long gaps.
The CWS Scheduling Decision Table
The table below is the WOWHOW framework's core artifact. Given inter-call gap and system prompt size, it prescribes the scheduling action.
| Inter-call gap |
System prompt size |
Dependency |
CWS Action |
Expected cache hit |
| < 60s |
Any |
Any |
No action needed — warm window is safe |
High (>95%) |
| 60–180s |
< 2,000 tokens |
Sequential |
Maintain sequence; monitor TTL resets |
High (>90%) |
| 60–180s |
> 2,000 tokens |
Sequential |
Consider warm-up call at 150s mark if gap is variable |
Medium (70–90%) |
| 180–270s |
Any |
Sequential |
Issue warm-up call at 240s; maintain 30s safety buffer |
Medium (60–80%) |
| 180–270s |
Any |
Independent |
Batch into parallel calls; fire all within one window |
High (>90%) |
| > 270s |
< 1,500 tokens |
Any |
Allow cold miss; cache savings may not justify warm-up overhead |
Low (cold miss likely) |
| > 270s |
> 1,500 tokens |
Sequential |
Mandatory warm-up call; restructure pipeline to reduce gap if possible |
Medium with warm-up |
| > 270s |
> 1,500 tokens |
Independent |
Batch all into one parallel dispatch; single cache write, all reads |
High (>85%) with batching |
Reading the Table
The breakeven point for a warm-up call is approximately 1,500 tokens of shared prefix. Below that threshold, the cost of the warm-up call (one API round-trip plus minimal tokens) approaches or exceeds the savings from avoiding a cache miss. Above 1,500 tokens, warm-up calls pay for themselves on a single avoided miss.
The “independent dependency” rows are where the biggest savings live. If your pipeline has five independent subagents that each take 200 seconds to produce output, the naive sequential approach runs them in order over 1,000 seconds — spanning three cache windows and paying the write rate three times on a large prompt. The CWS batch approach fires all five in parallel at T=0. They all share the single cache write. Total time: ~200 seconds. Total cache cost: one write plus four reads.
Worked Example: Code Review Pipeline
Consider a pipeline that reviews a pull request. It runs the following subagents in order:
- Diff parser — reads the raw diff, extracts changed files and line ranges (avg 45s)
- Security scanner — checks for OWASP patterns (avg 90s, calls external tool)
- Style linter — checks code conventions (avg 40s)
- Complexity analyzer — estimates cyclomatic complexity of changed functions (avg 75s)
- Summary writer — synthesizes all prior outputs into a review comment (depends on 1–4)
System prompt: 2,800 tokens (includes coding guidelines, style rules, security checklist).
Naive sequential execution: 1 → 2 → 3 → 4 → 5. Total time: 45 + 90 + 40 + 75 + final ≈ 310 seconds before the summary writer even starts. That 310-second gap between call 1 and call 5 crosses two cache windows if there's any orchestrator overhead between each step.
CWS analysis: Agents 2, 3, and 4 are all independent of each other (they each only need agent 1's output). Agent 5 depends on all four.
CWS-optimized schedule:
- T=0: Dispatch agent 1 (diff parser)
- T=45: Agents 2, 3, and 4 fire in parallel (all three in a single warm window since cache was just written at T=0)
- T=45+90=135: All three complete. Gap since last call: ~90s — within window.
- T=135: Dispatch agent 5 (summary writer) — cache still warm from T=45 writes
Total time: 135 + summary_time seconds. Cache writes: 2 (at T=0 and T=45). Cache reads: 3 (agents 3, 4 at T=45 read the T=0 write; agent 5 reads the T=45 write). Zero warm-up calls needed. The 2,800-token system prompt is paid at full write rate twice and read rate three times, instead of five full write-rate charges in the naive approach.
Anti-Patterns the CWS Framework Prevents
The Context-Stuffing Trap
Injecting retrieved documents, file contents, or database records into the system prompt rather than the user turn is the most common way to break prompt caching entirely. Every call gets a unique system prompt because the retrieved content differs. Result: zero cache hits, ever. CWS Phase 1 (Stabilize) catches this during the audit step.
The Alias Trap
Using floating model aliases like claude-opus-4-8 instead of pinned checkpoint IDs means a model update between your first and fifth subagent call produces two different cache namespaces. The cache appears to work — the first few calls hit — then silently misses after a checkpoint rotation. Pin the full model ID in all orchestration configs.
The Sequential Default
Most orchestrators default to sequential execution because it is simpler to reason about. For cache-warm purposes, sequential execution is often the worst choice when you have independent subagents. CWS requires a dependency analysis step precisely to force the question: does this actually need to wait for the previous result, or does it just happen to be ordered that way in the code?
The Long-Running Tool Trap
When a subagent calls an external tool (a web search, a database query, a code execution environment), that tool call happens between API invocations. If the tool takes more than 5 minutes, the cache is cold by the time the result comes back. CWS handles this by flagging any tool with a P95 latency above 240 seconds as a “cache boundary tool” and inserting a warm-up call immediately after the tool result arrives, before passing the result to the next subagent.
Implementing CWS in a Real Orchestrator
The CWS framework is model-agnostic in principle, but its practical home is the Claude Agent SDK or any custom orchestrator wrapping the Anthropic Messages API. Here is the implementation checklist:
Instrumentation (required for Phase 2)
Add a thin wrapper around every API call that records dispatch timestamp, response timestamp, and the usage object from the response. Parse cache_creation_input_tokens and cache_read_input_tokens out of the usage block. Log these per call. Without this data, you cannot measure Phase 2 and the scheduler in Phase 3 is flying blind.
Warm-Up Call Implementation
A warm-up call is a real API call with the full cached prefix and a minimal user message. It looks exactly like a real call but the user content is just an acknowledgement token. The system prompt must be identical — same content, same cache_control block positions — to hit the same cache entry. In the Claude API:
POST /v1/messages
{
"model": "claude-sonnet-4-6-20260514",
"max_tokens": 5,
"system": [
{
"type": "text",
"text": "[your full stable system prompt]",
"cache_control": {"type": "ephemeral"}
}
],
"messages": [
{"role": "user", "content": "ack"}
]
}
The response will be one or two tokens. The cost is negligible. The cache TTL resets. You have bought another 5 minutes.
Parallel Dispatch
In Node.js or Python, parallel dispatch means firing all independent-branch subagents via Promise.all (or asyncio.gather) rather than await-ing each one sequentially. Each parallel call goes to the API simultaneously. All of them present the same cached prefix. The first to arrive writes the cache; the others read it. The API handles the race condition server-side — you do not need to coordinate client-side.
Dependency DAG
Before running any multi-agent job, your orchestrator should build a dependency graph of the subagents. Tools like WOWHOW's developer tools can help visualize and audit agent pipelines. The minimal implementation: a dictionary mapping each agent ID to its list of prerequisite agent IDs. Topological sort gives you the execution layers. All agents in the same layer are independent and can be batched together in Phase 4.
CWS Tier Classification
The framework classifies pipelines into three tiers based on their cache-warm efficiency. This classification helps you prioritize optimization effort — a Tier 1 pipeline is already optimal; a Tier 3 pipeline has the most room for improvement.
| Tier |
Cache hit rate |
Avg inter-call gap |
Batching used |
Warm-up calls used |
Action |
| Tier 1 — Warm |
>85% |
<180s |
Yes (where independent) |
Rarely needed |
No change. Monitor for regression. |
| Tier 2 — Leaking |
50–85% |
180–300s |
Partial |
Not in use |
Add warm-up calls at cache boundaries; audit prompt stability. |
| Tier 3 — Cold |
<50% |
>300s or highly variable |
No |
Not in use |
Full CWS audit: stabilize prefix, add instrumentation, build dependency DAG, batch and warm-up. |
Most production pipelines that have never been cache-audited land at Tier 3. The instrumentation step from Phase 2 will reveal this quickly: if you see cache_creation_input_tokens equal to cache_read_input_tokens across your calls (a 50% hit rate), you are paying full price half the time. If cache_creation_input_tokens dominates on every single call, you are Tier 3 with zero effective caching.
When CWS Does Not Apply
The framework is not universal. Three scenarios where it adds no value:
Small shared contexts under 1,024 tokens. The Anthropic API requires a minimum block size of 1,024 tokens for caching. If your system prompt is 500 tokens, there is nothing to cache and CWS scheduling is irrelevant. Expand your system prompt with useful reference material before worrying about cache orchestration.
Single-call pipelines. If your workflow is one call per job with no shared prefix across jobs, there is no multi-call cache to warm. CWS applies exclusively to pipelines that make multiple calls that share a stable prefix within the same job.
Latency-sensitive real-time flows. The warm-up call adds a round-trip. If your pipeline has a hard sub-second latency budget, a warm-up call is not compatible. In real-time flows, the economics change: users accept slightly higher token cost in exchange for zero added latency. Accept the occasional cold miss rather than inserting warm-up calls.
Putting CWS Into Your Workflow
Start with instrumentation. You cannot optimize what you cannot measure. Add the two-field usage extraction to every API call you make this week and plot your cache hit rate over a 24-hour window. If you are below 70%, you have a Tier 2 or Tier 3 pipeline and the optimization is almost certainly worth the engineering time.
Next, run the Phase 1 stability audit. Check your system prompt for any per-call dynamic content. Move it to the user turn. This single change alone often gets a pipeline from 40% to 80% cache hit rate with no scheduling changes at all.
Then draw the dependency DAG. Tools like WOWHOW’s AI tooling catalog list utilities for pipeline visualization, or you can sketch it manually. The DAG takes 15 minutes for most pipelines and immediately shows you which agents can be batched.
Finally, add warm-up calls at the boundaries identified by the Phase 3 scheduling table. Monitor the cache hit rate for 48 hours post-deployment. A well-implemented CWS pipeline should stabilize above 80% hit rate on the shared prefix within one or two tuning iterations.
If you are building on WOWHOW developer tools and want to trace prompt cache behavior across a live pipeline, the Pro Vault includes structured observability templates for multi-agent token accounting — including per-call cache hit/miss logging that maps directly to the CWS tier classification. The difference between a Tier 3 cold pipeline and a Tier 1 warm one on a 3,000-token system prompt running 20 daily jobs is approximately 1.1M tokens per month. At standard API pricing, that pays for the observability tooling many times over.
Comments · 0
No comments yet. Be the first to share your thoughts.