The 14 Cache-Break Patterns
Pattern 1: Dynamic content in the system prompt
The most expensive pattern. Anything that changes between sessions — current date, user name, account tier, feature flags — does not belong in the system prompt.
// BREAKS CACHE — date changes every day
const systemPrompt = `You are an assistant. Today is ${new Date().toISOString().slice(0, 10)}.`
// CORRECT — inject dynamic content as the first user message
const systemPrompt = `You are an assistant.`
const firstUserMessage = `Context: today is ${new Date().toISOString().slice(0, 10)}.
User request: ${userRequest}`
Pattern 2: Tool definition ordering that varies between requests
If your tool definitions are built dynamically from a feature flags object or a user permissions set, the ordering can vary between requests. Even if the same 8 tools are present, a different ordering is a different prefix.
// BREAKS CACHE — object iteration order is not guaranteed across Node.js versions
const tools = Object.values(userPermissions).map(p => buildToolDef(p))
// CORRECT — always sort tool definitions by a stable key before sending
const tools = Object.values(userPermissions)
.map(p => buildToolDef(p))
.sort((a, b) => a.name.localeCompare(b.name))
Pattern 3: Request IDs or trace headers in the API payload
Some observability setups inject trace IDs or request IDs into the API call body rather than the headers. If this ends up in the messages array — even in a metadata field — it breaks caching.
Pattern 4: Serializing the same data in a different order
JSON serialization is not deterministic across all environments. JSON.stringify(obj) in Node.js 20 may produce a different key ordering than Node.js 18 for the same object. If you are embedding serialized objects in your prompts — tool outputs, context data — pin the serialization to a deterministic format.
// BREAKS CACHE — key order not guaranteed
const contextStr = JSON.stringify(contextObj)
// CORRECT — sort keys for deterministic output
const contextStr = JSON.stringify(contextObj, Object.keys(contextObj).sort())
Pattern 5: Including image bytes that vary between sessions
Screenshots, dynamically generated charts, and user avatars change frequently. Any image in your prompt prefix that changes between requests breaks caching for everything after it.
Fix: put images in the user message (end of prefix), not in the system prompt. For images that are truly static (a fixed UI reference screenshot, a company logo), include them early and they will cache normally.
Pattern 6: Different whitespace in system prompts across code paths
This sounds trivial. It is not. If your system prompt is assembled by concatenating strings from different sources — a template literal here, a variable there — trailing newlines, double spaces, and inconsistent indentation all break caching. Normalize your system prompts to a canonical form before sending.
function normalizePrompt(prompt: string): string {
return prompt
.split('
')
.map(line => line.trimEnd()) // Remove trailing whitespace per line
.join('
')
.replace(/
{3,}/g, '
') // Collapse 3+ newlines to 2
.trim()
}
Pattern 7: Not using cache_control breakpoints
This is an omission, not an action. Without explicit cache_control: { type: "ephemeral" } breakpoints on your long static content, caching is opportunistic — it may or may not apply, and the cache TTL defaults to 5 minutes instead of 1 hour.
For CLAUDE.md-loaded content and tool definitions, always set explicit breakpoints:
const messages = [
{
role: 'user',
content: [
{
type: 'text',
text: staticSystemContext, // Long, rarely-changing context
cache_control: { type: 'ephemeral' }, // Cache for 1 hour
},
{
type: 'text',
text: currentUserMessage, // Variable, no cache breakpoint
}
]
}
]
Pattern 8: Conversation history not preserved between API calls
If your client rebuilds conversation history from a database on each API call and the serialization is slightly different each time — different timestamp format, different metadata fields included — the history prefix changes and caching breaks. Store and replay the exact API payload, not a reconstructed version.
Pattern 9: Model version not pinned
Caches are per model version. If you use "model": "claude-sonnet-latest" and Anthropic rotates that alias to a new model version mid-month, your cache is empty. Pin to a specific version: "claude-sonnet-4-6".
Pattern 10: Tool output format changing between tool versions
If a tool you call in one turn returns JSON with fields in one format, and an updated version of that tool returns the same data with different field names or additional fields, the tool output goes into conversation history with a different format — breaking cache for all subsequent turns.
Pattern 11: Parallel requests racing on the same cache key
Cache entries are written after the response completes. If you fire two identical requests simultaneously, neither will cache-hit — both are computed fresh, and only one will populate the cache. For high-throughput use cases, add a small jitter (50-200ms) to parallel requests to avoid cache stampedes.
Pattern 12: Beta headers changing between requests
The anthropic-beta header is part of the cache key. If you use beta features on some requests and not others, or if you test with different beta flags, you get separate caches per beta configuration. Pick a consistent beta configuration and use it uniformly.
Pattern 13: System prompt loaded from a file that gets reformatted
If your CLAUDE.md or system prompt file is in a git repository where an editor auto-formats on save (Prettier, ESLint, trailing newline normalization), a reformatting commit will change the system prompt content and invalidate the cache for all users until it re-warms.
Pattern 14: Empty messages in conversation history
Some frameworks insert empty assistant messages as turn separators or debugging artifacts. An empty string and a message with content are different tokens — and empty messages in the history change the prefix for everything that follows.
The 6 Rules That Prevent These Patterns From Returning
Rule 1: Static front, variable back
System prompt and tool definitions are the static prefix. User messages and tool outputs are the variable suffix. Never violate this ordering. If you need to inject session-specific context, inject it as the first user message, not as part of the system prompt.
Rule 2: Pin everything that can rotate
Model version, API version, beta headers, tool definition versions. If any of these can change without your explicit intent, it will — and when it does, your cache empties. Pin to specific versions and upgrade deliberately.
Rule 3: Normalize before sending
Run every string that goes into your prompt through a normalization function. Trim trailing whitespace, collapse multiple newlines, ensure deterministic JSON serialization. Build this normalization into your prompt construction layer, not as a one-off fix.
Rule 4: Measure, not assume
The Anthropic API returns usage.cache_read_input_tokens and usage.cache_creation_input_tokens in every response. Log these. Calculate your cache hit rate per session type. If a session type drops below 80%, investigate immediately — something changed.
function logCacheStats(usage: Usage, sessionType: string): void {
const cached = usage.cache_read_input_tokens ?? 0
const total = usage.input_tokens + cached
const hitRate = total > 0 ? cached / total : 0
if (hitRate < 0.8) {
logger.warn({ sessionType, hitRate, cached, total }, 'Cache hit rate below threshold')
}
}
Rule 5: One canonical system prompt per agent type
Multiple code paths that construct the same logical system prompt in slightly different ways will inevitably diverge. Define one function per agent type that returns the system prompt, and call only that function. Never inline the system prompt text.
Rule 6: Replay exact API payloads, not reconstructed messages
For conversation history, store the exact messages array as returned by the API and replay it verbatim. Do not reconstruct it from a database schema. Reconstruction introduces format drift over time. The messages array IS the canonical conversation state.
Measuring Your Current State
Before implementing any of these fixes, measure where you actually are. Here is a diagnostic script that reads a session's API logs and calculates cache efficiency:
interface CacheStats {
hitRate: number
cacheHits: number
cacheMisses: number
estimatedMonthlySavings: number
worstPatterns: string[]
}
async function analyzeCacheEfficiency(sessionLogs: ApiCallLog[]): Promise<CacheStats> {
let totalCacheRead = 0
let totalCacheCreation = 0
let totalRegularInput = 0
for (const log of sessionLogs) {
totalCacheRead += log.usage.cache_read_input_tokens ?? 0
totalCacheCreation += log.usage.cache_creation_input_tokens ?? 0
totalRegularInput += log.usage.input_tokens
}
const totalInput = totalCacheRead + totalCacheCreation + totalRegularInput
const hitRate = totalInput > 0 ? totalCacheRead / totalInput : 0
// At $3/M for Sonnet input, cache reads at $0.30/M
// Savings = (cache_read tokens × 0.7 × $3/M)
const monthlySavingsPerDollar = totalCacheRead * 0.7 * (3 / 1_000_000)
return {
hitRate,
cacheHits: totalCacheRead,
cacheMisses: totalRegularInput + totalCacheCreation,
estimatedMonthlySavings: monthlySavingsPerDollar * 30,
worstPatterns: detectCacheBreakPatterns(sessionLogs),
}
}
The Before and After
Here is the actual before/after on my production setup after applying these 14 patterns systematically over 3 weeks:
| Metric | Before | After | Change |
| Cache hit rate | 43% | 91% | +48pp |
| Monthly Claude bill | $340 | $68 | -80% |
| Average response latency | 4.2s | 2.1s | -50% |
| Cache-break incidents/week | ~12 | 1-2 | -85% |
The latency improvement is a side effect I did not expect. Cache hits skip the full attention computation over cached tokens — for a 20,000-token system prompt, that is a significant fraction of the processing time. At 91% hit rate, most requests are computing attention only over the variable suffix, not the full prompt.
What the High-Traffic Teams Do Differently
After publishing an early version of this analysis on Hacker News (67 comments, 3 engineers from Anthropic customers confirmed similar numbers), a few patterns emerged from teams running Claude at 10M+ tokens per day:
They build a prompt construction layer, not inline prompts. Every prompt goes through a typed constructor that enforces the static-front/variable-back ordering and normalizes whitespace. No developer ever writes a raw string into an API call.
They have a staging cache environment. New system prompt versions are tested in a staging environment that logs cache metrics before they ship to production. A system prompt change that drops cache hit rate below 85% in staging blocks the deploy.
They treat cache miss spikes as incidents. Not degraded performance incidents — actual engineering incidents with a postmortem. Because at scale, a 10% drop in cache hit rate on a 10M token/day workload is a $210/day cost increase. That is a real incident.
The 14 patterns in this article are not obscure edge cases. Every production team running Claude at volume hits all of them. The difference between a $68/month bill and a $340/month bill is entirely in how systematically you eliminate them.
The AI API Cost Calculator at wowhow.cloud lets you model the impact of different cache hit rates on your specific usage volume before you commit to the optimization work.
Sources
- Prompt Caching Guide — Anthropic (2026)
- Messages API Reference — Anthropic (2026)
- Claude Model Overview and Pricing — Anthropic (2026)
Comments · 0
No comments yet. Be the first to share your thoughts.