TL;DR

Your Claude Code cache hit rate is probably below 50%. Anthropic's own team treats sub-90% as a SEV. Here are the 14 cache-break patterns and 6 rules that cut my bill 5x.

Three weeks ago, Anthropic's own engineering team quietly published a blog post revealing that their internal tooling treats a prompt cache hit rate below 90% as a severity-2 incident — the same classification as a partial service outage. That number reframed everything I thought I knew about how to run Claude Code in production.

My cache hit rate at the time: 43%. My monthly Claude Code bill: $340. After identifying and fixing 9 of the 14 cache-break patterns in this article, my hit rate is 91% and my bill is $68. Same work volume. Same output quality. The difference is entirely in what I stopped accidentally breaking.

This article covers the 14 patterns that silently destroy cache hit rates — from the obvious (dynamic timestamps in system prompts) to the subtle (tool definition ordering that changes between requests) — and the 6 rules that prevent them from coming back.

How Prompt Caching Actually Works

Claude's prompt caching operates on a prefix-match rule: the API compares the beginning of your current request against requests cached in the last 5 minutes (or up to 1 hour with the cache_control header). If the prefix matches exactly — including every token of every tool definition, every system prompt character, every image byte — the cache hits and you pay 10% of the normal input cost for those tokens.

The critical implication: cache matching is exact prefix matching, not semantic matching. "You are a helpful assistant." and "You are a helpful assistant. " (trailing space) are different cache keys. Changing one character anywhere in the prefix before your variable content invalidates the entire cache for everything after that character.

This is different from how most developers intuitively think about caching. In HTTP caching, you have cache keys and TTLs. In prompt caching, you have a prefix that either matches or does not. The architecture reward is: keep everything static at the front, put everything variable at the end.

The Prefix Match Rule in Practice

Here is the exact ordering that maximizes cache hits:

[Position 1] System prompt — completely static, never changes between requests
[Position 2] Tool definitions — static JSON, same order every time
[Position 3] Conversation history — grows, but prefix still matches up to the new turn
[Position 4] Current user message — variable, always last

Violating this ordering is Cache-Break Pattern #1. It is also the most common pattern I see in production setups. Developers put dynamic content — user preferences, session metadata, current timestamps — in the system prompt because it feels like the right place for context. It is not. Dynamic content in the system prompt breaks every single cache entry in the conversation.

Metric	Before	After	Change
Cache hit rate	43%	91%	+48pp
Monthly Claude bill	$340	$68	-80%
Average response latency	4.2s	2.1s	-50%
Cache-break incidents/week	~12	1-2	-85%

How Prompt Caching Actually Works

The Prefix Match Rule in Practice

Try Our Free Tools

JSON Formatter & Validator

cURL to Code Converter

More from AI Tools & Tutorials

Imagen 3 & 4 Shut Down June 24: Migrate to Gemini Image (2026)

The 14 Cache-Break Patterns

Pattern 1: Dynamic content in the system prompt

Pattern 2: Tool definition ordering that varies between requests

Pattern 3: Request IDs or trace headers in the API payload

Pattern 4: Serializing the same data in a different order

Pattern 5: Including image bytes that vary between sessions

Pattern 6: Different whitespace in system prompts across code paths

Pattern 7: Not using cache_control breakpoints

Pattern 8: Conversation history not preserved between API calls

Pattern 9: Model version not pinned

Pattern 10: Tool output format changing between tool versions

Pattern 11: Parallel requests racing on the same cache key

Pattern 12: Beta headers changing between requests

Pattern 13: System prompt loaded from a file that gets reformatted

Pattern 14: Empty messages in conversation history

The 6 Rules That Prevent These Patterns From Returning

Rule 1: Static front, variable back

Rule 2: Pin everything that can rotate

Rule 3: Normalize before sending

Rule 4: Measure, not assume

Rule 5: One canonical system prompt per agent type

Rule 6: Replay exact API payloads, not reconstructed messages

Measuring Your Current State

The Before and After

What the High-Traffic Teams Do Differently

Sources

Ready to ship faster?

One insight, every Monday. 7am IST. Zero fluff.

Comments · 0

Article stats

Regex Playground

Base64 Encoder / Decoder

UUID Generator

Grok Build Agent Dashboard: Run 8 Parallel Coding Agents From One Screen

Build an MCP Server in TypeScript (2026): Claude Code Guide

Income Tax Calculator India 2025-26: Complete Guide

OpenAI Codex Goal Mode Is Now GA — Multi-Hour Autonomous Coding Sessions

GitHub Copilot Token Billing Week 1: What Developers Are Actually Paying