The most-starred GitHub repository this week is a text file. Andrej Karpathy endorsed a single, well-structured CLAUDE.md file for Claude Code and it picked up 28,000 stars in seven days. That is not a statement about the celebrity effect — it is a statement about how much signal developers were looking for on the question of how to write better prompts for AI coding agents. The five techniques below are the ones that produce consistent, production-grade results in 2026, ordered from highest to lowest practical leverage for most developers.
1. System Prompt Engineering: The 4-Block Layout Strategy
A CLAUDE.md file is just a system prompt that persists on disk. The reason the karpathy-skills repository went viral is that it demonstrated a structural pattern most developers had not seen written down clearly: a well-organized system prompt produces dramatically more consistent agent behavior than an unorganized one of equal length. The 4-Block Layout is the most reliable structure for both CLAUDE.md files and API system prompts:
## 1. SYSTEM INSTRUCTIONS
What the model IS, what it MUST and MUST NOT do.
Constraints, forbidden patterns, required patterns.
This block is high-trust β the model obeys it strictly.
## 2. PROJECT CONTEXT
What the codebase IS, the tech stack, architecture, key files.
What matters about THIS project that changes how you should code.
Conventions: naming, error handling, logging, test patterns.
## 3. DATA INPUTS
What data the model will receive in user messages.
Schema descriptions, example inputs, edge cases to expect.
How to validate or handle malformed input.
## 4. OUTPUT CONTRACTS
Exact format requirements for every output type.
Examples of GOOD output.
Examples of BAD output (what to avoid).
Validation criteria the output must meet.
The 4-block structure works because it matches how frontier models process context. Instructions in the system prompt get more weight than instructions buried in a user message. Separating behavioral rules (Block 1) from project knowledge (Block 2) from input schemas (Block 3) from output specs (Block 4) eliminates the ambiguity that causes models to interpolate incorrectly across categories. A model that sees “always use TypeScript strict mode” in Block 1 treats it differently than the same sentence in Block 2 — the former is a hard rule, the latter is a description of what the codebase already does.
For CLAUDE.md specifically: put behavioral rules at the top (the MUST NOT list), project stack and key file paths in the middle, and output format contracts at the bottom. Every line you add should answer the question “does the model need this to do its job better?” not “would it be nice for the model to know this?” Trim ruthlessly. A 50-line CLAUDE.md that is carefully written outperforms a 500-line one that mixes rules with commentary. See the developer Q&A post for a concrete 4-section CLAUDE.md template you can start from.
2. Chain-of-Thought vs Tree-of-Thoughts: When to Use Each
Chain-of-Thought (CoT) prompting adds a step-by-step reasoning instruction to your prompt. The canonical formulation is “Let’s think through this step by step” appended to the question. The technique reliably improves performance on tasks that require sequential reasoning: math word problems, code debugging, logical deduction, and multi-condition policy evaluation. The critical constraint: CoT only shows meaningful gains on models with roughly 100 billion parameters or more. On smaller models it can actually hurt performance by introducing premature intermediate conclusions.
Tree-of-Thoughts (ToT) extends CoT by branching the reasoning process. Instead of a single sequential chain, the model generates multiple alternative reasoning paths at each decision point, evaluates them, and pursues the most promising branch. The practical implementation:
Solve this problem by exploring multiple approaches.
For each approach:
1. State the approach in one sentence.
2. Work through the first 3 steps.
3. Evaluate: does this approach look promising? Why or why not?
After exploring at least 3 approaches, select the most promising
one and complete the solution using that approach only.
When to use each:
| Technique | Best for | Avoid when |
|---|---|---|
| Chain-of-Thought | Math, debugging, logic, sequential procedures | Model < 70B parameters, simple factual lookup |
| Tree-of-Thoughts | Strategic planning, architecture decisions, open-ended problem solving | Tasks with a single correct path, time-sensitive generation |
| Neither | Summarization, translation, classification, simple extraction | N/A β these tasks do not benefit from explicit reasoning steps |
ToT is slower and more expensive than CoT by 3–5x in token cost, because the model generates and evaluates multiple paths before committing. Use it when the decision has high downstream cost — architecture choices, security design, complex refactoring plans — and avoid it when the reasoning path is straightforward and the primary goal is speed.
3. Agentic Prompts: Goal-Directed Multi-Step Workflows
A chat prompt asks the model a question and expects an answer. An agentic prompt gives the model a goal, a set of tools, success criteria, and an explicit decision policy for when to ask versus when to act. The distinction is not semantic — it fundamentally changes how the model allocates its planning budget and how it handles uncertainty.
A production-grade agentic system prompt has five required components:
GOAL: [Single clear objective β what done looks like]
TOOLS AVAILABLE:
- read_file(path): Returns file contents
- write_file(path, content): Writes content to path
- run_command(cmd): Executes shell command, returns stdout/stderr
- search_codebase(query): Semantic search over project files
SUCCESS CRITERIA:
- [Measurable condition 1: e.g., "All TypeScript files compile with zero errors"]
- [Measurable condition 2: e.g., "All existing tests pass"]
- [Measurable condition 3: e.g., "No new lint warnings introduced"]
FAILURE MODES TO AVOID:
- Do not modify files outside /src/ unless explicitly instructed
- Do not install npm packages without confirmation
- If a test was passing before your changes and now fails, stop and report
DECISION POLICY:
- Proceed autonomously for: file reads, code generation, test runs
- Pause and ask for: deleting files, external API calls, config changes
- Hard stop: any operation that would affect production data
The failure modes and decision policy sections are the most commonly omitted in practice, and they are the most important. A model without explicit failure modes tends to keep trying variations when it encounters an unexpected result rather than surfacing the problem. A model without a decision policy either asks for permission too often (frustrating for autonomous workflows) or acts on everything without checkpoints (dangerous for workflows that touch infrastructure). For agentic prompts in Claude Code, the CLAUDE.md is where the decision policy lives permanently; you supplement it per-task with goal-specific success criteria in the user message.
4. Structured Output Prompting: Getting Deterministic JSON Every Time
Getting a model to return valid, schema-conformant JSON reliably — not most of the time, but every time, including edge cases and unusual inputs — requires a specific prompting approach. The three-layer method:
Layer 1: Schema-first instruction. Define the exact output schema before any task description. The model prioritizes schema constraints more reliably when they appear before the instruction rather than after it.
You MUST return ONLY a valid JSON object matching this schema exactly.
Do not include any text before or after the JSON.
Do not include markdown code fences.
Schema:
{
"status": "success" | "error",
"data": {
"items": Array<{ id: string, name: string, score: number }>,
"total": number
} | null,
"error_message": string | null
}
Task: [your actual task here]
Layer 2: Few-shot examples of valid output. Provide 1–2 complete examples of correctly formatted output for representative inputs. Models learn the output structure from examples faster than from schema descriptions alone, particularly for nested structures.
Layer 3: Validator agent step. For production pipelines, add a validation step as a second model call. Pass the first model’s output and the schema to a second call with the prompt “Does this JSON conform to the schema exactly? If not, return the corrected JSON. If yes, return it unchanged.” This catches the 2–5% of outputs that pass regex validation but fail semantic schema constraints.
Common failure patterns to guard against: the model adding a prose explanation before the JSON (“Here is the requested JSON:”), wrapping the JSON in markdown code fences, using single quotes instead of double quotes, and omitting required fields that have no natural value for the given input (instead of returning null or empty string as specified). All four are more reliably prevented by few-shot examples than by rule-based instructions alone.
5. Self-Consistency Prompting: Reducing Hallucinations Without a Bigger Model
Self-consistency prompting generates multiple independent reasoning paths for the same question and selects the most consistent answer across them. It was introduced in a Google Research paper and has since become one of the most practically useful techniques for reducing hallucinations on arithmetic, logic, and factual recall tasks without upgrading to a larger or more expensive model.
The implementation: run the same prompt 3–5 times with temperature > 0 (temperature 0.7–1.0 works well) to get diverse reasoning paths, then aggregate the final answers by majority vote. For factual claims, require that the cited source appear in at least 2 of 3 paths before accepting it.
// Pseudocode for self-consistency aggregation
const paths = await Promise.all(
Array.from({ length: 5 }, () =>
model.generate(prompt, { temperature: 0.8 })
)
)
const answers = paths.map(extractFinalAnswer)
const counts = answers.reduce((acc, a) => {
acc[a] = (acc[a] ?? 0) + 1
return acc
}, {} as Record<string, number>)
const winner = Object.entries(counts)
.sort(([, a], [, b]) => b - a)[0][0]
The cost tradeoff is straightforward: 5 calls at temperature 0.8 costs 5x a single call but significantly less than switching to a larger model, and the accuracy gain on arithmetic and multi-step logic tasks is typically larger than the gain from switching models (assuming you are already on a strong frontier model). Self-consistency works best for tasks with verifiable single answers (math, code correctness, factual lookup). It does not help much for open-ended generation tasks like writing or summarization, where there is no ground-truth “correct” answer to vote toward.
One practical optimization: use temperature 0 for a first-pass answer, then use temperature 0.8 for 2 additional paths only if the first answer falls into a category you have identified as high-hallucination-risk (dates, statistics, API method signatures). This gives you self-consistency benefits at lower cost by targeting it where it matters rather than applying it uniformly. For developers building pipelines that use structured output (Technique 4) and self-consistency together, the combination is particularly powerful for data extraction tasks where accuracy matters and the cost of a wrong extraction is higher than the cost of 3–5 model calls.
Applying These in Order
The correct application order by leverage: start with Technique 1 (fix your system prompt structure) before any other optimization. A well-structured system prompt reduces the baseline error rate across all task types. Then apply Technique 3 (agentic prompt design) if you are building multi-step workflows. Add Technique 4 (structured output) if reliable JSON is a requirement. Reach for Technique 2 (CoT/ToT) on specific tasks that require sequential or branching reasoning. Apply Technique 5 (self-consistency) as a final accuracy booster for high-stakes outputs where the 5x cost is justified by error reduction. The mistake most developers make is reaching for self-consistency or tree-of-thoughts first — before fixing the system prompt, which is the highest-leverage, lowest-cost improvement available. For a practical example of all five techniques applied in a real codebase, see how the WOWHOW tools are built with Claude Code and a CLAUDE.md that encodes Techniques 1 through 4 explicitly.
Written by
Anup Karanjkar
Expert contributor at WOWHOW. Writing about AI, development, automation, and building products that ship.
Ready to ship faster?
Browse our catalog of 3,000+ premium dev tools, prompt packs, and templates.
Monday Memo Β· Free
One insight, every Monday. 7am IST. Zero fluff.
1 field report, 3 links, 1 tool we actually use. Join 11,200+ builders.
Comments Β· 0
No comments yet. Be the first to share your thoughts.