Stop Dumping Instructions Into CLAUDE.md — The 3-Layer Agent Harness Pattern

The repos going viral on GitHub right now — mattpocock’s skills repository at 55K stars, forrestchang’s Andrej Karpathy skills collection at 107K, shanraisshan’s Claude Code best-practices compendium trending past 20K — prove one thing with their combined star counts: everyone knows the skill system matters. The community has figured out that how you configure Claude Code is at least as important as the model itself. What the community has not figured out, and what those viral repos can’t teach you by themselves, is that copying someone else’s CLAUDE.md is like copying their .bashrc. The file that makes Matt’s workflow sing will collide with your project conventions, your team constraints, your deployment pipeline, and your mental model. You’ll paste 300 lines of someone else’s hard-won experience into a single config file, watch your agent become subtly less reliable, and never understand why.

I manage twelve production projects with Claude Code. A marketplace for developer tools with 1,800-plus products. A custom WordPress pipeline. A Cloudflare Worker fleet. A cron-driven SEO research agent that runs overnight. Across those twelve projects, I have tried every configuration approach that the community has proposed: monolithic CLAUDE.md, minimal CLAUDE.md, no CLAUDE.md at all, skills-only, hooks-only, and every combination. What I have converged on is an architecture I call the 3-Layer Agent Harness. It separates concerns in a way that makes each layer independently comprehensible, auditable, and improvable. It also happens to be the architecture that Anthropic’s own Claude Code documentation describes, though the docs don’t give it a name or explain why the separation matters at the scale of real production work.

This post is the explanation. It covers the cognitive budget problem that makes bloated CLAUDE.md files actively harmful, the precise role of each layer in the harness, the places where I have seen developers use the wrong layer for the wrong job, and the audit loop that keeps the whole system healthy over time. By the end, you will have enough to build your own stack from zero — one that fits your projects rather than someone else’s.

The 150-Instruction Budget Nobody Tells You About

HumanLayer, the company that builds human-in-the-loop approval tooling for AI agents, published internal research in early 2026 documenting something that most Claude Code power users have felt but never quantified: LLM instruction compliance degrades significantly around the 150-instruction mark.^[1] Below roughly 150 distinct behavioral directives, models follow instructions with high reliability. Above that threshold, compliance drops in a pattern that is not random — the model does not uniformly ignore all instructions slightly less often. Instead, it begins dropping specific categories of instruction: the recent ones, the ones that conflict with strong priors from training, and the ones that are structurally buried deep in a long document.

The practical implication is sharp. A CLAUDE.md file with 300 lines of behavioral instructions is not a file where Claude follows all 300 lines 60% of the time. It is a file where Claude follows the first 100 lines with high reliability and treats the remaining 200 lines as optional context that may or may not influence behavior depending on the specific query, the surrounding conversation, and whatever else happens to be loaded into the context window at the moment. You have written 200 lines of instructions that feel like rules but function like suggestions.

The problem compounds because the system prompt — the invisible prelude that Claude Code injects before your CLAUDE.md — already consumes roughly 50 of that 150-instruction budget. Claude Code’s own behavioral defaults, tool use policies, and safety instructions live in that system prompt. They are not visible to you, and they cannot be overridden by anything in your CLAUDE.md. Every instruction you write in CLAUDE.md is competing for the remaining 100 slots in a budget you cannot directly see.

When I audited my own original CLAUDE.md — a file I had accumulated over eight months of active development, adding rules after every incident — it ran to 670 lines. I had been adding instructions faster than I was auditing them, and the result was a file where many rules I considered critical were functionally invisible to the agent. The rules I had added most recently, to address the most recent incidents, were the least reliably followed. The architecture that fixed this is not a better way to write CLAUDE.md. It is a way to stop using CLAUDE.md for everything that CLAUDE.md is bad at.

Layer 1 — CLAUDE.md (The Constitution)

The right mental model for CLAUDE.md is a constitutional document: it establishes the principles, the non-negotiables, and the structural conventions of your project. It does not enumerate procedures. It does not describe how to perform specific tasks. It does not contain workflow automation. A well-written constitution is short enough that everyone can remember its key provisions, and CLAUDE.md should be the same.

In practice, this means CLAUDE.md should answer four questions. First, what is this project? A brief description of the product, the stack, and the architecture — enough that a competent developer reading it for the first time would understand what they are working on. Second, what are the absolute prohibitions? The things that must never happen regardless of context: never use ‘any’ in TypeScript, never add dark mode (I removed it from my site in April 2026 and do not want it back), never write a cache rule that misorders public and private routes. Third, what are the trust-boundary conditions? The paths and operations where security constraints apply automatically. Fourth, where do the extended rules live? References to the skill files and rule files that contain the detailed guidance, so the agent knows they exist and can load them on demand.

The structural skeleton that has worked for me across twelve projects looks like this:

# Project Name

## What This Is
[Two sentences: product description + primary tech stack]

## Architecture
[Directory map with one-line purpose per path]

## Hard Rules (Never Violate)
[5-8 absolute prohibitions with brief rationale]

## Decision Engine
[When to upgrade to subagent, when to read extended rules]

## Conditional Rules (loaded automatically)
- `.claude/rules/deploy-invariants.md` — deploy pipeline constraints
- `.claude/rules/trust-boundary.md` — security checklist for auth/payment paths
- `.claude/rules/seo-standards.md` — SEO checklist for new content

@AGENTS.md

That skeleton translates to roughly 80-120 lines in practice. I know this because I spent a day reducing my 670-line CLAUDE.md to 108 lines in March 2026, and the agent’s reliability on the hard rules improved measurably. The rules that had been buried on line 400 were suddenly being followed because they were now on line 40. Nothing else changed. Same rules, same agent, same projects. The reduction in file length was the intervention.

The conditional rules pattern deserves specific attention. Claude Code supports referencing external files from CLAUDE.md using the @filename syntax, and the Claude Code documentation describes “Skills” as files that are auto-discovered from SKILL.md frontmatter.^[2] What this means in practice is that you can offload detailed guidance to separate files that are only pulled into context when relevant. The deploy invariants file — 22 hard rules each from a real production incident — does not need to be in the main CLAUDE.md. It needs to be referenced from CLAUDE.md and loaded automatically when the agent touches deployment-adjacent files. This is precisely what the conditional loading pattern enables.

The failure mode I see most often in other developers’ configurations is using CLAUDE.md as a running log of lessons learned. Every time something goes wrong, a new rule gets appended. This is a reasonable instinct — you want to prevent the same mistake from recurring — but it turns CLAUDE.md into an append-only ledger that grows without bound. The right place for a new rule depends on its nature: deploy-time constraints belong in the deploy invariants file, security constraints belong in the trust-boundary file, and only the highest-level project-wide prohibitions belong in CLAUDE.md itself. The discipline of routing rules to the right layer is what prevents the file from becoming a instruction graveyard.

Layer 2 — Skills (The Specialists)

Skills are the second layer, and they are the layer that most developers underuse. Claude Code’s official documentation defines Skills as guidance documents with SKILL.md frontmatter that can be auto-discovered and loaded when triggered by specific user patterns or slash commands.^[3] The distinction the docs draw between Skills, Commands, and Subagents is worth internalizing explicitly: Skills provide guidance and are loaded into context when triggered, Commands are manually invoked slash commands that typically invoke a skill, and Subagents are isolated workers that execute tasks in separate contexts. They are three different mechanisms that solve three different problems, and conflating them is a common source of architectural confusion.

A Skill file has a specific structure. The frontmatter describes the trigger conditions — the user intents or patterns that cause the skill to be loaded. The body contains the guidance: the procedures, the patterns, the constraints, the examples that apply to the specific domain the skill covers. When a user’s request matches a trigger condition, the skill is pulled into the active context. When it does not match, the skill is invisible and contributes nothing to the context window. This is the mechanism that solves the instruction budget problem: instead of loading all behavioral guidance for all domains simultaneously, you load only what is relevant to the current task.

The anatomy of a well-structured skill file looks like this:

---
name: blog-writer
description: Write and publish SEO-optimized blog posts for wowhow.cloud
triggers:
  - write a blog post
  - add a blog post
  - new blog
  - /new-blog
---

# Blog Writer Skill

## Mode Selection
[Guidance for choosing PILLAR vs STANDARD vs QUICK mode]

## PILLAR Mode (8k-12k words)
[Content structure, citation requirements, prose style rules]

## File Operations
[Exact sequence of file edits: 2026-XX.ts → blog-posts.ts → build → commit]

## Quality Gates
[Word count thresholds, noindex guard rules, SEO checklist]

The guidance inside the skill can be as detailed as needed because it is only loaded when the user is actually writing a blog post. It does not compete with deploy rules, security constraints, or TypeScript conventions for space in the instruction budget. The blog writer skill on my projects runs to about 200 lines and covers mode selection, content structure, citation formats, internal linking requirements, and the exact sequence of file operations. None of that detail would survive in CLAUDE.md — it would be invisible by line 150.

The research I have seen from community power users — including the work documented in Matt Pocock’s skills repository and the patterns in the Karpathy skills collection — converges on a practical ceiling of 8-12 well-chosen skills as the optimal range for a senior developer’s daily workflow. Below eight, you are likely either duplicating guidance across CLAUDE.md and skills (a maintenance problem) or leaving common workflows without specialized guidance. Above twelve, you accumulate skills that trigger infrequently enough that you forget they exist, and the skill descriptions themselves take up context real estate when the agent tries to decide which skill to load.

My current stack for the WOWHOW marketplace has eleven active skills: blog-writer, build-tool, new-product, deploy, seo-dominator, security-hardening, competitive-intel-operative, product-qa-enforcer, analytics-oracle, growth-coordinator, and cloudflare. Each covers a domain that is distinct enough from the others that a separate guidance document is warranted, and common enough in my workflow that I interact with it at least a few times per week. The twelfth skill I maintain is a research-tools skill for deep-dive investigations, which triggers infrequently but when it does, the specialized guidance it provides is irreplaceable.

The mistake I see most often with skills is treating them as documentation repositories rather than behavioral guidance. A skill that says “here is how our blog system works” is a README, not a skill. A skill that says “when writing a blog post, follow this sequence of operations in this order, apply these quality gates, and commit in this exact way” is behavioral guidance that changes what the agent does. The distinction matters because behavioral guidance is what makes the agent reliable on repeated tasks. Documentation informs; behavioral guidance operationalizes.

Layer 3 — Hooks (The Guarantees)

Hooks are the third layer, and they are categorically different from the first two. CLAUDE.md and Skills are advisory: they shape what Claude does by providing context and guidance, but they cannot guarantee execution. Claude might follow a CLAUDE.md instruction 95% of the time. A well-written skill might achieve 99% compliance on the tasks it covers. But neither mechanism can guarantee that something happens every time, unconditionally, regardless of what else is in the context. Hooks can.

Claude Code’s hook system allows you to define shell commands that execute deterministically at specific points in the agent’s workflow: before a tool call, after a tool call, on session start, on session end. The hooks run in the host environment with the permissions of the process running Claude Code. They are not subject to the instruction budget. They do not compete with CLAUDE.md guidance. They are not advisory. When a pre-tool-call hook fires, it fires. When a post-tool-call hook fires, it fires. The agent cannot reason its way around them.

The settings.json configuration for hooks lives at the project level and looks like this:

{
  "hooks": {
    "PreToolCall": [
      {
        "matcher": "Bash",
        "hooks": [
          {
            "type": "command",
            "command": "/Users/you/.claude/hooks/pre-bash-check.sh"
          }
        ]
      }
    ],
    "PostToolCall": [
      {
        "matcher": "Write|Edit",
        "hooks": [
          {
            "type": "command",
            "command": "/Users/you/.claude/hooks/post-write-lint.sh"
          }
        ]
      }
    ],
    "Stop": [
      {
        "hooks": [
          {
            "type": "command",
            "command": "/Users/you/.claude/hooks/session-end-log.sh"
          }
        ]
      }
    ]
  }
}

What should go in hooks? The things that must happen unconditionally. Not the things you want to happen most of the time — those belong in CLAUDE.md or Skills. The things that must happen every time, even when the agent is mid-flow on a complex task and has not been explicitly reminded of the rule. In my stack, I use hooks for four categories: pre-commit validation (running the TypeScript compiler before any file write is finalized), linting on file writes (ESLint runs after every Edit or Write tool call), route security checks (a script that verifies new API route handlers include authorization middleware), and session-end retros (a prompt that asks “what did you learn this session?” and saves the response to the correct memory file).

The session retro hook deserves elaboration because it addresses a failure mode that I watched burn months of accumulated agent learning before I fixed it. Claude Code sessions are stateless. Everything the agent learns within a session — the edge cases it encounters, the workarounds it discovers, the patterns it identifies — evaporates at the end of the session unless it is explicitly persisted somewhere. The mechanism for persistence is writing to CLAUDE.md, a skill file, or a memory file. Without a hook that prompts for this at session end, it happens inconsistently. With a hook, it happens every session. The question “What did you learn this session, and where does that learning belong?” at session end has added more durable improvements to my configuration than any other single change.

The other category where hooks have proven indispensable is security validation. I touched on this pattern in my AI code security post from earlier this month: the research is unambiguous that AI agents generate authorization logic less reliably than feature logic. A pre-commit hook that runs a static analysis check for authorization middleware on new route handlers is not a backup for good CLAUDE.md instructions about security — it is a fundamentally different category of guarantee. The instruction might be missed. The hook cannot be missed.

When to Use Subagents Instead

Subagents are often described alongside Skills and Hooks as though they are a fourth configuration option, but they are architecturally distinct in a way that matters for the design decision. Skills and Hooks configure a single agent session. Subagents spawn separate agent sessions, each with their own context window, their own tool access, and their own CLAUDE.md loading. They communicate with the parent session through file I/O, not through shared context.

The defining characteristic of a task that belongs in a subagent is isolation: the task is self-contained enough that it does not need access to the parent session’s accumulated context, and large enough that running it in the parent session would consume context real estate that the parent needs for other work. Generating fifty product descriptions is a subagent task. Auditing twenty API routes for authorization gaps is a subagent task. Writing a single blog post in a session dedicated to that blog post is a subagent task. Fixing a bug in a file you just edited is not — the parent agent already has the context, and spawning a subagent would lose it.

The delegation pattern I use for parallelizable work looks like this:

// Parent agent instruction pattern for subagent delegation
const subagentTask = `
You are a focused subagent. Your single task:
1. Read the list of product slugs from /tmp/products-to-describe.json
2. For each slug, write an SEO-optimized 160-character meta description
3. Write all results to /tmp/product-descriptions-output.json
4. Report: how many completed, any errors

Do not read or modify any files outside those two paths.
Do not run any shell commands except reading/writing those files.
Report DONE when complete.
`;

The explicit constraint list — do not touch files outside these two paths, do not run shell commands — is not paranoia. It is the trust boundary that makes parallel subagent execution safe. Without it, subagents running in parallel can interfere with each other’s file operations or, worse, with the parent session’s active work. The bounded subagent pattern is what allows me to run three or four parallel agents against different product categories simultaneously without the kind of file-write conflicts that would corrupt the output.

One pattern I have seen misused is spawning a subagent for every non-trivial task as a way of “keeping the main session clean.” This is context-laundering: it feels organized because each task gets its own session, but it means the agent never accumulates the cross-task context that makes it better at the second task than the first. The right question to ask before spawning a subagent is not “is this task big enough to deserve its own session?” but “does this task genuinely not need the context from the current session?” If the answer is yes, spawn. If the answer is no, keep it in the parent session and let the accumulated context work for you.

The Architecture That Scales Across 12 Projects

The directory layout that has proven stable across twelve projects, from a single-developer marketplace to a multi-agent SEO pipeline, follows a consistent structure. The key insight is that the three layers live in three distinct locations, and the separation is physical as well as conceptual.

project-root/
├── CLAUDE.md                    # Layer 1: The Constitution (80-120 lines)
├── AGENTS.md                    # Extended project context (linked from CLAUDE.md)
├── .claude/
│   ├── settings.json            # Layer 3: Hooks configuration
│   ├── agents/
│   │   └── verification-agent.md  # Subagent specification files
│   ├── rules/
│   │   ├── deploy-invariants.md   # Conditional rules (loaded by path)
│   │   ├── trust-boundary.md      # Conditional rules (loaded by path)
│   │   └── seo-standards.md       # Conditional rules (loaded by path)
│   └── hooks/
│       ├── pre-bash-check.sh      # Layer 3: Hook implementations
│       ├── post-write-lint.sh
│       └── session-end-retro.sh
└── ~/.claude/
    └── skills/                  # Layer 2: Global skills (user-level)
        ├── blog-writer/
        │   └── SKILL.md
        ├── deploy/
        │   └── SKILL.md
        └── security-hardening/
            └── SKILL.md

The placement of skills at the user level rather than the project level is a deliberate architectural choice. Skills that encode how to write blog posts for wowhow.cloud, how to build new tools, and how to handle the deploy pipeline are transferable across all projects on the same workstation. When I start a new project, I do not rebuild the skill library from scratch — I get the accumulated intelligence of all previous projects as a baseline. Project-specific constraints live in CLAUDE.md and the conditional rules files, but the procedural intelligence lives in skills that travel with the developer.

The hooks configuration in settings.json lives at the project level because hook behavior is inherently project-specific. The pre-commit validation hooks need to know the project’s TypeScript configuration, the linting rules, the specific security patterns to check. A hook that works correctly for my Next.js marketplace would need modification for a pure Node.js CLI project. This is why the hooks themselves (the shell scripts) can live at the user level in ~/.claude/hooks/ as reusable utilities, while the settings.json that configures which hooks fire on which events lives at the project level in .claude/settings.json.

The conditional rules files deserve their own discussion. These are not skills — they do not have SKILL.md frontmatter and they do not trigger on user intent patterns. They are extended documentation that the agent loads when it detects it is working in a specific area. The mechanism for this loading is a reference in CLAUDE.md: when the agent sees a path reference to .claude/rules/deploy-invariants.md in the main CLAUDE.md file, it knows that file exists and can pull it into context when the task is deployment-adjacent. This is distinct from the skill auto-discovery mechanism — it is context-sensitive loading based on file path references, not pattern matching on user intent.

For developers building tools on top of WOWHOW’s infrastructure, the tools registry and associated architecture patterns are documented in the tools-architecture rule file. The same separation principle applies: CLAUDE.md knows the tools architecture rule file exists, and the agent loads it when building new tools. The detailed guidance about tool page structure, registry entries, and sitemap updates never needs to be in CLAUDE.md itself.

The Monthly Skill Audit — What to Keep, What to Kill

A skill system that grows without bounds eventually recreates the bloated CLAUDE.md problem in a different location. Skills that are never triggered still exist as entries in the auto-discovery list, and the process of deciding which skill to load adds latency and consumes context. More importantly, skills that were accurate six months ago may no longer reflect how the project actually works. The deployment pipeline changes. The TypeScript configuration evolves. A tool that was useful for a while gets replaced by a better one. Skills that encode obsolete procedures are not neutral — they are actively harmful because they generate confident wrong behavior.

I run a monthly skill audit that takes about 30 minutes and follows a fixed script. The audit starts with listing every skill in the library and answering three questions for each one: Was this skill triggered at least twice in the past month? Is the procedure it describes still accurate? Does it produce better outcomes than the agent would produce without it? A skill that fails any of those three questions gets either updated or archived. Archived skills go into a ~/.claude/skills/archive/ directory rather than being deleted outright, on the theory that a procedure that was useful once might be useful again.

The audit script I use to check trigger frequency looks like this:

#!/bin/bash
# skill-audit.sh — check skill trigger frequency from session logs
# Run monthly from ~/.claude/

SKILLS_DIR="$HOME/.claude/skills"
LOG_DIR="$HOME/.claude/logs"
AUDIT_PERIOD_DAYS=30

echo "=== SKILL AUDIT: $(date +%Y-%m-%d) ==="
echo ""

for skill_dir in "$SKILLS_DIR"/*/; do
  skill_name=$(basename "$skill_dir")
  # Count mentions in session logs from past 30 days
  trigger_count=$(find "$LOG_DIR" -name "*.json" -newer <(date -d "-$AUDIT_PERIOD_DAYS days" +%Y-%m-%d) 2>/dev/null     | xargs grep -l "$skill_name" 2>/dev/null | wc -l)

  if [ "$trigger_count" -lt 2 ]; then
    echo "REVIEW: $skill_name — triggered $trigger_count times (below threshold)"
  else
    echo "OK:     $skill_name — triggered $trigger_count times"
  fi
done

echo ""
echo "Skills to review: archive or update if no longer accurate"

The question “does it produce better outcomes than the agent would produce without it?” is harder to answer than trigger frequency, and it requires the session retro data I mentioned in the hooks section. When the session-end hook prompts “what did you learn this session?”, one of the questions it asks is whether any skill produced unexpected or incorrect behavior. Those notes, accumulated over a month, are the raw material for the quality judgment in the skill audit. Without the retro hook, this judgment would be based on impressions. With it, it is based on documented observations.

The monthly audit is also when I update skills to reflect project evolution. When I added the blog noindex guard in April 2026 — blog posts under 400 words get the noindex meta tag to protect AdSense eligibility — that rule needed to be in the blog-writer skill, not in CLAUDE.md. The audit cadence is what ensures that rule actually made it into the skill file and that the skill file still accurately reflects the current threshold. Without a scheduled audit, skill files drift from project reality, and drifted skills are worse than no skills.

Building Your Own Stack From Zero

The most common question I get from developers starting with this architecture is where to begin. The answer is not to build all three layers simultaneously. It is to build them in dependency order, because each layer is easier to build correctly when the previous layer is stable.

Start with CLAUDE.md. Write the project description, the tech stack summary, and the five absolute prohibitions that you know from experience are the most important. Keep it under 100 lines. Resist the urge to add procedures — those belong in skills. Resist the urge to add workflow-specific guidance — that belongs in skills. Resist the urge to add security constraints for specific paths — those belong in conditional rule files. If you find yourself writing something that starts with “when the user asks to...”, stop and make a note that this is a skill candidate. When the CLAUDE.md feels incomplete because it lacks all the procedures you want the agent to know, that discomfort is correct: CLAUDE.md should feel incomplete because it is a constitution, not an encyclopedia.

Add skills in order of workflow frequency. What is the task you do most often with Claude Code? For most developers, it is writing code in a specific style for a specific framework. Build a skill that captures your conventions: the component structure, the state management patterns, the testing approach, the file organization. Make it behavioral (“always do X before Y”) rather than documentary (“here is how X works”). Test it on three or four real tasks and observe where it produces correct behavior and where it diverges from your intent. Iterate on the skill file based on those observations, not based on what you think should work in theory.

The settings.json for your first hooks should be minimal. Start with a single post-write hook that runs your linter. This is the hook that gives you immediate, observable feedback — you can see it firing after every file write, and you can verify that it is catching the issues it is supposed to catch. Once you have one hook working correctly and providing value, add the next. The session-end retro hook is the second one I recommend, because it is what feeds the skill improvement loop. The security validation hooks can come third, once you have a stable baseline and you know what patterns to check for.

The full settings.json for a mature implementation looks like this:

{
  "model": "claude-sonnet-4-6",
  "permissions": {
    "allow": [
      "Bash(npm run lint:*)",
      "Bash(npm run build:*)",
      "Bash(git add:*)",
      "Bash(git commit:*)",
      "Bash(git push:*)"
    ],
    "deny": [
      "Bash(rm -rf:*)",
      "Bash(*FLUSHDB*)",
      "Bash(*DROP TABLE*)"
    ]
  },
  "hooks": {
    "PostToolCall": [
      {
        "matcher": "Write|Edit",
        "hooks": [
          {
            "type": "command",
            "command": "~/.claude/hooks/post-write-lint.sh"
          }
        ]
      }
    ],
    "Stop": [
      {
        "hooks": [
          {
            "type": "command",
            "command": "~/.claude/hooks/session-end-retro.sh"
          }
        ]
      }
    ]
  },
  "autocompact": true,
  "autocompactThreshold": 0.75
}

The permissions block is part of the settings.json that most developers overlook in their first implementation. It is a whitelist-and-blacklist mechanism that controls which shell commands the agent is allowed to run through the Bash tool. The deny list for destructive operations — rm -rf, Redis FLUSHDB, SQL DROP TABLE — is the kind of guarantee that should not live in CLAUDE.md where it competes for instruction budget. It lives in settings.json where it is enforced by the harness rather than requested from the model.

By the time you have a stable CLAUDE.md, four to six well-tested skills, and three hooks running reliably, you have a harness. Not a configuration file. A system with layers, each doing the job it is designed to do, each independently auditable, each improvable without breaking the others. The difference is not academic. A configuration file is something you paste from someone else’s GitHub. A harness is something you built for your specific projects and that reflects the accumulated intelligence of every session you have run with it.

I have found the best tools for building this kind of system live on WOWHOW’s developer tools catalog — starter templates and workflow kits that give you the structure without the months of trial and error. But regardless of where you start, the architecture matters more than the specific files. Get the layering right, and the rest follows.

The Session Retro That Closes the Loop

Every architectural pattern has a failure mode that shows up not at the start but after six months, when the initial discipline has faded and the system has been patched and extended by someone — you, six months from now — who no longer remembers every decision that went into the original design. For the 3-Layer Agent Harness, that failure mode is skill drift: skills that were accurate when written, gradually diverging from the project reality as the project evolves, producing confident wrong behavior that is hard to diagnose because the skill file still exists and still triggers and still looks reasonable when you read it.

The session retro hook is the mechanism that prevents this. At the end of every session, it fires. It asks three questions: what changed in this session that is different from what the skills and CLAUDE.md describe? What did you discover about the project that was not documented anywhere? What should be updated to reflect what you learned? The answers go into a structured log file. The monthly audit reads those logs. The skill files get updated. The loop closes.

This is the piece that the viral GitHub repos cannot give you, because it is not a file you can copy. It is a discipline that you build into your workflow through a hook that fires every session regardless of whether you remember to run it. The 107K stars on the Karpathy skills collection represent the artifact of someone’s accumulated experience, distilled into a specific configuration at a specific point in time. What the stars cannot capture is the ongoing practice that kept that configuration accurate as the underlying project evolved. That practice is what you are building when you set up the session retro hook, and it is the part of the architecture that, in my experience, makes the biggest difference over time.

The graphify knowledge graph tool I wrote about earlier this year integrates with this loop directly: it ingests session retro outputs and maps them to the codebase graph, so you can see which parts of the project are generating the most learning (and therefore the most likely to have skill files that need updating). It is the infrastructure layer underneath the retro discipline, and it is open-source for exactly this reason.

The three-layer harness is not a finished product. It is a system that gets better every session, driven by a retro hook that ensures learning is never lost, a monthly audit that ensures skills stay accurate, and a CLAUDE.md that stays short enough that its rules are actually followed. Build those three properties into your configuration, and you have something qualitatively different from a CLAUDE.md file, no matter how many stars the file you started from has on GitHub.

References

^[1] HumanLayer. LLM Instruction Compliance Research: The 150-Instruction Degradation Threshold. 2026. humanlayer.dev

^[2] Anthropic. Claude Code Documentation: Skills. docs.anthropic.com

^[3] Anthropic. Claude Code Documentation: Skills vs Commands vs Subagents. docs.anthropic.com

^[4] Pocock, Matt. Skills for Real Engineers. GitHub, 2026. github.com/mattpocock/skills

^[5] Zhang, Forrest. Andrej Karpathy Skills Collection. GitHub, 2026. github.com/forrestchang/andrej-karpathy-skills

^[6] Raisshan, Shan. Claude Code Best Practice: 84 Proven Patterns. GitHub, 2026. github.com/shanraisshan/claude-code-best-practice

Tags:Claude CodeSkillsCLAUDE.mdDeveloper ToolsAI AgentsProductivity

All Articles

Written by

Anup Karanjkar

Expert contributor at WOWHOW. Writing about AI, development, automation, and building products that ship.

Ready to ship faster?

Browse our catalog of 3,000+ premium dev tools, prompt packs, and templates.

Browse Products More Articles

Monday Memo · Free

One insight, every Monday. 7am IST. Zero fluff.

1 field report, 3 links, 1 tool we actually use. Join 11,200+ builders.

Comments · 0

No comments yet. Be the first to share your thoughts.

The 150-Instruction Budget Nobody Tells You About

Layer 1 — CLAUDE.md (The Constitution)

Layer 2 — Skills (The Specialists)

Layer 3 — Hooks (The Guarantees)

When to Use Subagents Instead

The Architecture That Scales Across 12 Projects

The Monthly Skill Audit — What to Keep, What to Kill

Building Your Own Stack From Zero

The Session Retro That Closes the Loop

References

Ready to ship faster?

One insight, every Monday. 7am IST. Zero fluff.

Comments · 0

Topics

Article stats

Try Our Free Tools

Image Compressor

QR Code Generator

More from AI Tools & Tutorials

OpenAI GPT-Realtime-2: Complete Voice API Developer Guide (2026)

AI Code Security Crisis 2026 — 92% Vulnerable and Getting Worse

WhatsApp Link Generator

Word & Character Counter

OpenClaw: 210K Stars in 4 Months — Local-First AI Agent Deep Dive

Uber Burned Its 2026 AI Budget by April — The Agentic Cost Crisis

MCP in Production — The Enterprise Hardening Guide Nobody Wrote

Your AI Agent Returns HTTP 200 With Confidently Wrong Answers — Fix It