What score does an agent spec need before going to production?

The WOWHOW Spec-Density framework sets 80 as the minimum threshold for a production agent. Scores between 65 and 79 are acceptable for staging or low-stakes automation. Anything below 65 is draft quality and will encounter at least one unhandled failure in the first 48 hours of a real run.

Why do agent specs fail even when the model is capable?

Agent failures almost never come from the model. They come from specs that leave runtime decisions unresolved. An agent with no failure modes listed will hallucinate a recovery path when it hits an error.

How is the Failure Modes dimension different from error handling in code?

Code error handling is about exceptions the runtime surfaces. Failure modes in the Spec-Density framework are about conditions the agent encounters at runtime that require a decision: what to do when an API returns 429, when a target directory is missing, or when the token budget runs out mid-run.

Can a 100-point spec still produce a failing agent?

Yes. The Spec-Density Score measures structural completeness, not correctness. A spec can score 100 because it has entries in all six dimensions, but still have constraints that are wrong for the specific task, examples that miss the actual edge cases, or a rollback procedure that is technically uns

The Spec-Density Score: Agent Spec Quality 2026

Q: What is the Spec-Density Score for AI agents?

The Spec-Density Score is a WOWHOW 0–100 rubric for grading an agent spec before any code is written. It scores six dimensions: constraints, acceptance criteria, examples, failure modes, tool scope, and rollback. Each dimension is weighted at 17 points (rollback at 15).

TL;DR

— The WOWHOW Spec-Density Score is a 0–100 rubric that grades an AI agent spec across six dimensions: constraints, acceptance criteria, examples, failure modes, tool scope, and rollback. Specs below 50 reliably break in production within the first week. Score your spec before writing a single line of agent code.

A spec that scores below 50 on the WOWHOW Spec-Density Score will produce an agent that fails in production within the first week — not because the model is bad, but because the spec gave it nothing solid to stand on. After analyzing dozens of agent build cycles, a clear pattern emerges: the gap between specs that work and specs that don’t isn’t about length or clever prompting. It’s about density — how much load-bearing information is packed into each dimension. The Spec-Density Score is a WOWHOW framework: a 0–100 rubric across six dimensions that lets you audit a spec before a single line of agent code is written. This post walks through the scoring table, explains why each dimension predicts failure, and shows a worked example on a real-world agent spec draft.

Why Agent Specs Fail

Agents differ from traditional software in one critical way: they make decisions at runtime that you cannot fully anticipate at write time. A function either returns the right value or it throws. An agent misinterprets an ambiguous instruction and silently does the wrong thing for 200 rows of data before anyone notices.

That asymmetry is what makes spec quality so consequential. When you write a function spec, ambiguity surfaces at compile time or in the first unit test. When you write an agent spec, ambiguity surfaces at 2am when the agent has consumed 40,000 tokens and is confidently doing the wrong thing.

The failure modes cluster into six buckets, which is exactly what the Spec-Density Score measures:

Constraints — what the agent must never do
Acceptance criteria — what “done” looks like
Examples — concrete input/output pairs
Failure modes — explicit enumeration of known bad paths
Tool scope — exactly which tools the agent may call and when
Rollback — how to undo what the agent did

Each dimension is scored 0–17 (except Rollback, which is weighted at 15), giving a maximum of 100 points. A score under 50 is a red flag. Under 35, stop and rewrite before building.

The Spec-Density Score: Scoring Table

The table below is the complete WOWHOW Spec-Density rubric. Each dimension has three bands: 0 (missing or useless), partial (exists but incomplete), and full (ship-ready). The weights reflect empirical importance, not symmetry — Failure Modes and Rollback are the two most commonly skipped dimensions and the two that cause the most expensive production incidents.

Dimension	Weight	0 points — Missing/Useless	Partial (half weight)	Full points — Ship-Ready
1. Constraints	17	No constraints listed, or only “be accurate” / “be safe” platitudes	At least one hard constraint, but no distinction between hard limits and soft preferences	Hard constraints (NEVER do X) and soft constraints (prefer Y) are explicitly separated; each constraint has a reason
2. Acceptance Criteria	17	No success definition, or “the task is complete when it looks right”	Some criteria exist but are subjective (“output should be clean”) or missing edge cases	Criteria are machine-checkable: specific field values, status codes, file paths, record counts, or observable side effects
3. Examples	17	No input/output examples provided	One example exists but it is the happy path only; no edge cases or boundary inputs	At least 3 examples: happy path, one edge case, one near-failure case. Each example has input, expected output, and why
4. Failure Modes	17	No failure modes listed; spec assumes success	One or two failures named (“API might be down”) but no recovery path or detection heuristic	At least 4 failure modes enumerated. Each has: detection condition, agent behavior on detect, escalation path if unrecoverable
5. Tool Scope	17	No tool list; agent infers what tools to use	Tools are named but no per-tool constraints (“use the search tool” with no rate limit, no forbidden queries, no auth context)	Every tool is listed with: allowed operations, forbidden operations, rate/cost guard, and auth/secret context. Unlisted tools are explicitly off-limits
6. Rollback	15	No rollback path; agent actions are irreversible by design or oversight	Rollback is mentioned (“can be undone”) but no concrete steps or pre-condition checks	Rollback is a named procedure: pre-action snapshot, rollback trigger condition, exact rollback steps, and verification that rollback succeeded

Partial scores use half the dimension weight (rounded down). So a Constraints dimension that is partial scores 8, not 0 or 17.

Why Agent Specs Fail

The Spec-Density Score: Scoring Table

You Might Also Like

Spec-Density Scorecard and 12 Agent Spec Templates

Spec-Density Scorecard and 12 Agent Spec Templates

Try Our Free Tools

JSON Formatter & Validator

GST Calculator

More from Development

MCP Governance Least Privilege: A Reference Design 2026

Single-Push Discipline: Multi-Agent Git Workflow 2026

How to Calculate Your Score

Worked Example: A File-Organizing Agent Spec

The Original Draft Spec

The Rewritten Spec

The Dimensions That Kill Agents Most Often

Failure Modes: The Most Skipped Dimension

Tool Scope: The Dimension That Creates Security Incidents

Rollback: The Dimension That Determines Whether Mistakes Are Recoverable

Common Scoring Traps

Trap 1: Mistaking length for density

Trap 2: Accepting “see the code” as a failure mode

Trap 3: Scoring partial when the dimension is actually missing

When to Score: The Spec Review Gate

Using the Score With AI-Assisted Spec Writing

The Score Does Not Replace Judgment

People Also Ask

What is the Spec-Density Score for AI agents?

What score does an agent spec need before going to production?

Why do agent specs fail even when the model is capable?

How is the Failure Modes dimension different from error handling in code?

Can a 100-point spec still produce a failing agent?

One insight, every Monday. 7am IST. Zero fluff.

Need production-ready templates?

Comments · 0

Article stats

Meta Tags & OG Preview

SIP & EMI Calculator

Dimension	Weight	Assessment
Constraints	17	Zero constraints. Nothing says “never delete,” “never touch node_modules,” “never move files with open git changes.”
Acceptance Criteria	17	“Report what changed” is too vague. No definition of “correct folders” or “naming convention.”
Examples	17	No examples whatsoever.
Failure Modes	17	No failure modes. What if the target folder doesn’t exist? What if two files would resolve to the same name after rename?
Tool Scope	17	No tools specified. The agent will infer access to filesystem read/write, git, possibly shell exec.
Rollback	15	No rollback. Once files move, they move.

Dimension	Weight	Assessment	Score
Constraints	17	Hard and soft constraints explicitly separated, each with a stated reason.	17
Acceptance Criteria	17	Machine-checkable: git diff output, report file existence, naming pattern, file extension pattern.	17
Examples	17	Three examples: happy path, no-op edge case, conflict near-failure. Each has input, expected output, reason.	17
Failure Modes	17	Five failures enumerated. Each has detection condition, agent response, and escalation or exit path.	17
Tool Scope	17	Every tool named. Forbidden operations explicit. Shell exec specifically prohibited.	17
Rollback	15	Named procedure. Snapshot before first action. Rollback script. Verification step. Cleanup condition.	15