How to Calculate Your Score
Read your spec once per dimension. Assign 0, partial, or full. Sum the scores. That’s it. The math is deliberately simple because the hard work is the reading, not the arithmetic.
Score interpretation:
- 85–100: Ship-ready. This spec will carry a production agent.
- 65–84: Build-ready with known gaps. Acceptable for a staging agent or a low-stakes automation. Fix gaps before production.
- 50–64: Draft quality. The agent will encounter at least one unhandled failure in the first 48 hours. Rewrite the lowest-scoring dimensions before building.
- 35–49: Prototype only. Use this spec to generate a skeleton, then throw it away and rewrite from scratch with what you learned.
- 0–34: Do not build. This spec will produce an agent that destroys time, money, or data. Stop here.
Worked Example: A File-Organizing Agent Spec
Here is a real spec draft (condensed for this post) from a file-organizing agent task — the kind of thing an autonomous coding assistant might tackle when given “clean up this repository.”
The Original Draft Spec
“The agent should scan the repo, identify misplaced files, and move them to the correct folders according to the project conventions. It should also rename files that don’t follow the naming convention. When done, it should report what changed.”
Score this against the rubric:
| Dimension |
Weight |
Assessment |
Score |
| Constraints |
17 |
Zero constraints. Nothing says “never delete,” “never touch node_modules,” “never move files with open git changes.” |
0 |
| Acceptance Criteria |
17 |
“Report what changed” is too vague. No definition of “correct folders” or “naming convention.” |
0 |
| Examples |
17 |
No examples whatsoever. |
0 |
| Failure Modes |
17 |
No failure modes. What if the target folder doesn’t exist? What if two files would resolve to the same name after rename? |
0 |
| Tool Scope |
17 |
No tools specified. The agent will infer access to filesystem read/write, git, possibly shell exec. |
0 |
| Rollback |
15 |
No rollback. Once files move, they move. |
0 |
Total: 0/100. This is a one-sentence task description, not a spec. An agent built from this will happily rename your README.md to readme.md, move your .env somewhere “logical,” and skip reporting when the run crashes halfway through.
The Rewritten Spec
Here is the same agent spec rewritten using the Spec-Density framework:
Constraints (hard): Never delete any file. Never touch files inside node_modules/, .git/, or any directory whose name starts with a dot. Never move a file that has unstaged git changes (check via git status --short before each move). Never rename a file if the target name already exists. Reason: the agent cannot know whether a “misplaced” file is load-bearing in its current location.
Constraints (soft): Prefer minimal changes. If a file is within one directory level of its “correct” location, flag it for human review rather than moving it automatically.
Acceptance Criteria: After the run, git diff --stat HEAD shows only file renames and moves, zero content changes. A reorganization-report.md exists at repo root containing: files moved (source → destination), files renamed (old name → new name), files skipped (with reason), and files flagged for human review. All items in src/ follow the kebab-case.ts naming pattern. All items in tests/ end in .test.ts.
Examples:
- Happy path:
src/Components/UserCard.tsx → src/components/user-card.tsx. Expected output: move confirmed in report, git shows rename, no content diff.
- Edge case:
src/utils/helpers.ts already correctly named. Expected output: no action taken, file not listed in report.
- Near-failure: two files would rename to the same target, e.g.,
UserCard.tsx and user-card.tsx both in scope. Expected output: both flagged for human review, neither moved, conflict logged with both source paths.
Failure Modes:
- Target directory does not exist: create it only if the spec explicitly maps to that path; otherwise flag for review.
- Git is dirty (uncommitted changes in the file being considered): skip that file, log it as “skipped — uncommitted changes”.
- Name conflict after rename: flag both files, move neither.
- File is binary (image, woff, pdf): skip unless explicitly in scope for this run.
- Agent token budget exhausted mid-run: write a partial report immediately, mark it as “INCOMPLETE — resumed run needed”, exit cleanly.
Tool Scope: Filesystem read (any path outside forbidden directories). Filesystem write — move and rename only, no create or delete. git status --short read-only. Report writer to reorganization-report.md. Shell exec is off-limits (no npm install, no git commit, no arbitrary commands). The agent does not have permission to push, commit, or stage changes.
Rollback: Before the first file move, create a snapshot file at .reorganization-snapshot.json listing every planned move with source and destination. Rollback trigger: the agent or a human runs node rollback-reorg.js which reads the snapshot and reverses each move in reverse order. Rollback verification: git diff HEAD returns empty after rollback. The snapshot file is deleted only after human confirms the reorg is final.
Score this rewrite:
| Dimension |
Weight |
Assessment |
Score |
| Constraints |
17 |
Hard and soft constraints explicitly separated, each with a stated reason. |
17 |
| Acceptance Criteria |
17 |
Machine-checkable: git diff output, report file existence, naming pattern, file extension pattern. |
17 |
| Examples |
17 |
Three examples: happy path, no-op edge case, conflict near-failure. Each has input, expected output, reason. |
17 |
| Failure Modes |
17 |
Five failures enumerated. Each has detection condition, agent response, and escalation or exit path. |
17 |
| Tool Scope |
17 |
Every tool named. Forbidden operations explicit. Shell exec specifically prohibited. |
17 |
| Rollback |
15 |
Named procedure. Snapshot before first action. Rollback script. Verification step. Cleanup condition. |
15 |
Total: 100/100. That does not mean the agent will never fail. It means the spec gives the agent everything it needs to handle failure gracefully instead of silently.
The Dimensions That Kill Agents Most Often
Failure Modes: The Most Skipped Dimension
Specs written by engineers who know the system well tend to skip failure modes because the engineer mentally simulates the happy path and stops there. The agent has no such mental model. It will encounter the failure mode the engineer “obviously” assumed could never happen, and it will have no instruction for what to do next. So it hallucinates a recovery path, which is worse than doing nothing.
The minimum useful failure mode entry has three parts: the detection condition (“when the API returns 429”), the agent behavior (“wait 60 seconds and retry once”), and the escalation path (“if the second attempt also fails, write the failed items to a retry queue and exit with status code 2”). Anything less is a placeholder, not a failure mode.
Tool Scope: The Dimension That Creates Security Incidents
Agents with undefined tool scope will call the most powerful tool available when a lower-powered one would suffice. An agent allowed to “use the database tool” with no further constraints will write DELETE queries if it decides that’s the cleanest way to solve the problem. Not out of malice — because you told it to solve the problem and it has access to a tool that can do it.
Tool scope entries need four fields: allowed operations (read-only? specific write types?), forbidden operations (never DELETE, never DROP, never shell exec), rate or cost guard (maximum API calls per run, maximum rows returned), and auth context (which credential does this tool use, and does the agent have permission to use it for this specific task or just generally?). A tool that is not listed is not available. That sentence should appear verbatim in every agent spec.
Rollback: The Dimension That Determines Whether Mistakes Are Recoverable
Most agent specs treat rollback as an afterthought — “we can undo it if needed.” But “we can undo it” is not a rollback plan. A rollback plan names: the pre-action state capture (snapshot, backup, transaction log), the trigger condition that initiates rollback (human command? automated detection of bad state?), the exact steps to reverse the agent’s actions, and a verification test that confirms the system is back to pre-run state.
The classic failure here is building a spec for an agent that sends emails, posts to Slack, or calls an external webhook — and not noting that those actions are irreversible. If your rollback dimension says “N/A — actions are irreversible,” that is a full-score entry. It means you thought about it. It does not mean the spec is bad. What kills you is an agent that sends 400 emails before you notice the bug, and you never wrote down that emails were permanent.
Common Scoring Traps
Trap 1: Mistaking length for density
A spec can be 3,000 words and score 15 on the Spec-Density rubric. Word count is not density. A spec that spends 800 words explaining the business context and 20 words on constraints scores 0 on constraints regardless of total length. The rubric measures what is present, not how much text surrounds it.
Trap 2: Accepting “see the code” as a failure mode
Engineers sometimes write “for error handling, see the existing error handler.” That is not a failure mode in the Spec-Density sense. A failure mode is a condition the agent might encounter, not a code pattern in the surrounding infrastructure. The agent cannot read your error handler. It needs explicit instruction.
Trap 3: Scoring partial when the dimension is actually missing
The partial band exists for dimensions that are started but not finished. If a spec says “the agent should handle errors gracefully,” that is not a partial score on Failure Modes — it is 0, because no failure mode is actually specified. Partial means: at least one concrete entry exists, but not enough entries to cover the known failure space. “Handle errors gracefully” is an aspiration, not an entry.
When to Score: The Spec Review Gate
The Spec-Density Score works best as a gate at a specific point in the agent development workflow: after the spec is drafted but before any code is written. Running the score at this point costs 15 minutes and potentially saves 15 hours of debugging a half-built agent.
Three useful insertion points for teams:
- Pre-build gate: Any agent spec must score 65+ before the first implementation session begins. Below 65, the spec author rewrites the failing dimensions and re-scores.
- Pre-production gate: Any agent going to production must score 80+. The gap between 65 and 80 is usually Rollback and edge-case Failure Modes — the dimensions that matter when the agent is running unattended.
- Post-incident review: After any agent incident, score the spec that produced the failing agent. The dimension with the lowest score is almost always the root cause category. This is not blame assignment — it is a systematic way to identify which spec dimension your team habitually underweights.
Using the Score With AI-Assisted Spec Writing
If you use an LLM to help draft agent specs, the Spec-Density Score doubles as a prompt structure. Instead of asking “write me a spec for X,” ask for each dimension explicitly: “List at least 4 failure modes for this agent, including a detection condition, agent response, and escalation path for each.” Then score the output. Models that produce impressive-sounding but score-0 specs on Failure Modes will tell you exactly where to push back.
The score also catches prompt injection attempts in agent specs — a constraint dimension that scores 0 means the agent has no hard limits, which means a crafted input can redirect it arbitrarily. A spec that scores 17 on Constraints has explicit NEVER instructions that the agent can treat as inviolable, making injection harder to execute silently.
The Score Does Not Replace Judgment
A 100-point spec is not automatically a good spec. It is a complete spec. The score measures structural completeness — the presence of load-bearing information in each dimension. It does not measure whether the constraints are the right constraints, whether the examples cover the actual edge cases, or whether the rollback procedure is technically sound.
Think of it as a pre-flight checklist, not a quality guarantee. A pilot who completes every checklist item correctly is still responsible for whether the destination is correct. The Spec-Density Score tells you the plane has fuel, not that you should make the trip.
What it eliminates is the class of failures that come from forgetting to think about a dimension entirely — which, based on the agent builds that fail most visibly, is the majority of production incidents.
Before your next agent build: score the spec. If any dimension is below 8 points, stop and fix it. That fifteen-minute audit is the highest-ROI investment in any agent project, and it costs nothing but attention. You can browse WOWHOW’s developer tools for automation and productivity tools that pair with agent workflows, or explore the full product catalog for starter kits that include pre-scored spec templates. If you want access to the downloadable Spec-Density scoring worksheet, it’s available through WOWHOW Pro Vault.
People Also Ask
What is the Spec-Density Score for AI agents?
The Spec-Density Score is a WOWHOW 0–100 rubric for grading an agent spec before any code is written. It scores six dimensions: constraints, acceptance criteria, examples, failure modes, tool scope, and rollback. Each dimension is weighted at 17 points (rollback at 15). Specs below 50 are not build-ready.
What score does an agent spec need before going to production?
The WOWHOW Spec-Density framework sets 80 as the minimum threshold for a production agent. Scores between 65 and 79 are acceptable for staging or low-stakes automation. Anything below 65 is draft quality and will encounter at least one unhandled failure in the first 48 hours of a real run.
Why do agent specs fail even when the model is capable?
Agent failures almost never come from the model. They come from specs that leave runtime decisions unresolved. An agent with no failure modes listed will hallucinate a recovery path when it hits an error. An agent with no tool scope will call the most powerful tool available, which creates security and data-integrity incidents. The spec is the primary failure surface, not the model.
How is the Failure Modes dimension different from error handling in code?
Code error handling is about exceptions the runtime surfaces. Failure modes in the Spec-Density framework are about conditions the agent encounters at runtime that require a decision: what to do when an API returns 429, when a target directory is missing, or when the token budget runs out mid-run. Each entry needs a detection condition, an agent response, and an escalation path — not a generic catch block.
Can a 100-point spec still produce a failing agent?
Yes. The Spec-Density Score measures structural completeness, not correctness. A spec can score 100 because it has entries in all six dimensions, but still have constraints that are wrong for the specific task, examples that miss the actual edge cases, or a rollback procedure that is technically unsound. Think of it as a pre-flight checklist, not a guarantee the destination is right.
Comments · 0
Beta: comments are stored locally on your device and not visible to other readers.
No comments yet. Be the first to share your thoughts.