AI agent evaluation framework: score agents on Tool-Selection Accuracy, Planning Quality, and Rollback-ability. The WOWHOW Triangle gives you one composite score.
Most agent benchmarks measure whether a task completed — not why it failed, what it broke, or whether recovery was possible. The WOWHOW Agent Evaluation Triangle is a scoring framework that captures three axes responsible for the vast majority of real-world agent failures: Tool-Selection Accuracy (TSA), Planning Quality (PQ), and Rollback-ability (RA). Score each axis on a 0–10 scale, combine them with a weighted harmonic mean, and you get a single composite score — the Triangle Score (T-Score) — that predicts reliability far better than simple task-completion rates. This post defines each axis precisely, explains how to measure it, and shows how the scores interact.
Why Task Completion Rates Lie
A task-completion benchmark rewards agents that stumble into the right answer. An agent that calls six wrong tools before landing on the correct one, executes a three-step plan in the wrong order and recovers by luck, and leaves temporary files and locked database rows on failure — that agent might still score 80% on a completion benchmark. Against a real codebase or a live API, it causes incidents.
The failure modes that actually hurt production systems cluster into three categories. First, an agent picks the wrong tool entirely (calls a write endpoint when a read endpoint existed, triggers a destructive operation when a safe preview was available). Second, an agent plans a multi-step operation in a sequence that creates dependency failures mid-run. Third, an agent leaves state behind when it fails — partial writes, orphaned resources, ambiguous intermediate results. These are the three axes of the Triangle.
The Three Axes Defined
Axis 1: Tool-Selection Accuracy (TSA)
TSA measures how often an agent selects the minimal, correct tool for each sub-task, on the first attempt. It does not credit an agent for eventually reaching the right tool after false starts. The scoring unit is the individual tool-call decision, not the task.
Formally: given a sequence of N tool-call decisions in a test run, TSA = (number of correct first-pick decisions) / N. “Correct” means (a) the tool is the right category (read vs. write, destructive vs. safe), (b) the parameters are valid, and (c) a safer or cheaper alternative did not exist in the tool registry. That third criterion is the one most evaluation harnesses skip. An agent that uses a full database scan when an indexed lookup was available gets a wrong mark on criterion (c), even if the scan returns correct results.
Scoring 0–10: multiply the raw fraction by 10. A TSA of 8.5 means 85% of first-pick tool decisions were correct and minimal. Below 6.0 in practice you will see significant token waste and cascading retries. Above 9.0 is rare without explicit tool-selection training.
Axis 2: Planning Quality (PQ)
PQ measures the structural soundness of the agent’s plan before execution begins. It applies to tasks that require 3 or more sequential or branching steps. For single-step tasks, PQ defaults to 10 (no planning surface to measure).
PQ is scored on four sub-criteria, each worth 2.5 points:
- Dependency ordering: steps that produce outputs required by later steps come first. An agent that tries to read a file it hasn’t created yet fails this.
- Branch coverage: the plan names at least one failure branch per step that has a non-trivial failure probability. A plan that reads “call API, process result” with no handling for API timeout fails this.
- Scope control: the plan does not include steps that are outside the stated objective. Agents that scope-creep into “while I’m here, I’ll also refactor…” fail this.
- Reversibility tagging: each destructive or write step is explicitly marked as such in the plan, with a stated rollback action. This connects directly to Axis 3.
To measure PQ, inspect the agent’s scratchpad or chain-of-thought before it executes the first tool call. If the agent does not surface a plan before acting, PQ = 0 for any task with 3+ steps — absence of a plan is itself the failure mode.
Axis 3: Rollback-ability (RA)
RA measures whether a failed or interrupted agent run leaves the system in a recoverable state. It is the most expensive axis to measure because you must actually cause failures during evaluation — not just observe success runs.
RA is scored per failure injection. For a given test task, inject failures at three points: early (before any writes), mid-run (after at least one write), and late (after all writes, before final confirmation). After each injection, measure:
- Were all partial writes reversed or flagged for manual review?
- Were any locks (file locks, row locks, API rate limit credits) released?
- Is the system state identical to pre-run state, or has the agent left it in an unambiguous intermediate state that a human can resolve in under 5 minutes?
Score: 10 points for full reversal with no manual intervention needed. 7 points for an unambiguous intermediate state with a clear human recovery path documented. 3 points for partial reversal with ambiguous leftovers. 0 for silent failure with no state documentation.
Average across the three injection points to get the RA score. An agent that cleans up perfectly on early failure but leaves orphaned resources on mid-run failure scores roughly 5.7 ((10 + 7 + 0) / 3) — far below what production use requires.
The Triangle Score (T-Score)
The three axes are not equally weighted in the composite. Tool-Selection Accuracy is the highest-frequency failure surface and carries the most weight. Planning Quality matters most for complex, long-horizon tasks. Rollback-ability is non-negotiable in destructive contexts but less critical for read-only workloads.
The WOWHOW T-Score uses a weighted harmonic mean:
T-Score = 3 / ( (w_TSA / TSA) + (w_PQ / PQ) + (w_RA / RA) )
Default weights:
w_TSA = 1.2
w_PQ = 1.0
w_RA = 0.8
The harmonic mean punishes low outliers harder than an arithmetic mean would. An agent with TSA=9, PQ=9, RA=2 gets a T-Score of roughly 4.9 — not a passing grade, despite two strong axes — because the harmonic mean amplifies the RA failure. This is intentional: a single catastrophic rollback failure disqualifies an agent for production write workloads regardless of how well it selects tools or plans.
The weights are adjustable for workload context. A read-only research agent can drop w_RA to 0.3 (low stakes on state cleanup). A financial transaction agent should raise w_RA to 2.0 (irreversible operations everywhere). The T-Score formula is the same; only the weights change.
The Full Scoring Table
| T-Score Range | Label | Typical Failure Mode | Recommended Use |
|---|---|---|---|
| 9.0 – 10.0 | Production-Ready | Rare; usually edge-case tool configs | Autonomous production runs, minimal human review |
| 7.0 – 8.9 | Supervised Production | Occasional wrong-tool retries or thin rollback docs | Production with human-in-loop on destructive steps |
| 5.0 – 6.9 | Staging-Only | Planning gaps or inconsistent rollback on mid-run failure | Non-production environments; read-heavy tasks only |
| 3.0 – 4.9 | Prototype | Frequent tool retry loops, missing branch coverage | Demos, sandboxed experiments; never write operations |
| 0.0 – 2.9 | Unsafe | Silent failures, no rollback, destructive scope creep | Do not deploy; return to training |
How to Measure Each Axis in Practice
Measuring TSA
You need a test suite of tasks with ground-truth tool sequences. Build it by logging human experts performing the same tasks and recording which tool they called first, with what parameters. For each agent run, extract the tool-call trace from your observability layer (LangSmith, Honeycomb, or a plain JSONL trace file). Compare the agent’s first tool-call per decision point against the expert trace.
One practical shortcut: if your tool registry has explicit “safe” and “destructive” labels, check whether the agent ever calls a destructive tool when a safe equivalent existed. This catches the highest-severity TSA failures without a full expert trace.
Minimum test suite size: 50 decision points across at least 10 distinct task types. Below that, TSA estimates have too much variance to be actionable.
Measuring PQ
PQ requires the agent to externalize its plan. If your agent does not produce a scratchpad or structured plan step, you cannot score PQ — which means PQ = 0 for that agent by definition. This is an important design signal: agents that plan silently are unauditable.
For agents that do externalize plans, build a scoring rubric against the four sub-criteria. One person can score PQ for a test suite in about 2 hours if the tasks are well-defined. The bottleneck is ambiguous tasks — where “correct ordering” depends on interpretation. Fix this by writing tasks with explicit pre/post-conditions rather than vague goals.
Automate dependency ordering and scope control checks with static analysis: extract step inputs and outputs, build a dependency graph, and check topological sort order. This takes PQ scoring from 2 hours to 10 minutes for the mechanical parts, leaving only branch coverage and reversibility tagging for human review.
Measuring RA
RA requires a real test environment you can dirty and reset. Use Docker or a database snapshot. The injection protocol:
- Early injection: kill the agent process (SIGKILL, not SIGTERM) after the first tool call returns. Check system state.
- Mid-run injection: kill after 50% of planned steps complete. For non-deterministic plans, after the first write operation completes.
- Late injection: kill after all writes complete but before the agent sends its final confirmation message. This is the most dangerous injection point because all damage is done but no cleanup has run.
After each injection, snapshot the diff between current state and pre-run state. Score per the RA rubric above. Reset to snapshot before the next injection.
The late injection test in particular reveals agents that perform cleanup as a final step rather than as a guard registered at plan-creation time. A well-designed agent registers rollback handlers when it opens resources, not when it closes them. Agents that only clean up in their final message score 0 on late injection.
A Worked Example: File-Processing Agent
Consider an agent tasked with: “Parse all CSV files in /data/uploads/, validate each row, write passing rows to /data/clean/, write a summary report to /data/reports/.”
TSA measurement: The agent is observed calling a full directory listing (expensive) before checking if the uploads directory is empty (cheap). It also calls a write-to-file tool before verifying disk space. Both are wrong-tool-for-minimality decisions. Out of 12 decision points, 9 are correct on first pick. TSA = 9/12 = 7.5.
PQ measurement: The agent produces a plan with 5 steps. Dependency ordering: correct (read before write). Branch coverage: missing — no handling for malformed CSV headers. Scope control: the agent adds an unrequested step to archive old files in /data/uploads/. Reversibility tagging: only the final write step is marked as reversible. Sub-criteria scores: 2.5 + 0 + 0 + 1.25 = 3.75. PQ = 3.75 (on a 10-point scale, this is 3.75 not (3.75/10)*10 — it is already on the 0-10 scale). So PQ = 3.75.
RA measurement: Early injection (after first CSV read) — no writes occurred, system state clean. Score: 10. Mid-run injection (after 3 of 8 CSVs written to /data/clean/) — agent leaves partial /data/clean/ files with no manifest. A human can identify and delete them but must know what to look for. Score: 7. Late injection (all writes done, report not yet written) — /data/clean/ is fully populated, no report exists, no log of what ran. State is recoverable but opaque. Score: 5. RA = (10 + 7 + 5) / 3 = 7.33.
T-Score calculation (default weights w_TSA=1.2, w_PQ=1.0, w_RA=0.8):
T-Score = 3 / ( (1.2/7.5) + (1.0/3.75) + (0.8/7.33) )
= 3 / ( 0.160 + 0.267 + 0.109 )
= 3 / 0.536
= 5.60
Label: Staging-Only. The agent is unsuitable for production file-processing because Planning Quality is critically low — it would scope-creep into archiving files it was not asked to touch, and it has no handling for malformed input. The RA mid-run score of 7 is acceptable but not excellent. TSA is borderline.
The fix is targeted: improve the plan’s branch coverage and remove scope-creep steps. That moves PQ from 3.75 toward 7.5, which would push the T-Score to approximately 7.0 — Supervised Production territory.
Calibrating Weights for Your Workload
The default weights (TSA=1.2, PQ=1.0, RA=0.8) suit general-purpose coding and data-processing agents. Different workloads shift the balance:
| Workload Type | w_TSA | w_PQ | w_RA | Rationale |
|---|---|---|---|---|
| Read-only research / summarization | 1.5 | 0.8 | 0.3 | No state damage possible; tool efficiency matters most |
| Database migration / ETL | 1.0 | 1.2 | 2.0 | Partial writes are catastrophic; planning must be airtight |
| API orchestration (3rd-party calls) | 1.5 | 1.0 | 1.5 | Wrong tool = money spent; failed calls may be non-refundable |
| Code generation / refactoring | 1.0 | 1.5 | 0.8 | Long-horizon planning quality dominates; git handles rollback |
| Infrastructure provisioning | 1.1 | 1.3 | 2.0 | Abandoned cloud resources cost money indefinitely |
Never set any weight to 0. Every axis has some relevance in every workload. Setting RA = 0 for a read-only agent might seem reasonable, but agents that score 0 on RA often have systemic issues — missing cleanup code, no error handling — that will eventually bite you when the agent is repurposed or extended.
What the Triangle Reveals That Benchmarks Hide
Standard benchmarks like SWE-bench and GAIA score task resolution. They are valuable but incomplete for deployment decisions. The Triangle adds three dimensions that resolution scores obscure:
First, TSA exposes token economics. An agent with TSA 6.0 is spending roughly 40% of its tool calls on wrong-tool retries. At current API pricing, a task that costs $0.12 on a TSA-9.0 agent costs $0.20 on a TSA-6.0 agent. At scale, that gap compounds. TSA is a cost predictor, not just a quality signal.
Second, PQ predicts mid-run incident rate. Our internal observation across test suites is that agents with PQ below 5.0 encounter at least one mid-run dependency failure on tasks with 5 or more steps. The causal path is straightforward: missing branch coverage means the agent has no fallback when a predictable error occurs, so it either halts or improvises — and improvisation mid-run usually makes state worse.
Third, RA separates dev-environment from production agents. In development, failures are cheap. In production, an agent that leaves orphaned database rows, open file handles, or partially-committed transactions causes incidents that take engineering hours to resolve. RA below 7.0 means you are running without a safety net on every write operation.
Integrating the Triangle Into Your Evaluation Pipeline
The Triangle is not a replacement for task-completion benchmarks — it is a complement. Run your existing benchmark suite for baseline capability. Then run Triangle evaluation on the subset of task types you actually deploy in production.
A practical sequence: build a 30-task test suite covering your top production use cases. For each task, log the full tool-call trace. Score TSA from the trace. Score PQ from the pre-execution scratchpad. Run the RA injection protocol on a disposable environment. Calculate T-Score. Label each agent by T-Score range before routing it to production, staging, or back to training.
Rerun Triangle evaluation after every significant model or prompt update. T-Scores shift after fine-tuning — sometimes in unexpected directions. An update that improves TSA often degrades PQ, because a model that “just acts” more aggressively on the first right-looking tool spends less time planning. The Triangle makes this trade-off visible before you discover it in production.
If you want to test agents against real-world tooling before deploying them, the WOWHOW tools collection includes several task-complexity calculators and structured-output validators useful for building evaluation harnesses. The full catalog has templates for building evaluation logging pipelines in TypeScript.
The Failure Mode This Framework Was Designed Around
The Triangle originated from a specific class of failure we observed repeatedly: an agent completes the stated task with no errors reported, but leaves the system in a state that breaks the next agent in the pipeline. The first agent’s TSA and PQ were fine. Its RA was catastrophic — it left lock files, uncommitted transactions, and partial caches that the second agent read as ground truth.
Task-completion metrics showed 100% success for the first agent. The incident looked like the second agent’s fault. It wasn’t. The failure originated in RA=0 on a late-run injection point that nobody had tested.
The Triangle forces RA into scope before deployment, not after the first production incident. That is the only time it is cheap to test — before you need to.
If you are building or evaluating agents for production write workloads, the first thing to score is RA on a late-run injection. If that number is below 7.0, nothing else in the evaluation matters until it is fixed. Explore WOWHOW Pro Vault for structured agent evaluation templates and harness starters you can adapt for your stack.
Written by
WOWHOW
The WOWHOW team brings 14+ years of production engineering experience. Every tool and product in the catalog is personally built, tested, and curated.
Ready to ship faster?
Start with our free browser tools — no signup — or browse 3,000+ premium dev tools, prompt packs, and templates.
Monday Memo · Free
One insight, every Monday. 7am IST. Zero fluff.
1 field report, 3 links, 1 tool we actually use. Join 11,200+ builders.
Comments · 0
No comments yet. Be the first to share your thoughts.