Most AI agent failures are invisible until they're catastrophic. This pack gives you the WOWHOW 14-Mode Agent Failure Taxonomy — a structured observability framework built for engineers shipping production agents today. You get a 14-category log event schema, a pre-wired OpenTelemetry hook configuration, a CSV triage scorecard, and a step-by-step runbook for diagnosing the exact failure class within five minutes of an alert. Designed for backend engineers, ML platform teams, and AI infra leads running LLM-powered agents on any orchestration stack (LangChain, LlamaIndex, CrewAI, custom tool loops, or raw API calls).
What You Get
- WOWHOW 14-Mode Failure Taxonomy (README.md) — The full framework: each failure mode defined, its observable signal, a worked example, and the correct instrumentation response. Covers everything from silent-tool-skip to compounding-context-drift.
- Log Schema Spec (log-schema.json) — A JSON Schema file defining the canonical log event structure for all 14 failure modes. Drop into any structured-logging pipeline (Pino, Winston, structlog, Cloud Logging). Each field is documented with purpose and cardinality.
- OpenTelemetry Hook Config (otel-hooks.yaml) — Ready-to-use span attribute rules, sampling hints, and alert thresholds for each failure mode. Works with any OTel-compatible collector (Grafana Tempo, Jaeger, Datadog, Honeycomb). Paste in, adjust your exporter endpoint, done.
- Triage Scorecard (triage-scorecard.csv) — A 14-row weighted CSV: failure mode, severity score (1–5), mean-time-to-detect estimate, first log field to check, and recommended escalation path. Import into any spreadsheet or incident tracker as your on-call reference card.
- Triage Runbook (triage-runbook.md) — A decision-tree runbook. Starts from the symptom your on-call sees (wrong output, infinite loop, tool not called, context too long) and walks to exact failure mode in under 10 steps. Includes copy-paste log queries for Elasticsearch, Loki, and BigQuery.
How to Use
- Read README.md first. Internalize the 14 failure modes and which one maps to the last incident your agent had.
- Copy log-schema.json into your agent's logging module. Instrument your tool-call wrappers, LLM call sites, and memory read/write points to emit events conforming to this schema.
- Import otel-hooks.yaml into your OTel collector config. Update the
exporter.endpoint field and the service.name attribute. Restart the collector.
- Open triage-scorecard.csv in your spreadsheet tool. Sort by severity or by the failure modes most common in your system. Pin it to your incident channel.
- When an alert fires, open triage-runbook.md and follow the decision tree from the symptom column. You should have a failure-mode diagnosis within five minutes and a log query to confirm it.
Who This Is For
- Backend engineers who shipped an LLM agent and are now debugging it in production for the first time
- ML platform teams building internal agent infrastructure and need a shared vocabulary for failure classification
- AI infra leads writing post-mortems who want a systematic taxonomy instead of a narrative
- DevOps and SRE teams who inherited an agent system and need an on-call runbook fast
- Founders and CTOs at AI-first startups who want observability rigor before their first major outage
This pack saves the hours it takes to discover empirically what can go wrong with an agent. The taxonomy is original WOWHOW IP drawn from systematic analysis of failure patterns across tool-calling, memory, planning, and context-management subsystems. Browse more developer tools and frameworks in the WOWHOW catalog, or use the AI model cost calculator to estimate the operational cost of running your agent at scale before you instrument it.