The agent returned a 200. The customer lost $4,200. The monitoring dashboard showed green. This is the failure mode that keeps me up at night — not the one where your AI agent crashes with a stack trace, but the one where it succeeds at the wrong thing with complete confidence and nobody notices for three days.
I run AI agents in production. A content research pipeline, a pricing sync agent across 2,000-plus products, a nightly SEO task executor. I have had every variety of failure: hard crashes, timeout loops, runaway tool calls, authentication expiries. Those failures are easy. They show up in your error logs. They page you at 2am. They get fixed. The failure that cost that customer $4,200 was none of those. It was a step-10 output that was confidently, politely, completely wrong — and tracing it back to the tool call at step 3 that set up the error took four engineers and a full day of log archaeology.
According to data from enterprise AI deployment surveys published in early 2026, 89% of organizations have implemented some form of observability for their AI systems. Only 31% have measurement frameworks — defined as documented KPIs, baselines, and a process for evaluating whether the outputs are actually correct. The remaining 58% are watching request counts and latency distributions while entirely missing the question that matters: was the answer right?
This post is the architecture I wish I had before that incident. I will cover why traditional monitoring fails for multi-step agents, the five observability primitives you actually need, how to build an eval pipeline from production traces, how the major platforms compare, and the shadow agent problem that will bite you if you skip the governance layer.
Why Traditional Monitoring Fails for AI Agents
Traditional application monitoring was built for a world where correctness is binary. A function either throws an exception or it does not. A database query either returns results or it errors. An HTTP endpoint either returns 200 or it does not. The entire observability stack — Datadog, New Relic, Prometheus, PagerDuty — is optimized for detecting deviations from a deterministic expected behavior. When everything is deterministic, the absence of errors means things are working.
AI agents violate this assumption at every level. An agent can execute all tool calls successfully, return HTTP 200, complete within SLA, and produce output that is wrong in a way that only a domain expert would recognize. The monitoring dashboard showing green is not lying to you — it is answering a different question than the one you need answered. It is telling you that the plumbing worked. It cannot tell you whether the output was good.
The multi-step nature of agent workflows compounds the problem. In a ten-step reasoning chain, an error at step 3 does not propagate as an exception. It propagates as subtly wrong context that informs step 4, which informs step 5, which informs the final answer. By the time you see the output, the root cause is nine steps upstream and you have no trace linking the wrong conclusion to the wrong tool call that started it. This is not a hypothetical failure mode. It is the default failure mode for any agent that chains tool calls together and does not instrument each step independently.
The quality issues that kill production agents are now the primary barrier — at 32% of surveyed organizations, they outrank latency problems, cost overruns, and infrastructure failures combined. Quality problems are invisible to infrastructure monitoring because quality is a semantic property of outputs, not a syntactic property of HTTP responses. Detecting them requires a different instrumentation layer entirely.
The Five Observability Primitives You Actually Need
After running agents in production and reading post-mortems from teams that have been doing this longer, I have converged on five primitives that actually cover the failure surface of agentic systems. These are not the five things vendors pitch. They are the five things that would have caught every production incident I have personally experienced or traced through other teams’ post-mortems.
Distributed tracing with step-level spans. Each tool call, each LLM invocation, each decision branch in your agent workflow needs its own span. The span should capture the input, the output, the latency, the token counts, and any structured metadata the step produces. The trace should be queryable by run ID so you can replay exactly what happened in any production execution. This is the foundation everything else builds on. Without it, debugging a wrong output means guessing at causality from log fragments.
OpenTelemetry is the right instrumentation layer here, and it integrates with all four major platforms. Here is the setup that works for a TypeScript agent:
// otel-agent-tracing.ts — step-level spans for every agent action
import { trace, context, SpanStatusCode } from '@opentelemetry/api'
import { NodeTracerProvider } from '@opentelemetry/sdk-node'
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http'
const provider = new NodeTracerProvider({
resource: Resource.default().merge(
new Resource({ 'service.name': 'wowhow-content-agent', 'service.version': '2.1.0' })
),
})
provider.addSpanProcessor(
new BatchSpanProcessor(new OTLPTraceExporter({ url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT }))
)
provider.register()
const tracer = trace.getTracer('agent-tracer')
export async function tracedToolCall<T>(
toolName: string,
input: unknown,
fn: () => Promise<T>
): Promise<T> {
return tracer.startActiveSpan(`tool.${toolName}`, async (span) => {
span.setAttributes({
'tool.name': toolName,
'tool.input': JSON.stringify(input).slice(0, 1024), // truncate large inputs
'agent.run_id': context.active().getValue(RUN_ID_KEY) as string ?? 'unknown',
})
try {
const result = await fn()
span.setAttributes({ 'tool.output': JSON.stringify(result).slice(0, 1024) })
span.setStatus({ code: SpanStatusCode.OK })
return result
} catch (err) {
span.recordException(err as Error)
span.setStatus({ code: SpanStatusCode.ERROR })
throw err
} finally {
span.end()
}
})
}
Multi-turn conversation replay. For agents that operate over multiple turns — handling a support ticket thread, researching a topic across several queries — you need the ability to replay any production conversation exactly as it happened. Not a summary, not a log of outputs. The full message history, tool call sequence, and intermediate states, reconstructed so a human reviewer can follow the exact reasoning path the agent took. This is what makes the step-3 root cause findable instead of invisible.
Online evaluation. A sample of production outputs, evaluated automatically against quality rubrics, running continuously. Not just at deployment time, not just in staging. In production, on real traffic, so you detect when model behavior drifts after a provider update, after your prompt changes, or after input distribution shifts. The evaluations do not need to be exhaustive — sampling 5% of production traffic with a fast evaluator gives you statistical sensitivity to drift without burning your evaluation budget.
Semantic alerting. Alerts that trigger on output quality metrics, not just infrastructure metrics. Error rate for tool calls is a good infrastructure alert. A drop in evaluation pass rate from 94% to 87% over 24 hours is a semantic alert, and it is the one that actually tells you something is wrong with the agent’s behavior. Most teams have the former and none of the latter.
Data curation loop. The traces from your production runs are the highest-quality training and evaluation data you will ever have. They show exactly what real users asked, exactly what the agent produced, and — after review — whether the output was good. A data curation loop systematically captures interesting production traces (high-confidence correct, high-confidence wrong, borderline cases) and routes them into your eval dataset. This is how your evaluation suite gets smarter over time rather than stale.
Building an Eval Pipeline From Production Traces
The most effective eval pipeline I have built does not start with carefully constructed test cases. It starts with production traces. Real inputs, real agent behavior, real outputs reviewed by a human expert or a strong evaluator model. That dataset is ground truth in a way that synthetic benchmarks never are.
Here is the pipeline architecture. Production traces are sampled, filtered for interest (high latency, low confidence scores, or random sampling for baseline coverage), and routed to a review queue. Reviewers label outputs as correct, incorrect, or borderline. Labeled traces become evaluation cases. The eval suite runs against every deployment candidate. The pass rate becomes the gate.
// trace-to-eval-pipeline.ts — promote production traces to eval cases
interface ProductionTrace {
runId: string
input: string
steps: Array<{ tool: string; input: unknown; output: unknown; latencyMs: number }>
finalOutput: string
confidenceScore: number
timestamp: string
}
interface EvalCase {
id: string
input: string
expectedOutputPattern: string // regex or semantic description
expectedSteps: string[] // required tool calls in order
sourceTrace: string // runId for lineage
reviewer: string
reviewedAt: string
}
async function promoteTraceToEvalCase(
trace: ProductionTrace,
review: { correct: boolean; expectedOutput: string; reviewer: string }
): Promise<EvalCase | null> {
if (!review.correct) {
// Wrong outputs are MORE valuable as eval cases — they define failure modes
return {
id: `eval-${trace.runId}`,
input: trace.input,
expectedOutputPattern: review.expectedOutput,
expectedSteps: trace.steps.map(s => s.tool),
sourceTrace: trace.runId,
reviewer: review.reviewer,
reviewedAt: new Date().toISOString(),
}
}
// Sample correct outputs at 20% — maintain eval diversity
return Math.random() < 0.2 ? {
id: `eval-${trace.runId}`,
input: trace.input,
expectedOutputPattern: review.expectedOutput,
expectedSteps: trace.steps.map(s => s.tool),
sourceTrace: trace.runId,
reviewer: review.reviewer,
reviewedAt: new Date().toISOString(),
} : null
}
The custom evaluator layer is where the real quality signal comes from. An LLM-as-judge evaluator, given a well-defined rubric, can score agent outputs for correctness, relevance, and absence of hallucination at a cost that makes continuous evaluation economically viable. The key is giving the evaluator enough context: the original input, the expected behavior, the actual output, and a structured rubric that converts a subjective quality judgment into a numeric score.
// custom-evaluator.ts — LLM-as-judge for production output quality
interface EvalRubric {
name: string
description: string
scoringGuide: Record<number, string> // 1-5 scale with explicit definitions
}
const FACTUAL_ACCURACY_RUBRIC: EvalRubric = {
name: 'factual_accuracy',
description: 'Does the output contain only claims that are verifiable from the input context?',
scoringGuide: {
5: 'All claims are directly supported by input context. No hallucination.',
4: 'All major claims supported. Minor unsupported details that do not affect correctness.',
3: 'Core answer correct but includes 1-2 unsupported or unverifiable claims.',
2: 'Mixed accuracy. Key claims are wrong or unverifiable.',
1: 'Output is predominantly incorrect or hallucinates critical facts.',
},
}
async function evaluateOutput(
input: string,
output: string,
rubrics: EvalRubric[]
): Promise<Record<string, number>> {
const scores: Record<string, number> = {}
for (const rubric of rubrics) {
const prompt = `You are evaluating AI agent output quality.
Input: ${input}
Output: ${output}
Rubric: ${rubric.description}
${Object.entries(rubric.scoringGuide).map(([score, desc]) => `${score}: ${desc}`).join('\n')}
Respond with JSON: {"score": <1-5>, "reasoning": "<one sentence>"}`
const result = await callEvaluatorModel(prompt)
scores[rubric.name] = result.score
}
return scores
}
Once you have an eval pipeline running on production traces, the quality alert becomes straightforward: compare rolling average scores against a threshold and alert when you cross it.
// quality-alerting.ts — semantic alerts on eval score degradation
interface QualityAlert {
metric: string
currentScore: number
threshold: number
windowHours: number
samplesEvaluated: number
alertAt: string
}
async function checkQualityAlerts(
evalScores: Array<{ score: number; timestamp: string; metric: string }>,
thresholds: Record<string, number>
): Promise<QualityAlert[]> {
const alerts: QualityAlert[] = []
const windowStart = new Date(Date.now() - 24 * 60 * 60 * 1000).toISOString()
for (const [metric, threshold] of Object.entries(thresholds)) {
const recentScores = evalScores.filter(
s => s.metric === metric && s.timestamp >= windowStart
)
if (recentScores.length < 10) continue // insufficient sample
const avgScore = recentScores.reduce((sum, s) => sum + s.score, 0) / recentScores.length
if (avgScore < threshold) {
alerts.push({
metric,
currentScore: avgScore,
threshold,
windowHours: 24,
samplesEvaluated: recentScores.length,
alertAt: new Date().toISOString(),
})
}
}
return alerts
}
// Usage: alert if factual accuracy drops below 3.5/5.0 over 24h
const QUALITY_THRESHOLDS = { factual_accuracy: 3.5, relevance: 3.8, completeness: 3.2 }
Tool Comparison — Braintrust vs Langfuse vs Arize Phoenix vs LangSmith
I have used all four in production contexts, and they each make a different bet on what the hard problem actually is. Choosing the wrong one does not kill your observability practice, but it does mean fighting the tool’s design instead of working with it.
Braintrust is built eval-first. The core product is a dataset-driven evaluation system with a human review UI, a score tracking timeline, and tight CI/CD integration. The tracing and monitoring capabilities exist as first-class features but feel downstream of the eval philosophy. If your primary problem is measuring and improving output quality across deployments, Braintrust is the best-designed tool for that specific job. It is not the cheapest option and it does not self-host, which matters for teams with data residency requirements.
Langfuse makes the opposite bet: it is built observability-first, with eval as a secondary layer you add on top. It is open source, self-hostable, and the logging SDK is genuinely lightweight. For teams that need data to stay on their own infrastructure — regulated industries, anything involving PII-heavy inputs — Langfuse is often the only viable option. The eval tooling is catching up to Braintrust but is not there yet. The tracing UI is excellent. I run a self-hosted Langfuse instance for the WOWHOW pipelines.
Arize Phoenix is the OSS option with the deepest ML observability lineage. Arize as a company comes from production ML monitoring, and Phoenix inherits that focus on embedding drift, data distribution monitoring, and the kind of statistical analysis that matters when your inputs shift over time. If you are running RAG pipelines or any agent whose quality degrades when the retrieval corpus drifts, Phoenix has capabilities the other three lack. The tradeoff is a steeper learning curve and a UI that is more analytics dashboard than operator console.
LangSmith is the default for anyone running LangChain, and it has earned its position through sheer scale — LangSmith has processed over 15 billion traces. The instrumentation is automatic if you are already in the LangChain ecosystem, and the hub of shared prompts and evaluators is genuinely useful for teams that want community-sourced eval patterns. Outside the LangChain ecosystem it is still capable but loses its key advantage. The pricing scales with trace volume, which can become a constraint at high-throughput production loads.
The practical decision framework: self-hosting requirement points you to Langfuse or Phoenix. LangChain-native stack points you to LangSmith. Eval-first culture or heavy human review workflows point you to Braintrust. All four support OpenTelemetry as an ingestion path, so you are not locked in — the instrumentation you write today works against any of them.
The Shadow Agent Problem
Here is the failure mode that will define the second half of 2026 for organizations that have been aggressively deploying AI agents: shadow agents. An agent built by one team without registering it in a central inventory. An agent a developer spun up for a one-time task that is still running six months later. An agent that was built on top of a deprecated internal API and now silently uses a fallback that nobody intended as a production endpoint.
Shadow agents exist in every organization that has been deploying agents faster than its governance processes have matured. They are not malicious. They are the natural byproduct of capable developers using capable tools to solve real problems quickly. The problem is that a shadow agent, by definition, has no eval pipeline, no quality alerting, no trace retention, and no owner who will respond when it starts producing wrong outputs.
The discovery problem is harder than it sounds. Agents are not like microservices that register themselves with a service mesh. They are processes that make LLM API calls. The only reliable way to find them is to audit your LLM API usage at the account level and cross-reference it against your known-registered agents.
# shadow-agent-discovery.py — find unregistered agents via API key audit
import httpx
import json
from datetime import datetime, timedelta
REGISTERED_AGENTS = {
"content-research-agent",
"pricing-sync-agent",
"seo-executor-agent",
# ... your known agents
}
async def discover_shadow_agents(api_key: str, lookback_days: int = 30) -> list[dict]:
"""
Query LLM provider usage logs for API calls not attributable to
registered agents. Requires usage metadata (user tag or custom header)
to be set on all registered agent calls.
"""
cutoff = datetime.utcnow() - timedelta(days=lookback_days)
async with httpx.AsyncClient() as client:
resp = await client.get(
"https://api.anthropic.com/v1/usage",
headers={"x-api-key": api_key, "anthropic-version": "2023-06-01"},
params={"start_time": cutoff.isoformat(), "limit": 1000}
)
usage_records = resp.json().get("data", [])
shadow_candidates = []
for record in usage_records:
agent_tag = record.get("metadata", {}).get("agent_id", "UNTAGGED")
if agent_tag not in REGISTERED_AGENTS:
shadow_candidates.append({
"agent_tag": agent_tag,
"model": record.get("model"),
"input_tokens": record.get("input_tokens"),
"output_tokens": record.get("output_tokens"),
"timestamp": record.get("created_at"),
})
# Group by agent_tag to see usage patterns
from collections import defaultdict
by_agent: dict = defaultdict(list)
for r in shadow_candidates:
by_agent[r["agent_tag"]].append(r)
return [
{
"agent_tag": tag,
"call_count": len(records),
"total_tokens": sum(r["input_tokens"] + r["output_tokens"] for r in records),
"first_seen": min(r["timestamp"] for r in records),
"last_seen": max(r["timestamp"] for r in records),
}
for tag, records in by_agent.items()
]
The discovery script is the easy part. The harder part is what you do with the results. A shadow agent that has been running for months, used by a business team for a real workflow, cannot be simply shut down without disruption. The practical response is a registration amnesty: give teams a window to register their shadow agents, bring them under the standard observability stack, and assign an owner. Make the bar for registration low enough that compliance is easy. The agents that do not get registered during the amnesty period are the ones you shut down — because if no one registered them, no one is responsible for them, and no one will respond when they start generating wrong outputs at scale.
The connection between shadow agents and the quality crisis is direct. Organizations reporting the highest rates of AI quality problems in production are the same organizations with the weakest agent registration practices. When you do not know an agent exists, you cannot evaluate it. When you cannot evaluate it, you find out it is wrong the same way that customer found out: after the damage is done, with a 200 response code in the logs and no trace of what actually happened.
The investment in observability infrastructure pays back unevenly but reliably. The first few months feel like instrumentation overhead. The first time your quality alert fires two hours after a provider model update and saves you from a day of wrong outputs reaching real users, the ROI calculation becomes obvious. I have had that experience once. It changed how I think about observability from an optional enhancement to a deployment prerequisite. You should not ship an agent to production that you cannot evaluate, and you should not evaluate an agent you cannot trace.
For more on building agents that are production-ready from day one, see the related posts on why 88% of agent pilots never reach production and the 3-layer agent harness pattern for keeping configuration complexity under control. The WOWHOW tools catalog includes utilities for structured agent logging that integrate with all four platforms covered here.
Written by
Anup Karanjkar
Expert contributor at WOWHOW. Writing about AI, development, automation, and building products that ship.
Ready to ship faster?
Browse our catalog of 3,000+ premium dev tools, prompt packs, and templates.
Monday Memo · Free
One insight, every Monday. 7am IST. Zero fluff.
1 field report, 3 links, 1 tool we actually use. Join 11,200+ builders.
Comments · 0
No comments yet. Be the first to share your thoughts.