AI agent observability: why HTTP 200 hides wrong answers, the 5 primitives you need, and Braintrust vs Langfuse vs Arize compared.
The agent returned a 200. The customer lost $4,200. The monitoring dashboard showed green. This is the failure mode that keeps me up at night — not the one where your AI agent crashes with a stack trace, but the one where it succeeds at the wrong thing with complete confidence and nobody notices for three days.
I run AI agents in production. A content research pipeline, a pricing sync agent across 2,000-plus products, a nightly SEO task executor. I have had every variety of failure: hard crashes, timeout loops, runaway tool calls, authentication expiries. Those failures are easy. They show up in your error logs. They page you at 2am. They get fixed. The failure that cost that customer $4,200 was none of those. It was a step-10 output that was confidently, politely, completely wrong — and tracing it back to the tool call at step 3 that set up the error took four engineers and a full day of log archaeology.
According to data from enterprise AI deployment surveys published in early 2026, 89% of organizations have implemented some form of observability for their AI systems. Only 31% have measurement frameworks — defined as documented KPIs, baselines, and a process for evaluating whether the outputs are actually correct. The remaining 58% are watching request counts and latency distributions while entirely missing the question that matters: was the answer right?
This post is the architecture I wish I had before that incident. I will cover why traditional monitoring fails for multi-step agents, the five observability primitives you actually need, how to build an eval pipeline from production traces, how the major platforms compare, and the shadow agent problem that will bite you if you skip the governance layer.
Why Traditional Monitoring Fails for AI Agents
Traditional application monitoring was built for a world where correctness is binary. A function either throws an exception or it does not. A database query either returns results or it errors. An HTTP endpoint either returns 200 or it does not. The entire observability stack — Datadog, New Relic, Prometheus, PagerDuty — is optimized for detecting deviations from a deterministic expected behavior. When everything is deterministic, the absence of errors means things are working.
AI agents violate this assumption at every level. An agent can execute all tool calls successfully, return HTTP 200, complete within SLA, and produce output that is wrong in a way that only a domain expert would recognize. The monitoring dashboard showing green is not lying to you — it is answering a different question than the one you need answered. It is telling you that the plumbing worked. It cannot tell you whether the output was good.
The multi-step nature of agent workflows compounds the problem. In a ten-step reasoning chain, an error at step 3 does not propagate as an exception. It propagates as subtly wrong context that informs step 4, which informs step 5, which informs the final answer. By the time you see the output, the root cause is nine steps upstream and you have no trace linking the wrong conclusion to the wrong tool call that started it. This is not a hypothetical failure mode. It is the default failure mode for any agent that chains tool calls together and does not instrument each step independently.
The quality issues that kill production agents are now the primary barrier — at 32% of surveyed organizations, they outrank latency problems, cost overruns, and infrastructure failures combined. Quality problems are invisible to infrastructure monitoring because quality is a semantic property of outputs, not a syntactic property of HTTP responses. Detecting them requires a different instrumentation layer entirely.
The Five Observability Primitives You Actually Need
After running agents in production and reading post-mortems from teams that have been doing this longer, I have converged on five primitives that actually cover the failure surface of agentic systems. These are not the five things vendors pitch. They are the five things that would have caught every production incident I have personally experienced or traced through other teams’ post-mortems.
Distributed tracing with step-level spans. Each tool call, each LLM invocation, each decision branch in your agent workflow needs its own span. The span should capture the input, the output, the latency, the token counts, and any structured metadata the step produces. The trace should be queryable by run ID so you can replay exactly what happened in any production execution. This is the foundation everything else builds on. Without it, debugging a wrong output means guessing at causality from log fragments.
OpenTelemetry is the right instrumentation layer here, and it integrates with all four major platforms. Here is the setup that works for a TypeScript agent:
// otel-agent-tracing.ts — step-level spans for every agent action
import { trace, context, SpanStatusCode } from '@opentelemetry/api'
import { NodeTracerProvider } from '@opentelemetry/sdk-node'
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http'
const provider = new NodeTracerProvider({
resource: Resource.default().merge(
new Resource({ 'service.name': 'wowhow-content-agent', 'service.version': '2.1.0' })
),
})
provider.addSpanProcessor(
new BatchSpanProcessor(new OTLPTraceExporter({ url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT }))
)
provider.register()
const tracer = trace.getTracer('agent-tracer')
export async function tracedToolCall<T>(
toolName: string,
input: unknown,
fn: () => Promise<T>
): Promise<T> {
return tracer.startActiveSpan(`tool.${toolName}`, async (span) => {
span.setAttributes({
'tool.name': toolName,
'tool.input': JSON.stringify(input).slice(0, 1024), // truncate large inputs
'agent.run_id': context.active().getValue(RUN_ID_KEY) as string ?? 'unknown',
})
try {
const result = await fn()
span.setAttributes({ 'tool.output': JSON.stringify(result).slice(0, 1024) })
span.setStatus({ code: SpanStatusCode.OK })
return result
} catch (err) {
span.recordException(err as Error)
span.setStatus({ code: SpanStatusCode.ERROR })
throw err
} finally {
span.end()
}
})
}
Multi-turn conversation replay. For agents that operate over multiple turns — handling a support ticket thread, researching a topic across several queries — you need the ability to replay any production conversation exactly as it happened. Not a summary, not a log of outputs. The full message history, tool call sequence, and intermediate states, reconstructed so a human reviewer can follow the exact reasoning path the agent took. This is what makes the step-3 root cause findable instead of invisible.
Online evaluation. A sample of production outputs, evaluated automatically against quality rubrics, running continuously. Not just at deployment time, not just in staging. In production, on real traffic, so you detect when model behavior drifts after a provider update, after your prompt changes, or after input distribution shifts. The evaluations do not need to be exhaustive — sampling 5% of production traffic with a fast evaluator gives you statistical sensitivity to drift without burning your evaluation budget.
Semantic alerting. Alerts that trigger on output quality metrics, not just infrastructure metrics. Error rate for tool calls is a good infrastructure alert. A drop in evaluation pass rate from 94% to 87% over 24 hours is a semantic alert, and it is the one that actually tells you something is wrong with the agent’s behavior. Most teams have the former and none of the latter.
Data curation loop. The traces from your production runs are the highest-quality training and evaluation data you will ever have. They show exactly what real users asked, exactly what the agent produced, and — after review — whether the output was good. A data curation loop systematically captures interesting production traces (high-confidence correct, high-confidence wrong, borderline cases) and routes them into your eval dataset. This is how your evaluation suite gets smarter over time rather than stale.
Comments · 0
Beta: comments are stored locally on your device and not visible to other readers.
No comments yet. Be the first to share your thoughts.