Why agents need different telemetry than traditional services
Traditional service observability covers latency, error rate, and saturation. Those metrics don't catch the failure modes agents exhibit. An agent can return a 200 OK with hallucinated content. A retry loop can quietly burn $400 in tokens behind a healthy latency curve. Drift in eval pass-rate can degrade quality 8 points across a quarter while every traditional dashboard stays green.
The fix isn't to build separate AI tooling - it's to extend your existing observability with span attributes that capture model, prompt, tokens, cost, and outcome. OpenTelemetry's GenAI semantic conventions [1] give you a vendor-neutral format that Datadog, New Relic, Grafana, and Honeycomb all consume. The on-call engineer reads agent traces in the same UI as the rest of the stack.
What to instrument (and what to skip)
- OTel span per LLM call + tool call
Standardized attributes: model name/provider, prompt hash, tokens (prompt + completion), cost, latency, outcome.
- Trace correlation to user/session
User ID (or anonymous session) on every span. Lets you debug a user-reported issue end-to-end.
- Eval pass-rate per release
Sampled production traffic re-run against the eval suite nightly. Pass-rate by release version, with release-over-release deltas.
- Cost-per-action distribution
p50, p95, p99 cost per action. Drift in p99 is the canary for runaway loops.
- Refusal + escalation rate
How often the agent refuses a request or escalates to human. Drift in either is a signal.
View data table· Source: Techimax engagement telemetry, anonymized customer rollout
| Series | USD per action |
|---|---|
| Wk 1 | 0.06 |
| Wk 2 | 0.07 |
| Wk 3 | 0.07 |
| Wk 4 | 0.09 |
| Wk 5 | 0.11 |
| Wk 6 | 0.18 |
| Wk 7 | 0.32 |
| Wk 8 | 0.1 |
| Wk 9 | 0.08 |
| Wk 10 | 0.07 |
| Wk 11 | 0.07 |
| Wk 12 | 0.07 |
What to alarm on
Three alarms cover most production incidents we see. Wire them to your existing on-call rotation; don't build a separate AI on-call.
| Alarm | Threshold | Why |
|---|---|---|
| Eval pass-rate dip | ≥ 2 percentage points sustained over 24h | Drift indicator; pages on-call before customers notice |
| Cost-per-action p99 | ≥ 2× rolling 7-day baseline | Loop / context bloat early-warning |
| Refusal rate | ≥ 1.5× rolling 7-day baseline | Often the canary for upstream policy / KB changes |
| Tool error rate | ≥ 5% over 1h | Upstream API outage or contract drift |
| Reviewer queue depth | ≥ 1.5× SLA at p95 | Reviewer team underwater; risk of escalation timeout |
| Token-per-action p99 | ≥ 1.8× baseline | Context bloat; retrieval pipeline drift |
// Wrap every model call + tool call. Same APM consumes them as
// the rest of your services - no separate \"AI dashboard\".
const span = tracer.startSpan("agent.tool_call", {
attributes: {
"agent.id": agent.id,
"agent.version": agent.version,
"model.name": model.name,
"model.provider": model.provider,
"tool.name": tool.name,
"prompt.hash": sha256(prompt),
"tokens.prompt": tokenCount.prompt,
"tokens.completion": tokenCount.completion,
"cost.usd": cost.usd,
"outcome": outcome, // ok | retry | escalate | refuse
"trace.parent_action_id": parentActionId,
"tenant.id": tenant.id,
"user.session": sessionHash, // hashed; PII-safe
},
});Closing the loop: production telemetry into the eval suite
The highest-leverage observability move isn't an alarm - it's the eval-flywheel. Sample 1–5% of production traces nightly, replay them against the eval suite, and surface failures into the next morning's standup. The eval suite hardens automatically; production drift becomes visible the next day, not the next quarter.
Concrete pattern: a Lambda runs at 03:00 local time, pulls a 2% trace sample, runs each through the eval grader (LLM-graded with calibrated thresholds), commits any new failure cases to a `reviewable.jsonl` file in the eval repo. A human triages weekly and promotes the real failures into the eval suite. The suite gets harder; the agent stays in front of drift [3].
View data table· Source: Techimax engagement telemetry, 2024–2026 (mean across 14 rollouts)
| Series | hours |
|---|---|
| Provider model update | 168 |
| Provider model update | 14 |
| Knowledge-base drift | 240 |
| Knowledge-base drift | 22 |
| Prompt regression | 96 |
| Prompt regression | 6 |
| Tool contract drift | 120 |
| Tool contract drift | 10 |
Common incidents and what telemetry catches each
We've debugged enough production agent incidents to recognize the recurring shapes. The good news: each one has a telemetry signature that, instrumented properly, paints the diagnosis on the dashboard. Here's the field guide we hand new engineers on engagement kickoff.
| Signature | Likely cause | First check | Mitigation |
|---|---|---|---|
| Cost p99 doubles, latency stable | Tool-call retry loop | Retry counts on tool spans | Cap retries; circuit-break |
| Eval pass-rate drops 3+ pts overnight | Provider model update | Model version on spans | Pin version; regression-test |
| Refusal rate spikes 2× | Prompt or KB drift | Diff KB sources / prompt changes | Roll back or recalibrate |
| Tool 5xx rate spikes | Upstream contract drift | Upstream provider status | Failover; bump tool version |
| Reviewer queue grows linearly | Routing too cautious | Auto-vs-review ratio per intent | Recalibrate routing threshold |
| Context-token p95 climbs slowly | Retrieval bloat | Retrieved-tokens histogram | Re-rank + truncate cap |
The on-call engineer at 2 AM should not have to learn a new tool to debug a stuck agent. AI observability that doesn't live in your existing APM doesn't survive contact with on-call.
References
- [1]OpenTelemetry GenAI semantic conventions - OpenTelemetry SIG (2025)
- [2]SRE practices for ML systems - Google SRE Workbook (2024)
- [3]Continuous evaluation for LLM applications - Anthropic engineering (2025)
- [4]OWASP Top 10 for LLM applications - OWASP (2024)
- [5]Datadog observability for LLM applications - Datadog (2025)
Frequently asked questions
Can we use Datadog / New Relic / Grafana?
Yes - all three accept OpenTelemetry. Use what your org uses. We don't recommend a separate "AI APM".
What about LangSmith / Helicone / Phoenix?
Useful as supplementary tools for prompt-level debugging. Don't make them the system of record. Your APM should still be the source of truth so the on-call rotation works.
Should we trace prompts in full?
Hash + sample. Full prompts are large, expensive to store, and may contain PII. Hash on every span; sample full prompts at 0.1–1% for offline analysis.
How do we handle PII in traces?
Redact at the SDK boundary before the span is recorded. Tokenize PII fields; store the mapping in a separate, more-protected store; the trace contains tokens. This pattern preserves debug utility without leaking PII into the APM tier [4].
Do we need a separate retention policy for AI traces?
Sometimes - regulated workloads (HIPAA, SOX) often require longer retention for outputs and access logs than for general APM data. Keep two tiers: 30-day APM retention for debug; 6-year cold-storage retention for audit-required lineage. Don't try to make the APM hold both.
How big is the storage cost of GenAI tracing?
Modest. Span attributes are bytes; the cost driver is full-prompt sampling. A typical mid-volume agent (1M actions/month) generates 4–8GB of trace metadata per month at 1% full-prompt sampling. Order-of-magnitude smaller than the model API spend.
What's a good first alarm to wire?
Cost-per-action p99 over 2× rolling 7-day baseline. It's easy to instrument, it catches the highest-cost failure mode (retry loops), and it doesn't false-alarm. Wire that on day one; layer eval pass-rate drift on day two.