ObservabilityFor VP EngineeringFor Data platform lead

AI observability: what to measure when agents go to production

What to instrument, what to alarm on, and how to keep the AI dashboard from becoming a separate observability tool nobody learns. Telemetry patterns for production agents.

TTechimax EngineeringForward-deployed engineering team13 min readUpdated May 10, 2026

Why agents need different telemetry than traditional services

Traditional service observability covers latency, error rate, and saturation. Those metrics don't catch the failure modes agents exhibit. An agent can return a 200 OK with hallucinated content. A retry loop can quietly burn $400 in tokens behind a healthy latency curve. Drift in eval pass-rate can degrade quality 8 points across a quarter while every traditional dashboard stays green.

The fix isn't to build separate AI tooling - it's to extend your existing observability with span attributes that capture model, prompt, tokens, cost, and outcome. OpenTelemetry's GenAI semantic conventions [1] give you a vendor-neutral format that Datadog, New Relic, Grafana, and Honeycomb all consume. The on-call engineer reads agent traces in the same UI as the rest of the stack.

What to instrument (and what to skip)

Telemetry on every agent action
  • OTel span per LLM call + tool call

    Standardized attributes: model name/provider, prompt hash, tokens (prompt + completion), cost, latency, outcome.

  • Trace correlation to user/session

    User ID (or anonymous session) on every span. Lets you debug a user-reported issue end-to-end.

  • Eval pass-rate per release

    Sampled production traffic re-run against the eval suite nightly. Pass-rate by release version, with release-over-release deltas.

  • Cost-per-action distribution

    p50, p95, p99 cost per action. Drift in p99 is the canary for runaway loops.

  • Refusal + escalation rate

    How often the agent refuses a request or escalates to human. Drift in either is a signal.

Chart · USD per action
Cost-per-action p99 over a 12-week production rollout (alarm threshold marked)
View data table· Source: Techimax engagement telemetry, anonymized customer rollout
SeriesUSD per action
Wk 10.06
Wk 20.07
Wk 30.07
Wk 40.09
Wk 50.11
Wk 60.18
Wk 70.32
Wk 80.1
Wk 90.08
Wk 100.07
Wk 110.07
Wk 120.07

What to alarm on

Three alarms cover most production incidents we see. Wire them to your existing on-call rotation; don't build a separate AI on-call.

AlarmThresholdWhy
Eval pass-rate dip≥ 2 percentage points sustained over 24hDrift indicator; pages on-call before customers notice
Cost-per-action p99≥ 2× rolling 7-day baselineLoop / context bloat early-warning
Refusal rate≥ 1.5× rolling 7-day baselineOften the canary for upstream policy / KB changes
Tool error rate≥ 5% over 1hUpstream API outage or contract drift
Reviewer queue depth≥ 1.5× SLA at p95Reviewer team underwater; risk of escalation timeout
Token-per-action p99≥ 1.8× baselineContext bloat; retrieval pipeline drift
Default alarm thresholds for production agentic systems
OpenTelemetry span attributes we add to every agent actionts
// Wrap every model call + tool call. Same APM consumes them as
// the rest of your services - no separate \"AI dashboard\".
const span = tracer.startSpan("agent.tool_call", {
  attributes: {
    "agent.id":               agent.id,
    "agent.version":          agent.version,
    "model.name":             model.name,
    "model.provider":         model.provider,
    "tool.name":              tool.name,
    "prompt.hash":            sha256(prompt),
    "tokens.prompt":          tokenCount.prompt,
    "tokens.completion":      tokenCount.completion,
    "cost.usd":               cost.usd,
    "outcome":                outcome,        // ok | retry | escalate | refuse
    "trace.parent_action_id": parentActionId,
    "tenant.id":              tenant.id,
    "user.session":           sessionHash,    // hashed; PII-safe
  },
});

Closing the loop: production telemetry into the eval suite

The highest-leverage observability move isn't an alarm - it's the eval-flywheel. Sample 1–5% of production traces nightly, replay them against the eval suite, and surface failures into the next morning's standup. The eval suite hardens automatically; production drift becomes visible the next day, not the next quarter.

Concrete pattern: a Lambda runs at 03:00 local time, pulls a 2% trace sample, runs each through the eval grader (LLM-graded with calibrated thresholds), commits any new failure cases to a `reviewable.jsonl` file in the eval repo. A human triages weekly and promotes the real failures into the eval suite. The suite gets harder; the agent stays in front of drift [3].

Chart · hours
Time to detect a quality regression: with vs without an eval flywheel
View data table· Source: Techimax engagement telemetry, 2024–2026 (mean across 14 rollouts)
Serieshours
Provider model update168
Provider model update14
Knowledge-base drift240
Knowledge-base drift22
Prompt regression96
Prompt regression6
Tool contract drift120
Tool contract drift10

Common incidents and what telemetry catches each

We've debugged enough production agent incidents to recognize the recurring shapes. The good news: each one has a telemetry signature that, instrumented properly, paints the diagnosis on the dashboard. Here's the field guide we hand new engineers on engagement kickoff.

SignatureLikely causeFirst checkMitigation
Cost p99 doubles, latency stableTool-call retry loopRetry counts on tool spansCap retries; circuit-break
Eval pass-rate drops 3+ pts overnightProvider model updateModel version on spansPin version; regression-test
Refusal rate spikes 2×Prompt or KB driftDiff KB sources / prompt changesRoll back or recalibrate
Tool 5xx rate spikesUpstream contract driftUpstream provider statusFailover; bump tool version
Reviewer queue grows linearlyRouting too cautiousAuto-vs-review ratio per intentRecalibrate routing threshold
Context-token p95 climbs slowlyRetrieval bloatRetrieved-tokens histogramRe-rank + truncate cap
Production agent incident playbook by telemetry signature

The on-call engineer at 2 AM should not have to learn a new tool to debug a stuck agent. AI observability that doesn't live in your existing APM doesn't survive contact with on-call.

References

  1. [1]OpenTelemetry GenAI semantic conventions - OpenTelemetry SIG (2025)
  2. [2]SRE practices for ML systems - Google SRE Workbook (2024)
  3. [3]Continuous evaluation for LLM applications - Anthropic engineering (2025)
  4. [4]OWASP Top 10 for LLM applications - OWASP (2024)
  5. [5]Datadog observability for LLM applications - Datadog (2025)

Frequently asked questions

Can we use Datadog / New Relic / Grafana?

Yes - all three accept OpenTelemetry. Use what your org uses. We don't recommend a separate "AI APM".

What about LangSmith / Helicone / Phoenix?

Useful as supplementary tools for prompt-level debugging. Don't make them the system of record. Your APM should still be the source of truth so the on-call rotation works.

Should we trace prompts in full?

Hash + sample. Full prompts are large, expensive to store, and may contain PII. Hash on every span; sample full prompts at 0.1–1% for offline analysis.

How do we handle PII in traces?

Redact at the SDK boundary before the span is recorded. Tokenize PII fields; store the mapping in a separate, more-protected store; the trace contains tokens. This pattern preserves debug utility without leaking PII into the APM tier [4].

Do we need a separate retention policy for AI traces?

Sometimes - regulated workloads (HIPAA, SOX) often require longer retention for outputs and access logs than for general APM data. Keep two tiers: 30-day APM retention for debug; 6-year cold-storage retention for audit-required lineage. Don't try to make the APM hold both.

How big is the storage cost of GenAI tracing?

Modest. Span attributes are bytes; the cost driver is full-prompt sampling. A typical mid-volume agent (1M actions/month) generates 4–8GB of trace metadata per month at 1% full-prompt sampling. Order-of-magnitude smaller than the model API spend.

What's a good first alarm to wire?

Cost-per-action p99 over 2× rolling 7-day baseline. It's easy to instrument, it catches the highest-cost failure mode (retry loops), and it doesn't false-alarm. Wire that on day one; layer eval pass-rate drift on day two.

Talk to engineering

Ready to ship the patterns from this post?

Tell us where you are. A senior forward-deployed engineer replies within 24 hours with a written plan tailored to your stack - never an SDR.

  • Practical engineering review of your current setup
  • Eval discipline + observability + cost controls
  • Free 60-min working session, no sales pitch

Senior reply within 24h

Drop your details and we'll match you with an engineer who's shipped in your industry.

By submitting, you agree to our privacy policy. We'll never share your information.