Where 78% of agent demos die
LangChain's 2024 State of AI Agents survey reported that 78% of organizations had built an agent demo, but only 22% had gotten one into production [1]. We've sat inside dozens of those abandoned projects. The death always falls into one of seven buckets.
View data table· Source: LangChain State of AI Agents 2024; Techimax AI Rescue engagements 2023–2026
| Series | Value |
|---|---|
| No eval discipline | 28 |
| Cost / latency surprises | 19 |
| Prompt-injection / guardrails | 14 |
| Brittle tool contracts | 13 |
| No observability | 11 |
| Human-handoff cliffs | 9 |
| Drift without alarms | 6 |
The seven-item production-readiness checklist
- Calibrated eval suite (≥ 50 cases) gating CI
Cases cover golden paths, adversarial inputs, refusals, and citation requirements. PR can't merge below the calibrated threshold.
- Cost telemetry per agent action
Per-message, per-tool-call, per-LLM-call cost and latency. Alarms wired to your APM. Spend caps enforced at the gateway, not in code.
- Prompt-injection defenses with red-team evidence
Documented red-team exercise; injection cases in the eval suite; PII redaction at the SDK boundary.
- Tool contracts that survive ambiguity
Strongly-typed schemas; fuzzed inputs; failure modes mapped to retry/escalate/refuse paths. No bare LLM tool-call without a typed validator.
- Per-action tracing in your APM
OpenTelemetry traces with span attributes for prompt hash, tool name, model, token cost, and outcome. Same APM your existing services use - no separate "AI dashboard".
- Human-in-loop at the right cliffs
Categorical mapping of decisions to human-required vs auto. Reviewer queues with calibrated SLAs. Don't ship an agent that promises 100% automation if the failure cost > $X.
- Drift alarms wired to evals
Eval suite re-runs nightly against production traffic samples. Pass-rate dip > 2% pages on-call. Drift you can't see is drift you can't fix.
Cost modeling: what to instrument before you launch
The cost mode of failure is often the most surprising. A demo that costs $0.04 per interaction can become $4 per interaction when an agent loops on a malformed tool response, or when a 200-token system prompt becomes a 20K-token context after retrieval injection. The shape of these surprises is predictable.
| Failure mode | Typical cost multiplier | Fix |
|---|---|---|
| Tool-call retry loop | 5–40× | Bound retries; circuit-break on repeat tool-call patterns |
| Retrieval context bloat | 8–15× | Re-rank + truncate; cap retrieved tokens per call |
| Streaming abandoned mid-call | 2–4× | Cancel-aware streaming; bill on completed tokens only |
| Background agent loops | 10–100× | Per-trace token caps with hard kill at threshold |
| Prompt regression on model swap | 1.5–3× | Eval-gated provider switching; never swap blind |
Guardrails: prompt injection, jailbreak, exfiltration
Anthropic's prompt-injection benchmarks show that even the most capable models can be coerced via indirect injection (poisoned retrieval content, tool-output payloads) without an explicit system-level defense [2]. Production agents need three layers: model-level resistance (system prompts + best-of-class providers), platform-level guardrails (PII redaction, prompt allow-lists, exfiltration filters at the gateway), and content-level evals (red-team cases in the eval suite).
View data table· Source: Anthropic + academic red-team data 2024–2025
| Series | % breached |
|---|---|
| No defenses | 64 |
| + System prompt only | 41 |
| + Gateway redaction | 18 |
| + Eval-suite red-team cases | 6 |
Where to put humans in the loop (and where not to)
Human-in-loop is a load-bearing engineering decision, not a checkbox. Put humans on the cliff edge - the high-blast-radius decisions where a wrong agent action is unrecoverable - and trust the agent on routine paths. The wrong pattern is to gate every action; that destroys the velocity that justified the agent in the first place.
| Decision class | Default policy | Why |
|---|---|---|
| Read-only retrieval, summarization | Auto | Reversible; low blast radius |
| Outbound customer messaging | Auto + sample-review | Reversible if caught fast |
| Money movement, refunds | Human required above $X threshold | Reversal cost > review cost |
| Schema/data mutations | Human required for non-idempotent changes | Migration risk |
| Regulatory submissions, contracts | Human required, no exception | Legal exposure |
Observability: the same APM as the rest of your stack
We strongly advise against "AI dashboards" as a separate observability surface. Every agent action is a request - wire it into the OpenTelemetry traces your APM already ingests. Tag spans with prompt hash, model name, tool, token counts, and outcome. The on-call engineer who pages at 2 AM should not have to learn a new tool to debug a stuck agent.
// Wrap every model call with a span; we use attributes the APM
// can group on without a custom UI.
const span = tracer.startSpan("agent.tool_call", {
attributes: {
"agent.id": agent.id,
"agent.version": agent.version,
"model.name": model.name,
"model.provider": model.provider,
"tool.name": tool.name,
"prompt.hash": sha256(prompt),
"tokens.prompt": tokenCount.prompt,
"tokens.completion": tokenCount.completion,
"cost.usd": cost.usd,
"outcome": outcome, // ok | retry | escalate | refuse
"trace.parent_action_id": parentActionId,
},
});What to do this week
- Pull the seven-item checklist into a Notion / Linear page. Score your agent honestly. Anything below 4/7 is not production-ready, regardless of how the demo went.
- Pick the lowest-scoring item and budget 5 days to close it. Eval suite is the highest leverage if you don't have one.
- Wire OpenTelemetry traces into your APM today, not next sprint. Every hour of production traffic without telemetry is a debugging deficit.
- Schedule a red-team session for prompt injection. 90 minutes with a security engineer; document the cases that worked; add them to the eval suite.
- Set spend caps at the gateway, not in application code. Application-level caps fail open under retry storms.
References
- [1]State of AI Agents 2024 - LangChain (2024)
- [2]Indirect prompt injection benchmarks - Anthropic Trust & Safety (2024)
- [3]OpenTelemetry semantic conventions for GenAI - OpenTelemetry SIG (2025)
Frequently asked questions
How is this different from a regular pre-launch checklist?
It's specific to the failure modes agentic systems exhibit - looping tool calls, prompt injection through retrieval, drift, cost surprises that don't happen with traditional services. A regular pre-launch checklist won't catch any of these.
Can we ship an agent without an eval suite?
Technically yes; in practice no. Without evals you have no way to detect regressions when models update, prompts change, or retrieval drifts. You'll find out from customer complaints. We've never seen a long-lived production agent without an eval suite - they all either get one or get rolled back.
What's the ROI on this checklist?
Direct: prevented cost overruns (typically 5–40× cost spikes are caught by gateway caps); prevented incidents (red-teamed injection cases). Indirect: shipping confidence - teams with the checklist in place ship 3–4× more agent updates per quarter because each ship is safe.
Does this apply to internal agents or just customer-facing ones?
Both. Internal agents that touch HR data, finance data, or production systems carry the same blast radius as customer-facing ones. The cost and reputational risk profiles differ; the engineering discipline is the same.
How long does it take to retrofit this onto an existing production agent?
Typical AI Rescue engagements run 3–4 weeks for the full checklist on a single agent. Eval suite is week one; observability + cost controls week two; guardrails + red-team week three; drift alarms week four. We do this kind of work all the time.