Agentic AI engineeringFor Head of AIFor CIO / CTO

From demo to production: the agentic AI engineering checklist

78% of agent demos fail to reach production. Here's the engineering checklist that separates the 22% that survive - orchestration, evals, telemetry, guardrails, cost controls, and the boring stuff nobody puts in keynote slides.

TTechimax EngineeringForward-deployed engineering team13 min readUpdated April 2, 2026

Where 78% of agent demos die

LangChain's 2024 State of AI Agents survey reported that 78% of organizations had built an agent demo, but only 22% had gotten one into production [1]. We've sat inside dozens of those abandoned projects. The death always falls into one of seven buckets.

Chart
Why agent projects fail to reach production (n=380 retrospectives)
View data table· Source: LangChain State of AI Agents 2024; Techimax AI Rescue engagements 2023–2026
SeriesValue
No eval discipline28
Cost / latency surprises19
Prompt-injection / guardrails14
Brittle tool contracts13
No observability11
Human-handoff cliffs9
Drift without alarms6

The seven-item production-readiness checklist

Run this list before flipping the production flag
  • Calibrated eval suite (≥ 50 cases) gating CI

    Cases cover golden paths, adversarial inputs, refusals, and citation requirements. PR can't merge below the calibrated threshold.

  • Cost telemetry per agent action

    Per-message, per-tool-call, per-LLM-call cost and latency. Alarms wired to your APM. Spend caps enforced at the gateway, not in code.

  • Prompt-injection defenses with red-team evidence

    Documented red-team exercise; injection cases in the eval suite; PII redaction at the SDK boundary.

  • Tool contracts that survive ambiguity

    Strongly-typed schemas; fuzzed inputs; failure modes mapped to retry/escalate/refuse paths. No bare LLM tool-call without a typed validator.

  • Per-action tracing in your APM

    OpenTelemetry traces with span attributes for prompt hash, tool name, model, token cost, and outcome. Same APM your existing services use - no separate "AI dashboard".

  • Human-in-loop at the right cliffs

    Categorical mapping of decisions to human-required vs auto. Reviewer queues with calibrated SLAs. Don't ship an agent that promises 100% automation if the failure cost > $X.

  • Drift alarms wired to evals

    Eval suite re-runs nightly against production traffic samples. Pass-rate dip > 2% pages on-call. Drift you can't see is drift you can't fix.

Cost modeling: what to instrument before you launch

The cost mode of failure is often the most surprising. A demo that costs $0.04 per interaction can become $4 per interaction when an agent loops on a malformed tool response, or when a 200-token system prompt becomes a 20K-token context after retrieval injection. The shape of these surprises is predictable.

Failure modeTypical cost multiplierFix
Tool-call retry loop5–40×Bound retries; circuit-break on repeat tool-call patterns
Retrieval context bloat8–15×Re-rank + truncate; cap retrieved tokens per call
Streaming abandoned mid-call2–4×Cancel-aware streaming; bill on completed tokens only
Background agent loops10–100×Per-trace token caps with hard kill at threshold
Prompt regression on model swap1.5–3×Eval-gated provider switching; never swap blind
Cost surprises we see in the first month of production traffic

Guardrails: prompt injection, jailbreak, exfiltration

Anthropic's prompt-injection benchmarks show that even the most capable models can be coerced via indirect injection (poisoned retrieval content, tool-output payloads) without an explicit system-level defense [2]. Production agents need three layers: model-level resistance (system prompts + best-of-class providers), platform-level guardrails (PII redaction, prompt allow-lists, exfiltration filters at the gateway), and content-level evals (red-team cases in the eval suite).

Chart · % breached
Successful indirect prompt-injection attempts by defense layer
View data table· Source: Anthropic + academic red-team data 2024–2025
Series% breached
No defenses64
+ System prompt only41
+ Gateway redaction18
+ Eval-suite red-team cases6

Where to put humans in the loop (and where not to)

Human-in-loop is a load-bearing engineering decision, not a checkbox. Put humans on the cliff edge - the high-blast-radius decisions where a wrong agent action is unrecoverable - and trust the agent on routine paths. The wrong pattern is to gate every action; that destroys the velocity that justified the agent in the first place.

Decision classDefault policyWhy
Read-only retrieval, summarizationAutoReversible; low blast radius
Outbound customer messagingAuto + sample-reviewReversible if caught fast
Money movement, refundsHuman required above $X thresholdReversal cost > review cost
Schema/data mutationsHuman required for non-idempotent changesMigration risk
Regulatory submissions, contractsHuman required, no exceptionLegal exposure
Human-in-loop default policy by decision class

Observability: the same APM as the rest of your stack

We strongly advise against "AI dashboards" as a separate observability surface. Every agent action is a request - wire it into the OpenTelemetry traces your APM already ingests. Tag spans with prompt hash, model name, tool, token counts, and outcome. The on-call engineer who pages at 2 AM should not have to learn a new tool to debug a stuck agent.

OpenTelemetry span attributes we add to every agent actionts
// Wrap every model call with a span; we use attributes the APM
// can group on without a custom UI.
const span = tracer.startSpan("agent.tool_call", {
  attributes: {
    "agent.id":               agent.id,
    "agent.version":          agent.version,
    "model.name":             model.name,
    "model.provider":         model.provider,
    "tool.name":              tool.name,
    "prompt.hash":            sha256(prompt),
    "tokens.prompt":          tokenCount.prompt,
    "tokens.completion":      tokenCount.completion,
    "cost.usd":               cost.usd,
    "outcome":                outcome,        // ok | retry | escalate | refuse
    "trace.parent_action_id": parentActionId,
  },
});

What to do this week

  1. Pull the seven-item checklist into a Notion / Linear page. Score your agent honestly. Anything below 4/7 is not production-ready, regardless of how the demo went.
  2. Pick the lowest-scoring item and budget 5 days to close it. Eval suite is the highest leverage if you don't have one.
  3. Wire OpenTelemetry traces into your APM today, not next sprint. Every hour of production traffic without telemetry is a debugging deficit.
  4. Schedule a red-team session for prompt injection. 90 minutes with a security engineer; document the cases that worked; add them to the eval suite.
  5. Set spend caps at the gateway, not in application code. Application-level caps fail open under retry storms.

References

  1. [1]State of AI Agents 2024 - LangChain (2024)
  2. [2]Indirect prompt injection benchmarks - Anthropic Trust & Safety (2024)
  3. [3]OpenTelemetry semantic conventions for GenAI - OpenTelemetry SIG (2025)

Frequently asked questions

How is this different from a regular pre-launch checklist?

It's specific to the failure modes agentic systems exhibit - looping tool calls, prompt injection through retrieval, drift, cost surprises that don't happen with traditional services. A regular pre-launch checklist won't catch any of these.

Can we ship an agent without an eval suite?

Technically yes; in practice no. Without evals you have no way to detect regressions when models update, prompts change, or retrieval drifts. You'll find out from customer complaints. We've never seen a long-lived production agent without an eval suite - they all either get one or get rolled back.

What's the ROI on this checklist?

Direct: prevented cost overruns (typically 5–40× cost spikes are caught by gateway caps); prevented incidents (red-teamed injection cases). Indirect: shipping confidence - teams with the checklist in place ship 3–4× more agent updates per quarter because each ship is safe.

Does this apply to internal agents or just customer-facing ones?

Both. Internal agents that touch HR data, finance data, or production systems carry the same blast radius as customer-facing ones. The cost and reputational risk profiles differ; the engineering discipline is the same.

How long does it take to retrofit this onto an existing production agent?

Typical AI Rescue engagements run 3–4 weeks for the full checklist on a single agent. Eval suite is week one; observability + cost controls week two; guardrails + red-team week three; drift alarms week four. We do this kind of work all the time.

Talk to engineering

Ready to ship the patterns from this post?

Tell us where you are. A senior forward-deployed engineer replies within 24 hours with a written plan tailored to your stack - never an SDR.

  • Practical engineering review of your current setup
  • Eval discipline + observability + cost controls
  • Free 60-min working session, no sales pitch

Senior reply within 24h

Drop your details and we'll match you with an engineer who's shipped in your industry.

By submitting, you agree to our privacy policy. We'll never share your information.