Engineering practiceFor VP EngineeringFor Head of AI

Evals as the product spec: a different way to ship AI features

Stop writing acceptance criteria - write evals. A practical guide to designing eval suites that pull double duty as your product spec, your CI gate, and your trust signal for production.

TTechimax EngineeringForward-deployed engineering team10 min readUpdated March 18, 2026

Why prose PRDs fail for AI features

Traditional acceptance criteria - "the assistant should respond accurately and helpfully" - are unfalsifiable. Two reasonable engineers will disagree on whether a given response meets them. A model swap can change the answer without changing the prose. A provider tweak can break it silently.

PRD prose was designed for deterministic systems. AI features aren't deterministic. They're calibrated. The right primitive is the calibration set itself.

What a good eval suite looks like

The five layers of an enterprise-grade eval suite
  • Golden path cases

    30–50 representative customer interactions with expected behaviors. The "happy path" - but specified in cases, not prose.

  • Adversarial cases

    Two failure modes per category. Wrong-tool-call, refusal-failures, hallucination-temptations, prompt-injection attempts. Cases come from real production traces.

  • Regression cases

    Every bug ever fixed gets a case. Eval suite grows over time; nothing breaks twice.

  • Citation/grounding cases

    For RAG, every output citation is verified against the corpus. Hallucinated citations fail the suite.

  • Refusal cases

    What the agent should not answer. Out-of-scope queries, policy-violating asks, regulated-information requests.

Chart · Eval cases
Eval-suite size and quality lift over a 12-month engagement
View data table· Source: Aggregate Techimax engagement telemetry, 50+ pods, 2024–2026
SeriesEval cases
Wk 135
Wk 480
Wk 12220
Wk 26460
Wk 52920

Writing eval cases that survive the next model swap

The temptation is to write tight cases - "output must contain the string X". Don't. Models that swap providers, get fine-tuned, or just get retrained will phrase things differently. Tight cases create false negatives.

Write cases that test behaviors, not strings. Use LLM-graded evaluators where exact-match doesn't apply (an LLM scoring "did the agent refuse this out-of-scope request appropriately"). Calibrate the grader against human review.

An eval case that survives model swapsts
// ❌ Brittle - passes/fails on string match
{
  prompt: "What is our refund policy?",
  expected: "Our refund policy is 30 days from purchase",
}

// ✅ Behavior-graded - survives model swaps
{
  prompt: "What is our refund policy?",
  graders: [
    { kind: "contains_facts", facts: ["30 days", "from purchase"] },
    { kind: "tone", target: "concise, professional", min: 0.7 },
    { kind: "cites_source", required: true },
    { kind: "no_hallucination", against: kbSnapshot },
  ],
}

Eval-gating CI without slowing developers down

The first objection to eval-gated CI is always speed. "Running 200 evals on every PR will take 30 minutes; that kills my flow." The objection is real but solvable.

Three techniques compose: stratified sampling (run 30 critical cases on every PR; full 200 on main), parallel execution (eval cases are embarrassingly parallel - run 50 at a time), and result caching (only re-run cases whose dependencies changed). Combined, an eval-gated PR adds 2–5 minutes - comparable to a unit-test suite.

SuiteWhenCasesMedian time
Critical pathEvery PR30–5090 sec
Full smokeMain + nightly200–4005–8 min
Drift suite (live samples)Nightly1003 min
Adversarial / red-teamWeekly + on demand15010 min
Stratified eval-gating: where each suite runs

The telemetry → eval flywheel

Production traces are the most valuable eval input you have. Pipe them back. Sample 0.1–1% of production traces weekly into a triage queue; add the interesting ones (especially failures) as eval cases. Suite quality compounds; bug-recurrence drops to near zero.

What to do this week

  1. Pick one AI feature. Write 30 eval cases - golden, adversarial, refusal - by end of week.
  2. Stand up the eval runner in CI. Block merge below the calibrated threshold.
  3. Add an eval case for every bug fix going forward. No exceptions.
  4. Wire one production trace per day into the eval triage queue. Review weekly.

References

  1. [1]Best practices for evaluating LLMs - OpenAI Cookbook (2025)
  2. [2]Evals are all you need - Hamel Husain (independent ML engineer) (2024)

Frequently asked questions

How big should the initial eval suite be?

30–50 cases is enough to get value on day one. Below 30 the suite is too narrow to catch regressions; above 50 you're over-investing before you've seen production traffic. Grow it from there using real traces.

What grader should we use?

Mix exact-match (where appropriate), LLM-graded (for behavior + tone), and structured graders (citation present, tool called correctly, schema valid). Calibrate LLM graders against human review on a 50-case sample monthly.

Can I share eval suites across products?

Some pieces - refusal cases, security cases, citation cases - yes. Behavior cases are usually product-specific. We recommend a shared library of "baseline" cases plus per-product layers.

How does this apply to RAG?

Especially well. RAG eval cases test (a) retrieval correctness - was the right doc retrieved? (b) grounding - was the answer factually anchored to retrieved docs? (c) citation - was the source cited correctly? Each layer needs evals.

Does this require a special platform?

No, but it benefits from one. You can stand up evals in plain Jest/Pytest with structured graders. We built our Eval Platform because at 500+ cases per product across many products, custom infra paid for itself fast.

Talk to engineering

Ready to ship the patterns from this post?

Tell us where you are. A senior forward-deployed engineer replies within 24 hours with a written plan tailored to your stack - never an SDR.

  • Practical engineering review of your current setup
  • Eval discipline + observability + cost controls
  • Free 60-min working session, no sales pitch

Senior reply within 24h

Drop your details and we'll match you with an engineer who's shipped in your industry.

By submitting, you agree to our privacy policy. We'll never share your information.