EngineeringFor Data platform leadFor VP Engineering

From RAG demo to RAG that survives production

Most RAG demos work. Most production RAG doesn't. The retrieval, ranking, grounding, and update-pipeline patterns that separate impressive demos from systems your customers can rely on.

TTechimax EngineeringForward-deployed engineering team11 min read

Where production RAG actually breaks

We've shipped RAG into healthcare, BFSI, telecom, and SaaS. The systems that work look different from each other in domain detail, but the engineering patterns are identical. The systems that fail also fail in identical ways - and almost always at the same five points.

The five RAG failure modes we see most
  • Dense-only retrieval

    Misses entity-keyed queries. Customer asks "order #ABC-1234"; vectors return semantically similar orders. BM25 + dense + filters fixes this.

  • No re-ranking

    Top-10 dense results contain noise. A cross-encoder re-rank cuts the noise and the answer quality jumps.

  • Missing or hallucinated citations

    Customers trust answers that link to sources. Grounded citation requires the LLM to copy IDs - and your code to verify them.

  • Stale corpus

    Doc updated yesterday; agent quotes last week's version. Incremental indexing on every doc-source webhook fixes this.

  • No grounding eval

    If your eval suite doesn't check that the answer is supported by retrieved docs, your suite passes when the model hallucinates plausibly.

Hybrid retrieval: BM25 + dense + filters

Dense vectors are great at semantic similarity but miserable at exact matches. BM25 is great at exact matches but blind to paraphrase. Combine them: run both queries in parallel, use reciprocal rank fusion to combine top-K, then apply structured filters (tenant, date range, language). The cost is one extra index and a fusion step. The recall lift is substantial.

Chart · % recall@10
Recall@10 by retrieval mechanism on enterprise Q&A benchmark
View data table· Source: Public BEIR benchmarks + Techimax customer benchmarks 2024–2026
Series% recall@10
Dense only78
Dense only78
BM25 only62
BM25 only62
Hybrid (RRF)91
Hybrid (RRF)91
Hybrid + cross-encoder96
Hybrid + cross-encoder96

Grounding: make the model copy IDs

The most reliable way to get grounded answers is to require the model to cite a doc ID for every claim, then verify those IDs against the retrieved set. If a claimed citation isn't in the retrieved set, the answer is rejected and re-generated.

Pair this with an eval grader that scores grounding directly: "is the answer supported by the cited docs?" Score with an LLM grader calibrated against human review. Below 0.85 grounding score, fail the eval.

Incremental ingestion - never bulk-rebuild in prod

Bulk re-indexing is a footgun. It introduces hours of stale-corpus during the rebuild and is expensive at corpus size > 1M docs. Use incremental ingestion: webhooks from your CMS / SharePoint / Confluence trigger per-doc re-embed and per-doc upsert. Add a per-doc version field; old versions tomb-stoned, not deleted, until queries drain.

Corpus sizeStrategyRefresh latency
< 100K docsFull re-index nightly OK< 24h
100K–1M docsIncremental + weekly compaction< 1h
1M–10M docsStrict incremental; sharded compaction< 15min
> 10M docsStreaming ingest; multi-region replicas< 2min
Indexing strategy by corpus size

What to do this sprint

  1. Add a 100-case grounding eval suite. Score each case for: retrieval correctness, grounding, citation accuracy.
  2. Stand up hybrid retrieval (BM25 + dense + RRF). Most vector DBs include both indices; just turn it on.
  3. Wire incremental ingestion from your top 3 doc sources (CMS, Confluence, SharePoint). Webhooks; not nightly batches.
  4. Add a cross-encoder re-rank step on top-50 → top-10. Even a small re-ranker (e.g., Cohere rerank-3, BGE reranker-v2) is enough.

References

  1. [1]BEIR: Benchmarking IR - academic (2024)
  2. [2]Reciprocal rank fusion - Cormack et al. (2009)

Frequently asked questions

Pinecone vs pgvector vs Weaviate?

All three work for sub-10M-doc corpora. We default to pgvector for teams already on Postgres (operational simplicity); Pinecone for serverless/burst workloads; Weaviate when hybrid retrieval is the primary requirement. The vector DB is rarely the differentiator; the retrieval pipeline around it is.

How do we handle multi-tenant isolation?

Tenant ID as a structured filter on every query. We don't trust LLMs with multi-tenant access. The filter is enforced at the retrieval layer, not the prompt.

Does fine-tuning beat RAG?

Rarely, for enterprise knowledge. Fine-tuning encodes static knowledge into a model that retrains slowly and updates expensively. RAG keeps knowledge external and updates on every doc change. We use fine-tuning for behavior (tone, format) and RAG for knowledge.

What chunk size?

300–600 tokens for prose; per-row for structured data; per-section for technical docs. Smaller than that fragments context; larger dilutes embeddings. Test with grounding evals - chunk size is empirical.

Talk to engineering

Ready to ship the patterns from this post?

Tell us where you are. A senior forward-deployed engineer replies within 24 hours with a written plan tailored to your stack - never an SDR.

  • Practical engineering review of your current setup
  • Eval discipline + observability + cost controls
  • Free 60-min working session, no sales pitch

Senior reply within 24h

Drop your details and we'll match you with an engineer who's shipped in your industry.

By submitting, you agree to our privacy policy. We'll never share your information.