Where production RAG actually breaks
We've shipped RAG into healthcare, BFSI, telecom, and SaaS. The systems that work look different from each other in domain detail, but the engineering patterns are identical. The systems that fail also fail in identical ways - and almost always at the same five points.
- Dense-only retrieval
Misses entity-keyed queries. Customer asks "order #ABC-1234"; vectors return semantically similar orders. BM25 + dense + filters fixes this.
- No re-ranking
Top-10 dense results contain noise. A cross-encoder re-rank cuts the noise and the answer quality jumps.
- Missing or hallucinated citations
Customers trust answers that link to sources. Grounded citation requires the LLM to copy IDs - and your code to verify them.
- Stale corpus
Doc updated yesterday; agent quotes last week's version. Incremental indexing on every doc-source webhook fixes this.
- No grounding eval
If your eval suite doesn't check that the answer is supported by retrieved docs, your suite passes when the model hallucinates plausibly.
Hybrid retrieval: BM25 + dense + filters
Dense vectors are great at semantic similarity but miserable at exact matches. BM25 is great at exact matches but blind to paraphrase. Combine them: run both queries in parallel, use reciprocal rank fusion to combine top-K, then apply structured filters (tenant, date range, language). The cost is one extra index and a fusion step. The recall lift is substantial.
View data table· Source: Public BEIR benchmarks + Techimax customer benchmarks 2024–2026
| Series | % recall@10 |
|---|---|
| Dense only | 78 |
| Dense only | 78 |
| BM25 only | 62 |
| BM25 only | 62 |
| Hybrid (RRF) | 91 |
| Hybrid (RRF) | 91 |
| Hybrid + cross-encoder | 96 |
| Hybrid + cross-encoder | 96 |
Grounding: make the model copy IDs
The most reliable way to get grounded answers is to require the model to cite a doc ID for every claim, then verify those IDs against the retrieved set. If a claimed citation isn't in the retrieved set, the answer is rejected and re-generated.
Pair this with an eval grader that scores grounding directly: "is the answer supported by the cited docs?" Score with an LLM grader calibrated against human review. Below 0.85 grounding score, fail the eval.
Incremental ingestion - never bulk-rebuild in prod
Bulk re-indexing is a footgun. It introduces hours of stale-corpus during the rebuild and is expensive at corpus size > 1M docs. Use incremental ingestion: webhooks from your CMS / SharePoint / Confluence trigger per-doc re-embed and per-doc upsert. Add a per-doc version field; old versions tomb-stoned, not deleted, until queries drain.
| Corpus size | Strategy | Refresh latency |
|---|---|---|
| < 100K docs | Full re-index nightly OK | < 24h |
| 100K–1M docs | Incremental + weekly compaction | < 1h |
| 1M–10M docs | Strict incremental; sharded compaction | < 15min |
| > 10M docs | Streaming ingest; multi-region replicas | < 2min |
What to do this sprint
- Add a 100-case grounding eval suite. Score each case for: retrieval correctness, grounding, citation accuracy.
- Stand up hybrid retrieval (BM25 + dense + RRF). Most vector DBs include both indices; just turn it on.
- Wire incremental ingestion from your top 3 doc sources (CMS, Confluence, SharePoint). Webhooks; not nightly batches.
- Add a cross-encoder re-rank step on top-50 → top-10. Even a small re-ranker (e.g., Cohere rerank-3, BGE reranker-v2) is enough.
References
- [1]BEIR: Benchmarking IR - academic (2024)
- [2]Reciprocal rank fusion - Cormack et al. (2009)
Frequently asked questions
Pinecone vs pgvector vs Weaviate?
All three work for sub-10M-doc corpora. We default to pgvector for teams already on Postgres (operational simplicity); Pinecone for serverless/burst workloads; Weaviate when hybrid retrieval is the primary requirement. The vector DB is rarely the differentiator; the retrieval pipeline around it is.
How do we handle multi-tenant isolation?
Tenant ID as a structured filter on every query. We don't trust LLMs with multi-tenant access. The filter is enforced at the retrieval layer, not the prompt.
Does fine-tuning beat RAG?
Rarely, for enterprise knowledge. Fine-tuning encodes static knowledge into a model that retrains slowly and updates expensively. RAG keeps knowledge external and updates on every doc change. We use fine-tuning for behavior (tone, format) and RAG for knowledge.
What chunk size?
300–600 tokens for prose; per-row for structured data; per-section for technical docs. Smaller than that fragments context; larger dilutes embeddings. Test with grounding evals - chunk size is empirical.