Most RAG Systems Fail Before Generation
When RAG performs poorly, people blame the model. In practice, the failure is usually upstream: wrong chunk size, weak metadata, poor ranking, stale content, or missing evaluation.
What We Optimize First
Chunking Strategy
Chunk by meaning, not by arbitrary token count. Contracts, FAQs, product docs, and tickets each need different chunk boundaries. Good chunking improves relevance more than many model upgrades.
Metadata and Filters
Retrieval gets dramatically better when documents include source type, product area, version, language, and freshness. Apply filters before ranking whenever possible.
Reranking
First-pass retrieval finds candidates. Reranking picks the best context. This is where many teams stop too early. A lightweight reranker often gives the biggest quality jump for the lowest cost.
Latency Without Losing Quality
- Cache high-frequency queries and embeddings
- Limit retrieved chunks aggressively, then rerank
- Summarize oversized documents offline, not at request time
- Store canonical answers for the top repetitive intents
RAG Checklist
- Source freshness tracked and alerting configured
- Chunking evaluated on real user questions
- Reranker performance measured separately from generator performance
- Groundedness score logged for every production answer