Retrieval-Augmented Generation demos well. You embed a document corpus, retrieve the top-k chunks relevant to a query, inject them into a prompt, and the model answers with real context instead of hallucinating. It looks reliable.
Production is different. Users ask questions you didn't anticipate. The retriever surfaces low-confidence results. The model hedges, or worse, confidently asserts something the retrieved documents don't actually support. Trust erodes fast once users notice inaccuracies.
The gap between a RAG demo and a RAG product is almost entirely about validation, fallbacks, and honesty — not the retrieval or generation mechanics themselves.
Retrieval quality: don't trust the top-k blindly
The default pattern is to retrieve top-k chunks and pass them all. The problem: similarity search is a ranking algorithm, not a relevance filter. The "most similar" result might still be completely irrelevant to the question — particularly for queries that fall outside the corpus.
Add a relevance threshold. If the highest-scoring retrieved chunk is below a minimum similarity score, treat it as a failed retrieval rather than passing weak context:
const RELEVANCE_THRESHOLD = 0.72
async function retrieveWithConfidence(
query: string,
k: number = 5,
): Promise<{ chunks: string[]; confident: boolean }> {
const results = await vectorStore.similaritySearchWithScore(query, k)
const confident = results.length > 0 && results[0][1] >= RELEVANCE_THRESHOLD
const chunks = confident ? results.map(([doc]) => doc.pageContent) : []
return { chunks, confident }
}
The threshold value depends on your embedding model and corpus — calibrate it against a labelled eval set, not intuition.
The honest no-answer
When retrieval confidence is low, the right answer is to say so. Not a hallucinated response, not a confident statement unsupported by context, but an explicit "I don't have enough information in the documents to answer this."
This sounds like a failure mode. It's actually a trust-building feature. Users learn quickly what the system knows and doesn't know. A system that admits uncertainty is one users can calibrate their reliance on. A system that confidently produces plausible-but-wrong answers is one users stop trusting entirely after the first mistake.
Build the "I don't know" path as a first-class response type, not an error state.
Grounding checks: source citation is not enough
Citing the source chunk doesn't prove the generated answer is faithful to it. Models can cite a passage and then assert something the passage doesn't say, or synthesise across passages in ways that introduce subtle inaccuracies.
Two approaches that work in production:
Entailment check: after generating the answer, run a secondary prompt that asks "does this answer follow from the provided context?" with a structured yes/no output. Reject and regenerate on a no. Adds latency but catches the most damaging errors.
Quote grounding: instruct the model to quote the specific sentence from the context that supports each claim. If it can't quote, it can't claim. This works well for structured Q&A and document review tasks; it's too rigid for conversational use.
Chunk design matters more than embedding model
Engineers spend a lot of time benchmarking embedding models. They spend comparatively little time on chunking strategy, which has a larger effect on retrieval quality.
Useful chunking principles:
- Chunk at semantic boundaries (paragraph, section) rather than fixed token count
- Include a small overlap between chunks to avoid splitting context across boundaries
- Prepend document metadata (title, section heading, date) to each chunk before embedding
- For structured documents (FAQs, policy docs), keep question and answer together in one chunk
The embedding model is a commodity. Chunking is where you leave the most quality on the table.
Monitoring in production
RAG is not set-and-forget. You need:
- Retrieval hit rate (% of queries where
confident: true) - Answer acceptance rate (if users can thumbs-up/down)
- Low-confidence query logs for corpus gap analysis
- Latency percentiles for retrieval and generation separately
The corpus gap analysis is the most actionable: questions that consistently fail to retrieve confident results tell you what content is missing. Fill those gaps and your product gets measurably better without changing any code.
RAG done well is a product discipline, not just a technical one.
AI Lead
Got an idea? Let's ship it.
Tell us what you're building. We'll come back with a clear scope, a real timeline, and the senior team who'll actually build it.