• ai
  • llm
  • architecture

RAG in practice: what actually breaks at scale

Building a retrieval-augmented system is straightforward in a notebook. Keeping it honest in production is not.

Feb 10, 2026·7 min read

Retrieval-augmented generation sounds simple on paper: embed the user's query, pull the closest documents, inject them into the prompt, get an answer. The five-line demo in every LLM tutorial makes it look like a solved problem.

It is not.

After building two production RAG systems — one for internal documentation search, one for a customer-facing support bot — the failures I remember are not the ones in the papers. Here is what actually went wrong.

#The retrieval ceiling

The most common failure mode is one the benchmark numbers hide: your retrieval is fine at finding related documents, but the answer is not in them.

The model does not tell you this. It answers anyway, blending what it retrieved with what it knows. The output looks confident. The user has no idea the ground truth was never in the context.

The fix is a confidence gate, not a better embedding model:

lib/rag.ts
async function answer(query: string) {
  const docs = await retrieve(query, { topK: 5 });
  const scores = docs.map(d => d.score);
 
  if (Math.max(...scores) < RETRIEVAL_THRESHOLD) {
    return { answer: null, reason: 'no-relevant-docs' };
  }
 
  return generate(query, docs);
}

Return null and fall back to a "I don't have information on that" message. Users handle uncertainty better than confident hallucinations.

#Chunk boundaries are information loss

Splitting documents into fixed-size chunks is the tutorial default. It is also how you guarantee the model never sees a sentence that spans a chunk boundary.

The fix I use now is overlap plus semantic splitting. Instead of splitting every N tokens, I split on paragraph boundaries and overlap each chunk with the tail of the previous one:

chunking.py
def chunk_with_overlap(text: str, max_tokens: int = 512, overlap: int = 64):
    paragraphs = text.split("\n\n")
    chunks, current, prev_tail = [], [], ""
 
    for para in paragraphs:
        current.append(para)
        token_count = sum(count_tokens(p) for p in current)
        if token_count >= max_tokens:
            chunks.append(prev_tail + "\n\n".join(current))
            prev_tail = "\n\n".join(current[-2:])
            current = []
 
    if current:
        chunks.append(prev_tail + "\n\n".join(current))
    return chunks

Not perfect. Still better than splitting mid-sentence.

#The context window is not a buffer

When retrieval is working well, you want to inject all five top-k documents. When it is working badly, you are injecting five irrelevant documents that actively hurt the answer.

The pattern that helped: re-rank the retrieved chunks with a cross-encoder before injecting. A bi-encoder finds candidates fast; a cross-encoder scores them accurately. You only inject the re-ranked top-2 or top-3.

The latency cost is real. It is worth it.

#Metadata is not optional

Every chunk needs metadata: source document, section heading, creation date, author. Not for the user — for the model.

When you inject [Source: Engineering Handbook, Section: Onboarding, Updated: 2025-11] alongside the chunk text, the model gets signal about how much to trust what it is reading. A chunk from a document updated three years ago deserves less weight than one updated last month.

The embedding knows nothing about when the document was written. The model does — if you tell it.

#The eval problem

You cannot know if your RAG system is getting worse without an eval set. An eval set for RAG is harder to build than for classification: the right answer often depends on what documents you have, and the documents change.

The minimum I recommend: 50 golden question-answer pairs, curated by hand, tagged with the source chunk that should have been retrieved. Run retrieval against them weekly. Alert if precision@1 drops more than 5 points.

That is not fancy. It will catch every silent regression I have seen in production.


RAG is not plug-and-play. The primitives — embed, retrieve, generate — are simple. Making the system honest, stable, and debuggable is the actual work.