RAG (retrieval-augmented generation) is a pattern where a model answers using passages pulled from your documents at query time, instead of relying only on what it memorized in training. Use it when answers must be grounded in a specific, changing body of knowledge — and when you need to cite where each answer came from.
The five steps
Strip away the jargon and a RAG query is five steps: the question becomes an embedding; the embedding is matched against an index of your documents; the top passages are retrieved; those passages are handed to the model as context; the model answers, grounded in what it was given — with citations back to the source.
- Embed the question into a vector.
- Retrieve the most similar passages from your indexed corpus.
- Ground the model by placing those passages in the prompt.
- Generate an answer constrained to that context.
- Cite the passages used, so a human can verify.
When to use it — and when not to
RAG earns its keep when answers must be grounded in a specific, changing corpus you control: policies, contracts, past responses, technical docs. It is the wrong tool when the task is reasoning over structured data (use a query), when the corpus is tiny (just put it in the prompt), or when freshness and citation don't matter.
The three failure modes
Almost every struggling enterprise RAG build we see is failing in one of three places — and none of them is the model.
- Retrieval, not generation. If the right passage never gets retrieved, no model can answer well. Most "the AI is wrong" complaints are retrieval problems wearing a generation costume.
- Chunking. How you split documents decides what can be found. Split badly and you sever the answer from its context.
- No evaluation. Without an eval set, you are tuning on vibes. You cannot improve what you do not measure.
Nine times out of ten, "the AI gave a bad answer" is really "the right document never made it into the prompt." Fix retrieval before you touch the model.
The takeaway
RAG is plumbing, not magic. The teams that win treat it like an information-retrieval problem with a language model on the end — and they instrument retrieval quality first.
