What is RAG (Retrieval-Augmented Generation)?, explained

Why RAG exists

A language model only knows what it learned during training. That means it has a knowledge cutoff, it doesn’t know your private or internal data, and when it’s unsure it may make something up. Retrieval-Augmented Generation (RAG) fixes this without retraining: at the moment of the question, you retrieve the relevant facts from your own sources and hand them to the model as context, so the answer is grounded in real, current, citable material.

How it works

RAG has two phases. First you index your knowledge once; then you retrieve from it per question.

Indexing (once, ahead of time)

Chunk your documents into passages small enough to be specific but large enough to be meaningful.
Embed each chunk, turn it into a vector (a list of numbers) that captures its meaning, using an embedding model.
Store the vectors in a vector database that can search by similarity.

Retrieval + generation (every query)

Embed the question the same way.
Search the vector store for the chunks whose meaning is closest to the question (semantic search), optionally rerank them for relevance.
Augment the prompt: drop the top chunks into the model’s context window alongside the question.
Generate: the model answers using those chunks, and can cite them.

Under the hood What “retrieve the closest chunks” actually means optional

“Closest” is literal geometry. Every chunk was turned into an embedding (a vector), and so is the question. Retrieval scores the question against each chunk with cosine similarity and keeps the top k, the handful whose meaning points closest to the question.

Two knobs decide quality. k is how many chunks you pull: too few and the answer’s source is missing; too many and you flood the context window with noise. A vector database makes this fast with approximate nearest-neighbor search, so it never compares against all million vectors one by one. Many systems then rerank: a second, slower model re-scores those top k for true relevance before they go into the prompt.

RAG vs fine-tuning

A common question is whether to RAG or fine-tune. They solve different problems. Fine-tuning changes how a model behaves, its tone, format, or skill at a task, and bakes knowledge in at training time (slow and static). RAG gives a model facts at query time, current, private, and easy to update by just changing the documents. For “answer from our knowledge base / latest docs,” RAG is usually the right tool; for “always respond in this exact style,” fine-tuning is. Many systems use both.

Where RAG goes wrong

Bad chunking. Chunks that are too big bury the answer; too small lose the context around it. This is the most common failure.
Retrieval misses. If the right chunk isn’t retrieved, the model answers without it, confidently. Good retrieval and reranking matter more than the model.
Context limits. You can only fit so many chunks; relevance ranking is what keeps the budget honest.
Garbage in. RAG grounds the model in your sources, if those are wrong or stale, so is the answer.

FAQ

Does RAG stop hallucinations?

It reduces them by grounding answers in retrieved facts and enabling citations, but it doesn’t eliminate them, the model can still misread a chunk or fill a retrieval gap. Treat RAG as strong grounding, not a guarantee.

RAG or a bigger context window?

Long context lets you paste more, but it’s costly, slower, and prone to “lost in the middle.” RAG sends only the relevant pieces, which is cheaper and often more accurate. They combine well: retrieve the best chunks, then use the context you have wisely.

Do I need a vector database?

For anything beyond a handful of documents, yes, it’s what makes similarity search fast at scale. For a tiny set you can get away with searching in memory.

RAG fills the context window; tools reach data via MCP. More in AI Explained.