AI Explained · Pattern

What is RAG (Retrieval-Augmented Generation)?

RAG lets a model answer from your documents instead of only its training data, by retrieving the relevant pieces at query time and putting them in the context.

Why RAG exists

A language model only knows what it learned during training. That means it has a knowledge cutoff, it doesn’t know your private or internal data, and when it’s unsure it may make something up. Retrieval-Augmented Generation (RAG) fixes this without retraining: at the moment of the question, you retrieve the relevant facts from your own sources and hand them to the model as context, so the answer is grounded in real, current, citable material.

How it works

RAG has two phases. First you index your knowledge once; then you retrieve from it per question.

Indexing (once, ahead of time)

  • Chunk your documents into passages small enough to be specific but large enough to be meaningful.
  • Embed each chunk, turn it into a vector (a list of numbers) that captures its meaning, using an embedding model.
  • Store the vectors in a vector database that can search by similarity.

Retrieval + generation (every query)

  • Embed the question the same way.
  • Search the vector store for the chunks whose meaning is closest to the question (semantic search), optionally rerank them for relevance.
  • Augment the prompt: drop the top chunks into the model’s context window alongside the question.
  • Generate: the model answers using those chunks, and can cite them.

RAG vs fine-tuning

A common question is whether to RAG or fine-tune. They solve different problems. Fine-tuning changes how a model behaves, its tone, format, or skill at a task, and bakes knowledge in at training time (slow and static). RAG gives a model facts at query time, current, private, and easy to update by just changing the documents. For “answer from our knowledge base / latest docs,” RAG is usually the right tool; for “always respond in this exact style,” fine-tuning is. Many systems use both.

Where RAG goes wrong

  • Bad chunking. Chunks that are too big bury the answer; too small lose the context around it. This is the most common failure.
  • Retrieval misses. If the right chunk isn’t retrieved, the model answers without it, confidently. Good retrieval and reranking matter more than the model.
  • Context limits. You can only fit so many chunks; relevance ranking is what keeps the budget honest.
  • Garbage in. RAG grounds the model in your sources, if those are wrong or stale, so is the answer.

FAQ

Does RAG stop hallucinations?

It reduces them by grounding answers in retrieved facts and enabling citations, but it doesn’t eliminate them, the model can still misread a chunk or fill a retrieval gap. Treat RAG as strong grounding, not a guarantee.

RAG or a bigger context window?

Long context lets you paste more, but it’s costly, slower, and prone to “lost in the middle.” RAG sends only the relevant pieces, which is cheaper and often more accurate. They combine well: retrieve the best chunks, then use the context you have wisely.

Do I need a vector database?

For anything beyond a handful of documents, yes, it’s what makes similarity search fast at scale. For a tiny set you can get away with searching in memory.

Related

RAG fills the context window; tools reach data via MCP. More in AI Explained.