Glossary · Foundations

Retrieval-Augmented Generation (RAG)

RAG is the technique of fetching documents from a database and feeding them to an LLM before it answers. Definition, architecture, and SMB use cases.

By Kadin Nestler · May 28, 2026 · Updated May 28, 2026

Why RAG exists

LLMs have two problems: they hallucinate on facts that were not in training data, and they have no access to your private data (your customer records, your policies, your inventory). RAG solves both. The retrieval step pulls fresh, private, or domain-specific documents at query time. The generation step grounds the answer in those documents. Facebook AI Research introduced the term in a 2020 paper that is now one of the most-cited AI papers of the decade.

How a RAG pipeline is built

  • Index time: documents are split into chunks, embedded into vectors, and stored in a vector database (Pinecone, Weaviate, Chroma, pgvector).
  • Query time: the user question is embedded into the same vector space and matched against the stored vectors to find the top-K most relevant chunks.
  • Generation: those chunks are inserted into the LLM prompt with the question and the model generates a grounded answer.
  • Optional: re-ranking, hybrid search (vector + keyword), and citation extraction so the answer points back to its sources.

Where RAG fits in a business

Customer support agents that quote the company knowledge base. Internal Q&A bots over HR policies, SOPs, or compliance manuals. Sales reps querying historical deal notes. Legal teams searching across thousands of contracts. Any domain where the answer must come from your documents, not the model's training data.

When RAG fails

  • Bad chunking — splitting documents at arbitrary boundaries loses context and breaks retrieval.
  • Embedding mismatch — the embedding model that indexed the docs is different from the one that queries them.
  • Stale index — the docs changed but the index did not refresh.
  • Insufficient top-K — only 2-3 chunks retrieved when the answer needs 8-10.
  • Long-context models (Claude 200K, Gemini 2M) reduce some RAG needs by letting you stuff entire document sets directly. But cost and latency favor RAG at scale.

What it means for your business

If a vendor is selling you an AI that answers questions about your business, ask whether it is fine-tuned (expensive, slow to update) or RAG-based (cheaper, updates instantly when you add a doc). For 95% of SMB use cases, RAG is the right answer.

  • Vector Database — A vector database stores embeddings and finds similar items by approximate nearest-neighbor search. Definition, top vendors, and when you actually need one.
  • Embedding — An embedding is a numeric vector that represents the meaning of text, an image, or audio. Definition, top embedding models, and how they power search.
  • Large Language Model (LLM) — A Large Language Model is a transformer-based neural network trained on trillions of tokens to predict the next token. Definition, key models, and business use.
  • AI Knowledge Base — An AI knowledge base is a structured corpus of documents an AI agent retrieves from to answer questions. Definition, architecture, and SMB setup tips.
  • AI Grounding — Grounding is the practice of tying AI outputs to verified source material. Definition, techniques, and why it is the primary defense against hallucination.