Skip to content

    Free chapter

    Simple RAG: The Foundation of Document-Powered AI

    Teaching a language model to look things up before answering, instead of guessing from memory.

    Prerequisites: Basic understanding of what a language model (LLM) does. Familiarity with the idea of a database.

    The Problem

    Imagine you have a 200-page report about climate change. You need a quick answer: “What does chapter 2 say about the main cause of climate change?” So you ask a language model.

    The model gives you a generic, textbook-style answer about greenhouse gases. It sounds reasonable. But it is not drawn from your report. The model has never seen your document. It is answering from memory, from patterns learned during training. It might even state facts that sound right but are nowhere in your source material. This is called hallucination, and it is one of the biggest risks of using language models for knowledge work.

    Now scale this up. Your company has thousands of internal documents: policy manuals, research papers, product specs, legal contracts. Employees ask questions about these every day. A language model that answers from memory alone is useless here. It does not know what is in your documents. The contrast is stark: without access to your documents, the model guesses and often gets it wrong. With access, it reads the relevant passage and answers correctly. The following illustration captures this difference.

    Without RAG, a robot answers from memory and gets it wrong. With RAG, the same robot reads the source document and answers correctly.

    The Core Idea

    RAG, short for Retrieval-Augmented Generation, solves this by adding a simple step: before the model answers, it first searches your documents for relevant information.

    Think of it as the difference between a closed-book exam and an open-book exam. In a closed-book exam, the student relies entirely on memory, staring at the blank page with growing uncertainty. In an open-book exam, the student can look up the relevant section, read it, and write an informed answer with confidence. RAG gives your language model an open book.

    A closed-book student struggles with a question mark overhead, while an open-book student writes confidently with a reference book on their desk and a lightbulb above.

    The core insight: you do not need to retrain the model or stuff entire documents into the prompt. You store your documents in a searchable format, retrieve only the relevant pieces, and hand those pieces to the model as context.

    The model does what it does best: read the context and generate a clear, grounded answer. The rest of this chapter explains exactly how this works, step by step.

    How It Works

    The simple RAG pipeline has two phases: an offline preparation phase (done once per document) and an online query phase (done each time a user asks a question).

    Phase 1: Preparing the Knowledge Base

    Step 1: Load the Document

    The process starts by extracting raw text from your source documents. A PDF is parsed page by page, converting visual content into plain text. This is a straightforward conversion step, but it matters. Poor text extraction leads to poor retrieval downstream.

    Think of this as a librarian transcribing a book onto index cards so it can be cataloged and searched.

    Step 2: Split Into Chunks

    A 200-page document is far too large to search as a single block. The system splits the text into smaller pieces called chunks. A common default is chunks of about 1,000 characters (roughly a long paragraph), with 200 characters of overlap between consecutive chunks.

    Why overlap? Imagine a critical explanation that spans two paragraphs. If you split the text exactly between them, neither chunk captures the full thought. The overlap ensures that information at the boundaries is preserved. Each chunk carries the last few lines of the previous chunk and the first few lines of the next one, so no thought is lost at the seams. The diagram below shows this: notice how the orange-highlighted bands at the bottom of one chunk match the top of the next, representing the shared overlap region.

    A full document on the left is split into three separate chunks on the right, with orange bands showing the overlap regions shared between consecutive chunks.

    Step 3: Clean the Text

    Raw text extracted from PDFs often carries formatting artifacts: stray tab characters, extra whitespace, broken line endings. A cleaning step normalizes the text so that the next stages work reliably. This is a small but important detail. Dirty text leads to noisy embeddings.

    Step 4: Create Embeddings

    This is the most important transformation in the pipeline. Each text chunk is converted into an embedding: a list of numbers (a vector) that captures the meaning of the text. These are not random numbers. They are carefully computed so that two chunks about similar topics produce vectors that are close together in mathematical space, while chunks about different topics produce vectors that are far apart.

    Think of embeddings as GPS coordinates for meaning. Just as GPS coordinates let you find places that are geographically close, embedding vectors let you find text passages that are semantically close. A chunk about “greenhouse gas emissions” and a chunk about “carbon dioxide in the atmosphere” will have nearby vectors, even though they use different words. The illustration below shows this: the blue and orange text cards map to dots that cluster together (labeled “similar”) because they cover related topics, while the gray card maps to a distant dot (labeled “different”).

    Three text cards in blue, orange, and gray are mapped into a coordinate grid labeled “Meaning Space.” The blue and orange dots cluster together labeled “similar,” while the gray dot sits apart labeled “different.”

    Step 5: Store in a Vector Database

    All embedding vectors are stored in a specialized database optimized for similarity search. A popular choice is FAISS (Facebook AI Similarity Search), which organizes vectors so that finding the nearest neighbors to any query vector is extremely fast, even with millions of entries.

    This is like organizing a library not alphabetically or by publication date, but by topic similarity. Books about related subjects sit on adjacent shelves. When you need information about a specific topic, you go straight to the right neighborhood and browse nearby books. The image below illustrates this contrast: a traditional library where a reader walks between distant shelves to find related books, versus a conceptual library where books on science, energy, and policy are clustered by meaning, putting relevant material within arm’s reach.

    A traditional library with alphabetical shelves where a reader searches far apart, versus a modern conceptual library where books on science, energy, and policy are clustered together by topic.

    The entire preparation phase can be summarized in five steps:

    Algorithm Sketch: Document Preparation
    
    1. Load document and extract raw text
    2. Split text into chunks of ~1000 characters with ~200 character overlap
    3. Clean each chunk (remove formatting artifacts)
    4. Convert each chunk into an embedding vector
    5. Store all vectors in a vector store for fast similarity search

    Phase 2: Answering a Query

    With the knowledge base prepared, the system is ready to answer questions. This phase runs every time a user submits a query.

    Step 6: Embed the Query

    When a user asks a question, that question is converted into an embedding vector using the exact same process applied to the document chunks. This is critical: the query must live in the same “meaning space” as the stored chunks so that distances between them are meaningful.

    Step 7: Find the Nearest Chunks

    The system searches the vector store for the chunks whose embedding vectors are closest to the query vector. A typical setup retrieves the top 2 to 5 most similar chunks. These represent the parts of the document most likely to contain the answer.

    For example, given the query “What is the main cause of climate change?” against a climate report, the system might retrieve a chunk discussing greenhouse gases and fossil fuels, and another covering modern scientific observations about human-driven climate change. Both are directly relevant to the question. In the diagram below, the orange dot represents the query in meaning space, and the blue highlighted dots are the nearest stored chunks that the system retrieves, connected by dashed lines to show proximity.

    A scatter plot of stored document chunks as gray dots in meaning space. An orange query dot appears with a magnifying glass, and the three nearest chunks are highlighted in blue with dashed lines showing their proximity.

    Step 8: Generate the Answer

    The retrieved chunks are passed to the language model along with the original question. The model reads the relevant context and generates an answer that is grounded in the actual document content, not in its training memory. This is the payoff. The model is no longer guessing. It is reading the relevant paragraphs and summarizing them for you.

    The diagram below shows the complete pipeline end-to-end. The top row is the preparation phase: a document is split, transformed into vectors, and stored. The bottom row is the query phase: a question is formulated into a vector, used to search the database, and the retrieved chunks feed into the final answer.

    The full RAG pipeline in two rows. Top row labeled “Prepare” shows document, split, transform, store. Bottom row labeled “Query” shows question, formulate, search, retrieve, answer.
    Algorithm Sketch: Query Processing
    
    1. Receive user question
    2. Convert question into an embedding vector
    3. Search vector store for top-K nearest chunk vectors
    4. Retrieve the corresponding text chunks
    5. Pass question + retrieved chunks to the language model
    6. Model generates answer grounded in the retrieved context

    The Role of Chunk Size

    Chunk size is the most important configuration decision in simple RAG, and there is no single right answer. It is a tradeoff:

    Too small (e.g., 100 characters): Each chunk is just a sentence or two. It may be very specific, but it lacks surrounding context. The model gets a precise fragment but cannot understand the bigger picture.

    Too large (e.g., 5,000 characters): Each chunk covers several pages. The relevant information is buried in a sea of unrelated text. The retrieval still works, but the signal is diluted by noise.

    Just right (e.g., 1,000 characters with 200 overlap): Each chunk is roughly a full paragraph. It carries enough context to be meaningful on its own, but is focused enough to rank well in similarity search. The three panels below show this visually: tiny scattered fragments with no context, balanced medium-sized pieces, and oversized blocks where the relevant text is a tiny fraction of the whole.

    Three panels comparing chunk sizes. “Too Small” shows scattered tiny fragments with no context. “Just Right” shows balanced medium-sized pieces. “Too Large” shows oversized blocks where the relevant portion is a tiny highlighted area.

    Where Simple RAG Falls Short

    Simple RAG handles focused, factual questions well. Ask “What is the main cause of climate change?” and the system retrieves the right paragraphs and produces an accurate, grounded answer.

    But consider a harder question: “Compare the economic impacts of climate change across different sectors discussed in the report.” This requires information scattered across multiple sections of the document. A system that retrieves only two or three chunks will miss most of the relevant material. The answer will be incomplete or shallow.

    Similarly, questions that require reasoning across multiple concepts, synthesizing definitions, or connecting ideas from different parts of a document will expose the limits of retrieving a small, fixed number of chunks from a single search.

    These limitations are not bugs. They are the natural boundaries of the simplest possible RAG design. They are also exactly what motivate the advanced techniques covered in later chapters: better chunking strategies, multi-query retrieval, reranking, and query transformation.

    When to Use This

    Best for:

    • Answering factual questions about a single document or small collection
    • Building a quick prototype to validate that RAG helps your use case
    • Internal knowledge bases with straightforward lookup queries
    • Situations where simplicity and low cost matter more than perfect accuracy

    Overkill when:

    • Your documents are short enough to fit entirely in the model’s context window
    • You only need keyword search (traditional search engines work fine)
    • You never update your document collection and could pre-compute all answers

    Tradeoffs:

    Factor Impact
    Latency Low: one embedding call + one vector search + one LLM call
    Complexity Low: fewest moving parts of any RAG approach
    Cost Low: embedding is cheap, only one LLM call per query
    Accuracy gain Moderate: strong for focused factual queries, weak for multi-hop or complex questions

    Compared to using an LLM without RAG: Simple RAG grounds the model’s answers in your actual documents, reducing hallucination and enabling responses about content the model never saw during training. The tradeoff is that answer quality depends entirely on retrieval quality. If the wrong chunks are retrieved, the answer will be wrong or incomplete. And with a fixed chunk size and a single retrieval strategy, simple RAG has limited ability to handle nuanced or multi-part questions.


    Key Takeaways

    • RAG adds a “look it up” step before the language model answers, grounding responses in actual source material instead of training memory.
    • Documents are split into overlapping chunks to ensure no information is lost at the boundaries between pieces.
    • Embeddings convert text into numerical vectors where similar meanings are placed close together, enabling search by meaning rather than keywords.
    • A vector store organizes these vectors for fast similarity search, even across millions of chunks.
    • Simple RAG is the foundation that all advanced techniques build upon. Its limitations, including fixed chunk sizes, single-strategy retrieval, and no query refinement, are exactly what the rest of this book addresses.

    Companion Notebook

    The working implementation of this technique is in simple_rag.ipynb in the RAG Techniques repository.