Skip to main content

RAG: Retrieval-Augmented Generation

Advanced RAG architecture including chunking, vector databases, retrieval strategies, and evaluation

~55 min
Listen to this lesson

RAG: Retrieval-Augmented Generation (Deep Dive)

In Lesson 1, we introduced RAG as a way to ground LLM responses in external knowledge. Now we go deep: production-grade chunking strategies, embedding models, vector databases, retrieval techniques, and evaluation frameworks.

Why RAG Over Fine-Tuning?

RAG is preferred when: (1) your knowledge base changes frequently, (2) you need source attribution, (3) you want to avoid the cost of fine-tuning, or (4) you need to combine information from multiple sources. Fine-tuning bakes knowledge into weights; RAG keeps knowledge external and updatable.

Chunking Strategies

Before embedding, documents must be split into chunks — passages small enough to be individually embedded and retrieved. Chunk size critically affects RAG quality.

Fixed-Size Chunking

Split text into chunks of N characters (or tokens) with optional overlap.

def fixed_size_chunk(text, chunk_size=500, overlap=50):
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunks.append(text[start:end])
        start = end - overlap
    return chunks

Pros: Simple, predictable chunk sizes Cons: May split sentences or paragraphs mid-thought

Semantic Chunking

Split at natural boundaries (paragraphs, sections, sentences) and merge small chunks up to a size limit.

import re

def semantic_chunk(text, max_size=500): # Split on double newlines (paragraph boundaries) paragraphs = re.split(r'\n\n+', text) chunks = [] current = "" for para in paragraphs: if len(current) + len(para) < max_size: current += para + "\n\n" else: if current: chunks.append(current.strip()) current = para + "\n\n" if current: chunks.append(current.strip()) return chunks

Pros: Preserves semantic coherence Cons: Variable chunk sizes

Recursive Chunking

Used by LangChain's RecursiveCharacterTextSplitter. Tries to split on the largest natural boundary first ("\n\n"), then falls back to smaller ones ("\n", ". ", " ").

StrategyBest For
Fixed-sizeUniform content (e.g., product descriptions)
SemanticStructured documents (articles, reports)
RecursiveGeneral-purpose, mixed content

Chunk Size Guidelines

  • Too small (< 100 tokens): Loses context, retrieves fragments
  • Too large (> 1000 tokens): Dilutes relevance, wastes context window
  • Sweet spot: 200–500 tokens with 10–20% overlap
  • Always Add Overlap

    Chunk overlap (typically 10–20% of chunk size) ensures that information at chunk boundaries is not lost. If a key fact spans two chunks, the overlap ensures at least one chunk contains the complete fact.

    Embedding Models

    Embedding models convert text into dense vectors that capture semantic meaning. Similar texts produce similar vectors.

    ModelDimensionsSpeedQualityOpen Source
    all-MiniLM-L6-v2384FastGoodYes
    all-mpnet-base-v2768MediumBetterYes
    text-embedding-3-small (OpenAI)1536APIVery GoodNo
    text-embedding-3-large (OpenAI)3072APIExcellentNo
    embed-v3 (Cohere)1024APIExcellentNo
    Choosing an embedding model:
  • Prototyping: Use all-MiniLM-L6-v2 (free, fast, runs locally)
  • Production (cost-sensitive): OpenAI text-embedding-3-small
  • Production (quality-first): OpenAI text-embedding-3-large or Cohere embed-v3
  • Vector Databases

    Vector databases are purpose-built to store, index, and search high-dimensional vectors efficiently.

    DatabaseTypeBest For
    ChromaEmbedded (local)Prototyping, small datasets
    PineconeManaged cloudProduction, zero-ops
    WeaviateSelf-hosted / cloudHybrid search, GraphQL
    pgvectorPostgreSQL extensionTeams already using Postgres
    QdrantSelf-hosted / cloudHigh performance, filtering
    FAISSLibrary (Meta)Research, maximum speed

    Chroma Example

    import chromadb

    client = chromadb.Client() collection = client.create_collection("my_docs")

    Add documents (Chroma embeds them automatically)

    collection.add( documents=["Doc 1 text...", "Doc 2 text..."], ids=["doc1", "doc2"], metadatas=[{"source": "wiki"}, {"source": "blog"}] )

    Query

    results = collection.query( query_texts=["What is machine learning?"], n_results=3 ) print(results["documents"])

    Retrieval Strategies

    Similarity Search (Basic)

    Find the k vectors closest to the query vector using cosine similarity or L2 distance. Simple but may return redundant results.

    MMR (Maximal Marginal Relevance)

    Balances relevance and diversity. After finding the most relevant chunk, each subsequent chunk is chosen to be relevant to the query BUT different from already-selected chunks.

    MMR = argmax [ lambda * sim(doc, query) - (1 - lambda) * max(sim(doc, selected)) ]
    

  • lambda = 1.0: Pure relevance (same as similarity search)
  • lambda = 0.5: Balance relevance and diversity
  • lambda = 0.0: Maximum diversity
  • Hybrid Search

    Combines semantic search (embeddings) with keyword search (BM25/TF-IDF). This handles cases where exact keyword matches matter (e.g., product IDs, technical terms).

    final_score = alpha * semantic_score + (1 - alpha) * keyword_score
    

    Production RAG Patterns

    Re-Ranking

    After initial retrieval (fast, approximate), use a cross-encoder model to re-rank the top results more accurately.

    from sentence_transformers import CrossEncoder

    reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

    Initial retrieval returns 20 candidates

    candidates = [...]

    Re-rank with cross-encoder

    scores = reranker.predict( [(query, doc) for doc in candidates] )

    Sort by re-ranked score

    reranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True) top_docs = [doc for doc, score in reranked[:5]]

    Query Expansion

    Use the LLM to rewrite or expand the user's query before retrieval:

    Original: "Python web frameworks"
    Expanded: "Python web frameworks Django Flask FastAPI comparison"
    

    HyDE (Hypothetical Document Embeddings)

    Instead of embedding the query directly, ask the LLM to generate a hypothetical answer, then embed that answer for retrieval. The hypothetical answer is closer in embedding space to the actual relevant documents.

    Query: "How does photosynthesis work?"
    HyDE: "Photosynthesis is the process by which plants convert sunlight,
           water, and CO2 into glucose and oxygen using chlorophyll..."

    Embed the HyDE text -> search -> retrieve -> generate final answer

    The RAG Evaluation Challenge

    RAG systems have multiple failure points: bad chunking, poor embeddings, wrong documents retrieved, or the LLM ignoring the context. You need to evaluate each stage independently to diagnose issues.

    RAG Evaluation

    The RAGAS framework provides standardized metrics for evaluating RAG systems:

    MetricWhat It Measures
    FaithfulnessIs the answer supported by the retrieved context? (no hallucination)
    Answer RelevancyDoes the answer actually address the question?
    Context PrecisionAre the retrieved documents relevant to the question?
    Context RecallWere all necessary documents retrieved?

    Manual Evaluation Approach

    def evaluate_faithfulness(answer: str, context: str) -> float:
        """Check what fraction of answer claims are supported by context."""
        # Split answer into claims
        # Check each claim against context
        # Return supported_claims / total_claims
        pass

    def evaluate_context_precision(retrieved_docs: list, relevant_docs: list) -> float: """What fraction of retrieved docs are actually relevant?""" relevant_set = set(relevant_docs) hits = sum(1 for doc in retrieved_docs if doc in relevant_set) return hits / len(retrieved_docs) if retrieved_docs else 0.0

    def evaluate_context_recall(retrieved_docs: list, relevant_docs: list) -> float: """What fraction of relevant docs were actually retrieved?""" retrieved_set = set(retrieved_docs) hits = sum(1 for doc in relevant_docs if doc in retrieved_set) return hits / len(relevant_docs) if relevant_docs else 0.0