RAG: Retrieval-Augmented Generation (Deep Dive)

In Lesson 1, we introduced RAG as a way to ground LLM responses in external knowledge. Now we go deep: production-grade chunking strategies, embedding models, vector databases, retrieval techniques, and evaluation frameworks.

Why RAG Over Fine-Tuning?

RAG is preferred when: (1) your knowledge base changes frequently, (2) you need source attribution, (3) you want to avoid the cost of fine-tuning, or (4) you need to combine information from multiple sources. Fine-tuning bakes knowledge into weights; RAG keeps knowledge external and updatable.

Chunking Strategies

Before embedding, documents must be split into chunks — passages small enough to be individually embedded and retrieved. Chunk size critically affects RAG quality.

Fixed-Size Chunking

Split text into chunks of N characters (or tokens) with optional overlap.

def fixed_size_chunk(text, chunk_size=500, overlap=50):
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunks.append(text[start:end])
        start = end - overlap
    return chunks

Pros: Simple, predictable chunk sizes Cons: May split sentences or paragraphs mid-thought

Semantic Chunking

Split at natural boundaries (paragraphs, sections, sentences) and merge small chunks up to a size limit.

import redef semantic_chunk(text, max_size=500):
    # Split on double newlines (paragraph boundaries)
    paragraphs = re.split(r'\n\n+', text)
    chunks = []
    current = ""
    for para in paragraphs:
        if len(current) + len(para) < max_size:
            current += para + "\n\n"
        else:
            if current:
                chunks.append(current.strip())
            current = para + "\n\n"
    if current:
        chunks.append(current.strip())
    return chunks

Pros: Preserves semantic coherence Cons: Variable chunk sizes

Recursive Chunking

Used by LangChain's RecursiveCharacterTextSplitter. Tries to split on the largest natural boundary first ("\n\n"), then falls back to smaller ones ("\n", ". ", " ").

Strategy	Best For
Fixed-size	Uniform content (e.g., product descriptions)
Semantic	Structured documents (articles, reports)
Recursive	General-purpose, mixed content

Chunk Size Guidelines

Too small (< 100 tokens): Loses context, retrieves fragments

Too large (> 1000 tokens): Dilutes relevance, wastes context window

Sweet spot: 200–500 tokens with 10–20% overlap

Always Add Overlap

Chunk overlap (typically 10–20% of chunk size) ensures that information at chunk boundaries is not lost. If a key fact spans two chunks, the overlap ensures at least one chunk contains the complete fact.

Embedding Models

Embedding models convert text into dense vectors that capture semantic meaning. Similar texts produce similar vectors.

Model	Dimensions	Speed	Quality	Open Source
all-MiniLM-L6-v2	384	Fast	Good	Yes
all-mpnet-base-v2	768	Medium	Better	Yes
text-embedding-3-small (OpenAI)	1536	API	Very Good	No
text-embedding-3-large (OpenAI)	3072	API	Excellent	No
embed-v3 (Cohere)	1024	API	Excellent	No

Choosing an embedding model:

Prototyping: Use all-MiniLM-L6-v2 (free, fast, runs locally)

Production (cost-sensitive): OpenAI text-embedding-3-small

Production (quality-first): OpenAI text-embedding-3-large or Cohere embed-v3

Vector Databases

Vector databases are purpose-built to store, index, and search high-dimensional vectors efficiently.

Database	Type	Best For
Chroma	Embedded (local)	Prototyping, small datasets
Pinecone	Managed cloud	Production, zero-ops
Weaviate	Self-hosted / cloud	Hybrid search, GraphQL
pgvector	PostgreSQL extension	Teams already using Postgres
Qdrant	Self-hosted / cloud	High performance, filtering
FAISS	Library (Meta)	Research, maximum speed

Chroma Example

import chromadb
client = chromadb.Client()
collection = client.create_collection("my_docs")
Add documents (Chroma embeds them automatically)
collection.add(
    documents=["Doc 1 text...", "Doc 2 text..."],
    ids=["doc1", "doc2"],
    metadatas=[{"source": "wiki"}, {"source": "blog"}]
)
Query
results = collection.query(
    query_texts=["What is machine learning?"],
    n_results=3
)
print(results["documents"])

Retrieval Strategies

Similarity Search (Basic)

Find the k vectors closest to the query vector using cosine similarity or L2 distance. Simple but may return redundant results.

MMR (Maximal Marginal Relevance)

Balances relevance and diversity. After finding the most relevant chunk, each subsequent chunk is chosen to be relevant to the query BUT different from already-selected chunks.

MMR = argmax [ lambda * sim(doc, query) - (1 - lambda) * max(sim(doc, selected)) ]

lambda = 1.0: Pure relevance (same as similarity search)

lambda = 0.5: Balance relevance and diversity

lambda = 0.0: Maximum diversity

Hybrid Search

Combines semantic search (embeddings) with keyword search (BM25/TF-IDF). This handles cases where exact keyword matches matter (e.g., product IDs, technical terms).

final_score = alpha * semantic_score + (1 - alpha) * keyword_score

Production RAG Patterns

Re-Ranking

After initial retrieval (fast, approximate), use a cross-encoder model to re-rank the top results more accurately.

from sentence_transformers import CrossEncoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
Initial retrieval returns 20 candidates
candidates = [...]
Re-rank with cross-encoder
scores = reranker.predict(
    [(query, doc) for doc in candidates]
)
Sort by re-ranked score
reranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
top_docs = [doc for doc, score in reranked[:5]]

Query Expansion

Use the LLM to rewrite or expand the user's query before retrieval:

Original: "Python web frameworks"
Expanded: "Python web frameworks Django Flask FastAPI comparison"

HyDE (Hypothetical Document Embeddings)

Instead of embedding the query directly, ask the LLM to generate a hypothetical answer, then embed that answer for retrieval. The hypothetical answer is closer in embedding space to the actual relevant documents.

Query: "How does photosynthesis work?" HyDE: "Photosynthesis is the process by which plants convert sunlight, water, and CO2 into glucose and oxygen using chlorophyll..."

Embed the HyDE text -> search -> retrieve -> generate final answer

The RAG Evaluation Challenge

RAG systems have multiple failure points: bad chunking, poor embeddings, wrong documents retrieved, or the LLM ignoring the context. You need to evaluate each stage independently to diagnose issues.

RAG Evaluation

The RAGAS framework provides standardized metrics for evaluating RAG systems:

Metric	What It Measures
Faithfulness	Is the answer supported by the retrieved context? (no hallucination)
Answer Relevancy	Does the answer actually address the question?
Context Precision	Are the retrieved documents relevant to the question?
Context Recall	Were all necessary documents retrieved?

Manual Evaluation Approach

def evaluate_faithfulness(answer: str, context: str) -> float:
    """Check what fraction of answer claims are supported by context."""
    # Split answer into claims
    # Check each claim against context
    # Return supported_claims / total_claims
    pass
def evaluate_context_precision(retrieved_docs: list, relevant_docs: list) -> float:
    """What fraction of retrieved docs are actually relevant?"""
    relevant_set = set(relevant_docs)
    hits = sum(1 for doc in retrieved_docs if doc in relevant_set)
    return hits / len(retrieved_docs) if retrieved_docs else 0.0def evaluate_context_recall(retrieved_docs: list, relevant_docs: list) -> float:
    """What fraction of relevant docs were actually retrieved?"""
    retrieved_set = set(retrieved_docs)
    hits = sum(1 for doc in relevant_docs if doc in retrieved_set)
    return hits / len(relevant_docs) if relevant_docs else 0.0