Tuesday, June 24, 2025

Retrieval Augmented Generation - From Zero to Hero

Part 1: Retrieval Augmented Generations for Dummies


Introduction


In the quiet hum of a software engineer’s day, there is a moment when the volume of documentation, code snippets, and reference manuals feels like a tidal wave. You might recall a time when you needed a precise summary of your company’s API behavior or a deep dive into a library you barely remember. You open a chat window with an LLM, type your question, and watch it confidently generate an answer—only to discover that it has hallucinated details, cited functions that don’t exist, or drawn on information that’s hopelessly out of date. This is where Retrieval-Augmented Generation, or RAG for short, steps into the spotlight.


Imagine walking into a grand library that contains volumes on every technology under the sun. Instead of the model wandering the stacks at random, you enlist a clever librarian who knows exactly which shelves hold the most relevant pages. You hand over your question, the librarian rushes off, slices out a handful of the most pertinent paragraphs, and delivers them back for your LLM to read. The result is an answer that is grounded in real text, up-to-date, and less prone to fiction. By combining the librarian’s retrieval prowess with the LLM’s generative flair, you get the best of both worlds: precision and fluency.


Turning Your Document Corpus into Vectors


Before our librarian can fetch the right pages, she needs a card catalog that describes every volume in terms she can compare. In the world of RAG, that catalog is a vector store filled with embeddings. An embedding is simply a list of numbers—a vector—that captures the meaning of a piece of text in a form a computer can measure. We transform each document (or chunk of a document) into an embedding, then store those embeddings in an index that supports similarity search.


Here is a concrete example in Python to show how you might take a folder of plain-text files, compute embeddings for each file using an open-source model, and load them into a FAISS index.


In the code below, we begin by loading a SentenceTransformer model that produces 768-dimensional embeddings. We then read every file in a directory called “docs,” split each file into one chunk (for simplicity), compute its embedding, and add it to a FAISS index along with a small lookup table that maps vector IDs back to the original text.


# Introduction to the code  

# The following code demonstrates how to build a simple vector store.  

# We use the sentence-transformers library to compute embeddings and FAISS to store them.  

from sentence_transformers import SentenceTransformer  

import faiss  

import os  


# Initialize the embedding model  

model = SentenceTransformer('all-MiniLM-L6-v2')  

embedding_dim = 384  # this model produces 384-dimensional vectors  

index = faiss.IndexFlatL2(embedding_dim)  


# This list will map vector IDs back to text chunks  

id_to_text = []  


# Walk through every text file in the docs directory  

for filename in os.listdir('docs'):  

    if filename.endswith('.txt'):  

        with open(os.path.join('docs', filename), 'r', encoding='utf-8') as f:  

            text = f.read()  


        # Compute the embedding for this entire document  

        embedding = model.encode([text])  


        # Add the vector to the index and remember its text  

        index.add(embedding)  

        id_to_text.append(text)  


# At this point, index.ntotal tells us how many vectors we have  

print(f"Indexed {index.ntotal} documents.")  


After running this snippet, you will have a FAISS index containing one vector per text file, and a Python list called id_to_text that stores the original text for each vector ID in the order they were inserted. The IndexFlatL2 structure uses plain Euclidean distance to find the nearest neighbors; for many applications this is fast enough, though FAISS offers much more sophisticated indexes if your corpus grows large.


Now that each chunk of text lives in vector form, our librarian can compare a user’s question—also turned into a vector—with all stored vectors, find the hundred or so closest ones, and hand them over for the LLM to inspect. 


Pulling Relevant Chunks and Building the Prompt


Before we hand our question over to the LLM, we must ask our librarian to fetch the most relevant passages. To do that, we transform the user’s question into the same vector space as our documents, compare it against every stored vector in the FAISS index, and pick the top hits. Once we have those hits, we stitch them together with a simple prompt template so the LLM can read the retrieved context before answering.


Here is a detailed introduction to the code that follows. First, we show how to compute the embedding for the incoming query using the same SentenceTransformer model. Then we perform a similarity search on the FAISS index, asking for, say, the five closest vectors. FAISS returns both the distances and the integer IDs of those vectors. We use our id_to_text lookup list to map each ID back to its original text chunk. Finally, we assemble a prompt string by prefixing a short instruction for the LLM, inserting each retrieved passage separated by a delimiter, and appending the user’s original question at the end.


# Introduction to the code

# This snippet takes a question string, computes its embedding, finds the top 5 closest document vectors in FAISS,

# maps each result back to its text, and constructs a single prompt string that combines those texts and the question.

from sentence_transformers import SentenceTransformer

import faiss


# Assume `model`, `index`, and `id_to_text` already exist from the indexing step


def build_rag_prompt(question, top_k=5):

    # Compute the embedding for the question

    query_vec = model.encode([question])


    # Perform similarity search to retrieve top_k nearest neighbors

    distances, indices = index.search(query_vec, top_k)


    # Map retrieved indices back to text passages

    retrieved_passages = []

    for idx in indices[0]:

        retrieved_passages.append(id_to_text[idx])


    # Build a prompt by combining a short instruction, the retrieved passages, and the question

    prompt_parts = []

    prompt_parts.append("You are an AI assistant with access to the following documents:")

    for i, passage in enumerate(retrieved_passages, start=1):

        prompt_parts.append(f"[Document {i}]\n{passage}\n")

    prompt_parts.append(f"Using only the above documents, answer the following question:\n{question}")


    # Join all parts into one string with clear separators

    full_prompt = "\n---\n".join(prompt_parts)

    return full_prompt


# Example usage

user_question = "What authentication methods does our API support?"

rag_prompt = build_rag_prompt(user_question, top_k=5)

print(rag_prompt)


When you run this code, you will see a prompt that begins by telling the assistant it has access to numbered documents, then shows each retrieved passage separated by a line of dashes, and finally repeats the user’s question. This structure ensures that the LLM’s generation is grounded in the actual text.



In a real-world setting you often need to support both cloud-based LLMs and models you run yourself. The two snippets below show how you might send the same rag_prompt first to OpenAI’s Chat API and then to a local HuggingFace model.


First we send our prompt to OpenAI’s remote API. We assume you have installed the openai package and set your OPENAI_API_KEY in the environment. The code constructs a chat completion request with our assembled prompt as the user message and then prints out the assistant’s reply.


# Introduction to the code

# This example submits the RAG prompt to OpenAI’s chat endpoint.

# It reads your API key from the environment, sends the prompt, and prints the response.

import os

import openai


# Make sure your API key is set as an environment variable

openai.api_key = os.getenv("OPENAI_API_KEY")


# rag_prompt is the string built by build_rag_prompt(...)

response = openai.ChatCompletion.create(

    model="gpt-3.5-turbo",

    messages=[

        {"role": "system", "content": "You are a helpful assistant."},

        {"role": "user",   "content": rag_prompt}

    ],

    max_tokens=512,

    temperature=0.2

)


# Extract and print the assistant’s answer

answer = response["choices"][0]["message"]["content"]

print("OpenAI’s response:")

print(answer)



Next we perform the same retrieval-augmented generation against a model running in our own environment. In this case we use the transformers library to load a causal-LM model and tokenizer. We encode the prompt, generate with sampling, and decode the result. In practice you would pick a larger model and tune generation settings for your latency and quality requirements.



# Introduction to the code

# This example loads a local HuggingFace model and runs text generation on the assembled RAG prompt.

from transformers import AutoTokenizer, AutoModelForCausalLM

import torch


# Load model and tokenizer from the HuggingFace hub

tokenizer = AutoTokenizer.from_pretrained("gpt2")

model     = AutoModelForCausalLM.from_pretrained("gpt2")


# Tokenize the prompt and move tensors to the appropriate device

inputs = tokenizer(rag_prompt, return_tensors="pt")

input_ids = inputs["input_ids"]


# Generate a continuation with a simple sampling strategy

output_ids = model.generate(

    input_ids,

    max_length=input_ids.shape[1] + 200,

    do_sample=True,

    top_p=0.95,

    temperature=0.7

)


# Decode and print the generated text, skipping the prompt itself

full_output = tokenizer.decode(output_ids[0], skip_special_tokens=True)

generated_answer = full_output[len(rag_prompt):].strip()

print("Local model’s response:")

print(generated_answer)



Both approaches use the same assembled prompt but trade off different concerns. The remote API offloads model hosting and scaling, while the local model gives you complete control over data privacy and latency.


Advanced Retrieval Techniques


To build a truly robust RAG system, sometimes the first pass of similarity search is not enough. You may notice that the nearest neighbors by simple vector distance still include passages that are off-topic or less useful. By applying a second layer of scrutiny, or by organizing your corpus into smarter pieces, you can dramatically improve the quality of what your librarian brings back. Below we explore three such techniques: reranking with a cross-encoder, splitting text into overlapping windows, and chaining multiple retrieval hops.


Reranking with a Cross-Encoder


Rather than trusting the raw distance from your FAISS index alone, you can use a powerful cross-encoder model to reassess the relevance of each retrieved passage. A cross-encoder takes both the query and a candidate passage together and outputs a single relevance score. In effect, you ask the model to read each passage more carefully and say how well it answers the question.



# Introduction to the code  

# This snippet shows how to use a CrossEncoder to rerank the top_k passages returned by FAISS.  

from sentence_transformers import CrossEncoder


# Initialize a cross-encoder model fine-tuned for reranking

reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')


def rerank_passages(question, passages):

    # Prepare input pairs for the cross-encoder: (question, passage)

    pairs = [[question, p] for p in passages]


    # Compute relevance scores for each pair

    scores = reranker.predict(pairs)


    # Pair up each passage with its score, sort by descending score

    scored = list(zip(passages, scores))

    scored.sort(key=lambda x: x[1], reverse=True)


    # Return passages ordered by their reranked score

    return [p for p, _ in scored]


# Example usage

initial_passages = ['first candidate text', 'second candidate text', 'third candidate text']

refined = rerank_passages("How does authentication work?", initial_passages)

print("Reranked passages in order of relevance:")

for idx, text in enumerate(refined, start=1):

    print(f"{idx}: {text}")



After running this code, you will see that the passages most closely matching your question rise to the top. By replacing your raw FAISS hits with these reranked passages, your LLM prompt always begins with the strongest candidates.


Creating Overlapping Text Chunks


Long documents can sometimes bury key sentences in the middle of a chunk that is too big, causing embeddings to blur important details. To avoid this, you split your text into smaller, overlapping windows so that each critical sentence appears in multiple chunks. Overlap ensures that no sentence is left in isolation, improving the chance that your retrieval step will grab exactly the lines that matter.


# Introduction to the code  

# This example shows how to split a long text into overlapping chunks of a given size and stride.  

def chunk_text_with_overlap(text, chunk_size=200, overlap=50):

    words = text.split()

    chunks = []

    start = 0

    # Slide a window through the tokens

    while start < len(words):

        end = start + chunk_size

        chunk = " ".join(words[start:end])

        chunks.append(chunk)

        # Move the window forward by chunk_size minus overlap

        start += (chunk_size - overlap)

    return chunks


# Example usage

long_doc = ("In a modern RAG system it is important that even details buried deep in a document "

            "can be retrieved accurately. By splitting text into overlapping chunks you ensure "

            "that no sentence is orphaned in a chunk that never surfaces.")

pieces = chunk_text_with_overlap(long_doc, chunk_size=20, overlap=5)

print("Generated chunks:")

for piece in pieces:

    print(f"— {piece}")



When you run this function, each chunk will share some sentences with its neighbors. After encoding each chunk and indexing them, your similarity search will find the exact snippet where the answer lives, rather than an entire page-length blob.


Multi-Hop Retrieval


Sometimes a single retrieval step is not enough to answer complex questions that require chaining information from different parts of your corpus. In multi-hop retrieval you perform an initial query, extract key terms or partial answers, and then use those to issue a second retrieval. This process can continue for as many hops as needed, each time narrowing in on the precise facts that together form the final answer.


# Introduction to the code  

# This snippet demonstrates a two-hop retrieval process.  

def multi_hop_query(question, hops=2, top_k=3):

    current_query = question

    collected_passages = []


    for hop in range(hops):

        # Build prompt for the current query and retrieve initial passages

        prompt = build_rag_prompt(current_query, top_k=top_k)

        # Compute vector for the current query and retrieve from FAISS

        vec = model.encode([current_query])

        _, ids = index.search(vec, top_k)

        passages = [id_to_text[i] for i in ids[0]]


        # Optionally rerank each hop

        passages = rerank_passages(current_query, passages)

        # Add the top passage of this hop to our growing context

        best = passages[0]

        collected_passages.append(best)


        # Formulate the next query by appending the found passage

        current_query = f"{question} Given this information: {best}"


    # At the end, return all collected passages

    return collected_passages


# Example usage

hops_result = multi_hop_query("Which encryption algorithm secures our tokens?", hops=2)

print("Multi-hop retrieved passages:")

for hop_idx, passage in enumerate(hops_result, start=1):

    print(f"Hop {hop_idx}: {passage}")


Measuring and Tuning Retrieval Performance


Any retrieval step is only as good as its ability to bring back passages that truly answer your questions. To find out whether your vector-store and similarity search are doing the job, you need concrete metrics. Two of the most informative are precision at k and recall at k. Precision at k tells you what fraction of the top k retrieved passages are actually relevant. Recall at k tells you what fraction of all the relevant passages in your entire corpus appear in that top k. By tracking these numbers, you can see whether your indexing strategy, chunk size, overlap, or reranker is improving the librarian’s accuracy.


Here is a Python example that demonstrates how you might compute precision@k and recall@k for a small evaluation set. We assume you have, for each test question, a list of ground-truth document IDs that are relevant, and that your retrieval code produces a list of retrieved document IDs in order of similarity.


# Introduction to the code  

# This snippet computes precision@k and recall@k given  

# ground_truth: a dict mapping question IDs to sets of relevant doc IDs  

# retrieved: a dict mapping question IDs to lists of retrieved doc IDs in rank order  

def compute_metrics(ground_truth, retrieved, k):

    precision_scores = []

    recall_scores    = []

    for qid, relevant_set in ground_truth.items():

        retrieved_k = retrieved[qid][:k]

        true_positives = set(retrieved_k).intersection(relevant_set)

        num_true = len(true_positives)

        precision = num_true / float(k)

        recall    = num_true / float(len(relevant_set)) if relevant_set else 0.0

        precision_scores.append(precision)

        recall_scores.append(recall)

    avg_precision = sum(precision_scores) / len(precision_scores)

    avg_recall    = sum(recall_scores)    / len(recall_scores)

    return avg_precision, avg_recall


# Example usage

ground_truth = {

    "q1": {"doc3", "doc7"},

    "q2": {"doc2"},

}

retrieved = {

    "q1": ["doc3", "doc1", "doc7", "doc8", "doc5"],

    "q2": ["doc4", "doc2", "doc6", "doc1", "doc9"],

}

p_at_3, r_at_3 = compute_metrics(ground_truth, retrieved, k=3)

print(f"Average precision@3: {p_at_3:.2f}")

print(f"Average recall@3:    {r_at_3:.2f}")


# Introduction to the code  

# This snippet computes precision@k and recall@k given  

# ground_truth: a dict mapping question IDs to sets of relevant doc IDs  

# retrieved: a dict mapping question IDs to lists of retrieved doc IDs in rank order  

def compute_metrics(ground_truth, retrieved, k):

    precision_scores = []

    recall_scores    = []

    for qid, relevant_set in ground_truth.items():

        retrieved_k = retrieved[qid][:k]

        true_positives = set(retrieved_k).intersection(relevant_set)

        num_true = len(true_positives)

        precision = num_true / float(k)

        recall    = num_true / float(len(relevant_set)) if relevant_set else 0.0

        precision_scores.append(precision)

        recall_scores.append(recall)

    avg_precision = sum(precision_scores) / len(precision_scores)

    avg_recall    = sum(recall_scores)    / len(recall_scores)

    return avg_precision, avg_recall


# Example usage

ground_truth = {

    "q1": {"doc3", "doc7"},

    "q2": {"doc2"},

}

retrieved = {

    "q1": ["doc3", "doc1", "doc7", "doc8", "doc5"],

    "q2": ["doc4", "doc2", "doc6", "doc1", "doc9"],

}

p_at_3, r_at_3 = compute_metrics(ground_truth, retrieved, k=3)

print(f"Average precision@3: {p_at_3:.2f}")

print(f"Average recall@3:    {r_at_3:.2f}")



After running this code on your evaluation questions, you will obtain numbers like “precision@3 equals 0.67” and “recall@3 equals 0.50,” which tell you that two out of three retrieved passages were correct on average, and you found half of all relevant documents in your top three. By experimenting with different chunk sizes, overlap strides, or adding a reranking stage, you can watch these metrics rise.


Prompt Engineering Strategies for RAG


Even when your retrieval step is rock solid, the way you ask the LLM to read and process those passages can make all the difference. A well-crafted prompt guides the model to treat retrieved text as authoritative, to cite sources when needed, and to avoid hallucination. One effective approach is to split your prompt into three clear sections: a system instruction that sets the context, a concatenation of numbered retrieved passages, and a final user question that explicitly tells the model to use only the supplied documents.


In practice you might implement a small templating function in Python that fills in these sections cleanly. Below is an example that shows how to build a chat prompt structure for OpenAI’s API.


# Introduction to the code  

# This snippet constructs a list of messages for the OpenAI ChatCompletion API  

# by injecting a system message, each retrieved passage as a user message,  

# and then the real query as a final user message.  

def build_chat_messages(passages, question):

    messages = []

    # The first message sets the assistant’s role and task

    messages.append({

        "role": "system",

        "content": (

            "You are an expert assistant who only answers questions using the provided documents. "

            "If the answer is not in the documents, say you don’t know."

        )

    })

    # Each retrieved passage becomes its own user message

    for i, text in enumerate(passages, start=1):

        messages.append({

            "role": "user",

            "content": f"[Document {i}]\n{text}"

        })

    # The final user message is the actual question

    messages.append({

        "role": "user",

        "content": f"Please answer the following question using only the above documents:\n{question}"

    })

    return messages


# Example usage

top_passages = ["Authentication uses tokens signed with HMAC.", "The API supports OAuth2 and API keys."]

question    = "What methods secure our API?"

chat_messages = build_chat_messages(top_passages, question)

response = openai.ChatCompletion.create(model="gpt-3.5-turbo", messages=chat_messages)

print(response["choices"][0]["message"]["content"])



By isolating each document into its own user message, the LLM can distinguish sources and even refer back to “[Document 2]” if needed. You might also experiment with few-shot examples by including one or two sample Q&A pairs at the top of the prompt so the model sees exactly how you want factual grounding and citation style to work.


Having now covered performance metrics and prompt construction, the next logical step is to consider how to observe and tune your RAG system in production. 


Making RAG more productive


In production, you must treat your RAG system like any other critical service. You cannot rely on anecdotal impressions of “it feels fast enough”; you need hard data. By instrumenting every step of retrieval and generation, pushing metrics into a centralized store, and wiring up dashboards and alerts, you gain visibility into the health of your system. You will know at a glance if retrieval latencies spike during peak traffic, if generation costs suddenly climb, or if your precision at k starts to drift downward. Armed with that data, you can scale components, tune parameters, and roll out improvements with confidence.


Below is a Python example that shows how to measure and expose the most important metrics—retrieval latency and generation latency—using the Prometheus client library. You wrap your existing functions with timing code, record histograms, and serve a HTTP endpoint that Prometheus can scrape.


# Introduction to the code  

# The following snippet instruments RAG retrieval and LLM generation  

# using prometheus_client to collect latency histograms and expose /metrics.  

from prometheus_client import start_http_server, Histogram  

import time  


# Create histograms for retrieval and generation latencies  

retrieval_latency = Histogram(  

    'rag_retrieval_latency_seconds',  

    'Time taken to perform document retrieval'  

)  

generation_latency = Histogram(  

    'rag_generation_latency_seconds',  

    'Time taken to perform LLM generation'  

)  


# Example wrapper around the retrieval function  

@retrieval_latency.time()  

def retrieve_documents(query):  

    # This calls your build_rag_prompt or vector search logic  

    return actual_retrieval_logic(query)  


# Example wrapper around the generation function  

@generation_latency.time()  

def generate_answer(prompt):  

    # This calls your OpenAI API or local model generate function  

    return actual_generation_logic(prompt)  


if __name__ == '__main__':  

    # Start up the Prometheus metrics server on port 8000  

    start_http_server(8000)  

    # Main loop or web service initialization  

    while True:  

        # Example usage  

        q = get_next_user_query()  

        docs = retrieve_documents(q)  

        prompt = assemble_prompt(docs, q)  

        answer = generate_answer(prompt)  

        deliver_answer(answer)  



Once this service is running, you configure your Prometheus server to scrape http://your-service:8000/metrics every fifteen seconds. In Grafana you build a panel showing the 95th percentile retrieval latency over time, and another for the average generation latency. You set an alert to fire if retrieval latency exceeds two seconds at the 95th percentile for three consecutive scrapes.


Beyond instrumentation, you must ensure your service can handle increased load without breaking. Vector stores like FAISS in “flat” mode are fast for small to medium corpora, but if your data grows or your query rate spikes, you will want a distributed solution such as Qdrant or Weaviate. These systems shard your embeddings across nodes, replicate for redundancy, and offer horizontal scaling. You connect to them through a client library rather than managing raw FAISS indexes.


To scale your inference layer, you will typically package your retrieval and generation logic into a stateless microservice. A Kubernetes cluster can then run multiple replicas behind a load-balancer, automatically adding pods when CPU or memory usage climbs, and trimming them back when traffic subsides. Here is a Kubernetes Horizontal Pod Autoscaler configuration that scales your RAG service between two and ten replicas based on CPU utilization.


# Introduction to the code  

# This YAML snippet configures the Kubernetes Horizontal Pod Autoscaler  

# for a deployment named rag-inference-service to maintain around 50% CPU usage.  

apiVersion: autoscaling/v2  

kind: HorizontalPodAutoscaler  

metadata:  

  name: rag-inference-hpa  

spec:  

  scaleTargetRef:  

    apiVersion: apps/v1  

    kind: Deployment  

    name: rag-inference-service  

  minReplicas: 2  

  maxReplicas: 10  

  metrics:  

  - type: Resource  

    resource:  

      name: cpu  

      target:  

        type: Utilization  

        averageUtilization: 50  



With this in place, your service automatically adapts to demand. You should also implement a simple warm-up routine at container startup that runs a dummy query through retrieval and generation so that your model tensors are loaded and JIT-compiled before real traffic arrives. This avoids the dreaded “cold start” penalty for the first user of each pod.


Finally, you want to tie quality metrics like precision at k and recall at k into your monitoring. By running a small, representative evaluation set on a schedule (for example via a CI job or a nightly batch), you can push those metrics into Prometheus as gauges. If precision@5 drops below your threshold, you trigger an alert for your team to investigate indexing parameters or retriever quality.


Warm-Up Script for Dockerized Deployment


When a new container starts, the model weights and any just-in-time compilations must load into memory before the first real request arrives. To avoid that initial latency spike, you can bake a warm-up routine into your Docker image that runs a dummy RAG query as soon as the service is ready. Below is an example of a simple shell script named warmup.sh that issues one retrieval-plus-generation call against localhost and then exits.


# Introduction to the code  

# This warmup.sh script waits for the RAG service to be available on port 8080  

# then sends a dummy question to prime both retrieval and LLM generation.  

# Place this script in your Docker image and invoke it before or alongside your service.  


#!/usr/bin/env bash  

set -e  


# Wait until the service is listening on port 8080  

until nc -z localhost 8080; do  

  echo "Waiting for RAG service on port 8080…"  

  sleep 1  

done  


echo "RAG service is up, sending warm-up request…"  


# A simple JSON payload with a placeholder question  

PAYLOAD='{"question":"What is the warm-up token?"}'  


# Send the request and discard the response  

curl -s -X POST \  

  http://localhost:8080/rag \  

  -H "Content-Type: application/json" \  

  -d "$PAYLOAD" > /dev/null  


echo "Warm-up request complete."  



To incorporate this into your Docker image, you might extend your Dockerfile so that your entrypoint runs warmup.sh in the background before starting the main service. For example, you copy the script into the image, make it executable, and then use a shell-based ENTRYPOINT that launches warmup.sh and your Python server in parallel.


Integrating RAG Tests into Your CI/CD Pipeline


Continuous integration can ensure that every code change preserves retrieval quality. You begin by writing a pytest suite that loads a small set of evaluation questions along with their ground-truth document IDs. Each test calls your retrieval function and asserts that precision at a chosen k remains above a threshold. Below is an example test module named test_rag_quality.py.


# Introduction to the code  

# This pytest module defines a single test that checks precision@3 on a tiny evaluation set.  

import pytest  


# Import your retrieval logic and your metrics function  

from my_rag_service import retrieve_documents, compute_metrics  


@pytest.fixture  

def evaluation_set():  

    # A dict mapping question strings to sets of relevant doc IDs  

    return {  

        "How do we handle authentication?": {"doc_api_key", "doc_oauth"},  

        "Where is user data stored?": {"doc_storage"}  

    }  


def test_precision_at_3(evaluation_set):  

    ground_truth = {}  

    retrieved = {}  

    # For each question, call your retrieval function and record the top-3 document IDs  

    for q, relevant_ids in evaluation_set.items():  

        ground_truth[q] = relevant_ids  

        docs = retrieve_documents(q, top_k=3)  

        # Assume retrieve_documents returns a list of doc IDs in rank order  

        retrieved[q] = docs  

    precision, _ = compute_metrics(ground_truth, retrieved, k=3)  

    # Assert that average precision stays above 0.66  

    assert precision >= 0.66, f"precision@3 dropped to {precision:.2f}"  



With this test in place, you configure your CI system—whether GitHub Actions, GitLab CI, or Jenkins—to install dependencies, spin up any required services, and then run pytest. If the precision falls below your threshold, the build fails and your team is alerted before bad code merges.


For a GitHub Actions example, you add a workflow file that checks out your code, sets up Python, installs your package, and invokes pytest. You can even schedule a nightly run that exercises a larger evaluation set and pushes the resulting precision and recall metrics into your monitoring system via an API call or Prometheus pushgateway.


Bringing It All Together


You have now journeyed from the moment you first imagined a librarian fetching just the right pages, all the way through the nuts and bolts of building a retrieval-augmented generation system. You learned how to turn raw documents into vectors, how to index them for fast similarity search, and how to assemble those retrieved passages into a coherent prompt so that an LLM—whether hosted in the cloud or running on your own hardware—can produce grounded, accurate answers.


Along the way you discovered advanced techniques to sharpen your retrieval step. You saw how a cross-encoder reranker can reorder initial candidates by true relevance. You observed how overlapping chunks ensure no important sentence gets lost. You practiced multi-hop queries that chain together context across multiple parts of your corpus. You measured precision at k and recall at k to know exactly how good your system really is. You crafted prompt templates that make it clear the model must rely only on the supplied documents.


Finally, you learned to treat your RAG pipeline as a first-class production service. You instrumented retrieval and generation latencies with Prometheus, scaled your inference with Kubernetes autoscaling, warmed up containers so users never see a cold start, and wove quality tests into your CI/CD pipeline so that drops in retrieval performance fail your build before they ever reach production.


With these building blocks in place, you possess everything required to design, implement, and operate a robust, accurate, and scalable RAG system. Your next move is simple: choose a modest set of internal documents or code snippets that your team relies on, follow the indexing and retrieval tutorial from this article, integrate a local or remote LLM, and measure your first precision@k. From there, experiment with overlap, reranking, and multi-hop strategies until you hit your quality targets.





Part 2: Advanced RAG - BUILDING A SOPHISTICATED RAG-SYSTEM IN 10 STEPS 

INTRODUCTION


Large Language Models (LLMs) are powerful, but they are not omniscient. While they can generate fluent and contextually rich text, their knowledge is frozen at training time and prone to hallucinations. As a result, any application that relies solely on a language model’s parameters for answering user questions risks producing inaccurate or outdated content.


Retrieval-Augmented Generation (RAG) offers a remedy. Instead of depending entirely on internal knowledge, a RAG system supplements the language model by dynamically retrieving relevant documents from an external knowledge base at query time. The retrieved documents are used to construct a prompt that grounds the LLM’s response in real, verifiable information. In other words, RAG turns an LLM into an open-book system: it still speaks eloquently, but now with evidence.


The RAG approach combines two subsystems: a retriever and a generator. The retriever scans a corpus of indexed documents for material related to a given query, returning the top matching passages. The generator then takes those passages and produces a coherent answer. The quality of the final output depends not only on the generator’s fluency but also on the retriever’s ability to surface relevant context and the system’s strategy for formatting, reranking, and decoding.


This article guides you step-by-step through the construction of a full RAG system. It covers both sparse (BM25) and dense (embedding-based) retrieval techniques, demonstrates how to construct hybrid retrievers that combine lexical and semantic signals, and explains how to rerank retrieved documents and generated answers using neural scoring models. It also shows how to control output generation with top-p and top-k sampling, how to design fallbacks when retrieval fails, and how to evaluate retrieval and generation quality in production. All examples are given in detailed code using real libraries (such as FAISS, SentenceTransformers, Hugging Face Transformers, and CrossEncoder) and are grounded in actual behavior.


By the end of this article, you will understand not just how to build a RAG system—but also how to control it, monitor it, debug it, and improve it.



Step One: Introduction to Retrieval-Augmented Generation


Retrieval-augmented generation (often abbreviated as RAG) has emerged as a powerful way to ground a language model’s output in factual materials. Rather than asking a model to hallucinate answers purely from its internal parameters, the RAG approach feeds the model with relevant pieces of text that have been retrieved from a large collection of documents. In this way, the model can quote or paraphrase actual source material, reducing the risk of fabricating information and improving factual accuracy.


In a typical RAG system, the workflow begins when the user issues a query. The system first transforms that query into a form suitable for searching over an index of documents. For sparse retrieval, this might involve simple token matching and term-frequency scoring; for dense retrieval, it might involve encoding the query and documents into high-dimensional vectors. The top results from this search become “context passages” that are concatenated with the user’s query and sent to the language model. The model then generates an answer that draws on both its own learned knowledge and the retrieved context.


Because the retrieval step can dramatically affect the quality of the final answer, careful consideration must be given to how documents are chunked and indexed, how similarity scores are computed, and how to combine or rerank results. The rest of this article will walk through each of these concerns in detail, starting with a deep dive into the foundational retrieval algorithms.



Step Two: Understanding BM25 and Building a Sparse Retriever


The core of a sparse retrieval method lies in scoring documents according to how well their terms match the query terms balanced by how common or rare those terms are across the entire collection. BM25 applies a formula that rewards higher term frequency in a document while penalizing terms that appear very often in the corpus. The intuition is that a term appearing many times in a given document is likely more relevant to that document’s topic, but if that term appears in almost every document, it loses discriminatory power.


BM25 begins by computing an inverse document frequency weight for each term based on the total number of documents and the number of documents containing that term. Terms that occur in fewer documents receive a larger weight. When ranking, each document’s score for a given query is the sum over all query terms of the product of that term’s inverse document frequency weight and a normalized term-frequency factor. The normalization takes the raw term frequency and straps it between two parameters that control the influence of frequency and document length.


To see BM25 in action, the following example uses the “rank_bm25” Python library. The code first tokenizes a small set of documents using a simple whitespace split. It then constructs a BM25 index with default parameters that implement the Okapi BM25 formula. Finally, it issues a sample query, computes BM25 scores against all documents, and retrieves the top scoring document texts in order of relevance. This illustrates how to turn BM25 into a working sparse retriever component.


from rank_bm25 import BM25Okapi


# Prepare a small corpus of documents

documents = [

    "Retrieval augmented generation combines a retriever with a generator.",

    "BM25 is a classical algorithm for scoring the relevance of documents.",

    "Dense vector retrieval uses embeddings to capture semantic similarity."

]


# Tokenize each document into a list of words

tokenized_corpus = [doc.lower().split() for doc in documents]


# Create the BM25 index

bm25 = BM25Okapi(tokenized_corpus)


# Define a user query and tokenize it

query = "what is bm25 relevance scoring"

tokenized_query = query.lower().split()


# Compute BM25 scores for each document

scores = bm25.get_scores(tokenized_query)


# Retrieve the top two documents based on BM25 score

top_docs = bm25.get_top_n(tokenized_query, documents, n=2)


print("BM25 scores:", scores)

print("Top documents:", top_docs)


When you run this code, you will see an array of floating-point BM25 scores corresponding to each document in the original list. The highest scores indicate the documents whose terms best match the query according to BM25’s balance of term frequency, rarity, and document length normalization. The call to get_top_n returns the actual document strings sorted in descending order of relevance.


This simple implementation demonstrates how BM25 can be woven into a retriever that selects the most relevant passages for any incoming query. With the sparse retriever in place, the system can now feed those top passages as contextual grounding for a language model to generate answers.



Step Three: Dense Retrieval with Embeddings and FAISS


Dense retrieval replaces the hand-crafted term weighting of BM25 with a semantic representation of text in a continuous vector space. Instead of counting how often each word appears, this approach uses a pretrained embedding model to transform each document and each query into a high-dimensional vector. Similarity is then computed by measuring the distance or angle between vectors, which captures semantic relationships that simple term matching cannot. For example, two documents that use different words to describe the same concept can end up close together in the vector space, allowing the retriever to surface relevant passages even when they share few exact tokens with the query.


To illustrate dense retrieval, the following code uses the SentenceTransformer library to obtain embeddings and FAISS to perform efficient nearest-neighbor search. First, the code loads a transformer model and encodes a small corpus of document texts into fixed-size vectors. Then it builds a FAISS index on those vectors. Finally, it encodes a sample query into a vector and searches for the nearest document embeddings, retrieving the most semantically similar documents.


# Introduction to the dense retrieval code example

# This code example shows how to load a sentence-transformers model to encode documents and queries,

# build a FAISS index on the document embeddings for fast similarity search,

# and retrieve the top two documents most semantically similar to a given query.

#

# First, SentenceTransformer encodes each text into a 768-dimensional vector.

# FAISS then indexes those vectors on CPU using an inner-product index.

# At query time, the inner product between the query vector and each document vector

# is computed, and the documents with the highest scores are returned as the best matches.


from sentence_transformers import SentenceTransformer

import faiss

import numpy as np


# Step 1: Prepare the corpus of documents

documents = [

    "Retrieval augmented generation grounds language models with retrieved facts.",

    "Dense retrieval finds semantically similar texts using vector embeddings.",

    "BM25 relies on token matching and term frequency for sparse retrieval."

]


# Step 2: Load a pretrained embedding model

model = SentenceTransformer('all-MiniLM-L6-v2')


# Step 3: Encode the documents into vectors

document_embeddings = model.encode(documents, convert_to_numpy=True)


# Step 4: Build a FAISS index for inner-product similarity

dimension = document_embeddings.shape[1]

index = faiss.IndexFlatIP(dimension)

faiss.normalize_L2(document_embeddings)      # normalize vectors for cosine similarity

index.add(document_embeddings)


# Step 5: Encode a query and perform retrieval

query = "how does dense semantic search work"

query_embedding = model.encode([query], convert_to_numpy=True)

faiss.normalize_L2(query_embedding)


# Step 6: Search for the top 2 nearest neighbors

top_k = 2

scores, indices = index.search(query_embedding, top_k)


# Step 7: Map indices back to document texts

top_documents = [documents[i] for i in indices[0]]


print("Similarity scores:", scores[0])

print("Retrieved documents:", top_documents)


When this code runs, you will see a pair of similarity scores corresponding to the two documents whose embeddings are closest to the query embedding in cosine similarity. The printed document texts illustrate which passages the dense retriever considers most semantically relevant, even when the query and documents do not share many exact words. By combining such a retriever with BM25 you can capture both lexical and semantic matching, which often yields better overall coverage.


Having explored sparse and dense retrieval side by side, the next step is to integrate these methods into a unified RAG pipeline. We will examine how to chunk large documents into passages, how to build hybrid indexes combining BM25 and vector scores, and how to assemble retrieval results as context for the language model. 



Step Four: Chunking Documents into Passages for Indexing


Before any retrieval can occur, long documents must be split into smaller passages that serve as the basic units of search and context. If you feed an entire novel or a multi-page report into your retriever, you lose precision: a single document may contain dozens of topics, and matching terms in one section should not boost relevance for queries about another section. By contrast, well-chosen passages keep each chunk focused on a coherent idea, making both sparse and dense retrieval more accurate.


A common strategy is to use a sliding-window approach over tokens rather than raw characters or words. Token counts correlate with how much context a language model sees at once, and tokenization handles punctuation, subwords, and whitespace in a consistent way. You decide on a window size—for example, 512 tokens—and an overlap size—perhaps 128 tokens—so that adjacent passages share context across their boundary. This overlap smooths out cases where an important sentence straddles the cut point.


To illustrate, the following code example shows how to load a GPT-2 tokenizer from the Hugging Face Tokenizers library, define chunk and overlap sizes, and iterate through a long text to generate overlapping token spans. Each span is then converted back to plain text for indexing. You can adjust the window and overlap lengths to suit your retrieval budget and how much context you wish each passage to carry.


from transformers import GPT2TokenizerFast


# Load a pretrained GPT-2 tokenizer

tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")


# Example long document

long_text = (

    "Retrieval-augmented generation requires slicing documents into passages. "

    "If passages are too long, retrieval loses precision. "

    "If passages are too short, you may lose context. "

    "Striking the right balance and using overlap preserves coherent ideas across boundaries."

)


# Define token window size and overlap size

window_size = 50

overlap_size = 10


# Tokenize the entire document once

tokens = tokenizer.encode(long_text)


# List to hold text chunks

passages = []


# Slide over tokens with the given window and overlap

start = 0

while start < len(tokens):

    end = start + window_size

    chunk_tokens = tokens[start:end]

    # Decode back to text and add to passages

    chunk_text = tokenizer.decode(chunk_tokens, clean_up_tokenization_spaces=True)

    passages.append(chunk_text)

    # Advance start by window minus overlap

    start += window_size - overlap_size


# Inspect the resulting passages

print(f"Total passages created: {len(passages)}")

for i, p in enumerate(passages):

    print(f"Passage {i+1}: {p}")


When this code runs, the output reports how many passages it created from the original text and then prints each passage. You will see that consecutive passages share the last ten tokens of the previous window. By indexing these passages rather than whole documents, your retrieval component can return highly focused snippets that directly address the user’s query.


With passages prepared, the next phase is to build indexes that support both sparse and dense search—and even hybrid strategies that combine their strengths. 



Step Five: Constructing a Hybrid Index by Merging BM25 and Embedding Similarity


To capture both precise keyword matches and deeper semantic relationships, a hybrid retriever computes a relevance score for each passage from the classic BM25 algorithm and from vector similarity, then merges these scores into a single ranking. This approach leverages the fact that term-frequency signals excel when queries share vocabulary with documents, while embedding distances discover relevance even when wording differs. By weighting and summing the two signals, you can tune the retriever to favor exact matches, semantic matches, or a balanced combination.


In the code example below, we first build a BM25 index over tokenized passages and then build a FAISS index over the same passages’ embeddings from a sentence transformer. At query time, the script computes BM25 scores and normalized FAISS cosine similarities, rescales them so they lie in the same range, and then forms a weighted sum. The weights alpha and beta control the relative importance of lexical and semantic scores. Finally, the passages are sorted by the hybrid score and the top results are returned.


# Introduction to the hybrid retrieval code example

# This code example demonstrates building a retriever that merges BM25 scores

# and dense vector similarities. It constructs both indexes on the same corpus,

# then at query time computes scores from each index, normalizes them,

# applies user-defined weights, and ranks passages by the combined score.


from rank_bm25 import BM25Okapi

from sentence_transformers import SentenceTransformer

import faiss

import numpy as np


# Prepare a small set of passages

passages = [

    "Retrieval augmented generation grounds outputs in external documents.",

    "Dense embeddings capture semantic similarity beyond exact terms.",

    "BM25 ranks documents by term frequency and inverse document frequency."

]


# Tokenize passages for BM25

tokenized_passages = [p.lower().split() for p in passages]

bm25 = BM25Okapi(tokenized_passages)


# Encode embeddings for FAISS

model = SentenceTransformer('all-MiniLM-L6-v2')

embeddings = model.encode(passages, convert_to_numpy=True)

faiss.normalize_L2(embeddings)

dimension = embeddings.shape[1]

index = faiss.IndexFlatIP(dimension)

index.add(embeddings)


# Define hybrid weight parameters

alpha = 0.5    # weight for BM25 score

beta = 0.5     # weight for dense similarity


# Process a query

query = "how to combine lexical and semantic retrieval"

tokenized_query = query.lower().split()


# Compute BM25 scores

bm25_scores = bm25.get_scores(tokenized_query)

# Compute dense similarity scores

query_emb = model.encode([query], convert_to_numpy=True)

faiss.normalize_L2(query_emb)

dense_scores, indices = index.search(query_emb, len(passages))

dense_scores = dense_scores.flatten()


# Normalize both score arrays to 0–1 range

def normalize(x):

    min_x, max_x = np.min(x), np.max(x)

    if max_x - min_x < 1e-6:

        return np.zeros_like(x)

    return (x - min_x) / (max_x - min_x)


norm_bm25 = normalize(bm25_scores)

norm_dense = normalize(dense_scores)


# Compute hybrid scores

hybrid_scores = alpha * norm_bm25 + beta * norm_dense


# Rank passages by hybrid score

ranked_indices = np.argsort(-hybrid_scores)

top_k = 2

top_passages = [passages[i] for i in ranked_indices[:top_k]]

top_scores = [hybrid_scores[i] for i in ranked_indices[:top_k]]


print("Top passages by hybrid score:")

for passage, score in zip(top_passages, top_scores):

    print(f"{score:.4f}  {passage}")


When you run this example, you will observe that passages can rank differently than with BM25 or dense retrieval alone. By adjusting alpha and beta, you can increase emphasis on exact term matches or rely more on semantic similarity. This hybrid scoring often yields more robust retrieval, especially for queries that mix lexical and conceptual cues.



Step Six: Assembling Retrieved Passages into a Prompt and Controlling Generation with Top-p and Token-k Sampling


Once the retriever returns the most relevant passages, those passages must be woven into the prompt you send to your language model. The simplest approach is to place each passage in sequence, separated by clear delimiters, then append the user’s question or instruction. This gives the model explicit context to ground its answer. After assembling the prompt, you call the model with generation-time parameters such as top-p and token-k sampling to influence how creative or focused the response will be.


In the example below, you will see how to load a pretrained causal-LM transformer and tokenizer from the Hugging Face library, how to build a prompt that combines retrieved passages with the user’s query, and how to configure top-p and token-k (top-k) sampling in the call to generate. The code shows why each generation parameter matters and how to tune them to balance factual precision against creativity.


# Introduction to prompt assembly and sampling parameters

# This code example illustrates how to take a list of retrieved passages and a user question,

# concatenate them into a single prompt for a causal language model,

# and then invoke the model’s generate method with top-p and top-k sampling enabled.

#

# The model uses top-p (nucleus) sampling to limit the token candidates

# to those whose cumulative probability mass does not exceed p, reducing unlikely tails.

# It also uses top-k sampling to restrict choices to the k highest-probability tokens.

# Temperature scales the logits to control randomness.


from transformers import AutoTokenizer, AutoModelForCausalLM


# Step 1: Load tokenizer and model

tokenizer = AutoTokenizer.from_pretrained("gpt2")

model = AutoModelForCausalLM.from_pretrained("gpt2")


# Step 2: Prepare retrieved passages and user question

retrieved_passages = [

    "Passage 1: Retrieval augmented generation grounds outputs in source documents.",

    "Passage 2: Hybrid retrieval combines BM25 and dense embeddings for robust relevance."

]

user_question = "How does retrieval augmented generation improve answer accuracy?"


# Step 3: Assemble the prompt

# We join passages with two newlines between them, then add a separator and the question.

context = "\n\n".join(retrieved_passages)

prompt = f"{context}\n\nUser: {user_question}\nAssistant:"


# Step 4: Tokenize the prompt

input_ids = tokenizer.encode(prompt, return_tensors="pt")


# Step 5: Configure generation parameters

max_new_tokens = 150

top_p = 0.92          # nucleus sampling threshold: cumulative probability mass

top_k = 50            # restrict to highest-probability 50 tokens each step

temperature = 0.7     # softens or sharpens the probability distribution


# Step 6: Generate the response

outputs = model.generate(

    input_ids,

    max_new_tokens=max_new_tokens,

    top_p=top_p,

    top_k=top_k,

    temperature=temperature,

    do_sample=True,             # enable sampling (otherwise it would be greedy)

    eos_token_id=tokenizer.eos_token_id

)


# Step 7: Decode and print the model’s answer

generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

# The generated_text includes the prompt and the new tokens, so we strip the prompt prefix.

answer = generated_text[len(prompt):].strip()

print("Assistant’s answer:")

print(answer)


When you run this code, the model first reads the two retrieved passages, then sees the user’s question. The generate call with do_sample=True uses nucleus (top-p) sampling at threshold 0.92 to keep only the smallest set of tokens whose combined probability is at least 92 percent. By also applying top-k sampling at 50, the generation limits itself to the fifty most likely tokens at each step. Temperature of 0.7 makes the distribution slightly sharper than default, reducing randomness but still allowing diverse word choices.


By adjusting top_p closer to 1.0, you allow more candidate tokens and hence more varied output. By lowering top_k to a smaller integer, you force the model to choose from an even narrower set of words, which can improve coherence at the cost of creativity. Setting temperature higher than 1.0 makes the model more adventurous, while a value below 1.0 makes it more conservative.


Next, you can experiment with reranking the raw generations by scoring each candidate answer against the question to pick the most relevant one, or you can use beam search with early stopping for more deterministic outputs. 



Step Seven: Reranking Retrieved Passages Using a Neural Cross-Encoder


When the retriever returns a set of candidate passages, their initial ranking may not perfectly reflect true relevance. A neural reranker examines each query-passage pair jointly and produces a more precise score because it can attend to interactions between the query words and the passage words. In practice, this second-stage reranker is often a cross-encoder model that takes as input a concatenation of the query and the passage and outputs a single relevance score. This approach is more computationally expensive than the first-stage retriever, but by limiting it to only the top K candidates it can dramatically improve precision at the top of the list.


The code example below shows how to use the sentence-transformers CrossEncoder class with a pretrained model fine-tuned on MS MARCO. The script takes a list of passages and a user question, applies the cross-encoder to each query-passage pair to obtain relevance scores, and then sorts the passages by descending score to produce the reranked list. Each step is explained in detail to clarify how the cross-encoder reranker is integrated into the pipeline.


from sentence_transformers import CrossEncoder


# Introduction to the neural reranking code example

# This example demonstrates how to apply a cross-encoder reranker to the top candidate passages.

# The cross-encoder model is loaded from the pretrained 'ms-marco' checkpoint.

# For each passage, the model predicts a relevance score given the query and passage as input.

# Finally, the passages are sorted by these scores to yield a more accurate top-K ranking.


# Step 1: Load the cross-encoder reranker model

reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')


# Step 2: Define the user query and initial candidate passages

query = "how to combine lexical and semantic retrieval"

candidate_passages = [

    "Retrieval augmented generation grounds outputs in external documents.",

    "Dense embeddings capture semantic similarity beyond exact terms.",

    "BM25 ranks documents by term frequency and inverse document frequency."

]


# Step 3: Prepare input pairs for the reranker

# Each input consists of the query string and one passage string

rerank_inputs = [[query, passage] for passage in candidate_passages]


# Step 4: Compute relevance scores for each pair

# The CrossEncoder returns a float score for each query-passage pair

scores = reranker.predict(rerank_inputs)


# Step 5: Pair up passages with their scores and sort

# We sort by score in descending order to rank the most relevant passages first

scored_passages = list(zip(candidate_passages, scores))

scored_passages.sort(key=lambda x: x[1], reverse=True)


# Step 6: Extract the reranked passages

reranked_passages = [passage for passage, score in scored_passages]


# Step 7: Print out the passages in their new order

print("Reranked passages:")

for passage, score in scored_passages:

    print(f"{score:.4f}  {passage}")


This code will output the passages sorted by the cross-encoder’s relevance scores, which typically elevates more contextually appropriate snippets above those that merely share keywords. By applying this reranking step after the initial fast retrieval, the system balances speed and precision: the first stage filters down to a manageable candidate set, and the second stage uses a more powerful model to fine-tune the ranking.



Step Eight: Reranking Multiple Generated Answer Candidates for Greater Coherence


When you invoke a language model with sampling enabled, it may produce several plausible answers to the same prompt. Each candidate can vary in style, length, and factual precision. By generating a set of answers and then applying a second-stage scoring model, you can select the one that best balances relevance to the question and internal coherence. A common technique is to use a neural cross-encoder reranker that considers the question and each candidate answer together and outputs a relevance or coherence score for each pair.


The code example below shows how to generate five candidate answers with top-p and top-k sampling, strip out the prompt to isolate each answer, and then apply a cross-encoder reranker to score each question-answer pair. Finally, the answers are sorted by their reranker scores and the highest-scoring answer is selected as the final output.


from transformers import AutoTokenizer, AutoModelForCausalLM

from sentence_transformers import CrossEncoder


# This example generates multiple answers to the same prompt and uses a cross-encoder

# reranker to pick the most coherent and relevant one.

#

# First, we load a causal language model and tokenizer for generation.

# Then we generate several answer sequences using nucleus (top-p) and top-k sampling.

# After decoding, we prepare inputs for the cross-encoder by pairing the user question

# with each generated answer. The reranker produces a float score for each pair.

# Finally, we sort the answers by descending score and print them in order.


# Load tokenizer and model for generation

gen_tokenizer = AutoTokenizer.from_pretrained("gpt2")

gen_model     = AutoModelForCausalLM.from_pretrained("gpt2")


# Load a cross-encoder reranker fine-tuned for relevance scoring

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")


# Define retrieved passages, question, and assemble the prompt

retrieved = [

    "Passage: Retrieval augmented generation grounds outputs in source documents.",

    "Passage: Hybrid retrieval combines lexical BM25 with semantic embeddings."

]

question = "How does retrieval augmented generation improve answer accuracy?"

prompt   = "\n\n".join(retrieved) + "\n\nUser: " + question + "\nAssistant:"


# Tokenize the prompt

input_ids = gen_tokenizer.encode(prompt, return_tensors="pt")


# Generate multiple answer candidates with sampling

outputs = gen_model.generate(

    input_ids,

    max_new_tokens=100,

    do_sample=True,

    top_p=0.9,

    top_k=40,

    temperature=0.8,

    num_return_sequences=5,

    eos_token_id=gen_tokenizer.eos_token_id,

    return_dict_in_generate=True,

    output_scores=False

)


# Decode and strip the prompt to get answer texts

candidates = []

for seq in outputs.sequences:

    full = gen_tokenizer.decode(seq, skip_special_tokens=True)

    answer = full[len(prompt):].strip()

    candidates.append(answer)


# Prepare inputs for the reranker: one [question, answer] pair per candidate

rerank_inputs = [[question, ans] for ans in candidates]


# Compute relevance/coherence scores for each candidate

scores = reranker.predict(rerank_inputs)


# Pair candidates with their scores and sort by descending score

scored = list(zip(candidates, scores))

scored.sort(key=lambda pair: pair[1], reverse=True)


# Print the reranked candidates

print("Candidates ranked by reranker score:")

for ans, score in scored:

    print(f"{score:.3f}  {ans}")


When you run this code, it will display each generated answer along with its coherence score as judged by the cross-encoder. The answer with the highest score typically best addresses the question while maintaining internal consistency and fluency.



Step Nine: Fallback Strategies When Retrieval Returns Poor Results


When the retriever fails to return passages that meet a minimum relevance threshold, the overall RAG pipeline risks generating content that is ungrounded or hallucinated. To avoid this, you can implement fallback strategies that detect low-confidence retrieval and pivot to alternative behaviors. A common approach is to measure the highest similarity score among the retrieved passages, compare it to a predefined cutoff, and if it falls below that cutoff, trigger one of several fallback actions. These might include invoking a broader search over an external knowledge source, asking the user to clarify or refine their query, or generating an answer template that signals uncertainty.


In the code example below, you will see how to implement a simple confidence check on the dense retrieval scores. If the top score is above a threshold, the passages are used as context in the prompt. If the top score is below the threshold, the system instead calls a web search API—represented here by a placeholder function—to fetch snippets from a search engine and uses those snippets in place of the original passages. This ensures that the model always has some grounded context, and it gives you a hook to insert more sophisticated fallbacks in the future.


# Introduction to the fallback strategy code example

# This code example demonstrates how to detect low-confidence retrieval

# by checking the top dense similarity score against a threshold.

# If retrieval confidence is too low, it falls back to a web search API

# to obtain external snippets for grounding. Otherwise, it proceeds

# with the originally retrieved passages.


from sentence_transformers import SentenceTransformer

import faiss

import numpy as np


# Placeholder for an external search API function

def web_search_snippets(query_text, num_snippets=3):

    """

    Call an external search engine and return a list of text snippets.

    In a real system, this might use a REST API to a knowledge base or web search endpoint.

    """

    # For demonstration, we return dummy snippets.

    return [

        "External snippet one related to your query.",

        "External snippet two containing factual information.",

        "External snippet three offering background context."

    ]


# Step 1: Load the embedding model and build the FAISS index

model = SentenceTransformer('all-MiniLM-L6-v2')

passages = [

    "This is a passage about retrieval augmented generation.",

    "This passage discusses semantic embeddings.",

    "This passage explains BM25 scoring."

]

embeddings = model.encode(passages, convert_to_numpy=True)

faiss.normalize_L2(embeddings)

index = faiss.IndexFlatIP(embeddings.shape[1])

index.add(embeddings)


# Step 2: Encode the user query and perform dense retrieval

query = "Explain how RAG handles missing information"

query_emb = model.encode([query], convert_to_numpy=True)

faiss.normalize_L2(query_emb)

top_k = 3

scores, indices = index.search(query_emb, top_k)

top_scores = scores.flatten()

top_passages = [passages[i] for i in indices.flatten()]


# Step 3: Apply a confidence threshold to decide on fallback

confidence_threshold = 0.3

if top_scores[0] < confidence_threshold:

    # Retrieval confidence is too low, invoke fallback

    fallback_passages = web_search_snippets(query, num_snippets=top_k)

    context_passages = fallback_passages

    print("Fallback activated. Using external snippets as context.")

else:

    # Retrieval confidence is acceptable, proceed normally

    context_passages = top_passages

    print("Retrieval confidence sufficient. Using retrieved passages.")


# Step 4: Assemble the prompt with the chosen context passages

user_question = "How does retrieval-augmented generation handle missing information?"

prompt = ""

for i, passage in enumerate(context_passages):

    prompt += f"Context passage {i+1}: {passage}\n\n"

prompt += f"User: {user_question}\nAssistant:"


print("Final prompt sent to the model:")

print(prompt)


In this example, the code first retrieves the top three passages and examines the highest similarity score. If the score is below the confidence threshold, the system replaces those passages with snippets obtained from a fallback search function. The assembled prompt always includes context—either from the primary retriever or the fallback—so that the language model is never asked to answer with zero grounding.


An alternative or complementary fallback is to prompt the user for clarification when context is insufficient. This could mean generating a response such as, “I’m not finding enough information; could you rephrase or add detail to your question?” By combining automated fallbacks with user interactions, you can guide the system toward reliable answers without letting it hallucinate.


With fallback mechanisms in place, the final piece of the RAG architecture is to define evaluation metrics to monitor retrieval quality and end-to-end answer precision in production. 


Step Ten: Evaluation Metrics for Monitoring a RAG System in Production


To ensure that a RAG system continues to serve reliable, accurate answers over time, it is essential to define quantitative measures of both retrieval quality and end-to-end answer performance. Tracking retrieval metrics alerts you when the retriever begins returning less relevant passages, while generation metrics tell you how well the language model is leveraging those passages to produce correct, fluent responses.


One core retrieval metric is recall at rank K. Recall at rank K measures the fraction of queries for which at least one of the known relevant documents appears among the top K retrieved passages. High recall@K indicates that the retriever is finding relevant sources, even if they are not always the very top result. Another important metric is mean reciprocal rank (MRR), which averages the inverse rank of the first relevant document across all queries; higher MRR means relevant passages tend to appear earlier in the ranking.


The following code example shows how to compute recall@K and MRR for a batch of queries. It assumes that for each query you have a list of retrieved passage identifiers and a set of ground-truth relevant passage identifiers. This script loops through each query, checks where the first relevant passage appears in the retrieved list, and accumulates both metrics.



# Introduction to retrieval metrics code example

# This code example demonstrates computing recall at K and mean reciprocal rank (MRR)

# for a batch of queries. For each query, retrieved_ids is the list of passage identifiers

# returned by the retriever, ordered by decreasing relevance. ground_truth_ids is the set

# of passage identifiers known to be relevant. We compute recall@K by checking whether any

# relevant id appears in the top K retrieved, and compute reciprocal rank by finding the

# position of the first relevant passage.


def evaluate_retrieval(retrieved_ids_list, ground_truth_ids_list, K):

    total_queries = len(retrieved_ids_list)

    recall_count = 0

    reciprocal_ranks = 0.0


    for retrieved_ids, ground_truth_ids in zip(retrieved_ids_list, ground_truth_ids_list):

        # Determine whether any ground-truth ID appears in the top K

        top_k_ids = retrieved_ids[:K]

        if any(doc_id in ground_truth_ids for doc_id in top_k_ids):

            recall_count += 1


        # Compute reciprocal rank: inverse of the position of first relevant passage

        rank = None

        for idx, doc_id in enumerate(retrieved_ids, start=1):

            if doc_id in ground_truth_ids:

                rank = idx

                break

        if rank is not None:

            reciprocal_ranks += 1.0 / rank


    recall_at_k = recall_count / total_queries

    mrr = reciprocal_ranks / total_queries

    return recall_at_k, mrr


# Example usage:

# Suppose we have two queries. For the first, retriever returned passages [5,3,7] and

# ground truth relevant passages are {3,8}. For the second, retriever returned [2,9,4]

# and ground truth is {9}.

retrieved_ids_list   = [[5,3,7], [2,9,4]]

ground_truth_ids_list = [{3,8}, {9}]

recall_at_3, mrr = evaluate_retrieval(retrieved_ids_list, ground_truth_ids_list, K=3)

print(f"Recall@3: {recall_at_3:.2f}")

print(f"MRR: {mrr:.2f}")



After running this script, recall@3 will reflect the fraction of queries where a relevant passage appeared in the top three, and MRR will summarize how early relevant passages tend to appear. By logging these metrics over time, you can detect drift in your document collection or changes in query patterns that degrade retriever performance.


Equally important is measuring the quality of the answers generated by the language model once it has its context. Automated metrics such as BLEU, ROUGE, or METEOR compare generated text against reference answers by examining n-gram overlap. While these metrics correlate imperfectly with human judgment, they provide fast, repeatable signals. For tasks with well-defined correct answers—such as question answering on a fixed knowledge base—you can also compute Exact Match, measuring the percentage of answers that exactly match the reference.


The next code example illustrates how to compute a simple BLEU score for a batch of generated answers against reference answers using the SacreBLEU library. It shows how to prepare lists of hypothesis texts and reference texts, call the corpus-level BLEU scorer, and interpret the score as a percentage of n-gram overlap fidelity.


# Introduction to generation metrics code example

# This example uses SacreBLEU to compute a corpus-level BLEU score

# between generated model outputs and reference answers. BLEU evaluates

# precision of n-gram matches; here we compute a single aggregate score

# across all examples.


from sacrebleu import corpus_bleu


# Example generated answers and corresponding reference answers

generated_answers = [

    "Retrieval augmented generation improves accuracy by grounding in source texts.",

    "Dense embeddings allow semantic matching beyond exact keyword overlap."

]

reference_answers = [

    ["Retrieval augmented generation grounds its answers in external documents to improve accuracy."],

    ["Semantic vector embeddings enable retrieval of text that shares meaning even without exact keywords."]

]


# Compute corpus BLEU score

bleu = corpus_bleu(generated_answers, reference_answers)

print(f"Corpus BLEU: {bleu.score:.2f}")


After computing the BLEU score, you obtain a number between 0 and 100 that reflects how closely your model’s responses match the reference answers at the n-gram level. Tracking BLEU over deployment helps you spot regressions when you update model versions or adjust retrieval parameters.


For a production system, you should log retrieval and generation metrics together on a regular schedule—daily or weekly. Visual dashboards can display recall@K and MRR trends alongside BLEU or Exact Match curves. When any metric drops below a warning threshold, automated alerts can notify your team to investigate.



CONCLUSION


Retrieval-Augmented Generation is more than a clever engineering pattern. It is a fundamental shift in how we use language models: from static, sealed brains to dynamic, evidence-driven reasoning agents. By fusing retrieval with generation, RAG systems retain the fluency and flexibility of LLMs while grounding their outputs in external, verifiable sources. This makes them not only more trustworthy, but also adaptable: knowledge can be updated instantly in the retrieval corpus without retraining the model.


Throughout this article, we walked through every layer of a complete RAG pipeline. You saw how to implement sparse retrieval using BM25, how to encode documents into high-dimensional vectors for dense retrieval using transformers, and how to build a hybrid retriever that balances term frequency with semantic similarity. You learned how to slice long documents into token-based chunks, how to use FAISS for fast approximate search, and how to combine passages into prompts for a generator model. You also saw how to tune decoding behavior using top-p and top-k sampling, how to rerank both retrieval and generation outputs using neural cross-encoders, and how to implement robust fallbacks when retrieval fails.


Finally, we explored evaluation: recall@K, MRR, BLEU scores, and production metrics to keep your system observable and debuggable. All examples were designed to run with both local and remote models—meaning the architecture is portable, privacy-compliant, and cloud-independent if needed.


By mastering these components, you are now equipped to build RAG systems that are reliable, explainable, and production-ready. Whether you are developing an internal chatbot for your company, a search assistant for a knowledge base, or a research tool for science literature, RAG provides the scaffolding for grounded, dynamic language understanding.


In short: with RAG, your model doesn’t just talk. It reads first.

No comments: