INTRODUCTION: WELCOME TO THE WORLD OF LLM CHATBOTS
In this tutorial, we will embark on a journey to build your very first Large Language Model chatbot from scratch. If you have never worked with LLMs programmatically before, do not worry. We will take this step by step, explaining every concept and every line of code along the way.
By the end of this tutorial, you will understand how to create a chatbot using four major frameworks: HuggingFace, LangChain, LangGraph, and LlamaIndex. You will also learn how to enhance your chatbot with Retrieval Augmented Generation, a technique that allows your chatbot to answer questions based on your own documents and data.
Before we begin, let me explain what we will be building. An LLM chatbot is a program that can have conversations with users by understanding their messages and generating intelligent responses. Think of it like having a conversation with a knowledgeable assistant who can help answer questions, provide information, or simply chat.
WHAT YOU NEED TO KNOW BEFORE STARTING
You should have basic Python programming knowledge. This means you should understand variables, functions, and how to run Python scripts. You do not need any prior experience with machine learning or artificial intelligence. We will explain everything from the ground up.
You will need a computer with Python 3.8 or higher installed. You will also need an internet connection to download the necessary libraries and models. Additionally, you may need API keys from services like OpenAI or Anthropic, though we will also show you how to use free, open-source models that run locally on your machine.
PART 1: UNDERSTANDING AND BUILDING A BASIC LLM CHATBOT
SECTION 1: FUNDAMENTAL CONCEPTS YOU MUST UNDERSTAND
Before writing any code, we need to understand what we are actually building. Let me explain the key concepts in simple terms.
What is a Large Language Model?
A Large Language Model, or LLM, is a type of artificial intelligence that has been trained on massive amounts of text from the internet, books, and other sources. Through this training, the model learns patterns in language and can generate human-like text in response to prompts. You can think of an LLM as a very sophisticated autocomplete system that can understand context and generate coherent, relevant responses.
When you send a message to an LLM, the model processes your text and predicts what the most appropriate response would be based on all the patterns it learned during training. The model does not truly understand language the way humans do, but it has learned statistical patterns that allow it to generate remarkably intelligent responses.
Understanding Tokens
LLMs do not actually work with words directly. Instead, they work with tokens. A token is a piece of text that could be a word, part of a word, or even a punctuation mark. For example, the sentence “Hello world!” might be split into three tokens: “Hello”, “ world”, and “!”. The model converts your text into tokens, processes these tokens, and then converts the output tokens back into readable text.
This is important to understand because LLMs have limits on how many tokens they can process at once. This limit is called the context window. If your conversation becomes too long, you may need to truncate older messages to stay within this limit.
The Concept of a Prompt
A prompt is the input you give to an LLM. In a chatbot, each message from the user becomes part of the prompt. A well-crafted prompt can significantly improve the quality of the responses you get from the model. For example, instead of just sending “Tell me about Paris”, you might send “You are a knowledgeable travel guide. Tell me about the top attractions in Paris for first-time visitors.”
How Chatbot Memory Works
A basic LLM call is stateless, meaning it does not remember previous messages in the conversation. If you want your chatbot to remember what was discussed earlier, you need to maintain a conversation history and include relevant previous messages in each new prompt. This is called conversation memory or chat history management.
SECTION 2: SETTING UP YOUR DEVELOPMENT ENVIRONMENT
Now that you understand the basics, let us set up your computer to start building. We will do this step by step.
Step One: Creating a Project Directory
First, create a folder on your computer where you will store all your chatbot code. You can name it something like “my_llm_chatbot”. Open your terminal or command prompt and navigate to this directory.
Step Two: Setting Up a Virtual Environment
A virtual environment is an isolated space where you can install Python packages without affecting other Python projects on your computer. This is considered best practice in Python development. Here is how to create one:
python -m venv chatbot_env
This command creates a virtual environment named “chatbot_env” in your current directory. Now activate it:
On Windows:
chatbot_env\Scripts\activate
On macOS or Linux:
source chatbot_env/bin/activate
When activated, you will see the environment name in your terminal prompt.
Step Three: Installing Core Dependencies
We will install the packages we need throughout this tutorial. Run these commands one by one:
pip install transformers torch
pip install langchain langchain-community langchain-openai
pip install langgraph
pip install llama-index
pip install openai anthropic
pip install chromadb sentence-transformers
pip install python-dotenv
Each of these packages serves a specific purpose. The transformers library is from HuggingFace and provides access to thousands of pre-trained models. Torch is PyTorch, a deep learning framework. The langchain packages provide tools for building LLM applications. LangGraph helps build stateful, multi-actor applications. LlamaIndex specializes in connecting LLMs to your data. The remaining packages handle API access, vector storage, and environment configuration.
Step Four: Organizing Your API Keys
If you plan to use commercial APIs like OpenAI, you will need API keys. Create a file named “.env” in your project directory and add your keys:
OPENAI_API_KEY=your_openai_key_here
ANTHROPIC_API_KEY=your_anthropic_key_here
Never commit this file to version control or share it publicly. The python-dotenv package will load these keys into your environment variables when needed.
SECTION 3: BUILDING YOUR FIRST CHATBOT WITH HUGGINGFACE
HuggingFace is a company that provides tools and a platform for working with machine learning models, particularly focused on natural language processing. Their Transformers library has become the de facto standard for working with pre-trained language models.
When to Use HuggingFace
You should use HuggingFace when you want direct access to open-source models that run locally on your machine, when you need fine-grained control over model behavior, when you want to experiment with different models easily, or when you prefer not to rely on external APIs for privacy or cost reasons.
Understanding the HuggingFace Approach
HuggingFace gives you direct access to the models themselves. You load a model into your computer’s memory and then send it text to generate responses. This means you have complete control but also means you need sufficient computational resources. Smaller models can run on regular laptops, while larger models may require powerful GPUs.
Step One: Importing Required Libraries
Let us create a file called “huggingface_chatbot.py” and start coding. First, we import the necessary libraries:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
The AutoModelForCausalLM class automatically selects the appropriate model architecture based on the model name you provide. The AutoTokenizer handles converting text to tokens and back. The torch library is PyTorch, which handles the underlying mathematical operations.
Step Two: Loading a Model and Tokenizer
Now we need to select a model and load it. For beginners, I recommend starting with a smaller model like GPT-2 or the newer Phi models from Microsoft:
def load_model_and_tokenizer(model_name="microsoft/phi-2"):
"""
Load a pre-trained language model and its tokenizer.
The model is responsible for generating text based on input prompts.
The tokenizer converts text to numbers (tokens) that the model can process
and converts the model's numeric output back to readable text.
Args:
model_name: The identifier for the model on HuggingFace Hub
Returns:
A tuple containing (tokenizer, model)
"""
print(f"Loading model {model_name}... This may take a few minutes.")
# Load the tokenizer for this specific model
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Load the actual model with specific settings
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16, # Use half precision to save memory
device_map="auto", # Automatically use GPU if available
trust_remote_code=True # Some models require custom code
)
print("Model loaded successfully!")
return tokenizer, model
This function does several important things. It downloads the model from HuggingFace’s servers if you do not already have it cached locally. The torch_dtype parameter tells PyTorch to use half-precision floating point numbers, which uses less memory and runs faster with minimal impact on quality. The device_map parameter automatically uses your GPU if you have one, otherwise it falls back to CPU.
Step Three: Creating the Generation Function
Now we need a function that takes user input and generates a response:
def generate_response(user_input, tokenizer, model, conversation_history=""):
"""
Generate a response to the user's input using the loaded model.
This function prepares the input, sends it to the model, and decodes
the model's output back into readable text.
Args:
user_input: The message from the user
tokenizer: The tokenizer for converting text to/from tokens
model: The language model that generates responses
conversation_history: Previous messages in the conversation
Returns:
The generated response as a string
"""
# Construct the full prompt including history
full_prompt = conversation_history + f"\nUser: {user_input}\nAssistant:"
# Convert text to tokens
inputs = tokenizer(full_prompt, return_tensors="pt")
# Move inputs to the same device as the model (CPU or GPU)
inputs = {key: value.to(model.device) for key, value in inputs.items()}
# Generate response tokens
with torch.no_grad(): # Disable gradient calculation for inference
outputs = model.generate(
**inputs,
max_new_tokens=200, # Maximum length of the response
temperature=0.7, # Controls randomness (0=deterministic, 1=creative)
do_sample=True, # Enable sampling for more diverse responses
top_p=0.9, # Nucleus sampling parameter
pad_token_id=tokenizer.eos_token_id # Padding token
)
# Decode the generated tokens back to text
full_response = tokenizer.decode(outputs[0], skip_special_tokens=True)
# Extract just the assistant's response
response = full_response.split("Assistant:")[-1].strip()
return response
This function is the heart of your chatbot. Let me explain the key parameters in the generate method. The max_new_tokens parameter limits how long the response can be. Temperature controls randomness - lower values make the model more focused and deterministic, while higher values make it more creative but potentially less coherent. The top_p parameter implements nucleus sampling, which helps generate more natural responses by sampling from the most likely tokens.
Step Four: Building the Chat Loop
Now we need a main loop that handles the conversation:
def main():
"""
Main function that runs the chatbot interaction loop.
This function loads the model, then repeatedly prompts the user for input,
generates responses, and maintains conversation history.
"""
print("Initializing HuggingFace Chatbot...")
# Load model and tokenizer
tokenizer, model = load_model_and_tokenizer()
# Initialize conversation history
conversation_history = "You are a helpful AI assistant."
print("\nChatbot is ready! Type 'quit' to exit.\n")
while True:
# Get user input
user_input = input("You: ").strip()
# Check for exit command
if user_input.lower() in ['quit', 'exit', 'bye']:
print("Goodbye!")
break
# Skip empty inputs
if not user_input:
continue
# Generate response
response = generate_response(user_input, tokenizer, model, conversation_history)
# Update conversation history
conversation_history += f"\nUser: {user_input}\nAssistant: {response}"
# Display response
print(f"Assistant: {response}\n")
if __name__ == "__main__":
main()
This creates a continuous loop where the user can type messages and receive responses. The conversation history accumulates all previous exchanges, allowing the model to maintain context. However, be aware that this history will eventually exceed the model’s context window, so in a production system you would need to implement truncation or summarization.
Understanding What We Built
You now have a complete working chatbot using HuggingFace. When you run this script, it loads a language model, processes your messages, and generates responses. The model runs entirely on your machine, which means your conversations stay private and you do not need to pay for API calls. However, the quality of responses depends on the size and capability of the model you choose.
SECTION 4: BUILDING A CHATBOT WITH LANGCHAIN
LangChain is a framework specifically designed for building applications powered by language models. It provides high-level abstractions that make it easier to build complex LLM applications without writing all the low-level code yourself.
When to Use LangChain
You should use LangChain when you want to quickly prototype LLM applications, when you need to integrate multiple components like memory and tools, when you want to easily switch between different LLM providers, or when you are building applications that need to chain multiple LLM calls together.
Understanding the LangChain Philosophy
LangChain thinks of LLM applications as chains of components. A component might be an LLM, a prompt template, a memory system, or a tool that the LLM can use. You connect these components together to create sophisticated behaviors. This modular approach makes it easy to build and modify complex applications.
Step One: Importing LangChain Components
Create a new file called “langchain_chatbot.py” and start with imports:
from langchain.chat_models import ChatOpenAI
from langchain.schema import HumanMessage, AIMessage, SystemMessage
from langchain.memory import ConversationBufferMemory
from langchain.chains import ConversationChain
from dotenv import load_dotenv
import os
The ChatOpenAI class is a wrapper around OpenAI’s chat models. The schema classes represent different types of messages. ConversationBufferMemory stores the conversation history. ConversationChain combines all these components into a working chatbot.
Step Two: Setting Up the LLM Connection
Now we configure the connection to the LLM:
def initialize_chatbot():
"""
Initialize the LangChain chatbot with memory and LLM configuration.
This function sets up a conversation chain that includes memory management
and configures the connection to the LLM provider (OpenAI in this case).
Returns:
A ConversationChain object ready to use
"""
# Load environment variables from .env file
load_dotenv()
# Initialize the chat model with specific parameters
llm = ChatOpenAI(
model_name="gpt-3.5-turbo", # The specific model to use
temperature=0.7, # Controls response creativity
openai_api_key=os.getenv("OPENAI_API_KEY") # API key from environment
)
# Set up memory to track conversation history
memory = ConversationBufferMemory(
return_messages=True, # Return messages in a format the model understands
memory_key="history" # The key used to store history in the chain
)
# Create the conversation chain
conversation = ConversationChain(
llm=llm,
memory=memory,
verbose=True # Print detailed logs of what's happening
)
return conversation
This function creates all the components needed for a conversation. The ChatOpenAI instance connects to OpenAI’s API using your key. The ConversationBufferMemory automatically tracks all messages in the conversation. The ConversationChain ties everything together and handles the flow of information between components.
Step Three: Creating the Chat Interface
Now let us create a function to handle the conversation:
def chat(conversation, user_message):
"""
Send a message to the chatbot and get a response.
This function handles a single interaction: sending the user's message
to the LLM and receiving the response. The conversation chain
automatically manages memory, so context is maintained.
Args:
conversation: The ConversationChain instance
user_message: The message from the user
Returns:
The chatbot's response as a string
"""
try:
# Send the message and get response
# The conversation chain automatically includes memory in the prompt
response = conversation.predict(input=user_message)
return response
except Exception as e:
return f"Error: {str(e)}"
This simple function is all you need to interact with the chatbot. LangChain handles all the complexity of managing conversation history, formatting prompts, and making API calls. This is the power of using a framework like LangChain.
Step Four: Building the Main Loop
Here is the main loop that runs the chatbot:
def main():
"""
Main function that runs the LangChain chatbot.
This initializes the chatbot and runs a loop where users can
have a continuous conversation.
"""
print("Initializing LangChain Chatbot...")
# Initialize the conversation chain
conversation = initialize_chatbot()
print("\nChatbot is ready! Type 'quit' to exit.\n")
while True:
# Get user input
user_input = input("You: ").strip()
# Check for exit command
if user_input.lower() in ['quit', 'exit', 'bye']:
print("Goodbye!")
break
# Skip empty inputs
if not user_input:
continue
# Get and display response
response = chat(conversation, user_input)
print(f"Assistant: {response}\n")
if __name__ == "__main__":
main()
Comparing to HuggingFace
Notice how much simpler this code is compared to the HuggingFace version. LangChain abstracts away the details of tokenization, prompt construction, and memory management. You get a working chatbot with automatic conversation history in just a few lines of code. The tradeoff is that you have less control over the low-level details and you rely on external API services.
Using Local Models with LangChain
While we used OpenAI in this example, LangChain also supports local models through HuggingFace. You can replace the ChatOpenAI line with:
from langchain.llms import HuggingFacePipeline
llm = HuggingFacePipeline.from_model_id(
model_id="microsoft/phi-2",
task="text-generation",
model_kwargs={"temperature": 0.7, "max_length": 512}
)
This gives you the convenience of LangChain with the privacy and cost benefits of local models.
SECTION 5: BUILDING A CHATBOT WITH LANGGRAPH
LangGraph is a newer framework from the creators of LangChain that focuses on building stateful, multi-actor applications with LLMs. It is particularly powerful when you need to build complex workflows with multiple steps and decision points.
When to Use LangGraph
You should use LangGraph when you need to build chatbots with complex state management, when your application requires multiple steps or decision branches, when you want to implement agentic behavior where the LLM can make decisions about what to do next, or when you need fine-grained control over the conversation flow.
Understanding LangGraph’s State Machine Approach
LangGraph treats your application as a state machine. A state machine is a system that can be in different states and transitions between states based on inputs or decisions. In a chatbot, the state might include the conversation history, the current topic, the user’s mood, or any other information you want to track. LangGraph provides tools to define states, transitions, and the logic that determines how your application flows from one state to another.
Step One: Understanding LangGraph Concepts
Before we code, let us understand the key concepts in LangGraph. A graph in LangGraph represents your application’s structure. Nodes are individual operations or steps in your application. Edges connect nodes and define how your application flows from one step to another. State is the data that flows through your graph and gets updated at each node.
Step Two: Setting Up a Basic LangGraph Chatbot
Create a file called “langgraph_chatbot.py”:
from langgraph.graph import StateGraph, END
from langchain.chat_models import ChatOpenAI
from langchain.schema import HumanMessage, AIMessage, SystemMessage
from typing import TypedDict, Annotated, Sequence
import operator
from dotenv import load_dotenv
import os
The StateGraph class is the main building block for creating LangGraph applications. The END constant marks the end of a graph execution. The TypedDict and Annotated types help define our state structure in a type-safe way.
Step Three: Defining the Application State
In LangGraph, you must define what information your application tracks:
class ChatState(TypedDict):
"""
Define the state structure for our chatbot.
This state object will be passed through each node in the graph
and can be modified at each step.
Attributes:
messages: The list of messages in the conversation
user_input: The current message from the user
"""
messages: Annotated[Sequence[HumanMessage | AIMessage], operator.add]
user_input: str
This state definition tells LangGraph what data flows through our application. The messages field uses operator.add which means each node can append to this list rather than replacing it. This is perfect for accumulating conversation history.
Step Four: Creating Graph Nodes
Nodes are functions that take the current state and return an updated state:
def chatbot_node(state: ChatState) -> ChatState:
"""
The main chatbot logic node.
This node receives the current state, sends the conversation to the LLM,
and returns an updated state with the LLM's response added.
Args:
state: The current state containing conversation history
Returns:
Updated state with the new AI message added
"""
# Initialize the LLM
load_dotenv()
llm = ChatOpenAI(
model_name="gpt-3.5-turbo",
temperature=0.7,
openai_api_key=os.getenv("OPENAI_API_KEY")
)
# Get the conversation history
messages = list(state["messages"])
# Add system message if this is the start of conversation
if len(messages) == 0:
messages.insert(0, SystemMessage(content="You are a helpful AI assistant."))
# Add the latest user input
messages.append(HumanMessage(content=state["user_input"]))
# Get response from LLM
response = llm(messages)
# Return updated state with AI response added
return {
"messages": [response],
"user_input": state["user_input"]
}
This node does the actual work of generating responses. It takes the current state, formats the conversation history properly, calls the LLM, and returns the updated state. Notice that we return a dictionary with the new message, and LangGraph’s operator.add annotation automatically appends it to the existing messages.
Step Five: Building the Graph
Now we construct the graph by adding nodes and edges:
def create_chatbot_graph():
"""
Create and configure the LangGraph graph for the chatbot.
This function defines the structure of our application by adding nodes
and edges that determine the flow of execution.
Returns:
A compiled graph ready to execute
"""
# Create a new state graph with our state definition
workflow = StateGraph(ChatState)
# Add the chatbot node
workflow.add_node("chatbot", chatbot_node)
# Set the entry point - where execution starts
workflow.set_entry_point("chatbot")
# Add an edge from chatbot to END - conversation ends after each response
workflow.add_edge("chatbot", END)
# Compile the graph into an executable application
app = workflow.compile()
return app
This creates a very simple graph with just one node. The graph starts at the chatbot node, executes it, and then ends. In more complex applications, you would have multiple nodes with conditional edges that decide which node to execute next based on the state.
Step Six: Running the Chatbot
Here is the main loop that uses our graph:
def main():
"""
Main function that runs the LangGraph chatbot.
"""
print("Initializing LangGraph Chatbot...")
# Create the graph
app = create_chatbot_graph()
# Initialize conversation state
state = {
"messages": [],
"user_input": ""
}
print("\nChatbot is ready! Type 'quit' to exit.\n")
while True:
# Get user input
user_input = input("You: ").strip()
# Check for exit command
if user_input.lower() in ['quit', 'exit', 'bye']:
print("Goodbye!")
break
# Skip empty inputs
if not user_input:
continue
# Update state with new input
state["user_input"] = user_input
# Execute the graph with current state
result = app.invoke(state)
# Extract the latest AI message
latest_message = result["messages"][-1].content
# Update state with new messages
state["messages"] = result["messages"]
# Display response
print(f"Assistant: {latest_message}\n")
if __name__ == "__main__":
main()
Understanding What Makes LangGraph Different
At first glance, this might seem more complicated than the LangChain example. However, LangGraph’s power becomes apparent when you build more complex applications. You can easily add nodes for different behaviors, implement conditional logic to route conversations differently based on user intent, add nodes that call external APIs or tools, and create loops where the LLM can refine its response based on feedback.
For example, you could add a node that classifies user intent and then route to different specialized nodes based on that intent. You could add a node that validates the LLM’s response before showing it to the user. These capabilities make LangGraph extremely powerful for production applications.
SECTION 6: BUILDING A CHATBOT WITH LLAMAINDEX
LlamaIndex is a framework specifically designed for connecting LLMs to your data. While the other frameworks focus on the LLM interaction itself, LlamaIndex specializes in indexing, retrieving, and using external data with LLMs.
When to Use LlamaIndex
You should use LlamaIndex when your chatbot needs to answer questions based on your documents, when you have a knowledge base you want to make searchable through natural language, when you need to work with structured or semi-structured data, or when you are building a question-answering system over your own data.
Understanding the LlamaIndex Approach
LlamaIndex thinks about LLM applications in terms of data ingestion, indexing, and retrieval. You first ingest your data into LlamaIndex, which processes and indexes it. When a user asks a question, LlamaIndex retrieves the most relevant pieces of information from your index and provides them to the LLM as context. The LLM then generates a response based on this retrieved information.
Step One: Setting Up LlamaIndex
Create a file called “llamaindex_chatbot.py”:
from llama_index import VectorStoreIndex, SimpleDirectoryReader, ServiceContext
from llama_index.llms import OpenAI
from llama_index.memory import ChatMemoryBuffer
from dotenv import load_dotenv
import os
The VectorStoreIndex is LlamaIndex’s main indexing structure. SimpleDirectoryReader helps load documents. ServiceContext configures the LLM and other services. ChatMemoryBuffer manages conversation history.
Step Two: Creating a Simple Chat Engine
Let us start with a basic chatbot without any external data:
def initialize_chat_engine():
"""
Initialize a LlamaIndex chat engine.
This creates a chat engine that can have conversations with memory.
Later we will enhance this with document retrieval capabilities.
Returns:
A chat engine ready for conversation
"""
# Load environment variables
load_dotenv()
# Configure the LLM
llm = OpenAI(
model="gpt-3.5-turbo",
temperature=0.7,
api_key=os.getenv("OPENAI_API_KEY")
)
# Create a service context with our LLM
service_context = ServiceContext.from_defaults(llm=llm)
# Create an empty index
index = VectorStoreIndex([], service_context=service_context)
# Create a chat engine with memory
chat_engine = index.as_chat_engine(
chat_mode="simple",
memory=ChatMemoryBuffer.from_defaults(token_limit=4000),
verbose=True
)
return chat_engine
This creates a chat engine with conversation memory. The token_limit parameter ensures that the conversation history does not exceed the model’s context window. LlamaIndex automatically manages truncating old messages when this limit is reached.
Step Three: Implementing the Chat Loop
The chat loop for LlamaIndex is straightforward:
def main():
"""
Main function that runs the LlamaIndex chatbot.
"""
print("Initializing LlamaIndex Chatbot...")
# Create chat engine
chat_engine = initialize_chat_engine()
print("\nChatbot is ready! Type 'quit' to exit.\n")
while True:
# Get user input
user_input = input("You: ").strip()
# Check for exit command
if user_input.lower() in ['quit', 'exit', 'bye']:
print("Goodbye!")
break
# Skip empty inputs
if not user_input:
continue
# Get response from chat engine
response = chat_engine.chat(user_input)
# Display response
print(f"Assistant: {response}\n")
if __name__ == "__main__":
main()
Understanding LlamaIndex’s Role
At this point, you might wonder why you would use LlamaIndex for a basic chatbot when LangChain is simpler. The answer is that LlamaIndex really shines when you add document retrieval, which we will cover in Part 2. For now, just understand that LlamaIndex provides excellent tools for managing conversation memory and integrates seamlessly with its retrieval capabilities.
SECTION 7: COMPARING FRAMEWORKS AND WHEN TO COMBINE THEM
Now that you have seen all four frameworks, let us discuss when to use each one and when to combine them in a single application.
HuggingFace: Direct Model Access
Use HuggingFace when you want complete control over the model, when you need to run models locally for privacy or cost reasons, when you are experimenting with different models to find the best one, or when you need to fine-tune models for specific tasks. HuggingFace gives you the lowest level of abstraction, which means maximum flexibility but also maximum complexity.
The main advantage is that you own your infrastructure and data never leaves your control. The main disadvantage is that you need to manage more complexity and need sufficient computational resources.
LangChain: Rapid Prototyping and Integration
Use LangChain when you want to quickly build a prototype, when you need to integrate multiple components like memory and tools, when you want to easily switch between different LLM providers, or when you are building chains of operations. LangChain provides a good balance between flexibility and convenience.
The main advantage is rapid development with high-level abstractions. The main disadvantage is that you are somewhat locked into LangChain’s way of doing things, and debugging can be challenging when things go wrong inside the framework.
LangGraph: Complex Stateful Applications
Use LangGraph when you need complex state management, when your application has multiple steps with conditional logic, when you are building agentic systems that make decisions, or when you need fine-grained control over application flow. LangGraph is the most sophisticated of the frameworks and requires more upfront design work.
The main advantage is that it excels at complex workflows and provides excellent observability. The main disadvantage is that it has a steeper learning curve and might be overkill for simple applications.
LlamaIndex: Document-Based Question Answering
Use LlamaIndex when your primary use case involves retrieving and using external documents, when you have a knowledge base to make searchable, when you need to work with structured data, or when building question-answering systems. LlamaIndex is purpose-built for retrieval augmented generation, which we will cover in Part 2.
The main advantage is exceptional document handling and retrieval capabilities. The main disadvantage is that it is less flexible for general LLM application patterns that do not involve document retrieval.
Combining Frameworks in Practice
In real-world applications, you often combine frameworks to leverage the strengths of each. Here are common combinations:
You might use HuggingFace models with LangChain’s abstractions. This gives you the privacy and cost benefits of local models with the convenience of LangChain. The code looks like this:
from langchain.llms import HuggingFacePipeline
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
# Load HuggingFace model
model = AutoModelForCausalLM.from_pretrained("microsoft/phi-2")
tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2")
# Create pipeline
pipe = pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
max_new_tokens=200
)
# Wrap in LangChain
llm = HuggingFacePipeline(pipeline=pipe)
# Now use with any LangChain components
from langchain.chains import ConversationChain
conversation = ConversationChain(llm=llm)
You might use LangChain for orchestration with LlamaIndex for retrieval. This combines LangChain’s workflow capabilities with LlamaIndex’s retrieval expertise:
from langchain.chains import RetrievalQA
from llama_index import VectorStoreIndex
from langchain.llms import OpenAI
# Create LlamaIndex index
index = VectorStoreIndex.from_documents(documents)
# Convert to LangChain retriever
retriever = index.as_retriever()
# Use in LangChain chain
qa_chain = RetrievalQA.from_chain_type(
llm=OpenAI(),
retriever=retriever
)
You might use LangGraph for orchestration with LlamaIndex for retrieval. This gives you LangGraph’s stateful workflows with LlamaIndex’s document capabilities.
The key principle is to choose frameworks based on your specific needs and combine them where it makes sense. Do not feel obligated to use only one framework. Each has its strengths, and they are designed to work together.
PART 2: ADDING RETRIEVAL AUGMENTED GENERATION (RAG)
SECTION 8: UNDERSTANDING RETRIEVAL AUGMENTED GENERATION
Now that you can build basic chatbots, let us enhance them with the ability to answer questions based on your own documents. This technique is called Retrieval Augmented Generation, or RAG for short.
What is RAG and Why Do We Need It?
An LLM is trained on a huge corpus of text and learns general knowledge about the world. However, it has three fundamental limitations. First, it only knows information from its training data, which has a cutoff date. Second, it does not know anything about your personal documents, company data, or private information. Third, it sometimes generates plausible-sounding but incorrect information, a phenomenon called hallucination.
RAG solves these problems by retrieving relevant information from your documents and providing it to the LLM as context. Instead of relying solely on the model’s training, the LLM generates responses based on the actual text you provide. This makes responses more accurate, up-to-date, and grounded in your specific data.
How RAG Works: The Complete Process
Let me walk you through what happens when you use RAG. First, you prepare your data by loading your documents and splitting them into smaller chunks. This is necessary because LLMs have limited context windows and work better with focused pieces of information.
Second, you convert these text chunks into embeddings. An embedding is a mathematical representation of text as a vector of numbers. Text with similar meanings will have similar embeddings. This allows us to find relevant information mathematically.
Third, you store these embeddings in a vector database. A vector database is optimized for finding similar vectors quickly.
When a user asks a question, the RAG system converts the question into an embedding using the same process. It then searches the vector database for the chunks with the most similar embeddings. These are the chunks most semantically related to the question. The system retrieves these relevant chunks and provides them to the LLM along with the user’s question. The LLM then generates a response based on this retrieved context.
Understanding Embeddings More Deeply
An embedding is a list of numbers that represents the semantic meaning of text. For example, the sentence “The cat sat on the mat” might be represented as a vector with 768 numbers. The sentence “The feline rested on the rug” would have a similar vector because the meanings are related, even though the words are different.
Embeddings are created by specialized models trained to capture semantic similarity. When you convert text to embeddings, you are essentially mapping language into a mathematical space where the distance between points represents semantic similarity.
The Vector Database Concept
A vector database stores embeddings and provides fast similarity search. When you query with an embedding, the database uses algorithms like cosine similarity or Euclidean distance to find the most similar stored embeddings. This retrieval happens in milliseconds even with millions of stored vectors.
Common vector databases include ChromaDB, which is lightweight and perfect for prototyping, Pinecone, which is a managed cloud service, Weaviate, which offers rich filtering capabilities, and FAISS from Facebook AI Research, which is extremely fast for local use.
SECTION 9: IMPLEMENTING RAG WITH HUGGINGFACE
Let us implement RAG using HuggingFace. We will use HuggingFace models for both the LLM and the embedding model, and ChromaDB as our vector database.
Step One: Installing Additional Dependencies
You will need some additional packages:
pip install chromadb sentence-transformers pypdf
ChromaDB is our vector database. Sentence-transformers provides embedding models from HuggingFace. PyPDF helps us read PDF documents.
Step Two: Creating the Document Processing Pipeline
Create a file called “huggingface_rag_chatbot.py”:
from sentence_transformers import SentenceTransformer
from transformers import AutoModelForCausalLM, AutoTokenizer
import chromadb
import torch
from pathlib import Path
Now let us create functions to process documents:
class DocumentProcessor:
"""
Handles loading and chunking documents for RAG.
This class provides methods to read text files and split them
into manageable chunks that fit within LLM context windows.
"""
def __init__(self, chunk_size=500, chunk_overlap=50):
"""
Initialize the document processor.
Args:
chunk_size: Maximum number of characters per chunk
chunk_overlap: Number of characters to overlap between chunks
"""
self.chunk_size = chunk_size
self.chunk_overlap = chunk_overlap
def load_text_file(self, file_path):
"""
Load a text file and return its contents.
Args:
file_path: Path to the text file
Returns:
The file contents as a string
"""
with open(file_path, 'r', encoding='utf-8') as f:
return f.read()
def split_into_chunks(self, text):
"""
Split text into overlapping chunks.
Chunking is necessary because LLMs have limited context windows.
Overlap helps ensure important information is not lost at boundaries.
Args:
text: The text to split
Returns:
A list of text chunks
"""
chunks = []
start = 0
while start < len(text):
# Calculate end position for this chunk
end = start + self.chunk_size
# Extract chunk
chunk = text[start:end]
# Only add non-empty chunks
if chunk.strip():
chunks.append(chunk)
# Move start position forward, accounting for overlap
start += self.chunk_size - self.chunk_overlap
return chunks
This class handles the first step of RAG: preparing documents. The chunking strategy uses overlapping windows to ensure that information is not lost when text is split across chunk boundaries.
Step Three: Creating the Vector Store
Now we need to create embeddings and store them:
class VectorStore:
"""
Manages embedding creation and vector database operations.
This class wraps ChromaDB and a sentence transformer model
to provide easy document storage and retrieval.
"""
def __init__(self, collection_name="documents"):
"""
Initialize the vector store.
Args:
collection_name: Name for the ChromaDB collection
"""
# Initialize the embedding model
print("Loading embedding model...")
self.embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
# Initialize ChromaDB client
self.client = chromadb.Client()
# Create or get the collection
self.collection = self.client.create_collection(
name=collection_name,
metadata={"hnsw:space": "cosine"} # Use cosine similarity
)
print("Vector store ready!")
def add_documents(self, chunks):
"""
Add document chunks to the vector store.
This method converts text chunks to embeddings and stores them
along with their text content.
Args:
chunks: List of text chunks to add
"""
print(f"Adding {len(chunks)} chunks to vector store...")
# Generate embeddings for all chunks
embeddings = self.embedding_model.encode(chunks)
# Create IDs for each chunk
ids = [f"chunk_{i}" for i in range(len(chunks))]
# Add to ChromaDB
self.collection.add(
embeddings=embeddings.tolist(),
documents=chunks,
ids=ids
)
print("Documents added successfully!")
def search(self, query, n_results=3):
"""
Search for relevant documents given a query.
This method converts the query to an embedding and finds
the most similar document chunks in the vector store.
Args:
query: The search query
n_results: Number of results to return
Returns:
List of relevant document chunks
"""
# Convert query to embedding
query_embedding = self.embedding_model.encode([query])
# Search in ChromaDB
results = self.collection.query(
query_embeddings=query_embedding.tolist(),
n_results=n_results
)
# Extract and return the document texts
return results['documents'][0]
This class encapsulates all vector database operations. It uses the all-MiniLM-L6-v2 model for embeddings, which is a good balance between speed and quality. The search method performs semantic search to find relevant chunks.
Step Four: Creating the RAG Chatbot
Now let us tie everything together:
class RAGChatbot:
"""
A complete RAG chatbot using HuggingFace models.
This class combines document retrieval with LLM generation
to answer questions based on your documents.
"""
def __init__(self, model_name="microsoft/phi-2"):
"""
Initialize the RAG chatbot.
Args:
model_name: HuggingFace model identifier
"""
# Initialize document processor
self.doc_processor = DocumentProcessor()
# Initialize vector store
self.vector_store = VectorStore()
# Load LLM
print(f"Loading language model {model_name}...")
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
self.model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="auto",
trust_remote_code=True
)
print("Model loaded!")
# Initialize conversation history
self.conversation_history = []
def load_documents(self, file_path):
"""
Load and index documents from a file.
Args:
file_path: Path to the document file
"""
# Load document
text = self.doc_processor.load_text_file(file_path)
# Split into chunks
chunks = self.doc_processor.split_into_chunks(text)
# Add to vector store
self.vector_store.add_documents(chunks)
def generate_response(self, user_input):
"""
Generate a response using RAG.
This method retrieves relevant context and generates
a response based on that context.
Args:
user_input: The user's question
Returns:
The generated response
"""
# Retrieve relevant context
relevant_chunks = self.vector_store.search(user_input, n_results=3)
# Construct context from retrieved chunks
context = "\n\n".join(relevant_chunks)
# Build the prompt with context
prompt = f"""Based on the following context, please answer the question.
```
Context:
{context}
Question: {user_input}
Answer:”””
```
# Add to conversation history
full_prompt = "\n".join(self.conversation_history) + "\n" + prompt
# Generate response
inputs = self.tokenizer(full_prompt, return_tensors="pt")
inputs = {k: v.to(self.model.device) for k, v in inputs.items()}
with torch.no_grad():
outputs = self.model.generate(
**inputs,
max_new_tokens=200,
temperature=0.7,
do_sample=True,
pad_token_id=self.tokenizer.eos_token_id
)
# Decode response
full_response = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
response = full_response.split("Answer:")[-1].strip()
# Update conversation history
self.conversation_history.append(f"User: {user_input}\nAssistant: {response}")
return response
```
Step Five: The Main Loop
Here is how to use the RAG chatbot:
def main():
"""
Main function to run the HuggingFace RAG chatbot.
"""
print("Initializing HuggingFace RAG Chatbot...")
# Initialize chatbot
chatbot = RAGChatbot()
# Load documents (you need to provide a text file)
print("\nPlease provide the path to your document file:")
file_path = input("File path: ").strip()
if Path(file_path).exists():
chatbot.load_documents(file_path)
print("\nDocuments loaded and indexed!")
else:
print("File not found. Starting without documents.")
print("\nChatbot is ready! Type 'quit' to exit.\n")
while True:
user_input = input("You: ").strip()
if user_input.lower() in ['quit', 'exit', 'bye']:
print("Goodbye!")
break
if not user_input:
continue
response = chatbot.generate_response(user_input)
print(f"Assistant: {response}\n")
if __name__ == "__main__":
main()
Understanding What We Built
You now have a complete RAG system using HuggingFace. When a user asks a question, the system searches your documents for relevant information, retrieves it, and provides it to the LLM as context. This allows the LLM to answer based on your specific documents rather than just its training data.
SECTION 10: IMPLEMENTING RAG WITH LANGCHAIN
LangChain makes RAG significantly easier with high-level abstractions for document loading, splitting, embedding, and retrieval.
Step One: Setting Up LangChain RAG
Create a file called “langchain_rag_chatbot.py”:
from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma
from langchain.chat_models import ChatOpenAI
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory
from dotenv import load_dotenv
import os
LangChain provides specialized components for each part of the RAG pipeline. The document loader reads files, the text splitter chunks documents, the embeddings class creates vector representations, the vector store manages retrieval, and the retrieval chain ties everything together.
Step Two: Creating the RAG Pipeline
Here is a complete RAG implementation:
class LangChainRAGChatbot:
"""
A RAG chatbot using LangChain's high-level abstractions.
This class demonstrates how LangChain simplifies RAG implementation
by providing pre-built components for each step.
"""
def __init__(self):
"""Initialize the LangChain RAG chatbot."""
load_dotenv()
# Initialize embeddings model
print("Initializing embeddings model...")
self.embeddings = HuggingFaceEmbeddings(
model_name="all-MiniLM-L6-v2"
)
# Initialize LLM
self.llm = ChatOpenAI(
model_name="gpt-3.5-turbo",
temperature=0.7,
openai_api_key=os.getenv("OPENAI_API_KEY")
)
# Initialize memory
self.memory = ConversationBufferMemory(
memory_key="chat_history",
return_messages=True,
output_key="answer"
)
# Vector store will be initialized when documents are loaded
self.vectorstore = None
self.qa_chain = None
print("Chatbot initialized!")
def load_documents(self, file_path):
"""
Load and index documents for RAG.
This method handles the complete pipeline: loading, splitting,
embedding, and storing documents.
Args:
file_path: Path to the document file
"""
print(f"Loading documents from {file_path}...")
# Load the document
loader = TextLoader(file_path, encoding='utf-8')
documents = loader.load()
# Split into chunks
# RecursiveCharacterTextSplitter tries to split on natural boundaries
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=50,
length_function=len,
separators=["\n\n", "\n", " ", ""] # Try these in order
)
chunks = text_splitter.split_documents(documents)
print(f"Created {len(chunks)} chunks")
# Create vector store
print("Creating embeddings and vector store...")
self.vectorstore = Chroma.from_documents(
documents=chunks,
embedding=self.embeddings,
collection_name="langchain_rag"
)
# Create the conversational retrieval chain
self.qa_chain = ConversationalRetrievalChain.from_llm(
llm=self.llm,
retriever=self.vectorstore.as_retriever(search_kwargs={"k": 3}),
memory=self.memory,
return_source_documents=True,
verbose=True
)
print("Documents indexed successfully!")
def chat(self, user_input):
"""
Chat with the RAG bot.
Args:
user_input: The user's question
Returns:
The bot's response
"""
if self.qa_chain is None:
return "Please load documents first using load_documents()."
# Get response from the chain
result = self.qa_chain({"question": user_input})
# Extract the answer
answer = result["answer"]
# Optionally, you can also see which source documents were used
source_docs = result.get("source_documents", [])
return answer, source_docs
Step Three: The Main Loop
def main():
"""
Main function to run the LangChain RAG chatbot.
"""
print("Initializing LangChain RAG Chatbot...")
# Initialize chatbot
chatbot = LangChainRAGChatbot()
# Load documents
print("\nPlease provide the path to your document file:")
file_path = input("File path: ").strip()
from pathlib import Path
if Path(file_path).exists():
chatbot.load_documents(file_path)
else:
print("File not found. Exiting.")
return
print("\nChatbot is ready! Type 'quit' to exit.")
print("The bot will answer questions based on your documents.\n")
while True:
user_input = input("You: ").strip()
if user_input.lower() in ['quit', 'exit', 'bye']:
print("Goodbye!")
break
if not user_input:
continue
# Get response
answer, sources = chatbot.chat(user_input)
print(f"Assistant: {answer}\n")
# Optionally show sources
show_sources = input("Show source documents? (y/n): ").strip().lower()
if show_sources == 'y':
print("\nSource documents used:")
for i, doc in enumerate(sources, 1):
print(f"\nSource {i}:")
print(doc.page_content[:200] + "...")
print()
if __name__ == "__main__":
main()
Understanding LangChain’s RAG Advantages
Notice how much simpler this is compared to the HuggingFace implementation. LangChain provides pre-built components that handle all the complexity. The ConversationalRetrievalChain automatically manages retrieval, prompt construction, and conversation history. The RecursiveCharacterTextSplitter intelligently splits text on natural boundaries. The integration between components is seamless.
This is the power of using a framework designed for RAG. You can focus on your application logic rather than the plumbing.
SECTION 11: IMPLEMENTING RAG WITH LANGGRAPH
LangGraph allows you to build more sophisticated RAG systems with custom logic for retrieval, generation, and response validation.
Step One: Setting Up LangGraph RAG
Create a file called “langgraph_rag_chatbot.py”:
from langgraph.graph import StateGraph, END
from langchain.chat_models import ChatOpenAI
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma
from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema import HumanMessage, AIMessage, SystemMessage
from typing import TypedDict, Annotated, Sequence, List
import operator
from dotenv import load_dotenv
Step Two: Defining State for RAG
Our state needs to track more information for RAG:
class RAGState(TypedDict):
"""
State definition for the RAG chatbot.
This tracks all information needed for retrieval and generation.
Attributes:
messages: Conversation history
user_input: Current user question
retrieved_docs: Documents retrieved for current question
final_answer: The generated response
"""
messages: Annotated[Sequence[HumanMessage | AIMessage], operator.add]
user_input: str
retrieved_docs: List[str]
final_answer: str
This state tracks not just the conversation but also the retrieved documents and the final answer. This allows us to implement multi-step workflows where we can inspect and modify what happens at each stage.
Step Three: Creating RAG Nodes
Now we create nodes for each step in the RAG process:
class LangGraphRAG:
"""
A RAG system built with LangGraph for maximum control.
"""
def __init__(self):
"""Initialize the LangGraph RAG system."""
load_dotenv()
# Initialize components
self.embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
self.llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0.7)
self.vectorstore = None
self.app = None
def load_documents(self, file_path):
"""Load and index documents."""
loader = TextLoader(file_path, encoding='utf-8')
documents = loader.load()
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=50
)
chunks = text_splitter.split_documents(documents)
self.vectorstore = Chroma.from_documents(
documents=chunks,
embedding=self.embeddings
)
# Build the graph after documents are loaded
self.app = self._build_graph()
print(f"Loaded and indexed {len(chunks)} chunks")
def retrieve_node(self, state: RAGState) -> RAGState:
"""
Node that retrieves relevant documents.
This node searches the vector store for documents
relevant to the user's question.
Args:
state: Current state
Returns:
Updated state with retrieved documents
"""
user_input = state["user_input"]
# Retrieve relevant documents
docs = self.vectorstore.similarity_search(user_input, k=3)
# Extract text from documents
doc_texts = [doc.page_content for doc in docs]
return {
"retrieved_docs": doc_texts,
"user_input": user_input,
"messages": []
}
def generate_node(self, state: RAGState) -> RAGState:
"""
Node that generates a response based on retrieved documents.
This node constructs a prompt with the retrieved context
and generates an answer using the LLM.
Args:
state: Current state with retrieved documents
Returns:
Updated state with the generated answer
"""
# Get retrieved documents
context = "\n\n".join(state["retrieved_docs"])
# Construct prompt
prompt = f"""Based on the following context, please answer the question.
```
Context:
{context}
Question: {state[‘user_input’]}
Please provide a clear and concise answer based on the context provided.”””
```
# Get existing messages
messages = list(state["messages"])
# Add system message if needed
if not messages or not isinstance(messages[0], SystemMessage):
messages.insert(0, SystemMessage(content="You are a helpful assistant that answers questions based on provided context."))
# Add user message
messages.append(HumanMessage(content=prompt))
# Generate response
response = self.llm(messages)
return {
"messages": [response],
"final_answer": response.content,
"user_input": state["user_input"],
"retrieved_docs": state["retrieved_docs"]
}
def _build_graph(self):
"""
Build the LangGraph workflow.
This creates a graph with nodes for retrieval and generation.
Returns:
Compiled graph application
"""
workflow = StateGraph(RAGState)
# Add nodes
workflow.add_node("retrieve", self.retrieve_node)
workflow.add_node("generate", self.generate_node)
# Define the flow
workflow.set_entry_point("retrieve")
workflow.add_edge("retrieve", "generate")
workflow.add_edge("generate", END)
return workflow.compile()
def chat(self, user_input):
"""
Process a user question through the RAG pipeline.
Args:
user_input: The user's question
Returns:
The generated answer
"""
if self.app is None:
return "Please load documents first."
# Create initial state
state = {
"user_input": user_input,
"messages": [],
"retrieved_docs": [],
"final_answer": ""
}
# Run the graph
result = self.app.invoke(state)
return result["final_answer"]
Step Four: The Main Loop
def main():
"""
Main function to run the LangGraph RAG chatbot.
"""
print("Initializing LangGraph RAG Chatbot...")
chatbot = LangGraphRAG()
print("\nPlease provide the path to your document file:")
file_path = input("File path: ").strip()
from pathlib import Path
if Path(file_path).exists():
chatbot.load_documents(file_path)
else:
print("File not found. Exiting.")
return
print("\nChatbot is ready! Type 'quit' to exit.\n")
while True:
user_input = input("You: ").strip()
if user_input.lower() in ['quit', 'exit', 'bye']:
print("Goodbye!")
break
if not user_input:
continue
answer = chatbot.chat(user_input)
print(f"Assistant: {answer}\n")
if __name__ == "__main__":
main()
Understanding LangGraph’s RAG Benefits
LangGraph gives you complete control over the RAG pipeline. You can easily add nodes for query rewriting, response validation, or iterative refinement. For example, you could add a node that checks if the retrieved documents are relevant and retrieves more if needed. You could add a node that validates the generated answer for factual consistency with the sources. This level of control is difficult to achieve with other frameworks.
SECTION 12: IMPLEMENTING RAG WITH LLAMAINDEX
LlamaIndex is purpose-built for RAG and makes it remarkably simple to implement sophisticated retrieval systems.
Step One: Basic RAG with LlamaIndex
Create a file called “llamaindex_rag_chatbot.py”:
from llama_index import VectorStoreIndex, SimpleDirectoryReader, ServiceContext
from llama_index.llms import OpenAI
from llama_index.embeddings import HuggingFaceEmbedding
from llama_index.memory import ChatMemoryBuffer
from dotenv import load_dotenv
import os
Step Two: Creating the RAG System
LlamaIndex makes RAG incredibly straightforward:
class LlamaIndexRAG:
"""
A RAG chatbot using LlamaIndex.
LlamaIndex is designed specifically for RAG, so this implementation
is remarkably concise while still being powerful.
"""
def __init__(self):
"""Initialize the LlamaIndex RAG system."""
load_dotenv()
# Configure LLM
self.llm = OpenAI(
model="gpt-3.5-turbo",
temperature=0.7,
api_key=os.getenv("OPENAI_API_KEY")
)
# Configure embeddings
self.embed_model = HuggingFaceEmbedding(
model_name="all-MiniLM-L6-v2"
)
# Create service context
self.service_context = ServiceContext.from_defaults(
llm=self.llm,
embed_model=self.embed_model
)
self.index = None
self.chat_engine = None
print("LlamaIndex RAG initialized!")
def load_documents(self, file_path):
"""
Load and index documents.
LlamaIndex handles all the complexity of loading, chunking,
embedding, and indexing in just a few lines.
Args:
file_path: Path to document file or directory
"""
print(f"Loading documents from {file_path}...")
# Load documents
# If you pass a directory, it will load all files in it
from pathlib import Path
if Path(file_path).is_dir():
documents = SimpleDirectoryReader(file_path).load_data()
else:
# For a single file, create a temporary directory reader
import tempfile
import shutil
temp_dir = tempfile.mkdtemp()
shutil.copy(file_path, temp_dir)
documents = SimpleDirectoryReader(temp_dir).load_data()
print(f"Loaded {len(documents)} documents")
# Create index
print("Creating index...")
self.index = VectorStoreIndex.from_documents(
documents,
service_context=self.service_context,
show_progress=True
)
# Create chat engine with memory
self.chat_engine = self.index.as_chat_engine(
chat_mode="context", # Use retrieval-augmented generation
memory=ChatMemoryBuffer.from_defaults(token_limit=4000),
system_prompt=(
"You are a helpful assistant that answers questions "
"based on the provided documents. Always cite the source "
"when possible and admit when you don't know something."
),
verbose=True
)
print("Documents indexed and chat engine ready!")
def chat(self, user_input):
"""
Chat with the RAG system.
Args:
user_input: The user's question
Returns:
The generated response
"""
if self.chat_engine is None:
return "Please load documents first."
# Get response
response = self.chat_engine.chat(user_input)
return str(response)
def reset_conversation(self):
"""Reset the conversation history."""
if self.chat_engine:
self.chat_engine.reset()
Step Three: The Main Loop
def main():
"""
Main function to run the LlamaIndex RAG chatbot.
"""
print("Initializing LlamaIndex RAG Chatbot...")
chatbot = LlamaIndexRAG()
print("\nPlease provide the path to your document file or directory:")
file_path = input("Path: ").strip()
from pathlib import Path
if Path(file_path).exists():
chatbot.load_documents(file_path)
else:
print("Path not found. Exiting.")
return
print("\nChatbot is ready! Type 'quit' to exit, 'reset' to clear history.\n")
while True:
user_input = input("You: ").strip()
if user_input.lower() in ['quit', 'exit', 'bye']:
print("Goodbye!")
break
if user_input.lower() == 'reset':
chatbot.reset_conversation()
print("Conversation history cleared.\n")
continue
if not user_input:
continue
response = chatbot.chat(user_input)
print(f"Assistant: {response}\n")
if __name__ == "__main__":
main()
Understanding LlamaIndex’s RAG Strengths
LlamaIndex shines in its simplicity for RAG applications. With just a few lines of code, you get sophisticated document processing, embedding generation, indexing, retrieval, and generation with conversation memory. The framework handles chunking strategies, embedding model integration, and prompt engineering for RAG automatically.
LlamaIndex also provides advanced features like response synthesis modes, query transformations, and sub-question query engines that break complex questions into simpler sub-questions. These advanced features make it easy to build production-quality RAG systems.
SECTION 13: ADVANCED RAG CONCEPTS AND BEST PRACTICES
Now that you have seen RAG implementations across all frameworks, let me share important concepts and best practices.
Chunking Strategies Matter
The way you split documents into chunks significantly affects RAG quality. Too large chunks include irrelevant information. Too small chunks lack context. A good starting point is 400 to 600 characters with 50 to 100 characters of overlap. For technical documents, splitting on section boundaries works better than fixed sizes.
You should experiment with different chunk sizes for your specific use case. Monitor which chunk sizes lead to the most relevant retrievals and best answers.
Choosing the Right Embedding Model
The embedding model determines how well the system understands semantic similarity. Smaller models like all-MiniLM-L6-v2 are fast but less accurate. Larger models like instructor-xl or e5-large-v2 are more accurate but slower and require more memory.
For most applications, all-MiniLM-L6-v2 provides a good balance. For production systems where accuracy is critical, consider larger models or domain-specific embeddings trained on your type of content.
Retrieval Quality is Critical
The quality of your RAG system depends primarily on retrieval quality. If the system retrieves irrelevant documents, even the best LLM cannot generate good answers. You should implement logging to track which documents get retrieved for each query. Review these logs regularly to identify retrieval problems.
Consider implementing hybrid search that combines semantic search with keyword search. This catches cases where semantic similarity alone might miss exact term matches that are important.
Managing Context Window Limits
Even with RAG, you can exceed the model’s context window when you have long retrieved documents and long conversation history. Implement truncation strategies that prioritize recent conversation turns and most relevant retrieved passages.
LlamaIndex handles this automatically with its token limit parameter. For custom implementations, you need to track token counts and truncate intelligently.
Handling Unanswerable Questions
Sometimes the retrieved documents do not contain the answer to a question. Your system should detect this and respond appropriately rather than hallucinating an answer. You can prompt the LLM to say “I cannot answer this based on the provided documents” when it lacks information.
You can also implement a relevance check where you ask the LLM to rate how relevant the retrieved documents are before generating an answer.
Metadata and Filtering
In production systems, you often want to filter documents based on metadata like date, author, or document type. Most vector databases support metadata filtering. For example, in ChromaDB you can filter results:
results = collection.query(
query_embeddings=query_embedding,
where={"date": {"$gte": "2024-01-01"}},
n_results=5
)
This retrieves only documents from 2024 or later. Metadata filtering significantly improves relevance when you have large document collections.
Citation and Source Attribution
Users need to know which documents the answers come from. Implement citation by tracking which chunks were used and displaying them with the response. LangChain and LlamaIndex provide this through source_documents. For custom implementations, return the chunk IDs along with the generated text.
Evaluating RAG Quality
You should systematically evaluate your RAG system. Create a test set of questions with known correct answers. Measure retrieval accuracy by checking if relevant documents are retrieved. Measure answer quality by comparing generated answers to reference answers. Track these metrics over time as you make improvements.
Production Considerations
For production systems, you need to consider additional aspects. Implement caching for common queries to reduce costs and latency. Use async operations to handle multiple concurrent users. Implement rate limiting to prevent abuse. Monitor costs carefully since embeddings and LLM calls can get expensive at scale. Consider using open-source models for embeddings to reduce costs.
CONCLUSION AND NEXT STEPS
You have now learned how to build LLM chatbots from scratch using four major frameworks. You started with basic chatbots and then enhanced them with Retrieval Augmented Generation to answer questions based on your documents.
Each framework has its strengths. HuggingFace gives you direct control and privacy. LangChain enables rapid prototyping with high-level abstractions. LangGraph provides sophisticated state management for complex applications. LlamaIndex excels at document-based question answering.
Where to Go From Here
To deepen your knowledge, I recommend building a complete application that solves a real problem you have. Perhaps build a chatbot that can answer questions about your company’s documentation, or create a personal assistant that knows about your notes and files. Real projects teach you far more than tutorials.
Experiment with different models to understand the tradeoffs between size, speed, and quality. Try fine-tuning models on your specific domain to improve performance. Explore advanced RAG techniques like query rewriting, hypothetical document embeddings, or fusion retrieval that combines multiple retrieval strategies.
Study the documentation for each framework deeply. I have only scratched the surface of what each framework can do. LangChain has tools for web scraping and API calls. LangGraph supports sophisticated multi-agent systems. LlamaIndex has advanced query engines that can handle complex reasoning.
Most importantly, remember that building with LLMs is still a rapidly evolving field. New techniques and best practices emerge constantly. Stay curious, keep experimenting, and do not be afraid to try unconventional approaches.
You now have the foundation to build powerful LLM applications. The key is to start simple, iterate based on what you learn, and gradually increase complexity as needed. Good luck on your journey building with large language models!
No comments:
Post a Comment