INTRODUCTION: WELCOME TO THE WORLD OF LLM CHATBOTS

In this tutorial, we will embark on a journey to build your very first Large Language Model chatbot from scratch. If you have never worked with LLMs programmatically before, do not worry. We will take this step by step, explaining every concept and every line of code along the way.

By the end of this tutorial, you will understand how to create a chatbot using four major frameworks: HuggingFace, LangChain, LangGraph, and LlamaIndex. You will also learn how to enhance your chatbot with Retrieval Augmented Generation, a technique that allows your chatbot to answer questions based on your own documents and data.

Before we begin, let me explain what we will be building. An LLM chatbot is a program that can have conversations with users by understanding their messages and generating intelligent responses. Think of it like having a conversation with a knowledgeable assistant who can help answer questions, provide information, or simply chat.

WHAT YOU NEED TO KNOW BEFORE STARTING

You should have basic Python programming knowledge. This means you should understand variables, functions, and how to run Python scripts. You do not need any prior experience with machine learning or artificial intelligence. We will explain everything from the ground up.

You will need a computer with Python 3.8 or higher installed. You will also need an internet connection to download the necessary libraries and models. Additionally, you may need API keys from services like OpenAI or Anthropic, though we will also show you how to use free, open-source models that run locally on your machine.

PART 1: UNDERSTANDING AND BUILDING A BASIC LLM CHATBOT

SECTION 1: FUNDAMENTAL CONCEPTS YOU MUST UNDERSTAND

Before writing any code, we need to understand what we are actually building. Let me explain the key concepts in simple terms.

What is a Large Language Model?

A Large Language Model, or LLM, is a type of artificial intelligence that has been trained on massive amounts of text from the internet, books, and other sources. Through this training, the model learns patterns in language and can generate human-like text in response to prompts. You can think of an LLM as a very sophisticated autocomplete system that can understand context and generate coherent, relevant responses.

When you send a message to an LLM, the model processes your text and predicts what the most appropriate response would be based on all the patterns it learned during training. The model does not truly understand language the way humans do, but it has learned statistical patterns that allow it to generate remarkably intelligent responses.

Understanding Tokens

LLMs do not actually work with words directly. Instead, they work with tokens. A token is a piece of text that could be a word, part of a word, or even a punctuation mark. For example, the sentence “Hello world!” might be split into three tokens: “Hello”, “ world”, and “!”. The model converts your text into tokens, processes these tokens, and then converts the output tokens back into readable text.

This is important to understand because LLMs have limits on how many tokens they can process at once. This limit is called the context window. If your conversation becomes too long, you may need to truncate older messages to stay within this limit.

The Concept of a Prompt

A prompt is the input you give to an LLM. In a chatbot, each message from the user becomes part of the prompt. A well-crafted prompt can significantly improve the quality of the responses you get from the model. For example, instead of just sending “Tell me about Paris”, you might send “You are a knowledgeable travel guide. Tell me about the top attractions in Paris for first-time visitors.”

How Chatbot Memory Works

A basic LLM call is stateless, meaning it does not remember previous messages in the conversation. If you want your chatbot to remember what was discussed earlier, you need to maintain a conversation history and include relevant previous messages in each new prompt. This is called conversation memory or chat history management.

SECTION 2: SETTING UP YOUR DEVELOPMENT ENVIRONMENT

Now that you understand the basics, let us set up your computer to start building. We will do this step by step.

Step One: Creating a Project Directory

First, create a folder on your computer where you will store all your chatbot code. You can name it something like “my_llm_chatbot”. Open your terminal or command prompt and navigate to this directory.

Step Two: Setting Up a Virtual Environment

A virtual environment is an isolated space where you can install Python packages without affecting other Python projects on your computer. This is considered best practice in Python development. Here is how to create one:

python -m venv chatbot_env

This command creates a virtual environment named “chatbot_env” in your current directory. Now activate it:

On Windows:

chatbot_env\Scripts\activate

On macOS or Linux:

source chatbot_env/bin/activate

When activated, you will see the environment name in your terminal prompt.

Step Three: Installing Core Dependencies

We will install the packages we need throughout this tutorial. Run these commands one by one:

pip install transformers torch

pip install langchain langchain-community langchain-openai

pip install langgraph

pip install llama-index

pip install openai anthropic

pip install chromadb sentence-transformers

pip install python-dotenv

Each of these packages serves a specific purpose. The transformers library is from HuggingFace and provides access to thousands of pre-trained models. Torch is PyTorch, a deep learning framework. The langchain packages provide tools for building LLM applications. LangGraph helps build stateful, multi-actor applications. LlamaIndex specializes in connecting LLMs to your data. The remaining packages handle API access, vector storage, and environment configuration.

Step Four: Organizing Your API Keys

If you plan to use commercial APIs like OpenAI, you will need API keys. Create a file named “.env” in your project directory and add your keys:

OPENAI_API_KEY=your_openai_key_here

ANTHROPIC_API_KEY=your_anthropic_key_here

Never commit this file to version control or share it publicly. The python-dotenv package will load these keys into your environment variables when needed.

SECTION 3: BUILDING YOUR FIRST CHATBOT WITH HUGGINGFACE

HuggingFace is a company that provides tools and a platform for working with machine learning models, particularly focused on natural language processing. Their Transformers library has become the de facto standard for working with pre-trained language models.

When to Use HuggingFace

You should use HuggingFace when you want direct access to open-source models that run locally on your machine, when you need fine-grained control over model behavior, when you want to experiment with different models easily, or when you prefer not to rely on external APIs for privacy or cost reasons.

Understanding the HuggingFace Approach

HuggingFace gives you direct access to the models themselves. You load a model into your computer’s memory and then send it text to generate responses. This means you have complete control but also means you need sufficient computational resources. Smaller models can run on regular laptops, while larger models may require powerful GPUs.

Step One: Importing Required Libraries

Let us create a file called “huggingface_chatbot.py” and start coding. First, we import the necessary libraries:

from transformers import AutoModelForCausalLM, AutoTokenizer

import torch

The AutoModelForCausalLM class automatically selects the appropriate model architecture based on the model name you provide. The AutoTokenizer handles converting text to tokens and back. The torch library is PyTorch, which handles the underlying mathematical operations.

Step Two: Loading a Model and Tokenizer

Now we need to select a model and load it. For beginners, I recommend starting with a smaller model like GPT-2 or the newer Phi models from Microsoft:

def load_model_and_tokenizer(model_name="microsoft/phi-2"):

"""

Load a pre-trained language model and its tokenizer.

The model is responsible for generating text based on input prompts.

The tokenizer converts text to numbers (tokens) that the model can process

and converts the model's numeric output back to readable text.

Args:

model_name: The identifier for the model on HuggingFace Hub

Returns:

A tuple containing (tokenizer, model)

"""

print(f"Loading model {model_name}... This may take a few minutes.")

# Load the tokenizer for this specific model

tokenizer = AutoTokenizer.from_pretrained(model_name)

# Load the actual model with specific settings

model = AutoModelForCausalLM.from_pretrained(

model_name,

torch_dtype=torch.float16, # Use half precision to save memory

device_map="auto", # Automatically use GPU if available

trust_remote_code=True # Some models require custom code

)

print("Model loaded successfully!")

return tokenizer, model

This function does several important things. It downloads the model from HuggingFace’s servers if you do not already have it cached locally. The torch_dtype parameter tells PyTorch to use half-precision floating point numbers, which uses less memory and runs faster with minimal impact on quality. The device_map parameter automatically uses your GPU if you have one, otherwise it falls back to CPU.

Step Three: Creating the Generation Function

Now we need a function that takes user input and generates a response:

def generate_response(user_input, tokenizer, model, conversation_history=""):

"""

Generate a response to the user's input using the loaded model.

This function prepares the input, sends it to the model, and decodes

the model's output back into readable text.

Args:

user_input: The message from the user

tokenizer: The tokenizer for converting text to/from tokens

model: The language model that generates responses

conversation_history: Previous messages in the conversation

Returns:

The generated response as a string

"""

# Construct the full prompt including history

full_prompt = conversation_history + f"\nUser: {user_input}\nAssistant:"

# Convert text to tokens

inputs = tokenizer(full_prompt, return_tensors="pt")

# Move inputs to the same device as the model (CPU or GPU)

inputs = {key: value.to(model.device) for key, value in inputs.items()}

# Generate response tokens

with torch.no_grad(): # Disable gradient calculation for inference

outputs = model.generate(

**inputs,

max_new_tokens=200, # Maximum length of the response

temperature=0.7, # Controls randomness (0=deterministic, 1=creative)

do_sample=True, # Enable sampling for more diverse responses

top_p=0.9, # Nucleus sampling parameter

pad_token_id=tokenizer.eos_token_id # Padding token

)

# Decode the generated tokens back to text

full_response = tokenizer.decode(outputs[0], skip_special_tokens=True)

# Extract just the assistant's response

response = full_response.split("Assistant:")[-1].strip()

return response

This function is the heart of your chatbot. Let me explain the key parameters in the generate method. The max_new_tokens parameter limits how long the response can be. Temperature controls randomness - lower values make the model more focused and deterministic, while higher values make it more creative but potentially less coherent. The top_p parameter implements nucleus sampling, which helps generate more natural responses by sampling from the most likely tokens.

Step Four: Building the Chat Loop

Now we need a main loop that handles the conversation:

def main():

"""

Main function that runs the chatbot interaction loop.

This function loads the model, then repeatedly prompts the user for input,

generates responses, and maintains conversation history.

"""

print("Initializing HuggingFace Chatbot...")

# Load model and tokenizer

tokenizer, model = load_model_and_tokenizer()

# Initialize conversation history

conversation_history = "You are a helpful AI assistant."

print("\nChatbot is ready! Type 'quit' to exit.\n")

while True:

# Get user input

user_input = input("You: ").strip()

# Check for exit command

if user_input.lower() in ['quit', 'exit', 'bye']:

print("Goodbye!")

break

# Skip empty inputs

if not user_input:

continue

# Generate response

response = generate_response(user_input, tokenizer, model, conversation_history)

# Update conversation history

conversation_history += f"\nUser: {user_input}\nAssistant: {response}"

# Display response

print(f"Assistant: {response}\n")

if __name__ == "__main__":

main()

This creates a continuous loop where the user can type messages and receive responses. The conversation history accumulates all previous exchanges, allowing the model to maintain context. However, be aware that this history will eventually exceed the model’s context window, so in a production system you would need to implement truncation or summarization.

Understanding What We Built

You now have a complete working chatbot using HuggingFace. When you run this script, it loads a language model, processes your messages, and generates responses. The model runs entirely on your machine, which means your conversations stay private and you do not need to pay for API calls. However, the quality of responses depends on the size and capability of the model you choose.

SECTION 4: BUILDING A CHATBOT WITH LANGCHAIN

LangChain is a framework specifically designed for building applications powered by language models. It provides high-level abstractions that make it easier to build complex LLM applications without writing all the low-level code yourself.

When to Use LangChain

You should use LangChain when you want to quickly prototype LLM applications, when you need to integrate multiple components like memory and tools, when you want to easily switch between different LLM providers, or when you are building applications that need to chain multiple LLM calls together.

Understanding the LangChain Philosophy

LangChain thinks of LLM applications as chains of components. A component might be an LLM, a prompt template, a memory system, or a tool that the LLM can use. You connect these components together to create sophisticated behaviors. This modular approach makes it easy to build and modify complex applications.

Step One: Importing LangChain Components

Create a new file called “langchain_chatbot.py” and start with imports:

from langchain.chat_models import ChatOpenAI

from langchain.schema import HumanMessage, AIMessage, SystemMessage

from langchain.memory import ConversationBufferMemory

from langchain.chains import ConversationChain

from dotenv import load_dotenv

import os

The ChatOpenAI class is a wrapper around OpenAI’s chat models. The schema classes represent different types of messages. ConversationBufferMemory stores the conversation history. ConversationChain combines all these components into a working chatbot.

Step Two: Setting Up the LLM Connection

Now we configure the connection to the LLM:

def initialize_chatbot():

"""

Initialize the LangChain chatbot with memory and LLM configuration.

This function sets up a conversation chain that includes memory management

and configures the connection to the LLM provider (OpenAI in this case).

Returns:

A ConversationChain object ready to use

"""

# Load environment variables from .env file

load_dotenv()

# Initialize the chat model with specific parameters

llm = ChatOpenAI(

model_name="gpt-3.5-turbo", # The specific model to use

temperature=0.7, # Controls response creativity

openai_api_key=os.getenv("OPENAI_API_KEY") # API key from environment

)

# Set up memory to track conversation history

memory = ConversationBufferMemory(

return_messages=True, # Return messages in a format the model understands

memory_key="history" # The key used to store history in the chain

)

# Create the conversation chain

conversation = ConversationChain(

llm=llm,

memory=memory,

verbose=True # Print detailed logs of what's happening

)

return conversation

This function creates all the components needed for a conversation. The ChatOpenAI instance connects to OpenAI’s API using your key. The ConversationBufferMemory automatically tracks all messages in the conversation. The ConversationChain ties everything together and handles the flow of information between components.

Step Three: Creating the Chat Interface

Now let us create a function to handle the conversation:

def chat(conversation, user_message):

"""

Send a message to the chatbot and get a response.

This function handles a single interaction: sending the user's message

to the LLM and receiving the response. The conversation chain

automatically manages memory, so context is maintained.

Args:

conversation: The ConversationChain instance

user_message: The message from the user

Returns:

The chatbot's response as a string

"""

try:

# Send the message and get response

# The conversation chain automatically includes memory in the prompt

response = conversation.predict(input=user_message)

return response

except Exception as e:

return f"Error: {str(e)}"

This simple function is all you need to interact with the chatbot. LangChain handles all the complexity of managing conversation history, formatting prompts, and making API calls. This is the power of using a framework like LangChain.

Step Four: Building the Main Loop

Here is the main loop that runs the chatbot:

def main():

"""

Main function that runs the LangChain chatbot.

This initializes the chatbot and runs a loop where users can

have a continuous conversation.

"""

print("Initializing LangChain Chatbot...")

# Initialize the conversation chain

conversation = initialize_chatbot()

print("\nChatbot is ready! Type 'quit' to exit.\n")

while True:

# Get user input

user_input = input("You: ").strip()

# Check for exit command

if user_input.lower() in ['quit', 'exit', 'bye']:

print("Goodbye!")

break

# Skip empty inputs

if not user_input:

continue

# Get and display response

response = chat(conversation, user_input)

print(f"Assistant: {response}\n")

if __name__ == "__main__":

main()

Comparing to HuggingFace

Notice how much simpler this code is compared to the HuggingFace version. LangChain abstracts away the details of tokenization, prompt construction, and memory management. You get a working chatbot with automatic conversation history in just a few lines of code. The tradeoff is that you have less control over the low-level details and you rely on external API services.

Using Local Models with LangChain

While we used OpenAI in this example, LangChain also supports local models through HuggingFace. You can replace the ChatOpenAI line with:

from langchain.llms import HuggingFacePipeline

llm = HuggingFacePipeline.from_model_id(

model_id="microsoft/phi-2",

task="text-generation",

model_kwargs={"temperature": 0.7, "max_length": 512}

)

This gives you the convenience of LangChain with the privacy and cost benefits of local models.

SECTION 5: BUILDING A CHATBOT WITH LANGGRAPH

LangGraph is a newer framework from the creators of LangChain that focuses on building stateful, multi-actor applications with LLMs. It is particularly powerful when you need to build complex workflows with multiple steps and decision points.

When to Use LangGraph

You should use LangGraph when you need to build chatbots with complex state management, when your application requires multiple steps or decision branches, when you want to implement agentic behavior where the LLM can make decisions about what to do next, or when you need fine-grained control over the conversation flow.

Understanding LangGraph’s State Machine Approach

LangGraph treats your application as a state machine. A state machine is a system that can be in different states and transitions between states based on inputs or decisions. In a chatbot, the state might include the conversation history, the current topic, the user’s mood, or any other information you want to track. LangGraph provides tools to define states, transitions, and the logic that determines how your application flows from one state to another.

Step One: Understanding LangGraph Concepts

Before we code, let us understand the key concepts in LangGraph. A graph in LangGraph represents your application’s structure. Nodes are individual operations or steps in your application. Edges connect nodes and define how your application flows from one step to another. State is the data that flows through your graph and gets updated at each node.

Step Two: Setting Up a Basic LangGraph Chatbot

Create a file called “langgraph_chatbot.py”:

from langgraph.graph import StateGraph, END

from langchain.chat_models import ChatOpenAI

from langchain.schema import HumanMessage, AIMessage, SystemMessage

from typing import TypedDict, Annotated, Sequence

import operator

from dotenv import load_dotenv

import os

The StateGraph class is the main building block for creating LangGraph applications. The END constant marks the end of a graph execution. The TypedDict and Annotated types help define our state structure in a type-safe way.

Step Three: Defining the Application State

In LangGraph, you must define what information your application tracks:

class ChatState(TypedDict):

"""

Define the state structure for our chatbot.

This state object will be passed through each node in the graph

and can be modified at each step.

Attributes:

messages: The list of messages in the conversation

user_input: The current message from the user

"""

messages: Annotated[Sequence[HumanMessage | AIMessage], operator.add]

user_input: str

This state definition tells LangGraph what data flows through our application. The messages field uses operator.add which means each node can append to this list rather than replacing it. This is perfect for accumulating conversation history.

Step Four: Creating Graph Nodes

Nodes are functions that take the current state and return an updated state:

def chatbot_node(state: ChatState) -> ChatState:

"""

The main chatbot logic node.

This node receives the current state, sends the conversation to the LLM,

and returns an updated state with the LLM's response added.

Args:

state: The current state containing conversation history

Returns:

Updated state with the new AI message added

"""

# Initialize the LLM

load_dotenv()

llm = ChatOpenAI(

model_name="gpt-3.5-turbo",

temperature=0.7,

openai_api_key=os.getenv("OPENAI_API_KEY")

)

# Get the conversation history

messages = list(state["messages"])

# Add system message if this is the start of conversation

if len(messages) == 0:

messages.insert(0, SystemMessage(content="You are a helpful AI assistant."))

# Add the latest user input

messages.append(HumanMessage(content=state["user_input"]))

# Get response from LLM

response = llm(messages)

# Return updated state with AI response added

return {

"messages": [response],

"user_input": state["user_input"]

}

This node does the actual work of generating responses. It takes the current state, formats the conversation history properly, calls the LLM, and returns the updated state. Notice that we return a dictionary with the new message, and LangGraph’s operator.add annotation automatically appends it to the existing messages.

Step Five: Building the Graph

Now we construct the graph by adding nodes and edges:

def create_chatbot_graph():

"""

Create and configure the LangGraph graph for the chatbot.

This function defines the structure of our application by adding nodes

and edges that determine the flow of execution.

Returns:

A compiled graph ready to execute

"""

# Create a new state graph with our state definition

workflow = StateGraph(ChatState)

# Add the chatbot node

workflow.add_node("chatbot", chatbot_node)

# Set the entry point - where execution starts

workflow.set_entry_point("chatbot")

# Add an edge from chatbot to END - conversation ends after each response

workflow.add_edge("chatbot", END)

# Compile the graph into an executable application

app = workflow.compile()

return app

This creates a very simple graph with just one node. The graph starts at the chatbot node, executes it, and then ends. In more complex applications, you would have multiple nodes with conditional edges that decide which node to execute next based on the state.

Step Six: Running the Chatbot

Here is the main loop that uses our graph:

def main():

"""

Main function that runs the LangGraph chatbot.

"""

print("Initializing LangGraph Chatbot...")

# Create the graph

app = create_chatbot_graph()

# Initialize conversation state

state = {

"messages": [],

"user_input": ""

}

print("\nChatbot is ready! Type 'quit' to exit.\n")

while True:

# Get user input

user_input = input("You: ").strip()

# Check for exit command

if user_input.lower() in ['quit', 'exit', 'bye']:

print("Goodbye!")

break

# Skip empty inputs

if not user_input:

continue

# Update state with new input

state["user_input"] = user_input

# Execute the graph with current state

result = app.invoke(state)

# Extract the latest AI message

latest_message = result["messages"][-1].content

# Update state with new messages

state["messages"] = result["messages"]

# Display response

print(f"Assistant: {latest_message}\n")

if __name__ == "__main__":

main()

Understanding What Makes LangGraph Different

At first glance, this might seem more complicated than the LangChain example. However, LangGraph’s power becomes apparent when you build more complex applications. You can easily add nodes for different behaviors, implement conditional logic to route conversations differently based on user intent, add nodes that call external APIs or tools, and create loops where the LLM can refine its response based on feedback.

For example, you could add a node that classifies user intent and then route to different specialized nodes based on that intent. You could add a node that validates the LLM’s response before showing it to the user. These capabilities make LangGraph extremely powerful for production applications.

SECTION 6: BUILDING A CHATBOT WITH LLAMAINDEX

LlamaIndex is a framework specifically designed for connecting LLMs to your data. While the other frameworks focus on the LLM interaction itself, LlamaIndex specializes in indexing, retrieving, and using external data with LLMs.

When to Use LlamaIndex

You should use LlamaIndex when your chatbot needs to answer questions based on your documents, when you have a knowledge base you want to make searchable through natural language, when you need to work with structured or semi-structured data, or when you are building a question-answering system over your own data.

Understanding the LlamaIndex Approach

LlamaIndex thinks about LLM applications in terms of data ingestion, indexing, and retrieval. You first ingest your data into LlamaIndex, which processes and indexes it. When a user asks a question, LlamaIndex retrieves the most relevant pieces of information from your index and provides them to the LLM as context. The LLM then generates a response based on this retrieved information.

Step One: Setting Up LlamaIndex

Create a file called “llamaindex_chatbot.py”:

from llama_index import VectorStoreIndex, SimpleDirectoryReader, ServiceContext

from llama_index.llms import OpenAI

from llama_index.memory import ChatMemoryBuffer

from dotenv import load_dotenv

import os

The VectorStoreIndex is LlamaIndex’s main indexing structure. SimpleDirectoryReader helps load documents. ServiceContext configures the LLM and other services. ChatMemoryBuffer manages conversation history.

Step Two: Creating a Simple Chat Engine

Let us start with a basic chatbot without any external data:

def initialize_chat_engine():

"""

Initialize a LlamaIndex chat engine.

This creates a chat engine that can have conversations with memory.

Later we will enhance this with document retrieval capabilities.

Returns:

A chat engine ready for conversation

"""

# Load environment variables

load_dotenv()

# Configure the LLM

llm = OpenAI(

model="gpt-3.5-turbo",

temperature=0.7,

api_key=os.getenv("OPENAI_API_KEY")

)

# Create a service context with our LLM

service_context = ServiceContext.from_defaults(llm=llm)

# Create an empty index

index = VectorStoreIndex([], service_context=service_context)

# Create a chat engine with memory

chat_engine = index.as_chat_engine(

chat_mode="simple",

memory=ChatMemoryBuffer.from_defaults(token_limit=4000),

verbose=True

)

return chat_engine

This creates a chat engine with conversation memory. The token_limit parameter ensures that the conversation history does not exceed the model’s context window. LlamaIndex automatically manages truncating old messages when this limit is reached.

Step Three: Implementing the Chat Loop

The chat loop for LlamaIndex is straightforward:

def main():

"""

Main function that runs the LlamaIndex chatbot.

"""

print("Initializing LlamaIndex Chatbot...")

# Create chat engine

chat_engine = initialize_chat_engine()

print("\nChatbot is ready! Type 'quit' to exit.\n")

while True:

# Get user input

user_input = input("You: ").strip()

# Check for exit command

if user_input.lower() in ['quit', 'exit', 'bye']:

print("Goodbye!")

break

# Skip empty inputs

if not user_input:

continue

# Get response from chat engine

response = chat_engine.chat(user_input)

# Display response

print(f"Assistant: {response}\n")

if __name__ == "__main__":

main()

Understanding LlamaIndex’s Role

At this point, you might wonder why you would use LlamaIndex for a basic chatbot when LangChain is simpler. The answer is that LlamaIndex really shines when you add document retrieval, which we will cover in Part 2. For now, just understand that LlamaIndex provides excellent tools for managing conversation memory and integrates seamlessly with its retrieval capabilities.

SECTION 7: COMPARING FRAMEWORKS AND WHEN TO COMBINE THEM

Now that you have seen all four frameworks, let us discuss when to use each one and when to combine them in a single application.

HuggingFace: Direct Model Access

Use HuggingFace when you want complete control over the model, when you need to run models locally for privacy or cost reasons, when you are experimenting with different models to find the best one, or when you need to fine-tune models for specific tasks. HuggingFace gives you the lowest level of abstraction, which means maximum flexibility but also maximum complexity.

The main advantage is that you own your infrastructure and data never leaves your control. The main disadvantage is that you need to manage more complexity and need sufficient computational resources.

LangChain: Rapid Prototyping and Integration

Use LangChain when you want to quickly build a prototype, when you need to integrate multiple components like memory and tools, when you want to easily switch between different LLM providers, or when you are building chains of operations. LangChain provides a good balance between flexibility and convenience.

The main advantage is rapid development with high-level abstractions. The main disadvantage is that you are somewhat locked into LangChain’s way of doing things, and debugging can be challenging when things go wrong inside the framework.

LangGraph: Complex Stateful Applications

Use LangGraph when you need complex state management, when your application has multiple steps with conditional logic, when you are building agentic systems that make decisions, or when you need fine-grained control over application flow. LangGraph is the most sophisticated of the frameworks and requires more upfront design work.

The main advantage is that it excels at complex workflows and provides excellent observability. The main disadvantage is that it has a steeper learning curve and might be overkill for simple applications.

LlamaIndex: Document-Based Question Answering

Use LlamaIndex when your primary use case involves retrieving and using external documents, when you have a knowledge base to make searchable, when you need to work with structured data, or when building question-answering systems. LlamaIndex is purpose-built for retrieval augmented generation, which we will cover in Part 2.

The main advantage is exceptional document handling and retrieval capabilities. The main disadvantage is that it is less flexible for general LLM application patterns that do not involve document retrieval.

Combining Frameworks in Practice

In real-world applications, you often combine frameworks to leverage the strengths of each. Here are common combinations:

You might use HuggingFace models with LangChain’s abstractions. This gives you the privacy and cost benefits of local models with the convenience of LangChain. The code looks like this:

from langchain.llms import HuggingFacePipeline

from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

# Load HuggingFace model

model = AutoModelForCausalLM.from_pretrained("microsoft/phi-2")

tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2")

# Create pipeline

pipe = pipeline(

"text-generation",

model=model,

tokenizer=tokenizer,

max_new_tokens=200

)

# Wrap in LangChain

llm = HuggingFacePipeline(pipeline=pipe)

# Now use with any LangChain components

from langchain.chains import ConversationChain

conversation = ConversationChain(llm=llm)

You might use LangChain for orchestration with LlamaIndex for retrieval. This combines LangChain’s workflow capabilities with LlamaIndex’s retrieval expertise:

from langchain.chains import RetrievalQA

from llama_index import VectorStoreIndex

from langchain.llms import OpenAI

# Create LlamaIndex index

index = VectorStoreIndex.from_documents(documents)

# Convert to LangChain retriever

retriever = index.as_retriever()

# Use in LangChain chain

qa_chain = RetrievalQA.from_chain_type(

llm=OpenAI(),

retriever=retriever

)

You might use LangGraph for orchestration with LlamaIndex for retrieval. This gives you LangGraph’s stateful workflows with LlamaIndex’s document capabilities.

The key principle is to choose frameworks based on your specific needs and combine them where it makes sense. Do not feel obligated to use only one framework. Each has its strengths, and they are designed to work together.

PART 2: ADDING RETRIEVAL AUGMENTED GENERATION (RAG)

SECTION 8: UNDERSTANDING RETRIEVAL AUGMENTED GENERATION

Now that you can build basic chatbots, let us enhance them with the ability to answer questions based on your own documents. This technique is called Retrieval Augmented Generation, or RAG for short.

What is RAG and Why Do We Need It?

An LLM is trained on a huge corpus of text and learns general knowledge about the world. However, it has three fundamental limitations. First, it only knows information from its training data, which has a cutoff date. Second, it does not know anything about your personal documents, company data, or private information. Third, it sometimes generates plausible-sounding but incorrect information, a phenomenon called hallucination.

RAG solves these problems by retrieving relevant information from your documents and providing it to the LLM as context. Instead of relying solely on the model’s training, the LLM generates responses based on the actual text you provide. This makes responses more accurate, up-to-date, and grounded in your specific data.

How RAG Works: The Complete Process

Let me walk you through what happens when you use RAG. First, you prepare your data by loading your documents and splitting them into smaller chunks. This is necessary because LLMs have limited context windows and work better with focused pieces of information.

Second, you convert these text chunks into embeddings. An embedding is a mathematical representation of text as a vector of numbers. Text with similar meanings will have similar embeddings. This allows us to find relevant information mathematically.

Third, you store these embeddings in a vector database. A vector database is optimized for finding similar vectors quickly.

When a user asks a question, the RAG system converts the question into an embedding using the same process. It then searches the vector database for the chunks with the most similar embeddings. These are the chunks most semantically related to the question. The system retrieves these relevant chunks and provides them to the LLM along with the user’s question. The LLM then generates a response based on this retrieved context.

Understanding Embeddings More Deeply

An embedding is a list of numbers that represents the semantic meaning of text. For example, the sentence “The cat sat on the mat” might be represented as a vector with 768 numbers. The sentence “The feline rested on the rug” would have a similar vector because the meanings are related, even though the words are different.

Embeddings are created by specialized models trained to capture semantic similarity. When you convert text to embeddings, you are essentially mapping language into a mathematical space where the distance between points represents semantic similarity.

The Vector Database Concept

A vector database stores embeddings and provides fast similarity search. When you query with an embedding, the database uses algorithms like cosine similarity or Euclidean distance to find the most similar stored embeddings. This retrieval happens in milliseconds even with millions of stored vectors.

Common vector databases include ChromaDB, which is lightweight and perfect for prototyping, Pinecone, which is a managed cloud service, Weaviate, which offers rich filtering capabilities, and FAISS from Facebook AI Research, which is extremely fast for local use.

SECTION 9: IMPLEMENTING RAG WITH HUGGINGFACE

Let us implement RAG using HuggingFace. We will use HuggingFace models for both the LLM and the embedding model, and ChromaDB as our vector database.

Step One: Installing Additional Dependencies

You will need some additional packages:

pip install chromadb sentence-transformers pypdf

ChromaDB is our vector database. Sentence-transformers provides embedding models from HuggingFace. PyPDF helps us read PDF documents.

Step Two: Creating the Document Processing Pipeline

Create a file called “huggingface_rag_chatbot.py”:

from sentence_transformers import SentenceTransformer

from transformers import AutoModelForCausalLM, AutoTokenizer

import chromadb

import torch

from pathlib import Path

Now let us create functions to process documents:

class DocumentProcessor:

"""

Handles loading and chunking documents for RAG.

This class provides methods to read text files and split them

into manageable chunks that fit within LLM context windows.

"""

def __init__(self, chunk_size=500, chunk_overlap=50):

"""

Initialize the document processor.

Args:

chunk_size: Maximum number of characters per chunk

chunk_overlap: Number of characters to overlap between chunks

"""

self.chunk_size = chunk_size

self.chunk_overlap = chunk_overlap

def load_text_file(self, file_path):

"""

Load a text file and return its contents.

Args:

file_path: Path to the text file

Returns:

The file contents as a string

"""

with open(file_path, 'r', encoding='utf-8') as f:

return f.read()

def split_into_chunks(self, text):

"""

Split text into overlapping chunks.

Chunking is necessary because LLMs have limited context windows.

Overlap helps ensure important information is not lost at boundaries.

Args:

text: The text to split

Returns:

A list of text chunks

"""

chunks = []

start = 0

while start < len(text):

# Calculate end position for this chunk

end = start + self.chunk_size

# Extract chunk

chunk = text[start:end]

# Only add non-empty chunks

if chunk.strip():

chunks.append(chunk)

# Move start position forward, accounting for overlap

start += self.chunk_size - self.chunk_overlap

return chunks

This class handles the first step of RAG: preparing documents. The chunking strategy uses overlapping windows to ensure that information is not lost when text is split across chunk boundaries.

Step Three: Creating the Vector Store

Now we need to create embeddings and store them:

class VectorStore:

"""

Manages embedding creation and vector database operations.

This class wraps ChromaDB and a sentence transformer model

to provide easy document storage and retrieval.

"""

def __init__(self, collection_name="documents"):

"""

Initialize the vector store.

Args:

collection_name: Name for the ChromaDB collection

"""

# Initialize the embedding model

print("Loading embedding model...")

self.embedding_model = SentenceTransformer('all-MiniLM-L6-v2')

# Initialize ChromaDB client

self.client = chromadb.Client()

# Create or get the collection

self.collection = self.client.create_collection(

name=collection_name,

metadata={"hnsw:space": "cosine"} # Use cosine similarity

)

print("Vector store ready!")

def add_documents(self, chunks):

"""

Add document chunks to the vector store.

This method converts text chunks to embeddings and stores them

along with their text content.

Args:

chunks: List of text chunks to add

"""

print(f"Adding {len(chunks)} chunks to vector store...")

# Generate embeddings for all chunks

embeddings = self.embedding_model.encode(chunks)

# Create IDs for each chunk

ids = [f"chunk_{i}" for i in range(len(chunks))]

# Add to ChromaDB

self.collection.add(

embeddings=embeddings.tolist(),

documents=chunks,

ids=ids

)

print("Documents added successfully!")

def search(self, query, n_results=3):

"""

Search for relevant documents given a query.

This method converts the query to an embedding and finds

the most similar document chunks in the vector store.

Args:

query: The search query

n_results: Number of results to return

Returns:

List of relevant document chunks

"""

# Convert query to embedding

query_embedding = self.embedding_model.encode([query])

# Search in ChromaDB

results = self.collection.query(

query_embeddings=query_embedding.tolist(),

n_results=n_results

)

# Extract and return the document texts

return results['documents'][0]

This class encapsulates all vector database operations. It uses the all-MiniLM-L6-v2 model for embeddings, which is a good balance between speed and quality. The search method performs semantic search to find relevant chunks.

Step Four: Creating the RAG Chatbot

Now let us tie everything together:

class RAGChatbot:

"""

A complete RAG chatbot using HuggingFace models.

This class combines document retrieval with LLM generation

to answer questions based on your documents.

"""

def __init__(self, model_name="microsoft/phi-2"):

"""

Initialize the RAG chatbot.

Args:

model_name: HuggingFace model identifier

"""

# Initialize document processor

self.doc_processor = DocumentProcessor()

# Initialize vector store

self.vector_store = VectorStore()

# Load LLM

print(f"Loading language model {model_name}...")

self.tokenizer = AutoTokenizer.from_pretrained(model_name)

self.model = AutoModelForCausalLM.from_pretrained(

model_name,

torch_dtype=torch.float16,

device_map="auto",

trust_remote_code=True

)

print("Model loaded!")

# Initialize conversation history

self.conversation_history = []

def load_documents(self, file_path):

"""

Load and index documents from a file.

Args:

file_path: Path to the document file

"""

# Load document

text = self.doc_processor.load_text_file(file_path)

# Split into chunks

chunks = self.doc_processor.split_into_chunks(text)

# Add to vector store

self.vector_store.add_documents(chunks)

def generate_response(self, user_input):

"""

Generate a response using RAG.

This method retrieves relevant context and generates

a response based on that context.

Args:

user_input: The user's question

Returns:

The generated response

"""

# Retrieve relevant context

relevant_chunks = self.vector_store.search(user_input, n_results=3)

# Construct context from retrieved chunks

context = "\n\n".join(relevant_chunks)

# Build the prompt with context

prompt = f"""Based on the following context, please answer the question.

```

Context:

{context}

Question: {user_input}

Answer:”””

```

# Add to conversation history

full_prompt = "\n".join(self.conversation_history) + "\n" + prompt

# Generate response

inputs = self.tokenizer(full_prompt, return_tensors="pt")

inputs = {k: v.to(self.model.device) for k, v in inputs.items()}

with torch.no_grad():

outputs = self.model.generate(

**inputs,

max_new_tokens=200,

temperature=0.7,

do_sample=True,

pad_token_id=self.tokenizer.eos_token_id

)

# Decode response

full_response = self.tokenizer.decode(outputs[0], skip_special_tokens=True)

response = full_response.split("Answer:")[-1].strip()

# Update conversation history

self.conversation_history.append(f"User: {user_input}\nAssistant: {response}")

return response

```

Step Five: The Main Loop

Here is how to use the RAG chatbot:

def main():

"""

Main function to run the HuggingFace RAG chatbot.

"""

print("Initializing HuggingFace RAG Chatbot...")

# Initialize chatbot

chatbot = RAGChatbot()

# Load documents (you need to provide a text file)

print("\nPlease provide the path to your document file:")

file_path = input("File path: ").strip()

if Path(file_path).exists():

chatbot.load_documents(file_path)

print("\nDocuments loaded and indexed!")

else:

print("File not found. Starting without documents.")

print("\nChatbot is ready! Type 'quit' to exit.\n")

while True:

user_input = input("You: ").strip()

if user_input.lower() in ['quit', 'exit', 'bye']:

print("Goodbye!")

break

if not user_input:

continue

response = chatbot.generate_response(user_input)

print(f"Assistant: {response}\n")

if __name__ == "__main__":

main()

Understanding What We Built

You now have a complete RAG system using HuggingFace. When a user asks a question, the system searches your documents for relevant information, retrieves it, and provides it to the LLM as context. This allows the LLM to answer based on your specific documents rather than just its training data.

SECTION 10: IMPLEMENTING RAG WITH LANGCHAIN

LangChain makes RAG significantly easier with high-level abstractions for document loading, splitting, embedding, and retrieval.

Step One: Setting Up LangChain RAG

Create a file called “langchain_rag_chatbot.py”:

from langchain.document_loaders import TextLoader

from langchain.text_splitter import RecursiveCharacterTextSplitter

from langchain.embeddings import HuggingFaceEmbeddings

from langchain.vectorstores import Chroma

from langchain.chat_models import ChatOpenAI

from langchain.chains import ConversationalRetrievalChain

from langchain.memory import ConversationBufferMemory

from dotenv import load_dotenv

import os

LangChain provides specialized components for each part of the RAG pipeline. The document loader reads files, the text splitter chunks documents, the embeddings class creates vector representations, the vector store manages retrieval, and the retrieval chain ties everything together.

Step Two: Creating the RAG Pipeline

Here is a complete RAG implementation:

class LangChainRAGChatbot:

"""

A RAG chatbot using LangChain's high-level abstractions.

This class demonstrates how LangChain simplifies RAG implementation

by providing pre-built components for each step.

"""

def __init__(self):

"""Initialize the LangChain RAG chatbot."""

load_dotenv()

# Initialize embeddings model

print("Initializing embeddings model...")

self.embeddings = HuggingFaceEmbeddings(

model_name="all-MiniLM-L6-v2"

)

# Initialize LLM

self.llm = ChatOpenAI(

model_name="gpt-3.5-turbo",

temperature=0.7,

openai_api_key=os.getenv("OPENAI_API_KEY")

)

# Initialize memory

self.memory = ConversationBufferMemory(

memory_key="chat_history",

return_messages=True,

output_key="answer"

)

# Vector store will be initialized when documents are loaded

self.vectorstore = None

self.qa_chain = None

print("Chatbot initialized!")

def load_documents(self, file_path):

"""

Load and index documents for RAG.

This method handles the complete pipeline: loading, splitting,

embedding, and storing documents.

Args:

file_path: Path to the document file

"""

print(f"Loading documents from {file_path}...")

# Load the document

loader = TextLoader(file_path, encoding='utf-8')

documents = loader.load()

# Split into chunks

# RecursiveCharacterTextSplitter tries to split on natural boundaries

text_splitter = RecursiveCharacterTextSplitter(

chunk_size=500,

chunk_overlap=50,

length_function=len,

separators=["\n\n", "\n", " ", ""] # Try these in order

)

chunks = text_splitter.split_documents(documents)

print(f"Created {len(chunks)} chunks")

# Create vector store

print("Creating embeddings and vector store...")

self.vectorstore = Chroma.from_documents(

documents=chunks,

embedding=self.embeddings,

collection_name="langchain_rag"

)

# Create the conversational retrieval chain

self.qa_chain = ConversationalRetrievalChain.from_llm(

llm=self.llm,

retriever=self.vectorstore.as_retriever(search_kwargs={"k": 3}),

memory=self.memory,

return_source_documents=True,

verbose=True

)

print("Documents indexed successfully!")

def chat(self, user_input):

"""

Chat with the RAG bot.

Args:

user_input: The user's question

Returns:

The bot's response

"""

if self.qa_chain is None:

return "Please load documents first using load_documents()."

# Get response from the chain

result = self.qa_chain({"question": user_input})

# Extract the answer

answer = result["answer"]

# Optionally, you can also see which source documents were used

source_docs = result.get("source_documents", [])

return answer, source_docs

Step Three: The Main Loop

def main():

"""

Main function to run the LangChain RAG chatbot.

"""

print("Initializing LangChain RAG Chatbot...")

# Initialize chatbot

chatbot = LangChainRAGChatbot()

# Load documents

print("\nPlease provide the path to your document file:")

file_path = input("File path: ").strip()

from pathlib import Path

if Path(file_path).exists():

chatbot.load_documents(file_path)

else:

print("File not found. Exiting.")

return

print("\nChatbot is ready! Type 'quit' to exit.")

print("The bot will answer questions based on your documents.\n")

while True:

user_input = input("You: ").strip()

if user_input.lower() in ['quit', 'exit', 'bye']:

print("Goodbye!")

break

if not user_input:

continue

# Get response

answer, sources = chatbot.chat(user_input)

print(f"Assistant: {answer}\n")

# Optionally show sources

show_sources = input("Show source documents? (y/n): ").strip().lower()

if show_sources == 'y':

print("\nSource documents used:")

for i, doc in enumerate(sources, 1):

print(f"\nSource {i}:")

print(doc.page_content[:200] + "...")

print()

if __name__ == "__main__":

main()

Understanding LangChain’s RAG Advantages

Notice how much simpler this is compared to the HuggingFace implementation. LangChain provides pre-built components that handle all the complexity. The ConversationalRetrievalChain automatically manages retrieval, prompt construction, and conversation history. The RecursiveCharacterTextSplitter intelligently splits text on natural boundaries. The integration between components is seamless.

This is the power of using a framework designed for RAG. You can focus on your application logic rather than the plumbing.

SECTION 11: IMPLEMENTING RAG WITH LANGGRAPH

LangGraph allows you to build more sophisticated RAG systems with custom logic for retrieval, generation, and response validation.

Step One: Setting Up LangGraph RAG

Create a file called “langgraph_rag_chatbot.py”:

from langgraph.graph import StateGraph, END

from langchain.chat_models import ChatOpenAI

from langchain.embeddings import HuggingFaceEmbeddings

from langchain.vectorstores import Chroma

from langchain.document_loaders import TextLoader

from langchain.text_splitter import RecursiveCharacterTextSplitter

from langchain.schema import HumanMessage, AIMessage, SystemMessage

from typing import TypedDict, Annotated, Sequence, List

import operator

from dotenv import load_dotenv

Step Two: Defining State for RAG

Our state needs to track more information for RAG:

class RAGState(TypedDict):

"""

State definition for the RAG chatbot.

This tracks all information needed for retrieval and generation.

Attributes:

messages: Conversation history

user_input: Current user question

retrieved_docs: Documents retrieved for current question

final_answer: The generated response

"""

messages: Annotated[Sequence[HumanMessage | AIMessage], operator.add]

user_input: str

retrieved_docs: List[str]

final_answer: str

This state tracks not just the conversation but also the retrieved documents and the final answer. This allows us to implement multi-step workflows where we can inspect and modify what happens at each stage.

Step Three: Creating RAG Nodes

Now we create nodes for each step in the RAG process:

class LangGraphRAG:

"""

A RAG system built with LangGraph for maximum control.

"""

def __init__(self):

"""Initialize the LangGraph RAG system."""

load_dotenv()

# Initialize components

self.embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

self.llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0.7)

self.vectorstore = None

self.app = None

def load_documents(self, file_path):

"""Load and index documents."""

loader = TextLoader(file_path, encoding='utf-8')

documents = loader.load()

text_splitter = RecursiveCharacterTextSplitter(

chunk_size=500,

chunk_overlap=50

)

chunks = text_splitter.split_documents(documents)

self.vectorstore = Chroma.from_documents(

documents=chunks,

embedding=self.embeddings

)

# Build the graph after documents are loaded

self.app = self._build_graph()

print(f"Loaded and indexed {len(chunks)} chunks")

def retrieve_node(self, state: RAGState) -> RAGState:

"""

Node that retrieves relevant documents.

This node searches the vector store for documents

relevant to the user's question.

Args:

state: Current state

Returns:

Updated state with retrieved documents

"""

user_input = state["user_input"]

# Retrieve relevant documents

docs = self.vectorstore.similarity_search(user_input, k=3)

# Extract text from documents

doc_texts = [doc.page_content for doc in docs]

return {

"retrieved_docs": doc_texts,

"user_input": user_input,

"messages": []

}

def generate_node(self, state: RAGState) -> RAGState:

"""

Node that generates a response based on retrieved documents.

This node constructs a prompt with the retrieved context

and generates an answer using the LLM.

Args:

state: Current state with retrieved documents

Returns:

Updated state with the generated answer

"""

# Get retrieved documents

context = "\n\n".join(state["retrieved_docs"])

# Construct prompt

prompt = f"""Based on the following context, please answer the question.

```

Context:

{context}

Question: {state[‘user_input’]}

Please provide a clear and concise answer based on the context provided.”””

```

# Get existing messages

messages = list(state["messages"])

# Add system message if needed

if not messages or not isinstance(messages[0], SystemMessage):

messages.insert(0, SystemMessage(content="You are a helpful assistant that answers questions based on provided context."))

# Add user message

messages.append(HumanMessage(content=prompt))

# Generate response

response = self.llm(messages)

return {

"messages": [response],

"final_answer": response.content,

"user_input": state["user_input"],

"retrieved_docs": state["retrieved_docs"]

}

def _build_graph(self):

"""

Build the LangGraph workflow.

This creates a graph with nodes for retrieval and generation.

Returns:

Compiled graph application

"""

workflow = StateGraph(RAGState)

# Add nodes

workflow.add_node("retrieve", self.retrieve_node)

workflow.add_node("generate", self.generate_node)

# Define the flow

workflow.set_entry_point("retrieve")

workflow.add_edge("retrieve", "generate")

workflow.add_edge("generate", END)

return workflow.compile()

def chat(self, user_input):

"""

Process a user question through the RAG pipeline.

Args:

user_input: The user's question

Returns:

The generated answer

"""

if self.app is None:

return "Please load documents first."

# Create initial state

state = {

"user_input": user_input,

"messages": [],

"retrieved_docs": [],

"final_answer": ""

}

# Run the graph

result = self.app.invoke(state)

return result["final_answer"]

Step Four: The Main Loop

def main():

"""

Main function to run the LangGraph RAG chatbot.

"""

print("Initializing LangGraph RAG Chatbot...")

chatbot = LangGraphRAG()

print("\nPlease provide the path to your document file:")

file_path = input("File path: ").strip()

from pathlib import Path

if Path(file_path).exists():

chatbot.load_documents(file_path)

else:

print("File not found. Exiting.")

return

print("\nChatbot is ready! Type 'quit' to exit.\n")

while True:

user_input = input("You: ").strip()

if user_input.lower() in ['quit', 'exit', 'bye']:

print("Goodbye!")

break

if not user_input:

continue

answer = chatbot.chat(user_input)

print(f"Assistant: {answer}\n")

if __name__ == "__main__":

main()

Understanding LangGraph’s RAG Benefits

LangGraph gives you complete control over the RAG pipeline. You can easily add nodes for query rewriting, response validation, or iterative refinement. For example, you could add a node that checks if the retrieved documents are relevant and retrieves more if needed. You could add a node that validates the generated answer for factual consistency with the sources. This level of control is difficult to achieve with other frameworks.

SECTION 12: IMPLEMENTING RAG WITH LLAMAINDEX

LlamaIndex is purpose-built for RAG and makes it remarkably simple to implement sophisticated retrieval systems.

Step One: Basic RAG with LlamaIndex

Create a file called “llamaindex_rag_chatbot.py”:

from llama_index import VectorStoreIndex, SimpleDirectoryReader, ServiceContext

from llama_index.llms import OpenAI

from llama_index.embeddings import HuggingFaceEmbedding

from llama_index.memory import ChatMemoryBuffer

from dotenv import load_dotenv

import os

Step Two: Creating the RAG System

LlamaIndex makes RAG incredibly straightforward:

class LlamaIndexRAG:

"""

A RAG chatbot using LlamaIndex.

LlamaIndex is designed specifically for RAG, so this implementation

is remarkably concise while still being powerful.

"""

def __init__(self):

"""Initialize the LlamaIndex RAG system."""

load_dotenv()

# Configure LLM

self.llm = OpenAI(

model="gpt-3.5-turbo",

temperature=0.7,

api_key=os.getenv("OPENAI_API_KEY")

)

# Configure embeddings

self.embed_model = HuggingFaceEmbedding(

model_name="all-MiniLM-L6-v2"

)

# Create service context

self.service_context = ServiceContext.from_defaults(

llm=self.llm,

embed_model=self.embed_model

)

self.index = None

self.chat_engine = None

print("LlamaIndex RAG initialized!")

def load_documents(self, file_path):

"""

Load and index documents.

LlamaIndex handles all the complexity of loading, chunking,

embedding, and indexing in just a few lines.

Args:

file_path: Path to document file or directory

"""

print(f"Loading documents from {file_path}...")

# Load documents

# If you pass a directory, it will load all files in it

from pathlib import Path

if Path(file_path).is_dir():

documents = SimpleDirectoryReader(file_path).load_data()

else:

# For a single file, create a temporary directory reader

import tempfile

import shutil

temp_dir = tempfile.mkdtemp()

shutil.copy(file_path, temp_dir)

documents = SimpleDirectoryReader(temp_dir).load_data()

print(f"Loaded {len(documents)} documents")

# Create index

print("Creating index...")

self.index = VectorStoreIndex.from_documents(

documents,

service_context=self.service_context,

show_progress=True

)

# Create chat engine with memory

self.chat_engine = self.index.as_chat_engine(

chat_mode="context", # Use retrieval-augmented generation

memory=ChatMemoryBuffer.from_defaults(token_limit=4000),

system_prompt=(

"You are a helpful assistant that answers questions "

"based on the provided documents. Always cite the source "

"when possible and admit when you don't know something."

verbose=True

)

print("Documents indexed and chat engine ready!")

def chat(self, user_input):

"""

Chat with the RAG system.

Args:

user_input: The user's question

Returns:

The generated response

"""

if self.chat_engine is None:

return "Please load documents first."

# Get response

response = self.chat_engine.chat(user_input)

return str(response)

def reset_conversation(self):

"""Reset the conversation history."""

if self.chat_engine:

self.chat_engine.reset()

Step Three: The Main Loop

def main():

"""

Main function to run the LlamaIndex RAG chatbot.

"""

print("Initializing LlamaIndex RAG Chatbot...")

chatbot = LlamaIndexRAG()

print("\nPlease provide the path to your document file or directory:")

file_path = input("Path: ").strip()

from pathlib import Path

if Path(file_path).exists():

chatbot.load_documents(file_path)

else:

print("Path not found. Exiting.")

return

print("\nChatbot is ready! Type 'quit' to exit, 'reset' to clear history.\n")

while True:

user_input = input("You: ").strip()

if user_input.lower() in ['quit', 'exit', 'bye']:

print("Goodbye!")

break

if user_input.lower() == 'reset':

chatbot.reset_conversation()

print("Conversation history cleared.\n")

continue

if not user_input:

continue

response = chatbot.chat(user_input)

print(f"Assistant: {response}\n")

if __name__ == "__main__":

main()

Understanding LlamaIndex’s RAG Strengths

LlamaIndex shines in its simplicity for RAG applications. With just a few lines of code, you get sophisticated document processing, embedding generation, indexing, retrieval, and generation with conversation memory. The framework handles chunking strategies, embedding model integration, and prompt engineering for RAG automatically.

LlamaIndex also provides advanced features like response synthesis modes, query transformations, and sub-question query engines that break complex questions into simpler sub-questions. These advanced features make it easy to build production-quality RAG systems.

SECTION 13: ADVANCED RAG CONCEPTS AND BEST PRACTICES

Now that you have seen RAG implementations across all frameworks, let me share important concepts and best practices.

Chunking Strategies Matter

The way you split documents into chunks significantly affects RAG quality. Too large chunks include irrelevant information. Too small chunks lack context. A good starting point is 400 to 600 characters with 50 to 100 characters of overlap. For technical documents, splitting on section boundaries works better than fixed sizes.

You should experiment with different chunk sizes for your specific use case. Monitor which chunk sizes lead to the most relevant retrievals and best answers.

Choosing the Right Embedding Model

The embedding model determines how well the system understands semantic similarity. Smaller models like all-MiniLM-L6-v2 are fast but less accurate. Larger models like instructor-xl or e5-large-v2 are more accurate but slower and require more memory.

For most applications, all-MiniLM-L6-v2 provides a good balance. For production systems where accuracy is critical, consider larger models or domain-specific embeddings trained on your type of content.

Retrieval Quality is Critical

The quality of your RAG system depends primarily on retrieval quality. If the system retrieves irrelevant documents, even the best LLM cannot generate good answers. You should implement logging to track which documents get retrieved for each query. Review these logs regularly to identify retrieval problems.

Consider implementing hybrid search that combines semantic search with keyword search. This catches cases where semantic similarity alone might miss exact term matches that are important.

Managing Context Window Limits

Even with RAG, you can exceed the model’s context window when you have long retrieved documents and long conversation history. Implement truncation strategies that prioritize recent conversation turns and most relevant retrieved passages.

LlamaIndex handles this automatically with its token limit parameter. For custom implementations, you need to track token counts and truncate intelligently.

Handling Unanswerable Questions

Sometimes the retrieved documents do not contain the answer to a question. Your system should detect this and respond appropriately rather than hallucinating an answer. You can prompt the LLM to say “I cannot answer this based on the provided documents” when it lacks information.

You can also implement a relevance check where you ask the LLM to rate how relevant the retrieved documents are before generating an answer.

Metadata and Filtering

In production systems, you often want to filter documents based on metadata like date, author, or document type. Most vector databases support metadata filtering. For example, in ChromaDB you can filter results:

results = collection.query(

query_embeddings=query_embedding,

where={"date": {"$gte": "2024-01-01"}},

n_results=5

)

This retrieves only documents from 2024 or later. Metadata filtering significantly improves relevance when you have large document collections.

Citation and Source Attribution

Users need to know which documents the answers come from. Implement citation by tracking which chunks were used and displaying them with the response. LangChain and LlamaIndex provide this through source_documents. For custom implementations, return the chunk IDs along with the generated text.

Evaluating RAG Quality

You should systematically evaluate your RAG system. Create a test set of questions with known correct answers. Measure retrieval accuracy by checking if relevant documents are retrieved. Measure answer quality by comparing generated answers to reference answers. Track these metrics over time as you make improvements.

Production Considerations

For production systems, you need to consider additional aspects. Implement caching for common queries to reduce costs and latency. Use async operations to handle multiple concurrent users. Implement rate limiting to prevent abuse. Monitor costs carefully since embeddings and LLM calls can get expensive at scale. Consider using open-source models for embeddings to reduce costs.

CONCLUSION AND NEXT STEPS

You have now learned how to build LLM chatbots from scratch using four major frameworks. You started with basic chatbots and then enhanced them with Retrieval Augmented Generation to answer questions based on your documents.

Each framework has its strengths. HuggingFace gives you direct control and privacy. LangChain enables rapid prototyping with high-level abstractions. LangGraph provides sophisticated state management for complex applications. LlamaIndex excels at document-based question answering.

Where to Go From Here

To deepen your knowledge, I recommend building a complete application that solves a real problem you have. Perhaps build a chatbot that can answer questions about your company’s documentation, or create a personal assistant that knows about your notes and files. Real projects teach you far more than tutorials.

Experiment with different models to understand the tradeoffs between size, speed, and quality. Try fine-tuning models on your specific domain to improve performance. Explore advanced RAG techniques like query rewriting, hypothetical document embeddings, or fusion retrieval that combines multiple retrieval strategies.

Study the documentation for each framework deeply. I have only scratched the surface of what each framework can do. LangChain has tools for web scraping and API calls. LangGraph supports sophisticated multi-agent systems. LlamaIndex has advanced query engines that can handle complex reasoning.

Most importantly, remember that building with LLMs is still a rapidly evolving field. New techniques and best practices emerge constantly. Stay curious, keep experimenting, and do not be afraid to try unconventional approaches.

You now have the foundation to build powerful LLM applications. The key is to start simple, iterate based on what you learn, and gradually increase complexity as needed. Good luck on your journey building with large language models!

Hitchhiker's Guide to AI, Software Architecture, and Everything Else

Friday, June 12, 2026

BUILDING YOUR FIRST LLM CHATBOT: FROM BASICS TO RAG - A Guide for Beginners with Zero LLM Experience

INTRODUCTION: WELCOME TO THE WORLD OF LLM CHATBOTS

WHAT YOU NEED TO KNOW BEFORE STARTING

PART 1: UNDERSTANDING AND BUILDING A BASIC LLM CHATBOT

SECTION 1: FUNDAMENTAL CONCEPTS YOU MUST UNDERSTAND

SECTION 2: SETTING UP YOUR DEVELOPMENT ENVIRONMENT

SECTION 3: BUILDING YOUR FIRST CHATBOT WITH HUGGINGFACE

SECTION 4: BUILDING A CHATBOT WITH LANGCHAIN

SECTION 5: BUILDING A CHATBOT WITH LANGGRAPH

SECTION 6: BUILDING A CHATBOT WITH LLAMAINDEX

SECTION 7: COMPARING FRAMEWORKS AND WHEN TO COMBINE THEM

PART 2: ADDING RETRIEVAL AUGMENTED GENERATION (RAG)

SECTION 8: UNDERSTANDING RETRIEVAL AUGMENTED GENERATION

SECTION 9: IMPLEMENTING RAG WITH HUGGINGFACE

SECTION 10: IMPLEMENTING RAG WITH LANGCHAIN

SECTION 11: IMPLEMENTING RAG WITH LANGGRAPH

SECTION 12: IMPLEMENTING RAG WITH LLAMAINDEX

SECTION 13: ADVANCED RAG CONCEPTS AND BEST PRACTICES

No comments:

About Me