FOREWORD
If you have ever wondered how ChatGPT works under the hood, or how companies are building AI assistants that can browse the web, write code, and take actions in the real world, then you are holding exactly the right guide. This tutorial will take you from zero knowledge about Large Language Models all the way to building your own agentic AI systems that can reason, plan, and act. We will move slowly and deliberately, building each concept on top of the last, so that by the time you reach the end, you will not just have working code but a genuine mental model of what is happening inside these systems.
No prior experience with LLMs is assumed. You should be comfortable reading Python code and have a basic understanding of functions and dictionaries. Everything else will be explained as we go. All code in this tutorial requires Python 3.9 or newer. Python 3.10 or newer is recommended for the best experience.
CHAPTER 1: WHAT IS A LARGE LANGUAGE MODEL, REALLY?
Before writing a single line of code, we need to build an accurate mental model of what a Large Language Model actually is. Many tutorials skip this step and jump straight into API calls, which leaves learners confused about why things work the way they do. We will not make that mistake.
A Large Language Model is, at its core, a mathematical function. It takes a sequence of text as input and produces a probability distribution over possible next tokens as output. A token is roughly a word or a piece of a word. The word "running" might be a single token, while "unbelievable" might be split into "un", "believ", and "able". The model was trained on an enormous corpus of text, perhaps hundreds of billions of tokens scraped from books, websites, scientific papers, and code repositories. During training, the model learned to predict what token comes next in a sequence, over and over again, billions of times, until its internal parameters settled into a configuration that captures an astonishing amount of knowledge about language, facts, reasoning patterns, and even code.
The figure below shows this conceptually:
Input Text LLM Output
+-----------+ +-----------+ +-----------+
| "The sky | ----> | billions | ----> | "blue" |
| is very | | of math | | (0.72) |
| ________" | | operations| | "clear" |
+-----------+ +-----------+ | (0.11) |
| "dark" |
| (0.08) |
| ... |
+-----------+
The model does not "know" things the way a database knows things. It has internalized statistical patterns so deeply that it can generate text that is coherent, factually grounded, and contextually appropriate. When you ask it a question, it is not looking up an answer. It is generating the most probable continuation of the conversation given everything it has seen during training.
This distinction matters enormously when you start building applications. The model is a text-completion engine. Everything else, the chat interface, the memory, the ability to use tools, is built on top of that fundamental capability. Understanding this will help you debug problems, design better prompts, and architect more reliable systems.
Modern LLMs are accessed through what is called a chat completion API. Instead of sending raw text and getting raw text back, you send a structured list of messages, each with a role (system, user, or assistant), and the model generates the next assistant message. This structure gives you a clean way to provide context, instructions, and conversation history.
The three roles work as follows. The system role is where you place instructions that define the model's behavior for the entire conversation. Think of it as a briefing you give to a new employee before they start their shift. The user role contains messages from the human participant in the conversation. The assistant role contains the model's previous responses. By including previous assistant messages in the conversation, you give the model the context it needs to continue coherently.
[system] "You are a helpful assistant."
[user] "What is the capital of France?"
[assistant] "The capital of France is Paris."
[user] "What is its population?"
[assistant] ??? <-- model generates this
The model sees the entire message list and generates the next assistant turn. This is why LLMs are said to be stateless: they do not remember previous conversations on their own. Every time you make a new API call, you must send the full conversation history yourself. This is a crucial architectural insight that will shape everything we build.
CHAPTER 2: SETTING UP YOUR ENVIRONMENT
We will work with two categories of LLMs throughout this tutorial. The first category is local models, which run entirely on your own machine. The second category is remote models, which run on a provider's servers and are accessed via an API key.
For local models, we will use two popular tools: Ollama and LM Studio. Ollama is a command-line tool that makes it trivially easy to download and run open-source models like Llama 3, Mistral, and Phi-3. LM Studio is a graphical application that provides a user-friendly interface for the same purpose and also exposes a local server that mimics the OpenAI API format. Both tools are free and run on Mac, Windows, and Linux.
For remote models, we will use the OpenAI API, which gives access to GPT-4o and other powerful models. You will need an OpenAI account and an API key for those sections. The concepts transfer directly to other providers like Anthropic (Claude) or Google (Gemini), since they all follow similar patterns.
Start by creating a fresh Python virtual environment. This keeps your project's dependencies isolated from other Python projects on your machine, which prevents version conflicts and makes your work reproducible.
Open your terminal and run the following commands:
python -m venv llm-tutorial-env
source llm-tutorial-env/bin/activate
pip install openai python-dotenv
On Windows, the activation command is:
llm-tutorial-env\Scripts\activate
The openai package is the official Python client for the OpenAI API. Importantly, it also works with any server that implements the OpenAI-compatible API format, which includes both Ollama and LM Studio. The python-dotenv library lets us store sensitive values like API keys in a file called .env rather than hardcoding them in our scripts.
Create a file called .env in your project directory with the following content. Replace the placeholder with your actual OpenAI API key when you have one:
OPENAI_API_KEY=sk-your-actual-key-goes-here
OPENAI_MODEL=gpt-4o-mini
Now install Ollama by visiting https://ollama.com and following the instructions for your operating system. Once installed, open a terminal and pull a model. We will use Llama 3.2 with 3 billion parameters, which is small enough to run on most modern laptops:
ollama pull llama3.2
After the download completes, you can verify it works by running:
ollama run llama3.2
Type a message and press Enter. If you get a coherent response, your local setup is working. Press Ctrl+D or type /bye to exit. Ollama automatically starts a local server on port 11434 whenever it is running, and that server speaks the OpenAI API format. This means we can use the exact same Python code to talk to Ollama as we use to talk to OpenAI, just with a different base URL.
If you prefer LM Studio, download it from https://lmstudio.ai, install a model through its interface, and then start the local server from the "Local Server" tab. LM Studio's server runs on port 1234 by default.
CHAPTER 3: YOUR FIRST API CALL - TALKING TO A LOCAL MODEL
With the environment ready, we can write our first program. We will start with a local model because it costs nothing to run, works offline, and lets you experiment freely without worrying about API costs. The concepts are identical to using a remote model.
The program below demonstrates the absolute minimum required to send a message to an LLM and receive a response. Study it carefully, because every more complex example in this tutorial builds on exactly this pattern.
Save the following as chapter_03_first_call.py and run it with the command "python chapter_03_first_call.py":
# chapter_03_first_call.py
#
# This is our very first interaction with a Large Language Model.
# We use the openai Python package, which can talk to any server
# that implements the OpenAI-compatible chat completion API.
# Ollama provides exactly such a server on localhost port 11434.
import os
from openai import OpenAI
# Create a client that points to our local Ollama server.
# The base_url tells the client where to send requests.
# The api_key is required by the openai library but Ollama
# does not actually validate it, so any non-empty string works.
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama" # Ollama ignores this value
)
# The model name must match what you pulled with "ollama pull".
# You can list available models by running "ollama list" in
# your terminal.
MODEL = "llama3.2"
def ask_question(question: str) -> str:
"""
Send a single question to the LLM and return its text response.
Args:
question: The question or prompt to send to the model.
Returns:
The model's response as a plain string.
"""
# We construct the messages list. For a single question,
# we only need one user message. The model will generate
# the assistant's reply.
messages = [
{
"role": "user",
"content": question
}
]
# This is the actual API call. The model processes the
# entire messages list and generates a completion.
# max_tokens limits how long the response can be.
# temperature controls randomness: 0.0 is deterministic,
# 1.0 is more creative and varied.
response = client.chat.completions.create(
model=MODEL,
messages=messages,
max_tokens=512,
temperature=0.7
)
# The response object has a nested structure.
# response.choices is a list of possible completions.
# We always take the first one (index 0).
# .message.content gives us the text of the response.
return response.choices[0].message.content
# Entry point: run this when the script is executed directly.
if __name__ == "__main__":
question = "Explain what a neural network is in three sentences."
print("Question:", question)
print()
answer = ask_question(question)
print("Answer:", answer)
Within a few seconds of running this script, you should see the model's response printed to your terminal. If you get a connection error, make sure Ollama is running in the background. You can start it by running "ollama serve" in a separate terminal window.
Let us examine what just happened in detail. The OpenAI client object is a thin wrapper that knows how to format HTTP requests and parse HTTP responses. When we call client.chat.completions.create(), the library serializes our messages list into a JSON body and sends an HTTP POST request to http://localhost:11434/v1/chat/completions. The Ollama server receives this request, loads the model into memory (or uses it if already loaded), runs the forward pass through the neural network token by token, and streams the generated tokens back. The client collects all the tokens and assembles them into the response object we see.
The temperature parameter deserves special attention because you will use it constantly. At temperature 0.0, the model always picks the single most probable next token. This makes responses deterministic and consistent, which is ideal for tasks where you need reliable, repeatable outputs like data extraction or classification. At higher temperatures, the model samples from the probability distribution more broadly, introducing variety and creativity. For creative writing or brainstorming, temperatures between 0.7 and 1.0 work well. For factual question answering or code generation, staying between 0.0 and 0.3 tends to produce better results.
Now let us make the same call to a remote OpenAI model. The code is almost identical. The only differences are the base URL (which we omit, letting the client use OpenAI's servers by default) and the API key (which must be a real, valid key). Save the following as chapter_03_remote_call.py:
# chapter_03_remote_call.py
#
# This script demonstrates calling the OpenAI API (remote model).
# It is structurally identical to the local call, showing that
# the same code pattern works for both local and remote LLMs.
import os
from openai import OpenAI
from dotenv import load_dotenv
# Load environment variables from the .env file.
# This keeps our API key out of the source code.
load_dotenv()
# When no base_url is provided, the client defaults to
# https://api.openai.com/v1, which is OpenAI's production server.
# The api_key is read from the OPENAI_API_KEY environment variable.
client = OpenAI(
api_key=os.getenv("OPENAI_API_KEY")
)
# gpt-4o-mini is a fast, affordable model that is excellent
# for learning and experimentation.
MODEL = os.getenv("OPENAI_MODEL", "gpt-4o-mini")
def ask_question(question: str) -> str:
"""
Send a single question to the OpenAI API and return the response.
Args:
question: The question or prompt to send.
Returns:
The model's response as a plain string.
"""
messages = [
{
"role": "user",
"content": question
}
]
response = client.chat.completions.create(
model=MODEL,
messages=messages,
max_tokens=512,
temperature=0.7
)
return response.choices[0].message.content
if __name__ == "__main__":
question = "Explain what a neural network is in three sentences."
print("Question:", question)
print()
answer = ask_question(question)
print("Answer:", answer)
The structural similarity between these two scripts is intentional and important. It demonstrates that the OpenAI API format has become a de facto standard. By writing your code against this interface, you can switch between local and remote models with minimal changes. This is a powerful architectural principle: program to an interface, not an implementation. We will build on this by creating a small abstraction layer in the next chapter.
CHAPTER 4: BUILDING A REUSABLE LLM CLIENT
As we build more complex programs, we do not want to repeat the client setup code everywhere. More importantly, we want to be able to switch between local and remote models easily, perhaps for testing a feature locally before deploying it against a more powerful remote model. This is a perfect opportunity to apply clean architecture principles.
We will create a module called llm_client.py that provides a unified interface for both local and remote models. This module will be imported by every subsequent example in this tutorial. The design uses a dictionary of provider configurations so that adding a new provider requires only a single new entry, with no changes to the class itself.
Save the following as llm_client.py:
# llm_client.py
#
# A clean, reusable abstraction over local and remote LLM APIs.
# This module provides a single LLMClient class that can be
# configured to talk to Ollama, LM Studio, or OpenAI with
# no changes to the calling code.
#
# Requires Python 3.9+
import os
from typing import Dict, List, Optional
from openai import OpenAI
from dotenv import load_dotenv
load_dotenv()
# These constants define the known provider configurations.
# Adding a new provider is as simple as adding a new entry here.
PROVIDERS: Dict[str, Dict] = {
"ollama": {
"base_url": "http://localhost:11434/v1",
"api_key": "ollama",
"default_model": "llama3.2"
},
"lmstudio": {
"base_url": "http://localhost:1234/v1",
"api_key": "lmstudio",
"default_model": "local-model"
},
"openai": {
"base_url": None, # Use the openai library's default
"api_key": os.getenv("OPENAI_API_KEY"),
"default_model": os.getenv("OPENAI_MODEL", "gpt-4o-mini")
}
}
class LLMClient:
"""
A unified client for interacting with LLMs from different providers.
This class wraps the OpenAI Python client and handles
provider-specific configuration, making it easy to switch
between local and remote models.
Usage:
# Connect to a local Ollama model
client = LLMClient(provider="ollama")
# Connect to OpenAI
client = LLMClient(provider="openai")
# Use a specific model
client = LLMClient(provider="ollama", model="mistral")
"""
def __init__(
self,
provider: str = "ollama",
model: Optional[str] = None,
temperature: float = 0.7,
max_tokens: int = 1024
):
"""
Initialize the LLM client for a specific provider.
Args:
provider: One of "ollama", "lmstudio", or "openai".
model: The model name to use. If None, uses the
provider's default model.
temperature: Sampling temperature (0.0 to 1.0).
max_tokens: Maximum number of tokens in the response.
Raises:
ValueError: If an unknown provider name is given.
"""
if provider not in PROVIDERS:
raise ValueError(
f"Unknown provider '{provider}'. "
f"Choose from: {list(PROVIDERS.keys())}"
)
config = PROVIDERS[provider]
self.provider = provider
self.model = model or config["default_model"]
self.temperature = temperature
self.max_tokens = max_tokens
# Build the OpenAI client with provider-specific settings.
# We only pass base_url when it is not None so the openai
# library uses its own default for the openai provider.
client_kwargs: Dict = {"api_key": config["api_key"]}
if config["base_url"] is not None:
client_kwargs["base_url"] = config["base_url"]
self._client = OpenAI(**client_kwargs)
def chat(
self,
messages: List[Dict],
temperature: Optional[float] = None,
max_tokens: Optional[int] = None
) -> str:
"""
Send a list of messages to the LLM and return the text response.
Args:
messages: A list of message dicts with "role" and
"content" keys.
temperature: Override the instance temperature for this call.
max_tokens: Override the instance max_tokens for this call.
Returns:
The model's response as a plain string.
"""
response = self._client.chat.completions.create(
model=self.model,
messages=messages,
temperature=(
temperature if temperature is not None
else self.temperature
),
max_tokens=(
max_tokens if max_tokens is not None
else self.max_tokens
)
)
return response.choices[0].message.content
def complete(
self,
prompt: str,
system: Optional[str] = None
) -> str:
"""
A convenience method for simple single-turn interactions.
This builds the messages list automatically from a plain
prompt string and an optional system instruction.
Args:
prompt: The user's question or instruction.
system: Optional system message to set the model's behavior.
Returns:
The model's response as a plain string.
"""
messages: List[Dict] = []
# Add the system message first if one was provided.
if system:
messages.append({"role": "system", "content": system})
# Add the user's prompt.
messages.append({"role": "user", "content": prompt})
return self.chat(messages)
def __repr__(self) -> str:
return (
f"LLMClient(provider='{self.provider}', "
f"model='{self.model}', "
f"temperature={self.temperature})"
)
This class encapsulates all the configuration complexity and gives us a clean, simple interface. The complete() method handles the common case of a single prompt with an optional system instruction. The chat() method gives us full control over the messages list for more complex scenarios. Let us verify it works with a quick test. Save the following as chapter_04_test_client.py:
# chapter_04_test_client.py
#
# A simple test to verify our LLMClient works correctly
# with both local and remote providers.
from llm_client import LLMClient
def test_local_model() -> None:
"""Test the client against a locally running Ollama model."""
print("=== Testing Local Model (Ollama) ===")
# Create a client for Ollama with a low temperature for
# consistent, factual responses.
client = LLMClient(provider="ollama", temperature=0.1)
print(f"Client configured: {client}")
# Use the convenience method for a simple one-shot query.
response = client.complete(
prompt="What is 2 + 2? Answer with just the number.",
system="You are a precise mathematical assistant."
)
print(f"Response: {response}")
print()
def test_remote_model() -> None:
"""Test the client against the OpenAI API."""
print("=== Testing Remote Model (OpenAI) ===")
client = LLMClient(provider="openai", temperature=0.1)
print(f"Client configured: {client}")
response = client.complete(
prompt="What is 2 + 2? Answer with just the number.",
system="You are a precise mathematical assistant."
)
print(f"Response: {response}")
print()
if __name__ == "__main__":
# Run the local test first since it costs nothing.
test_local_model()
# Uncomment the line below when you have an OpenAI API key.
# test_remote_model()
With this foundation in place, we can now move on to one of the most important skills in working with LLMs: prompt engineering.
CHAPTER 5: PROMPT ENGINEERING - THE ART OF TALKING TO MODELS
Prompt engineering is the practice of crafting inputs to an LLM in a way that reliably produces the outputs you want. It sounds simple, but it is genuinely a skill that takes practice to develop. The good news is that the underlying principles are logical and learnable. In this chapter, we will cover the techniques that will serve you in 90% of real-world applications.
The most important principle is this: be explicit. LLMs are trained to be helpful and to infer intent from context, which means they will fill in any ambiguity with their best guess. Sometimes that guess is right. Often it is not. The more precisely you specify what you want, the more reliably you will get it.
Let us explore this with a concrete example. Suppose you want the model to summarize a document. A naive prompt might look like this:
"Summarize this document."
This leaves enormous ambiguity. How long should the summary be? Should it be in bullet points or prose? Should it include technical details or be accessible to a general audience? Should it focus on conclusions or methodology? The model will make all these decisions for you, and the results will be inconsistent across different runs.
A well-engineered prompt specifies all of these things explicitly. The following example demonstrates four important prompting techniques working together. Save it as chapter_05_prompt_engineering.py:
# chapter_05_prompt_engineering.py
#
# This module demonstrates core prompt engineering techniques.
# Each function shows a different principle with a clear example.
# Run this file directly to see all four techniques in action.
from llm_client import LLMClient
# We will use a local model for all examples in this chapter.
# Switch provider to "openai" if you prefer to use the remote API.
client = LLMClient(provider="ollama", temperature=0.3)
# --- TECHNIQUE 1: Role Assignment ---
# Giving the model a specific persona or role dramatically
# improves the relevance and tone of its responses.
def demonstrate_role_assignment() -> None:
"""Show how assigning a role changes the model's response style."""
print("--- Technique 1: Role Assignment ---")
# Without a role: the model gives a generic response.
generic_response = client.complete(
prompt="Explain recursion.",
system="You are a helpful assistant."
)
print("Generic response:")
print(generic_response[:300])
print()
# With a specific role: the response is targeted and appropriate.
expert_response = client.complete(
prompt="Explain recursion.",
system=(
"You are an experienced computer science professor "
"teaching a first-year undergraduate course. "
"Use simple analogies and avoid jargon. "
"Keep your explanation under 100 words."
)
)
print("Expert role response:")
print(expert_response[:300])
print()
# --- TECHNIQUE 2: Few-Shot Prompting ---
# Providing examples of the desired input-output format
# is one of the most powerful prompting techniques.
# The model learns the pattern from your examples and
# applies it to new inputs without further instruction.
def demonstrate_few_shot() -> None:
"""Show how examples guide the model to the correct output format."""
print("--- Technique 2: Few-Shot Prompting ---")
# We want to extract sentiment from product reviews.
# Instead of describing the format, we demonstrate it with examples.
few_shot_prompt = (
"Classify the sentiment of each review as POSITIVE, NEGATIVE, "
"or NEUTRAL.\nRespond with only the classification word.\n\n"
'Review: "This product exceeded all my expectations!"\n'
"Sentiment: POSITIVE\n\n"
'Review: "It arrived on time but the packaging was damaged."\n'
"Sentiment: NEUTRAL\n\n"
'Review: "Completely useless. Broke after one day."\n'
"Sentiment: NEGATIVE\n\n"
'Review: "I\'ve been using this for six months and it still '
'works perfectly."\n'
"Sentiment:"
)
response = client.complete(
prompt=few_shot_prompt,
system="You are a sentiment analysis engine."
)
print(f"Sentiment classification: {response.strip()}")
print()
# --- TECHNIQUE 3: Chain of Thought ---
# For complex reasoning tasks, asking the model to think
# step by step dramatically improves accuracy.
# This works because the model can use its intermediate
# reasoning steps as context for subsequent steps.
def demonstrate_chain_of_thought() -> None:
"""Show how step-by-step reasoning improves complex task accuracy."""
print("--- Technique 3: Chain of Thought ---")
problem = (
"A train leaves City A at 9:00 AM traveling at 60 mph. "
"Another train leaves City B at 10:00 AM traveling at 90 mph "
"toward City A. The cities are 300 miles apart. "
"At what time do the trains meet?"
)
# Without chain of thought: the model may jump to a wrong answer.
direct_response = client.complete(
prompt=problem,
system="Answer with just the time."
)
print(f"Direct answer: {direct_response.strip()}")
# With chain of thought: the model reasons through the problem.
cot_response = client.complete(
prompt=(
problem
+ "\n\nThink through this step by step before giving "
+ "your final answer."
),
system="You are a careful mathematical reasoner."
)
print(f"Chain of thought answer:\n{cot_response}")
print()
# --- TECHNIQUE 4: Output Format Specification ---
# Specifying the exact format of the output makes
# responses much easier to parse programmatically.
def demonstrate_output_format() -> None:
"""Show how to get structured, parseable output from an LLM."""
print("--- Technique 4: Output Format Specification ---")
prompt = (
"Extract the following information from the text below and "
"format it exactly as shown in the template. "
"Do not add any extra text or explanation.\n\n"
"Template:\n"
"NAME: [person's full name]\n"
"COMPANY: [company name]\n"
"ROLE: [job title]\n"
"EMAIL: [email address if present, otherwise 'not provided']\n\n"
"Text:\n"
'"Hi, I\'m Sarah Chen, the Lead Engineer at Quantum Dynamics. '
"You can reach me at s.chen@quantumdynamics.io for any "
'technical questions."'
)
response = client.complete(
prompt=prompt,
system="You are a precise data extraction assistant."
)
print("Extracted information:")
print(response)
print()
if __name__ == "__main__":
demonstrate_role_assignment()
demonstrate_few_shot()
demonstrate_chain_of_thought()
demonstrate_output_format()
The techniques demonstrated above are not independent tricks. They work together. A production prompt might combine role assignment (to set the model's expertise and tone), few-shot examples (to demonstrate the exact output format), and chain-of-thought instructions (to improve reasoning quality) all in a single system message. The art lies in knowing which combination to use for a given task.
One more technique deserves mention before we move on: the importance of negative instructions. Telling the model what NOT to do can be just as important as telling it what to do. If you are building a customer service bot for a software company, you might include instructions like "Do not discuss competitor products" or "Do not make promises about future features." Models generally follow these constraints well when they are stated clearly in the system message.
CHAPTER 6: BUILDING A CONVERSATIONAL CHATBOT WITH MEMORY
Now that we understand how to craft effective prompts, we are ready to build something genuinely useful: a chatbot that maintains a coherent conversation over multiple turns. As we established in Chapter 1, LLMs are stateless. They do not remember anything between API calls. To create the illusion of memory, we must maintain the conversation history ourselves and send it with every request.
This pattern is called a conversation buffer. It is the foundation of every chatbot, from the simplest toy project to production systems serving millions of users. The key insight is that the conversation history is just a Python list of dictionaries, and we append to it after every exchange.
The following implementation builds a complete interactive chatbot with a configurable persona. Save it as chapter_06_chatbot.py:
# chapter_06_chatbot.py
#
# A complete conversational chatbot with persistent memory.
# This demonstrates the conversation buffer pattern, which is
# the foundation of all LLM-based chat applications.
from typing import Dict, List, Optional
from llm_client import LLMClient
class Chatbot:
"""
A conversational chatbot that maintains full conversation history.
The chatbot stores all messages in a list and sends the complete
history with every API call, giving the model the context it needs
to respond coherently across multiple turns.
Attributes:
name: The chatbot's display name.
client: The LLMClient used to make API calls.
history: The list of all messages in the conversation.
max_history_turns: Maximum turns to retain before trimming.
"""
def __init__(
self,
name: str = "Assistant",
system_prompt: str = "You are a helpful assistant.",
provider: str = "ollama",
model: Optional[str] = None,
temperature: float = 0.7,
max_history_turns: int = 20
):
"""
Initialize the chatbot with a persona and configuration.
Args:
name: Display name for the chatbot.
system_prompt: Instructions that define the chatbot's
behavior for the entire conversation.
provider: LLM provider ("ollama", "lmstudio",
or "openai").
model: Specific model name, or None for the
provider default.
temperature: Sampling temperature for response
generation.
max_history_turns: Maximum number of conversation turns
to keep. Older turns are dropped to
stay within token limits.
"""
self.name = name
self.max_history_turns = max_history_turns
# Initialize the LLM client with the specified provider.
self.client = LLMClient(
provider=provider,
model=model,
temperature=temperature
)
# The history starts with just the system message.
# Every subsequent message will be appended here.
self.history: List[Dict] = [
{"role": "system", "content": system_prompt}
]
def chat(self, user_message: str) -> str:
"""
Send a user message and get the chatbot's response.
This method appends the user message to history, trims if
needed, calls the LLM with the full history, appends the
assistant's response, and returns the response text.
Args:
user_message: The text message from the user.
Returns:
The chatbot's response as a string.
"""
# Append the user's message to the conversation history.
self.history.append({
"role": "user",
"content": user_message
})
# Trim history if it has grown too long, keeping the system
# message and only the most recent turns.
self._trim_history()
# Send the full history to the LLM and get a response.
response_text = self.client.chat(self.history)
# Append the assistant's response to history so the model
# will have context for the next turn.
self.history.append({
"role": "assistant",
"content": response_text
})
return response_text
def _trim_history(self) -> None:
"""
Remove old messages to prevent exceeding token limits.
We keep the system message and the most recent N turns.
Each turn consists of one user message and one assistant
message, so max_history_turns * 2 gives us the number of
non-system messages to retain.
"""
# The system message is always at index 0.
# Non-system messages start at index 1.
non_system_messages = self.history[1:]
max_messages = self.max_history_turns * 2
if len(non_system_messages) > max_messages:
# Keep only the most recent messages.
trimmed = non_system_messages[-max_messages:]
self.history = [self.history[0]] + trimmed
def reset(self) -> None:
"""
Clear the conversation history, keeping only the system message.
Use this to start a fresh conversation with the same chatbot
configuration without creating a new instance.
"""
system_message = self.history[0]
self.history = [system_message]
print(f"[{self.name}] Conversation history cleared.")
def show_history(self) -> None:
"""Print the full conversation history for debugging purposes."""
print(
f"\n=== Conversation History "
f"({len(self.history)} messages) ==="
)
for i, msg in enumerate(self.history):
role = msg["role"].upper()
# Truncate long messages for display readability.
content = msg["content"][:100]
print(f"[{i}] {role}: {content}...")
print("=" * 50)
def run_interactive_chat() -> None:
"""
Launch an interactive command-line chat session.
This function creates a chatbot and enters a loop where the user
can type messages and receive responses. Type 'quit' to exit,
'reset' to clear history, or 'history' to inspect all messages.
"""
# Create a chatbot with a specific persona.
# Try changing the system_prompt to create different personalities.
bot = Chatbot(
name="Aria",
system_prompt=(
"You are Aria, a friendly and knowledgeable AI assistant "
"specializing in science and technology. You explain complex "
"topics clearly and enjoy using real-world analogies. "
"You are enthusiastic but concise, keeping responses under "
"150 words unless the user asks for more detail."
),
provider="ollama",
temperature=0.7,
max_history_turns=10
)
print(f"Chat with {bot.name} (powered by {bot.client.model})")
print(
"Commands: 'quit' to exit, 'reset' to clear history, "
"'history' to inspect"
)
print("-" * 60)
while True:
# Get user input, handling keyboard interrupts gracefully.
try:
user_input = input("You: ").strip()
except (KeyboardInterrupt, EOFError):
print("\nGoodbye!")
break
# Handle special commands before sending to the LLM.
if not user_input:
continue
if user_input.lower() == "quit":
print("Goodbye!")
break
if user_input.lower() == "reset":
bot.reset()
continue
if user_input.lower() == "history":
bot.show_history()
continue
# Send the message and print the response.
response = bot.chat(user_input)
print(f"\n{bot.name}: {response}\n")
if __name__ == "__main__":
run_interactive_chat()
Run this script and have a multi-turn conversation. Ask a question, then ask a follow-up that references your previous question. You will see that the model correctly understands what "it" or "that" refers to in your follow-up, because it has the full conversation history available.
Notice the _trim_history method. This is a critical piece of production engineering that many tutorials omit. Every LLM has a context window, which is the maximum number of tokens it can process in a single call. For smaller local models, this might be 4,096 or 8,192 tokens. For large remote models, it can be 128,000 or more. If your conversation history grows beyond the context window, the API call will fail. The trim method prevents this by discarding the oldest messages while always keeping the system message. A more sophisticated implementation might use a summarization technique, where old messages are summarized and the summary is kept rather than discarded, but the buffer approach is sufficient for most applications.
CHAPTER 7: STRUCTURED OUTPUT - GETTING JSON FROM LLMs
One of the most common requirements in real applications is getting structured data out of an LLM rather than free-form text. You might want to extract entities from a document, classify text into categories, or have the model fill out a form. The challenge is that LLMs naturally produce text, and parsing free-form text programmatically is fragile.
There are two main approaches to this problem. The first approach is prompt-based: you instruct the model very carefully to output JSON and then parse the result. The second approach is to use the JSON mode or structured output features that some APIs provide, which constrain the model's output to valid JSON. We will cover both.
Let us start with the prompt-based approach, since it works with any model including local ones. Save the following as chapter_07_structured_output.py:
# chapter_07_structured_output.py
#
# Techniques for extracting structured data from LLM responses.
# We cover prompt-based JSON extraction with validation and retry
# logic, which works with any LLM including local models.
import json
import re
from typing import Any, Dict, List, Optional
from llm_client import LLMClient
# Low temperature is critical for structured output tasks.
# Higher temperatures increase the chance of malformed JSON.
client = LLMClient(provider="ollama", temperature=0.1)
def extract_json_from_text(text: str) -> Optional[Dict]:
"""
Extract a JSON object from a text string that may contain
surrounding explanation or markdown code fences.
LLMs sometimes wrap JSON in markdown code blocks. This function
handles that case gracefully by trying multiple extraction
strategies in order of reliability.
Args:
text: The raw text from the LLM response.
Returns:
A parsed Python dict, or None if no valid JSON was found.
"""
# Strategy 1: Try to parse the entire text as JSON directly.
try:
return json.loads(text.strip())
except json.JSONDecodeError:
pass
# Strategy 2: Look for a JSON block inside code fences.
# The regex looks for content between ```json and ``` markers.
code_fence_pattern = r"```(?:json)?\s*(\{.*?\})\s*```"
match = re.search(code_fence_pattern, text, re.DOTALL)
if match:
try:
return json.loads(match.group(1))
except json.JSONDecodeError:
pass
# Strategy 3: Find any JSON object anywhere in the text.
# This looks for the outermost { } pair and tries to parse it.
brace_pattern = r"\{.*\}"
match = re.search(brace_pattern, text, re.DOTALL)
if match:
try:
return json.loads(match.group(0))
except json.JSONDecodeError:
pass
# All strategies failed: return None to signal failure.
return None
def extract_person_info(text: str) -> Dict:
"""
Extract structured information about a person from unstructured text.
This function uses a carefully crafted prompt to instruct the model
to output a specific JSON schema, then validates and returns the
result. It retries up to three times if the model produces
malformed JSON.
Args:
text: Unstructured text containing information about a person.
Returns:
A dictionary with the extracted person information.
Raises:
ValueError: If the model fails to produce valid JSON after
all retries are exhausted.
"""
system = (
"You are a precise data extraction engine. "
"You always respond with valid JSON only, no other text. "
"Never add explanations or markdown formatting."
)
# The prompt specifies the exact schema we want.
# Providing the schema explicitly is much more reliable than
# describing it in natural language alone.
prompt = (
"Extract information from the following text and return it "
"as JSON matching this exact schema:\n\n"
"{\n"
' "full_name": "string or null",\n'
' "age": "integer or null",\n'
' "occupation": "string or null",\n'
' "location": "string or null",\n'
' "skills": ["list", "of", "strings"],\n'
' "contact": {\n'
' "email": "string or null",\n'
' "phone": "string or null"\n'
" }\n"
"}\n\n"
f"Text to analyze:\n{text}\n\n"
"Return only the JSON object, nothing else."
)
# Initialize raw_response so it is always defined, even if the
# loop body never executes (which cannot happen, but satisfies
# static analysis tools).
raw_response = ""
# We try up to 3 times in case the model produces malformed JSON.
for attempt in range(3):
raw_response = client.chat([
{"role": "system", "content": system},
{"role": "user", "content": prompt}
])
parsed = extract_json_from_text(raw_response)
if parsed is not None:
return parsed
print(f" [Attempt {attempt + 1} failed, retrying...]")
raise ValueError(
f"Failed to extract valid JSON after 3 attempts. "
f"Last response: {raw_response}"
)
def classify_support_ticket(ticket_text: str) -> Dict:
"""
Classify a customer support ticket into structured categories.
This demonstrates using structured output for classification tasks,
which is one of the most common real-world LLM applications.
Args:
ticket_text: The raw text of the support ticket.
Returns:
A dict containing category, priority, sentiment, and summary.
Raises:
ValueError: If the model response cannot be parsed as JSON.
"""
system = (
"You are a customer support triage system. "
"Analyze support tickets and return structured JSON "
"classifications. Return only valid JSON, no other text."
)
prompt = (
"Analyze this support ticket and classify it using the "
"following JSON schema:\n\n"
"{\n"
' "category": "one of: billing, technical, account, '
'shipping, other",\n'
' "priority": "one of: low, medium, high, critical",\n'
' "sentiment": "one of: positive, neutral, frustrated, '
'angry",\n'
' "requires_human": true or false,\n'
' "summary": "one sentence summary of the issue",\n'
' "suggested_action": "brief description of recommended '
'next step"\n'
"}\n\n"
f"Support ticket:\n{ticket_text}"
)
raw_response = client.chat([
{"role": "system", "content": system},
{"role": "user", "content": prompt}
])
result = extract_json_from_text(raw_response)
if result is None:
raise ValueError(
f"Could not parse classification response: {raw_response}"
)
return result
if __name__ == "__main__":
print("=== Person Information Extraction ===")
sample_text = (
"Meet Dr. James Okafor, a 34-year-old data scientist based "
"in Berlin, Germany. He works at a leading AI research lab "
"and specializes in natural language processing, computer "
"vision, and reinforcement learning. James has a PhD in "
"Computer Science from TU Berlin. You can reach him at "
"james.okafor@ailab.de or call +49-30-12345678."
)
try:
info = extract_person_info(sample_text)
print(json.dumps(info, indent=2))
except ValueError as e:
print(f"Extraction failed: {e}")
print()
print("=== Support Ticket Classification ===")
ticket = (
"Subject: URGENT - Cannot access my account for 3 days!!!\n\n"
"I have been locked out of my account since Monday and I have "
"an important presentation tomorrow that requires the files I "
"stored in your system. I've tried resetting my password 5 "
"times and nothing works. I'm extremely frustrated. My account "
"email is user@company.com. Please help immediately!"
)
try:
classification = classify_support_ticket(ticket)
print(json.dumps(classification, indent=2))
except ValueError as e:
print(f"Classification failed: {e}")
The retry logic in extract_person_info is an important pattern. LLMs occasionally produce malformed output, especially smaller local models. Rather than crashing on the first failure, a robust application retries with the same prompt. In production systems, you might also log these failures to identify patterns and improve your prompts over time.
For applications using the OpenAI API, there is a more reliable approach called structured outputs, where you provide a JSON schema and the API guarantees the response will conform to it. This uses constrained decoding under the hood, which means the model is mathematically prevented from generating tokens that would violate the schema. Save the following as chapter_07_openai_structured.py:
# chapter_07_openai_structured.py
#
# Using OpenAI's native structured output feature for guaranteed
# JSON schema compliance. This only works with the OpenAI API
# (not local models) and requires gpt-4o or newer.import json
import os
from typing import Dict
from openai import OpenAI
from dotenv import load_dotenv
load_dotenv()
# We use the OpenAI client directly here rather than our LLMClient
# wrapper because the structured output feature requires passing
# the response_format parameter, which our wrapper does not expose.
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
def classify_with_schema(text: str) -> Dict:
"""
Classify text using OpenAI's structured output feature.
The response_format parameter with a JSON schema guarantees
that the response will be valid JSON matching the schema.
No parsing or retry logic is needed.
Args:
text: The text to classify.
Returns:
A dict guaranteed to match the specified schema.
"""
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{
"role": "system",
"content": "Classify the given text."
},
{
"role": "user",
"content": f"Classify this text: {text}"
}
],
# The response_format parameter enables structured output.
# The model is constrained to produce valid JSON matching
# this schema exactly.
response_format={
"type": "json_schema",
"json_schema": {
"name": "text_classification",
"schema": {
"type": "object",
"properties": {
"category": {
"type": "string",
"enum": [
"positive",
"negative",
"neutral"
]
},
"confidence": {
"type": "number",
"minimum": 0.0,
"maximum": 1.0
},
"reasoning": {
"type": "string"
}
},
"required": [
"category",
"confidence",
"reasoning"
],
"additionalProperties": False
},
"strict": True
}
}
)
# With structured output, we can parse directly without
# additional error handling because the API guarantees
# valid JSON conforming to our schema.
return json.loads(response.choices[0].message.content)
if __name__ == "__main__":
# This example requires a valid OpenAI API key.
sample = (
"I absolutely love this product. "
"Best purchase I've made all year!"
)
result = classify_with_schema(sample)
print(json.dumps(result, indent=2))
CHAPTER 8: TOOL USE AND FUNCTION CALLING
We have now reached one of the most transformative capabilities of modern LLMs: the ability to use tools. This is the bridge between LLMs as text generators and LLMs as active agents that can interact with the world.
The core idea is elegant. You define a set of tools (Python functions) and describe them to the model. When the model determines that it needs information or capabilities beyond its training data, it can request to call one of those tools. Your code executes the function, gets the result, and sends it back to the model. The model then uses that result to formulate its final response.
This is not the model executing code directly. The model outputs a structured request saying "I want to call function X with these arguments." Your code intercepts that request, calls the actual function, and feeds the result back. The model never touches your system directly. This design keeps the model sandboxed while still giving it the ability to access real-world information and capabilities.
The following diagram illustrates the tool use loop:
User: "What is the weather in Berlin?"
|
v
LLM decides it needs weather data
|
v
LLM outputs: {"tool": "get_weather", "args": {"city": "Berlin"}}
|
v
Your code calls get_weather("Berlin") --> "18C, partly cloudy"
|
v
You send the result back to the LLM
|
v
LLM generates: "The weather in Berlin is 18 degrees Celsius..."
|
v
User receives the final answer
Let us implement this pattern from scratch. We will first build a manual implementation to understand the mechanics, then show how to use the OpenAI tool-calling API. Save the following as chapter_08_tool_use.py:
# chapter_08_tool_use.py
#
# Implementing tool use (function calling) with LLMs.
# We first build a manual implementation to understand the mechanics,
# then show how to use the OpenAI tool-calling API.
#
# This file also exports get_current_time, calculate, and
# search_knowledge_base for use by chapter_08_openai_tools.py.
import json
import math
import re
import datetime
from typing import Any, Dict, List, Optional
from llm_client import LLMClient
# ==========================================================================
# PART 1: DEFINING OUR TOOLS
# These are regular Python functions that the LLM can request to call.
# They represent capabilities that the model does not have on its own:
# real-time data, calculations, and knowledge retrieval.
# ==========================================================================
def get_current_time() -> str:
"""
Return the current date and time.
This is a simple example of a tool that provides real-time
information that the model cannot know from its training data.
Returns:
A formatted string with the current date and time.
"""
now = datetime.datetime.now()
return now.strftime(
"Current date and time: %A, %B %d, %Y at %H:%M:%S"
)
def calculate(expression: str) -> str:
"""
Safely evaluate a mathematical expression.
We use a restricted evaluation approach to prevent code injection.
Only basic math operations and a curated set of math functions
are allowed.
Args:
expression: A mathematical expression as a string,
e.g. "2 ** 10 + 5" or "sqrt(144)"
Returns:
The result of the calculation as a string, or an error message.
"""
# Define a safe set of allowed names for evaluation.
# This prevents the expression from accessing dangerous built-ins.
safe_names: Dict[str, Any] = {
"abs": abs, "round": round, "min": min, "max": max,
"sqrt": math.sqrt, "pow": math.pow, "log": math.log,
"sin": math.sin, "cos": math.cos, "tan": math.tan,
"pi": math.pi, "e": math.e
}
try:
# eval() with restricted globals is safer than unrestricted eval.
# We still only use this for controlled, model-generated inputs.
result = eval(expression, {"__builtins__": {}}, safe_names)
return f"The result of '{expression}' is {result}"
except Exception as ex:
return f"Error evaluating '{expression}': {str(ex)}"
def search_knowledge_base(query: str) -> str:
"""
Search a simulated knowledge base for information.
In a real application, this would query a vector database,
a search engine, or a company's internal documentation system.
For this tutorial, we return canned responses to demonstrate
the pattern without requiring external services.
Args:
query: The search query string.
Returns:
Relevant information from the knowledge base as a string.
"""
# Simulated knowledge base entries.
knowledge: Dict[str, str] = {
"python": (
"Python is a high-level, interpreted programming language "
"known for its readability and versatility. Created by "
"Guido van Rossum in 1991."
),
"llm": (
"Large Language Models (LLMs) are neural networks trained "
"on massive text datasets to understand and generate human "
"language. Examples include GPT-4, Claude, and Llama."
),
"OpenAI": (
"OpenAI is a US technology company that focuses on Generative AI models,"
"founded in 2015. It operates in areas including LLM models, VLM models, "
"image and video generation, coding agents."
)
}
# Simple keyword matching for demonstration purposes.
query_lower = query.lower()
for keyword, info in knowledge.items():
if keyword in query_lower:
return info
return f"No specific information found for query: '{query}'"
# ==========================================================================
# PART 2: THE TOOL REGISTRY
# A registry maps tool names to their implementations and descriptions.
# The descriptions are what we send to the LLM so it knows what tools
# exist and when to use them.
# ==========================================================================
TOOL_REGISTRY: Dict[str, Dict] = {
"get_current_time": {
"function": get_current_time,
"description": (
"Returns the current date and time. Use this when the user "
"asks about the current time, date, day of the week, etc."
),
"parameters": {} # This tool takes no parameters
},
"calculate": {
"function": calculate,
"description": (
"Evaluates a mathematical expression and returns the result. "
"Use this for any arithmetic, algebra, or math calculations."
),
"parameters": {
"expression": (
"A valid Python mathematical expression as a string, "
"e.g. '2 ** 10' or 'sqrt(144)'"
)
}
},
"search_knowledge_base": {
"function": search_knowledge_base,
"description": (
"Searches the internal knowledge base for information on a "
"topic. Use this when you need factual information about "
"Python, LLMs, OpenAI, or other topics."
),
"parameters": {
"query": (
"The search query describing what information is needed"
)
}
}
}
# ==========================================================================
# PART 3: THE MANUAL TOOL-USE AGENT
# This implementation shows the mechanics without using the OpenAI
# function-calling API. The model is instructed to output JSON when
# it wants to call a tool, and we parse that JSON ourselves.
# ==========================================================================
def build_tool_description_prompt() -> str:
"""
Build a text description of all available tools for the system prompt.
Returns:
A formatted string describing all tools and their parameters.
"""
lines: List[str] = ["You have access to the following tools:\n"]
for tool_name, tool_info in TOOL_REGISTRY.items():
lines.append(f"Tool: {tool_name}")
lines.append(f"Description: {tool_info['description']}")
if tool_info["parameters"]:
lines.append("Parameters:")
for param_name, param_desc in tool_info["parameters"].items():
lines.append(f" - {param_name}: {param_desc}")
else:
lines.append("Parameters: none")
lines.append("") # Blank line between tool entries
lines.append(
"When you need to use a tool, respond with ONLY this JSON:\n"
'{"tool_call": {"name": "tool_name", "args": {"param": "value"}}}\n'
"\n"
"When you have the information you need and are ready to give a "
"final answer to the user, respond normally in plain text. "
"Do not use the JSON format for your final answer."
)
return "\n".join(lines)
class ManualToolAgent:
"""
An agent that uses tools through a manual JSON-based protocol.
This class demonstrates the fundamental mechanics of tool use
without relying on provider-specific APIs. It works with any
LLM that can follow instructions to output JSON when it needs
to call a tool.
"""
def __init__(self, provider: str = "ollama"):
"""
Initialize the agent with an LLM client and tool descriptions.
Args:
provider: The LLM provider to use ("ollama" or "openai").
"""
self.client = LLMClient(provider=provider, temperature=0.1)
self.system_prompt = build_tool_description_prompt()
def run(self, user_query: str, max_steps: int = 5) -> str:
"""
Process a user query, using tools as needed.
The agent runs a loop: it sends the query to the LLM, checks
if the LLM wants to call a tool, executes the tool if so, and
repeats until the LLM produces a final text answer.
Args:
user_query: The user's question or request.
max_steps: Maximum number of tool calls before forcing
a final response.
Returns:
The agent's final answer as a string.
"""
messages: List[Dict] = [
{"role": "system", "content": self.system_prompt},
{"role": "user", "content": user_query}
]
print(f"\n[Agent] Processing: '{user_query}'")
for step in range(max_steps):
# Ask the LLM what to do next.
response = self.client.chat(messages)
print(
f"[Agent Step {step + 1}] LLM response: "
f"{response[:150]}..."
)
# Try to parse the response as a tool call.
tool_call = self._parse_tool_call(response)
if tool_call is None:
# No tool call found: this is the final answer.
print("[Agent] Final answer received.")
return response
# Execute the requested tool.
tool_name = tool_call.get("name")
tool_args = tool_call.get("args", {})
if not tool_name or tool_name not in TOOL_REGISTRY:
# The model hallucinated a tool that does not exist,
# or the name field was missing from the JSON.
tool_result = (
f"Error: Tool '{tool_name}' does not exist. "
f"Available tools: {list(TOOL_REGISTRY.keys())}"
)
else:
tool_function = TOOL_REGISTRY[tool_name]["function"]
print(f"[Agent] Calling tool: {tool_name}({tool_args})")
tool_result = tool_function(**tool_args)
print(f"[Agent] Tool result: {tool_result}")
# Add the tool call and result to the conversation history
# so the model can see what happened and continue reasoning.
messages.append({"role": "assistant", "content": response})
messages.append({
"role": "user",
"content": (
f"Tool result for '{tool_name}': {tool_result}"
)
})
# If we reach max_steps without a final answer, force one.
messages.append({
"role": "user",
"content": (
"Please provide your final answer based on the "
"information gathered so far."
)
})
return self.client.chat(messages)
def _parse_tool_call(self, response: str) -> Optional[Dict]:
"""
Try to parse a tool call from the LLM's response.
Args:
response: The raw text response from the LLM.
Returns:
A dict with "name" and "args" keys, or None if no
tool call was found in the response.
"""
# Strategy 1: Try to parse the entire response as JSON.
try:
data = json.loads(response.strip())
if "tool_call" in data:
return data["tool_call"]
except json.JSONDecodeError:
pass
# Strategy 2: Try to find JSON embedded anywhere in the text.
brace_match = re.search(r"\{.*\}", response, re.DOTALL)
if brace_match:
try:
data = json.loads(brace_match.group(0))
if "tool_call" in data:
return data["tool_call"]
except json.JSONDecodeError:
pass
# No tool call found: the response is a final answer.
return None
if __name__ == "__main__":
agent = ManualToolAgent(provider="ollama")
queries = [
"What time is it right now?",
"What is the square root of 144 plus 50?",
"Tell me about Large Language Models."
]
for query in queries:
print("\n" + "=" * 60)
answer = agent.run(query)
print(f"\nFinal Answer: {answer}")
print("=" * 60)
This manual implementation is educational because it makes every step explicit. In production, you would use the OpenAI function-calling API, which handles the JSON protocol automatically and is more reliable. The following example shows how that works. Save it as chapter_08_openai_tools.py:
# chapter_08_openai_tools.py
#
# Using the OpenAI function-calling API for robust tool use.
# This approach is more reliable than the manual JSON protocol
# because the API handles the formatting and parsing for you.
# It only works with OpenAI (or compatible APIs), not all local
# models.
import json
import os
from openai import OpenAI
from dotenv import load_dotenv
# Import our tool functions from the previous example.
# In a real project, these would live in a dedicated tools module.
from chapter_08_tool_use import (
get_current_time,
calculate,
search_knowledge_base
)
load_dotenv()
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
MODEL = "gpt-4o-mini"
# Tool definitions in the format required by the OpenAI API.
# This is a JSON Schema description of each tool's interface.
# The model uses these descriptions to decide when and how to call
# each tool.
OPENAI_TOOLS = [
{
"type": "function",
"function": {
"name": "get_current_time",
"description": (
"Returns the current date and time. Use when the user "
"asks about the current time, date, or day of the week."
),
"parameters": {
"type": "object",
"properties": {},
"required": []
}
}
},
{
"type": "function",
"function": {
"name": "calculate",
"description": (
"Evaluates a mathematical expression. Use for arithmetic, "
"algebra, or any numerical calculations."
),
"parameters": {
"type": "object",
"properties": {
"expression": {
"type": "string",
"description": (
"A Python mathematical expression, "
"e.g. '2 ** 10 + sqrt(16)'"
)
}
},
"required": ["expression"]
}
}
},
{
"type": "function",
"function": {
"name": "search_knowledge_base",
"description": (
"Searches the internal knowledge base. Use when you need "
"factual information about Python, LLMs, or organizations."
),
"parameters": {
"type": "object",
"properties": {
"query": {
"type": "string",
"description": "The search query"
}
},
"required": ["query"]
}
}
}
]
# Map tool names to their Python implementations for dispatch.
TOOL_IMPLEMENTATIONS = {
"get_current_time": get_current_time,
"calculate": calculate,
"search_knowledge_base": search_knowledge_base
}
def run_agent_with_tools(user_query: str) -> str:
"""
Run an agent loop using the OpenAI function-calling API.
The OpenAI API handles the tool-call protocol. When the model
wants to call a tool, the response contains structured tool_calls
objects instead of plain text. We execute the tools and feed
results back until the model produces a final text answer.
Args:
user_query: The user's question or request.
Returns:
The agent's final text response.
"""
messages = [
{
"role": "system",
"content": "You are a helpful assistant with access to tools."
},
{"role": "user", "content": user_query}
]
print(f"\n[Agent] Query: '{user_query}'")
# The agent loop continues until the model produces a text
# response (indicating a final answer) rather than a tool call.
while True:
response = client.chat.completions.create(
model=MODEL,
messages=messages,
tools=OPENAI_TOOLS,
# "auto" lets the model decide whether to use a tool.
# "required" forces tool use; "none" disables it.
tool_choice="auto"
)
message = response.choices[0].message
# Check if the model wants to call one or more tools.
if message.tool_calls:
# Add the assistant's message (with tool calls) to history.
messages.append(message)
# Execute each requested tool call in sequence.
for tool_call in message.tool_calls:
func_name = tool_call.function.name
func_args = json.loads(tool_call.function.arguments)
print(f"[Agent] Calling: {func_name}({func_args})")
# Dispatch to the correct tool implementation.
if func_name in TOOL_IMPLEMENTATIONS:
result = TOOL_IMPLEMENTATIONS[func_name](**func_args)
else:
result = f"Error: Unknown tool '{func_name}'"
print(f"[Agent] Result: {result}")
# Add the tool result to the message history.
# The "tool" role is specific to the OpenAI API format
# and must include the tool_call_id for correlation.
messages.append({
"role": "tool",
"tool_call_id": tool_call.id,
"content": str(result)
})
else:
# No tool calls in this response: it is the final answer.
final_answer = message.content
print("[Agent] Final answer produced.")
return final_answer
if __name__ == "__main__":
# These examples require a valid OpenAI API key.
test_queries = [
"What day of the week is it today?",
(
"Calculate the compound interest on $10,000 at 5% for "
"3 years. Use the formula 10000 * (1.05 ** 3)"
),
"What can you tell me about OpenAI?"
]
for query in test_queries:
print("\n" + "=" * 60)
answer = run_agent_with_tools(query)
print(f"\nFinal Answer:\n{answer}")
print("=" * 60)
The key difference between the manual approach and the OpenAI API approach is reliability. With the manual approach, the model must follow your JSON protocol exactly, and even small deviations can break your parser. With the OpenAI API, the model outputs tool calls as structured objects in the API response, completely separate from the text content. There is no parsing ambiguity. For production applications using OpenAI, always use the native function-calling API. For local models, the manual approach is your best option, though some local model servers are beginning to support function calling natively.
CHAPTER 9: BUILDING AN AGENT FROM SCRATCH
We have been using the word "agent" informally. Now it is time to define it precisely and build one properly. An agent is an LLM-powered system that can perceive its environment, reason about what actions to take, execute those actions, observe the results, and repeat this loop until a goal is achieved. The key word is "loop." Unlike a simple chatbot that responds to each message independently, an agent can pursue a goal over multiple steps, adapting its strategy based on what it learns along the way.
The most influential framework for thinking about agents is called ReAct, which stands for Reasoning and Acting. In the ReAct pattern, the agent explicitly alternates between two types of outputs. The first type is a Thought, where the agent reasons about the current situation and what to do next. The second type is an Action, where the agent specifies a tool to call and what arguments to pass. After each action, the agent receives an Observation (the tool's output), which it uses to inform its next thought. This cycle continues until the agent produces a Final Answer.
The ReAct pattern looks like this:
Question: What is the population of the capital of France?
Thought: I need to find the capital of France, then find its population.
Action: web_search
Action Input: {"query": "capital of France"}
Observation: "Paris is the capital of France."
Thought: Now I need the population of Paris.
Action: web_search
Action Input: {"query": "population of Paris"}
Observation: "Paris has a population of approximately 2.1 million."
Thought: I now have all the information I need to answer.
Final Answer: The capital of France is Paris, with a population of
approximately 2.1 million people.
This explicit reasoning trace has several important benefits. It makes the agent's decision-making transparent and debuggable. It helps the model stay on track for complex multi-step tasks. And it allows you to inspect the agent's reasoning to identify where things go wrong when they do.
Save the following complete ReAct agent implementation as chapter_09_react_agent.py:
# chapter_09_react_agent.py
#
# A complete implementation of the ReAct (Reasoning + Acting) agent
# pattern. This agent explicitly reasons about each step before taking
# action, making its decision process transparent and debuggable.
import re
import json
import math
import datetime
from typing import Any, Dict, List, Optional, Tuple
from llm_client import LLMClient
# ==========================================================================
# TOOL DEFINITIONS
# Each tool is a plain Python function. We keep tools simple and focused:
# each one does exactly one thing and does it well.
# ==========================================================================
def web_search(query: str) -> str:
"""
Simulate a web search (returns canned results for demonstration).
In a real application, this would call a search API like
Brave Search, SerpAPI, or DuckDuckGo. For this tutorial,
we return realistic-looking simulated results so you can
run everything without external API keys.
Args:
query: The search query string.
Returns:
Simulated search results as a formatted string.
"""
results: Dict[str, str] = {
"python programming": (
"Python is a versatile programming language used in web "
"development, data science, AI, and automation. Latest "
"version: Python 3.12 (2023). Creator: Guido van Rossum. "
"First released: 1991."
),
"eiffel tower height": (
"The Eiffel Tower in Paris, France stands 330 meters "
"(1,083 feet) tall including its broadcast antenna. The "
"iron structure itself is 300 meters. Built: 1887-1889. "
"Architect: Gustave Eiffel."
),
"machine learning": (
"Machine learning is a subset of AI where systems learn "
"from data. Key types: supervised learning, unsupervised "
"learning, reinforcement learning. Popular frameworks: "
"TensorFlow, PyTorch, scikit-learn."
),
"openai gpt": (
"GPT (Generative Pre-trained Transformer) is a family of "
"LLMs by OpenAI. GPT-4o is the latest flagship model as "
"of 2024. GPT-4o-mini offers a balance of capability and "
"cost."
)
}
query_lower = query.lower()
for key, result in results.items():
if any(word in query_lower for word in key.split()):
return f"Search results for '{query}':\n{result}"
return (
f"Search results for '{query}': No specific results found. "
f"Try a different query."
)
def calculator(expression: str) -> str:
"""
Evaluate a mathematical expression safely.
Args:
expression: A mathematical expression string,
e.g. "round(0.15 * 847)"
Returns:
The computed result as a string, or an error message.
"""
safe_env: Dict[str, Any] = {
"abs": abs, "round": round, "sqrt": math.sqrt,
"pow": pow, "log": math.log, "log10": math.log10,
"sin": math.sin, "cos": math.cos, "tan": math.tan,
"pi": math.pi, "e": math.e, "ceil": math.ceil,
"floor": math.floor
}
try:
result = eval(expression, {"__builtins__": {}}, safe_env)
return f"Result: {result}"
except Exception as ex:
return f"Calculation error: {str(ex)}"
def get_date_info() -> str:
"""
Return current date and time information.
Returns:
Formatted date and time string with day, week number, and
day of year.
"""
now = datetime.datetime.now()
return (
f"Current date: {now.strftime('%A, %B %d, %Y')}\n"
f"Current time: {now.strftime('%H:%M:%S')}\n"
f"Day of year: {now.timetuple().tm_yday}\n"
f"Week number: {now.isocalendar()[1]}"
)
def text_analyzer(text: str) -> str:
"""
Analyze basic statistics of a text passage.
Args:
text: The text to analyze.
Returns:
A formatted summary of text statistics.
"""
words = text.split()
sentences = re.split(r'[.!?]+', text)
sentences = [s.strip() for s in sentences if s.strip()]
chars = len(text)
unique_words = len(
set(word.lower().strip('.,!?;:') for word in words)
)
avg_words = len(words) / max(len(sentences), 1)
return (
f"Text analysis results:\n"
f" Characters: {chars}\n"
f" Words: {len(words)}\n"
f" Unique words: {unique_words}\n"
f" Sentences: {len(sentences)}\n"
f" Average words per sentence: {avg_words:.1f}"
)
# Registry mapping tool names to their implementations and descriptions.
TOOLS: Dict[str, Dict] = {
"web_search": {
"func": web_search,
"description": (
"Search the web for information. Args: query (string)"
)
},
"calculator": {
"func": calculator,
"description": (
"Evaluate a math expression. Args: expression (string)"
)
},
"get_date_info": {
"func": get_date_info,
"description": (
"Get current date and time information. Args: none"
)
},
"text_analyzer": {
"func": text_analyzer,
"description": (
"Analyze word count, sentences, etc. of a text passage. "
"Args: text (string)"
)
}
}
# ==========================================================================
# THE REACT AGENT
# The system prompt uses Python's str.format() method.
# We use {{}} to produce literal { } braces in the formatted string,
# and {tool_descriptions} as the actual placeholder.
# ==========================================================================
REACT_SYSTEM_PROMPT = (
"You are a helpful AI agent that solves problems step by step.\n\n"
"You have access to the following tools:\n"
"{tool_descriptions}\n\n"
"To use a tool, output your reasoning and action in this EXACT format:\n\n"
"Thought: [your reasoning about what to do next]\n"
"Action: [tool_name]\n"
"Action Input: [the input to the tool, as a JSON object]\n\n"
"After receiving the tool result (shown as 'Observation:'), continue:\n\n"
"Thought: [reasoning about the observation]\n"
"Action: [next tool if needed]\n"
"Action Input: [input as JSON, use {{}} for no arguments]\n\n"
"When you have enough information to answer completely, output:\n\n"
"Thought: I now have all the information needed to answer.\n"
"Final Answer: [your complete, helpful answer to the user]\n\n"
"Rules:\n"
"- Always start with a Thought before taking any Action.\n"
"- Always use valid JSON for Action Input.\n"
"- Never make up information; use tools to find facts.\n"
"- Be thorough but concise in your final answer."
)
class ReActAgent:
"""
An agent implementing the ReAct (Reasoning + Acting) pattern.
The agent maintains a conversation history of its reasoning steps
and iterates through the Think-Act-Observe cycle until it reaches
a final answer or hits the maximum number of steps.
Attributes:
client: The LLM client used for reasoning.
max_steps: Maximum number of think-act-observe cycles.
verbose: Whether to print the agent's reasoning steps.
"""
def __init__(
self,
provider: str = "ollama",
model: Optional[str] = None,
max_steps: int = 8,
verbose: bool = True
):
"""
Initialize the ReAct agent.
Args:
provider: LLM provider to use ("ollama" or "openai").
model: Specific model name, or None for provider default.
max_steps: Maximum reasoning steps before forcing a final
answer.
verbose: If True, print each step of the agent's reasoning.
"""
# Build tool descriptions for the system prompt.
tool_descriptions = "\n".join(
f"- {name}: {info['description']}"
for name, info in TOOLS.items()
)
# Format the system prompt, inserting the tool descriptions.
self.system_prompt = REACT_SYSTEM_PROMPT.format(
tool_descriptions=tool_descriptions
)
self.client = LLMClient(
provider=provider,
model=model,
temperature=0.1 # Low temperature for consistent reasoning
)
self.max_steps = max_steps
self.verbose = verbose
def run(self, question: str) -> str:
"""
Run the agent on a question and return the final answer.
The agent iterates through the ReAct loop: it generates a
thought and action, executes the action, observes the result,
and repeats until it produces a Final Answer.
Args:
question: The user's question or task.
Returns:
The agent's final answer as a string.
"""
if self.verbose:
print(f"\n{'=' * 60}")
print(f"AGENT TASK: {question}")
print('=' * 60)
messages: List[Dict] = [
{"role": "system", "content": self.system_prompt},
{"role": "user", "content": f"Question: {question}"}
]
for step in range(self.max_steps):
if self.verbose:
print(f"\n--- Step {step + 1} ---")
# Ask the LLM to continue the reasoning trace.
response = self.client.chat(messages)
if self.verbose:
print(response)
# Check if the agent has reached a final answer.
if "Final Answer:" in response:
final_answer = self._extract_final_answer(response)
if self.verbose:
print(f"\n{'=' * 60}")
print(f"FINAL ANSWER: {final_answer}")
print('=' * 60)
return final_answer
# Parse the action from the response.
action, action_input = self._parse_action(response)
if action is None:
# The model did not follow the format. Nudge it back.
observation = (
"Format error: Could not parse your action. "
"Please follow the exact format: "
"Thought: ... then Action: [tool_name] then "
"Action Input: {json}"
)
elif action == "Final Answer":
# The model sometimes writes "Final Answer" as the action.
return self._extract_final_answer(response)
elif action not in TOOLS:
observation = (
f"Error: Tool '{action}' does not exist. "
f"Available tools: {list(TOOLS.keys())}"
)
else:
# Execute the tool and capture the observation.
try:
tool_func = TOOLS[action]["func"]
if action_input:
observation = tool_func(**action_input)
else:
observation = tool_func()
except Exception as ex:
observation = f"Tool execution error: {str(ex)}"
if self.verbose:
print(f"\nObservation: {observation}")
# Append the agent's response and the observation to the
# messages list, building up the full reasoning trace.
messages.append({"role": "assistant", "content": response})
messages.append({
"role": "user",
"content": (
f"Observation: {observation}\n\n"
"Continue your reasoning:"
)
})
# If max steps reached, ask for a best-effort final answer.
if self.verbose:
print("\n[Agent] Maximum steps reached. Requesting final answer.")
messages.append({
"role": "user",
"content": (
"You have reached the maximum number of steps. "
"Please provide your best Final Answer based on "
"what you have learned so far."
)
})
response = self.client.chat(messages)
return self._extract_final_answer(response) or response
def _parse_action(self, response: str) -> Tuple[Optional[str], Dict]:
"""
Parse the action name and input from the agent's response.
Args:
response: The raw text from the LLM.
Returns:
A tuple of (action_name, action_input_dict).
Returns (None, {}) if parsing fails.
"""
# Extract the action name from the "Action: ..." line.
action_match = re.search(r"Action:\s*(.+?)(?:\n|$)", response)
if not action_match:
return None, {}
action = action_match.group(1).strip()
# Extract the action input JSON from the "Action Input: ..." line.
input_match = re.search(
r"Action Input:\s*(\{.*?\}|\[\]|\{\})",
response,
re.DOTALL
)
action_input: Dict = {}
if input_match:
try:
action_input = json.loads(input_match.group(1))
except json.JSONDecodeError:
# If JSON parsing fails, wrap the raw value as a fallback.
raw_input = input_match.group(1).strip()
action_input = {"query": raw_input}
return action, action_input
def _extract_final_answer(self, response: str) -> str:
"""
Extract the final answer text from the agent's response.
Args:
response: The raw text containing "Final Answer:".
Returns:
The text after "Final Answer:", stripped of whitespace.
"""
match = re.search(r"Final Answer:\s*(.+)", response, re.DOTALL)
if match:
return match.group(1).strip()
return response.strip()
if __name__ == "__main__":
agent = ReActAgent(provider="ollama", verbose=True)
test_questions = [
"What is 15% of 847, rounded to the nearest whole number?",
(
"Search for information about machine learning and "
"summarize the key types."
),
(
"What day of the week is it today, and what week number "
"of the year is it?"
)
]
for question in test_questions:
answer = agent.run(question)
print(f"\nSummary Answer: {answer}\n")
print("-" * 60)
Run this script and watch the agent's reasoning unfold step by step. You will see it formulate thoughts, choose tools, observe results, and adjust its reasoning based on what it learns. This is the essence of agentic behavior: not just responding to a prompt, but actively pursuing a goal through a sequence of reasoned actions.
CHAPTER 10: MEMORY SYSTEMS FOR AGENTS
The agents we have built so far have a significant limitation: they forget everything between conversations. Each time you start a new session, the agent has no memory of previous interactions. For many applications, this is fine. For others, such as a personal assistant that should remember your preferences, or a research agent that should build on previous findings, persistent memory is essential.
There are three main types of memory in agentic systems. The first type is in-context memory, which is the conversation history we have been using throughout this tutorial. It is fast and requires no extra infrastructure, but it is limited by the context window size and disappears when the session ends. The second type is external memory, where information is stored in a database outside the model and retrieved when relevant. This is persistent and can scale to enormous amounts of information. The third type is in-weights memory, which is knowledge baked into the model during training or fine-tuning. This is the most permanent form but also the most expensive to update.
For most applications, a combination of in-context and external memory is the right approach. We will implement a simple but effective external memory system using a JSON file as the storage backend. In production, you would replace this with a proper database, and for semantic search over memories, you would use a vector database. The pattern, however, is identical. Save the following as chapter_10_agent_memory.py:
# chapter_10_agent_memory.py
#
# A persistent memory system for LLM agents.
# We implement a simple key-value memory store backed by a JSON file,
# plus a more sophisticated episodic memory that stores and retrieves
# past interactions based on keyword relevance.
import json
import os
import re
import datetime
from typing import Any, Dict, List, Optional, Tuple
from llm_client import LLMClient
class PersistentMemory:
"""
A simple persistent key-value memory store backed by a JSON file.
This allows an agent to save and retrieve specific pieces of
information across multiple sessions. Think of it as the agent's
long-term memory for facts it has been explicitly told to remember.
Attributes:
filepath: Path to the JSON file used for storage.
data: The in-memory dictionary of stored facts.
"""
def __init__(self, filepath: str = "agent_memory.json"):
"""
Initialize the memory store, loading existing data if available.
Args:
filepath: Path to the JSON file for persistent storage.
"""
self.filepath = filepath
self.data: Dict = self._load()
def _load(self) -> Dict:
"""
Load memory from the JSON file.
Returns:
The loaded dictionary, or an empty dict if the file does
not exist or cannot be parsed.
"""
if os.path.exists(self.filepath):
try:
with open(self.filepath, "r", encoding="utf-8") as f:
return json.load(f)
except (json.JSONDecodeError, IOError):
return {}
return {}
def _save(self) -> None:
"""Persist the current in-memory data to the JSON file."""
with open(self.filepath, "w", encoding="utf-8") as f:
json.dump(self.data, f, indent=2, ensure_ascii=False)
def store(self, key: str, value: Any) -> str:
"""
Store a value under a given key.
Args:
key: The memory key (e.g., "user_name", "deadline").
value: The value to store. Must be JSON-serializable.
Returns:
A confirmation message string.
"""
self.data[key] = {
"value": value,
"stored_at": datetime.datetime.now().isoformat()
}
self._save()
return f"Stored: {key} = {value}"
def retrieve(self, key: str) -> Optional[Any]:
"""
Retrieve a value by key.
Args:
key: The memory key to look up.
Returns:
The stored value, or None if the key does not exist.
"""
entry = self.data.get(key)
if entry:
return entry["value"]
return None
def retrieve_all(self) -> Dict:
"""
Return all stored memories as a flat dictionary.
Returns:
A dict mapping keys to their stored values, without
the internal metadata such as timestamps.
"""
return {
key: entry["value"]
for key, entry in self.data.items()
}
def forget(self, key: str) -> str:
"""
Remove a specific memory entry.
Args:
key: The key to remove from memory.
Returns:
A confirmation or error message string.
"""
if key in self.data:
del self.data[key]
self._save()
return f"Forgotten: {key}"
return f"Key '{key}' not found in memory."
def search(self, query: str) -> Dict:
"""
Search memories for entries whose keys or values contain
the query string.
This is a simple substring search. A production system would
use semantic search with embeddings for much better recall
across paraphrased or conceptually similar queries.
Args:
query: The search term to look for.
Returns:
A dict of matching key-value pairs.
"""
query_lower = query.lower()
results: Dict = {}
for key, entry in self.data.items():
key_matches = query_lower in key.lower()
value_matches = query_lower in str(entry["value"]).lower()
if key_matches or value_matches:
results[key] = entry["value"]
return results
def __len__(self) -> int:
return len(self.data)
def __repr__(self) -> str:
return (
f"PersistentMemory(filepath='{self.filepath}', "
f"entries={len(self)})"
)
class EpisodicMemory:
"""
An episodic memory system that stores and retrieves past
conversations.
Episodic memory records what happened during past interactions,
allowing the agent to recall relevant past experiences when faced
with similar situations. This is analogous to human episodic
memory: the ability to remember specific events from the past.
Attributes:
filepath: Path to the JSON file for storage.
episodes: List of stored episode records.
max_episodes: Maximum number of episodes to retain.
"""
def __init__(
self,
filepath: str = "agent_episodes.json",
max_episodes: int = 100
):
"""
Initialize episodic memory.
Args:
filepath: Path to the JSON file for storage.
max_episodes: Maximum episodes to keep. Oldest are
discarded when the limit is exceeded.
"""
self.filepath = filepath
self.max_episodes = max_episodes
self.episodes: List[Dict] = self._load()
def _load(self) -> List[Dict]:
"""
Load episodes from the JSON file.
Returns:
The loaded list of episode dicts, or an empty list if
the file does not exist or cannot be parsed.
"""
if os.path.exists(self.filepath):
try:
with open(self.filepath, "r", encoding="utf-8") as f:
return json.load(f)
except (json.JSONDecodeError, IOError):
return []
return []
def _save(self) -> None:
"""Persist the current episodes list to the JSON file."""
with open(self.filepath, "w", encoding="utf-8") as f:
json.dump(self.episodes, f, indent=2, ensure_ascii=False)
def record(
self,
user_query: str,
agent_response: str,
tools_used: Optional[List[str]] = None
) -> None:
"""
Record a new episode (a completed interaction).
Args:
user_query: The user's original question or request.
agent_response: The agent's final response.
tools_used: List of tool names used during the
interaction. Defaults to empty list.
"""
episode: Dict = {
"timestamp": datetime.datetime.now().isoformat(),
"user_query": user_query,
"agent_response": agent_response,
"tools_used": tools_used or [],
"keywords": self._extract_keywords(
user_query + " " + agent_response
)
}
self.episodes.append(episode)
# Trim to max_episodes if needed, keeping the most recent.
if len(self.episodes) > self.max_episodes:
self.episodes = self.episodes[-self.max_episodes:]
self._save()
def _extract_keywords(self, text: str) -> List[str]:
"""
Extract simple keywords from text for later retrieval.
In production, you would use embeddings for semantic search.
This simple approach filters words by length to remove common
short stop words.
Args:
text: The text to extract keywords from.
Returns:
A list of unique significant words from the text.
"""
# Extract words of 4+ characters (removes most stop words).
words = re.findall(r'\b[a-zA-Z]{4,}\b', text.lower())
# Additional stop words to filter out.
stop_words = {
"this", "that", "with", "have", "from", "they", "will",
"been", "were", "what", "when", "where", "which", "your",
"their", "there", "about", "would", "could", "should",
"more", "also", "than", "then", "some", "into", "over"
}
keywords = [w for w in words if w not in stop_words]
# Return unique keywords only.
return list(set(keywords))
def recall(self, query: str, top_k: int = 3) -> List[Dict]:
"""
Retrieve the most relevant past episodes for a given query.
Relevance is measured by keyword overlap between the query
and stored episode keywords. This is a simple but effective
approach for small episode stores.
Args:
query: The current query to find relevant past episodes for.
top_k: Maximum number of episodes to return.
Returns:
A list of the most relevant episode records, sorted by
relevance score in descending order.
"""
query_keywords = set(self._extract_keywords(query))
if not query_keywords or not self.episodes:
return []
# Score each episode by keyword overlap with the query.
scored_episodes: List[Tuple[int, Dict]] = []
for episode in self.episodes:
episode_keywords = set(episode.get("keywords", []))
overlap = len(query_keywords & episode_keywords)
if overlap > 0:
scored_episodes.append((overlap, episode))
# Sort by score descending and return the top_k results.
scored_episodes.sort(key=lambda x: x[0], reverse=True)
return [ep for _, ep in scored_episodes[:top_k]]
def format_for_context(self, episodes: List[Dict]) -> str:
"""
Format recalled episodes as a string for inclusion in the
LLM's prompt context.
Args:
episodes: List of episode records from recall().
Returns:
A formatted string summarizing past relevant interactions.
"""
if not episodes:
return "No relevant past interactions found."
lines: List[str] = ["Relevant past interactions:"]
for i, ep in enumerate(episodes, 1):
# Use only the date portion of the ISO timestamp.
timestamp = ep["timestamp"][:10]
lines.append(f"\n[Past interaction {i} - {timestamp}]")
lines.append(f"User asked: {ep['user_query'][:200]}")
lines.append(
f"Agent responded: {ep['agent_response'][:300]}"
)
return "\n".join(lines)
class MemoryEnhancedAgent:
"""
An agent with both persistent key-value memory and episodic memory.
This agent can remember specific facts across sessions (persistent
memory) and recall relevant past interactions when answering new
questions (episodic memory). It also automatically extracts and
stores personal facts mentioned by the user.
"""
def __init__(self, provider: str = "ollama"):
"""
Initialize the memory-enhanced agent.
Args:
provider: The LLM provider to use ("ollama" or "openai").
"""
self.client = LLMClient(provider=provider, temperature=0.3)
self.persistent_memory = PersistentMemory("agent_facts.json")
self.episodic_memory = EpisodicMemory("agent_episodes.json")
def run(self, user_query: str) -> str:
"""
Process a query using both persistent and episodic memory.
This method enriches the system prompt with relevant memories
before calling the LLM, then records the interaction afterward.
Args:
user_query: The user's question or request.
Returns:
The agent's response as a string.
"""
# Recall relevant past episodes for context.
past_episodes = self.episodic_memory.recall(user_query, top_k=2)
episode_context = self.episodic_memory.format_for_context(
past_episodes
)
# Retrieve all persistent facts for context.
known_facts = self.persistent_memory.retrieve_all()
facts_context = (
"Known facts about the user:\n"
+ json.dumps(known_facts, indent=2)
if known_facts
else "No specific facts stored about the user yet."
)
# Build a context-rich system prompt that includes both
# persistent facts and relevant episodic memories.
system_prompt = (
"You are a helpful personal assistant with memory.\n\n"
f"{facts_context}\n\n"
f"{episode_context}\n\n"
"When the user shares personal information (name, "
"preferences, goals, etc.), acknowledge it warmly. "
"When answering questions, use your memory of past "
"interactions to provide personalized, contextually "
"aware responses.\n\n"
"If the user asks you to remember something, confirm "
"that you will. If the user asks what you remember, "
"tell them based on the known facts above."
)
response = self.client.complete(
prompt=user_query,
system=system_prompt
)
# Attempt to automatically extract and store any personal
# facts the user mentioned in their query. This runs as a
# separate LLM call and silently skips on failure.
self._auto_extract_facts(user_query)
# Record this interaction in episodic memory for future recall.
self.episodic_memory.record(
user_query=user_query,
agent_response=response
)
return response
def _auto_extract_facts(self, user_query: str) -> None:
"""
Attempt to automatically extract and store facts from user input.
This uses the LLM to identify personal information in the
user's message and stores it in persistent memory. It makes
one additional LLM call per user message, so it doubles the
API usage. In production, you might batch this or run it
asynchronously.
Args:
user_query: The user's message to analyze for facts.
"""
extraction_prompt = (
"Analyze this message and extract any personal facts the "
"user is sharing about themselves.\n"
"Return a JSON object with fact_key: fact_value pairs, "
"or an empty object {} if none are present.\n\n"
"Only extract clear, explicit facts (name, age, location, "
"job, preferences, goals). Do not infer or guess. "
"Use snake_case for keys.\n\n"
f'Message: "{user_query}"\n\n'
"Return only the JSON object, nothing else:"
)
raw = self.client.complete(
prompt=extraction_prompt,
system=(
"You are a precise fact extraction system. "
"Return only valid JSON."
)
)
# Try to parse and store the extracted facts silently.
try:
match = re.search(r"\{.*\}", raw, re.DOTALL)
if match:
facts: Dict = json.loads(match.group(0))
for key, value in facts.items():
if key and value: # Skip empty keys or values
self.persistent_memory.store(key, value)
except (json.JSONDecodeError, AttributeError):
pass # Silently skip if extraction fails
if __name__ == "__main__":
agent = MemoryEnhancedAgent(provider="ollama")
print("Memory-Enhanced Agent Demo")
print(
"The agent will remember information across interactions."
)
print("-" * 60)
# Simulate a multi-turn conversation to demonstrate memory.
interactions = [
"Hi! My name is Alex and I'm a software engineer.",
"I'm working on a project involving machine learning.",
"What do you know about me so far?",
"Can you suggest some resources for my ML project?"
]
for user_input in interactions:
print(f"\nUser: {user_input}")
response = agent.run(user_input)
print(f"Agent: {response}")
print()
print("Persistent memory contents:")
print(json.dumps(agent.persistent_memory.retrieve_all(), indent=2))
The memory system above is deliberately simple so you can understand every part of it. In a production system, you would replace the JSON file backend with a proper database like PostgreSQL or Redis, and you would replace the keyword-based search with semantic search using vector embeddings. The embeddings approach converts text into numerical vectors and finds memories that are semantically similar to the current query, even if they use different words. Libraries like ChromaDB, Pinecone, and Weaviate make this straightforward to implement once you understand the pattern shown here.
CHAPTER 11: A COMPLETE AGENTIC APPLICATION
We have now covered all the building blocks. In this final chapter, we will assemble them into a complete, production-quality agentic application: a Research Assistant that can search for information, perform calculations, maintain memory across sessions, and produce well-structured reports. This application demonstrates how all the concepts from previous chapters work together in a coherent system.
The architecture of our Research Assistant looks like this:
+------------------+
| User Interface | (command-line in this tutorial)
+--------+---------+
|
v
+------------------+
| Research Agent | (ReAct loop with memory)
+--------+---------+
|
+--------+--------+--------+--------+
| | | | |
v v v v v
Search Calc Memory Summarize Format Tool Tool System Tool Tool | | | | | +--------+--------+--------+--------+ | v +------------------+ | LLM Backend | (Ollama local or OpenAI remote) +------------------+
The application is designed so that switching between local and remote LLMs requires changing a single configuration value at the top of the file. Save the following as chapter_11_research_assistant.py:
# chapter_11_research_assistant.py
#
# A complete Research Assistant application combining:
# - ReAct agent loop for multi-step reasoning
# - Persistent and episodic memory
# - Multiple specialized research tools
# - Configurable LLM backend (local or remote)
# - Clean, modular architecture
#
# This is the capstone example of the tutorial. Study how all
# the pieces from previous chapters fit together here.
import json
import re
import math
import datetime
import os
from typing import Any, Dict, List, Optional, Tuple
from llm_client import LLMClient
from chapter_10_agent_memory import PersistentMemory, EpisodicMemory
# ==========================================================================
# CONFIGURATION
# Change PROVIDER to switch between "ollama" and "openai".
# Change MODEL to use a specific model within the same provider.
# Setting MODEL to None uses the provider's configured default.
# ==========================================================================
PROVIDER = "ollama" # Change to "openai" for the remote model
MODEL = None # None uses the provider's default model
# ==========================================================================
# RESEARCH TOOLS
# Each tool is a focused, well-documented function that does one thing.
# This follows the single-responsibility principle from clean architecture.
# ==========================================================================
def research_search(topic: str, depth: str = "overview") -> str:
"""
Search for research information on a topic.
Args:
topic: The research topic to search for.
depth: "overview" for a brief summary, "detailed" for more
comprehensive information.
Returns:
Research information as a formatted string.
"""
knowledge_base: Dict[str, Dict[str, str]] = {
"artificial intelligence": {
"overview": (
"Artificial Intelligence (AI) is the simulation of human "
"intelligence in machines. Key subfields include machine "
"learning, natural language processing, computer vision, "
"and robotics. AI applications range from recommendation "
"systems to autonomous vehicles."
),
"detailed": (
"AI encompasses several major paradigms: symbolic AI "
"(rule-based systems), connectionist AI (neural networks), "
"and hybrid approaches. Modern AI is dominated by deep "
"learning, which uses multi-layer neural networks trained "
"on large datasets. Key milestones include Deep Blue (1997), "
"AlphaGo (2016), GPT-3 (2020), and ChatGPT (2022). Current "
"challenges include alignment, interpretability, and energy "
"consumption."
)
},
"climate change": {
"overview": (
"Climate change refers to long-term shifts in global "
"temperatures and weather patterns. Since the 1800s, human "
"activities have been the main driver, primarily through "
"burning fossil fuels. Effects include rising sea levels, "
"extreme weather events, and biodiversity loss."
),
"detailed": (
"The IPCC reports that global temperatures have risen "
"approximately 1.1 degrees Celsius above pre-industrial "
"levels. Key greenhouse gases include CO2, methane, and "
"nitrous oxide. Mitigation strategies include renewable "
"energy transition, carbon capture, and reforestation. "
"The Paris Agreement (2015) aims to limit warming to "
"1.5-2 degrees Celsius."
)
},
"quantum computing": {
"overview": (
"Quantum computing uses quantum mechanical phenomena like "
"superposition and entanglement to process information. "
"Unlike classical bits (0 or 1), quantum bits (qubits) can "
"exist in multiple states simultaneously, enabling certain "
"calculations to be performed exponentially faster."
),
"detailed": (
"Quantum computers excel at specific problems: factoring "
"large numbers (Shor's algorithm), searching unsorted "
"databases (Grover's algorithm), and simulating quantum "
"systems. Current challenges include decoherence, error "
"rates, and scaling. Major players include IBM, Google, "
"IonQ, and Rigetti. Google claimed 'quantum supremacy' in "
"2019 with their Sycamore processor."
)
}
}
topic_lower = topic.lower()
for key, content in knowledge_base.items():
if any(word in topic_lower for word in key.split()):
return content.get(depth, content["overview"])
available = ", ".join(knowledge_base.keys())
return (
f"No detailed research found for '{topic}'. "
f"Available topics: {available}. "
f"Try rephrasing your query."
)
def calculate_statistics(numbers_str: str) -> str:
"""
Calculate basic statistics for a list of numbers.
Args:
numbers_str: A comma-separated string of numbers,
e.g. "1, 2, 3, 4, 5"
Returns:
A formatted string with the statistical summary, or an error
message if the input cannot be parsed.
"""
try:
numbers = [float(n.strip()) for n in numbers_str.split(",")]
if not numbers:
return "Error: No numbers provided."
n = len(numbers)
total = sum(numbers)
mean = total / n
sorted_nums = sorted(numbers)
# Calculate median based on whether n is odd or even.
if n % 2 == 0:
median = (sorted_nums[n // 2 - 1] + sorted_nums[n // 2]) / 2
else:
median = sorted_nums[n // 2]
# Calculate population standard deviation.
variance = sum((x - mean) ** 2 for x in numbers) / n
std_dev = math.sqrt(variance)
return (
f"Statistics for {n} numbers:\n"
f" Sum: {total:.4f}\n"
f" Mean: {mean:.4f}\n"
f" Median: {median:.4f}\n"
f" Min: {min(numbers):.4f}\n"
f" Max: {max(numbers):.4f}\n"
f" Std Dev: {std_dev:.4f}\n"
f" Range: {max(numbers) - min(numbers):.4f}"
)
except ValueError as ex:
return f"Error parsing numbers: {str(ex)}"
def generate_outline(topic: str, sections: int = 5) -> str:
"""
Generate a structured outline for a research report.
Args:
topic: The report topic.
sections: Number of main sections to include (max 8).
Returns:
A formatted outline as a string.
"""
section_templates = [
"Executive Summary",
"Introduction and Background",
"Current State of the Field",
"Key Findings and Analysis",
"Challenges and Limitations",
"Future Directions and Opportunities",
"Conclusions and Recommendations",
"References and Further Reading"
]
# Clamp sections to the available templates.
num_sections = min(max(sections, 1), len(section_templates))
selected_sections = section_templates[:num_sections]
lines: List[str] = [
f"Research Report Outline: {topic}",
"=" * 50
]
for i, section in enumerate(selected_sections, 1):
lines.append(f"{i}. {section}")
lines.append(f" - Key points for this section")
lines.append(f" - Supporting evidence and examples")
return "\n".join(lines)
def format_report(
title: str,
content: str,
format_type: str = "plain"
) -> str:
"""
Format research content into a structured report.
Args:
title: The report title.
content: The main content of the report.
format_type: "plain" for plain text, "structured" for a
more decorated layout.
Returns:
A formatted report string.
"""
timestamp = datetime.datetime.now().strftime("%B %d, %Y")
separator = "=" * 60
if format_type == "structured":
return (
f"{separator}\n"
f"RESEARCH REPORT\n"
f"Title: {title}\n"
f"Date: {timestamp}\n"
f"{separator}\n\n"
f"{content}\n\n"
f"{separator}\n"
f"End of Report\n"
f"{separator}"
)
else:
return (
f"Report: {title}\n"
f"Date: {timestamp}\n\n"
f"{content}"
)
# ==========================================================================
# THE RESEARCH ASSISTANT AGENT
# ==========================================================================
# Registry of all tools available to the Research Assistant.
RESEARCH_TOOLS: Dict[str, Dict] = {
"research_search": {
"func": research_search,
"description": (
"Search for research information on a topic. "
"Args: topic (string), depth ('overview' or 'detailed')"
)
},
"calculate_statistics": {
"func": calculate_statistics,
"description": (
"Calculate statistics (mean, median, std dev, etc.) for a "
"list of numbers. "
"Args: numbers_str (comma-separated numbers as a string)"
)
},
"generate_outline": {
"func": generate_outline,
"description": (
"Generate a structured outline for a research report. "
"Args: topic (string), sections (integer, default 5)"
)
},
"format_report": {
"func": format_report,
"description": (
"Format research content into a structured report. "
"Args: title (string), content (string), "
"format_type ('plain' or 'structured')"
)
}
}
# The system prompt for the Research Assistant.
# Uses str.format() so {tool_descriptions} is the only placeholder.
RESEARCH_SYSTEM_PROMPT = (
"You are an expert Research Assistant with access to research tools.\n\n"
"Your goal is to help users research topics, analyze information, "
"and produce well-structured reports. You approach every task "
"methodically:\n\n"
"1. Understand exactly what the user needs.\n"
"2. Search for relevant information using the research_search tool.\n"
"3. Analyze and synthesize the information.\n"
"4. Structure your findings clearly.\n"
"5. Produce a comprehensive, accurate response.\n\n"
"Available tools:\n"
"{tool_descriptions}\n\n"
"Follow this EXACT format for tool use:\n\n"
"Thought: [your reasoning]\n"
"Action: [tool_name]\n"
"Action Input: {{\"param\": \"value\"}}\n\n"
"After receiving an Observation, continue reasoning. When done:\n\n"
"Thought: I have gathered sufficient information.\n"
"Final Answer: [your comprehensive response]\n\n"
"Be thorough, accurate, and always base your answers on what "
"you found using the tools."
)
class ResearchAssistant:
"""
A complete Research Assistant with memory and multi-step reasoning.
This is the capstone class of the tutorial, combining all concepts:
the ReAct agent loop, persistent memory, episodic memory, and
multiple specialized research tools.
Attributes:
client: The LLM client for reasoning.
system_prompt: The formatted system prompt with tool info.
persistent_memory: Long-term key-value fact storage.
episodic_memory: Storage and retrieval of past interactions.
verbose: Whether to print reasoning steps.
max_steps: Maximum ReAct loop iterations.
"""
def __init__(
self,
provider: str = PROVIDER,
model: Optional[str] = MODEL,
verbose: bool = True
):
"""
Initialize the Research Assistant.
Args:
provider: LLM provider ("ollama" or "openai").
model: Specific model name, or None for provider default.
verbose: Whether to print reasoning steps during research.
"""
# Build tool descriptions and format the system prompt.
tool_descriptions = "\n".join(
f"- {name}: {info['description']}"
for name, info in RESEARCH_TOOLS.items()
)
self.system_prompt = RESEARCH_SYSTEM_PROMPT.format(
tool_descriptions=tool_descriptions
)
self.client = LLMClient(
provider=provider,
model=model,
temperature=0.2, # Low temperature for research accuracy
max_tokens=2048
)
# Initialize both memory systems with dedicated storage files.
self.persistent_memory = PersistentMemory("research_facts.json")
self.episodic_memory = EpisodicMemory("research_episodes.json")
self.verbose = verbose
self.max_steps = 8
def research(self, query: str) -> str:
"""
Conduct research on a query using the full agent loop.
This method enriches the query with relevant memories, runs
the ReAct loop to gather information using tools, records the
session in episodic memory, and returns the final answer.
Args:
query: The research question or task.
Returns:
The agent's comprehensive research response.
"""
if self.verbose:
print(f"\n{'=' * 60}")
print(f"RESEARCH QUERY: {query}")
print('=' * 60)
# Enrich the query with relevant memories from past sessions.
past_episodes = self.episodic_memory.recall(query, top_k=2)
memory_context = ""
if past_episodes:
memory_context = (
"\n\nRelevant past research sessions:\n"
+ self.episodic_memory.format_for_context(past_episodes)
)
known_facts = self.persistent_memory.retrieve_all()
if known_facts:
memory_context += (
f"\n\nKnown user context: {json.dumps(known_facts)}"
)
# Append memory context to the query if any exists.
enriched_query = query
if memory_context:
enriched_query = (
f"{query}\n\n[Context from memory:{memory_context}]"
)
# Prepare the initial message list for the ReAct loop.
messages: List[Dict] = [
{"role": "system", "content": self.system_prompt},
{"role": "user", "content": enriched_query}
]
tools_used: List[str] = []
final_answer: Optional[str] = None
# Run the ReAct loop.
for step in range(self.max_steps):
if self.verbose:
print(f"\n--- Reasoning Step {step + 1} ---")
response = self.client.chat(messages)
if self.verbose:
# Truncate long responses for display clarity.
display = response[:400]
print(display)
if len(response) > 400:
print("... [truncated for display]")
# Check for a final answer in the response.
if "Final Answer:" in response:
final_answer = self._extract_section(
response, "Final Answer:"
)
break
# Parse and execute the tool call.
action, action_input = self._parse_action(response)
if action is None:
observation = (
"Format error: Please use the exact "
"Thought/Action/Action Input format."
)
elif action not in RESEARCH_TOOLS:
observation = (
f"Unknown tool '{action}'. "
f"Available: {list(RESEARCH_TOOLS.keys())}"
)
else:
tools_used.append(action)
tool_func = RESEARCH_TOOLS[action]["func"]
try:
observation = tool_func(**action_input)
if self.verbose:
print(f"\nObservation: {observation[:200]}...")
except TypeError as ex:
observation = (
f"Tool call error (wrong arguments): {str(ex)}"
)
except Exception as ex:
observation = f"Tool error: {str(ex)}"
# Append to conversation and continue the loop.
messages.append({"role": "assistant", "content": response})
messages.append({
"role": "user",
"content": (
f"Observation: {observation}\n\n"
"Continue your research:"
)
})
# If no final answer was produced within max_steps, request one.
if final_answer is None:
messages.append({
"role": "user",
"content": (
"Please provide your Final Answer based on the "
"research conducted so far."
)
})
response = self.client.chat(messages)
final_answer = (
self._extract_section(response, "Final Answer:")
or response
)
# Record this research session in episodic memory.
self.episodic_memory.record(
user_query=query,
# Store only a summary to keep the episode file manageable.
agent_response=final_answer[:500],
tools_used=tools_used
)
if self.verbose:
print(f"\n{'=' * 60}")
print("RESEARCH COMPLETE")
print(f"Tools used: {tools_used}")
print('=' * 60)
return final_answer
def remember(self, key: str, value: str) -> str:
"""
Store a fact in persistent memory.
Args:
key: The fact key (e.g., "user_role").
value: The fact value (e.g., "researcher").
Returns:
A confirmation message from the memory store.
"""
return self.persistent_memory.store(key, value)
def _parse_action(self, response: str) -> Tuple[Optional[str], Dict]:
"""
Parse the action name and input from the agent's response.
Args:
response: The raw text from the LLM.
Returns:
A tuple of (action_name, action_input_dict).
Returns (None, {}) if no action can be parsed.
"""
# Extract the action name from the "Action: ..." line.
action_match = re.search(r"Action:\s*(.+?)(?:\n|$)", response)
if not action_match:
return None, {}
action = action_match.group(1).strip()
# Extract the action input JSON from the "Action Input: ..." line.
input_match = re.search(
r"Action Input:\s*(\{.*?\})",
response,
re.DOTALL
)
action_input: Dict = {}
if input_match:
try:
action_input = json.loads(input_match.group(1))
except json.JSONDecodeError:
action_input = {}
return action, action_input
def _extract_section(
self,
text: str,
marker: str
) -> Optional[str]:
"""
Extract text following a specific marker string.
Args:
text: The full text to search within.
marker: The marker string to find (e.g., "Final Answer:").
Returns:
The text after the marker, stripped of whitespace, or None
if the marker is not found.
"""
idx = text.find(marker)
if idx != -1:
return text[idx + len(marker):].strip()
return None
# ==========================================================================
# MAIN APPLICATION ENTRY POINT
# ==========================================================================
def main() -> None:
"""
Run the Research Assistant in interactive mode.
The user can ask research questions, and the assistant will
use its tools and memory to provide comprehensive answers.
Special commands: 'quit' to exit, 'memory' to see stored facts.
"""
print("=" * 60)
print("RESEARCH ASSISTANT")
print(f"Powered by: {PROVIDER.upper()}")
print("Type 'quit' to exit, 'memory' to see stored facts")
print("=" * 60)
assistant = ResearchAssistant(provider=PROVIDER, verbose=True)
# Pre-store some user context to demonstrate persistent memory.
assistant.remember("user_role", "researcher")
assistant.remember("preferred_detail_level", "detailed")
while True:
try:
query = input("\nResearch Query: ").strip()
except (KeyboardInterrupt, EOFError):
print("\nGoodbye!")
break
if not query:
continue
if query.lower() == "quit":
print("Goodbye!")
break
if query.lower() == "memory":
facts = assistant.persistent_memory.retrieve_all()
print("\nStored Facts:")
print(json.dumps(facts, indent=2))
continue
# Conduct the research and display the result.
answer = assistant.research(query)
print(f"\nRESEARCH ANSWER:\n{answer}")
if __name__ == "__main__":
main()
This Research Assistant demonstrates the full power of what we have built. It combines the clean LLM client abstraction from Chapter 4, the prompt engineering principles from Chapter 5, the conversation management from Chapter 6, the structured tool use from Chapter 8, the ReAct reasoning loop from Chapter 9, and the memory systems from Chapter 10. Each component is independently testable and replaceable, which is the hallmark of good software architecture.
CHAPTER 12: WHERE TO GO FROM HERE
You have now built a solid foundation in LLM application development and agentic AI. You understand how language models work conceptually, how to communicate with them via API, how to engineer effective prompts, how to maintain conversation state, how to extract structured data, how to give models access to tools, how to build reasoning agents, and how to add persistent memory. These are the core skills that underpin virtually every LLM application in production today.
The natural next steps from here branch in several directions, each of which is a rich area of study in its own right.
The first direction is Retrieval-Augmented Generation, commonly known as RAG. RAG is the technique of connecting an LLM to a large corpus of documents, such as a company's internal knowledge base, so that the model can retrieve and cite relevant passages when answering questions. This solves the fundamental limitation that LLMs can only know what was in their training data. The key components are a document chunking strategy, an embedding model to convert text into vectors, a vector database to store and search those vectors, and a retrieval step that runs before the LLM generates its response. Libraries like LangChain, LlamaIndex, and ChromaDB make this straightforward to implement.
The second direction is multi-agent systems. Instead of a single agent with multiple tools, you can build systems where multiple specialized agents collaborate. One agent might specialize in research, another in writing, and a third in fact-checking. A coordinator agent routes tasks to the appropriate specialist. This architecture scales to complex, long-horizon tasks that would overwhelm a single agent. The AutoGen library from Microsoft and the CrewAI framework are excellent starting points for multi-agent systems.
The third direction is fine-tuning. While prompt engineering can accomplish a great deal, there are tasks where you need the model itself to behave differently than its base training allows. Fine-tuning involves training the model on a curated dataset of examples that demonstrate the behavior you want. With techniques like LoRA (Low-Rank Adaptation), you can fine-tune even large models on consumer hardware. The Hugging Face ecosystem, particularly the transformers and peft libraries, is the standard toolkit for this work.
The fourth direction is evaluation and testing. As you build more complex agentic systems, you need rigorous ways to measure whether they are working correctly. This is a surprisingly deep problem: how do you evaluate whether an agent's reasoning is sound, or whether its responses are accurate and helpful? Frameworks like RAGAS (for RAG evaluation), LangSmith (for tracing and evaluation), and custom LLM-as-judge approaches are the current state of the art.
The fifth direction is production deployment. The code in this tutorial runs locally and is optimized for learning. A production system needs to handle concurrent users, manage costs, implement rate limiting, log interactions for debugging, monitor for failures, and ensure security. FastAPI is an excellent framework for building LLM-powered APIs, and platforms like Render, Railway, and AWS make deployment straightforward.
The most important thing you can do right now is to start building. Pick a problem you care about, apply the patterns from this tutorial, and iterate. The field is moving extraordinarily fast, but the foundational concepts we have covered, the API interaction model, prompt engineering, tool use, and the agent loop, are stable and will remain relevant regardless of which specific models or frameworks dominate in the future. The best way to develop intuition for this technology is to use it, break it, fix it, and use it again.
Good luck, and enjoy the journey.
APPENDIX A: QUICK REFERENCE
CREATING AN LLM CLIENT
To connect to Ollama (local), use base_url "http://localhost:11434/v1" and any non-empty string as the api_key. To connect to LM Studio (local), use base_url "http://localhost:1234/v1". To connect to OpenAI (remote), omit base_url and provide a valid api_key from your environment variables.
THE MESSAGES FORMAT
Every chat completion call takes a list of message dictionaries. The system role sets the model's behavior and should always be the first message. The user role contains the human's input. The assistant role contains the model's previous responses and must be included to maintain conversation context across multiple turns.
TEMPERATURE GUIDE
A temperature of 0.0 produces deterministic, consistent outputs and is best for data extraction, classification, and code generation. A temperature between 0.3 and 0.5 balances consistency with some variation and works well for question answering and summarization. A temperature between 0.7 and 1.0 produces creative, varied outputs and is best for brainstorming, creative writing, and generating diverse options.
THE REACT AGENT LOOP
The ReAct loop consists of four steps that repeat until completion. In the Thought step, the model reasons about the current situation and what to do next. In the Action step, the model specifies a tool to call and its arguments. In the Observation step, your code executes the tool and returns the result to the model. In the Final Answer step, the model produces its response when it has sufficient information to answer the user's question.
KEY LIBRARIES
The openai package is the Python client for the OpenAI API and works with any OpenAI-compatible server including Ollama and LM Studio. The python-dotenv package loads environment variables from a .env file, keeping API keys out of source code. The typing module provides type hint classes like Dict, List, Optional, and Tuple for use in Python 3.9 code.
FILE SUMMARY
The following files are created in this tutorial, in the order they appear. The file llm_client.py is the shared LLM client module used by all other files. The file chapter_03_first_call.py demonstrates the first local API call. The file chapter_03_remote_call.py demonstrates the first remote API call. The file chapter_04_test_client.py tests the LLMClient class. The file chapter_05_prompt_engineering.py demonstrates four prompting techniques. The file chapter_06_chatbot.py implements the conversational chatbot. The file chapter_07_structured_output.py demonstrates JSON extraction from LLMs. The file chapter_07_openai_structured.py demonstrates OpenAI's structured output feature. The file chapter_08_tool_use.py implements the manual tool-use agent. The file chapter_08_openai_tools.py implements the OpenAI function-calling agent. The file chapter_09_react_agent.py implements the full ReAct agent. The file chapter_10_agent_memory.py implements the memory systems. The file chapter_11_research_assistant.py is the complete capstone application.
APPENDIX B: TROUBLESHOOTING COMMON ISSUES
CONNECTION REFUSED WHEN CALLING OLLAMA
This means the Ollama server is not running. Open a terminal and run "ollama serve" to start it. On some systems, Ollama starts automatically as a background service after installation. You can verify it is running by visiting http://localhost:11434 in your browser, which should show a simple status page saying "Ollama is running".
MODEL NOT FOUND ERROR
This means you are requesting a model that has not been downloaded yet. Run "ollama pull model_name" to download it, replacing model_name with the model you want to use. Run "ollama list" to see all models that are already downloaded on your system.
AUTHENTICATION ERROR WITH OPENAI
This means your API key is invalid, expired, or not being loaded correctly. Verify that your .env file exists in the same directory as your script, that it contains OPENAI_API_KEY=your_actual_key with no extra spaces, and that you are calling load_dotenv() before accessing the environment variable. Check that your key is active in the OpenAI dashboard at platform.openai.com.
JSON PARSING FAILURES IN STRUCTURED OUTPUT
Smaller local models sometimes fail to produce valid JSON even with explicit instructions. The most reliable fixes are to lower the temperature to 0.0 or 0.1, to provide a concrete example of the expected JSON in your prompt, to add the instruction "Return ONLY the JSON object, no other text" to your system message, and to implement retry logic as shown in Chapter 7.
CONTEXT LENGTH EXCEEDED
This error means the total number of tokens in your messages list exceeds the model's context window. Implement the history trimming pattern from Chapter 6, reduce the max_tokens in your API call, or switch to a model with a larger context window. For local models, check the model's documentation for its context window size.
AGENT STUCK IN A LOOP
If your agent keeps calling the same tool repeatedly without making progress, add a step counter and a maximum step limit as shown in Chapter 9. Also examine your system prompt to ensure the stopping condition (Final Answer) is clearly defined. Sometimes adding an explicit instruction like "If you have called the same tool twice with the same arguments, stop and provide your best answer with the information you have" can break loops effectively.
IMPORT ERRORS BETWEEN CHAPTERS
Some chapter files import from other chapter files. For example, chapter_08_openai_tools.py imports from chapter_08_tool_use.py, and chapter_11_research_assistant.py imports from chapter_10_agent_memory.py. Make sure all files are saved in the same directory before running them. If you rename files, update the import statements accordingly.
SLOW RESPONSES FROM LOCAL MODELS
Local models run on your CPU or GPU and can be significantly slower than remote API calls, especially for larger models. If responses are too slow, try a smaller model (llama3.2 at 3B parameters is a good starting point), ensure Ollama is using your GPU if available (check with "ollama ps"), or reduce max_tokens to limit the length of generated responses.