Written in the spirit of a technical wiki, this article walks you through one of the most clarifying and practically useful mental models in modern AI engineering: Andrej Karpathy's taxonomy of memory for Large Language Model (LLM) agents. By the time you finish reading, you will understand not only what the four types of memory are, but why they exist, how they differ from each other, and how to implement each one in real Python code that works with both a local Ollama model and a remote OpenAI-compatible API.
QUICK REFERENCE: THE FOUR MEMORY TYPES AT A GLANCE
IN-WEIGHTS MEMORY
Location: Neural network parameters (model weights)
Persistence: Permanent (until fine-tuned)
Capacity: Vast but fixed at training time
Speed: Instantaneous (implicit in every forward pass)
Update cost: Very high (requires fine-tuning)
Best for: General world knowledge, reasoning ability, language fluency
Limitation: Static, opaque, cannot be surgically updated
IN-CONTEXT MEMORY
Location: The model's context window (active prompt)
Persistence: Volatile (lost when context is cleared)
Capacity: Limited (thousands to hundreds of thousands of tokens)
Speed: Fast (direct attention over all tokens)
Update cost: Free (just add to the message list)
Best for: Conversational continuity, injecting retrieved context
Limitation: Limited size, volatile, expensive for very long contexts
EXTERNAL RETRIEVAL MEMORY
Location: Vector database, file system, knowledge graph
Persistence: Persistent (survives restarts)
Capacity: Essentially unlimited
Speed: Slower (requires embedding + similarity search)
Update cost: Low (add/update/delete individual entries)
Best for: Domain knowledge, user preferences, episodic history
Limitation: Requires retrieval infrastructure, quality depends on curation
KV CACHE MEMORY
Location: GPU memory (during inference)
Persistence: Session-scoped (or cross-session with prompt caching)
Capacity: Limited by GPU memory
Speed: Very fast (avoids recomputation)
Update cost: Automatic (managed by the inference engine)
Best for: Accelerating inference for stable prompt prefixes
Limitation: Invalidated by any change to the cached prefix
CHAPTER ONE: THE PROBLEM THAT MEMORY SOLVES
Before diving into Karpathy's taxonomy, it is worth spending a moment understanding the fundamental problem that makes memory a first-class concern in agentic AI systems. If you have ever chatted with a raw, unadorned LLM endpoint, you have already experienced the problem firsthand: the model is brilliant in the moment and completely amnesiac the next. Every time you send a new request, the model wakes up with no recollection of who you are, what you discussed yesterday, or what preferences you expressed last week. It is like hiring the world's most knowledgeable consultant, only to discover that they suffer from severe anterograde amnesia and must be re-briefed from scratch at the start of every meeting.
This is not a bug. It is a direct consequence of how transformer-based language models work at their core. A model is a mathematical function: it takes a sequence of tokens as input and produces a probability distribution over the next token as output. That function is stateless. It does not carry hidden state between calls the way a running process does. The weights of the network encode everything the model "knows," and those weights do not change during inference.
CHAPTER TWO: KARPATHY'S LLM OS ANALOGY
Andrej Karpathy, former Director of AI at Tesla and one of the founding members of OpenAI, introduced a powerful and clarifying analogy: the LLM as an operating system. This is not a casual metaphor. It is a precise structural comparison that maps every major component of a traditional OS to a corresponding component in an LLM-based agent system.
In a traditional operating system, the CPU is the computational core that executes instructions. The RAM is the fast, volatile working memory that holds whatever the CPU is currently processing. The disk is the slow, persistent storage that holds everything else. System calls are the interface through which programs request services from the OS kernel.
In Karpathy's LLM OS, the LLM itself is the CPU. It is the reasoning engine that interprets instructions, evaluates context, and produces outputs. The context window — the finite sequence of tokens the model can attend to at once — is the RAM. It is fast, immediately accessible, but strictly limited in size and completely volatile: when the context is cleared, everything in it is gone.
External storage systems, whether they are vector databases, relational databases, file systems, or knowledge graphs, are the disk. They are slow relative to the context window but persistent and essentially unlimited in capacity. Tool calls and API invocations are the system calls — the mechanism by which the LLM reaches out to the external world to do things it cannot do on its own.
Agents, in this analogy, are the long-running applications that run on top of this OS. Just as a web server is a process that runs continuously, handles requests, manages state, and coordinates with the OS, an AI agent is a process that runs continuously, handles tasks, manages memory, and coordinates with the LLM.
This analogy is not just intellectually satisfying. It is practically useful because it immediately tells you what the design constraints are. RAM is fast but small, so you must be selective about what you put in the context window. Disk is large but slow, so you need efficient indexing to retrieve what you need quickly. The CPU has no memory of its own between processes, so you must explicitly manage state. Every design decision in an agentic system can be traced back to these constraints.
CHAPTER THREE: THE FOUR TYPES OF MEMORY
Within this OS analogy, Karpathy identifies four distinct types of memory that an LLM agent can use. Each type has a different location, a different persistence model, a different capacity, and a different update mechanism. Understanding all four, and knowing when to use each one, is the core skill of an agentic AI architect.
The four types are:
In-Weights Memory (Parametric Memory)
In-Context Memory (Working Memory)
External Retrieval Memory (Non-Parametric Memory)
KV Cache Memory (Computational Memory)
MEMORY TYPE 1: IN-WEIGHTS MEMORY (PARAMETRIC MEMORY)
In-weights memory is the knowledge that is baked into the model's parameters during training. When a model is trained on hundreds of billions of tokens of text, the gradient descent process slowly adjusts billions of floating-point numbers — the weights of the neural network — until those weights encode a compressed statistical representation of everything the model has seen. The result is a kind of crystallized knowledge: the model "knows" that Paris is the capital of France, that the derivative of sin(x) is cos(x), that Python uses indentation to define code blocks, and millions of other facts, not because it looks them up, but because that knowledge is embedded in the geometry of its weight space.
This is parametric memory because the knowledge is stored in the parameters of the model. It is the most fundamental type of memory, and it is the only type that is always present. Every other type of memory is optional and must be explicitly engineered. In-weights memory comes for free with the model.
The extraordinary thing about in-weights memory is its density. A model with 70 billion parameters, stored in 16-bit floating point, occupies roughly 140 gigabytes. Yet those 140 gigabytes encode a working knowledge of virtually every domain of human knowledge, from quantum mechanics to Renaissance poetry to how to write a SQL JOIN query. No other storage medium achieves anything close to this information density per byte.
But in-weights memory has three serious limitations that make it insufficient on its own for agentic systems.
The first limitation is that it is static. Once training is complete, the weights are frozen. The model cannot learn new facts during inference. If a new programming language is invented after the model's training cutoff, the model will not know about it. If a company's internal policy changes, the model will not know. The world moves on; the model's weights do not.
The second limitation is that updating it is extraordinarily expensive. To teach the model new facts by modifying its weights, you must perform fine-tuning, a process that requires significant GPU compute, careful dataset curation, and hyperparameter tuning. This is not something you can do in response to a user's request in real time.
The third limitation is that it is opaque. You cannot inspect the weights and say "here is where the model stores the fact that the speed of light is 299,792,458 meters per second." The knowledge is distributed across millions of weights in a way that is not interpretable. This makes it impossible to surgically update or delete specific facts.
Despite these limitations, in-weights memory is the foundation on which everything else is built. It is what gives the model its reasoning ability, its language fluency, its general world knowledge, and its capacity to use the other three types of memory effectively.
The mechanism for updating in-weights memory is fine-tuning. In the context of Karpathy's Software 2.0 concept, fine-tuning is analogous to a git commit: you are making a permanent change to the "codebase" of the model's knowledge. The following example shows how you might create a custom model with a specific persona using the Ollama Modelfile mechanism — the closest analog to in-weights memory available without a full GPU-based fine-tuning run.
# in_weights_memory.py
# Demonstrates persona baking via Ollama Modelfiles.
# pip install requests
import os
import subprocess
import textwrap
from typing import Optional
import requests
class OllamaModelfileManager:
"""
Manages the creation of custom Ollama models via Modelfiles.
A Modelfile specifies the base model, a system prompt, and
various parameters. When you create a model from a Modelfile,
you bake a specific persona and set of instructions into the
model's default behavior — the closest thing to in-weights memory
available without a full GPU-based fine-tuning run.
"""
def __init__(self, ollama_base_url: str = "http://localhost:11434"):
self.ollama_base_url = ollama_base_url
def create_modelfile(
self,
base_model: str,
system_prompt: str,
temperature: float = 0.7,
context_length: int = 4096,
output_path: str = "./Modelfile"
) -> str:
"""
Generates a Modelfile with the given configuration.
Args:
base_model: The Ollama base model to build on
(e.g., "llama3.2", "mistral").
system_prompt: The system prompt to bake into the model.
This defines the model's persona and
default behavior.
temperature: Sampling temperature (0.0 = deterministic,
1.0 = creative).
context_length: The context window size in tokens.
output_path: Where to write the Modelfile.
Returns:
The path to the written Modelfile.
"""
modelfile_content = textwrap.dedent(f"""
FROM {base_model}
PARAMETER temperature {temperature}
PARAMETER num_ctx {context_length}
SYSTEM \"\"\"
{system_prompt}
\"\"\"
""").strip()
with open(output_path, "w", encoding="utf-8") as f:
f.write(modelfile_content)
print(f"[MODELFILE] Written to {output_path}")
return output_path
def build_model(
self,
model_name: str,
modelfile_path: str = "./Modelfile"
) -> bool:
"""
Calls the Ollama CLI to build a custom model from a Modelfile.
Args:
model_name: The name to give the new custom model.
modelfile_path: Path to the Modelfile.
Returns:
True if the build succeeded, False otherwise.
"""
try:
result = subprocess.run(
["ollama", "create", model_name, "-f", modelfile_path],
capture_output=True,
text=True,
timeout=300
)
if result.returncode == 0:
print(f"[MODELFILE] Model '{model_name}' built successfully.")
return True
else:
print(f"[MODELFILE] Build failed: {result.stderr}")
return False
except FileNotFoundError:
print("[MODELFILE] Error: 'ollama' CLI not found. Is Ollama installed?")
return False
def test_model(
self,
model_name: str,
test_prompt: str = "Introduce yourself briefly."
) -> Optional[str]:
"""
Sends a test prompt to the custom model and returns the response.
"""
try:
response = requests.post(
f"{self.ollama_base_url}/api/chat",
json={
"model": model_name,
"messages": [{"role": "user", "content": test_prompt}],
"stream": False
},
timeout=120
)
response.raise_for_status()
return response.json()["message"]["content"]
except requests.RequestException as e:
print(f"[MODELFILE] Test failed: {e}")
return None
if __name__ == "__main__":
manager = OllamaModelfileManager()
# Define a domain-specific system prompt to bake into the model.
# This could be any persona: a medical assistant, a legal researcher,
# a software architect, etc.
DOMAIN_SYSTEM_PROMPT = textwrap.dedent("""
You are an expert software architect with deep knowledge of
distributed systems, API design, and cloud-native patterns.
You always:
- Provide precise, technically accurate answers.
- Reference relevant design patterns and trade-offs.
- Suggest the most maintainable and scalable solution.
- Use standard industry terminology.
- Flag performance-critical considerations clearly.
""").strip()
# Step 1: Create the Modelfile
modelfile_path = manager.create_modelfile(
base_model="llama3.2",
system_prompt=DOMAIN_SYSTEM_PROMPT,
temperature=0.3, # Lower temperature for deterministic answers
context_length=8192
)
# Step 2: Build the custom model
success = manager.build_model(
model_name="software-architect",
modelfile_path=modelfile_path
)
if success:
# Step 3: Test the custom model
print("\n=== Testing Custom Model ===")
response = manager.test_model(
model_name="software-architect",
test_prompt="What are the trade-offs between REST and gRPC?"
)
if response:
print(f"Response: {response}")
When you bake a system prompt into a Modelfile, you are not truly modifying the model's weights. You are creating a configuration wrapper that prepends the system prompt to every conversation. True in-weights modification requires GPU-based fine-tuning using frameworks like Hugging Face's Transformers library with PEFT techniques such as LoRA. That process is beyond the scope of this article, but the conceptual point is clear: in-weights memory is the bedrock, and modifying it is expensive but permanent.
MEMORY TYPE 2: IN-CONTEXT MEMORY (WORKING MEMORY)
In-context memory is the information that lives inside the model's context window during a single inference call. It is the most immediate, most flexible, and most widely used form of agent memory. Everything the model can "see" right now — the system prompt, the conversation history, the results of tool calls, retrieved documents, intermediate reasoning steps — is in-context memory.
The context window is the model's RAM. Like RAM, it is fast: the model can attend to any token in the context window with equal ease. Like RAM, it is volatile: when the context is cleared or the session ends, everything in it is gone. And like RAM, it is limited: even the most capable models today have context windows measured in hundreds of thousands of tokens, which sounds large until you realize that a single book is roughly 100,000 tokens, and a complex agentic task might involve dozens of tool call results, each of which is thousands of tokens long.
Karpathy has emphasized the concept of "context engineering" as one of the most important skills in building effective LLM applications. Context engineering is the art and science of deciding what information to put into the context window, in what order, and in what format, to maximize the quality of the model's output. It is the evolution beyond "prompt engineering," which focused on crafting clever instructions. Context engineering is about constructing an entire information ecosystem for the model to reason within.
The following example demonstrates a simple but complete in-context memory manager. It maintains a rolling conversation history, enforces a token budget to prevent the context from overflowing, and works with both a local Ollama model and a remote OpenAI-compatible API.
# in_context_memory.py
# A production-quality in-context memory manager for LLM agents.
# Implements a rolling window strategy to manage context window limits.
# pip install openai tiktoken requests
import os
import time
from typing import Optional
from dataclasses import dataclass, field
import requests
@dataclass
class Message:
"""
Represents a single message in the conversation history.
Immutable once created to prevent accidental mutation of history.
"""
role: str # "system", "user", or "assistant"
content: str
timestamp: float = field(default_factory=time.time)
def to_api_dict(self) -> dict:
"""
Converts the message to the format expected by
OpenAI-compatible chat completion APIs.
"""
return {"role": self.role, "content": self.content}
class InContextMemoryManager:
"""
Manages the in-context memory (context window) for an LLM agent.
This class maintains a conversation history and implements a rolling
window strategy: when the estimated token count exceeds the budget,
the oldest non-system messages are dropped to make room for new ones.
This mirrors how human working memory works: recent information is
retained while older details fade.
Supports both:
- Local Ollama models (via the Ollama REST API)
- Remote OpenAI-compatible APIs (via the openai Python library)
"""
def __init__(
self,
system_prompt: str,
max_context_tokens: int = 4096,
backend: str = "ollama",
model_name: str = "llama3.2",
ollama_base_url: str = "http://localhost:11434",
openai_api_key: Optional[str] = None,
openai_model: str = "gpt-4o-mini"
):
"""
Args:
system_prompt: The system prompt that defines the agent's
role. This is always kept in context.
max_context_tokens: The maximum number of tokens to keep in
the context window before pruning.
backend: Either "ollama" for local inference or
"openai" for remote API calls.
model_name: The Ollama model name (e.g., "llama3.2").
ollama_base_url: The base URL of the local Ollama server.
openai_api_key: The OpenAI API key (for remote backend).
openai_model: The OpenAI model name (for remote backend).
"""
self.system_prompt = system_prompt
self.max_context_tokens = max_context_tokens
self.backend = backend
self.model_name = model_name
self.ollama_base_url = ollama_base_url
self.openai_model = openai_model
# The conversation history always starts with the system message.
# The system message is never pruned; it is the permanent anchor
# of the agent's identity and instructions.
self.history: list[Message] = [
Message(role="system", content=system_prompt)
]
if backend == "openai":
from openai import OpenAI
api_key = openai_api_key or os.environ.get("OPENAI_API_KEY")
self.openai_client = OpenAI(api_key=api_key)
def _estimate_tokens(self, text: str) -> int:
"""
Estimates the token count for a piece of text.
Uses tiktoken for accuracy when available; falls back to a
word-count heuristic (words * 1.3) otherwise.
"""
try:
import tiktoken
enc = tiktoken.get_encoding("cl100k_base")
return len(enc.encode(text))
except ImportError:
return int(len(text.split()) * 1.3)
def _total_context_tokens(self) -> int:
"""Returns the estimated total token count for the full history."""
return sum(self._estimate_tokens(msg.content) for msg in self.history)
def _prune_history(self) -> None:
"""
Removes the oldest non-system messages from the history to bring
the total token count under the budget. The system message (index 0)
is always preserved. This implements a FIFO (first-in, first-out)
eviction policy, analogous to an LRU cache.
"""
while (
self._total_context_tokens() > self.max_context_tokens
and len(self.history) > 2
):
# Remove the oldest non-system message (index 1)
removed = self.history.pop(1)
print(
f"[PRUNE] Removed old message (role={removed.role}, "
f"~{self._estimate_tokens(removed.content)} tokens)"
)
def add_user_message(self, content: str) -> None:
"""Adds a user message to the conversation history."""
self.history.append(Message(role="user", content=content))
self._prune_history()
def add_assistant_message(self, content: str) -> None:
"""Adds an assistant message to the conversation history."""
self.history.append(Message(role="assistant", content=content))
def _call_ollama(self) -> str:
"""
Sends the current context to the local Ollama server and returns
the model's response as a string.
"""
payload = {
"model": self.model_name,
"messages": [msg.to_api_dict() for msg in self.history],
"stream": False
}
response = requests.post(
f"{self.ollama_base_url}/api/chat",
json=payload,
timeout=120
)
response.raise_for_status()
return response.json()["message"]["content"]
def _call_openai(self) -> str:
"""
Sends the current context to the OpenAI API and returns the
model's response as a string.
"""
response = self.openai_client.chat.completions.create(
model=self.openai_model,
messages=[msg.to_api_dict() for msg in self.history]
)
return response.choices[0].message.content
def chat(self, user_input: str) -> str:
"""
The main entry point for interacting with the agent. Adds the
user's message to the context, calls the LLM, stores the
response, and returns it.
Args:
user_input: The user's message.
Returns:
The assistant's response as a string.
"""
self.add_user_message(user_input)
print(
f"[CONTEXT] Sending {len(self.history)} messages "
f"(~{self._total_context_tokens()} tokens) to {self.backend}"
)
if self.backend == "ollama":
response_text = self._call_ollama()
else:
response_text = self._call_openai()
self.add_assistant_message(response_text)
return response_text
def get_history_summary(self) -> str:
"""Returns a human-readable summary of the current context state."""
lines = [
f"Backend: {self.backend}",
f"Messages: {len(self.history)}",
f"Approx tokens: {self._total_context_tokens()}",
f"Token budget: {self.max_context_tokens}",
"---"
]
for i, msg in enumerate(self.history):
preview = msg.content[:60].replace("\n", " ")
lines.append(f" [{i}] {msg.role:10s} | {preview}...")
return "\n".join(lines)
if __name__ == "__main__":
# Choose your backend: "ollama" for local, "openai" for remote
BACKEND = "ollama"
agent = InContextMemoryManager(
system_prompt=(
"You are a helpful assistant specializing in Python programming. "
"You remember everything the user tells you within this conversation."
),
max_context_tokens=2048,
backend=BACKEND,
model_name="llama3.2"
)
print("=== In-Context Memory Demo ===\n")
# Turn 1: Establish a preference
response = agent.chat("My name is Alex and I prefer type hints in Python.")
print(f"Assistant: {response}\n")
# Turn 2: Ask something that requires remembering Turn 1
response = agent.chat("Write me a function to reverse a string.")
print(f"Assistant: {response}\n")
# Turn 3: Verify the agent remembers the preference from Turn 1
response = agent.chat("What is my name, and what coding style do I prefer?")
print(f"Assistant: {response}\n")
print("\n=== Context Window State ===")
print(agent.get_history_summary())
The code above demonstrates several important principles of in-context memory management. The system message is treated as sacred: it is always the first message in the history and is never pruned, because it defines the agent's identity and instructions. Without the system message, the model loses its persona and behavioral guidelines.
The pruning strategy is a rolling window: when the context grows too large, the oldest non-system messages are dropped first. This is a simple but effective strategy that mirrors how human working memory works. More sophisticated strategies might use importance scoring to decide which messages to drop, or might summarize old messages into a compressed form before dropping them.
The token estimation is a critical detail. LLMs charge for tokens, not characters or words, and different models use different tokenization schemes. The tiktoken library provides accurate token counts for OpenAI models. For Ollama models, the approximation is usually good enough for budget management, though you should be aware that it may be off by 10–20%.
MEMORY TYPE 3: EXTERNAL RETRIEVAL MEMORY (NON-PARAMETRIC MEMORY)
External retrieval memory is the most powerful and most architecturally complex type of memory in Karpathy's taxonomy. It is the mechanism by which an agent can access information that is too large to fit in the context window, too dynamic to be baked into the model's weights, and too important to be left to chance. In the OS analogy, this is the disk: vast, persistent, and accessible through a retrieval interface.
The canonical implementation of external retrieval memory is Retrieval-Augmented Generation, or RAG. In a RAG system, documents are split into chunks, each chunk is converted into a dense vector embedding, and those embeddings are stored in a vector database. When the agent needs information, it converts the query into an embedding, searches the vector database for the most similar chunks, and injects those chunks into the context window. The model then reasons over the retrieved content to produce its answer.
But Karpathy has proposed a more ambitious and intellectually interesting approach that goes beyond simple RAG. He calls it "compilation over retrieval," and it is worth understanding deeply because it represents a qualitative shift in how we think about agent memory.
The core insight is this: raw documents are like source code. They are verbose, redundant, written for human readers rather than machine consumers, and full of context that is important for the original author but irrelevant for a future query. When you do naive RAG, you are essentially running an interpreter: every time a query comes in, you re-parse the raw source material, extract the relevant bits, and hand them to the model. This works, but it is inefficient and produces inconsistent results because the same information might be expressed differently in different source documents.
Karpathy's proposal is to use the LLM itself as a compiler. Instead of storing raw documents and retrieving them at query time, you run the LLM over the raw documents once, upfront, and have it synthesize the information into a structured, coherent, interlinked knowledge base — typically a collection of Markdown files organized like a wiki. This compiled knowledge base is the "executable": it is faster to query, more coherent, and easier to maintain than a collection of raw documents.
This three-layer architecture consists of:
Layer 1 — Raw Sources: The immutable collection of original documents: PDFs, web pages, research papers, meeting notes, code comments, anything that contains information the agent needs. These are never modified; they are the ground truth.
Layer 2 — LLM-Owned Wiki: A structured collection of Markdown files that the LLM has compiled from the raw sources. Each file covers a specific topic, is written in a consistent style, resolves contradictions between sources, and contains links to related topics.
Layer 3 — Schema Configuration: A document that tells the LLM how to organize the wiki, what topics to cover, how to handle edge cases, and what quality standards to maintain. It is the "build system" of the knowledge base.
The difference between the simple RAG approach and the compiled knowledge base approach is profound. In the RAG approach, an informal note like "So I was reading about distributed systems the other day and it's pretty impressive" would be stored verbatim and potentially retrieved verbatim. The LLM would have to parse this informal, verbose prose at query time, every single time a relevant question is asked. In the compiled approach, the LLM transforms this informal note into a clean, structured wiki page once, and all future queries benefit from that upfront investment.
This is exactly the compiler analogy that Karpathy uses. A compiler transforms human-readable source code into efficient machine code once. Every subsequent execution of the program benefits from that compilation. Similarly, the LLM compiler transforms verbose, informal raw text into structured knowledge once, and every subsequent query benefits from that compilation.
The following code demonstrates both approaches — first the classic RAG pipeline, then the compiled knowledge base:
# external_retrieval_memory.py
# Implements both classic RAG and Karpathy's "compilation over retrieval"
# approach for external agent memory.
# pip install chromadb openai requests
import os
import json
import textwrap
from pathlib import Path
from datetime import datetime
from typing import Optional
import requests
# ---------------------------------------------------------------------------
# CHAPTER A: CLASSIC RAG PIPELINE
# ---------------------------------------------------------------------------
class LocalEmbedder:
"""
Generates text embeddings using a locally running Ollama model.
Uses the nomic-embed-text model by default, which is optimized
for semantic similarity tasks.
"""
def __init__(
self,
model: str = "nomic-embed-text",
ollama_base_url: str = "http://localhost:11434"
):
self.model = model
self.ollama_base_url = ollama_base_url
def embed(self, text: str) -> list[float]:
"""Returns the embedding vector for a piece of text."""
response = requests.post(
f"{self.ollama_base_url}/api/embeddings",
json={"model": self.model, "prompt": text},
timeout=60
)
response.raise_for_status()
return response.json()["embedding"]
class SimpleRAGMemory:
"""
Implements the classic Retrieval-Augmented Generation (RAG) pattern
for external agent memory.
Documents are split into overlapping chunks, embedded, and stored in
a persistent ChromaDB collection. At query time, the most semantically
similar chunks are retrieved and returned for injection into the
agent's context window.
"""
def __init__(
self,
collection_name: str = "agent_knowledge",
persist_directory: str = "./rag_store",
chunk_size: int = 512,
chunk_overlap: int = 64,
embedding_model: str = "nomic-embed-text",
ollama_base_url: str = "http://localhost:11434"
):
import chromadb
self.chunk_size = chunk_size
self.chunk_overlap = chunk_overlap
self.embedder = LocalEmbedder(
model=embedding_model,
ollama_base_url=ollama_base_url
)
self.client = chromadb.PersistentClient(path=persist_directory)
self.collection = self.client.get_or_create_collection(
name=collection_name,
metadata={"hnsw:space": "cosine"}
)
def _chunk_text(self, text: str) -> list[str]:
"""
Splits a long text into overlapping chunks. Overlapping chunks
ensure that sentences or paragraphs that fall near chunk boundaries
are still retrievable in their full context.
"""
chunks = []
start = 0
while start < len(text):
end = min(start + self.chunk_size, len(text))
chunks.append(text[start:end])
start += self.chunk_size - self.chunk_overlap
return chunks
def ingest_document(
self,
text: str,
source_name: str,
metadata: Optional[dict] = None
) -> int:
"""
Ingests a document into the vector store. The document is split
into chunks, each chunk is embedded, and all chunks are stored
with their source metadata.
Args:
text: The full text of the document.
source_name: A human-readable name for the source (e.g.,
a filename or URL). Used for citation.
metadata: Optional additional metadata to store with each chunk.
Returns:
The number of chunks ingested.
"""
chunks = self._chunk_text(text)
ids, embeddings, documents, metadatas = [], [], [], []
for i, chunk in enumerate(chunks):
chunk_id = f"{source_name}::chunk_{i}"
embedding = self.embedder.embed(chunk)
chunk_meta = {"source": source_name, "chunk_index": i}
if metadata:
chunk_meta.update(metadata)
ids.append(chunk_id)
embeddings.append(embedding)
documents.append(chunk)
metadatas.append(chunk_meta)
self.collection.upsert(
ids=ids,
embeddings=embeddings,
documents=documents,
metadatas=metadatas
)
print(f"[RAG] Ingested '{source_name}': {len(chunks)} chunks stored.")
return len(chunks)
def retrieve(
self,
query: str,
n_results: int = 5
) -> list[dict]:
"""
Retrieves the most semantically relevant chunks for a given query.
Args:
query: The search query (usually the user's question).
n_results: The maximum number of chunks to retrieve.
Returns:
A list of dicts, each containing 'text', 'source', and
'distance' (lower distance = higher similarity).
"""
if self.collection.count() == 0:
print("[RAG] Warning: collection is empty. Ingest documents first.")
return []
query_embedding = self.embedder.embed(query)
results = self.collection.query(
query_embeddings=[query_embedding],
n_results=min(n_results, self.collection.count()),
include=["documents", "metadatas", "distances"]
)
retrieved = []
for doc, meta, dist in zip(
results["documents"][0],
results["metadatas"][0],
results["distances"][0]
):
retrieved.append({
"text": doc,
"source": meta.get("source", "unknown"),
"chunk_index": meta.get("chunk_index", 0),
"distance": round(dist, 4)
})
return retrieved
def format_context_block(self, retrieved_chunks: list[dict]) -> str:
"""
Formats retrieved chunks into a clean context block suitable for
injection into an LLM prompt. Each chunk is labeled with its
source for traceability.
"""
if not retrieved_chunks:
return "No relevant information found in the knowledge base."
lines = ["=== RETRIEVED KNOWLEDGE ==="]
for i, chunk in enumerate(retrieved_chunks, 1):
lines.append(
f"\n[Source {i}: {chunk['source']} "
f"(relevance: {1 - chunk['distance']:.2f})]"
)
lines.append(chunk["text"])
lines.append("\n=== END OF RETRIEVED KNOWLEDGE ===")
return "\n".join(lines)
# This class implements the full RAG agent loop, combining the
# SimpleRAGMemory retrieval system with an LLM for generation.
import requests as req_lib
class RAGAgent:
"""
A complete RAG-based agent that combines external retrieval memory
with LLM generation. Implements the retrieve-augment-generate loop.
The agent:
1. Receives a user query.
2. Retrieves relevant chunks from the vector store.
3. Injects those chunks into the LLM's context window.
4. Generates a grounded, cited response.
"""
def __init__(
self,
rag_memory: SimpleRAGMemory,
backend: str = "ollama",
model_name: str = "llama3.2",
ollama_base_url: str = "http://localhost:11434",
openai_api_key: Optional[str] = None,
openai_model: str = "gpt-4o-mini",
n_retrieved_chunks: int = 5
):
self.memory = rag_memory
self.backend = backend
self.model_name = model_name
self.ollama_base_url = ollama_base_url
self.n_retrieved_chunks = n_retrieved_chunks
if backend == "openai":
from openai import OpenAI
api_key = openai_api_key or os.environ.get("OPENAI_API_KEY")
self.openai_client = OpenAI(api_key=api_key)
self.openai_model = openai_model
def _build_prompt(self, query: str, context_block: str) -> list[dict]:
"""
Constructs the message list for the LLM API call. The system
message instructs the model to use only the provided context,
which reduces hallucination by grounding the model in retrieved
facts rather than its parametric memory.
"""
system_message = textwrap.dedent("""
You are a knowledgeable assistant with access to a curated
knowledge base. When answering questions, you MUST:
1. Base your answer primarily on the retrieved knowledge provided.
2. Cite your sources by referencing the [Source N] labels.
3. If the retrieved knowledge does not contain enough information
to answer the question, say so clearly rather than guessing.
4. Be concise and precise.
""").strip()
user_message = textwrap.dedent(f"""
RETRIEVED CONTEXT:
{context_block}
USER QUESTION:
{query}
Please answer the question based on the retrieved context above.
""").strip()
return [
{"role": "system", "content": system_message},
{"role": "user", "content": user_message}
]
def answer(self, query: str) -> dict:
"""
Answers a query using the RAG pipeline.
Args:
query: The user's question.
Returns:
A dict containing 'answer', 'sources', and 'chunks_used'.
"""
# Step 1: Retrieve relevant chunks from external memory
retrieved = self.memory.retrieve(query, n_results=self.n_retrieved_chunks)
context_block = self.memory.format_context_block(retrieved)
# Step 2: Build the augmented prompt
messages = self._build_prompt(query, context_block)
# Step 3: Generate the response
if self.backend == "ollama":
response = req_lib.post(
f"{self.ollama_base_url}/api/chat",
json={
"model": self.model_name,
"messages": messages,
"stream": False
},
timeout=120
)
response.raise_for_status()
answer_text = response.json()["message"]["content"]
else:
response = self.openai_client.chat.completions.create(
model=self.openai_model,
messages=messages
)
answer_text = response.choices[0].message.content
return {
"answer": answer_text,
"sources": list({c["source"] for c in retrieved}),
"chunks_used": len(retrieved)
}
# ---------------------------------------------------------------------------
# CHAPTER B: COMPILED KNOWLEDGE BASE (KARPATHY'S "COMPILATION OVER RETRIEVAL")
# ---------------------------------------------------------------------------
class CompiledKnowledgeBase:
"""
Implements Karpathy's "LLM as compiler" concept for agent memory.
Architecture:
Layer 1: Raw sources (immutable, original documents)
Layer 2: LLM-compiled wiki (structured Markdown files)
Layer 3: Schema config (tells the LLM how to organize the wiki)
The LLM processes raw sources once, upfront, and synthesizes them
into a coherent wiki. Subsequent queries are answered from the wiki,
which is faster and more consistent than re-parsing raw documents.
"""
def __init__(
self,
wiki_directory: str = "./compiled_wiki",
backend: str = "ollama",
model_name: str = "llama3.2",
ollama_base_url: str = "http://localhost:11434",
openai_api_key: Optional[str] = None,
openai_model: str = "gpt-4o-mini"
):
self.wiki_dir = Path(wiki_directory)
self.wiki_dir.mkdir(parents=True, exist_ok=True)
self.sources_dir = self.wiki_dir / "raw_sources"
self.sources_dir.mkdir(exist_ok=True)
self.pages_dir = self.wiki_dir / "pages"
self.pages_dir.mkdir(exist_ok=True)
self.backend = backend
self.model_name = model_name
self.ollama_base_url = ollama_base_url
index_file = self.wiki_dir / "index.json"
if index_file.exists():
with open(index_file, "r") as f:
self.index = json.load(f)
else:
self.index = {"pages": {}, "sources": {}}
if backend == "openai":
from openai import OpenAI
api_key = openai_api_key or os.environ.get("OPENAI_API_KEY")
self.openai_client = OpenAI(api_key=api_key)
self.openai_model = openai_model
def _save_index(self) -> None:
with open(self.wiki_dir / "index.json", "w") as f:
json.dump(self.index, f, indent=2)
def _call_llm(self, messages: list[dict]) -> str:
"""Calls the configured LLM backend and returns the response text."""
if self.backend == "ollama":
response = requests.post(
f"{self.ollama_base_url}/api/chat",
json={
"model": self.model_name,
"messages": messages,
"stream": False
},
timeout=180
)
response.raise_for_status()
return response.json()["message"]["content"]
else:
response = self.openai_client.chat.completions.create(
model=self.openai_model,
messages=messages
)
return response.choices[0].message.content
def ingest_and_compile(
self,
raw_text: str,
source_name: str,
topic_hint: Optional[str] = None
) -> str:
"""
The core "compilation" step. Takes a raw document and uses the LLM
to synthesize it into a structured wiki page.
This is the key difference from RAG: instead of storing the raw
text and retrieving it later, we transform it NOW into a structured,
queryable format. The LLM does the heavy lifting once, upfront.
"""
# Save the raw source (Layer 1)
source_file = self.sources_dir / source_name
with open(source_file, "w", encoding="utf-8") as f:
f.write(raw_text)
topic_instruction = (
f"The document is about: {topic_hint}." if topic_hint
else "Infer the topic from the document content."
)
compile_prompt = textwrap.dedent(f"""
You are a knowledge base compiler. Your job is to transform
raw source documents into clean, structured wiki pages.
{topic_instruction}
Transform the following raw document into a well-structured
Markdown wiki page. The page should:
- Have a clear title (H1)
- Be organized with logical sections (H2, H3)
- Extract and highlight key facts, definitions, and relationships
- Remove redundancy and informal language
- Use bullet points and tables where appropriate
- Be written for a technical audience
RAW DOCUMENT:
{raw_text}
OUTPUT (Markdown wiki page only, no preamble):
""").strip()
messages = [{"role": "user", "content": compile_prompt}]
compiled_content = self._call_llm(messages)
# Save the compiled wiki page (Layer 2)
page_filename = f"{source_name.replace('.', '_')}_compiled.md"
page_file = self.pages_dir / page_filename
with open(page_file, "w", encoding="utf-8") as f:
f.write(f"<!-- Compiled from: {source_name} -->\n")
f.write(f"<!-- Compiled at: {datetime.now().isoformat()} -->\n\n")
f.write(compiled_content)
self.index["pages"][page_filename] = {
"source": source_name,
"topic": topic_hint or "auto-detected",
"compiled_at": datetime.now().isoformat(),
"file": str(page_file)
}
self.index["sources"][source_name] = page_filename
self._save_index()
print(f"[COMPILE] '{source_name}' -> '{page_filename}'")
return page_filename
def query(self, question: str) -> dict:
"""
Answers a question by searching the compiled wiki pages.
Unlike RAG, this searches structured, pre-compiled content
rather than raw document chunks.
Args:
question: The user's question.
Returns:
A dict with 'answer' and 'pages_consulted'.
"""
wiki_content = []
for page_file in self.pages_dir.glob("*_compiled.md"):
with open(page_file, "r", encoding="utf-8") as f:
content = f.read()
wiki_content.append(f"=== {page_file.name} ===\n{content}")
if not wiki_content:
return {
"answer": "The knowledge base is empty. Please ingest documents first.",
"pages_consulted": []
}
combined_wiki = "\n\n".join(wiki_content)
query_prompt = textwrap.dedent(f"""
You have access to the following compiled knowledge base.
Answer the question using only the information in the knowledge base.
Cite the specific wiki page(s) you used.
KNOWLEDGE BASE:
{combined_wiki}
QUESTION: {question}
ANSWER:
""").strip()
messages = [
{
"role": "system",
"content": (
"You are a precise assistant that answers questions "
"strictly from the provided knowledge base. "
"Always cite your sources."
)
},
{"role": "user", "content": query_prompt}
]
answer = self._call_llm(messages)
return {
"answer": answer,
"pages_consulted": [p.name for p in self.pages_dir.glob("*_compiled.md")]
}
def lint(self) -> str:
"""
Runs a "lint" pass over the wiki. The LLM reviews the compiled
pages for contradictions, gaps, stale information, and quality
issues. This is the "self-healing" aspect of the compiled knowledge
base that Karpathy describes.
Returns:
A lint report as a string.
"""
wiki_content = []
for page_file in self.pages_dir.glob("*_compiled.md"):
with open(page_file, "r", encoding="utf-8") as f:
wiki_content.append(f"=== {page_file.name} ===\n{f.read()}")
if not wiki_content:
return "No wiki pages to lint."
combined_wiki = "\n\n".join(wiki_content)
lint_prompt = textwrap.dedent(f"""
Review the following knowledge base wiki for quality issues.
Identify:
1. Contradictions between pages
2. Missing information or gaps
3. Stale or potentially outdated content
4. Inconsistent terminology or formatting
WIKI CONTENT:
{combined_wiki}
LINT REPORT:
""").strip()
messages = [{"role": "user", "content": lint_prompt}]
return self._call_llm(messages)
if __name__ == "__main__":
GEN_BACKEND = "ollama"
# --- Classic RAG Demo ---
print("=== Classic RAG Demo ===\n")
rag = SimpleRAGMemory(
collection_name="demo_knowledge",
persist_directory="./rag_demo_store"
)
rag.ingest_document(
text=textwrap.dedent("""
Python is a high-level, interpreted programming language known for
its clear syntax and readability. It supports multiple programming
paradigms including procedural, object-oriented, and functional
programming. Python's standard library is extensive, and its package
ecosystem (PyPI) contains hundreds of thousands of third-party packages.
Python is widely used in web development, data science, machine learning,
automation, and scientific computing.
""").strip(),
source_name="python_overview.txt"
)
rag.ingest_document(
text=textwrap.dedent("""
FastAPI is a modern, fast web framework for building APIs with Python,
based on standard Python type hints. It is built on top of Starlette
for the web parts and Pydantic for the data parts. FastAPI automatically
generates OpenAPI documentation and supports async/await natively.
It is one of the fastest Python web frameworks available.
""").strip(),
source_name="fastapi_overview.txt"
)
agent = RAGAgent(
rag_memory=rag,
backend=GEN_BACKEND,
model_name="llama3.2"
)
query = "What is FastAPI and what is it built on?"
print(f"\nQuery: {query}")
result = agent.answer(query)
print(f"\nAnswer: {result['answer']}")
print(f"Sources: {result['sources']}")
print(f"Chunks used: {result['chunks_used']}")
# --- Compiled Knowledge Base Demo ---
print("\n\n=== Compiled Knowledge Base Demo ===\n")
kb = CompiledKnowledgeBase(
wiki_directory="./compiled_wiki_demo",
backend=GEN_BACKEND,
model_name="llama3.2"
)
kb.ingest_and_compile(
raw_text=textwrap.dedent("""
So I was reading about vector databases the other day and they're
pretty cool. Basically they store embeddings — these high-dimensional
vectors that represent the semantic meaning of text. When you want to
find similar documents, you just embed your query and find the nearest
vectors. ChromaDB is a popular open-source one. Pinecone is a managed
cloud service. The main algorithms used are HNSW and IVF for approximate
nearest neighbor search.
""").strip(),
source_name="vector_db_notes.txt",
topic_hint="Vector databases and embedding-based retrieval"
)
result = kb.query("What algorithms do vector databases use for search?")
print(f"\nAnswer: {result['answer']}")
print(f"Pages consulted: {result['pages_consulted']}")
MEMORY TYPE 4: KV CACHE MEMORY (COMPUTATIONAL MEMORY)
The KV cache is the most technically esoteric of the four memory types, and it is the one that most developers interact with indirectly rather than explicitly. Understanding it is nonetheless important because it explains certain performance characteristics of LLM systems and opens up advanced optimization possibilities.
To understand the KV cache, you need to understand a small piece of how the transformer attention mechanism works. In a transformer, every token in the input sequence is processed by computing three vectors: a Query vector (Q), a Key vector (K), and a Value vector (V). The attention mechanism computes the dot product of each token's Query with every other token's Key to determine how much attention to pay to each token. Then it uses those attention weights to compute a weighted sum of the Value vectors.
The crucial insight is this: when you are generating a response token by token, the Key and Value vectors for all the tokens you have already processed do not change. They are the same on every generation step. Without caching, you would recompute them from scratch on every step, which is enormously wasteful. The KV cache solves this by storing the Key and Value vectors for every token that has been processed, so they can be reused on subsequent generation steps.
In Karpathy's OS analogy, the KV cache is a form of computational working memory. It is not memory in the sense of storing facts or documents; it is memory in the sense of storing intermediate computational state that would be expensive to recompute. It is analogous to the CPU's L1/L2 cache: not the main RAM, but a fast, specialized cache that dramatically accelerates computation.
The KV cache has several important implications for agent system design.
First, it means that the cost of processing a long system prompt is paid only once per session, not once per token generated. If you have a 10,000-token system prompt, the KV cache stores the K and V vectors for all 10,000 tokens after the first forward pass. Subsequent generation steps reuse those cached vectors, making the effective cost of the long system prompt amortized over the entire conversation.
Second, some LLM providers, including Anthropic and OpenAI, offer explicit "prompt caching" features that persist the KV cache across API calls. This means that if you send the same long system prompt at the beginning of every API call, the provider can cache the KV cache for that prefix and charge you a reduced rate for subsequent calls that reuse the same prefix. This is a significant cost optimization for agents with large, stable system prompts.
Third, the KV cache is the primary bottleneck for long-context inference. The size of the KV cache grows linearly with the context length and the number of attention heads. For very long contexts, the KV cache can consume tens of gigabytes of GPU memory, which is why long-context inference is expensive and why context window management matters so much.
The key architectural principle is that the stable prefix — the part of the prompt that never changes — must always come first. If you put dynamic content before the stable system prompt, you invalidate the cache on every call, losing all the performance benefits. This is why the standard message structure for LLM APIs puts the system message first: it is the most stable part of the prompt and benefits the most from caching.
For Ollama specifically, the keep_alive parameter is the mechanism that controls KV cache persistence. When you set keep_alive to "10m", Ollama keeps the model loaded in GPU memory for 10 minutes after the last request. During that window, the KV cache for the stable prefix is preserved, and subsequent calls that share the same prefix benefit from dramatically reduced latency.
# kv_cache_aware_agent.py
# Demonstrates KV cache-aware prompt design and monitoring.
# Shows how to structure prompts to maximize cache hit rates,
# and how to use OpenAI's prompt caching feature.
# pip install openai requests tiktoken
import os
import time
import textwrap
from typing import Optional
import requests
class KVCacheAwareAgent:
"""
An agent designed to maximize KV cache efficiency.
Key design principles:
1. The system prompt (stable prefix) is always placed first and
never changes between calls, maximizing cache hits.
2. Dynamic content (user messages, retrieved context) is placed
after the stable prefix, where it does not invalidate the cache.
3. Cache performance metrics are tracked to measure efficiency.
This design is critical for production agents with large system
prompts, where cache hits can reduce latency by 50-80% and cost
by up to 90% (on providers that support prompt caching).
"""
def __init__(
self,
stable_system_prompt: str,
backend: str = "ollama",
model_name: str = "llama3.2",
ollama_base_url: str = "http://localhost:11434",
openai_api_key: Optional[str] = None,
openai_model: str = "gpt-4o-mini"
):
"""
Args:
stable_system_prompt: The system prompt that remains constant
across all calls. This is the "cacheable
prefix" that benefits from KV caching.
Make this as large and stable as possible.
"""
self.stable_system_prompt = stable_system_prompt
self.backend = backend
self.model_name = model_name
self.ollama_base_url = ollama_base_url
# Metrics tracking for cache performance analysis
self.call_count = 0
self.total_latency_ms = 0.0
self.cached_tokens_total = 0
self.uncached_tokens_total = 0
if backend == "openai":
from openai import OpenAI
api_key = openai_api_key or os.environ.get("OPENAI_API_KEY")
self.openai_client = OpenAI(api_key=api_key)
self.openai_model = openai_model
# Ollama handles KV caching internally and automatically
def _estimate_tokens(self, text: str) -> int:
"""Estimates token count using tiktoken."""
try:
import tiktoken
enc = tiktoken.get_encoding("cl100k_base")
return len(enc.encode(text))
except ImportError:
return int(len(text.split()) * 1.3)
def call_with_cache_awareness(
self,
user_message: str,
dynamic_context: Optional[str] = None
) -> dict:
"""
Makes an LLM call with KV cache-optimized prompt structure.
The prompt is structured as:
[STABLE SYSTEM PROMPT] <- cached after first call
[DYNAMIC CONTEXT] <- changes per call, not cached
[USER MESSAGE] <- changes per call, not cached
The stable system prompt must always come first and must not
change between calls. Any change to the stable prefix invalidates
the entire cache.
Args:
user_message: The user's current message.
dynamic_context: Optional context that changes per call
(e.g., retrieved documents, tool results).
"""
# Build the user turn: dynamic context + actual question
if dynamic_context:
user_content = (
f"DYNAMIC CONTEXT:\n{dynamic_context}\n\n"
f"USER MESSAGE:\n{user_message}"
)
else:
user_content = user_message
messages = [
{"role": "system", "content": self.stable_system_prompt},
{"role": "user", "content": user_content}
]
start_time = time.time()
cache_metrics = {}
if self.backend == "ollama":
response = requests.post(
f"{self.ollama_base_url}/api/chat",
json={
"model": self.model_name,
"messages": messages,
"stream": False,
# keep_alive keeps the model (and its KV cache) warm
# in GPU memory between requests.
"keep_alive": "10m"
},
timeout=120
)
response.raise_for_status()
data = response.json()
answer = data["message"]["content"]
# Ollama returns timing information we can use to infer
# cache behavior. A very fast prompt evaluation time
# suggests the KV cache was warm.
eval_duration_ns = data.get("prompt_eval_duration", 0)
cache_metrics = {
"prompt_eval_ms": eval_duration_ns / 1_000_000,
"total_duration_ms": data.get("total_duration", 0) / 1_000_000,
"prompt_tokens": data.get("prompt_eval_count", 0),
"response_tokens": data.get("eval_count", 0),
# A very low prompt_eval_ms relative to token count
# is a strong signal that the KV cache was hit.
"cache_likely_hit": (
eval_duration_ns > 0
and (eval_duration_ns / 1_000_000)
< data.get("prompt_eval_count", 1) * 0.5
)
}
else: # OpenAI backend
response = self.openai_client.chat.completions.create(
model=self.openai_model,
messages=messages
)
answer = response.choices[0].message.content
usage = response.usage
cached_tokens = getattr(
getattr(usage, "prompt_tokens_details", None),
"cached_tokens", 0
) or 0
uncached_tokens = usage.prompt_tokens - cached_tokens
self.cached_tokens_total += cached_tokens
self.uncached_tokens_total += uncached_tokens
cache_metrics = {
"cached_tokens": cached_tokens,
"uncached_tokens": uncached_tokens,
"cache_rate": (
cached_tokens / usage.prompt_tokens
if usage.prompt_tokens > 0 else 0
)
}
latency_ms = (time.time() - start_time) * 1000
self.call_count += 1
self.total_latency_ms += latency_ms
return {
"response": answer,
"latency_ms": round(latency_ms, 1),
"cache_metrics": cache_metrics
}
def get_performance_report(self) -> str:
"""Returns a summary of cache performance across all calls."""
if self.call_count == 0:
return "No calls made yet."
avg_latency = self.total_latency_ms / self.call_count
total_tokens = self.cached_tokens_total + self.uncached_tokens_total
cache_rate = (
self.cached_tokens_total / total_tokens
if total_tokens > 0 else 0
)
return textwrap.dedent(f"""
KV Cache Performance Report
===========================
Total calls: {self.call_count}
Average latency: {avg_latency:.1f} ms
Total cached tokens: {self.cached_tokens_total:,}
Total uncached tokens: {self.uncached_tokens_total:,}
Overall cache rate: {cache_rate:.1%}
Estimated cost saving: {cache_rate * 90:.1f}% (vs no caching)
""").strip()
if __name__ == "__main__":
# This large, stable system prompt is the "cacheable prefix".
# In a real system, this might contain tool definitions, company
# policies, domain knowledge, or a compiled knowledge base excerpt.
LARGE_STABLE_PROMPT = textwrap.dedent("""
You are an expert software engineering assistant with comprehensive
knowledge of:
LANGUAGES & RUNTIMES:
- Python, TypeScript, Go, Rust, Java
- CPython internals, V8, GraalVM
FRAMEWORKS & LIBRARIES:
- FastAPI, Django, Flask (Python web)
- React, Next.js, Vue (frontend)
- PyTorch, JAX, scikit-learn (ML)
INFRASTRUCTURE:
- Docker, Kubernetes, Helm
- AWS, GCP, Azure cloud services
- PostgreSQL, Redis, Kafka, Elasticsearch
STANDARDS & PRACTICES:
- REST, GraphQL, gRPC API design
- CI/CD pipelines and GitOps
- Twelve-Factor App methodology
- OWASP security guidelines
RESPONSE GUIDELINES:
- Always specify the exact version when relevant.
- Flag security-critical information with [SECURITY] prefix.
- Provide code examples when applicable.
- Reference the relevant documentation or RFC when known.
- Use metric units throughout.
""").strip()
agent = KVCacheAwareAgent(
stable_system_prompt=LARGE_STABLE_PROMPT,
backend="ollama",
model_name="llama3.2"
)
# Make several calls. After the first call, Ollama's internal KV
# cache should be warm, and subsequent calls should be faster.
questions = [
"What is the difference between asyncio and threading in Python?",
"How do I implement rate limiting in a FastAPI application?",
"What are the trade-offs between PostgreSQL and Redis for session storage?"
]
for question in questions:
print(f"\nQ: {question}")
result = agent.call_with_cache_awareness(question)
print(f"A: {result['response'][:200]}...")
print(f" Latency: {result['latency_ms']} ms")
print(f" Cache metrics: {result['cache_metrics']}")
print(f"\n{agent.get_performance_report()}")
CHAPTER FOUR: PUTTING IT ALL TOGETHER — A UNIFIED MEMORY ARCHITECTURE
Now that we have explored all four memory types individually, let us build a unified agent that uses all four types simultaneously, as a real production agent would. This architecture demonstrates how the four memory types complement each other: in-weights memory provides the foundation, in-context memory provides the working space, external retrieval provides the knowledge base, and KV cache optimization provides the performance.
# unified_memory_agent.py
# A production-quality agent that orchestrates all four types of memory.
# pip install chromadb openai requests tiktoken
import os
import time
import hashlib
import textwrap
from dataclasses import dataclass, field
from typing import Optional
import requests
@dataclass
class MemoryEntry:
"""
Represents a single entry in the agent's episodic memory.
Episodic memory records specific events and interactions,
allowing the agent to recall what happened and when.
"""
content: str
memory_type: str # "episodic", "semantic", or "procedural"
source: str # Where this memory came from
timestamp: float = field(default_factory=time.time)
importance: float = 1.0 # 0.0 to 1.0; higher = more important
def to_dict(self) -> dict:
return {
"content": self.content,
"memory_type": self.memory_type,
"source": self.source,
"timestamp": self.timestamp,
"importance": self.importance
}
class UnifiedMemoryAgent:
"""
A production-quality agent that orchestrates all four types of
memory described in Karpathy's LLM OS taxonomy.
Memory architecture:
- In-weights: The base LLM's parametric knowledge (implicit)
- In-context: Rolling conversation history (explicit, managed)
- External: ChromaDB vector store for long-term knowledge (explicit)
- KV cache: Stable system prompt prefix (implicit, optimized)
The agent follows a "memory-first" design philosophy: before
generating any response, it always consults its external memory
to ground its answer in retrieved facts rather than relying solely
on its parametric memory (which may be outdated or incomplete).
"""
# The stable system prompt is the KV-cacheable prefix. It is designed
# to be large, stable, and placed first in every API call.
STABLE_SYSTEM_PROMPT = textwrap.dedent("""
You are a knowledgeable, helpful assistant with access to a
persistent memory system. You have four types of memory:
1. PARAMETRIC KNOWLEDGE: Your training data (always available).
2. CONVERSATION HISTORY: What has been discussed in this session.
3. RETRIEVED KNOWLEDGE: Facts retrieved from your knowledge base.
4. PROCEDURAL KNOWLEDGE: How to perform specific tasks.
When answering questions:
- Prioritize RETRIEVED KNOWLEDGE over PARAMETRIC KNOWLEDGE.
- Always cite the source of retrieved information.
- If you are using parametric knowledge, say so explicitly.
- Be concise, precise, and honest about uncertainty.
- If you don't know something, say so rather than guessing.
You maintain continuity across conversations by storing important
facts in your external memory. When the user shares important
information (preferences, facts, corrections), acknowledge that
you are storing it for future reference.
""").strip()
def __init__(
self,
backend: str = "ollama",
model_name: str = "llama3.2",
embedding_model: str = "nomic-embed-text",
ollama_base_url: str = "http://localhost:11434",
openai_api_key: Optional[str] = None,
openai_model: str = "gpt-4o-mini",
openai_embedding_model: str = "text-embedding-3-small",
memory_store_path: str = "./unified_agent_memory",
max_context_tokens: int = 3000,
n_retrieved_chunks: int = 4
):
self.backend = backend
self.model_name = model_name
self.embedding_model = embedding_model
self.ollama_base_url = ollama_base_url
self.max_context_tokens = max_context_tokens
self.n_retrieved_chunks = n_retrieved_chunks
# Initialize the external memory store (ChromaDB)
import chromadb
self.chroma_client = chromadb.PersistentClient(path=memory_store_path)
self.knowledge_collection = self.chroma_client.get_or_create_collection(
name="knowledge_base",
metadata={"hnsw:space": "cosine"}
)
self.episodic_collection = self.chroma_client.get_or_create_collection(
name="episodic_memory",
metadata={"hnsw:space": "cosine"}
)
if backend == "openai":
from openai import OpenAI
api_key = openai_api_key or os.environ.get("OPENAI_API_KEY")
self.openai_client = OpenAI(api_key=api_key)
self.openai_model = openai_model
self.openai_embedding_model = openai_embedding_model
# In-context memory: the conversation history.
# The system message (stable prefix) is always index 0.
self.conversation_history: list[dict] = [
{"role": "system", "content": self.STABLE_SYSTEM_PROMPT}
]
def _get_embedding(self, text: str) -> list[float]:
"""Generates an embedding for the given text."""
if self.backend == "ollama":
response = requests.post(
f"{self.ollama_base_url}/api/embeddings",
json={"model": self.embedding_model, "prompt": text},
timeout=60
)
response.raise_for_status()
return response.json()["embedding"]
else:
response = self.openai_client.embeddings.create(
model=self.openai_embedding_model,
input=text
)
return response.data[0].embedding
def _estimate_tokens(self, text: str) -> int:
"""Estimates token count."""
try:
import tiktoken
enc = tiktoken.get_encoding("cl100k_base")
return len(enc.encode(text))
except ImportError:
return int(len(text.split()) * 1.3)
def _call_llm(self, messages: list[dict]) -> str:
"""Calls the configured LLM and returns the response text."""
if self.backend == "ollama":
response = requests.post(
f"{self.ollama_base_url}/api/chat",
json={
"model": self.model_name,
"messages": messages,
"stream": False,
"keep_alive": "10m" # Keep model warm for KV cache reuse
},
timeout=180
)
response.raise_for_status()
return response.json()["message"]["content"]
else:
response = self.openai_client.chat.completions.create(
model=self.openai_model,
messages=messages
)
return response.choices[0].message.content
def _retrieve_relevant_memories(self, query: str) -> str:
"""
Retrieves relevant memories from both the knowledge base and
episodic memory, and formats them for injection into the prompt.
"""
results = []
query_embedding = self._get_embedding(query)
# Search the knowledge base (semantic memory)
if self.knowledge_collection.count() > 0:
kb_results = self.knowledge_collection.query(
query_embeddings=[query_embedding],
n_results=min(self.n_retrieved_chunks,
self.knowledge_collection.count()),
include=["documents", "metadatas", "distances"]
)
for doc, meta, dist in zip(
kb_results["documents"][0],
kb_results["metadatas"][0],
kb_results["distances"][0]
):
if dist < 0.5: # Only include sufficiently relevant results
results.append(
f"[Knowledge Base | {meta.get('category', 'general')} | "
f"relevance: {1 - dist:.2f}]\n{doc}"
)
# Search episodic memory
if self.episodic_collection.count() > 0:
ep_results = self.episodic_collection.query(
query_embeddings=[query_embedding],
n_results=min(2, self.episodic_collection.count()),
include=["documents", "metadatas", "distances"]
)
for doc, meta, dist in zip(
ep_results["documents"][0],
ep_results["metadatas"][0],
ep_results["distances"][0]
):
if dist < 0.4:
results.append(
f"[Episodic Memory | {meta.get('source', 'session')} | "
f"relevance: {1 - dist:.2f}]\n{doc}"
)
if not results:
return ""
return (
"=== RETRIEVED MEMORIES ===\n"
+ "\n\n".join(results)
+ "\n=== END RETRIEVED MEMORIES ==="
)
def _store_episodic_memory(self, content: str, source: str) -> None:
"""
Stores a piece of information in the episodic memory collection.
This is how the agent "remembers" things that have scrolled out
of its context window.
"""
entry_id = hashlib.sha256(
f"{source}::{content[:100]}".encode()
).hexdigest()[:16]
embedding = self._get_embedding(content)
self.episodic_collection.upsert(
ids=[entry_id],
embeddings=[embedding],
documents=[content],
metadatas=[{"source": source, "stored_at": time.time()}]
)
def _prune_conversation_history(self) -> None:
"""
Prunes the conversation history to stay within the token budget.
The system message (index 0) is always preserved.
Before dropping old messages, important facts are extracted and
stored in episodic memory so they are not permanently lost.
"""
total_tokens = sum(
self._estimate_tokens(msg["content"])
for msg in self.conversation_history
)
while (
total_tokens > self.max_context_tokens
and len(self.conversation_history) > 2
):
old_message = self.conversation_history.pop(1)
# Memory consolidation: move important content to episodic
# memory before discarding from the context window.
if len(old_message["content"]) > 50:
self._store_episodic_memory(
content=old_message["content"],
source=f"conversation_history_{old_message['role']}"
)
print(
f"[MEMORY] Consolidated old {old_message['role']} "
f"message into episodic memory."
)
total_tokens = sum(
self._estimate_tokens(msg["content"])
for msg in self.conversation_history
)
def remember_fact(self, fact: str, category: str = "general") -> None:
"""
Explicitly stores a fact in the knowledge base. This is the
agent's "semantic memory" store: general facts and knowledge
that should persist across sessions.
"""
fact_id = hashlib.sha256(fact.encode()).hexdigest()[:16]
embedding = self._get_embedding(fact)
self.knowledge_collection.upsert(
ids=[fact_id],
embeddings=[embedding],
documents=[fact],
metadatas=[{"category": category, "stored_at": time.time()}]
)
print(f"[MEMORY] Stored fact: '{fact[:60]}...'")
def chat(self, user_input: str) -> str:
"""
The main agent interaction loop. Integrates all four memory types:
1. In-weights memory: The LLM's parametric knowledge is always
available implicitly through the model itself.
2. External retrieval: Before generating a response, we retrieve
relevant memories and inject them into the context.
3. In-context memory: The conversation history is maintained and
injected into every call, giving the model continuity.
4. KV cache: The stable system prompt (STABLE_SYSTEM_PROMPT) is
always first, maximizing cache hit rates across calls.
"""
# Step 1: Retrieve relevant memories from external storage.
retrieved_context = self._retrieve_relevant_memories(user_input)
# Step 2: Build the augmented user message.
if retrieved_context:
augmented_user_message = (
f"{retrieved_context}\n\n"
f"Based on the above retrieved memories and your own knowledge, "
f"please answer:\n\n{user_input}"
)
else:
augmented_user_message = user_input
# Step 3: Add to conversation history (in-context memory).
self.conversation_history.append({
"role": "user",
"content": augmented_user_message
})
# Step 4: Prune history if it exceeds the token budget.
self._prune_conversation_history()
# Step 5: Call the LLM with the full context.
response_text = self._call_llm(self.conversation_history)
# Step 6: Add the assistant's response to conversation history.
self.conversation_history.append({
"role": "assistant",
"content": response_text
})
# Step 7: Store the user's input in episodic memory.
self._store_episodic_memory(
content=f"User said: {user_input}",
source="current_session"
)
return response_text
def get_memory_status(self) -> dict:
"""Returns a summary of the agent's current memory state."""
context_tokens = sum(
self._estimate_tokens(msg["content"])
for msg in self.conversation_history
)
return {
"in_context_messages": len(self.conversation_history),
"in_context_tokens_approx": context_tokens,
"context_budget": self.max_context_tokens,
"context_utilization": f"{context_tokens / self.max_context_tokens:.1%}",
"knowledge_base_entries": self.knowledge_collection.count(),
"episodic_memory_entries": self.episodic_collection.count(),
"backend": self.backend,
"model": self.model_name
}
if __name__ == "__main__":
print("=== Unified Memory Agent Demo ===\n")
agent = UnifiedMemoryAgent(
backend="ollama",
model_name="llama3.2",
embedding_model="nomic-embed-text",
memory_store_path="./unified_demo_memory",
max_context_tokens=2500
)
# Pre-load some semantic knowledge into the knowledge base.
agent.remember_fact(
"The user's name is Jordan and they are a senior backend engineer.",
category="user_profile"
)
agent.remember_fact(
"Jordan prefers Go for high-throughput services and Python for data pipelines.",
category="user_preferences"
)
agent.remember_fact(
"The current project uses PostgreSQL 15 and Redis 7 for caching.",
category="project_context"
)
# Turn 1: A question that benefits from retrieved knowledge
print("Turn 1:")
response = agent.chat(
"Can you help me choose a language for a new microservice?"
)
print(f"Agent: {response}\n")
# Turn 2: Follow-up that tests conversational continuity
print("Turn 2:")
response = agent.chat(
"What database should I use for storing session data in that service?"
)
print(f"Agent: {response}\n")
print("\n=== Memory Status ===")
import json
print(json.dumps(agent.get_memory_status(), indent=2))
The unified agent above demonstrates the full interplay of all four memory types in a single coherent system. The stable system prompt at the top of every message list is the KV cache optimization. The conversation history maintained in self.conversation_history is the in-context memory. The ChromaDB collections are the external retrieval memory. And the underlying LLM's parametric knowledge is the in-weights memory, always present as the foundation on which everything else rests.
The memory consolidation step — where old conversation history is moved to episodic memory before being pruned from the context window — is particularly important. Without this step, information that scrolls out of the context window is permanently lost. With it, the agent can retrieve that information later if it becomes relevant again, giving the agent a form of long-term memory that persists across sessions.
CHAPTER FIVE: CONTEXT ENGINEERING — THE ART OF FILLING THE CONTEXT WINDOW
Karpathy has described context engineering as "the delicate art and science of filling the context window with just the right information for the next step." This is a more mature and more powerful concept than the earlier notion of prompt engineering. Prompt engineering is about writing clever instructions. Context engineering is about designing the entire information ecosystem that the model reasons within.
There are several key principles of good context engineering that every agentic developer should internalize.
The first principle is relevance over completeness. It is tempting to give the model as much information as possible, on the theory that more context means better answers. But this is wrong. Irrelevant information in the context window is not neutral; it actively degrades performance by distracting the model and increasing the probability of the "lost in the middle" phenomenon, where the model fails to attend to information in the middle of a very long context. You should retrieve and inject only the information that is directly relevant to the current query.
The second principle is structure over prose. Information injected into the context window should be structured clearly, with labels, headers, and delimiters that make it easy for the model to identify what is what. A block of retrieved text labeled [Source: API Design Guide | Relevance: 0.92] is far more useful than the same text without labels, because the model can use the label to calibrate how much to trust and prioritize the information.
The third principle is recency bias. In a conversation history, the most recent messages are the most relevant. When you must prune the context, prune from the oldest messages first. This mirrors how human working memory works: recent events are vivid and accessible, while older events fade.
The fourth principle is stable prefix first. As discussed in the KV cache section, the stable parts of the prompt should always come first. This means the system message, which contains the agent's identity, instructions, and any large, stable knowledge blocks, should always be at index 0 in the message list, and it should never change between calls.
# context_engineer.py
# Implements all four context engineering principles in a reusable utility.
# pip install tiktoken
import textwrap
from dataclasses import dataclass
from typing import Optional
@dataclass
class ContextBlock:
"""
Represents a single block of content to be included in the context.
Blocks are ranked by priority; higher priority blocks are included
first when the context window is limited.
"""
content: str
label: str
priority: int # 1 = highest priority, 10 = lowest
block_type: str # "system", "retrieved", "history", "tool_result"
relevance_score: float = 1.0 # 0.0 to 1.0
class ContextEngineer:
"""
Constructs optimal context windows by applying Karpathy's context
engineering principles:
1. Relevance over completeness: only include what matters.
2. Structure over prose: use clear labels and delimiters.
3. Recency bias: prefer recent information when pruning.
4. Stable prefix first: system message always comes first.
"""
def __init__(self, token_budget: int = 4096):
"""
Args:
token_budget: The maximum number of tokens for the entire
context window, including the response.
Leave headroom for the model's response
(typically 512-2048 tokens).
"""
self.token_budget = token_budget
self.blocks: list[ContextBlock] = []
def _estimate_tokens(self, text: str) -> int:
"""Estimates token count."""
try:
import tiktoken
enc = tiktoken.get_encoding("cl100k_base")
return len(enc.encode(text))
except ImportError:
return int(len(text.split()) * 1.3)
def add_block(self, block: ContextBlock) -> None:
"""Adds a context block to the pool of available content."""
self.blocks.append(block)
def add_system_prompt(self, content: str) -> None:
"""
Adds the system prompt as the highest-priority block.
The system prompt is always included and always comes first.
"""
self.add_block(ContextBlock(
content=content,
label="SYSTEM",
priority=1,
block_type="system",
relevance_score=1.0
))
def add_retrieved_knowledge(
self,
content: str,
source: str,
relevance: float
) -> None:
"""
Adds a retrieved knowledge chunk. Retrieved chunks are sorted
by relevance score; the most relevant chunks are included first
when the context window is limited.
"""
self.add_block(ContextBlock(
content=content,
label=f"RETRIEVED | {source} | relevance={relevance:.2f}",
priority=3,
block_type="retrieved",
relevance_score=relevance
))
def add_conversation_turn(
self,
role: str,
content: str,
recency_rank: int
) -> None:
"""
Adds a conversation history turn. More recent turns have lower
priority numbers (higher priority) to implement recency bias.
"""
self.add_block(ContextBlock(
content=content,
label=f"HISTORY | {role.upper()} | recency_rank={recency_rank}",
priority=4 + recency_rank,
block_type="history",
relevance_score=1.0 / recency_rank
))
def build_context(self) -> list[dict]:
"""
Builds the optimal context window from the available blocks,
respecting the token budget and applying all four principles.
Returns:
A list of message dicts ready for the LLM API.
"""
# Separate the system block (always included first)
system_blocks = [b for b in self.blocks if b.block_type == "system"]
other_blocks = [b for b in self.blocks if b.block_type != "system"]
# Sort other blocks: by priority first, then by relevance score
other_blocks.sort(key=lambda b: (b.priority, -b.relevance_score))
# Calculate remaining budget after the system prompt
system_content = "\n\n".join(b.content for b in system_blocks)
remaining_budget = (
self.token_budget - self._estimate_tokens(system_content)
)
# Greedily include other blocks until the budget is exhausted
included_blocks = []
for block in other_blocks:
block_tokens = self._estimate_tokens(block.content)
if block_tokens <= remaining_budget:
included_blocks.append(block)
remaining_budget -= block_tokens
else:
print(
f"[CONTEXT] Dropped block '{block.label}' "
f"({block_tokens} tokens, budget remaining: {remaining_budget})"
)
# Assemble the final message list
messages = [{"role": "system", "content": system_content}]
tool_results = [b for b in included_blocks if b.block_type == "tool_result"]
retrieved = [b for b in included_blocks if b.block_type == "retrieved"]
history = sorted(
[b for b in included_blocks if b.block_type == "history"],
key=lambda b: -b.relevance_score # Most recent first
)
context_parts = []
if retrieved:
context_parts.append("=== RETRIEVED KNOWLEDGE ===")
for block in retrieved:
context_parts.append(f"[{block.label}]\n{block.content}")
context_parts.append("=== END RETRIEVED KNOWLEDGE ===")
if tool_results:
context_parts.append("=== TOOL RESULTS ===")
for block in tool_results:
context_parts.append(f"[{block.label}]\n{block.content}")
context_parts.append("=== END TOOL RESULTS ===")
if context_parts:
messages.append({
"role": "user",
"content": "\n\n".join(context_parts)
})
for block in reversed(history):
role = "user" if "USER" in block.label else "assistant"
messages.append({"role": role, "content": block.content})
return messages
if __name__ == "__main__":
engineer = ContextEngineer(token_budget=2000)
# Add the stable system prompt (always first, always included)
engineer.add_system_prompt(
"You are a helpful software engineering assistant. "
"Always cite your sources and flag performance-critical information."
)
# Add retrieved knowledge chunks (sorted by relevance)
engineer.add_retrieved_knowledge(
content="FastAPI supports async/await natively via Starlette.",
source="fastapi_docs.txt",
relevance=0.95
)
engineer.add_retrieved_knowledge(
content="Pydantic v2 introduced a Rust-based validation core.",
source="pydantic_release_notes.txt",
relevance=0.72
)
engineer.add_retrieved_knowledge(
content="gRPC uses HTTP/2 and Protocol Buffers for transport.",
source="grpc_overview.txt",
relevance=0.61
)
# Add conversation history (most recent = rank 1)
engineer.add_conversation_turn(
role="user",
content="What web frameworks does Python support?",
recency_rank=2
)
engineer.add_conversation_turn(
role="assistant",
content="Python supports FastAPI, Django, Flask, and many others.",
recency_rank=1
)
# Build and display the optimized context
messages = engineer.build_context()
print("\n=== Built Context ===")
for i, msg in enumerate(messages):
preview = msg["content"][:100].replace("\n", " ")
print(f" [{i}] role={msg['role']:10s} | {preview}...")
CHAPTER SIX: THE MEMORY LIFECYCLE — CONSOLIDATION, FORGETTING, AND UPDATING
One of the most sophisticated aspects of Karpathy's memory framework is the recognition that memory is not just about storage and retrieval. It is also about the lifecycle of information: how memories are formed, how they are consolidated from short-term to long-term storage, how they are updated when new information contradicts old information, and how they are forgotten when they are no longer relevant.
Human memory is not a perfect recording device. It is a dynamic, constructive system that continuously reorganizes, updates, and prunes information based on relevance, recency, and emotional salience. Effective agent memory systems should aspire to similar dynamics.
Memory consolidation is the process of moving information from in-context memory (short-term) to external retrieval memory (long-term). This happens naturally when the context window fills up and old messages must be pruned. But it should also happen proactively: when the agent encounters important information, it should explicitly store it in the knowledge base rather than relying on it remaining in the context window.
Memory updating is the process of revising stored information when new information contradicts it. This is one of the hardest problems in agent memory design. A naive system will simply add the new information alongside the old, resulting in a knowledge base full of contradictions. A sophisticated system will detect the contradiction and resolve it, either by updating the old entry, flagging it for human review, or using the LLM to synthesize a reconciled version.
Memory forgetting is the process of removing information that is no longer relevant. This is important for performance (a smaller knowledge base is faster to search) and for accuracy (stale information can mislead the model). The lint operation in the compiled knowledge base example above is one implementation of this concept.
# memory_lifecycle.py
# Implements memory consolidation, updating, and forgetting for LLM agents.
# Demonstrates the full lifecycle of agent memory management.
# pip install chromadb openai requests
import os
import time
import hashlib
import textwrap
from typing import Optional
import requests
class MemoryLifecycleManager:
"""
Manages the full lifecycle of agent memory:
- Consolidation: short-term -> long-term memory transfer
- Updating: revising stored memories when contradictions arise
- Forgetting: removing stale or irrelevant memories
This implements the "memory as a first-class citizen" philosophy
that Karpathy advocates for agentic AI systems.
"""
def __init__(
self,
backend: str = "ollama",
model_name: str = "llama3.2",
embedding_model: str = "nomic-embed-text",
ollama_base_url: str = "http://localhost:11434",
openai_api_key: Optional[str] = None,
openai_model: str = "gpt-4o-mini",
memory_path: str = "./lifecycle_memory"
):
import chromadb
self.backend = backend
self.model_name = model_name
self.embedding_model = embedding_model
self.ollama_base_url = ollama_base_url
self.client = chromadb.PersistentClient(path=memory_path)
self.memories = self.client.get_or_create_collection(
name="memories",
metadata={"hnsw:space": "cosine"}
)
if backend == "openai":
from openai import OpenAI
api_key = openai_api_key or os.environ.get("OPENAI_API_KEY")
self.openai_client = OpenAI(api_key=api_key)
self.openai_model = openai_model
def _get_embedding(self, text: str) -> list[float]:
"""Generates an embedding for the given text."""
if self.backend == "ollama":
response = requests.post(
f"{self.ollama_base_url}/api/embeddings",
json={"model": self.embedding_model, "prompt": text},
timeout=60
)
response.raise_for_status()
return response.json()["embedding"]
else:
response = self.openai_client.embeddings.create(
model="text-embedding-3-small", input=text
)
return response.data[0].embedding
def _call_llm(self, prompt: str) -> str:
"""Calls the LLM for reasoning tasks (contradiction detection, etc.)."""
messages = [{"role": "user", "content": prompt}]
if self.backend == "ollama":
response = requests.post(
f"{self.ollama_base_url}/api/chat",
json={"model": self.model_name, "messages": messages, "stream": False},
timeout=120
)
response.raise_for_status()
return response.json()["message"]["content"]
else:
response = self.openai_client.chat.completions.create(
model=self.openai_model, messages=messages
)
return response.choices[0].message.content
def consolidate(self, content: str, importance: float = 0.5) -> str:
"""
Consolidates a piece of information into long-term memory.
Before storing, checks for existing contradictory memories
and resolves conflicts using the LLM.
Args:
content: The information to consolidate.
importance: How important this memory is (0.0 to 1.0).
Higher importance memories are retained longer
during forgetting operations.
Returns:
The ID of the stored memory entry.
"""
embedding = self._get_embedding(content)
similar = self.memories.query(
query_embeddings=[embedding],
n_results=min(3, max(1, self.memories.count())),
include=["documents", "metadatas", "distances"]
)
# Check for potential contradictions among similar memories
if similar["documents"][0]:
for existing_doc, dist in zip(
similar["documents"][0],
similar["distances"][0]
):
if dist < 0.2:
contradiction_check = self._call_llm(textwrap.dedent(f"""
Do these two statements contradict each other?
Answer with only "YES" or "NO" followed by a brief explanation.
Statement A: {existing_doc}
Statement B: {content}
""").strip())
if contradiction_check.upper().startswith("YES"):
print(
f"[MEMORY] Contradiction detected!\n"
f" Existing: {existing_doc[:80]}...\n"
f" New: {content[:80]}..."
)
reconciled = self._call_llm(textwrap.dedent(f"""
These two statements contradict each other.
Synthesize a single, accurate statement that
reconciles them. If one is clearly more recent
or more authoritative, prefer it.
Statement A (older): {existing_doc}
Statement B (newer): {content}
Reconciled statement:
""").strip())
content = reconciled.strip()
print(f"[MEMORY] Reconciled to: {content[:80]}...")
memory_id = hashlib.sha256(
f"{content}::{time.time()}".encode()
).hexdigest()[:16]
self.memories.add(
ids=[memory_id],
embeddings=[embedding],
documents=[content],
metadatas=[{
"importance": importance,
"stored_at": time.time(),
"last_accessed": time.time(),
"access_count": 0
}]
)
print(f"[MEMORY] Consolidated: '{content[:60]}...' (id={memory_id})")
return memory_id
def recall(self, query: str, n_results: int = 5) -> list[dict]:
"""
Retrieves relevant memories and updates their access metadata.
Frequently accessed memories are less likely to be forgotten.
"""
if self.memories.count() == 0:
return []
embedding = self._get_embedding(query)
results = self.memories.query(
query_embeddings=[embedding],
n_results=min(n_results, self.memories.count()),
include=["documents", "metadatas", "distances"]
)
recalled = []
for mem_id, doc, meta, dist in zip(
results["ids"][0],
results["documents"][0],
results["metadatas"][0],
results["distances"][0]
):
# Update access metadata
meta["last_accessed"] = time.time()
meta["access_count"] = meta.get("access_count", 0) + 1
self.memories.update(ids=[mem_id], metadatas=[meta])
recalled.append({
"id": mem_id,
"content": doc,
"relevance": round(1 - dist, 4),
"importance": meta.get("importance", 0.5),
"access_count": meta.get("access_count", 0)
})
return recalled
def forget(
self,
age_threshold_days: float = 30.0,
min_importance: float = 0.3
) -> int:
"""
Implements selective forgetting: removes memories that are old,
rarely accessed, and of low importance.
A memory is forgotten if ALL of the following are true:
- It is older than age_threshold_days.
- Its importance score is below min_importance.
- It has been accessed fewer than 3 times.
Args:
age_threshold_days: Memories older than this are candidates.
min_importance: Memories below this threshold are candidates.
Returns:
The number of memories forgotten.
"""
if self.memories.count() == 0:
return 0
all_memories = self.memories.get(include=["metadatas", "documents"])
forgotten_count = 0
now = time.time()
age_threshold_seconds = age_threshold_days * 86400
ids_to_delete = []
for mem_id, meta, doc in zip(
all_memories["ids"],
all_memories["metadatas"],
all_memories["documents"]
):
stored_at = meta.get("stored_at", now)
age_seconds = now - stored_at
importance = meta.get("importance", 0.5)
access_count = meta.get("access_count", 0)
should_forget = (
age_seconds > age_threshold_seconds
and importance < min_importance
and access_count < 3
)
if should_forget:
ids_to_delete.append(mem_id)
print(
f"[FORGET] Removing: '{doc[:60]}...' "
f"(age={age_seconds / 86400:.1f}d, "
f"importance={importance:.2f}, "
f"accesses={access_count})"
)
if ids_to_delete:
self.memories.delete(ids=ids_to_delete)
forgotten_count = len(ids_to_delete)
print(
f"[FORGET] Forgot {forgotten_count} memories. "
f"{self.memories.count()} remain."
)
return forgotten_count
if __name__ == "__main__":
manager = MemoryLifecycleManager(
backend="ollama",
model_name="llama3.2",
memory_path="./lifecycle_demo"
)
# Consolidate some memories
manager.consolidate(
"Python 3.12 introduced significant performance improvements over 3.11.",
importance=0.8
)
manager.consolidate(
"Python 3.11 is faster than 3.12.", # Contradicts the above!
importance=0.6
)
manager.consolidate(
"FastAPI 0.100 introduced Pydantic v2 support.",
importance=0.5
)
# Recall memories related to a query
print("\n=== Recall: Python performance ===")
memories = manager.recall("Which Python version is fastest?")
for m in memories:
print(f" [{m['relevance']:.2f}] {m['content'][:80]}")
# Simulate forgetting (with a very short threshold for demo purposes)
print("\n=== Forgetting pass ===")
manager.forget(age_threshold_days=0.0, min_importance=0.4)
The memory lifecycle manager above demonstrates one of the most important and often overlooked aspects of agent memory design: the system must actively manage its own memory, not just passively store and retrieve. The contradiction detection and resolution mechanism is particularly powerful because it prevents the knowledge base from accumulating conflicting information over time, which would degrade the quality of the agent's responses.
CONCLUSION: WHY THIS MATTERS FOR THE FUTURE OF AGENTIC AI
Karpathy's memory taxonomy is more than an academic classification. It is a practical engineering framework that gives developers a clear mental model for designing, building, and debugging agentic AI systems. By understanding the four types of memory and their respective trade-offs, you can make principled architectural decisions rather than ad-hoc ones.
In-weights memory is the foundation: it is what makes the model intelligent, but it is expensive to update and opaque to inspect. You rely on it for general reasoning and world knowledge, but you should not rely on it for domain-specific, up-to-date, or confidential information.
In-context memory is the workspace: it is fast and flexible, but limited and volatile. You use it to maintain conversational continuity and to inject retrieved context, but you must actively manage it to prevent overflow and to consolidate important information before it is lost.
External retrieval memory is the knowledge base: it is persistent, scalable, and updatable, but requires retrieval infrastructure and careful curation. Karpathy's "compilation over retrieval" insight suggests that you should invest in compiling raw information into structured knowledge before storing it, rather than storing raw documents and hoping the retrieval system can make sense of them at query time.
KV cache memory is the performance layer: it is largely invisible to the developer but has profound implications for latency and cost. By designing your prompts with a stable prefix first, you maximize cache hit rates and reduce the effective cost of every API call.
An agent that can remember what it has learned, update its knowledge when the world changes, forget what is no longer relevant, and efficiently retrieve what it needs is qualitatively more powerful than one that starts fresh with every request. Karpathy's taxonomy gives us the vocabulary and the conceptual tools to build such agents.
The code in this article is a starting point, not an ending point. Real production systems will need more sophisticated contradiction resolution, more nuanced importance scoring, more efficient retrieval strategies, and more careful attention to the security and privacy implications of persistent memory. But the architecture described here — four types of memory working in concert, managed through a principled lifecycle, and engineered for optimal context window utilization — is the right foundation to build on.
FURTHER READING AND RESOURCES
Andrej Karpathy's talks and writings are the primary source for the concepts in this article. His "State of GPT" talk from Microsoft Build 2023 introduced many of these ideas to a broad audience. His blog at karpathy.ai contains deeper explorations of Software 2.0, Software 3.0, and the LLM OS concept. His X (formerly Twitter) account @karpathy is a continuous stream of insights on LLMs, agents, and AI systems design.
For the technical implementation of vector stores, the ChromaDB documentation at docs.trychroma.com is excellent. For Ollama, the documentation at ollama.com covers model management, the REST API, and Modelfile syntax. For the OpenAI API, the platform.openai.com documentation covers prompt caching, embeddings, and the chat completions API in detail.
The LangChain and LlamaIndex frameworks provide higher-level abstractions over many of the patterns demonstrated in this article, and are worth studying once you have a solid understanding of the underlying concepts. The LangGraph library, in particular, provides a principled framework for building stateful, multi-step agents with explicit memory management.
No comments:
Post a Comment