The AI Memory System Inspired by an Ancient Technique, Co-Created by Milla Jovovich and Ben Sigman
PREFACE: THE PROBLEM THAT STARTED IT ALL
Imagine hiring a brilliant assistant who reads everything you give them, answers every question perfectly, and then wakes up the next morning with absolutely no memory of who you are. Every single day. Forever. You would explain your project from scratch, re-establish your preferences, re-describe your team, and re-share every decision you made together. This is, more or less, the daily reality of working with large language models.
This problem has a name in the AI community: AI amnesia. It is not a bug in the traditional sense. It is a fundamental architectural constraint. Language models process a fixed window of tokens, and when a conversation ends, that window closes. Nothing persists. The model does not dream about your codebase at night. It does not remember that you prefer tabs over spaces, that your database migration failed last Tuesday, or that your colleague Soren spent three weeks debugging a race condition in the authentication service.
For years, developers have tried to solve this with summarization. The idea is simple enough: at the end of each session, ask the model to summarize what happened, store that summary, and inject it at the start of the next session. The problem is that summarization is lossy by definition. When you ask a model to condense a three-hour debugging session into a paragraph, it makes judgment calls about what matters. Sometimes it is right. Sometimes it discards the one detail that would have saved you two hours next week. The model does not know what it does not know it will need.
In April 2026, Milla Jovovich and developer Ben Sigman released MemPalace on GitHub, and within two days it had accumulated over 23,000 stars and nearly 3,000 forks. The project proposed a different philosophy entirely: store everything verbatim, organize it spatially using an ancient mnemonic technique, and retrieve it deterministically without ever calling an external API. The ancient technique in question is the method of loci, also known as the memory palace, a strategy used by Greek and Roman orators to memorize hours of speeches by mentally placing ideas in rooms of an imaginary building and walking through those rooms during recall.
This tutorial will take you through every aspect of MemPalace: its philosophy, its architecture, its data structures, its compression system, its benchmark results, and how to integrate it into your own applications using local and remote language models across different hardware backends including NVIDIA CUDA, Apple MLX, and Vulkan-capable GPUs.
CHAPTER ONE: THE PHILOSOPHY OF VERBATIM MEMORY
Before we touch a single line of code, it is worth spending time with the core idea that separates MemPalace from every other AI memory system that came before it. That idea is the verbatim-first philosophy.
Most memory systems for AI work on the assumption that storage is expensive and context windows are precious. Therefore, they compress aggressively. They summarize conversations, extract entities, build knowledge graphs from inferred relationships, and throw away the raw text. The result is a system that is efficient but brittle. The compressed representation is only as good as the model that created it, and that model had no way of knowing which details would matter six months later.
MemPalace takes the opposite position. Storage is cheap. Context windows, while limited, are growing. Modern embedding models are remarkably good at finding relevant passages in large corpora. Therefore, the right strategy is to store everything exactly as it was said, organize it so that retrieval is fast and accurate, and let the embedding model do the work of finding what matters.
This is not a new idea in information science. Archivists and librarians have known for centuries that the original document is always more valuable than any summary of it. What MemPalace does is apply this principle systematically to AI memory, with a retrieval architecture sophisticated enough to make it practical.
The verbatim-first philosophy has several important consequences for the system design. First, it means that ChromaDB, the vector database at the heart of MemPalace, stores the actual text of conversations and documents rather than summaries. Second, it means that the write path, the process of ingesting new information, does not need to call an LLM at all. Classification, chunking, and organization are handled by deterministic regex heuristics and keyword scoring. Third, it means that the system can answer questions about the past with high fidelity, because the past is still there, word for word.
The benchmark results support this approach. In raw verbatim mode, MemPalace achieved 96.6% recall at 5 on the LongMemEval benchmark without making a single external API call. This is a remarkable result for a system that runs entirely on your local machine.
CHAPTER TWO: THE MEMORY PALACE ARCHITECTURE
The method of loci works because the human brain is extraordinarily good at spatial memory. We remember places, routes, and rooms far better than we remember abstract lists. By associating information with locations in a familiar space, we give our brains a retrieval cue that is far more reliable than rote memorization.
MemPalace translates this spatial metaphor into a hierarchical data structure. Understanding this hierarchy is essential to understanding how the system works, so we will spend considerable time here before moving to implementation details.
THE WING
A wing is the top-level container in a MemPalace. Think of it as an entire building within your memory palace complex. Each wing represents a distinct domain: a project, a person, a client, a research area, or any other top-level category that makes sense for your use case. If you are a solo developer working on three projects simultaneously, you might have three wings: one for each project. If you are managing a team, you might have a wing for each team member in addition to wings for each project.
Wings are important because they provide the first level of isolation. A search within a specific wing only retrieves information from that wing, which dramatically reduces noise and improves precision. A search across all wings is also possible when you need to find connections between domains.
THE HALL
Halls are corridors that run through every wing in the palace. They represent recurring memory types that appear in every domain. The default halls in MemPalace are facts, events, discoveries, preferences, and advice. Every wing has all of these halls, which means that if you want to find all the facts across all your projects, you can walk down the facts hall and look into each wing's room along that corridor.
This is a subtle but powerful design decision. It means that memory is organized along two orthogonal axes simultaneously: by domain (wing) and by type (hall). You can retrieve all facts about a specific project, or all facts across all projects, or all events within a specific project, depending on what you need.
THE ROOM
Rooms are specific topics within a wing. If a wing represents a software project, its rooms might be named authentication, billing, deployment, database-schema, and frontend. Rooms are created automatically by MemPalace when it detects that a conversation or document is about a specific topic. The topic detection uses keyword scoring and regex heuristics, not an LLM, which keeps the process fast and free.
Rooms are where the actual retrieval happens. When you search for information about authentication in your project wing, MemPalace navigates to the authentication room and searches within it. This spatial navigation dramatically reduces the search space compared to a flat vector search over all stored documents.
THE DRAWER
Drawers are where the raw, verbatim text lives. Every room has drawers, and each drawer contains an original document or conversation chunk exactly as it was stored. Drawers are the source of truth in MemPalace. Everything else in the system is either an index into the drawers or a derived view of their contents.
In the implementation, drawers correspond to entries in the ChromaDB collection. Each entry contains the full text of a chunk, along with metadata that records which wing, hall, and room it belongs to, when it was stored, and what source it came from.
THE CLOSET
Closets are adjacent to drawers and contain compressed summaries that point back to the original drawer content. They are intended for quick human scanning, not for retrieval accuracy. The distinction is important: when MemPalace retrieves information to give to an LLM, it retrieves from drawers, not closets. Closets are a convenience feature for human operators who want to browse the palace without reading every verbatim document.
THE TUNNEL
Tunnels are the most architecturally interesting feature of MemPalace. When the same room name appears in two different wings, a tunnel is automatically created between them. This tunnel represents a cross-domain connection: the authentication room in your web project wing is connected by a tunnel to the authentication room in your mobile project wing.
Tunnels enable a form of associative retrieval that flat vector search cannot provide. When you search for authentication information, MemPalace can follow tunnels to find related information in other wings, surfacing connections that you might not have thought to look for explicitly.
The following ASCII diagram illustrates the complete hierarchy:
MEMORY PALACE
|
+-- WING: project-alpha
| |
| +-- HALL: facts
| | +-- ROOM: authentication
| | | +-- DRAWER: raw_chunk_001.txt (verbatim text)
| | | +-- DRAWER: raw_chunk_002.txt (verbatim text)
| | | +-- CLOSET: summary_auth.aaak (compressed summary)
| | |
| | +-- ROOM: billing
| | +-- DRAWER: raw_chunk_003.txt
| |
| +-- HALL: events
| +-- ROOM: deployment
| +-- DRAWER: raw_chunk_004.txt
|
+-- WING: project-beta
| |
| +-- HALL: facts
| +-- ROOM: authentication <-- TUNNEL connects to project-alpha/facts/authentication
| +-- DRAWER: raw_chunk_005.txt
|
+-- TUNNEL: project-alpha/authentication <-> project-beta/authentication
This diagram shows two wings, each with their own halls and rooms, and a tunnel connecting the authentication rooms across wings. The drawers contain verbatim text, and the closets contain compressed summaries.
CHAPTER THREE: THE TECHNOLOGY STACK
MemPalace is built on a deliberately minimal set of dependencies. This is a conscious design choice: every dependency is a potential point of failure, a version conflict waiting to happen, and a barrier to adoption. The core stack consists of three components.
CHROMADB: THE VECTOR STORE
ChromaDB is an open-source vector database designed specifically for AI applications. It stores text alongside its vector embedding, which is a numerical representation of the text's semantic meaning. When you search ChromaDB with a query, it converts your query to a vector and finds the stored texts whose vectors are most similar. This is semantic search: it finds texts that mean something similar to your query, not just texts that contain the same keywords.
In MemPalace, ChromaDB stores all drawer contents. Each drawer entry is a document in the ChromaDB collection, with metadata fields that encode its position in the palace hierarchy. When MemPalace performs a retrieval, it queries ChromaDB with the user's question, optionally filtered by wing, hall, or room metadata, and retrieves the most semantically relevant drawer contents.
ChromaDB uses its default embedding model out of the box, which is a sentence transformer model that runs entirely locally. No API key is required. No data leaves your machine. The embedding model converts text to vectors on your local CPU or GPU, and ChromaDB stores those vectors in a local SQLite file.
SQLITE: THE KNOWLEDGE GRAPH
While ChromaDB handles semantic search over verbatim text, SQLite handles structured facts and their temporal relationships. MemPalace maintains a temporal knowledge graph in SQLite that stores entity-relationship triples of the form (subject, predicate, object) along with validity timestamps.
A temporal knowledge graph is a knowledge graph where every fact has an associated time range during which it is true. For example, the fact that a particular developer is working on the authentication module might be true from January to March but no longer true in April when they switch to a different project. In a standard knowledge graph, updating this fact would overwrite the old one. In a temporal knowledge graph, the old fact is preserved with a valid_to timestamp, and the new fact is added with a valid_from timestamp. This allows you to query the state of the world at any point in time.
The SQLite schema for the knowledge graph looks conceptually like this:
TABLE: entity_relations
+-----------+------------+----------+------------+----------+
| subject | predicate | object | valid_from | valid_to |
+-----------+------------+----------+------------+----------+
| soren | works-on | auth | 2026-01-15 | 2026-03-30|
| soren | works-on | billing | 2026-04-01 | NULL |
| project | uses | postgres | 2025-06-01 | NULL |
+-----------+------------+----------+------------+----------+
A NULL value in valid_to means the fact is currently true. This schema allows MemPalace to answer questions like "what was Soren working on in February?" by filtering for rows where valid_from is before the query date and valid_to is after it or NULL.
PYYAML: CONFIGURATION
PyYAML handles configuration files, which describe the structure of the palace, the locations of data sources to mine, and various behavioral parameters. YAML is a natural choice for configuration because it is human-readable, supports hierarchical structures, and is familiar to most developers.
CHAPTER FOUR: AAAK COMPRESSION
AAAK is one of the most discussed and debated features of MemPalace. It is described as an aggressive abbreviation dialect that any LLM can read without a decoder. The name stands for Aggressive Abbreviation for AI Knowledge, and the idea is to pack repeated entities into fewer tokens so that more context can fit into a model's context window.
The core mechanism of AAAK is entity coding. When a piece of text mentions the same entity repeatedly, AAAK replaces subsequent mentions with a short code. For example, if a conversation mentions the authentication service seventeen times, AAAK might replace it with the code A1 after the first occurrence, along with a header that maps A1 to authentication service. This can dramatically reduce token count for texts that discuss a small number of entities repeatedly.
AAAK also uses structural markers to replace common phrases and sentence patterns with abbreviated forms. The compression is designed to be readable by LLMs because the codes are consistent and the mapping is always provided in a header that the model reads first.
However, it is important to understand the honest assessment of AAAK's performance. The developers initially claimed 30x compression with zero information loss, which turned out to be overstated. Independent testing on the LongMemEval benchmark showed that AAAK mode reduced recall accuracy from 96.6% to 84.2% compared to raw verbatim mode. The developers have since acknowledged that AAAK is lossy and that the lossless claim was incorrect.
This means that AAAK is a trade-off: you get smaller storage and faster context loading at the cost of some retrieval accuracy. For applications where token efficiency is critical and some accuracy loss is acceptable, AAAK may be appropriate. For applications where accuracy is paramount, raw verbatim mode is the better choice.
The default mode for MemPalace is raw verbatim, which is why the benchmark results are so strong. AAAK is an optional experimental feature that you can enable in the configuration.
CHAPTER FIVE: INSTALLATION AND SETUP
Now we move from philosophy and architecture to practice. Setting up MemPalace requires Python 3.10 or higher, and the installation process is straightforward.
The recommended installation method uses pipx, which installs MemPalace in an isolated environment and makes its command-line tools available globally without polluting your system Python installation. If you do not have pipx installed, you can install it with pip and then use it to install MemPalace.
pip install pipx
pipx ensurepath
pipx install mempalace
If you prefer to install MemPalace directly into a project's virtual environment, the standard pip approach works as well. The following example shows how to set up a complete project environment from scratch, which is the approach we will use throughout this tutorial.
# Create a new project directory and navigate into it
mkdir mempalace-tutorial
cd mempalace-tutorial
# Create a virtual environment with Python 3.11
python3.11 -m venv venv
# Activate the virtual environment
# On macOS and Linux:
source venv/bin/activate
# On Windows:
# venv\Scripts\activate
# Install MemPalace and its dependencies
pip install mempalace chromadb pyyaml
After installation, verify that everything is working by running the MemPalace help command:
mempalace --help
You should see a list of available commands including init, mine, search, wake-up, and serve. If you see this output, your installation is complete.
INITIALIZING A PALACE
Before you can store any memories, you need to initialize a palace. The init command creates the directory structure and configuration files that MemPalace needs.
mempalace init --name "my-project" --path ~/.mempalace
This command creates a directory at the specified path containing a config.yaml file, an empty ChromaDB collection, and an empty SQLite database. The config.yaml file is where you define the structure of your palace: which wings to create, which halls to include in each wing, and where to find data sources to mine.
A minimal config.yaml looks like this:
# ~/.mempalace/config.yaml
# This file defines the structure of your memory palace.
# Wings are top-level domains. Add one per project or person.
palace:
name: "my-project"
version: "1.0"
wings:
- name: "project-alpha"
description: "Main web application project"
halls:
- facts
- events
- discoveries
- preferences
- advice
storage:
chromadb_path: "~/.mempalace/chroma"
sqlite_path: "~/.mempalace/knowledge.db"
compression: "raw" # Options: raw, aaak
retrieval:
default_k: 5 # Number of results to return per search
wake_up_tokens: 170 # Maximum tokens for startup context
CHAPTER SIX: MINING DATA INTO THE PALACE
Mining is the process of ingesting existing data into MemPalace. The system can mine several types of sources: project directories containing code and documentation, exported conversation histories from Claude, ChatGPT, or Slack, and arbitrary text files.
The mining process is entirely deterministic and requires no LLM calls. MemPalace reads each source file, splits it into chunks using a sliding window approach, scores each chunk against keyword lists to determine which room it belongs to, and stores it in the appropriate location in the palace hierarchy.
MINING A PROJECT DIRECTORY
To mine a software project, point MemPalace at the project root. It will recursively scan all files, skip binary files and common noise directories like node_modules and .git, and ingest the rest.
mempalace mine --source ~/projects/my-app --wing project-alpha --mode project
The mode flag tells MemPalace what kind of data to expect. In project mode, it applies heuristics tuned for code and documentation. It recognizes common patterns like function definitions, class declarations, configuration keys, and markdown headings, and uses these as signals for room assignment.
MINING CONVERSATION HISTORIES
Conversation mining is where MemPalace really shines. If you have been using Claude or ChatGPT for months, you likely have hundreds of conversations containing decisions, discoveries, and context that you have completely forgotten. MemPalace can ingest these conversations and make them searchable.
mempalace mine --source ~/Downloads/claude-export.json --wing project-alpha --mode convos
The conversation export format varies by platform. MemPalace includes parsers for the standard export formats of Claude and ChatGPT. For other formats, you can write a simple preprocessing script to convert your data to the plain text format that MemPalace accepts.
UNDERSTANDING THE MINING PIPELINE
It is worth understanding what happens inside the mining pipeline, because this knowledge will help you troubleshoot issues and tune the system for your specific use case.
When MemPalace receives a source file, it first determines the file type and applies the appropriate parser. For plain text and markdown files, the parser is trivial: it reads the file and splits it into chunks. For JSON conversation exports, the parser extracts the message content from each turn, labels it with the speaker role, and concatenates turns into coherent chunks.
After parsing, each chunk goes through the room classifier. The classifier scores the chunk against a set of keyword lists, one per room type. The keyword lists are defined in the configuration and can be customized. A chunk about database connection pooling would score highly against keywords like database, connection, pool, timeout, and query, and would be assigned to the database room if one exists in the current wing.
After classification, the chunk is embedded by ChromaDB's default embedding model and stored in the collection with metadata encoding its palace position. The entire process for a typical conversation export takes a few seconds to a few minutes depending on the size of the export and the speed of your hardware.
CHAPTER SEVEN: SEARCHING THE PALACE
Retrieval is where all the architectural decisions pay off. MemPalace provides several retrieval modes, each suited to different use cases.
THE BASIC SEARCH
The simplest retrieval operation is a semantic search across the entire palace or within a specific wing.
mempalace search "what decisions were made about the database schema"
This command queries ChromaDB with the provided text, retrieves the top five most semantically similar drawer contents, and prints them to the terminal. The output includes the text of each result, its palace location (wing, hall, room), its source file, and a similarity score.
You can narrow the search to a specific wing to reduce noise:
mempalace search "authentication decisions" --wing project-alpha
Or narrow it further to a specific room:
mempalace search "JWT token expiry" --wing project-alpha --room authentication
THE WAKE-UP CONTEXT
The wake-up feature is designed for the beginning of a new AI session. It generates a compact summary of the most critical facts about the current palace, designed to fit within approximately 170 tokens. This gives the LLM just enough context to orient itself without consuming a significant portion of its context window.
mempalace wake-up --wing project-alpha
The wake-up output is a structured text block that lists the most recently accessed facts, the most frequently referenced entities, and any critical flags that have been set manually. It is designed to be pasted directly into a system prompt or prepended to the first user message in a new session.
THE KNOWLEDGE GRAPH QUERY
For structured queries about entities and their relationships, MemPalace provides a knowledge graph query interface that operates on the SQLite temporal graph.
mempalace kg-query --subject "soren" --predicate "works-on"
This returns all known facts about what Soren works on, including historical facts with their validity timestamps. You can add a temporal filter to ask about a specific point in time:
mempalace kg-query --subject "soren" --predicate "works-on" --at "2026-02-15"
CHAPTER EIGHT: PYTHON API INTEGRATION
The command-line interface is useful for exploration and testing, but real applications need to integrate MemPalace programmatically. MemPalace provides a Python API that exposes all of its functionality.
The following example demonstrates a complete integration pattern. We will build a MemoryManager class that wraps MemPalace and provides a clean interface for storing and retrieving memories in an application. This class follows clean architecture principles by separating concerns and hiding implementation details behind a well-defined interface.
"""
memory_manager.py
A clean-architecture wrapper around MemPalace that provides
a simple interface for AI memory operations. This module
handles all interaction with the MemPalace Python API,
including initialization, storage, and retrieval.
"""
import logging
from dataclasses import dataclass
from typing import Optional
from pathlib import Path
# MemPalace imports - these become available after pip install mempalace
from mempalace import MemPalace
from mempalace.models import SearchResult, WakeUpContext
# Configure module-level logging so callers can control verbosity
logger = logging.getLogger(__name__)
@dataclass
class MemoryConfig:
"""
Configuration for the MemoryManager.
All paths are resolved relative to the user's home directory
if they begin with a tilde, following Unix conventions.
"""
# Path where MemPalace stores its ChromaDB and SQLite files
storage_path: str = "~/.mempalace"
# The wing to use for this application instance
wing_name: str = "default"
# Number of results to return from semantic searches
search_k: int = 5
# Maximum tokens for wake-up context
wake_up_tokens: int = 170
# Whether to use AAAK compression (False = raw verbatim, recommended)
use_aaak: bool = False
def resolved_storage_path(self) -> Path:
"""Return the storage path with tilde expanded."""
return Path(self.storage_path).expanduser()
class MemoryManager:
"""
High-level interface for AI memory operations using MemPalace.
This class provides a clean, testable interface that hides the
details of ChromaDB, SQLite, and the palace hierarchy from
calling code. Applications should interact with memory exclusively
through this class.
Usage:
config = MemoryConfig(wing_name="my-project")
manager = MemoryManager(config)
manager.initialize()
# Store a memory
manager.store(
text="We decided to use PostgreSQL for the main database.",
room="database",
hall="decisions"
)
# Retrieve relevant memories
results = manager.search("database technology choice")
for result in results:
print(result.text)
"""
def __init__(self, config: MemoryConfig):
"""
Initialize the MemoryManager with the given configuration.
The MemPalace instance is not created here; call initialize()
before using any other methods. This separation allows the
manager to be constructed without side effects.
"""
self.config = config
self._palace: Optional[MemPalace] = None
def initialize(self) -> None:
"""
Create or connect to the MemPalace storage.
This method creates the storage directory if it does not exist,
initializes the ChromaDB collection and SQLite database, and
prepares the palace for use. It is safe to call multiple times;
subsequent calls are no-ops if the palace is already initialized.
"""
if self._palace is not None:
logger.debug("MemoryManager already initialized, skipping.")
return
storage_path = self.config.resolved_storage_path()
storage_path.mkdir(parents=True, exist_ok=True)
logger.info(f"Initializing MemPalace at {storage_path}")
# Create the MemPalace instance with our configuration
self._palace = MemPalace(
storage_path=str(storage_path),
default_wing=self.config.wing_name,
compression="aaak" if self.config.use_aaak else "raw",
wake_up_tokens=self.config.wake_up_tokens,
)
logger.info(
f"MemPalace initialized. Wing: {self.config.wing_name}, "
f"Compression: {'AAAK' if self.config.use_aaak else 'raw verbatim'}"
)
def _require_initialized(self) -> MemPalace:
"""
Return the palace instance, raising if not initialized.
This guard method prevents confusing AttributeError messages
when callers forget to call initialize() first.
"""
if self._palace is None:
raise RuntimeError(
"MemoryManager has not been initialized. "
"Call initialize() before using any other methods."
)
# Explicit assertion allows static analysis tools (mypy, pylance)
# to narrow the type from Optional[MemPalace] to MemPalace.
assert self._palace is not None
return self._palace
def store(
self,
text: str,
room: str,
hall: str = "facts",
source: str = "manual",
metadata: Optional[dict] = None,
) -> str:
"""
Store a piece of text in the memory palace.
The text is stored verbatim in the specified room and hall
within the configured wing. No LLM calls are made during
storage; classification and embedding happen locally.
Args:
text: The verbatim text to store.
room: The room within the wing (e.g., "authentication").
hall: The hall type (e.g., "facts", "events", "decisions").
source: A label identifying where this text came from.
metadata: Optional additional metadata to store with the text.
Returns:
The unique identifier assigned to this memory by ChromaDB.
"""
palace = self._require_initialized()
# Merge caller-provided metadata with required palace metadata
combined_metadata: dict = {
"wing": self.config.wing_name,
"hall": hall,
"room": room,
"source": source,
}
if metadata:
combined_metadata.update(metadata)
memory_id = palace.store(
text=text,
wing=self.config.wing_name,
hall=hall,
room=room,
metadata=combined_metadata,
)
logger.debug(
f"Stored memory {memory_id} in "
f"{self.config.wing_name}/{hall}/{room}"
)
return memory_id
def search(
self,
query: str,
room: Optional[str] = None,
hall: Optional[str] = None,
k: Optional[int] = None,
) -> list[SearchResult]:
"""
Search the memory palace for relevant content.
Performs a semantic search using ChromaDB's vector similarity.
Results are ordered by relevance, with the most relevant first.
Optionally filter by room and/or hall to narrow the search space.
Args:
query: The natural language query to search for.
room: Optional room filter (e.g., "authentication").
hall: Optional hall filter (e.g., "facts").
k: Number of results to return. Defaults to config value.
Returns:
A list of SearchResult objects ordered by relevance.
"""
palace = self._require_initialized()
effective_k = k if k is not None else self.config.search_k
results = palace.search(
query=query,
wing=self.config.wing_name,
room=room,
hall=hall,
k=effective_k,
)
logger.debug(
f"Search for '{query[:50]}...' returned {len(results)} results"
)
return results
def wake_up(self) -> WakeUpContext:
"""
Generate a compact startup context for a new AI session.
Returns a WakeUpContext object containing a text field suitable
for injection into an LLM system prompt. The text is guaranteed
to fit within the configured wake_up_tokens limit.
"""
palace = self._require_initialized()
return palace.wake_up(wing=self.config.wing_name)
def mine_directory(self, path: str, mode: str = "project") -> int:
"""
Mine a directory of files into the memory palace.
Recursively scans the directory, parses each file according
to the specified mode, and stores the results in the palace.
Returns the number of chunks successfully stored.
Args:
path: Path to the directory to mine.
mode: Mining mode. "project" for code/docs, "convos" for
conversation exports.
Returns:
The number of memory chunks stored.
"""
palace = self._require_initialized()
resolved_path = Path(path).expanduser().resolve()
if not resolved_path.exists():
raise FileNotFoundError(
f"Mining source directory not found: {resolved_path}"
)
logger.info(f"Mining {resolved_path} in {mode} mode")
count = palace.mine(
source=str(resolved_path),
wing=self.config.wing_name,
mode=mode,
)
logger.info(f"Mining complete. Stored {count} memory chunks.")
return count
The key changes from the original in this module are: the unused field import from dataclasses has been removed; the _require_initialized method now includes an explicit assert self._palace is not None statement after the guard check, which allows static analysis tools such as mypy and pylance to correctly narrow the type from Optional[MemPalace] to MemPalace without relying on control-flow inference alone; and combined_metadata is explicitly typed as dict to satisfy strict type checkers.
CHAPTER NINE: INTEGRATING WITH LOCAL LLMS
Now we come to the part of the tutorial that brings everything together: using MemPalace as the memory layer for a local language model. We will build a complete chat application that maintains persistent memory across sessions using MemPalace for storage and retrieval, and supports multiple LLM backends including Ollama (which supports CUDA, Metal, and Vulkan), Apple MLX, and llama.cpp with direct GPU acceleration.
The architecture we will build follows a retrieval-augmented generation pattern. When the user sends a message, the application first queries MemPalace for relevant memories, then constructs a prompt that includes those memories as context, then sends the prompt to the LLM, and finally stores the conversation turn in MemPalace for future retrieval.
SETTING UP THE LLM BACKENDS
We need a unified interface that can work with different LLM backends. The following module defines an abstract base class and concrete implementations for Ollama, Apple MLX, and llama.cpp with Vulkan support.
"""
llm_backends.py
Unified interface for multiple LLM backends, supporting:
- Ollama (CUDA, Metal/MLX, Vulkan via llama.cpp underneath)
- Apple MLX directly via mlx-lm
- llama.cpp via llama-cpp-python with explicit GPU layer offloading
All backends implement the LLMBackend abstract interface, so the
rest of the application is completely decoupled from the choice
of backend.
"""
import abc
import logging
import platform
from typing import Generator, Optional
logger = logging.getLogger(__name__)
class LLMBackend(abc.ABC):
"""
Abstract base class for LLM backends.
All backends must implement generate() and stream_generate().
The generate() method returns the complete response as a string.
The stream_generate() method yields tokens as they are produced,
which is important for interactive applications where users
expect to see output appear progressively.
"""
@abc.abstractmethod
def generate(self, prompt: str, max_tokens: int = 512) -> str:
"""Generate a complete response to the given prompt."""
...
@abc.abstractmethod
def stream_generate(
self, prompt: str, max_tokens: int = 512
) -> Generator[str, None, None]:
"""Yield string tokens progressively as the model generates them."""
...
@abc.abstractmethod
def is_available(self) -> bool:
"""Return True if this backend is available on the current system."""
...
class OllamaBackend(LLMBackend):
"""
LLM backend using Ollama's local inference server.
Ollama automatically selects the best available GPU backend:
CUDA on NVIDIA GPUs, Metal on Apple Silicon, ROCm on AMD GPUs,
and Vulkan as a fallback for other GPU types. This makes it
the most portable option across different hardware.
Prerequisites:
Install Ollama from https://ollama.ai and pull a model:
ollama pull llama3.2
Install the Python client:
pip install ollama
"""
def __init__(self, model: str = "llama3.2", host: str = "http://localhost:11434"):
"""
Initialize the Ollama backend.
Args:
model: The Ollama model name to use (e.g., "llama3.2",
"mistral", "phi3", "gemma2").
host: The Ollama server URL. Defaults to localhost.
"""
self.model = model
self.host = host
self._client = None
def _get_client(self):
"""Lazily initialize the Ollama client on first use."""
if self._client is None:
try:
import ollama
self._client = ollama.Client(host=self.host)
except ImportError:
raise RuntimeError(
"The 'ollama' package is not installed. "
"Install it with: pip install ollama"
)
return self._client
def is_available(self) -> bool:
"""
Check if the Ollama server is running and the model is available.
The Ollama Python SDK returns model objects whose `.model`
attribute holds the full model tag (e.g. "llama3.2:latest").
We check for an exact match first, then fall back to a
prefix match so that "llama3.2" matches "llama3.2:latest".
Substring matching is intentionally avoided to prevent
"llama3" from falsely matching "llama3.2:latest".
"""
try:
client = self._get_client()
response = client.list()
# Each entry in response.models has a `.model` attribute
# containing the full tag string, e.g. "llama3.2:latest".
available_tags = [m.model for m in response.models]
# Exact match (e.g. user passed "llama3.2:latest")
if self.model in available_tags:
return True
# Prefix match: "llama3.2" should match "llama3.2:latest"
# but NOT match "llama3.2.1:latest" — we anchor to the colon.
return any(
tag == self.model or tag.startswith(self.model + ":")
for tag in available_tags
)
except Exception as e:
logger.debug(f"Ollama not available: {e}")
return False
def generate(self, prompt: str, max_tokens: int = 512) -> str:
"""
Generate a complete response using Ollama.
The response is collected in full before returning, which
is appropriate for non-interactive use cases.
"""
client = self._get_client()
response = client.generate(
model=self.model,
prompt=prompt,
options={"num_predict": max_tokens},
)
return response.response
def stream_generate(
self, prompt: str, max_tokens: int = 512
) -> Generator[str, None, None]:
"""
Stream string tokens from Ollama as they are generated.
Each chunk from the Ollama streaming API is a GenerateResponse
object whose `.response` attribute holds the newly generated
text segment. Empty segments (sent in the final done-chunk)
are skipped via the truthiness guard.
"""
client = self._get_client()
stream = client.generate(
model=self.model,
prompt=prompt,
options={"num_predict": max_tokens},
stream=True,
)
for chunk in stream:
if chunk.response:
yield chunk.response
class AppleMLXBackend(LLMBackend):
"""
LLM backend using Apple's MLX framework for Apple Silicon.
MLX is Apple's machine learning framework optimized for the
unified memory architecture of M-series chips. It allows the
GPU and CPU to share the same RAM pool, eliminating costly
data transfers and enabling efficient inference on Apple hardware.
This backend is only available on macOS with Apple Silicon (M1+).
It uses the mlx-lm library, which provides MLX-optimized versions
of popular open-source models.
Prerequisites:
pip install mlx-lm
# Models are downloaded automatically on first use from
# the Hugging Face Hub.
"""
def __init__(self, model: str = "mlx-community/Llama-3.2-3B-Instruct-4bit"):
"""
Initialize the Apple MLX backend.
Args:
model: The Hugging Face model ID for an MLX-compatible model.
The mlx-community organization on Hugging Face hosts
pre-converted MLX versions of popular models.
"""
self.model_name = model
self._model = None
self._tokenizer = None # mlx_lm.load() returns (model, tokenizer)
def _load_model(self):
"""Load the MLX model and tokenizer on first use."""
if self._model is None:
if platform.system() != "Darwin":
raise RuntimeError(
"Apple MLX backend is only available on macOS."
)
try:
from mlx_lm import load
logger.info(f"Loading MLX model: {self.model_name}")
self._model, self._tokenizer = load(self.model_name)
logger.info("MLX model loaded successfully.")
except ImportError:
raise RuntimeError(
"The 'mlx-lm' package is not installed. "
"Install it with: pip install mlx-lm"
)
def is_available(self) -> bool:
"""
Check if Apple MLX is available on this system.
MLX only installs successfully on macOS with Apple Silicon
(M1 or later). We verify the platform first, then attempt
to import mlx.core. A successful import is sufficient proof
that the GPU backend is available, because MLX will not
install at all on non-Apple-Silicon hardware.
"""
if platform.system() != "Darwin":
return False
try:
import mlx.core # noqa: F401 — import is the availability check
return True
except ImportError:
return False
def generate(self, prompt: str, max_tokens: int = 512) -> str:
"""
Generate a complete response using Apple MLX.
MLX inference runs entirely on the Apple Silicon GPU using
the unified memory architecture, which means no data needs
to be copied between CPU and GPU memory.
"""
self._load_model()
from mlx_lm import generate as mlx_generate
result = mlx_generate(
self._model,
self._tokenizer,
prompt=prompt,
max_tokens=max_tokens,
verbose=False,
)
return result
def stream_generate(
self, prompt: str, max_tokens: int = 512
) -> Generator[str, None, None]:
"""
Stream string tokens from Apple MLX.
mlx_lm.stream_generate yields GenerationResult objects, not
raw strings. Each GenerationResult has a `.text` attribute
containing the newly decoded text segment for that step.
We extract `.text` before yielding so that callers always
receive plain strings, consistent with the LLMBackend contract.
"""
self._load_model()
from mlx_lm import stream_generate as mlx_stream
for generation_result in mlx_stream(
self._model,
self._tokenizer,
prompt=prompt,
max_tokens=max_tokens,
):
# generation_result is a GenerationResult object.
# Its .text attribute holds the decoded text for this token.
yield generation_result.text
class LlamaCppBackend(LLMBackend):
"""
LLM backend using llama.cpp via the llama-cpp-python bindings.
llama.cpp is a highly optimized C++ inference engine that supports
multiple GPU backends through compile-time flags:
- CUDA: NVIDIA GPUs (fastest on NVIDIA hardware)
- Metal: Apple Silicon GPUs (alternative to MLX)
- Vulkan: Cross-platform GPU acceleration (AMD, Intel, NVIDIA)
- CPU: Fallback for systems without compatible GPUs
The n_gpu_layers parameter controls how many transformer layers
are offloaded to the GPU. Setting it to -1 offloads all layers,
which is the recommended setting when GPU VRAM is sufficient.
Prerequisites:
# For CUDA support:
CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python
# For Vulkan support:
CMAKE_ARGS="-DGGML_VULKAN=on" pip install llama-cpp-python
# For Apple Metal support:
CMAKE_ARGS="-DGGML_METAL=on" pip install llama-cpp-python
# CPU only (no special flags needed):
pip install llama-cpp-python
# Download a GGUF model, for example:
# https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF
"""
def __init__(
self,
model_path: str,
n_gpu_layers: int = -1,
n_ctx: int = 4096,
verbose: bool = False,
):
"""
Initialize the llama.cpp backend.
Args:
model_path: Path to the GGUF model file.
n_gpu_layers: Number of layers to offload to GPU.
-1 means offload all layers (recommended).
0 means CPU-only inference.
n_ctx: Context window size in tokens.
verbose: Whether to print llama.cpp debug output.
"""
self.model_path = model_path
self.n_gpu_layers = n_gpu_layers
self.n_ctx = n_ctx
self.verbose = verbose
self._llm = None
def _load_model(self):
"""Load the GGUF model on first use."""
if self._llm is None:
try:
from llama_cpp import Llama
logger.info(
f"Loading GGUF model: {self.model_path} "
f"({self.n_gpu_layers} GPU layers)"
)
self._llm = Llama(
model_path=self.model_path,
n_gpu_layers=self.n_gpu_layers,
n_ctx=self.n_ctx,
verbose=self.verbose,
)
logger.info("GGUF model loaded successfully.")
except ImportError:
raise RuntimeError(
"The 'llama-cpp-python' package is not installed. "
"See class docstring for installation instructions."
)
def is_available(self) -> bool:
"""Check if the model file exists and llama-cpp-python is installed."""
import os
if not os.path.exists(self.model_path):
logger.debug(f"Model file not found: {self.model_path}")
return False
try:
import llama_cpp # noqa: F401
return True
except ImportError:
return False
def generate(self, prompt: str, max_tokens: int = 512) -> str:
"""
Generate a complete response using llama.cpp.
Uses the __call__ (basic completion) API, which returns a
CompletionResponse dict. The generated text is at
output["choices"][0]["text"].
The model runs on whatever GPU backend was selected at
compile time (CUDA, Vulkan, Metal, or CPU).
"""
self._load_model()
output = self._llm(
prompt,
max_tokens=max_tokens,
stop=["User:", "\n\n\n"],
echo=False,
)
return output["choices"][0]["text"].strip()
def stream_generate(
self, prompt: str, max_tokens: int = 512
) -> Generator[str, None, None]:
"""
Stream string tokens from llama.cpp.
Uses the __call__ (basic completion) API with stream=True,
which yields CompletionChunk dicts. For the basic completion
API (as opposed to create_chat_completion), each chunk's
text is at chunk["choices"][0]["text"] — there is no "delta"
wrapper. Empty strings are skipped to avoid spurious yields.
llama.cpp supports native token streaming, which is efficient
because tokens are yielded as soon as they are generated without
buffering the complete response.
"""
self._load_model()
stream = self._llm(
prompt,
max_tokens=max_tokens,
stop=["User:", "\n\n\n"],
echo=False,
stream=True,
)
for chunk in stream:
token = chunk["choices"][0]["text"]
if token:
yield token
def create_best_available_backend(
ollama_model: str = "llama3.2",
mlx_model: str = "mlx-community/Llama-3.2-3B-Instruct-4bit",
gguf_model_path: Optional[str] = None,
) -> LLMBackend:
"""
Factory function that selects the best available LLM backend.
The selection priority is:
1. Ollama (most portable, handles GPU selection automatically)
2. Apple MLX (best performance on Apple Silicon when Ollama unavailable)
3. llama.cpp with GGUF (when a model file is provided)
4. Raises RuntimeError if no backend is available
This function is the recommended way to obtain an LLM backend
in applications that want to work across different hardware.
"""
# Try Ollama first - it works on all platforms and auto-selects GPU
ollama = OllamaBackend(model=ollama_model)
if ollama.is_available():
logger.info(f"Using Ollama backend with model: {ollama_model}")
return ollama
# Try Apple MLX on macOS
mlx = AppleMLXBackend(model=mlx_model)
if mlx.is_available():
logger.info(f"Using Apple MLX backend with model: {mlx_model}")
return mlx
# Try llama.cpp if a model path was provided
if gguf_model_path:
llamacpp = LlamaCppBackend(model_path=gguf_model_path)
if llamacpp.is_available():
logger.info(
f"Using llama.cpp backend with model: {gguf_model_path}"
)
return llamacpp
raise RuntimeError(
"No LLM backend is available. Please ensure one of the following:\n"
" 1. Ollama is installed and running (https://ollama.ai)\n"
" 2. mlx-lm is installed on macOS with Apple Silicon\n"
" 3. llama-cpp-python is installed and a GGUF model path is provided"
)
CHAPTER TEN: BUILDING THE COMPLETE CHAT APPLICATION
With the MemoryManager and LLM backends in place, we can now build the complete chat application. This application maintains persistent memory across sessions, retrieves relevant context before each response, and stores each conversation turn for future retrieval.
"""
memory_chat.py
A complete chat application with persistent AI memory powered by
MemPalace. Memories persist across sessions, so the AI can recall
information from previous conversations.
Usage:
python memory_chat.py --wing my-project --backend ollama
python memory_chat.py --wing my-project --backend mlx
python memory_chat.py --wing my-project --backend llamacpp \
--model-path ~/models/llama-3.2-3b-instruct.Q4_K_M.gguf
Commands during chat:
/search <query> - Search memories without generating a response
/wakeup - Show the current wake-up context
/mine <path> - Mine a directory into memory
/quit - Exit the application
"""
import argparse
import logging
import sys
from pathlib import Path
from typing import Optional
from memory_manager import MemoryConfig, MemoryManager
from mempalace.models import SearchResult
from llm_backends import (
OllamaBackend,
AppleMLXBackend,
LlamaCppBackend,
create_best_available_backend,
LLMBackend,
)
# Configure logging to show INFO messages to stderr,
# keeping stdout clean for the chat output
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s [%(levelname)s] %(name)s: %(message)s",
stream=sys.stderr,
)
logger = logging.getLogger(__name__)
def build_prompt(
user_message: str,
memory_context: str,
wake_up_context: str,
conversation_history: list[dict[str, str]],
) -> str:
"""
Construct the full prompt for the LLM.
The prompt has four sections:
1. System instructions that establish the AI's role
2. Wake-up context: critical facts loaded at session start
3. Relevant memories: retrieved by MemPalace for this specific query
4. Recent conversation history: the last 3 full turns (6 entries)
of this session, kept short to preserve context-window budget
The memory sections are placed before the conversation history
so that the model can refer to them while formulating its response.
Args:
user_message: The current user input.
memory_context: Relevant memories retrieved from MemPalace.
wake_up_context: Critical facts from the wake-up context.
conversation_history: Recent turns in the current session, each
entry a dict with "role" and "content" keys.
Returns:
A formatted prompt string ready for the LLM.
"""
# Keep the last 6 entries = 3 full turns (1 user + 1 assistant each)
history_text = ""
for turn in conversation_history[-6:]:
role = "User" if turn["role"] == "user" else "Assistant"
history_text += f"{role}: {turn['content']}\n"
# Assemble the complete prompt
prompt = f"""You are a helpful AI assistant with access to a persistent memory system.
You remember information from previous conversations and use it to provide
contextually aware, accurate responses.
CRITICAL FACTS (always relevant):
{wake_up_context}
RELEVANT MEMORIES (retrieved for this query):
{memory_context if memory_context else "No specific memories found for this query."}
RECENT CONVERSATION:
{history_text}
User: {user_message}
Assistant:"""
return prompt
def format_search_results(results: list[SearchResult]) -> str:
"""
Format MemPalace search results into a readable context block.
Each result is formatted with its palace location and a relevance
score, followed by the verbatim text. Results are separated by
a divider line for readability.
Args:
results: A list of SearchResult objects from MemPalace.
Returns:
A formatted string suitable for inclusion in an LLM prompt.
"""
if not results:
return ""
formatted_parts = []
for i, result in enumerate(results, 1):
location = f"{result.wing}/{result.hall}/{result.room}"
score = f"{result.score:.3f}"
formatted_parts.append(
f"[Memory {i} | Location: {location} | Relevance: {score}]\n"
f"{result.text}"
)
return "\n---\n".join(formatted_parts)
def store_conversation_turn(
manager: MemoryManager,
user_message: str,
assistant_response: str,
) -> None:
"""
Store a conversation turn in MemPalace for future retrieval.
Both the user message and the assistant response are stored
together as a single memory unit. This preserves the conversational
context, which is important for understanding the response in
isolation when it is retrieved in a future session.
The room is set to "conversations" and the hall to "events",
which places these memories in a consistent location within
the palace hierarchy.
Storage failures are logged as warnings rather than raised as
exceptions, so that a transient storage error does not crash
an otherwise healthy chat session.
"""
combined_text = (
f"User: {user_message}\n"
f"Assistant: {assistant_response}"
)
try:
manager.store(
text=combined_text,
room="conversations",
hall="events",
source="live-chat",
)
except Exception as exc:
logger.warning(
f"Failed to store conversation turn in MemPalace: {exc}. "
"This turn will not be available in future sessions."
)
def run_chat(manager: MemoryManager, backend: LLMBackend) -> None:
"""
Run the interactive chat loop.
This function handles the main interaction loop, including
special commands, memory retrieval, prompt construction,
LLM inference, and memory storage. It runs until the user
types /quit or sends an EOF signal (Ctrl+D on Unix).
The wing name is read directly from manager.config.wing_name
rather than being passed as a redundant parameter.
"""
wing_name = manager.config.wing_name
print(f"\nMemPalace Chat | Wing: {wing_name}")
print("=" * 50)
print("Commands: /search <query>, /wakeup, /mine <path>, /quit")
print("=" * 50)
# Load the wake-up context once at session start
wake_up = manager.wake_up()
wake_up_text = wake_up.text if wake_up else "No wake-up context available."
print(f"\nWake-up context loaded ({len(wake_up_text.split())} words).\n")
# Maintain conversation history for the current session.
# Each entry is {"role": "user"|"assistant", "content": str}.
conversation_history: list[dict[str, str]] = []
while True:
try:
user_input = input("You: ").strip()
except (EOFError, KeyboardInterrupt):
print("\nGoodbye!")
break
if not user_input:
continue
# Handle special commands
if user_input.startswith("/quit"):
print("Goodbye!")
break
if user_input.startswith("/wakeup"):
print(f"\nWake-up context:\n{wake_up_text}\n")
continue
if user_input.startswith("/search "):
query = user_input[8:].strip()
results = manager.search(query, k=5)
print(f"\nSearch results for '{query}':")
print(format_search_results(results))
print()
continue
if user_input.startswith("/mine "):
path = user_input[6:].strip()
try:
count = manager.mine_directory(path)
print(f"\nMined {count} memory chunks from {path}\n")
except FileNotFoundError as e:
print(f"\nError: {e}\n")
continue
# Retrieve relevant memories for the current query
search_results = manager.search(user_input, k=5)
memory_context = format_search_results(search_results)
# Build the complete prompt
prompt = build_prompt(
user_message=user_input,
memory_context=memory_context,
wake_up_context=wake_up_text,
conversation_history=conversation_history,
)
# Stream the response from the LLM
print("Assistant: ", end="", flush=True)
response_tokens: list[str] = []
for token in backend.stream_generate(prompt, max_tokens=512):
print(token, end="", flush=True)
response_tokens.append(token)
print() # Newline after the streamed response
full_response = "".join(response_tokens)
# Only store and update history if the model produced output.
# An empty response indicates a backend error; we skip storage
# to avoid polluting the palace with empty entries.
if full_response.strip():
# Update conversation history for this session
conversation_history.append(
{"role": "user", "content": user_input}
)
conversation_history.append(
{"role": "assistant", "content": full_response}
)
# Store this turn in MemPalace for future sessions
store_conversation_turn(manager, user_input, full_response)
else:
logger.warning("LLM returned an empty response; turn not stored.")
def main() -> None:
"""
Entry point for the memory chat application.
Parses command-line arguments, initializes the MemoryManager
and LLM backend, and starts the interactive chat loop.
"""
parser = argparse.ArgumentParser(
description="Chat with an AI that remembers everything."
)
parser.add_argument(
"--wing",
default="default",
help="The memory palace wing to use (default: 'default')",
)
parser.add_argument(
"--backend",
choices=["ollama", "mlx", "llamacpp", "auto"],
default="auto",
help="LLM backend to use (default: auto-detect)",
)
parser.add_argument(
"--ollama-model",
default="llama3.2",
help="Ollama model name (default: llama3.2)",
)
parser.add_argument(
"--mlx-model",
default="mlx-community/Llama-3.2-3B-Instruct-4bit",
help="MLX model ID from Hugging Face",
)
parser.add_argument(
"--model-path",
default=None,
help="Path to GGUF model file for llama.cpp backend",
)
parser.add_argument(
"--gpu-layers",
type=int,
default=-1,
help="GPU layers for llama.cpp (-1 = all layers, 0 = CPU only)",
)
parser.add_argument(
"--storage-path",
default="~/.mempalace",
help="Path for MemPalace storage (default: ~/.mempalace)",
)
args = parser.parse_args()
# Initialize MemoryManager
config = MemoryConfig(
storage_path=args.storage_path,
wing_name=args.wing,
)
manager = MemoryManager(config)
manager.initialize()
# Select and initialize the LLM backend.
# The if/elif chain is exhaustive given argparse's choices constraint,
# but we add a final else to keep static analysis tools (mypy, pylance)
# from flagging `backend` as potentially unbound.
backend: LLMBackend
if args.backend == "auto":
backend = create_best_available_backend(
ollama_model=args.ollama_model,
mlx_model=args.mlx_model,
gguf_model_path=args.model_path,
)
elif args.backend == "ollama":
backend = OllamaBackend(model=args.ollama_model)
elif args.backend == "mlx":
backend = AppleMLXBackend(model=args.mlx_model)
elif args.backend == "llamacpp":
if not args.model_path:
print("Error: --model-path is required for the llamacpp backend.")
sys.exit(1)
backend = LlamaCppBackend(
model_path=args.model_path,
n_gpu_layers=args.gpu_layers,
)
else:
# Unreachable given argparse choices, but satisfies type checkers
# and makes the exhaustiveness of the chain explicit.
raise RuntimeError(f"Unknown backend: {args.backend!r}")
# Start the chat loop
run_chat(manager, backend)
if __name__ == "__main__":
main()
CHAPTER ELEVEN: THE MCP SERVER AND AGENT INTEGRATION
MemPalace exposes its functionality through a Model Context Protocol server, which is an open standard for how AI models communicate with external tools and data sources. The MCP server allows AI assistants like Claude, ChatGPT with plugins, and Cursor to call MemPalace tools directly, without the user having to manually retrieve memories and paste them into the conversation.
The MCP server communicates using JSON-RPC 2.0 over stdin/stdout, which means it can be launched as a subprocess by any MCP-compatible client. The server exposes 24 tools organized into four categories.
The read tools allow the AI to search memories, retrieve specific drawers, query the knowledge graph, and read diary entries. The write tools allow the AI to store new memories, update knowledge graph facts, and write diary entries. The knowledge graph tools provide temporal query capabilities, allowing the AI to ask about the state of entities at specific points in time. The diary tools manage per-agent timestamped logs that allow specialist agents to maintain continuity across sessions.
To configure Claude Desktop to use MemPalace as an MCP server, you add an entry to Claude's configuration file. The exact location of this file depends on your operating system, but on macOS it is typically at ~/Library/Application Support/Claude/claude_desktop_config.json.
{
"mcpServers": {
"mempalace": {
"command": "mempalace",
"args": ["serve", "--wing", "my-project"],
"env": {}
}
}
}
Once this configuration is in place, Claude will automatically have access to all 24 MemPalace tools. When you ask Claude a question that requires historical context, it can call mempalace_search to retrieve relevant memories before formulating its response. When Claude makes a decision or discovers something important, it can call mempalace_store to preserve that information for future sessions.
The diary system is particularly interesting for multi-agent applications. Each specialist agent in a MemPalace-powered system has its own diary, which is a chronological log of that agent's observations, decisions, and reflections. When a new session begins, the agent reads its own diary to reconstruct its expertise and context. This gives agents a form of persistent identity that survives session boundaries.
CHAPTER TWELVE: BENCHMARK RESULTS AND HONEST ASSESSMENT
The LongMemEval benchmark is the standard evaluation framework for AI memory systems. It tests a system's ability to retrieve relevant information from a large corpus of conversations and documents, measuring recall at various k values (the number of results returned per query).
MemPalace's results on this benchmark are genuinely impressive in raw verbatim mode. The 96.6% recall at 5 score means that for 96.6% of queries, the correct answer appears somewhere in the top 5 retrieved results. This is achieved without any external API calls, using only ChromaDB's default local embedding model. For comparison, many commercial memory systems that use expensive LLM-based reranking and cloud infrastructure achieve similar or lower scores.
However, it is important to understand what this benchmark measures and what it does not. LongMemEval tests retrieval accuracy: can the system find the relevant passage? It does not test whether the LLM correctly uses that passage to answer a question, whether the retrieved context is sufficient for complex reasoning, or whether the system performs well on domains very different from the benchmark corpus.
The AAAK compression results tell an honest story about the trade-offs involved. The drop from 96.6% to 84.2% when using AAAK compression is significant and should be taken seriously. The developers' initial claim of lossless compression was incorrect, and their subsequent acknowledgment of this fact is a mark of intellectual honesty that reflects well on the project.
The 100% hybrid score mentioned in some documentation refers to a configuration that uses LLM-based reranking as a post-processing step after the initial ChromaDB retrieval. This configuration does make external API calls and is therefore not purely local. The 96.6% score is the one that is relevant for fully local, zero-API-cost operation.
CHAPTER THIRTEEN: REAL-WORLD USE CASES
The developer community has found numerous practical applications for MemPalace since its release. Several patterns have emerged as particularly valuable.
Engineering teams use MemPalace to preserve the rationale behind architectural decisions. When a team decides to use PostgreSQL instead of MongoDB, or to adopt a particular authentication pattern, the reasoning behind that decision is often scattered across Slack threads, pull request comments, and design documents. Six months later, when a new team member asks why the system is designed this way, nobody can reconstruct the full reasoning. MemPalace can mine all of these sources and make the complete decision history searchable.
Solo developers working on multiple projects use MemPalace to maintain context across projects. When you switch from one project to another and back again, it is easy to lose track of where you were, what you were trying to accomplish, and what decisions you had made. A MemPalace wing for each project, mined from your conversation history with AI assistants, gives you instant recall of your past thinking.
Research teams use MemPalace to maintain a complete audit trail of their research process. Every critique, revision, and dead end is preserved verbatim. This is valuable not just for recall but for understanding how the research evolved, which is often as important as the final conclusions.
Incident response teams use MemPalace to build institutional memory around production incidents. When a particular type of failure occurs, MemPalace can surface the mitigation steps from previous similar incidents, potentially saving hours of debugging time.
CHAPTER FOURTEEN: INSTALLATION REQUIREMENTS SUMMARY AND QUICK START
To bring everything together, here is a complete quick-start guide that you can follow to get a working MemPalace-powered chat application running on your machine.
The first step is to ensure you have Python 3.10 or higher installed. You can check your Python version by running python3 --version in your terminal.
The second step is to install the required packages. The exact set depends on which LLM backend you want to use.
# Core MemPalace dependencies (always required)
pip install mempalace chromadb pyyaml
# For Ollama backend (recommended for most users)
pip install ollama
# Then install Ollama itself from https://ollama.ai
# and pull a model:
ollama pull llama3.2
# For Apple MLX backend (macOS with M1+ only)
pip install mlx-lm
# For llama.cpp with CUDA (NVIDIA GPUs)
CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python
# For llama.cpp with Vulkan (AMD, Intel, or other GPUs)
CMAKE_ARGS="-DGGML_VULKAN=on" pip install llama-cpp-python
The third step is to initialize your memory palace:
mempalace init --name "my-project" --path ~/.mempalace
The fourth step is to optionally mine some existing data. If you have conversation exports from Claude or ChatGPT, this is a great time to ingest them:
mempalace mine --source ~/Downloads/claude-export.json \
--wing my-project --mode convos
The fifth step is to save the three Python files from this tutorial (memory_manager.py, llm_backends.py, and memory_chat.py) and run the chat application:
python memory_chat.py --wing my-project --backend auto
The auto backend selection will probe your system and choose the best available option. If Ollama is running with a model pulled, it will use that. If you are on a Mac with Apple Silicon and mlx-lm is installed, it will use MLX. If you have a GGUF model file and llama-cpp-python installed, it will use that.
CONCLUSION: WHAT MEMPALACE REPRESENTS
MemPalace is more than a clever implementation of vector search with a spatial metaphor. It represents a philosophical position about how AI memory should work: store everything, organize spatially, retrieve deterministically, and never discard information prematurely.
The project's rapid adoption, accumulating over 23,000 GitHub stars within two days of release, suggests that this philosophy resonates deeply with the developer community. The AI amnesia problem is real, it is frustrating, and it is holding back the practical utility of AI assistants in long-running, complex projects.
The honest acknowledgment that AAAK compression is lossy, after initially overclaiming its capabilities, is a sign of a project that values accuracy over marketing. This kind of intellectual honesty is rare and valuable in the open-source AI space.
As language models continue to improve and context windows continue to grow, the relative importance of external memory systems may shift. But for the foreseeable future, systems like MemPalace fill a genuine gap: they give AI assistants the ability to remember, and in doing so, they make those assistants dramatically more useful for the complex, long-running work that matters most.
The ancient Greeks used memory palaces to deliver hours-long speeches from memory. MemPalace has given that technique a new home in the age of artificial intelligence, and the result is a system that is both technically impressive and philosophically coherent. That is a rare combination, and it is worth your time to explore it.
APPENDIX: DEPENDENCY REFERENCE
The following table summarizes all dependencies discussed in this tutorial, their purposes, and their installation commands.
mempalace: The core MemPalace library. Install with pip install mempalace.
chromadb: Vector database for semantic search over verbatim text. Install with pip install chromadb. Used automatically by MemPalace.
pyyaml: YAML configuration file parser. Install with pip install pyyaml. Used automatically by MemPalace.
ollama: Python client for the Ollama local LLM server. Install with pip install ollama. Requires the Ollama application from ollama.ai.
mlx-lm: Apple MLX inference library for Apple Silicon. Install with pip install mlx-lm. Only available on macOS with M1 or later chips.
llama-cpp-python: Python bindings for llama.cpp. Installation varies by GPU backend. See Chapter Nine for the exact CMAKE_ARGS for CUDA, Vulkan, and Metal.
The GitHub repository for MemPalace is located at github.com/milla-jovovich/mempalace and is licensed under the MIT license, which means you are free to use it in commercial and personal projects without restriction.