Monday, April 13, 2026

MEMPALACE: A COMPLETE TECHNICAL TUTORIAL


 


The AI Memory System Inspired by an Ancient Technique, Co-Created by Milla Jovovich and Ben Sigman


PREFACE: THE PROBLEM THAT STARTED IT ALL

Imagine hiring a brilliant assistant who reads everything you give them, answers every question perfectly, and then wakes up the next morning with absolutely no memory of who you are. Every single day. Forever. You would explain your project from scratch, re-establish your preferences, re-describe your team, and re-share every decision you made together. This is, more or less, the daily reality of working with large language models.

This problem has a name in the AI community: AI amnesia. It is not a bug in the traditional sense. It is a fundamental architectural constraint. Language models process a fixed window of tokens, and when a conversation ends, that window closes. Nothing persists. The model does not dream about your codebase at night. It does not remember that you prefer tabs over spaces, that your database migration failed last Tuesday, or that your colleague Soren spent three weeks debugging a race condition in the authentication service.

For years, developers have tried to solve this with summarization. The idea is simple enough: at the end of each session, ask the model to summarize what happened, store that summary, and inject it at the start of the next session. The problem is that summarization is lossy by definition. When you ask a model to condense a three-hour debugging session into a paragraph, it makes judgment calls about what matters. Sometimes it is right. Sometimes it discards the one detail that would have saved you two hours next week. The model does not know what it does not know it will need.

In April 2026, Milla Jovovich and developer Ben Sigman released MemPalace on GitHub, and within two days it had accumulated over 23,000 stars and nearly 3,000 forks. The project proposed a different philosophy entirely: store everything verbatim, organize it spatially using an ancient mnemonic technique, and retrieve it deterministically without ever calling an external API. The ancient technique in question is the method of loci, also known as the memory palace, a strategy used by Greek and Roman orators to memorize hours of speeches by mentally placing ideas in rooms of an imaginary building and walking through those rooms during recall.

This tutorial will take you through every aspect of MemPalace: its philosophy, its architecture, its data structures, its compression system, its benchmark results, and how to integrate it into your own applications using local and remote language models across different hardware backends including NVIDIA CUDA, Apple MLX, and Vulkan-capable GPUs.


CHAPTER ONE: THE PHILOSOPHY OF VERBATIM MEMORY

Before we touch a single line of code, it is worth spending time with the core idea that separates MemPalace from every other AI memory system that came before it. That idea is the verbatim-first philosophy.

Most memory systems for AI work on the assumption that storage is expensive and context windows are precious. Therefore, they compress aggressively. They summarize conversations, extract entities, build knowledge graphs from inferred relationships, and throw away the raw text. The result is a system that is efficient but brittle. The compressed representation is only as good as the model that created it, and that model had no way of knowing which details would matter six months later.

MemPalace takes the opposite position. Storage is cheap. Context windows, while limited, are growing. Modern embedding models are remarkably good at finding relevant passages in large corpora. Therefore, the right strategy is to store everything exactly as it was said, organize it so that retrieval is fast and accurate, and let the embedding model do the work of finding what matters.

This is not a new idea in information science. Archivists and librarians have known for centuries that the original document is always more valuable than any summary of it. What MemPalace does is apply this principle systematically to AI memory, with a retrieval architecture sophisticated enough to make it practical.

The verbatim-first philosophy has several important consequences for the system design. First, it means that ChromaDB, the vector database at the heart of MemPalace, stores the actual text of conversations and documents rather than summaries. Second, it means that the write path, the process of ingesting new information, does not need to call an LLM at all. Classification, chunking, and organization are handled by deterministic regex heuristics and keyword scoring. Third, it means that the system can answer questions about the past with high fidelity, because the past is still there, word for word.

The benchmark results support this approach. In raw verbatim mode, MemPalace achieved 96.6% recall at 5 on the LongMemEval benchmark without making a single external API call. This is a remarkable result for a system that runs entirely on your local machine.


CHAPTER TWO: THE MEMORY PALACE ARCHITECTURE

The method of loci works because the human brain is extraordinarily good at spatial memory. We remember places, routes, and rooms far better than we remember abstract lists. By associating information with locations in a familiar space, we give our brains a retrieval cue that is far more reliable than rote memorization.

MemPalace translates this spatial metaphor into a hierarchical data structure. Understanding this hierarchy is essential to understanding how the system works, so we will spend considerable time here before moving to implementation details.


THE WING

A wing is the top-level container in a MemPalace. Think of it as an entire building within your memory palace complex. Each wing represents a distinct domain: a project, a person, a client, a research area, or any other top-level category that makes sense for your use case. If you are a solo developer working on three projects simultaneously, you might have three wings: one for each project. If you are managing a team, you might have a wing for each team member in addition to wings for each project.

Wings are important because they provide the first level of isolation. A search within a specific wing only retrieves information from that wing, which dramatically reduces noise and improves precision. A search across all wings is also possible when you need to find connections between domains.


THE HALL

Halls are corridors that run through every wing in the palace. They represent recurring memory types that appear in every domain. The default halls in MemPalace are facts, events, discoveries, preferences, and advice. Every wing has all of these halls, which means that if you want to find all the facts across all your projects, you can walk down the facts hall and look into each wing's room along that corridor.

This is a subtle but powerful design decision. It means that memory is organized along two orthogonal axes simultaneously: by domain (wing) and by type (hall). You can retrieve all facts about a specific project, or all facts across all projects, or all events within a specific project, depending on what you need.


THE ROOM

Rooms are specific topics within a wing. If a wing represents a software project, its rooms might be named authentication, billing, deployment, database-schema, and frontend. Rooms are created automatically by MemPalace when it detects that a conversation or document is about a specific topic. The topic detection uses keyword scoring and regex heuristics, not an LLM, which keeps the process fast and free.

Rooms are where the actual retrieval happens. When you search for information about authentication in your project wing, MemPalace navigates to the authentication room and searches within it. This spatial navigation dramatically reduces the search space compared to a flat vector search over all stored documents.


THE DRAWER

Drawers are where the raw, verbatim text lives. Every room has drawers, and each drawer contains an original document or conversation chunk exactly as it was stored. Drawers are the source of truth in MemPalace. Everything else in the system is either an index into the drawers or a derived view of their contents.

In the implementation, drawers correspond to entries in the ChromaDB collection. Each entry contains the full text of a chunk, along with metadata that records which wing, hall, and room it belongs to, when it was stored, and what source it came from.


THE CLOSET

Closets are adjacent to drawers and contain compressed summaries that point back to the original drawer content. They are intended for quick human scanning, not for retrieval accuracy. The distinction is important: when MemPalace retrieves information to give to an LLM, it retrieves from drawers, not closets. Closets are a convenience feature for human operators who want to browse the palace without reading every verbatim document.


THE TUNNEL

Tunnels are the most architecturally interesting feature of MemPalace. When the same room name appears in two different wings, a tunnel is automatically created between them. This tunnel represents a cross-domain connection: the authentication room in your web project wing is connected by a tunnel to the authentication room in your mobile project wing.

Tunnels enable a form of associative retrieval that flat vector search cannot provide. When you search for authentication information, MemPalace can follow tunnels to find related information in other wings, surfacing connections that you might not have thought to look for explicitly.


The following ASCII diagram illustrates the complete hierarchy:


MEMORY PALACE

|

+-- WING: project-alpha

|   |

|   +-- HALL: facts

|   |   +-- ROOM: authentication

|   |   |   +-- DRAWER: raw_chunk_001.txt  (verbatim text)

|   |   |   +-- DRAWER: raw_chunk_002.txt  (verbatim text)

|   |   |   +-- CLOSET: summary_auth.aaak  (compressed summary)

|   |   |

|   |   +-- ROOM: billing

|   |       +-- DRAWER: raw_chunk_003.txt

|   |

|   +-- HALL: events

|       +-- ROOM: deployment

|           +-- DRAWER: raw_chunk_004.txt

|

+-- WING: project-beta

|   |

|   +-- HALL: facts

|       +-- ROOM: authentication  <-- TUNNEL connects to project-alpha/facts/authentication

|           +-- DRAWER: raw_chunk_005.txt

|

+-- TUNNEL: project-alpha/authentication <-> project-beta/authentication


This diagram shows two wings, each with their own halls and rooms, and a tunnel connecting the authentication rooms across wings. The drawers contain verbatim text, and the closets contain compressed summaries.


CHAPTER THREE: THE TECHNOLOGY STACK

MemPalace is built on a deliberately minimal set of dependencies. This is a conscious design choice: every dependency is a potential point of failure, a version conflict waiting to happen, and a barrier to adoption. The core stack consists of three components.


CHROMADB: THE VECTOR STORE

ChromaDB is an open-source vector database designed specifically for AI applications. It stores text alongside its vector embedding, which is a numerical representation of the text's semantic meaning. When you search ChromaDB with a query, it converts your query to a vector and finds the stored texts whose vectors are most similar. This is semantic search: it finds texts that mean something similar to your query, not just texts that contain the same keywords.

In MemPalace, ChromaDB stores all drawer contents. Each drawer entry is a document in the ChromaDB collection, with metadata fields that encode its position in the palace hierarchy. When MemPalace performs a retrieval, it queries ChromaDB with the user's question, optionally filtered by wing, hall, or room metadata, and retrieves the most semantically relevant drawer contents.

ChromaDB uses its default embedding model out of the box, which is a sentence transformer model that runs entirely locally. No API key is required. No data leaves your machine. The embedding model converts text to vectors on your local CPU or GPU, and ChromaDB stores those vectors in a local SQLite file.


SQLITE: THE KNOWLEDGE GRAPH

While ChromaDB handles semantic search over verbatim text, SQLite handles structured facts and their temporal relationships. MemPalace maintains a temporal knowledge graph in SQLite that stores entity-relationship triples of the form (subject, predicate, object) along with validity timestamps.

A temporal knowledge graph is a knowledge graph where every fact has an associated time range during which it is true. For example, the fact that a particular developer is working on the authentication module might be true from January to March but no longer true in April when they switch to a different project. In a standard knowledge graph, updating this fact would overwrite the old one. In a temporal knowledge graph, the old fact is preserved with a valid_to timestamp, and the new fact is added with a valid_from timestamp. This allows you to query the state of the world at any point in time.

The SQLite schema for the knowledge graph looks conceptually like this:


TABLE: entity_relations

+-----------+------------+----------+------------+----------+

| subject   | predicate  | object   | valid_from | valid_to |

+-----------+------------+----------+------------+----------+

| soren     | works-on   | auth     | 2026-01-15 | 2026-03-30|

| soren     | works-on   | billing  | 2026-04-01 | NULL      |

| project   | uses       | postgres | 2025-06-01 | NULL      |

+-----------+------------+----------+------------+----------+


A NULL value in valid_to means the fact is currently true. This schema allows MemPalace to answer questions like "what was Soren working on in February?" by filtering for rows where valid_from is before the query date and valid_to is after it or NULL.


PYYAML: CONFIGURATION

PyYAML handles configuration files, which describe the structure of the palace, the locations of data sources to mine, and various behavioral parameters. YAML is a natural choice for configuration because it is human-readable, supports hierarchical structures, and is familiar to most developers.


CHAPTER FOUR: AAAK COMPRESSION

AAAK is one of the most discussed and debated features of MemPalace. It is described as an aggressive abbreviation dialect that any LLM can read without a decoder. The name stands for Aggressive Abbreviation for AI Knowledge, and the idea is to pack repeated entities into fewer tokens so that more context can fit into a model's context window.

The core mechanism of AAAK is entity coding. When a piece of text mentions the same entity repeatedly, AAAK replaces subsequent mentions with a short code. For example, if a conversation mentions the authentication service seventeen times, AAAK might replace it with the code A1 after the first occurrence, along with a header that maps A1 to authentication service. This can dramatically reduce token count for texts that discuss a small number of entities repeatedly.

AAAK also uses structural markers to replace common phrases and sentence patterns with abbreviated forms. The compression is designed to be readable by LLMs because the codes are consistent and the mapping is always provided in a header that the model reads first.

However, it is important to understand the honest assessment of AAAK's performance. The developers initially claimed 30x compression with zero information loss, which turned out to be overstated. Independent testing on the LongMemEval benchmark showed that AAAK mode reduced recall accuracy from 96.6% to 84.2% compared to raw verbatim mode. The developers have since acknowledged that AAAK is lossy and that the lossless claim was incorrect.

This means that AAAK is a trade-off: you get smaller storage and faster context loading at the cost of some retrieval accuracy. For applications where token efficiency is critical and some accuracy loss is acceptable, AAAK may be appropriate. For applications where accuracy is paramount, raw verbatim mode is the better choice.

The default mode for MemPalace is raw verbatim, which is why the benchmark results are so strong. AAAK is an optional experimental feature that you can enable in the configuration.


CHAPTER FIVE: INSTALLATION AND SETUP

Now we move from philosophy and architecture to practice. Setting up MemPalace requires Python 3.10 or higher, and the installation process is straightforward.

The recommended installation method uses pipx, which installs MemPalace in an isolated environment and makes its command-line tools available globally without polluting your system Python installation. If you do not have pipx installed, you can install it with pip and then use it to install MemPalace.

pip install pipx

pipx ensurepath

pipx install mempalace


If you prefer to install MemPalace directly into a project's virtual environment, the standard pip approach works as well. The following example shows how to set up a complete project environment from scratch, which is the approach we will use throughout this tutorial.


# Create a new project directory and navigate into it

mkdir mempalace-tutorial

cd mempalace-tutorial


# Create a virtual environment with Python 3.11

python3.11 -m venv venv


# Activate the virtual environment

# On macOS and Linux:

source venv/bin/activate

# On Windows:

# venv\Scripts\activate


# Install MemPalace and its dependencies

pip install mempalace chromadb pyyaml


After installation, verify that everything is working by running the MemPalace help command:


mempalace --help


You should see a list of available commands including init, mine, search, wake-up, and serve. If you see this output, your installation is complete.


INITIALIZING A PALACE

Before you can store any memories, you need to initialize a palace. The init command creates the directory structure and configuration files that MemPalace needs.

mempalace init --name "my-project" --path ~/.mempalace


This command creates a directory at the specified path containing a config.yaml file, an empty ChromaDB collection, and an empty SQLite database. The config.yaml file is where you define the structure of your palace: which wings to create, which halls to include in each wing, and where to find data sources to mine.

A minimal config.yaml looks like this:


# ~/.mempalace/config.yaml

# This file defines the structure of your memory palace.

# Wings are top-level domains. Add one per project or person.

palace:

  name: "my-project"

  version: "1.0"


wings:

  - name: "project-alpha"

    description: "Main web application project"

    halls:

      - facts

      - events

      - discoveries

      - preferences

      - advice


storage:

  chromadb_path: "~/.mempalace/chroma"

  sqlite_path: "~/.mempalace/knowledge.db"

  compression: "raw"  # Options: raw, aaak


retrieval:

  default_k: 5  # Number of results to return per search

  wake_up_tokens: 170  # Maximum tokens for startup context


CHAPTER SIX: MINING DATA INTO THE PALACE

Mining is the process of ingesting existing data into MemPalace. The system can mine several types of sources: project directories containing code and documentation, exported conversation histories from Claude, ChatGPT, or Slack, and arbitrary text files.

The mining process is entirely deterministic and requires no LLM calls. MemPalace reads each source file, splits it into chunks using a sliding window approach, scores each chunk against keyword lists to determine which room it belongs to, and stores it in the appropriate location in the palace hierarchy.


MINING A PROJECT DIRECTORY

To mine a software project, point MemPalace at the project root. It will recursively scan all files, skip binary files and common noise directories like node_modules and .git, and ingest the rest.

mempalace mine --source ~/projects/my-app --wing project-alpha --mode project


The mode flag tells MemPalace what kind of data to expect. In project mode, it applies heuristics tuned for code and documentation. It recognizes common patterns like function definitions, class declarations, configuration keys, and markdown headings, and uses these as signals for room assignment.


MINING CONVERSATION HISTORIES

Conversation mining is where MemPalace really shines. If you have been using Claude or ChatGPT for months, you likely have hundreds of conversations containing decisions, discoveries, and context that you have completely forgotten. MemPalace can ingest these conversations and make them searchable.

mempalace mine --source ~/Downloads/claude-export.json --wing project-alpha --mode convos


The conversation export format varies by platform. MemPalace includes parsers for the standard export formats of Claude and ChatGPT. For other formats, you can write a simple preprocessing script to convert your data to the plain text format that MemPalace accepts.


UNDERSTANDING THE MINING PIPELINE

It is worth understanding what happens inside the mining pipeline, because this knowledge will help you troubleshoot issues and tune the system for your specific use case.

When MemPalace receives a source file, it first determines the file type and applies the appropriate parser. For plain text and markdown files, the parser is trivial: it reads the file and splits it into chunks. For JSON conversation exports, the parser extracts the message content from each turn, labels it with the speaker role, and concatenates turns into coherent chunks.

After parsing, each chunk goes through the room classifier. The classifier scores the chunk against a set of keyword lists, one per room type. The keyword lists are defined in the configuration and can be customized. A chunk about database connection pooling would score highly against keywords like database, connection, pool, timeout, and query, and would be assigned to the database room if one exists in the current wing.

After classification, the chunk is embedded by ChromaDB's default embedding model and stored in the collection with metadata encoding its palace position. The entire process for a typical conversation export takes a few seconds to a few minutes depending on the size of the export and the speed of your hardware.


CHAPTER SEVEN: SEARCHING THE PALACE

Retrieval is where all the architectural decisions pay off. MemPalace provides several retrieval modes, each suited to different use cases.


THE BASIC SEARCH

The simplest retrieval operation is a semantic search across the entire palace or within a specific wing.

mempalace search "what decisions were made about the database schema"


This command queries ChromaDB with the provided text, retrieves the top five most semantically similar drawer contents, and prints them to the terminal. The output includes the text of each result, its palace location (wing, hall, room), its source file, and a similarity score.

You can narrow the search to a specific wing to reduce noise:

mempalace search "authentication decisions" --wing project-alpha

Or narrow it further to a specific room:

mempalace search "JWT token expiry" --wing project-alpha --room authentication


THE WAKE-UP CONTEXT

The wake-up feature is designed for the beginning of a new AI session. It generates a compact summary of the most critical facts about the current palace, designed to fit within approximately 170 tokens. This gives the LLM just enough context to orient itself without consuming a significant portion of its context window.

mempalace wake-up --wing project-alpha


The wake-up output is a structured text block that lists the most recently accessed facts, the most frequently referenced entities, and any critical flags that have been set manually. It is designed to be pasted directly into a system prompt or prepended to the first user message in a new session.


THE KNOWLEDGE GRAPH QUERY

For structured queries about entities and their relationships, MemPalace provides a knowledge graph query interface that operates on the SQLite temporal graph.

mempalace kg-query --subject "soren" --predicate "works-on"


This returns all known facts about what Soren works on, including historical facts with their validity timestamps. You can add a temporal filter to ask about a specific point in time:

mempalace kg-query --subject "soren" --predicate "works-on" --at "2026-02-15"


CHAPTER EIGHT: PYTHON API INTEGRATION

The command-line interface is useful for exploration and testing, but real applications need to integrate MemPalace programmatically. MemPalace provides a Python API that exposes all of its functionality.

The following example demonstrates a complete integration pattern. We will build a MemoryManager class that wraps MemPalace and provides a clean interface for storing and retrieving memories in an application. This class follows clean architecture principles by separating concerns and hiding implementation details behind a well-defined interface.


"""

memory_manager.py


A clean-architecture wrapper around MemPalace that provides

a simple interface for AI memory operations. This module

handles all interaction with the MemPalace Python API,

including initialization, storage, and retrieval.

"""


import logging

from dataclasses import dataclass

from typing import Optional

from pathlib import Path


# MemPalace imports - these become available after pip install mempalace

from mempalace import MemPalace

from mempalace.models import SearchResult, WakeUpContext


# Configure module-level logging so callers can control verbosity

logger = logging.getLogger(__name__)



@dataclass

class MemoryConfig:

    """

    Configuration for the MemoryManager.


    All paths are resolved relative to the user's home directory

    if they begin with a tilde, following Unix conventions.

    """

    # Path where MemPalace stores its ChromaDB and SQLite files

    storage_path: str = "~/.mempalace"


    # The wing to use for this application instance

    wing_name: str = "default"


    # Number of results to return from semantic searches

    search_k: int = 5


    # Maximum tokens for wake-up context

    wake_up_tokens: int = 170


    # Whether to use AAAK compression (False = raw verbatim, recommended)

    use_aaak: bool = False


    def resolved_storage_path(self) -> Path:

        """Return the storage path with tilde expanded."""

        return Path(self.storage_path).expanduser()



class MemoryManager:

    """

    High-level interface for AI memory operations using MemPalace.


    This class provides a clean, testable interface that hides the

    details of ChromaDB, SQLite, and the palace hierarchy from

    calling code. Applications should interact with memory exclusively

    through this class.


    Usage:

        config = MemoryConfig(wing_name="my-project")

        manager = MemoryManager(config)

        manager.initialize()


        # Store a memory

        manager.store(

            text="We decided to use PostgreSQL for the main database.",

            room="database",

            hall="decisions"

        )


        # Retrieve relevant memories

        results = manager.search("database technology choice")

        for result in results:

            print(result.text)

    """


    def __init__(self, config: MemoryConfig):

        """

        Initialize the MemoryManager with the given configuration.


        The MemPalace instance is not created here; call initialize()

        before using any other methods. This separation allows the

        manager to be constructed without side effects.

        """

        self.config = config

        self._palace: Optional[MemPalace] = None


    def initialize(self) -> None:

        """

        Create or connect to the MemPalace storage.


        This method creates the storage directory if it does not exist,

        initializes the ChromaDB collection and SQLite database, and

        prepares the palace for use. It is safe to call multiple times;

        subsequent calls are no-ops if the palace is already initialized.

        """

        if self._palace is not None:

            logger.debug("MemoryManager already initialized, skipping.")

            return


        storage_path = self.config.resolved_storage_path()

        storage_path.mkdir(parents=True, exist_ok=True)


        logger.info(f"Initializing MemPalace at {storage_path}")


        # Create the MemPalace instance with our configuration

        self._palace = MemPalace(

            storage_path=str(storage_path),

            default_wing=self.config.wing_name,

            compression="aaak" if self.config.use_aaak else "raw",

            wake_up_tokens=self.config.wake_up_tokens,

        )


        logger.info(

            f"MemPalace initialized. Wing: {self.config.wing_name}, "

            f"Compression: {'AAAK' if self.config.use_aaak else 'raw verbatim'}"

        )


    def _require_initialized(self) -> MemPalace:

        """

        Return the palace instance, raising if not initialized.


        This guard method prevents confusing AttributeError messages

        when callers forget to call initialize() first.

        """

        if self._palace is None:

            raise RuntimeError(

                "MemoryManager has not been initialized. "

                "Call initialize() before using any other methods."

            )

        # Explicit assertion allows static analysis tools (mypy, pylance)

        # to narrow the type from Optional[MemPalace] to MemPalace.

        assert self._palace is not None

        return self._palace


    def store(

        self,

        text: str,

        room: str,

        hall: str = "facts",

        source: str = "manual",

        metadata: Optional[dict] = None,

    ) -> str:

        """

        Store a piece of text in the memory palace.


        The text is stored verbatim in the specified room and hall

        within the configured wing. No LLM calls are made during

        storage; classification and embedding happen locally.


        Args:

            text:     The verbatim text to store.

            room:     The room within the wing (e.g., "authentication").

            hall:     The hall type (e.g., "facts", "events", "decisions").

            source:   A label identifying where this text came from.

            metadata: Optional additional metadata to store with the text.


        Returns:

            The unique identifier assigned to this memory by ChromaDB.

        """

        palace = self._require_initialized()


        # Merge caller-provided metadata with required palace metadata

        combined_metadata: dict = {

            "wing": self.config.wing_name,

            "hall": hall,

            "room": room,

            "source": source,

        }

        if metadata:

            combined_metadata.update(metadata)


        memory_id = palace.store(

            text=text,

            wing=self.config.wing_name,

            hall=hall,

            room=room,

            metadata=combined_metadata,

        )


        logger.debug(

            f"Stored memory {memory_id} in "

            f"{self.config.wing_name}/{hall}/{room}"

        )

        return memory_id


    def search(

        self,

        query: str,

        room: Optional[str] = None,

        hall: Optional[str] = None,

        k: Optional[int] = None,

    ) -> list[SearchResult]:

        """

        Search the memory palace for relevant content.


        Performs a semantic search using ChromaDB's vector similarity.

        Results are ordered by relevance, with the most relevant first.

        Optionally filter by room and/or hall to narrow the search space.


        Args:

            query: The natural language query to search for.

            room:  Optional room filter (e.g., "authentication").

            hall:  Optional hall filter (e.g., "facts").

            k:     Number of results to return. Defaults to config value.


        Returns:

            A list of SearchResult objects ordered by relevance.

        """

        palace = self._require_initialized()

        effective_k = k if k is not None else self.config.search_k


        results = palace.search(

            query=query,

            wing=self.config.wing_name,

            room=room,

            hall=hall,

            k=effective_k,

        )


        logger.debug(

            f"Search for '{query[:50]}...' returned {len(results)} results"

        )

        return results


    def wake_up(self) -> WakeUpContext:

        """

        Generate a compact startup context for a new AI session.


        Returns a WakeUpContext object containing a text field suitable

        for injection into an LLM system prompt. The text is guaranteed

        to fit within the configured wake_up_tokens limit.

        """

        palace = self._require_initialized()

        return palace.wake_up(wing=self.config.wing_name)


    def mine_directory(self, path: str, mode: str = "project") -> int:

        """

        Mine a directory of files into the memory palace.


        Recursively scans the directory, parses each file according

        to the specified mode, and stores the results in the palace.

        Returns the number of chunks successfully stored.


        Args:

            path: Path to the directory to mine.

            mode: Mining mode. "project" for code/docs, "convos" for

                  conversation exports.


        Returns:

            The number of memory chunks stored.

        """

        palace = self._require_initialized()

        resolved_path = Path(path).expanduser().resolve()


        if not resolved_path.exists():

            raise FileNotFoundError(

                f"Mining source directory not found: {resolved_path}"

            )


        logger.info(f"Mining {resolved_path} in {mode} mode")

        count = palace.mine(

            source=str(resolved_path),

            wing=self.config.wing_name,

            mode=mode,

        )

        logger.info(f"Mining complete. Stored {count} memory chunks.")

        return count


The key changes from the original in this module are: the unused field import from dataclasses has been removed; the _require_initialized method now includes an explicit assert self._palace is not None statement after the guard check, which allows static analysis tools such as mypy and pylance to correctly narrow the type from Optional[MemPalace] to MemPalace without relying on control-flow inference alone; and combined_metadata is explicitly typed as dict to satisfy strict type checkers.


CHAPTER NINE: INTEGRATING WITH LOCAL LLMS

Now we come to the part of the tutorial that brings everything together: using MemPalace as the memory layer for a local language model. We will build a complete chat application that maintains persistent memory across sessions using MemPalace for storage and retrieval, and supports multiple LLM backends including Ollama (which supports CUDA, Metal, and Vulkan), Apple MLX, and llama.cpp with direct GPU acceleration.

The architecture we will build follows a retrieval-augmented generation pattern. When the user sends a message, the application first queries MemPalace for relevant memories, then constructs a prompt that includes those memories as context, then sends the prompt to the LLM, and finally stores the conversation turn in MemPalace for future retrieval.


SETTING UP THE LLM BACKENDS

We need a unified interface that can work with different LLM backends. The following module defines an abstract base class and concrete implementations for Ollama, Apple MLX, and llama.cpp with Vulkan support.


"""

llm_backends.py


Unified interface for multiple LLM backends, supporting:

  - Ollama (CUDA, Metal/MLX, Vulkan via llama.cpp underneath)

  - Apple MLX directly via mlx-lm

  - llama.cpp via llama-cpp-python with explicit GPU layer offloading


All backends implement the LLMBackend abstract interface, so the

rest of the application is completely decoupled from the choice

of backend.

"""


import abc

import logging

import platform

from typing import Generator, Optional


logger = logging.getLogger(__name__)



class LLMBackend(abc.ABC):

    """

    Abstract base class for LLM backends.


    All backends must implement generate() and stream_generate().

    The generate() method returns the complete response as a string.

    The stream_generate() method yields tokens as they are produced,

    which is important for interactive applications where users

    expect to see output appear progressively.

    """


    @abc.abstractmethod

    def generate(self, prompt: str, max_tokens: int = 512) -> str:

        """Generate a complete response to the given prompt."""

        ...


    @abc.abstractmethod

    def stream_generate(

        self, prompt: str, max_tokens: int = 512

    ) -> Generator[str, None, None]:

        """Yield string tokens progressively as the model generates them."""

        ...


    @abc.abstractmethod

    def is_available(self) -> bool:

        """Return True if this backend is available on the current system."""

        ...



class OllamaBackend(LLMBackend):

    """

    LLM backend using Ollama's local inference server.


    Ollama automatically selects the best available GPU backend:

    CUDA on NVIDIA GPUs, Metal on Apple Silicon, ROCm on AMD GPUs,

    and Vulkan as a fallback for other GPU types. This makes it

    the most portable option across different hardware.


    Prerequisites:

        Install Ollama from https://ollama.ai and pull a model:

            ollama pull llama3.2

        Install the Python client:

            pip install ollama

    """


    def __init__(self, model: str = "llama3.2", host: str = "http://localhost:11434"):

        """

        Initialize the Ollama backend.


        Args:

            model: The Ollama model name to use (e.g., "llama3.2",

                   "mistral", "phi3", "gemma2").

            host:  The Ollama server URL. Defaults to localhost.

        """

        self.model = model

        self.host = host

        self._client = None


    def _get_client(self):

        """Lazily initialize the Ollama client on first use."""

        if self._client is None:

            try:

                import ollama

                self._client = ollama.Client(host=self.host)

            except ImportError:

                raise RuntimeError(

                    "The 'ollama' package is not installed. "

                    "Install it with: pip install ollama"

                )

        return self._client


    def is_available(self) -> bool:

        """

        Check if the Ollama server is running and the model is available.


        The Ollama Python SDK returns model objects whose `.model`

        attribute holds the full model tag (e.g. "llama3.2:latest").

        We check for an exact match first, then fall back to a

        prefix match so that "llama3.2" matches "llama3.2:latest".

        Substring matching is intentionally avoided to prevent

        "llama3" from falsely matching "llama3.2:latest".

        """

        try:

            client = self._get_client()

            response = client.list()

            # Each entry in response.models has a `.model` attribute

            # containing the full tag string, e.g. "llama3.2:latest".

            available_tags = [m.model for m in response.models]

            # Exact match (e.g. user passed "llama3.2:latest")

            if self.model in available_tags:

                return True

            # Prefix match: "llama3.2" should match "llama3.2:latest"

            # but NOT match "llama3.2.1:latest" — we anchor to the colon.

            return any(

                tag == self.model or tag.startswith(self.model + ":")

                for tag in available_tags

            )

        except Exception as e:

            logger.debug(f"Ollama not available: {e}")

            return False


    def generate(self, prompt: str, max_tokens: int = 512) -> str:

        """

        Generate a complete response using Ollama.


        The response is collected in full before returning, which

        is appropriate for non-interactive use cases.

        """

        client = self._get_client()

        response = client.generate(

            model=self.model,

            prompt=prompt,

            options={"num_predict": max_tokens},

        )

        return response.response


    def stream_generate(

        self, prompt: str, max_tokens: int = 512

    ) -> Generator[str, None, None]:

        """

        Stream string tokens from Ollama as they are generated.


        Each chunk from the Ollama streaming API is a GenerateResponse

        object whose `.response` attribute holds the newly generated

        text segment. Empty segments (sent in the final done-chunk)

        are skipped via the truthiness guard.

        """

        client = self._get_client()

        stream = client.generate(

            model=self.model,

            prompt=prompt,

            options={"num_predict": max_tokens},

            stream=True,

        )

        for chunk in stream:

            if chunk.response:

                yield chunk.response



class AppleMLXBackend(LLMBackend):

    """

    LLM backend using Apple's MLX framework for Apple Silicon.


    MLX is Apple's machine learning framework optimized for the

    unified memory architecture of M-series chips. It allows the

    GPU and CPU to share the same RAM pool, eliminating costly

    data transfers and enabling efficient inference on Apple hardware.


    This backend is only available on macOS with Apple Silicon (M1+).

    It uses the mlx-lm library, which provides MLX-optimized versions

    of popular open-source models.


    Prerequisites:

        pip install mlx-lm

        # Models are downloaded automatically on first use from

        # the Hugging Face Hub.

    """


    def __init__(self, model: str = "mlx-community/Llama-3.2-3B-Instruct-4bit"):

        """

        Initialize the Apple MLX backend.


        Args:

            model: The Hugging Face model ID for an MLX-compatible model.

                   The mlx-community organization on Hugging Face hosts

                   pre-converted MLX versions of popular models.

        """

        self.model_name = model

        self._model = None

        self._tokenizer = None  # mlx_lm.load() returns (model, tokenizer)


    def _load_model(self):

        """Load the MLX model and tokenizer on first use."""

        if self._model is None:

            if platform.system() != "Darwin":

                raise RuntimeError(

                    "Apple MLX backend is only available on macOS."

                )

            try:

                from mlx_lm import load

                logger.info(f"Loading MLX model: {self.model_name}")

                self._model, self._tokenizer = load(self.model_name)

                logger.info("MLX model loaded successfully.")

            except ImportError:

                raise RuntimeError(

                    "The 'mlx-lm' package is not installed. "

                    "Install it with: pip install mlx-lm"

                )


    def is_available(self) -> bool:

        """

        Check if Apple MLX is available on this system.


        MLX only installs successfully on macOS with Apple Silicon

        (M1 or later). We verify the platform first, then attempt

        to import mlx.core. A successful import is sufficient proof

        that the GPU backend is available, because MLX will not

        install at all on non-Apple-Silicon hardware.

        """

        if platform.system() != "Darwin":

            return False

        try:

            import mlx.core  # noqa: F401 — import is the availability check

            return True

        except ImportError:

            return False


    def generate(self, prompt: str, max_tokens: int = 512) -> str:

        """

        Generate a complete response using Apple MLX.


        MLX inference runs entirely on the Apple Silicon GPU using

        the unified memory architecture, which means no data needs

        to be copied between CPU and GPU memory.

        """

        self._load_model()

        from mlx_lm import generate as mlx_generate

        result = mlx_generate(

            self._model,

            self._tokenizer,

            prompt=prompt,

            max_tokens=max_tokens,

            verbose=False,

        )

        return result


    def stream_generate(

        self, prompt: str, max_tokens: int = 512

    ) -> Generator[str, None, None]:

        """

        Stream string tokens from Apple MLX.


        mlx_lm.stream_generate yields GenerationResult objects, not

        raw strings. Each GenerationResult has a `.text` attribute

        containing the newly decoded text segment for that step.

        We extract `.text` before yielding so that callers always

        receive plain strings, consistent with the LLMBackend contract.

        """

        self._load_model()

        from mlx_lm import stream_generate as mlx_stream

        for generation_result in mlx_stream(

            self._model,

            self._tokenizer,

            prompt=prompt,

            max_tokens=max_tokens,

        ):

            # generation_result is a GenerationResult object.

            # Its .text attribute holds the decoded text for this token.

            yield generation_result.text



class LlamaCppBackend(LLMBackend):

    """

    LLM backend using llama.cpp via the llama-cpp-python bindings.


    llama.cpp is a highly optimized C++ inference engine that supports

    multiple GPU backends through compile-time flags:

      - CUDA:   NVIDIA GPUs (fastest on NVIDIA hardware)

      - Metal:  Apple Silicon GPUs (alternative to MLX)

      - Vulkan: Cross-platform GPU acceleration (AMD, Intel, NVIDIA)

      - CPU:    Fallback for systems without compatible GPUs


    The n_gpu_layers parameter controls how many transformer layers

    are offloaded to the GPU. Setting it to -1 offloads all layers,

    which is the recommended setting when GPU VRAM is sufficient.


    Prerequisites:

        # For CUDA support:

        CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python


        # For Vulkan support:

        CMAKE_ARGS="-DGGML_VULKAN=on" pip install llama-cpp-python


        # For Apple Metal support:

        CMAKE_ARGS="-DGGML_METAL=on" pip install llama-cpp-python


        # CPU only (no special flags needed):

        pip install llama-cpp-python


        # Download a GGUF model, for example:

        # https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF

    """


    def __init__(

        self,

        model_path: str,

        n_gpu_layers: int = -1,

        n_ctx: int = 4096,

        verbose: bool = False,

    ):

        """

        Initialize the llama.cpp backend.


        Args:

            model_path:   Path to the GGUF model file.

            n_gpu_layers: Number of layers to offload to GPU.

                          -1 means offload all layers (recommended).

                          0 means CPU-only inference.

            n_ctx:        Context window size in tokens.

            verbose:      Whether to print llama.cpp debug output.

        """

        self.model_path = model_path

        self.n_gpu_layers = n_gpu_layers

        self.n_ctx = n_ctx

        self.verbose = verbose

        self._llm = None


    def _load_model(self):

        """Load the GGUF model on first use."""

        if self._llm is None:

            try:

                from llama_cpp import Llama

                logger.info(

                    f"Loading GGUF model: {self.model_path} "

                    f"({self.n_gpu_layers} GPU layers)"

                )

                self._llm = Llama(

                    model_path=self.model_path,

                    n_gpu_layers=self.n_gpu_layers,

                    n_ctx=self.n_ctx,

                    verbose=self.verbose,

                )

                logger.info("GGUF model loaded successfully.")

            except ImportError:

                raise RuntimeError(

                    "The 'llama-cpp-python' package is not installed. "

                    "See class docstring for installation instructions."

                )


    def is_available(self) -> bool:

        """Check if the model file exists and llama-cpp-python is installed."""

        import os

        if not os.path.exists(self.model_path):

            logger.debug(f"Model file not found: {self.model_path}")

            return False

        try:

            import llama_cpp  # noqa: F401

            return True

        except ImportError:

            return False


    def generate(self, prompt: str, max_tokens: int = 512) -> str:

        """

        Generate a complete response using llama.cpp.


        Uses the __call__ (basic completion) API, which returns a

        CompletionResponse dict. The generated text is at

        output["choices"][0]["text"].


        The model runs on whatever GPU backend was selected at

        compile time (CUDA, Vulkan, Metal, or CPU).

        """

        self._load_model()

        output = self._llm(

            prompt,

            max_tokens=max_tokens,

            stop=["User:", "\n\n\n"],

            echo=False,

        )

        return output["choices"][0]["text"].strip()


    def stream_generate(

        self, prompt: str, max_tokens: int = 512

    ) -> Generator[str, None, None]:

        """

        Stream string tokens from llama.cpp.


        Uses the __call__ (basic completion) API with stream=True,

        which yields CompletionChunk dicts. For the basic completion

        API (as opposed to create_chat_completion), each chunk's

        text is at chunk["choices"][0]["text"] — there is no "delta"

        wrapper. Empty strings are skipped to avoid spurious yields.


        llama.cpp supports native token streaming, which is efficient

        because tokens are yielded as soon as they are generated without

        buffering the complete response.

        """

        self._load_model()

        stream = self._llm(

            prompt,

            max_tokens=max_tokens,

            stop=["User:", "\n\n\n"],

            echo=False,

            stream=True,

        )

        for chunk in stream:

            token = chunk["choices"][0]["text"]

            if token:

                yield token



def create_best_available_backend(

    ollama_model: str = "llama3.2",

    mlx_model: str = "mlx-community/Llama-3.2-3B-Instruct-4bit",

    gguf_model_path: Optional[str] = None,

) -> LLMBackend:

    """

    Factory function that selects the best available LLM backend.


    The selection priority is:

      1. Ollama (most portable, handles GPU selection automatically)

      2. Apple MLX (best performance on Apple Silicon when Ollama unavailable)

      3. llama.cpp with GGUF (when a model file is provided)

      4. Raises RuntimeError if no backend is available


    This function is the recommended way to obtain an LLM backend

    in applications that want to work across different hardware.

    """

    # Try Ollama first - it works on all platforms and auto-selects GPU

    ollama = OllamaBackend(model=ollama_model)

    if ollama.is_available():

        logger.info(f"Using Ollama backend with model: {ollama_model}")

        return ollama


    # Try Apple MLX on macOS

    mlx = AppleMLXBackend(model=mlx_model)

    if mlx.is_available():

        logger.info(f"Using Apple MLX backend with model: {mlx_model}")

        return mlx


    # Try llama.cpp if a model path was provided

    if gguf_model_path:

        llamacpp = LlamaCppBackend(model_path=gguf_model_path)

        if llamacpp.is_available():

            logger.info(

                f"Using llama.cpp backend with model: {gguf_model_path}"

            )

            return llamacpp


    raise RuntimeError(

        "No LLM backend is available. Please ensure one of the following:\n"

        "  1. Ollama is installed and running (https://ollama.ai)\n"

        "  2. mlx-lm is installed on macOS with Apple Silicon\n"

        "  3. llama-cpp-python is installed and a GGUF model path is provided"

    )


CHAPTER TEN: BUILDING THE COMPLETE CHAT APPLICATION

With the MemoryManager and LLM backends in place, we can now build the complete chat application. This application maintains persistent memory across sessions, retrieves relevant context before each response, and stores each conversation turn for future retrieval.


"""

memory_chat.py


A complete chat application with persistent AI memory powered by

MemPalace. Memories persist across sessions, so the AI can recall

information from previous conversations.


Usage:

    python memory_chat.py --wing my-project --backend ollama

    python memory_chat.py --wing my-project --backend mlx

    python memory_chat.py --wing my-project --backend llamacpp \

        --model-path ~/models/llama-3.2-3b-instruct.Q4_K_M.gguf


Commands during chat:

    /search <query>   - Search memories without generating a response

    /wakeup           - Show the current wake-up context

    /mine <path>      - Mine a directory into memory

    /quit             - Exit the application

"""


import argparse

import logging

import sys

from pathlib import Path

from typing import Optional


from memory_manager import MemoryConfig, MemoryManager

from mempalace.models import SearchResult

from llm_backends import (

    OllamaBackend,

    AppleMLXBackend,

    LlamaCppBackend,

    create_best_available_backend,

    LLMBackend,

)


# Configure logging to show INFO messages to stderr,

# keeping stdout clean for the chat output

logging.basicConfig(

    level=logging.INFO,

    format="%(asctime)s [%(levelname)s] %(name)s: %(message)s",

    stream=sys.stderr,

)

logger = logging.getLogger(__name__)



def build_prompt(

    user_message: str,

    memory_context: str,

    wake_up_context: str,

    conversation_history: list[dict[str, str]],

) -> str:

    """

    Construct the full prompt for the LLM.


    The prompt has four sections:

      1. System instructions that establish the AI's role

      2. Wake-up context: critical facts loaded at session start

      3. Relevant memories: retrieved by MemPalace for this specific query

      4. Recent conversation history: the last 3 full turns (6 entries)

         of this session, kept short to preserve context-window budget


    The memory sections are placed before the conversation history

    so that the model can refer to them while formulating its response.


    Args:

        user_message:          The current user input.

        memory_context:        Relevant memories retrieved from MemPalace.

        wake_up_context:       Critical facts from the wake-up context.

        conversation_history:  Recent turns in the current session, each

                               entry a dict with "role" and "content" keys.


    Returns:

        A formatted prompt string ready for the LLM.

    """

    # Keep the last 6 entries = 3 full turns (1 user + 1 assistant each)

    history_text = ""

    for turn in conversation_history[-6:]:

        role = "User" if turn["role"] == "user" else "Assistant"

        history_text += f"{role}: {turn['content']}\n"


    # Assemble the complete prompt

    prompt = f"""You are a helpful AI assistant with access to a persistent memory system.

You remember information from previous conversations and use it to provide

contextually aware, accurate responses.


CRITICAL FACTS (always relevant):

{wake_up_context}


RELEVANT MEMORIES (retrieved for this query):

{memory_context if memory_context else "No specific memories found for this query."}


RECENT CONVERSATION:

{history_text}

User: {user_message}

Assistant:"""


    return prompt



def format_search_results(results: list[SearchResult]) -> str:

    """

    Format MemPalace search results into a readable context block.


    Each result is formatted with its palace location and a relevance

    score, followed by the verbatim text. Results are separated by

    a divider line for readability.


    Args:

        results: A list of SearchResult objects from MemPalace.


    Returns:

        A formatted string suitable for inclusion in an LLM prompt.

    """

    if not results:

        return ""


    formatted_parts = []

    for i, result in enumerate(results, 1):

        location = f"{result.wing}/{result.hall}/{result.room}"

        score = f"{result.score:.3f}"

        formatted_parts.append(

            f"[Memory {i} | Location: {location} | Relevance: {score}]\n"

            f"{result.text}"

        )


    return "\n---\n".join(formatted_parts)



def store_conversation_turn(

    manager: MemoryManager,

    user_message: str,

    assistant_response: str,

) -> None:

    """

    Store a conversation turn in MemPalace for future retrieval.


    Both the user message and the assistant response are stored

    together as a single memory unit. This preserves the conversational

    context, which is important for understanding the response in

    isolation when it is retrieved in a future session.


    The room is set to "conversations" and the hall to "events",

    which places these memories in a consistent location within

    the palace hierarchy.


    Storage failures are logged as warnings rather than raised as

    exceptions, so that a transient storage error does not crash

    an otherwise healthy chat session.

    """

    combined_text = (

        f"User: {user_message}\n"

        f"Assistant: {assistant_response}"

    )

    try:

        manager.store(

            text=combined_text,

            room="conversations",

            hall="events",

            source="live-chat",

        )

    except Exception as exc:

        logger.warning(

            f"Failed to store conversation turn in MemPalace: {exc}. "

            "This turn will not be available in future sessions."

        )



def run_chat(manager: MemoryManager, backend: LLMBackend) -> None:

    """

    Run the interactive chat loop.


    This function handles the main interaction loop, including

    special commands, memory retrieval, prompt construction,

    LLM inference, and memory storage. It runs until the user

    types /quit or sends an EOF signal (Ctrl+D on Unix).


    The wing name is read directly from manager.config.wing_name

    rather than being passed as a redundant parameter.

    """

    wing_name = manager.config.wing_name


    print(f"\nMemPalace Chat | Wing: {wing_name}")

    print("=" * 50)

    print("Commands: /search <query>, /wakeup, /mine <path>, /quit")

    print("=" * 50)


    # Load the wake-up context once at session start

    wake_up = manager.wake_up()

    wake_up_text = wake_up.text if wake_up else "No wake-up context available."

    print(f"\nWake-up context loaded ({len(wake_up_text.split())} words).\n")


    # Maintain conversation history for the current session.

    # Each entry is {"role": "user"|"assistant", "content": str}.

    conversation_history: list[dict[str, str]] = []


    while True:

        try:

            user_input = input("You: ").strip()

        except (EOFError, KeyboardInterrupt):

            print("\nGoodbye!")

            break


        if not user_input:

            continue


        # Handle special commands

        if user_input.startswith("/quit"):

            print("Goodbye!")

            break


        if user_input.startswith("/wakeup"):

            print(f"\nWake-up context:\n{wake_up_text}\n")

            continue


        if user_input.startswith("/search "):

            query = user_input[8:].strip()

            results = manager.search(query, k=5)

            print(f"\nSearch results for '{query}':")

            print(format_search_results(results))

            print()

            continue


        if user_input.startswith("/mine "):

            path = user_input[6:].strip()

            try:

                count = manager.mine_directory(path)

                print(f"\nMined {count} memory chunks from {path}\n")

            except FileNotFoundError as e:

                print(f"\nError: {e}\n")

            continue


        # Retrieve relevant memories for the current query

        search_results = manager.search(user_input, k=5)

        memory_context = format_search_results(search_results)


        # Build the complete prompt

        prompt = build_prompt(

            user_message=user_input,

            memory_context=memory_context,

            wake_up_context=wake_up_text,

            conversation_history=conversation_history,

        )


        # Stream the response from the LLM

        print("Assistant: ", end="", flush=True)

        response_tokens: list[str] = []


        for token in backend.stream_generate(prompt, max_tokens=512):

            print(token, end="", flush=True)

            response_tokens.append(token)


        print()  # Newline after the streamed response

        full_response = "".join(response_tokens)


        # Only store and update history if the model produced output.

        # An empty response indicates a backend error; we skip storage

        # to avoid polluting the palace with empty entries.

        if full_response.strip():

            # Update conversation history for this session

            conversation_history.append(

                {"role": "user", "content": user_input}

            )

            conversation_history.append(

                {"role": "assistant", "content": full_response}

            )

            # Store this turn in MemPalace for future sessions

            store_conversation_turn(manager, user_input, full_response)

        else:

            logger.warning("LLM returned an empty response; turn not stored.")



def main() -> None:

    """

    Entry point for the memory chat application.


    Parses command-line arguments, initializes the MemoryManager

    and LLM backend, and starts the interactive chat loop.

    """

    parser = argparse.ArgumentParser(

        description="Chat with an AI that remembers everything."

    )

    parser.add_argument(

        "--wing",

        default="default",

        help="The memory palace wing to use (default: 'default')",

    )

    parser.add_argument(

        "--backend",

        choices=["ollama", "mlx", "llamacpp", "auto"],

        default="auto",

        help="LLM backend to use (default: auto-detect)",

    )

    parser.add_argument(

        "--ollama-model",

        default="llama3.2",

        help="Ollama model name (default: llama3.2)",

    )

    parser.add_argument(

        "--mlx-model",

        default="mlx-community/Llama-3.2-3B-Instruct-4bit",

        help="MLX model ID from Hugging Face",

    )

    parser.add_argument(

        "--model-path",

        default=None,

        help="Path to GGUF model file for llama.cpp backend",

    )

    parser.add_argument(

        "--gpu-layers",

        type=int,

        default=-1,

        help="GPU layers for llama.cpp (-1 = all layers, 0 = CPU only)",

    )

    parser.add_argument(

        "--storage-path",

        default="~/.mempalace",

        help="Path for MemPalace storage (default: ~/.mempalace)",

    )


    args = parser.parse_args()


    # Initialize MemoryManager

    config = MemoryConfig(

        storage_path=args.storage_path,

        wing_name=args.wing,

    )

    manager = MemoryManager(config)

    manager.initialize()


    # Select and initialize the LLM backend.

    # The if/elif chain is exhaustive given argparse's choices constraint,

    # but we add a final else to keep static analysis tools (mypy, pylance)

    # from flagging `backend` as potentially unbound.

    backend: LLMBackend

    if args.backend == "auto":

        backend = create_best_available_backend(

            ollama_model=args.ollama_model,

            mlx_model=args.mlx_model,

            gguf_model_path=args.model_path,

        )

    elif args.backend == "ollama":

        backend = OllamaBackend(model=args.ollama_model)

    elif args.backend == "mlx":

        backend = AppleMLXBackend(model=args.mlx_model)

    elif args.backend == "llamacpp":

        if not args.model_path:

            print("Error: --model-path is required for the llamacpp backend.")

            sys.exit(1)

        backend = LlamaCppBackend(

            model_path=args.model_path,

            n_gpu_layers=args.gpu_layers,

        )

    else:

        # Unreachable given argparse choices, but satisfies type checkers

        # and makes the exhaustiveness of the chain explicit.

        raise RuntimeError(f"Unknown backend: {args.backend!r}")


    # Start the chat loop

    run_chat(manager, backend)



if __name__ == "__main__":

    main()


CHAPTER ELEVEN: THE MCP SERVER AND AGENT INTEGRATION

MemPalace exposes its functionality through a Model Context Protocol server, which is an open standard for how AI models communicate with external tools and data sources. The MCP server allows AI assistants like Claude, ChatGPT with plugins, and Cursor to call MemPalace tools directly, without the user having to manually retrieve memories and paste them into the conversation.

The MCP server communicates using JSON-RPC 2.0 over stdin/stdout, which means it can be launched as a subprocess by any MCP-compatible client. The server exposes 24 tools organized into four categories.

The read tools allow the AI to search memories, retrieve specific drawers, query the knowledge graph, and read diary entries. The write tools allow the AI to store new memories, update knowledge graph facts, and write diary entries. The knowledge graph tools provide temporal query capabilities, allowing the AI to ask about the state of entities at specific points in time. The diary tools manage per-agent timestamped logs that allow specialist agents to maintain continuity across sessions.

To configure Claude Desktop to use MemPalace as an MCP server, you add an entry to Claude's configuration file. The exact location of this file depends on your operating system, but on macOS it is typically at ~/Library/Application Support/Claude/claude_desktop_config.json.


{

  "mcpServers": {

    "mempalace": {

      "command": "mempalace",

      "args": ["serve", "--wing", "my-project"],

      "env": {}

    }

  }

}


Once this configuration is in place, Claude will automatically have access to all 24 MemPalace tools. When you ask Claude a question that requires historical context, it can call mempalace_search to retrieve relevant memories before formulating its response. When Claude makes a decision or discovers something important, it can call mempalace_store to preserve that information for future sessions.

The diary system is particularly interesting for multi-agent applications. Each specialist agent in a MemPalace-powered system has its own diary, which is a chronological log of that agent's observations, decisions, and reflections. When a new session begins, the agent reads its own diary to reconstruct its expertise and context. This gives agents a form of persistent identity that survives session boundaries.


CHAPTER TWELVE: BENCHMARK RESULTS AND HONEST ASSESSMENT

The LongMemEval benchmark is the standard evaluation framework for AI memory systems. It tests a system's ability to retrieve relevant information from a large corpus of conversations and documents, measuring recall at various k values (the number of results returned per query).

MemPalace's results on this benchmark are genuinely impressive in raw verbatim mode. The 96.6% recall at 5 score means that for 96.6% of queries, the correct answer appears somewhere in the top 5 retrieved results. This is achieved without any external API calls, using only ChromaDB's default local embedding model. For comparison, many commercial memory systems that use expensive LLM-based reranking and cloud infrastructure achieve similar or lower scores.

However, it is important to understand what this benchmark measures and what it does not. LongMemEval tests retrieval accuracy: can the system find the relevant passage? It does not test whether the LLM correctly uses that passage to answer a question, whether the retrieved context is sufficient for complex reasoning, or whether the system performs well on domains very different from the benchmark corpus.

The AAAK compression results tell an honest story about the trade-offs involved. The drop from 96.6% to 84.2% when using AAAK compression is significant and should be taken seriously. The developers' initial claim of lossless compression was incorrect, and their subsequent acknowledgment of this fact is a mark of intellectual honesty that reflects well on the project.

The 100% hybrid score mentioned in some documentation refers to a configuration that uses LLM-based reranking as a post-processing step after the initial ChromaDB retrieval. This configuration does make external API calls and is therefore not purely local. The 96.6% score is the one that is relevant for fully local, zero-API-cost operation.


CHAPTER THIRTEEN: REAL-WORLD USE CASES

The developer community has found numerous practical applications for MemPalace since its release. Several patterns have emerged as particularly valuable.

Engineering teams use MemPalace to preserve the rationale behind architectural decisions. When a team decides to use PostgreSQL instead of MongoDB, or to adopt a particular authentication pattern, the reasoning behind that decision is often scattered across Slack threads, pull request comments, and design documents. Six months later, when a new team member asks why the system is designed this way, nobody can reconstruct the full reasoning. MemPalace can mine all of these sources and make the complete decision history searchable.

Solo developers working on multiple projects use MemPalace to maintain context across projects. When you switch from one project to another and back again, it is easy to lose track of where you were, what you were trying to accomplish, and what decisions you had made. A MemPalace wing for each project, mined from your conversation history with AI assistants, gives you instant recall of your past thinking.

Research teams use MemPalace to maintain a complete audit trail of their research process. Every critique, revision, and dead end is preserved verbatim. This is valuable not just for recall but for understanding how the research evolved, which is often as important as the final conclusions.

Incident response teams use MemPalace to build institutional memory around production incidents. When a particular type of failure occurs, MemPalace can surface the mitigation steps from previous similar incidents, potentially saving hours of debugging time.


CHAPTER FOURTEEN: INSTALLATION REQUIREMENTS SUMMARY AND QUICK START

To bring everything together, here is a complete quick-start guide that you can follow to get a working MemPalace-powered chat application running on your machine.

The first step is to ensure you have Python 3.10 or higher installed. You can check your Python version by running python3 --version in your terminal.

The second step is to install the required packages. The exact set depends on which LLM backend you want to use.


# Core MemPalace dependencies (always required)

pip install mempalace chromadb pyyaml


# For Ollama backend (recommended for most users)

pip install ollama

# Then install Ollama itself from https://ollama.ai

# and pull a model:

ollama pull llama3.2


# For Apple MLX backend (macOS with M1+ only)

pip install mlx-lm


# For llama.cpp with CUDA (NVIDIA GPUs)

CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python


# For llama.cpp with Vulkan (AMD, Intel, or other GPUs)

CMAKE_ARGS="-DGGML_VULKAN=on" pip install llama-cpp-python


The third step is to initialize your memory palace:

mempalace init --name "my-project" --path ~/.mempalace


The fourth step is to optionally mine some existing data. If you have conversation exports from Claude or ChatGPT, this is a great time to ingest them:

mempalace mine --source ~/Downloads/claude-export.json \

    --wing my-project --mode convos


The fifth step is to save the three Python files from this tutorial (memory_manager.py, llm_backends.py, and memory_chat.py) and run the chat application:

python memory_chat.py --wing my-project --backend auto


The auto backend selection will probe your system and choose the best available option. If Ollama is running with a model pulled, it will use that. If you are on a Mac with Apple Silicon and mlx-lm is installed, it will use MLX. If you have a GGUF model file and llama-cpp-python installed, it will use that.


CONCLUSION: WHAT MEMPALACE REPRESENTS

MemPalace is more than a clever implementation of vector search with a spatial metaphor. It represents a philosophical position about how AI memory should work: store everything, organize spatially, retrieve deterministically, and never discard information prematurely.

The project's rapid adoption, accumulating over 23,000 GitHub stars within two days of release, suggests that this philosophy resonates deeply with the developer community. The AI amnesia problem is real, it is frustrating, and it is holding back the practical utility of AI assistants in long-running, complex projects.

The honest acknowledgment that AAAK compression is lossy, after initially overclaiming its capabilities, is a sign of a project that values accuracy over marketing. This kind of intellectual honesty is rare and valuable in the open-source AI space.

As language models continue to improve and context windows continue to grow, the relative importance of external memory systems may shift. But for the foreseeable future, systems like MemPalace fill a genuine gap: they give AI assistants the ability to remember, and in doing so, they make those assistants dramatically more useful for the complex, long-running work that matters most.

The ancient Greeks used memory palaces to deliver hours-long speeches from memory. MemPalace has given that technique a new home in the age of artificial intelligence, and the result is a system that is both technically impressive and philosophically coherent. That is a rare combination, and it is worth your time to explore it.


APPENDIX: DEPENDENCY REFERENCE

The following table summarizes all dependencies discussed in this tutorial, their purposes, and their installation commands.

mempalace: The core MemPalace library. Install with pip install mempalace.


chromadb: Vector database for semantic search over verbatim text. Install with pip install chromadb. Used automatically by MemPalace.


pyyaml: YAML configuration file parser. Install with pip install pyyaml. Used automatically by MemPalace.


ollama: Python client for the Ollama local LLM server. Install with pip install ollama. Requires the Ollama application from ollama.ai.


mlx-lm: Apple MLX inference library for Apple Silicon. Install with pip install mlx-lm. Only available on macOS with M1 or later chips.


llama-cpp-python: Python bindings for llama.cpp. Installation varies by GPU backend. See Chapter Nine for the exact CMAKE_ARGS for CUDA, Vulkan, and Metal.


The GitHub repository for MemPalace is located at github.com/milla-jovovich/mempalace and is licensed under the MIT license, which means you are free to use it in commercial and personal projects without restriction.

No comments: