Saturday, May 02, 2026

WIKI LLM: HOW ANDREJ KARPATHY'S IDEA TURNS YOUR AI INTO A SELF-MAINTAINING KNOWLEDGE MACHINE




CHAPTER ONE: THE PROBLEM THAT NOBODY TALKS ABOUT ENOUGH

Every developer who has built something serious with a large language model eventually runs into the same wall. It does not announce itself dramatically. It creeps up on you quietly, usually around the third or fourth sprint of a project, when you realize that your AI-powered application is doing a lot of redundant work, spending a lot of tokens, and somehow still not getting smarter over time. The wall is called the memory problem, and it is arguably the most important unsolved engineering challenge in applied AI today.

To understand why this problem is so fundamental, you need to understand what an LLM actually is at runtime. A large language model, when you call its API, is a completely stateless function. It takes a sequence of tokens as input and produces a sequence of tokens as output. It has no memory of the last time you called it. It has no awareness that it has answered a similar question before. It does not accumulate wisdom from repeated exposure to your data. Every single API call starts from zero, with only the information you explicitly pack into the context window of that particular call.

This is not a bug. It is a deliberate design choice that makes LLMs scalable, parallelizable, and predictable. But it creates a profound engineering problem for anyone who wants to build a system that gets smarter over time, maintains a coherent understanding of a domain, or answers complex questions by synthesizing information from many different sources.


The Context Window Is Not as Big as It Looks

The context window is the LLM's working memory. As of April 2026, the leading frontier models have reached impressive milestones in this area:

  • GPT-5.5 (OpenAI, released April 23, 2026) supports a 256,000-token context window with exceptional instruction persistence across long contexts.
  • Claude Opus 4.7 (Anthropic, released April 16, 2026) supports a remarkable 1,000,000-token input context window with up to 128,000 output tokens — the largest production-available context of any current frontier model.
  • Gemini 3.1 Pro (Google DeepMind, released February 2026) also supports a 1,000,000-token context window and is particularly well-suited to multimodal and document-heavy workloads.

These numbers sound enormous until you start doing the math. A single research paper might consume 8,000 to 15,000 tokens. A moderately sized codebase might consume 200,000 tokens or more. A year's worth of meeting notes, emails, and documentation for a medium-sized team could easily run into the millions of tokens. No context window, however large, can hold everything you might want the model to know.

And even if it could, there is a subtler problem. Research has consistently shown that LLM performance degrades as the context grows longer. The phenomenon is sometimes called the "lost in the middle" problem: when relevant information is buried in the middle of a very long context, the model is significantly less likely to use it correctly than when the same information appears near the beginning or end. A one-million-token context window does not give you one million tokens of perfect, uniform attention. It gives you something more like a spotlight that gets dimmer and less focused as the context grows. Even Claude Opus 4.7 and Gemini 3.1 Pro, both of which advertise one-million-token windows, show measurable accuracy degradation when the relevant information is buried deep in a long context.

So what do engineers do? They reach for one of several standard approaches, each of which solves part of the problem while creating new problems of its own.


CHAPTER TWO: THE STANDARD APPROACHES AND THEIR LIMITATIONS

The Naive Full-Context Approach

The most naive approach is simply to stuff everything into the context window and hope for the best. For small, well-defined tasks, it works fine. For anything involving a large or growing knowledge base, it quickly becomes untenable. The cost grows linearly with the amount of information you include, latency grows as well, and the quality of the model's responses often degrades as the context becomes cluttered with information that is not relevant to the current query.

Summarization

The next approach that most developers discover is summarization. Instead of keeping the full text of every document in the context, you ask the model to summarize older or less relevant material, and you keep only the summaries. This is better — it reduces token count and keeps the most important information accessible. But summarization is lossy. When you compress a ten-page research paper into a three-paragraph summary, you inevitably discard details that might turn out to be important later. And crucially, the summaries are generated independently, so they do not cross-reference each other. The model that summarizes document A has no idea what document B says, and vice versa. You end up with a collection of isolated summaries rather than an integrated understanding.

Retrieval-Augmented Generation (RAG)

The approach that has become the industry standard is Retrieval-Augmented Generation, universally known as RAG. RAG is genuinely clever and solves real problems, so it deserves a careful explanation.

In a RAG system, your documents are preprocessed before any queries are made. Each document is split into chunks, typically a few hundred to a few thousand tokens each. Each chunk is then passed through an embedding model, which converts it into a high-dimensional vector that captures its semantic meaning. These vectors are stored in a specialized vector database such as Pinecone, Weaviate, Chroma, or FAISS. When a user asks a question, the question is also converted into an embedding vector, and the vector database performs a nearest-neighbor search to find the chunks whose embeddings are most similar to the question embedding. These retrieved chunks are then inserted into the context window along with the question, and the LLM generates an answer based on this augmented context.

RAG is a significant improvement over naive full-context stuffing. It scales to arbitrarily large document collections because you only retrieve the relevant chunks for each query. It is cost-effective because you only pay for the tokens in the retrieved chunks. It supports source attribution because you know exactly which chunks were retrieved. And it handles updates gracefully because you can add new documents to the vector database without retraining the model.

But RAG has a fundamental limitation that becomes apparent when you think carefully about what it is actually doing. RAG is a retrieval system. It finds pieces of text that are semantically similar to your query and hands them to the model. It does not synthesize, integrate, or reason about the knowledge in your document collection. Every single query starts from scratch. The model reads the retrieved chunks as if it has never seen them before, derives whatever insights it can from them in the context of the current query, and then discards that work entirely. The next query starts over.

This means that RAG systems do not accumulate knowledge. They do not get smarter over time. They do not notice when two documents contradict each other. They do not build up a coherent understanding of the relationships between concepts in your domain. Every query is a fresh start, and all the synthesis work done for previous queries is thrown away.

There is also a more subtle problem with RAG that practitioners often discover the hard way. The quality of RAG retrieval depends heavily on the quality of the chunking strategy. If a concept is spread across multiple chunks, or if a chunk contains information about multiple unrelated concepts, the retrieval quality suffers. Tuning a RAG system for high recall and precision on a specific domain is a significant engineering effort, and the results are often fragile.

Other approaches exist as well. Some systems use fine-tuning to bake domain knowledge into the model's weights, but fine-tuning is expensive, requires labeled training data, and produces a model that is frozen at a point in time and cannot easily incorporate new information. Some systems use agent loops with tool calls to databases or search engines, which is powerful but complex and expensive. Some systems use hierarchical summarization, where documents are summarized at multiple levels of granularity, but this still does not solve the integration and cross-referencing problem.

This is the landscape that Andrej Karpathy surveyed when he formalized the LLM Wiki pattern in April 2026. And what he proposed is, in retrospect, surprisingly simple and elegant.


CHAPTER THREE: THE WIKI LLM CONCEPT

Karpathy's key insight is captured in a single analogy: treat your raw documents as source code and the LLM as a compiler. The wiki is the compiled binary.

Think about what a compiler does. It takes human-readable source code and transforms it into an optimized, structured, executable artifact. The source code is the canonical representation of intent, but the binary is what you actually run. When you add new source code, you do not re-run the entire compilation from scratch for every execution. You compile once, cache the result, and only recompile when the source changes. The compilation process is expensive, but it happens once and the result is reused many times.

Now apply this analogy to knowledge management. Your raw documents — papers, articles, notes, data — are the source code. They are the canonical representation of the knowledge you want the system to have. But instead of feeding them raw to the LLM every time a question is asked, you compile them first. The LLM reads the raw sources, extracts the key information, synthesizes it, cross-references related concepts, identifies contradictions, and writes all of this into a structured collection of Markdown files. This collection is the wiki, and it is the compiled binary of your knowledge base.

When a question is asked, the LLM does not go back to the raw sources. It reads the wiki, which already contains the synthesized, cross-referenced, integrated understanding of all the raw sources. The synthesis work was done once, at ingest time, and the result is reused for every subsequent query. Knowledge accumulates in the wiki over time. Each new document that is ingested makes the wiki richer, more interconnected, and more useful.

This is the fundamental difference between RAG and the LLM Wiki:

DimensionRAGLLM Wiki
Knowledge modelRetrieval at query timeCompilation at ingest time
StateStatelessStateful
Knowledge accumulationNoneCompounds over time
Cross-referencingNoneExplicit and maintained
Contradiction detectionNoneBuilt into lint operation
RecallBest-effort (~70–85%)100% (for ingested content)
Ingest costLowHigh (~40,000–60,000 tokens/source)
Query costModerateLow (~4,000–5,000 tokens)
Break-evenN/A~8–10 queries per topic

CHAPTER FOUR: THE THREE-LAYER ARCHITECTURE

The architecture that implements this idea has three layers, and understanding each layer is essential to understanding how the system works.

Layer 1 — The Raw Sources Directory

This is where your original documents live: PDFs, Markdown files, text files, HTML pages, whatever you have. This directory is immutable. Documents are added to it but never modified or deleted. It is the ground truth, the source code in Karpathy's analogy. Every claim in the wiki can be traced back to a specific document in the raw sources directory.

Layer 2 — The Wiki Directory

This is where the LLM writes and maintains its structured knowledge base. Each file in the wiki directory is a Markdown file representing a single concept, entity, comparison, timeline, or summary. The LLM creates these files, updates them when new information is ingested, and maintains the cross-references between them. Humans can read the wiki files — they are designed to be readable and useful to humans — but the LLM is responsible for maintaining them. The wiki directory is the compiled binary in Karpathy's analogy.

Two special files live in the wiki directory:

  • index.md — A content-oriented catalog of every page in the wiki, updated by the LLM on every ingest operation.
  • log.md — An append-only ledger of every operation performed, providing a full audit trail of how the wiki evolved over time.

Layer 3 — The Schema File

This is a configuration document, typically named CLAUDE.md or SCHEMA.md, that tells the LLM how to maintain the wiki. It specifies the structure of wiki pages, the conventions for cross-referencing, the rules for handling contradictions, the format for citations, and the procedures for the three main operations: ingestquery, and lint. The schema file is what turns a generic LLM into a specialized wiki maintenance agent.


CHAPTER FIVE: THE THREE CORE OPERATIONS

The Ingest Operation

The ingest operation is triggered when a new document is added to the raw sources directory. The LLM reads the new document and performs several tasks:

  1. Identifies the key concepts and entities discussed in the document.
  2. For each concept or entity, checks whether a wiki page already exists.
  3. If a page exists, updates it with any new information from the document, flagging any contradictions with existing content.
  4. If no page exists, creates a new one following the schema's page template.
  5. Updates index.md to include the new or modified pages.
  6. Appends an entry to log.md recording what was ingested and what changes were made.
  7. Updates the cross-references between pages to reflect any new relationships discovered in the document.

The ingest operation is where most of the token cost is concentrated. Reading a document, understanding its content, and integrating it into an existing wiki is a complex, expensive operation. But it is done once per document, and the result persists. Every subsequent query benefits from the work done during ingest.

The Query Operation

The query operation is triggered when a user asks a question. The LLM reads the relevant wiki pages and synthesizes an answer from their content. Because the wiki already contains synthesized, cross-referenced information, the model can answer complex questions that would require reading and integrating multiple raw documents. The answer includes citations to specific wiki pages, which in turn cite specific raw sources. If the answer reveals a gap in the wiki's coverage, the LLM can create a new page to fill that gap, turning every query into an opportunity to improve the wiki.

The Lint Operation

The lint operation is a periodic health check of the entire wiki. The LLM reads through all the wiki pages and looks for:

  • Contradictions between pages
  • Orphaned pages that are not linked from anywhere
  • Missing backlinks where a page references another but the reference is not reciprocated
  • Factual drift where a page's content has become inconsistent with the raw sources it cites
  • Coverage gaps where important concepts are not yet represented in the wiki

The lint operation is analogous to running a linter or a test suite on a codebase. It catches problems before they accumulate into serious inconsistencies.


CHAPTER SIX: THE WIKI PAGE STRUCTURE

A well-designed wiki page has a consistent structure that the LLM can parse, update, and cross-reference reliably. Karpathy's design uses YAML frontmatter for machine-readable metadata and Markdown for human-readable content.

Here is what a well-structured wiki page looks like:

---
title: "Transformer Architecture"
topics: ["deep-learning", "attention", "neural-networks"]
sources:
  - "raw/attention-is-all-you-need.pdf"
  - "raw/bert-paper.pdf"
created: "2026-04-01"
updated: "2026-04-22"
---

# Transformer Architecture

## Summary

The Transformer is a neural network architecture introduced by Vaswani et al.
in 2017 that relies entirely on attention mechanisms, dispensing with
recurrence and convolutions. It has become the dominant architecture for
natural language processing and has been extended to vision, audio, and
multimodal tasks.

## Key Concepts

The self-attention mechanism allows each position in a sequence to attend to
all other positions, computing a weighted sum of their value representations.
The weights are determined by the compatibility of query and key vectors.

The multi-head attention mechanism runs multiple attention operations in
parallel, allowing the model to attend to information from different
representation subspaces simultaneously.

## Cross-References

- [[Attention Mechanism]] — The core computational primitive of the Transformer
- [[BERT]] — A bidirectional Transformer pre-trained on masked language modeling
- [[GPT Architecture]] — A unidirectional Transformer pre-trained on next-token
  prediction

## Contradictions and Open Questions

None currently identified.

## Sources

[1] Vaswani et al., "Attention Is All You Need", 2017
    raw/attention-is-all-you-need.pdf
[2] Devlin et al., "BERT: Pre-training of Deep Bidirectional Transformers", 2018
    raw/bert-paper.pdf

Notice the important design choices here. The YAML frontmatter provides machine-readable metadata that the LLM can use to filter, search, and update pages without parsing the full Markdown content. The summary section gives a quick overview that can be included in other pages' cross-references without duplicating the full content. The cross-references section uses double-bracket notation familiar from tools like Obsidian. The contradictions section is explicitly reserved for flagging inconsistencies, making it easy for the lint operation to find and review them. And the sources section provides full traceability back to the raw documents.


CHAPTER SEVEN: THE SCHEMA FILE

The schema file is the brain of the system. It transforms a general-purpose LLM into a specialized wiki maintenance agent. A good schema file is precise, comprehensive, and concise — ideally under 300 lines, as LLMs have limited instruction-following capacity for very long system prompts. Here is a production-quality example:

# Wiki Maintenance Schema

## Your Role

You are the maintainer of this knowledge wiki. Your job is to keep the wiki
accurate, consistent, and well-organized. You process raw source documents
and compile them into structured wiki pages. You answer questions by reading
the wiki, not the raw sources. You periodically audit the wiki for quality.

## Directory Structure

raw/        Immutable source documents. Never modify files here.
wiki/       LLM-generated Markdown pages. You write and update these.
wiki/index.md   Master index of all wiki pages. Update on every ingest.
wiki/log.md     Append-only operation log. Append on every operation.

## Page Template

Every wiki page must follow this exact structure:

---
title: "[Concept Name]"
topics: ["tag1", "tag2"]
sources:
  - "raw/filename.pdf"
created: "YYYY-MM-DD"
updated: "YYYY-MM-DD"
---

# [Concept Name]

## Summary
[2-4 sentence overview]

## Key Concepts
[Detailed content organized by sub-topic]

## Cross-References
- [[Page Name]] — [One-line description of the relationship]

## Contradictions and Open Questions
[Any contradictions with other pages or open questions]

## Sources
[Numbered list of citations with file paths]

## Ingest Procedure

When asked to ingest a new document:
1. Read the document carefully.
2. Identify all key concepts and entities.
3. For each concept, check whether a wiki page already exists.
4. Update existing pages with new information. Flag contradictions explicitly.
5. Create new pages for concepts not yet in the wiki.
6. Update wiki/index.md.
7. Append to wiki/log.md: date, operation, files affected, summary of changes.

## Query Procedure

When asked a question:
1. Identify which wiki pages are relevant.
2. Read those pages.
3. Synthesize an answer with citations to wiki pages.
4. If the answer reveals a coverage gap, create a new wiki page.
5. Append to wiki/log.md: date, query, pages consulted, gap pages created.

## Lint Procedure

When asked to lint the wiki:
1. Read all wiki pages.
2. Check for contradictions between pages.
3. Check for orphaned pages with no incoming links.
4. Check for missing backlinks.
5. Check for coverage gaps.
6. Report all findings and ask for permission before making changes.
7. Append to wiki/log.md: date, lint results, changes made.

## Cross-Reference Conventions

Use [[Page Name]] notation for all cross-references.
Every cross-reference must include a one-line description of the relationship.
When you create a cross-reference from page A to page B, also add a
cross-reference from page B to page A.

## Contradiction Handling

When you find information in a new source that contradicts an existing wiki
page, do not silently overwrite the existing content. Instead:
1. Add the new information to the page.
2. Add an entry to the Contradictions section describing the conflict.
3. Note which sources support each position.
4. Flag the page in wiki/log.md for human review.

CHAPTER EIGHT: CHOOSING YOUR LLM FOR WIKI MAINTENANCE

The choice of LLM for wiki maintenance matters significantly. The three leading frontier models as of April 2026 each have distinct strengths relevant to this use case.

Claude Opus 4.7 is the strongest choice for wiki maintenance tasks. Its 1,000,000-token context window means it can hold the entire wiki in context for lint operations on large knowledge bases. Its leading performance on SWE-bench Verified (87.6%) reflects a deep ability to understand and maintain complex, interconnected structured content — exactly what wiki maintenance requires. Its task budgets feature, which manages token expenditure during autonomous agent runs, is particularly valuable for controlling costs during expensive ingest operations. For a wiki with 100+ pages and dozens of source documents, Claude Opus 4.7 is the recommended choice.

GPT-5.5 is the strongest choice for query operations where the user wants fast, accurate answers with strong web research integration. Its 90.1% score on BrowseComp reflects exceptional ability to synthesize information from multiple sources — a skill that translates directly to answering complex questions from a wiki. Its 256,000-token context window is sufficient for most query operations, though it may require selective page loading for lint operations on large wikis. Its tendency to hallucinate rather than admit uncertainty is worth monitoring in wiki contexts, where accuracy is paramount.

Gemini 3.1 Pro is the strongest choice for wikis that contain multimodal content — charts, diagrams, images, PDFs with complex layouts, and video transcripts. Its 94.3% score on GPQA Diamond (graduate-level science reasoning) makes it particularly valuable for technical and scientific knowledge bases. Its competitive pricing ($2 per million input tokens versus $5 for both Opus 4.7 and GPT-5.5) makes it attractive for high-volume ingest operations where cost is a concern.

A pragmatic production architecture uses Claude Opus 4.7 for ingest and lint (where accuracy, context length, and structured reasoning matter most) and GPT-5.5 for query (where speed and synthesis quality matter most), with Gemini 3.1 Pro handling any multimodal source documents during ingest preprocessing.


CHAPTER NINE: A COMPLETE PYTHON IMPLEMENTATION

Here is a complete, production-ready Python implementation of the LLM Wiki pattern. It is designed to be modular, provider-agnostic, and easy to extend.

Project Structure

llmwiki/
├── raw/                    # Immutable source documents
├── wiki/                   # LLM-generated Markdown pages
│   ├── index.md
│   └── log.md
├── SCHEMA.md               # Wiki maintenance schema
├── llmwiki/
│   ├── __init__.py
│   ├── config.py           # Configuration and API client setup
│   ├── ingest.py           # Ingest operation
│   ├── query.py            # Query operation
│   ├── lint.py             # Lint operation
│   ├── wiki_io.py          # File I/O utilities
│   └── preprocessor.py     # Document-to-Markdown conversion
├── main.py                 # CLI entry point
└── requirements.txt

requirements.txt

anthropic>=0.25.0
openai>=1.30.0
google-generativeai>=0.7.0
markitdown[all]>=0.1.0
python-frontmatter>=1.1.0
rich>=13.0.0
click>=8.1.0
python-dotenv>=1.0.0

llmwiki/config.py

import os
from dataclasses import dataclass
from dotenv import load_dotenv
import anthropic
import openai

load_dotenv()


@dataclass
class WikiConfig:
    """Central configuration for the LLM Wiki system."""
    raw_dir: str = "raw"
    wiki_dir: str = "wiki"
    schema_path: str = "SCHEMA.md"
    index_path: str = "wiki/index.md"
    log_path: str = "wiki/log.md"

    # Model assignments by operation
    ingest_model: str = "claude-opus-4-7"
    query_model: str = "gpt-5.5"
    lint_model: str = "claude-opus-4-7"

    # Token budgets
    ingest_max_tokens: int = 8192
    query_max_tokens: int = 4096
    lint_max_tokens: int = 16384


def get_anthropic_client() -> anthropic.Anthropic:
    """Return a configured Anthropic client."""
    api_key = os.environ.get("ANTHROPIC_API_KEY")
    if not api_key:
        raise EnvironmentError(
            "ANTHROPIC_API_KEY environment variable is not set."
        )
    return anthropic.Anthropic(api_key=api_key)


def get_openai_client() -> openai.OpenAI:
    """Return a configured OpenAI client."""
    api_key = os.environ.get("OPENAI_API_KEY")
    if not api_key:
        raise EnvironmentError(
            "OPENAI_API_KEY environment variable is not set."
        )
    return openai.OpenAI(api_key=api_key)

llmwiki/wiki_io.py

import os
import frontmatter
from datetime import datetime
from typing import Optional


def read_schema(schema_path: str) -> str:
    """Read the wiki schema file."""
    with open(schema_path, "r", encoding="utf-8") as f:
        return f.read()


def read_wiki_page(page_path: str) -> Optional[dict]:
    """
    Read a wiki page and return its frontmatter metadata and content.
    Returns None if the file does not exist.
    """
    if not os.path.exists(page_path):
        return None
    post = frontmatter.load(page_path)
    return {
        "metadata": post.metadata,
        "content": post.content,
        "raw": post
    }


def write_wiki_page(page_path: str, content: str) -> None:
    """Write content to a wiki page, creating directories as needed."""
    os.makedirs(os.path.dirname(page_path), exist_ok=True)
    with open(page_path, "w", encoding="utf-8") as f:
        f.write(content)


def list_wiki_pages(wiki_dir: str) -> list[str]:
    """Return a list of all Markdown file paths in the wiki directory."""
    pages = []
    for root, _, files in os.walk(wiki_dir):
        for filename in files:
            if filename.endswith(".md"):
                pages.append(os.path.join(root, filename))
    return sorted(pages)


def read_all_wiki_pages(wiki_dir: str) -> str:
    """
    Read and concatenate all wiki pages into a single string.
    Used for lint operations that need full wiki context.
    """
    pages = list_wiki_pages(wiki_dir)
    combined = []
    for page_path in pages:
        with open(page_path, "r", encoding="utf-8") as f:
            combined.append(f"### FILE: {page_path}\n\n{f.read()}")
    return "\n\n---\n\n".join(combined)


def append_to_log(log_path: str, entry: str) -> None:
    """Append an entry to the operation log."""
    timestamp = datetime.utcnow().strftime("%Y-%m-%d %H:%M:%S UTC")
    log_entry = f"\n## {timestamp}\n\n{entry}\n"
    os.makedirs(os.path.dirname(log_path), exist_ok=True)
    with open(log_path, "a", encoding="utf-8") as f:
        f.write(log_entry)


def ensure_wiki_structure(config) -> None:
    """
    Ensure that the wiki directory and required files exist.
    Creates index.md and log.md if they are missing.
    """
    os.makedirs(config.raw_dir, exist_ok=True)
    os.makedirs(config.wiki_dir, exist_ok=True)

    if not os.path.exists(config.index_path):
        with open(config.index_path, "w", encoding="utf-8") as f:
            f.write("# Wiki Index\n\n*No pages yet. Run an ingest to begin.*\n")

    if not os.path.exists(config.log_path):
        with open(config.log_path, "w", encoding="utf-8") as f:
            f.write("# Operation Log\n\n*No operations yet.*\n")

llmwiki/preprocessor.py

import os
from markitdown import MarkItDown


def convert_to_markdown(file_path: str) -> str:
    """
    Convert a source document to Markdown using MarkItDown.
    Supports PDF, DOCX, PPTX, XLSX, HTML, images, and more.
    """
    converter = MarkItDown()
    result = converter.convert(file_path)
    return result.text_content


def load_raw_document(file_path: str) -> str:
    """
    Load a raw source document, converting to Markdown if necessary.
    Plain text and Markdown files are returned as-is.
    """
    _, ext = os.path.splitext(file_path.lower())

    if ext in (".md", ".txt", ".rst"):
        with open(file_path, "r", encoding="utf-8") as f:
            return f.read()
    else:
        return convert_to_markdown(file_path)

llmwiki/ingest.py

import os
from rich.console import Console
from .config import WikiConfig, get_anthropic_client
from .wiki_io import (
    read_schema,
    read_all_wiki_pages,
    append_to_log,
    list_wiki_pages,
)
from .preprocessor import load_raw_document

console = Console()


def build_ingest_prompt(
    schema: str,
    document_content: str,
    document_path: str,
    existing_wiki: str,
) -> str:
    """Construct the ingest prompt for the LLM."""
    return f"""You are a wiki maintenance agent. Below is your schema, the
current state of the wiki, and a new source document to ingest.

Follow the Ingest Procedure in the schema exactly.

<schema>
{schema}
</schema>

<existing_wiki>
{existing_wiki if existing_wiki else "The wiki is currently empty."}
</existing_wiki>

<new_document>
Source path: {document_path}

{document_content}
</new_document>

Perform the ingest operation now. For each wiki page you create or update,
output the FULL file content in a code block with the file path as the
language identifier, like this:

```wiki/page-name.md
[full page content here]

After all page outputs, provide a brief log entry summarizing what you did. Output the log entry inside tags. """

def parse_llm_ingest_response(response_text: str, wiki_dir: str) -> dict: """ Parse the LLM's ingest response to extract file writes and log entry. Returns a dict with 'files' (dict of path->content) and 'log' (str). """ import re

files = {}
# Match code blocks with file paths as language identifiers
pattern = r"```(wiki/[^\n]+\.md)\n(.*?)```"
matches = re.findall(pattern, response_text, re.DOTALL)
for file_path, content in matches:
    files[file_path.strip()] = content.strip()

# Extract log entry
log_match = re.search(r"<log>(.*?)</log>", response_text, re.DOTALL)
log_entry = log_match.group(1).strip() if log_match else response_text[-500:]

return {"files": files, "log": log_entry}

def ingest_document(document_path: str, config: WikiConfig) -> None: """ Ingest a single document into the wiki. """ console.print(f"\n[bold cyan]Ingesting:[/bold cyan] {document_path}")

# Load and convert the document
console.print("  Loading and converting document...")
document_content = load_raw_document(document_path)
console.print(f"  Document loaded: {len(document_content):,} characters")

# Load schema and existing wiki
schema = read_schema(config.schema_path)
existing_wiki = read_all_wiki_pages(config.wiki_dir)

# Build prompt and call LLM
prompt = build_ingest_prompt(
    schema=schema,
    document_content=document_content,
    document_path=document_path,
    existing_wiki=existing_wiki,
)

console.print("  Calling Claude Opus 4.7 for ingest...")
client = get_anthropic_client()

response = client.messages.create(
    model=config.ingest_model,
    max_tokens=config.ingest_max_tokens,
    messages=[{"role": "user", "content": prompt}],
)

response_text = response.content[0].text

# Parse and write the results
parsed = parse_llm_ingest_response(response_text, config.wiki_dir)

for file_path, content in parsed["files"].items():
    console.print(f"  Writing: {file_path}")
    os.makedirs(os.path.dirname(file_path), exist_ok=True)
    with open(file_path, "w", encoding="utf-8") as f:
        f.write(content)

# Append to log
log_entry = (
    f"**Operation:** Ingest\n"
    f"**Source:** {document_path}\n"
    f"**Pages written:** {list(parsed['files'].keys())}\n\n"
    f"{parsed['log']}"
)
append_to_log(config.log_path, log_entry)

console.print(
    f"  [green]Done.[/green] Wrote {len(parsed['files'])} page(s)."
)

### `llmwiki/query.py`

```python
from rich.console import Console
from .config import WikiConfig, get_openai_client
from .wiki_io import read_schema, read_all_wiki_pages, append_to_log

console = Console()


def build_query_prompt(schema: str, wiki_content: str, question: str) -> str:
    """Construct the query prompt for the LLM."""
    return f"""You are a wiki query agent. Below is your schema and the
current wiki. Answer the user's question by reading the wiki.

Follow the Query Procedure in the schema exactly.

<schema>
{schema}
</schema>

<wiki>
{wiki_content}
</wiki>

<question>
{question}
</question>

Provide a thorough answer with citations to specific wiki pages. If you
identify a coverage gap, describe what new page should be created and why.
"""


def query_wiki(question: str, config: WikiConfig) -> str:
    """
    Query the wiki to answer a question.
    Returns the LLM's answer as a string.
    """
    console.print(f"\n[bold cyan]Query:[/bold cyan] {question}")

    schema = read_schema(config.schema_path)
    wiki_content = read_all_wiki_pages(config.wiki_dir)

    if not wiki_content:
        return "The wiki is empty. Please ingest some documents first."

    prompt = build_query_prompt(schema, wiki_content, question)

    console.print("  Calling GPT-5.5 for query...")
    client = get_openai_client()

    response = client.chat.completions.create(
        model=config.query_model,
        max_tokens=config.query_max_tokens,
        messages=[{"role": "user", "content": prompt}],
    )

    answer = response.choices[0].message.content

    # Log the query
    log_entry = (
        f"**Operation:** Query\n"
        f"**Question:** {question}\n\n"
        f"**Answer summary:** {answer[:300]}..."
    )
    append_to_log(config.log_path, log_entry)

    return answer

llmwiki/lint.py

from rich.console import Console
from .config import WikiConfig, get_anthropic_client
from .wiki_io import read_schema, read_all_wiki_pages, append_to_log

console = Console()


def build_lint_prompt(schema: str, wiki_content: str) -> str:
    """Construct the lint prompt for the LLM."""
    return f"""You are a wiki audit agent. Below is your schema and the
complete wiki. Perform a thorough lint operation.

Follow the Lint Procedure in the schema exactly.

<schema>
{schema}
</schema>

<wiki>
{wiki_content}
</wiki>

Perform the lint operation now. Report:
1. All contradictions found between pages
2. All orphaned pages (no incoming links)
3. All missing backlinks
4. All coverage gaps you identify
5. Your recommended remediation actions

Be thorough and specific. Cite exact page names and line content where
relevant. Do NOT make changes — report only. Changes require human approval.
"""


def lint_wiki(config: WikiConfig) -> str:
    """
    Perform a lint health check of the entire wiki.
    Returns the lint report as a string.
    """
    console.print("\n[bold cyan]Linting wiki...[/bold cyan]")

    schema = read_schema(config.schema_path)
    wiki_content = read_all_wiki_pages(config.wiki_dir)

    if not wiki_content:
        return "The wiki is empty. Nothing to lint."

    prompt = build_lint_prompt(schema, wiki_content)

    console.print("  Calling Claude Opus 4.7 for lint...")
    client = get_anthropic_client()

    response = client.messages.create(
        model=config.lint_model,
        max_tokens=config.lint_max_tokens,
        messages=[{"role": "user", "content": prompt}],
    )

    report = response.content[0].text

    # Log the lint operation
    log_entry = (
        f"**Operation:** Lint\n\n"
        f"**Report summary:** {report[:500]}..."
    )
    append_to_log(config.log_path, log_entry)

    return report

main.py — The CLI Entry Point

import click
import os
from rich.console import Console
from rich.markdown import Markdown
from llmwiki.config import WikiConfig
from llmwiki.wiki_io import ensure_wiki_structure
from llmwiki.ingest import ingest_document
from llmwiki.query import query_wiki
from llmwiki.lint import lint_wiki

console = Console()


@click.group()
def cli():
    """LLM Wiki — a self-maintaining knowledge base powered by frontier LLMs."""
    pass


@cli.command()
@click.argument("document_path")
def ingest(document_path: str):
    """Ingest a document into the wiki."""
    if not os.path.exists(document_path):
        console.print(f"[red]Error:[/red] File not found: {document_path}")
        return

    config = WikiConfig()
    ensure_wiki_structure(config)
    ingest_document(document_path, config)


@cli.command()
@click.argument("question")
def query(question: str):
    """Query the wiki with a natural language question."""
    config = WikiConfig()
    ensure_wiki_structure(config)
    answer = query_wiki(question, config)
    console.print("\n[bold green]Answer:[/bold green]")
    console.print(Markdown(answer))


@cli.command()
def lint():
    """Run a health check on the entire wiki."""
    config = WikiConfig()
    ensure_wiki_structure(config)
    report = lint_wiki(config)
    console.print("\n[bold yellow]Lint Report:[/bold yellow]")
    console.print(Markdown(report))


@cli.command()
def ingest_all():
    """Ingest all documents in the raw/ directory."""
    config = WikiConfig()
    ensure_wiki_structure(config)

    raw_files = []
    for root, _, files in os.walk(config.raw_dir):
        for filename in files:
            raw_files.append(os.path.join(root, filename))

    if not raw_files:
        console.print("[yellow]No files found in raw/ directory.[/yellow]")
        return

    console.print(f"Found {len(raw_files)} file(s) to ingest.")
    for file_path in raw_files:
        ingest_document(file_path, config)

    console.print("\n[bold green]All documents ingested.[/bold green]")


if __name__ == "__main__":
    cli()

CHAPTER TEN: WHEN TO USE THE LLM WIKI VS. RAG

The LLM Wiki is not a universal replacement for RAG. It is a better tool for a specific class of problems. Here is a practical decision guide:

Choose the LLM Wiki when:

  • Your knowledge base is curated and relatively stable (not changing every hour)
  • You need synthesis across multiple sources, not just retrieval of individual facts
  • Your knowledge base is under approximately 100,000 tokens (~400,000 words) in compiled wiki form
  • You want a human-readable, auditable knowledge base
  • You need cross-referencing and contradiction detection
  • You query the same topics repeatedly (break-even at ~8–10 queries per topic)

Choose RAG when:

  • Your document collection is very large (tens of thousands of documents or more)
  • Your content changes frequently (news feeds, live databases, real-time data)
  • You have many concurrent users with diverse, unpredictable query patterns
  • You need to retrieve specific verbatim passages from source documents
  • Cost per ingest is a primary constraint

Choose a hybrid approach when:

  • You have a stable core knowledge base (use the wiki) plus a large dynamic corpus (use RAG)
  • You want the wiki's synthesis quality for common queries and RAG's breadth for edge cases
  • You are building a production system that needs to scale beyond the wiki's sweet spot

CHAPTER ELEVEN: COST ANALYSIS

Understanding the economics of the LLM Wiki is essential for production deployment. Here is a realistic cost model using April 2026 pricing:

Ingest cost per document (Claude Opus 4.7 at $5/M input, $25/M output):

  • Average document: ~10,000 tokens input (document + schema + existing wiki)
  • Average LLM output per ingest: ~3,000 tokens (new/updated pages + log)
  • Cost per ingest: (10,000 × $0.000005) + (3,000 × $0.000025) = $0.125 per document

Query cost per question (GPT-5.5 at $5/M input, $30/M output):

  • Average wiki context loaded: ~20,000 tokens
  • Average answer: ~500 tokens
  • Cost per query: (20,000 × $0.000005) + (500 × $0.000030) = $0.115 per query

Lint cost per run (Claude Opus 4.7, full wiki):

  • Full wiki context (100 pages): ~80,000 tokens
  • Lint report: ~2,000 tokens
  • Cost per lint: (80,000 × $0.000005) + (2,000 × $0.000025) = $0.45 per lint run

For a knowledge base of 50 documents queried 20 times each, the total cost is approximately:

  • Ingest: 50 × $0.125 = $6.25
  • Queries: 1,000 × $0.115 = $115.00
  • Lint (weekly, 3 months): 12 × $0.45 = $5.40
  • Total: ~$126.65

This compares favorably with a RAG system at similar scale, which would spend a similar amount on queries alone while providing lower synthesis quality and no contradiction detection.


CONCLUSION

The LLM Wiki pattern is one of the most elegant ideas to emerge from the applied AI community in 2026. It reframes the question from "how do we retrieve the right information at query time?" to "how do we compile our knowledge into a form that makes every query easy?" It is the difference between a library with no card catalog and a library with a brilliant, constantly-updated reference librarian who has read every book and remembers how they all connect.

The pattern is not magic. It has real costs, real limitations, and a real sweet spot. It works best for curated, stable knowledge bases of moderate size. It requires careful schema design and thoughtful page structure. And it depends on the quality of the frontier models that power it — which, as of April 2026, with Claude Opus 4.7GPT-5.5, and Gemini 3.1 Pro all operating at genuinely remarkable levels of capability, has never been higher.

The implementation presented in this article is a starting point. Production deployments will want to add streaming responses, caching of wiki page embeddings for faster retrieval, a web interface, version control integration via Git, and monitoring of token usage and costs. But the core architecture — raw sources, compiled wiki, schema file, and three operations — is sound, and it scales gracefully as you add those layers.

The most important thing is to start. Pick a domain you care about, drop a few documents into raw/, write a schema, and run your first ingest. Watch the wiki grow. Ask it a question. Run a lint. You will quickly develop an intuition for how the system works and what it needs. And you will find, as many practitioners have, that there is something genuinely delightful about watching an AI build and maintain a knowledge base that gets smarter every time you add a new document.

That is the promise of the LLM Wiki. It is not just a retrieval system. It is a knowledge compiler. And it is ready to use today.



No comments: