Monday, May 18, 2026

THE SELF-EVOLVING AGENT: FROM HUMBLE CHATBOT TO LIVING ARCHITECTURE



PREFACE: WHY THIS MATTERS

Imagine hiring a brand-new junior developer on their first day. They know the basics: they can write code, answer questions, look things up, and use the tools you hand them. But here is the remarkable part: every thirty minutes, they sit quietly, reflect on every conversation they have had, figure out what new skill would have made them more useful, and then they teach themselves that skill. By the end of the week, they are not the same developer you hired on Monday morning. They have grown. They have changed. They have become something more capable than they were, organically, without you having to send them to a single training course. That is the vision at the heart of the self-evolving agent. It is not science fiction. It is an architecture you can build today, using real protocols, real language models, and real engineering discipline. This article is your complete guide to understanding, designing, and implementing such a system from the ground up. We will begin with the simplest possible starting point: a lightweight LLM chatbot that can call tools. We will then layer on a persistent memory system inspired by Andrej Karpathy's LLM Wiki concept. We will add the thirty-minute reflection loop that is the engine of self-improvement. And finally, we will describe how the agent dynamically generates new tools, new adapters, and even new sub-agents, integrating them into itself at runtime. By the end of this journey, you will understand every constituent of a system that starts small and grows powerful over time. Let us begin.

CHAPTER ONE: THE FOUNDATION - UNDERSTANDING THE ARCHITECTURE BEFORE WRITING CODE

SECTION 1.1: THE FOUR PILLARS OF A SELF-EVOLVING AGENT

Before we write a single line of code, we need to understand what we are building at a conceptual level. A self-evolving agent rests on four interdependent pillars, and understanding how they relate to each other is the most important thing you can do before you start. The first pillar is the LLM Core. This is the reasoning engine at the center of everything. The LLM does not just answer questions; it plans, reflects, writes code, evaluates its own outputs, and makes decisions about what to do next. In our architecture, the LLM is not a passive text generator. It is an active decision-maker that orchestrates all the other components. The second pillar is the Tool Layer, implemented according to the Model Context Protocol (MCP) as standardized by late 2025. Tools are the agent's hands. They let it search the web, read files, write files, execute code, call APIs, and interact with external systems. Without tools, the LLM is a brilliant mind locked in a dark room. With tools, it can reach out and touch the world. The third pillar is the Memory System, implemented as a living wiki inspired by Andrej Karpathy's LLM Wiki concept. This is the agent's long-term memory. It is not a simple database of facts. It is a structured, interlinked knowledge base that the agent actively maintains, updates, and queries. It remembers what happened in past sessions, what tools it has built, what it has learned, and what it still needs to learn. The fourth pillar is the Reflection and Evolution Engine. Every thirty minutes, the agent pauses its normal operation, reads its own memory, analyses what it has been asked to do, identifies gaps in its capabilities, and then generates and integrates new functionality. This is the heartbeat of self-evolution. Without this pillar, the agent is just a very good chatbot. With it, the agent is a living system. These four pillars do not operate in isolation. The LLM Core uses the Tool Layer to interact with the world. The Tool Layer writes results into the Memory System. The Memory System feeds the Reflection Engine with the raw material it needs to reason about growth. The Reflection Engine generates new tools and capabilities that expand the Tool Layer. And the expanded Tool Layer makes the LLM Core more capable. It is a virtuous cycle.

SECTION 1.2: THE MODEL CONTEXT PROTOCOL (MCP) - THE UNIVERSAL TOOL LANGUAGE

The Model Context Protocol, or MCP, is the backbone of our tool layer. It was introduced by Anthropic in November 2024 and had matured significantly by December 2025, when the Agentic AI Foundation (under the Linux Foundation) took stewardship of the standard. Understanding MCP is not optional for building this system. It is essential. MCP is built on a client-server architecture that will feel familiar to anyone who has worked with the Language Server Protocol (LSP) used in code editors. The analogy is intentional and illuminating. Just as LSP allows any editor to talk to any language server using a standard protocol, MCP allows any LLM host to talk to any tool server using a standard protocol. The three roles in an MCP system are the Host, the Client, and the Server. The Host is the application that contains the LLM and manages the overall user experience. In our case, the Host is the self-evolving agent itself. The Client lives inside the Host and speaks MCP on behalf of the LLM. The Server is an external process that exposes tools, resources, and prompts. Communication happens over JSON-RPC 2.0, which is a lightweight, language- agnostic remote procedure call protocol. Every message is a JSON object with a specific structure. For local servers, communication happens over stdio (standard input and output). For remote servers, it happens over HTTP with Server-Sent Events (SSE). A tool in MCP is defined by three things: a name, a human-readable description, and a JSON Schema that describes its input parameters. When the LLM wants to use a tool, it emits a structured tool call. The MCP Client intercepts this, routes it to the appropriate MCP Server, and returns the result. The LLM never needs to know the implementation details of the tool. It only needs to know the tool's name and what it does. Here is what a minimal MCP server looks like in Python, using the official MCP SDK. Notice how the @mcp.tool() decorator does all the heavy lifting of registering the tool and generating its JSON Schema from the type hints: from mcp.server.fastmcp import FastMCP # Create the MCP server instance with a human-readable name. # This name is used for discovery and logging. mcp = FastMCP("EvolverAgent-ToolServer") @mcp.tool() def web_search(query: str, max_results: int = 5) -> str: """ Search the web for information about a given query. Args: query: The search query string. max_results: Maximum number of results to return (default 5). Returns: A formatted string containing search results with titles, URLs, and snippets. """ # In a real implementation, this would call a search API. # The docstring becomes the tool description that the LLM reads. results = _call_search_api(query, max_results) return _format_results(results) if __name__ == "__main__": # Run the server over stdio for local communication. mcp.run(transport="stdio") The beauty of this design is that the LLM sees a clean, well-described interface. It reads the docstring as the tool's description and uses the type hints to understand what arguments to provide. The MCP framework automatically generates the JSON Schema that the LLM needs to call the tool correctly. When the MCP Client sends a tools/list request to this server, it receives back a JSON structure that looks like this: { "tools": [ { "name": "web_search", "description": "Search the web for information...", "inputSchema": { "type": "object", "properties": { "query": { "type": "string", "description": "The search query string." }, "max_results": { "type": "integer", "description": "Maximum number of results.", "default": 5 } }, "required": ["query"] } } ] } This JSON structure is what the LLM actually sees when it decides which tools to use. It is the menu from which the LLM orders. The richer and clearer the descriptions, the better the LLM's tool selection will be. One of the most important features of MCP for our self-evolving agent is that the tool registry is dynamic. We can add new tools to a running server, and the LLM will discover them the next time it calls tools/list. This is the mechanism that makes runtime tool integration possible. When our reflection engine generates a new tool, it registers it with the MCP server, and the agent immediately gains access to it without any restart.

SECTION 1.3: THE AGENT LOOP - THE HEARTBEAT OF OPERATION

Before we go deeper into any individual component, let us understand the agent loop. This is the fundamental cycle that drives all agent behavior. Every agentic system, no matter how complex, reduces to some version of this loop. The loop begins when the agent receives a user message. The LLM processes the message, its current context, and its knowledge of available tools. It then decides whether to respond directly or to use one or more tools. If it decides to use a tool, it emits a structured tool call. The system executes the tool, returns the result to the LLM, and the LLM incorporates the result into its reasoning. This continues until the LLM decides it has enough information to give a final response to the user. In pseudocode, the core agent loop looks like this: def agent_loop(user_message: str, context: AgentContext) -> str: """ The fundamental agent reasoning loop. This function drives all agent behavior. It continues until the LLM produces a final response with no pending tool calls. """ # Add the user message to the conversation history. context.messages.append({ "role": "user", "content": user_message }) # Keep looping until the LLM gives a final answer. while True: # Ask the LLM what to do next, given the current context # and the list of available tools. response = llm.complete( messages=context.messages, tools=context.tool_registry.get_tool_schemas(), system_prompt=context.system_prompt ) # If the LLM wants to call a tool, execute it. if response.has_tool_calls(): for tool_call in response.tool_calls: result = context.tool_registry.execute( tool_name=tool_call.name, arguments=tool_call.arguments ) # Add the tool result to the conversation so the # LLM can see what happened. context.messages.append({ "role": "tool", "tool_call_id": tool_call.id, "content": result }) # If the LLM gave a final text response, we are done. elif response.has_text_content(): final_response = response.text context.messages.append({ "role": "assistant", "content": final_response }) return final_response This loop is simple, but it is the engine of everything. The self-evolving agent is built around this loop, with the reflection engine running as a parallel process that periodically enriches the context and expands the tool registry.

CHAPTER TWO: THE MEMORY SYSTEM - THE AGENT'S LIVING KNOWLEDGE BASE

SECTION 2.1: WHY SIMPLE MEMORY IS NOT ENOUGH

Most LLM applications handle memory in one of two ways. The first approach is to stuff the entire conversation history into the context window and hope it fits. This works for short conversations but fails catastrophically for long-running agents that need to remember things from days or weeks ago. The context window is finite, and important information gets pushed out as new information comes in. The second approach is Retrieval-Augmented Generation, or RAG. In RAG, you embed documents into a vector database and retrieve the most semantically similar chunks when you need them. RAG is powerful, but it has a fundamental limitation: it treats knowledge as a static collection of raw documents. It does not build a deeper understanding of those documents. It does not notice when two documents contradict each other. It does not synthesize information across multiple sources into a coherent picture. It just retrieves chunks and hopes the LLM can figure out the rest. Andrej Karpathy's LLM Wiki concept addresses both of these limitations with an elegant insight: instead of storing raw documents and retrieving them, you use the LLM to compile raw documents into a structured, interlinked wiki. The wiki is the memory. The LLM reads the wiki, not the raw documents, when it needs to answer a question. And the wiki grows and improves over time as the agent ingests new information. The analogy Karpathy uses is that of a compiler. Raw source documents are like source code. The wiki is like the compiled binary. You do not ship source code to users; you ship the compiled binary. Similarly, you do not query raw documents; you query the compiled wiki. The compilation step, performed by the LLM, adds structure, cross-links, summaries, and reconciliation of contradictions that raw retrieval cannot provide.

SECTION 2.2: THE WIKI ARCHITECTURE

Our wiki is a collection of Markdown files stored in a directory on disk. Each file represents a page about a specific topic. Pages are interlinked using wiki-style links. The wiki has a schema that defines how pages are structured, what metadata they carry, and how they relate to each other. The wiki has three types of pages. Concept pages describe a single idea, technology, or entity. They contain a summary, key facts, related concepts, and a list of sources. Session pages record what happened in a specific interaction session: what the user asked, what tools were used, what was learned, and what new capabilities were identified as needed. Capability pages describe tools, adapters, and agents that the system has built or is planning to build. They record the tool's purpose, its implementation, its performance, and any known limitations. The directory structure of the wiki looks like this: wiki/ |-- concepts/ | |-- python_asyncio.md | |-- mcp_protocol.md | |-- vector_databases.md | `-- ... |-- sessions/ | |-- 2025-12-01_session_001.md | |-- 2025-12-01_session_002.md | `-- ... |-- capabilities/ | |-- web_search_tool.md | |-- email_adapter.md | |-- calendar_agent.md | `-- ... |-- schema.md `-- index.md The index.md file is the entry point. It contains a table of contents and a high-level summary of everything the agent knows. The schema.md file defines the structure that all pages must follow. This structure is important because the LLM needs to know what to expect when it reads a page. A concept page follows this template: # [Concept Name] ## Summary [A 2-3 sentence summary of the concept, written for an LLM reader.] ## Key Facts [Bullet points of the most important facts about this concept.] ## Related Concepts [Links to related pages: [[other_concept]], [[another_concept]]] ## Sources [List of raw source documents that contributed to this page.] ## Last Updated [Timestamp of the last update.] ## Confidence [High / Medium / Low - how confident the agent is in this information.] The Confidence field is particularly important for our self-evolving agent. When the agent reflects on its memory, it pays special attention to low- confidence pages and plans to gather more information about those topics.

SECTION 2.3: THE WIKI MEMORY MANAGER

The WikiMemoryManager is the component that manages all interactions with the wiki. It provides four core operations: ingest, query, update, and lint. These map directly to Karpathy's original design. The ingest operation takes a raw document (a web search result, a user- provided document, or a tool output) and uses the LLM to extract key information from it, update existing wiki pages, create new pages if needed, and add cross-links between related pages. The query operation takes a natural language question and uses the LLM to find the most relevant wiki pages, read them, and synthesize an answer. If the answer is good enough, it can be filed back into the wiki as a new page, allowing knowledge to compound. The update operation is called after every session to record what happened, what was learned, and what new capabilities were identified. The lint operation runs periodically to find contradictions, orphaned pages, outdated information, and structural problems in the wiki. Here is the WikiMemoryManager class: import os import json from datetime import datetime from pathlib import Path from typing import Optional class WikiMemoryManager: """ Manages the agent's long-term memory as a structured wiki. This class implements Karpathy's LLM Wiki concept, treating raw information as source code and the wiki as the compiled binary. The LLM is the compiler that transforms raw information into structured, interlinked knowledge. """ def __init__(self, wiki_root: str, llm_client): """ Initialize the wiki memory manager. Args: wiki_root: Path to the root directory of the wiki. llm_client: An LLM client for performing wiki operations. """ self.wiki_root = Path(wiki_root) self.llm = llm_client self._ensure_wiki_structure() def _ensure_wiki_structure(self): """Create the wiki directory structure if it does not exist.""" for subdir in ["concepts", "sessions", "capabilities"]: (self.wiki_root / subdir).mkdir(parents=True, exist_ok=True) # Create the index if it does not exist yet. index_path = self.wiki_root / "index.md" if not index_path.exists(): index_path.write_text( "# Agent Wiki Index\n\n" "This wiki is the long-term memory of the self-evolving " "agent.\n\n" "## Concepts\n\n" "## Sessions\n\n" "## Capabilities\n" ) def ingest(self, raw_content: str, source_label: str) -> list[str]: """ Ingest a raw document into the wiki. The LLM reads the raw content, identifies key concepts, updates existing pages, and creates new pages as needed. Args: raw_content: The raw text content to ingest. source_label: A human-readable label for the source. Returns: A list of wiki page paths that were created or updated. """ # Build the prompt for the LLM to perform the ingestion. existing_pages = self._list_all_pages() prompt = self._build_ingest_prompt( raw_content, source_label, existing_pages ) # Ask the LLM to produce a list of page updates. response = self.llm.complete(prompt) updates = self._parse_page_updates(response) # Apply the updates to the wiki files. updated_paths = [] for page_path, page_content in updates.items(): full_path = self.wiki_root / page_path full_path.parent.mkdir(parents=True, exist_ok=True) full_path.write_text(page_content) updated_paths.append(str(full_path)) # Update the index to reflect new pages. self._update_index(updated_paths) return updated_paths def query(self, question: str) -> str: """ Query the wiki to answer a natural language question. The LLM searches for relevant pages, reads them, and synthesizes an answer with citations. Args: question: The natural language question to answer. Returns: A synthesized answer with citations to wiki pages. """ # First, find the most relevant pages using keyword search # and LLM-guided relevance ranking. relevant_pages = self._find_relevant_pages(question) page_contents = { page: (self.wiki_root / page).read_text() for page in relevant_pages if (self.wiki_root / page).exists() } # Ask the LLM to synthesize an answer from the page contents. prompt = self._build_query_prompt(question, page_contents) answer = self.llm.complete(prompt) return answer def record_session(self, session_data: dict) -> str: """ Record a completed session into the wiki. Args: session_data: A dictionary containing session metadata, conversation summary, tools used, and identified capability gaps. Returns: The path to the created session page. """ timestamp = datetime.now().strftime("%Y-%m-%d_%H-%M-%S") session_id = f"session_{timestamp}" page_path = self.wiki_root / "sessions" / f"{session_id}.md" # Use the LLM to write a well-structured session summary. prompt = self._build_session_record_prompt(session_data) page_content = self.llm.complete(prompt) page_path.write_text(page_content) return str(page_path) def _find_relevant_pages(self, question: str) -> list[str]: """ Find wiki pages relevant to a given question. Uses a combination of keyword matching and LLM-guided relevance ranking to find the best pages. """ all_pages = self._list_all_pages() # Simple keyword matching as a first pass. keywords = question.lower().split() candidates = [] for page in all_pages: page_name = page.lower() if any(kw in page_name for kw in keywords): candidates.append(page) # If we have too many candidates, use the LLM to rank them. if len(candidates) > 10: prompt = ( f"Given the question: '{question}'\n" f"Rank these wiki pages by relevance (most relevant " f"first, return top 5):\n" + "\n".join(candidates) ) ranked = self.llm.complete(prompt) candidates = self._parse_page_list(ranked)[:5] return candidates[:10] # Never read more than 10 pages at once. def _list_all_pages(self) -> list[str]: """Return a list of all wiki page paths, relative to wiki_root.""" pages = [] for path in self.wiki_root.rglob("*.md"): pages.append(str(path.relative_to(self.wiki_root))) return pages def _update_index(self, new_pages: list[str]): """Update the wiki index to include newly created pages.""" index_path = self.wiki_root / "index.md" current_index = index_path.read_text() prompt = ( f"Update this wiki index to include these new pages:\n" f"{new_pages}\n\nCurrent index:\n{current_index}\n\n" f"Return the complete updated index." ) updated_index = self.llm.complete(prompt) index_path.write_text(updated_index) This class is the heart of the memory system. Notice how every operation involves the LLM as an active participant. The LLM does not just retrieve information; it compiles, synthesizes, and structures it. This is what distinguishes the LLM Wiki approach from simple RAG.

SECTION 2.4: WEB SEARCH AS MEMORY INGESTION

The web search tool is the primary way the agent acquires new knowledge. When the agent searches the web, it does not just use the search result to answer the user's question and then forget it. It ingests the search result into the wiki, so that future queries can benefit from what was learned. This is the key insight that makes the web search tool so powerful in our architecture. Every search is not just a retrieval operation; it is a learning operation. The agent gets smarter with every search it performs. The web search tool is implemented as an MCP tool that wraps a search API (such as SerpAPI, Brave Search, or Tavily) and automatically ingests the results into the wiki: from mcp.server.fastmcp import FastMCP from wiki_memory import WikiMemoryManager import httpx mcp = FastMCP("WebSearchTool") # The wiki manager is shared across all tool invocations. # In production, this would be injected via dependency injection. wiki_manager: WikiMemoryManager = None @mcp.tool() async def web_search_and_learn( query: str, max_results: int = 5, ingest_into_wiki: bool = True ) -> str: """ Search the web for information and optionally learn from results. This tool searches the web using the configured search API, formats the results for the LLM, and optionally ingests the results into the agent's long-term wiki memory so that future queries can benefit from this knowledge. Args: query: The search query. Be specific for better results. max_results: Number of results to fetch (1-10, default 5). ingest_into_wiki: If True, results are ingested into the wiki for long-term retention (default True). Returns: Formatted search results with titles, URLs, and snippets. """ # Clamp max_results to a safe range. max_results = max(1, min(10, max_results)) # Perform the actual web search via the search API. raw_results = await _call_search_api(query, max_results) # Format the results for immediate LLM consumption. formatted = _format_search_results(raw_results) # Ingest into the wiki for long-term memory if requested. if ingest_into_wiki and wiki_manager is not None: # Run ingestion asynchronously so it does not block the response. # In production, this would use a proper task queue. await wiki_manager.ingest_async( raw_content=formatted, source_label=f"web_search:{query}" ) return formatted async def _call_search_api(query: str, max_results: int) -> list[dict]: """ Call the configured search API and return raw results. This function is intentionally separated from the tool definition so that the search backend can be swapped without changing the tool interface. """ # Using Tavily as an example search API. async with httpx.AsyncClient() as client: response = await client.post( "https://api.tavily.com/search", json={ "api_key": os.environ["TAVILY_API_KEY"], "query": query, "max_results": max_results, "search_depth": "basic" } ) response.raise_for_status() return response.json().get("results", []) def _format_search_results(results: list[dict]) -> str: """Format raw search results into a readable string for the LLM.""" if not results: return "No results found for this query." lines = [] for i, result in enumerate(results, 1): lines.append(f"Result {i}: {result.get('title', 'No title')}") lines.append(f"URL: {result.get('url', 'No URL')}") lines.append(f"Snippet: {result.get('content', 'No content')}") lines.append("") # Empty line between results. return "\n".join(lines) The dual purpose of this tool, serving the immediate query while also building long-term memory, is what makes the self-evolving agent's knowledge compound over time. The agent does not just answer questions; it learns from every interaction.

CHAPTER THREE: THE REFLECTION ENGINE - THE AGENT THINKS ABOUT ITSELF

SECTION 3.1: THE PHILOSOPHY OF REFLECTION

Every thirty minutes, the agent stops what it is doing and thinks. Not about the user's next question, but about itself. It asks: What have I been doing? What have I struggled with? What tools did I wish I had? What knowledge was I missing? What would make me more useful? This reflection process is not mystical. It is a structured, systematic analysis of the agent's recent history, implemented as a carefully designed prompt that asks the LLM to reason about its own performance and capabilities. The output of reflection is not just insights; it is an action plan. The agent does not just identify gaps; it fills them. The philosophical underpinning here is metacognition, which is thinking about thinking. Humans who are good at metacognition tend to be better learners because they can identify their own knowledge gaps and address them deliberately. We are giving the agent this same capability. The reflection loop runs on a separate thread, completely independently of the main agent loop. It does not interrupt conversations. It runs quietly in the background, every thirty minutes, improving the agent while the agent continues to serve users.

SECTION 3.2: THE REFLECTION LOOP IMPLEMENTATION

The ReflectionEngine class manages the thirty-minute cycle. It reads recent session records from the wiki, analyses them, identifies capability gaps, and triggers the code generation pipeline to fill those gaps. import threading import time import logging from datetime import datetime, timedelta from typing import Callable logger = logging.getLogger(__name__) class ReflectionEngine: """ The self-improvement engine of the evolving agent. This class runs a background thread that periodically reflects on the agent's recent history, identifies capability gaps, and triggers the generation of new tools and capabilities. The reflection cycle runs every REFLECTION_INTERVAL_SECONDS (default: 1800 seconds = 30 minutes). During each cycle, the engine performs the following steps: 1. Read recent session records from the wiki. 2. Analyse patterns in user requests and tool failures. 3. Identify missing capabilities that would have been useful. 4. Generate specifications for new tools or adapters. 5. Trigger the CapabilityGenerator to implement and integrate them. 6. Record the reflection and its outcomes in the wiki. """ REFLECTION_INTERVAL_SECONDS = 1800 # 30 minutes def __init__( self, wiki_manager: WikiMemoryManager, llm_client, capability_generator, on_new_capability: Callable ): """ Initialize the reflection engine. Args: wiki_manager: The wiki memory manager for reading history and recording reflection outcomes. llm_client: The LLM client for performing reflection. capability_generator: The component that generates and integrates new capabilities. on_new_capability: Callback invoked when a new capability has been successfully integrated. """ self.wiki = wiki_manager self.llm = llm_client self.generator = capability_generator self.on_new_capability = on_new_capability self._stop_event = threading.Event() self._thread = None self._reflection_count = 0 def start(self): """Start the background reflection thread.""" if self._thread is not None and self._thread.is_alive(): logger.warning("Reflection engine is already running.") return self._stop_event.clear() self._thread = threading.Thread( target=self._reflection_loop, name="ReflectionEngine", daemon=True # Dies when the main process dies. ) self._thread.start() logger.info("Reflection engine started. Cycle: %d seconds.", self.REFLECTION_INTERVAL_SECONDS) def stop(self): """Stop the background reflection thread gracefully.""" self._stop_event.set() if self._thread is not None: self._thread.join(timeout=30) logger.info("Reflection engine stopped.") def _reflection_loop(self): """ The main loop of the reflection engine. This runs in a background thread and wakes up every REFLECTION_INTERVAL_SECONDS to perform a reflection cycle. """ # Wait for the first interval before reflecting, so the agent # has some history to reflect on. self._stop_event.wait(timeout=self.REFLECTION_INTERVAL_SECONDS) while not self._stop_event.is_set(): try: self._perform_reflection_cycle() except Exception as e: # Never let an exception kill the reflection thread. logger.error("Reflection cycle failed: %s", e, exc_info=True) # Wait for the next cycle. self._stop_event.wait( timeout=self.REFLECTION_INTERVAL_SECONDS ) def _perform_reflection_cycle(self): """ Execute a single reflection cycle. This is the core of the self-improvement process. It reads recent history, analyses it, and generates new capabilities. """ self._reflection_count += 1 cycle_id = self._reflection_count logger.info("Starting reflection cycle #%d", cycle_id) # Step 1: Gather recent session data from the wiki. recent_sessions = self._gather_recent_sessions(hours=1) if not recent_sessions: logger.info("No recent sessions to reflect on. Skipping.") return # Step 2: Gather the current capability inventory. current_capabilities = self._gather_current_capabilities() # Step 3: Ask the LLM to reflect and identify gaps. reflection_result = self._perform_llm_reflection( recent_sessions, current_capabilities ) # Step 4: For each identified gap, generate a new capability. new_capabilities = [] for gap in reflection_result.get("capability_gaps", []): try: capability = self.generator.generate(gap) if capability is not None: new_capabilities.append(capability) # Notify the main agent loop about the new tool. self.on_new_capability(capability) logger.info("New capability integrated: %s", capability.name) except Exception as e: logger.error("Failed to generate capability for gap " "'%s': %s", gap.get("name", "?"), e) # Step 5: Record the reflection and its outcomes in the wiki. self._record_reflection( cycle_id, reflection_result, new_capabilities ) logger.info("Reflection cycle #%d complete. " "New capabilities: %d", cycle_id, len(new_capabilities)) def _gather_recent_sessions(self, hours: int) -> list[str]: """ Gather session records from the last N hours. Returns a list of session page contents as strings. """ cutoff = datetime.now() - timedelta(hours=hours) sessions_dir = self.wiki.wiki_root / "sessions" recent = [] for session_file in sorted(sessions_dir.glob("*.md")): # Parse the timestamp from the filename. try: # Filename format: session_YYYY-MM-DD_HH-MM-SS.md name = session_file.stem.replace("session_", "") session_time = datetime.strptime( name, "%Y-%m-%d_%H-%M-%S" ) if session_time >= cutoff: recent.append(session_file.read_text()) except (ValueError, OSError): continue # Skip files with unexpected names. return recent def _gather_current_capabilities(self) -> list[str]: """ Gather the names and descriptions of all current capabilities. """ caps_dir = self.wiki.wiki_root / "capabilities" capabilities = [] for cap_file in caps_dir.glob("*.md"): # Read just the first few lines for a quick summary. content = cap_file.read_text() summary = "\n".join(content.splitlines()[:10]) capabilities.append(summary) return capabilities def _perform_llm_reflection( self, recent_sessions: list[str], current_capabilities: list[str] ) -> dict: """ Ask the LLM to reflect on recent history and identify gaps. This is the most important method in the reflection engine. The quality of the reflection prompt determines the quality of the self-improvement. Returns: A dictionary with keys: - "observations": What the LLM noticed about recent sessions. - "capability_gaps": A list of gap descriptions, each with keys "name", "description", "priority", "type". - "wiki_updates": Suggested updates to wiki pages. """ sessions_text = "\n\n---\n\n".join(recent_sessions) caps_text = "\n\n---\n\n".join(current_capabilities) reflection_prompt = f""" You are the metacognitive reflection module of a self-evolving AI agent. Your job is to analyse recent interaction sessions and identify what new capabilities the agent needs to become more useful. RECENT SESSIONS (last 1 hour): {sessions_text} CURRENT CAPABILITIES: {caps_text} Please reflect deeply and systematically. Consider: 1. What types of requests did users make that the agent struggled with? 2. What tools did the agent wish it had but did not? 3. What external systems or APIs would have been useful to integrate? 4. What knowledge was missing from the wiki? 5. What repetitive tasks could be automated with a new tool? 6. What new sub-agents could be spawned to handle specialized domains? Respond with a JSON object with this exact structure: {{ "observations": "Your narrative observations about recent sessions.", "capability_gaps": [ {{ "name": "short_snake_case_name", "description": "What this capability does and why it is needed.", "priority": "high|medium|low", "type": "tool|adapter|agent", "implementation_hints": "Specific technical hints for implementation." }} ], "wiki_updates": [ {{ "page": "relative/path/to/page.md", "reason": "Why this page needs updating." }} ] }} Be specific, practical, and honest. Only suggest capabilities that are technically feasible and genuinely useful based on the evidence in the session records. """ response_text = self.llm.complete(reflection_prompt) # Parse the JSON response. If parsing fails, return a safe default. try: import json # Extract JSON from the response (it might have surrounding text). json_start = response_text.find("{") json_end = response_text.rfind("}") + 1 json_str = response_text[json_start:json_end] return json.loads(json_str) except (json.JSONDecodeError, ValueError) as e: logger.warning("Failed to parse reflection response: %s", e) return {"observations": response_text, "capability_gaps": [], "wiki_updates": []} def _record_reflection( self, cycle_id: int, reflection_result: dict, new_capabilities: list ): """Record the reflection cycle and its outcomes in the wiki.""" timestamp = datetime.now().strftime("%Y-%m-%d_%H-%M-%S") page_path = ( self.wiki.wiki_root / "sessions" / f"reflection_{timestamp}.md" ) content = f"""# Reflection Cycle #{cycle_id} ## Timestamp {timestamp} ## Observations {reflection_result.get("observations", "No observations recorded.")} ## Capability Gaps Identified """ for gap in reflection_result.get("capability_gaps", []): content += ( f"\n### {gap.get('name', 'unnamed')}\n" f"- **Type**: {gap.get('type', 'unknown')}\n" f"- **Priority**: {gap.get('priority', 'medium')}\n" f"- **Description**: {gap.get('description', '')}\n" ) content += "\n## New Capabilities Generated\n" for cap in new_capabilities: content += f"\n- {cap.name}: {cap.description}\n" page_path.write_text(content)

SECTION 3.3: WHAT THE AGENT REFLECTS ON

The reflection prompt is the most important piece of engineering in the entire system. It determines what the agent notices, what it prioritizes, and what it decides to build. Let us walk through what a real reflection might look like. Imagine the agent has had three sessions in the past hour. In the first session, a user asked it to send an email, and the agent had to tell the user it could not do that. In the second session, a user asked for a summary of a PDF document, and the agent had to ask the user to paste the text manually because it had no PDF reading tool. In the third session, a user asked about the current weather, and the agent had to use a generic web search instead of a dedicated weather API. When the reflection engine reads these three sessions, it identifies three clear capability gaps. The first gap is an email sending tool. The second gap is a PDF reading tool. The third gap is a weather API adapter. It assigns priorities based on how frequently each gap appeared and how severely it impacted the user experience. Email sending gets high priority because the user was completely blocked. PDF reading gets high priority for the same reason. Weather gets medium priority because the agent found a workaround. The reflection engine then passes these three gap descriptions to the CapabilityGenerator, which is the subject of the next section.

CHAPTER FOUR: THE CAPABILITY GENERATOR - THE AGENT BUILDS ITS OWN TOOLS

SECTION 4.1: CODE GENERATION AS SELF-EXTENSION

The CapabilityGenerator is the component that turns reflection insights into running code. It takes a gap description and produces a working MCP tool, adapter, or sub-agent that is immediately integrated into the running system. This is the most technically challenging part of the self-evolving agent. Generating code that works correctly, is safe to execute, and integrates cleanly with the existing system requires careful engineering. We need to think about code generation, code validation, sandboxed execution, and dynamic registration. The process has five stages. In the specification stage, the LLM takes the gap description and produces a detailed technical specification for the new capability, including its interface, its dependencies, and its implementation approach. In the generation stage, the LLM writes the actual code based on the specification. In the validation stage, the generated code is checked for syntax errors, security issues, and compliance with the MCP interface. In the testing stage, the code is executed in a sandboxed environment to verify that it works correctly. In the integration stage, the validated code is registered with the MCP server and made available to the agent. Each stage is a checkpoint. If a stage fails, the process stops and the failure is recorded in the wiki. The agent does not integrate broken code. Safety and correctness come before capability expansion.

SECTION 4.2: THE CAPABILITY GENERATOR IMPLEMENTATION

import subprocess import tempfile import ast import sys import importlib.util from dataclasses import dataclass from typing import Optional @dataclass class GeneratedCapability: """ Represents a newly generated capability (tool, adapter, or agent). This dataclass holds all the information about a generated capability, including its source code, its MCP registration information, and its test results. """ name: str description: str capability_type: str # "tool", "adapter", or "agent" source_code: str module_path: str test_passed: bool test_output: str class CapabilityGenerator: """ Generates new capabilities from gap descriptions. This class implements the five-stage pipeline for turning a capability gap description into a working, integrated MCP tool, adapter, or sub-agent. The pipeline is designed to be safe: code is always validated and tested before integration. Failed generations are recorded but never integrated. """ # Directory where generated capabilities are stored. CAPABILITIES_DIR = "generated_capabilities" def __init__(self, llm_client, mcp_server, wiki_manager): """ Initialize the capability generator. Args: llm_client: LLM client for code generation. mcp_server: The running MCP server for tool registration. wiki_manager: For recording generated capabilities. """ self.llm = llm_client self.mcp_server = mcp_server self.wiki = wiki_manager Path(self.CAPABILITIES_DIR).mkdir(exist_ok=True) def generate(self, gap: dict) -> Optional[GeneratedCapability]: """ Generate a new capability from a gap description. This is the main entry point for the five-stage pipeline. Args: gap: A dictionary with keys "name", "description", "type", "priority", "implementation_hints". Returns: A GeneratedCapability if all stages pass, or None if any stage fails. """ gap_name = gap.get("name", "unnamed_capability") logger.info("Generating capability: %s", gap_name) # Stage 1: Generate a detailed technical specification. spec = self._generate_specification(gap) if spec is None: logger.warning("Spec generation failed for: %s", gap_name) return None # Stage 2: Generate the source code from the specification. source_code = self._generate_code(spec, gap) if source_code is None: logger.warning("Code generation failed for: %s", gap_name) return None # Stage 3: Validate the generated code for safety and correctness. validation_result = self._validate_code(source_code, gap_name) if not validation_result["passed"]: logger.warning("Validation failed for %s: %s", gap_name, validation_result["reason"]) # Try to fix the code once before giving up. source_code = self._attempt_fix( source_code, validation_result["reason"], gap ) if source_code is None: return None # Re-validate the fixed code. validation_result = self._validate_code(source_code, gap_name) if not validation_result["passed"]: return None # Stage 4: Test the code in a sandboxed environment. test_result = self._test_code(source_code, gap) # Stage 5: Integrate the capability into the running system. capability = self._integrate_capability( gap, source_code, test_result ) # Record the new capability in the wiki. if capability is not None: self._record_capability_in_wiki(capability) return capability def _generate_specification(self, gap: dict) -> Optional[dict]: """ Generate a detailed technical specification for the capability. The specification includes the tool's interface, dependencies, implementation approach, and test cases. """ prompt = f""" You are a senior software architect designing a new MCP tool for a self-evolving AI agent. Generate a detailed technical specification for the following capability gap: Gap Name: {gap.get("name")} Gap Description: {gap.get("description")} Gap Type: {gap.get("type")} (tool, adapter, or agent) Implementation Hints: {gap.get("implementation_hints", "None provided.")} Produce a JSON specification with this structure: {{ "tool_name": "snake_case_name", "tool_description": "Clear description for the LLM to understand.", "parameters": [ {{ "name": "param_name", "type": "str|int|float|bool|list|dict", "description": "What this parameter does.", "required": true }} ], "returns": "Description of what the tool returns.", "dependencies": ["list", "of", "pip", "packages"], "implementation_approach": "Step-by-step implementation plan.", "test_cases": [ {{ "description": "What this test verifies.", "input": {{"param": "value"}}, "expected_behavior": "What should happen." }} ] }} """ response = self.llm.complete(prompt) try: json_str = self._extract_json(response) return json.loads(json_str) except Exception: return None def _generate_code( self, spec: dict, gap: dict ) -> Optional[str]: """ Generate Python source code for the capability. The generated code must be a valid MCP tool definition that follows the project's coding standards. """ prompt = f""" You are an expert Python developer implementing an MCP tool for a self-evolving AI agent. Write clean, well-documented Python code for the following specification: SPECIFICATION: {json.dumps(spec, indent=2)} REQUIREMENTS: 1. Use the FastMCP framework: from mcp.server.fastmcp import FastMCP 2. Define the tool using the @mcp.tool() decorator. 3. Include comprehensive docstrings (they become the tool description). 4. Use type hints for all parameters and return values. 5. Handle errors gracefully and return informative error messages. 6. Do NOT include the mcp.run() call (the server is managed externally). 7. Use only the dependencies listed in the specification. 8. Follow PEP 8 style guidelines. 9. The function must be named exactly: {spec.get("tool_name")} Return ONLY the Python code, with no surrounding text or markdown. """ code = self.llm.complete(prompt) # Strip any markdown code fences if the LLM added them. code = code.strip() if code.startswith("```"): lines = code.split("\n") code = "\n".join(lines[1:-1]) return code if code else None def _validate_code( self, source_code: str, name: str ) -> dict: """ Validate generated code for syntax errors and safety issues. This method performs static analysis to catch obvious problems before we try to execute the code. Returns: A dict with "passed" (bool) and "reason" (str) keys. """ # Check 1: Python syntax validation using the AST parser. try: tree = ast.parse(source_code) except SyntaxError as e: return { "passed": False, "reason": f"Syntax error: {e}" } # Check 2: Security scan for dangerous operations. dangerous_patterns = [ "os.system", "subprocess.call", "__import__", "eval(", "exec(", "open(", # File access must be explicit and controlled. "shutil.rmtree", "os.remove", ] for pattern in dangerous_patterns: if pattern in source_code: return { "passed": False, "reason": f"Dangerous pattern detected: {pattern}" } # Check 3: Verify the @mcp.tool() decorator is present. if "@mcp.tool()" not in source_code: return { "passed": False, "reason": "Missing @mcp.tool() decorator." } # Check 4: Verify the function name matches the expected name. func_names = [ node.name for node in ast.walk(tree) if isinstance(node, ast.FunctionDef) ] if not func_names: return { "passed": False, "reason": "No function definition found in generated code." } return {"passed": True, "reason": "All checks passed."} def _attempt_fix( self, broken_code: str, error: str, gap: dict ) -> Optional[str]: """ Ask the LLM to fix broken generated code. This gives the generator one chance to self-correct before the generation attempt is abandoned. """ prompt = f""" The following Python code for an MCP tool has a problem: PROBLEM: {error} BROKEN CODE: {broken_code} Please fix the code to resolve the problem. Return ONLY the fixed Python code, with no surrounding text or markdown. """ fixed = self.llm.complete(prompt) fixed = fixed.strip() if fixed.startswith("```"): lines = fixed.split("\n") fixed = "\n".join(lines[1:-1]) return fixed if fixed else None def _test_code( self, source_code: str, gap: dict ) -> dict: """ Test the generated code in a sandboxed subprocess. We run the code in a separate Python process with a timeout to prevent infinite loops or hanging operations. The test verifies that the code can be imported without errors. Returns: A dict with "passed" (bool) and "output" (str) keys. """ # Write the code to a temporary file. with tempfile.NamedTemporaryFile( mode="w", suffix=".py", delete=False ) as tmp: tmp.write(source_code) tmp_path = tmp.name try: # Run the code in a subprocess with a 30-second timeout. # We use a simple import test: if the module imports # without error, the basic structure is correct. test_script = f""" import sys sys.path.insert(0, '.') try: import importlib.util spec = importlib.util.spec_from_file_location("test_module", "{tmp_path}") module = importlib.util.module_from_spec(spec) spec.loader.exec_module(module) print("IMPORT_SUCCESS") except Exception as e: print(f"IMPORT_FAILED: {{e}}") """ result = subprocess.run( [sys.executable, "-c", test_script], capture_output=True, text=True, timeout=30 ) output = result.stdout + result.stderr passed = "IMPORT_SUCCESS" in output return {"passed": passed, "output": output} except subprocess.TimeoutExpired: return {"passed": False, "output": "Test timed out after 30s."} finally: # Always clean up the temporary file. os.unlink(tmp_path) def _integrate_capability( self, gap: dict, source_code: str, test_result: dict ) -> Optional[GeneratedCapability]: """ Integrate a validated capability into the running MCP server. This method saves the generated code to the capabilities directory and dynamically registers it with the MCP server. """ name = gap.get("name", "unnamed") cap_path = Path(self.CAPABILITIES_DIR) / f"{name}.py" # Save the source code to the capabilities directory. cap_path.write_text(source_code) # Dynamically load the module and register its tools. try: spec = importlib.util.spec_from_file_location( f"capability_{name}", str(cap_path) ) module = importlib.util.module_from_spec(spec) spec.loader.exec_module(module) # Register any new tools found in the module with the # MCP server. The MCP server will make them available # on the next tools/list call. self.mcp_server.register_module(module) return GeneratedCapability( name=name, description=gap.get("description", ""), capability_type=gap.get("type", "tool"), source_code=source_code, module_path=str(cap_path), test_passed=test_result["passed"], test_output=test_result["output"] ) except Exception as e: logger.error("Failed to integrate capability %s: %s", name, e) return None def _record_capability_in_wiki(self, capability: GeneratedCapability): """Record a newly generated capability in the wiki.""" page_path = ( self.wiki.wiki_root / "capabilities" / f"{capability.name}.md" ) content = f"""# Capability: {capability.name} ## Type {capability.capability_type} ## Description {capability.description} ## Status {"Operational" if capability.test_passed else "Degraded (test failed)"} ## Source File {capability.module_path} ## Test Output {capability.test_output} ## Generated At {datetime.now().isoformat()} """ page_path.write_text(content) def _extract_json(self, text: str) -> str: """Extract a JSON object from a text that may contain other content.""" start = text.find("{") end = text.rfind("}") + 1 if start == -1 or end == 0: raise ValueError("No JSON object found in text.") return text[start:end]

SECTION 4.3: GENERATING DIFFERENT TYPES OF CAPABILITIES

The CapabilityGenerator can produce three types of capabilities. Understanding the differences between them is important for understanding how the agent grows. A tool is the simplest type of capability. It is a Python function decorated with @mcp.tool() that performs a specific, well-defined operation. Examples include a weather API tool, a PDF reader tool, an email sender tool, or a currency converter tool. Tools are stateless and self-contained. They take inputs, perform an operation, and return outputs. An adapter is a more complex type of capability that provides a bridge between the agent and an external application or system. An adapter might connect the agent to an industrial Middleware instance, a SAP system, a Jira board, or a Confluence wiki. Adapters are more complex than tools because they need to manage authentication, session state, and the specific data models of the external system. An adapter typically exposes multiple tools that share a common connection or authentication context. An agent is the most complex type of capability. It is a specialized sub-agent that handles a specific domain of tasks. For example, a data analysis agent might be spawned to handle all requests involving data processing, statistical analysis, and visualization. A code review agent might be spawned to handle all requests involving code quality assessment. Sub-agents have their own memory, their own tools, and their own reasoning loops. They communicate with the main agent through a well-defined interface. The following example shows what a generated email adapter might look like after the reflection engine identifies that users frequently ask to send emails: from mcp.server.fastmcp import FastMCP import smtplib from email.mime.text import MIMEText from email.mime.multipart import MIMEMultipart import os # Note: mcp instance is provided by the host server when this # module is registered. In standalone mode, create a new instance. mcp = FastMCP("EmailAdapter") @mcp.tool() def send_email( to: str, subject: str, body: str, cc: str = "", html: bool = False ) -> str: """ Send an email to one or more recipients. This tool sends an email using the SMTP server configured in the environment variables SMTP_HOST, SMTP_PORT, SMTP_USER, and SMTP_PASSWORD. Args: to: Recipient email address(es), comma-separated for multiple. subject: The email subject line. body: The email body text (plain text or HTML). cc: Optional CC recipients, comma-separated. html: If True, the body is treated as HTML (default: False). Returns: A confirmation message with the message ID, or an error description if sending failed. """ try: # Build the MIME message. msg = MIMEMultipart("alternative") msg["From"] = os.environ["SMTP_USER"] msg["To"] = to msg["Subject"] = subject if cc: msg["Cc"] = cc # Attach the body as the appropriate MIME type. mime_type = "html" if html else "plain" msg.attach(MIMEText(body, mime_type)) # Connect to the SMTP server and send. smtp_host = os.environ.get("SMTP_HOST", "localhost") smtp_port = int(os.environ.get("SMTP_PORT", "587")) with smtplib.SMTP(smtp_host, smtp_port) as server: server.starttls() server.login( os.environ["SMTP_USER"], os.environ["SMTP_PASSWORD"] ) recipients = [to] + ([cc] if cc else []) server.sendmail( os.environ["SMTP_USER"], recipients, msg.as_string() ) return f"Email sent successfully to {to}. Subject: {subject}" except smtplib.SMTPException as e: return f"Failed to send email: SMTP error - {e}" except KeyError as e: return (f"Failed to send email: Missing environment variable " f"{e}. Please configure SMTP settings.") except Exception as e: return f"Failed to send email: Unexpected error - {e}" This generated tool follows all the principles we established: it has a comprehensive docstring, it handles errors gracefully, it uses environment variables for configuration rather than hardcoded credentials, and it returns informative messages for both success and failure cases.

CHAPTER FIVE: THE COMPLETE SYSTEM - WIRING EVERYTHING TOGETHER

SECTION 5.1: THE AGENT ORCHESTRATOR

Now that we have all the individual components, we need to wire them together into a coherent system. The AgentOrchestrator is the top-level class that manages all the components and presents a unified interface to the outside world. import asyncio import logging from pathlib import Path logger = logging.getLogger(__name__) class SelfEvolvingAgent: """ The top-level orchestrator of the self-evolving agent system. This class wires together all four pillars of the architecture: 1. The LLM Core (via an LLM client) 2. The Tool Layer (via the MCP server and tool registry) 3. The Memory System (via the WikiMemoryManager) 4. The Reflection Engine (via the ReflectionEngine) Usage: agent = SelfEvolvingAgent(config) agent.start() response = agent.chat("What is the weather in Munich today?") agent.stop() """ def __init__(self, config: dict): """ Initialize the self-evolving agent. Args: config: Configuration dictionary with keys: - llm_model: The LLM model identifier. - llm_api_key: API key for the LLM provider. - wiki_root: Path to the wiki directory. - capabilities_dir: Path for generated capabilities. """ self.config = config self._session_data = [] self._is_running = False # Initialize the LLM client. self.llm = self._create_llm_client(config) # Initialize the wiki memory manager. self.wiki = WikiMemoryManager( wiki_root=config.get("wiki_root", "./wiki"), llm_client=self.llm ) # Initialize the MCP tool registry. self.tool_registry = MCPToolRegistry() self._register_core_tools() # Initialize the capability generator. self.generator = CapabilityGenerator( llm_client=self.llm, mcp_server=self.tool_registry, wiki_manager=self.wiki ) # Initialize the reflection engine. self.reflection_engine = ReflectionEngine( wiki_manager=self.wiki, llm_client=self.llm, capability_generator=self.generator, on_new_capability=self._on_new_capability ) # Build the initial system prompt. self.system_prompt = self._build_system_prompt() logger.info("SelfEvolvingAgent initialized.") def start(self): """Start the agent and its background processes.""" if self._is_running: logger.warning("Agent is already running.") return self._is_running = True self.reflection_engine.start() logger.info("Agent started. Reflection engine active.") def stop(self): """Stop the agent and all background processes gracefully.""" self.reflection_engine.stop() self._is_running = False logger.info("Agent stopped.") def chat(self, user_message: str) -> str: """ Process a user message and return the agent's response. This is the main public interface of the agent. It runs the agent loop until the LLM produces a final response. Args: user_message: The user's natural language message. Returns: The agent's response as a string. """ # Start tracking this interaction for session recording. interaction_start = datetime.now() tools_used = [] context = AgentContext( messages=[], tool_registry=self.tool_registry, system_prompt=self.system_prompt ) # First, check the wiki for relevant context. wiki_context = self.wiki.query(user_message) if wiki_context: # Inject wiki context as a system note. context.messages.append({ "role": "system", "content": ( f"[Wiki Memory Context]\n{wiki_context}\n" f"[End Wiki Context]" ) }) # Run the main agent loop. response = agent_loop(user_message, context) # Record this interaction in the session data. self._session_data.append({ "timestamp": interaction_start.isoformat(), "user_message": user_message, "response": response, "tools_used": tools_used, "duration_seconds": ( datetime.now() - interaction_start ).total_seconds() }) # Periodically save session data to the wiki. # (The reflection engine will process it on the next cycle.) if len(self._session_data) >= 5: self._flush_session_data() return response def _register_core_tools(self): """ Register the core tools that the agent starts with. These are the tools the agent has from day one. Additional tools are added dynamically by the reflection engine. """ # Register the web search and learn tool. self.tool_registry.register_module( WebSearchToolModule(wiki_manager=self.wiki) ) # Register the wiki query tool. self.tool_registry.register_module( WikiQueryToolModule(wiki_manager=self.wiki) ) # Register the code execution tool (sandboxed). self.tool_registry.register_module( SandboxedCodeExecutionModule() ) logger.info("Core tools registered: web_search, wiki_query, " "code_execution.") def _on_new_capability(self, capability: GeneratedCapability): """ Callback invoked when the reflection engine generates a new capability. Updates the system prompt to inform the LLM about the new tool. """ logger.info("New capability available: %s", capability.name) # Rebuild the system prompt to include the new capability. self.system_prompt = self._build_system_prompt() def _build_system_prompt(self) -> str: """ Build the system prompt for the LLM. The system prompt tells the LLM what it is, what it can do, and how to behave. It is rebuilt whenever new capabilities are added. """ capabilities_summary = self._get_capabilities_summary() return f"""You are a self-evolving AI assistant. You have access to a growing set of tools that expand over time as you learn from interactions. You also have a persistent wiki memory that you can query and update. Current capabilities: {capabilities_summary} Guidelines: - Always check your wiki memory before searching the web. - Use tools proactively when they would improve your answer. - Be honest about what you do not know. - When you cannot do something, describe what capability would help. (This information is used to improve you in the next reflection cycle.) """ def _get_capabilities_summary(self) -> str: """Return a brief summary of all registered capabilities.""" tools = self.tool_registry.get_tool_schemas() lines = [] for tool in tools: lines.append( f"- {tool['name']}: {tool['description'][:80]}..." ) return "\n".join(lines) if lines else "Core tools only." def _flush_session_data(self): """Save accumulated session data to the wiki.""" if not self._session_data: return self.wiki.record_session({ "interactions": self._session_data, "session_start": self._session_data[0]["timestamp"], "session_end": datetime.now().isoformat(), "total_interactions": len(self._session_data) }) self._session_data = []

SECTION 5.2: THE MCP TOOL REGISTRY

The MCPToolRegistry is the component that manages the dynamic registration and discovery of tools. It wraps the MCP server and provides a clean interface for adding tools at runtime. class MCPToolRegistry: """ A dynamic registry for MCP tools. This class manages the collection of available tools and supports runtime registration of new tools generated by the reflection engine. It acts as the bridge between the agent's tool layer and the MCP protocol. The registry maintains a list of tool schemas (for the LLM to read) and a dispatch table (for routing tool calls to their implementations). """ def __init__(self): """Initialize an empty tool registry.""" self._tools: dict[str, callable] = {} self._schemas: list[dict] = [] self._lock = threading.Lock() # Thread-safe for reflection engine. def register_module(self, module): """ Register all MCP tools found in a Python module. This method inspects the module for functions decorated with @mcp.tool() and registers them in the dispatch table. Args: module: A Python module object containing tool definitions. """ with self._lock: for attr_name in dir(module): attr = getattr(module, attr_name) if callable(attr) and hasattr(attr, "_mcp_tool_schema"): schema = attr._mcp_tool_schema self._tools[schema["name"]] = attr self._schemas.append(schema) logger.info("Registered tool: %s", schema["name"]) def execute(self, tool_name: str, arguments: dict) -> str: """ Execute a tool by name with the given arguments. Args: tool_name: The name of the tool to execute. arguments: A dictionary of argument name-value pairs. Returns: The tool's return value as a string. """ with self._lock: if tool_name not in self._tools: return (f"Error: Tool '{tool_name}' is not registered. " f"Available tools: {list(self._tools.keys())}") tool_fn = self._tools[tool_name] try: result = tool_fn(**arguments) return str(result) except TypeError as e: return f"Error: Invalid arguments for tool '{tool_name}': {e}" except Exception as e: return f"Error: Tool '{tool_name}' raised an exception: {e}" def get_tool_schemas(self) -> list[dict]: """ Return the JSON schemas for all registered tools. This is what the LLM reads to understand what tools are available and how to call them. """ with self._lock: return list(self._schemas) # Return a copy for thread safety. def get_tool_count(self) -> int: """Return the number of registered tools.""" with self._lock: return len(self._tools)

SECTION 5.3: THE STARTUP SEQUENCE

When the agent starts for the first time, it is a lightweight framework with just three core tools: web search, wiki query, and sandboxed code execution. It has an empty wiki with just the index and schema pages. It has no generated capabilities. It is, in a sense, a newborn. But from the very first interaction, it begins to learn. Every web search enriches the wiki. Every session is recorded. Every thirty minutes, the reflection engine wakes up, reads the recent history, and identifies what new capabilities would make the agent more useful. Over time, the agent grows. The startup sequence looks like this: def main(): """ Entry point for the self-evolving agent. This function initializes the agent with its configuration, starts all background processes, and enters the main interaction loop. """ # Load configuration from environment variables and config file. config = { "llm_model": os.environ.get("LLM_MODEL", "claude-3-5-sonnet"), "llm_api_key": os.environ["LLM_API_KEY"], "wiki_root": os.environ.get("WIKI_ROOT", "./wiki"), "capabilities_dir": os.environ.get( "CAPABILITIES_DIR", "./generated_capabilities" ), } # Initialize and start the agent. agent = SelfEvolvingAgent(config) agent.start() print("Self-Evolving Agent is running.") print(f"Wiki: {config['wiki_root']}") print(f"Tools: {agent.tool_registry.get_tool_count()} registered") print("Type 'quit' to exit.\n") # Main interaction loop. try: while True: user_input = input("You: ").strip() if not user_input: continue if user_input.lower() in ("quit", "exit", "q"): break response = agent.chat(user_input) print(f"\nAgent: {response}\n") except KeyboardInterrupt: print("\nInterrupted by user.") finally: print("Stopping agent...") agent.stop() print("Agent stopped. Goodbye.") if __name__ == "__main__": logging.basicConfig( level=logging.INFO, format="%(asctime)s [%(name)s] %(levelname)s: %(message)s" ) main()

CHAPTER SIX: THE GROWTH TRAJECTORY - HOW THE AGENT EVOLVES OVER TIME

SECTION 6.1: DAY ONE - THE NEWBORN AGENT

On day one, the agent is a lightweight framework. It has three tools, an empty wiki, and no generated capabilities. But it is already useful. It can answer questions using its LLM knowledge, search the web for current information, and learn from every search it performs. A typical day-one interaction might look like this: User: What is the current status of the MCP protocol? Agent: [Checks wiki - empty] [Calls web_search_and_learn("MCP protocol status 2025")] [Receives search results] [Ingests results into wiki: creates concepts/mcp_protocol.md] [Synthesizes answer from results] Agent: The Model Context Protocol (MCP) is currently under the stewardship of the Agentic AI Foundation (AAIF), a directed fund under the Linux Foundation, as of December 2025. The latest stable version was released on November 25, 2025... The wiki now has a page about MCP. The next time a user asks about MCP, the agent will find this page in the wiki and use it as context, potentially without needing to search the web again.

SECTION 6.2: DAY TWO - THE FIRST REFLECTION

After thirty minutes of operation, the reflection engine wakes up for the first time. It reads the session records from day one and notices several patterns. It notices that users frequently ask about documents but the agent has no way to read PDF or Word files. It notices that users ask about data but the agent has no way to execute Python code for data analysis. It notices that one user asked to schedule a meeting but the agent could not do it. The reflection engine generates three capability gap descriptions and passes them to the CapabilityGenerator. The generator produces three new tools: a PDF reader, a data analysis tool, and a calendar integration adapter. Each tool goes through the five-stage pipeline: specification, generation, validation, testing, and integration. By the time the second reflection cycle runs (one hour into operation), the agent has six tools instead of three. It is already twice as capable as it was when it started.

SECTION 6.3: WEEK ONE - EXPONENTIAL GROWTH

The growth is not linear; it is exponential in the early stages. Each new tool enables new types of interactions, which generate new session data, which reveals new capability gaps, which generate more tools. The agent bootstraps itself into a progressively more capable system. By the end of week one, a typical agent might have accumulated between twenty and fifty tools, depending on the diversity of user requests. It might have adapters for email, calendar, Slack, GitHub, and several internal APIs. It might have specialized sub-agents for data analysis, code review, and document summarization. Its wiki might contain hundreds of pages covering the topics that users have asked about most frequently. The agent's system prompt grows longer as new capabilities are added. The LLM becomes better at choosing the right tool for each task because it has more tools to choose from and more wiki context to draw on.

SECTION 6.4: THE STABILIZATION PHASE

After the initial period of rapid growth, the agent enters a stabilization phase. The most obvious capability gaps have been filled. New reflection cycles still run every thirty minutes, but they identify fewer and fewer critical gaps. The agent's growth slows from exponential to logarithmic. In the stabilization phase, the reflection engine shifts its focus from generating new tools to improving existing ones. It notices that the PDF reader tool sometimes fails on scanned PDFs and generates an improved version with OCR support. It notices that the email adapter does not support attachments and generates an updated version. It notices that the data analysis tool is slow on large datasets and generates an optimized version. This continuous improvement of existing capabilities is just as important as the generation of new ones. The agent does not just grow wider; it grows deeper.

SECTION 6.5: THE MEMORY COMPOUNDING EFFECT

One of the most powerful effects of the LLM Wiki memory system is knowledge compounding. Every piece of information the agent ingests makes it slightly better at answering future questions. Over time, these small improvements compound into a significant advantage. After a month of operation, the agent's wiki might contain thousands of pages covering the specific topics that matter most to its users. When a user asks a question about a topic the agent has encountered before, the agent can answer from its wiki memory without needing to search the web. This makes responses faster, more accurate, and more tailored to the specific context of the organization using the agent. The wiki also captures institutional knowledge. If a user explains a company-specific process or terminology, the agent ingests this into the wiki and uses it in future interactions. The agent becomes progressively more aligned with the specific needs and context of its users.

CHAPTER SEVEN: SAFETY, SECURITY, AND GOVERNANCE

SECTION 7.1: THE RISKS OF SELF-MODIFICATION

A self-evolving agent that can generate and execute its own code is a powerful but potentially dangerous system. We need to think carefully about the risks and how to mitigate them. The most obvious risk is that the agent generates code with security vulnerabilities. A generated tool that makes HTTP requests might be vulnerable to server-side request forgery (SSRF). A generated tool that processes user input might be vulnerable to injection attacks. A generated tool that reads files might be vulnerable to path traversal attacks. A second risk is that the agent generates code that does something unintended. The LLM might misunderstand the gap description and generate a tool that does something different from what was intended. Without proper validation and testing, this broken tool could be integrated into the system and cause problems. A third risk is that the reflection engine identifies a capability gap that should not be filled. For example, it might identify that users frequently ask it to delete files, and generate a file deletion tool. But file deletion is a dangerous operation that should require explicit human approval. We address these risks through a combination of technical controls and governance policies.

SECTION 7.2: TECHNICAL SAFETY CONTROLS

The technical safety controls in our system operate at four levels. At the code generation level, the LLM is given explicit instructions to avoid dangerous patterns. The system prompt for code generation includes a list of forbidden operations and explains why they are forbidden. The LLM is also instructed to use environment variables for all credentials and to never hardcode sensitive information. At the validation level, the _validate_code method performs static analysis to detect dangerous patterns before the code is executed. This is a whitelist/blacklist approach: certain operations are always forbidden, and the code must always include certain required elements (like the @mcp.tool() decorator). At the testing level, generated code is executed in a sandboxed subprocess with a timeout. The sandbox prevents the generated code from accessing the main process's memory or file system. The timeout prevents infinite loops from hanging the system. At the integration level, a human-in-the-loop review can be added for high- risk capabilities. The system can be configured to require human approval before integrating any capability of type "adapter" or "agent", while allowing simple "tool" capabilities to be integrated automatically. The configuration for the safety controls looks like this: SAFETY_CONFIG = { # Capability types that require human approval before integration. "require_approval_for": ["adapter", "agent"], # Operations that are always forbidden in generated code. "forbidden_patterns": [ "os.system", "subprocess.call", "subprocess.run", "__import__", "eval(", "exec(", "shutil.rmtree", "os.remove", "os.unlink", "open(", ], # Maximum execution time for generated code tests (seconds). "test_timeout_seconds": 30, # Maximum number of new capabilities per reflection cycle. "max_capabilities_per_cycle": 3, # Whether to allow network access in generated tools. "allow_network_access": True, # Whether to allow file system access in generated tools. "allow_filesystem_access": False, } The max_capabilities_per_cycle limit is particularly important. It prevents the agent from generating too many new capabilities in a single cycle, which could overwhelm the system or introduce too many changes at once.

SECTION 7.3: THE CAPABILITY APPROVAL WORKFLOW

For capabilities that require human approval, the system implements a simple approval workflow. When the reflection engine generates a high-risk capability, it does not integrate it immediately. Instead, it saves the generated code to a "pending" directory and notifies a human operator. The human operator can review the generated code, test it manually, and then approve or reject it. If approved, the capability is moved from the "pending" directory to the "capabilities" directory and registered with the MCP server. If rejected, the capability is archived with a note explaining why it was rejected, so the reflection engine does not try to generate the same capability again. This workflow ensures that the agent can grow autonomously for low-risk capabilities while maintaining human oversight for high-risk ones. It is a practical balance between autonomy and safety.

SECTION 7.4: AUDIT LOGGING AND OBSERVABILITY

Every action the agent takes is logged. Every tool call, every wiki update, every reflection cycle, every generated capability, and every integration decision is recorded in a structured audit log. This log is the foundation of observability for the self-evolving agent. The audit log serves several purposes. It allows operators to understand what the agent has been doing and why. It provides the raw material for post-hoc analysis of the agent's behavior. It enables rollback: if a generated capability causes problems, the audit log shows exactly when it was integrated and what it does, making it easy to remove. The audit log is also fed back into the wiki. Reflection cycles read the audit log as part of their analysis, giving the reflection engine a complete picture of the agent's behavior, not just the user-facing session records.

CHAPTER EIGHT: ADVANCED TOPICS AND FUTURE DIRECTIONS

SECTION 8.1: MULTI-AGENT ARCHITECTURES

As the agent grows, it may spawn specialized sub-agents to handle specific domains. These sub-agents are themselves self-evolving agents, but with a narrower scope. They have their own wikis, their own tools, and their own reflection engines. The main agent communicates with sub-agents through a well-defined interface. When the main agent receives a request that falls within a sub-agent's domain, it delegates the request to the sub-agent and incorporates the sub-agent's response into its own response. This hierarchical architecture allows the system to scale to handle complex, multi-domain tasks. The main agent acts as an orchestrator, routing requests to the appropriate specialist. Each specialist agent grows and improves independently, but they share knowledge through a common wiki namespace. A sub-agent is spawned by the CapabilityGenerator when the reflection engine identifies that a specific domain requires specialized handling. The specification for a sub-agent includes its domain, its initial tools, its wiki namespace, and its communication interface with the main agent.

SECTION 8.2: CROSS-SESSION LEARNING AND KNOWLEDGE TRANSFER

One of the most powerful aspects of the LLM Wiki memory system is that knowledge persists across sessions and across users. When one user teaches the agent something, all future users benefit from that knowledge. This cross-session learning is particularly valuable in an enterprise context. When a subject matter expert interacts with the agent and provides detailed domain knowledge, that knowledge is compiled into the wiki and becomes available to all future users. The agent becomes progressively more knowledgeable about the specific domain of the organization using it. Knowledge transfer between agents is also possible. If an organization runs multiple instances of the self-evolving agent (for example, one for each department), the wikis of these instances can be synchronized. A concept page created by the engineering department's agent can be shared with the marketing department's agent, allowing knowledge to flow across the organization.

SECTION 8.3: FINE-TUNING AND WEIGHT UPDATES

The architecture we have described so far is entirely based on in-context learning: the agent learns by updating its wiki and its tool registry, not by updating the weights of the underlying LLM. This is a deliberate design choice. Weight updates require significant computational resources, careful data curation, and rigorous evaluation. They are not something that can be done every thirty minutes in a production system. However, the wiki and session data that the agent accumulates over time is extremely valuable training data. Over longer time horizons (weeks or months), this data can be used to fine-tune the underlying LLM to be more aligned with the specific needs of the organization. The fine-tuned model can then be deployed as the new LLM Core of the agent, giving it a permanent improvement in its baseline capabilities. This creates a two-speed learning system. The fast loop (every thirty minutes) updates the wiki and tool registry. The slow loop (every few months) updates the LLM weights. Together, they enable both rapid adaptation and deep, permanent learning.

SECTION 8.4: EVALUATING THE AGENT'S GROWTH

How do we know if the agent is actually getting better? We need metrics. The most important metrics for a self-evolving agent are not the usual LLM benchmarks. They are operational metrics that measure the agent's actual usefulness in its specific deployment context. The first metric is the tool utilization rate, which measures what percentage of user requests are successfully handled using the available tools. A rising tool utilization rate indicates that the agent is building the right tools for its users. The second metric is the wiki hit rate, which measures what percentage of user queries find relevant information in the wiki without needing a web search. A rising wiki hit rate indicates that the agent's knowledge base is growing in the right direction. The third metric is the capability gap rate, which measures how many capability gaps are identified per reflection cycle. A declining gap rate indicates that the agent is approaching a stable, comprehensive capability set for its deployment context. The fourth metric is the user satisfaction score, which can be collected through simple thumbs-up/thumbs-down feedback after each interaction. This is the ultimate measure of whether the agent's growth is actually making it more useful. These metrics should be tracked over time and visualized in a dashboard that operators can use to monitor the agent's growth and identify areas that need attention.

CHAPTER NINE: PUTTING IT ALL TOGETHER - THE COMPLETE PICTURE

SECTION 9.1: THE ARCHITECTURE DIAGRAM IN ASCII

Let us draw the complete architecture of the self-evolving agent in ASCII. This diagram shows all the components and how they relate to each other. +------------------------------------------------------------------+ | SELF-EVOLVING AGENT | | | | +------------------+ +------------------------------+ | | | USER INTERFACE | | LLM CORE | | | | (CLI / API / |<------>| (Claude / GPT / Llama) | | | | Web UI) | | | | | +------------------+ +----------+-------------------+ | | | | | +----------v-------------------+ | | | AGENT ORCHESTRATOR | | | | (SelfEvolvingAgent class) | | | +--+-------+----------+--------+ | | | | | | | +------------------+ | +-----+----------+ | | | | | REFLECTION | | | +-----------v-----------+ | | ENGINE | | | | MCP TOOL REGISTRY | | | (30 min loop) | | | | | | +-----+----------+ | | | [web_search_and_learn]| | | | | | [wiki_query] | | | | | | [code_execution] | | +-----v----------+ | | | [email_sender] <-- dynamically | | CAPABILITY | | | | [pdf_reader] <-- added by | | GENERATOR | | | | [weather_api] <-- reflection | | (5-stage pipe) | | | | [...] <-- engine | +----------------+ | | +-----------+-----------+ | | | | | | | +-----------v-----------+ | | | | WIKI MEMORY MANAGER <-------------+ | | | | | | | concepts/ | | | | mcp_protocol.md | | | | python_asyncio.md | | | | [...] | | | | sessions/ | | | | session_001.md | | | | reflection_001.md | | | | [...] | | | | capabilities/ | | | | email_sender.md | | | | pdf_reader.md | | | | [...] | | | +-----------------------+ | | | +------------------------------------------------------------------+ External World: - Web Search APIs (Tavily, Brave, SerpAPI) - SMTP Servers (email) - Calendar APIs (Google Calendar, Outlook) - Enterprise Systems (SAP, Jira, Confluence) - File Systems (local, S3, SharePoint) - [... dynamically discovered and integrated ...]

SECTION 9.2: THE INFORMATION FLOW

Understanding how information flows through the system is crucial for understanding why it works. Let us trace the information flow for a single user interaction. The user sends a message. The AgentOrchestrator receives it and queries the wiki for relevant context. The wiki returns any relevant pages it finds. The orchestrator builds a context object containing the conversation history, the wiki context, and the list of available tool schemas. It passes this context to the LLM Core via the agent loop. The LLM Core reads the context and decides what to do. If it decides to use a tool, it emits a structured tool call. The agent loop intercepts the tool call and routes it to the MCPToolRegistry. The registry finds the appropriate tool function and executes it. The result is added to the conversation history. The LLM Core reads the result and continues reasoning. When the LLM Core produces a final text response, the agent loop returns it to the AgentOrchestrator. The orchestrator records the interaction in the session data. If the session data has accumulated enough interactions, it flushes the session data to the wiki. The orchestrator returns the response to the user. In the background, the ReflectionEngine is running its thirty-minute cycle. It reads the session records from the wiki, asks the LLM to reflect on them, identifies capability gaps, and passes them to the CapabilityGenerator. The generator produces new tools, validates them, tests them, and integrates them into the MCPToolRegistry. The orchestrator is notified of each new capability and rebuilds the system prompt to include it. This information flow is continuous, parallel, and self-reinforcing. The agent never stops learning, never stops improving, and never stops growing.

SECTION 9.3: DEPLOYMENT CONSIDERATIONS

Deploying a self-evolving agent in a production environment requires careful thought about infrastructure, security, and operations. From an infrastructure perspective, the agent needs a persistent file system for the wiki and the generated capabilities directory. This should be backed by a reliable storage system (not just a local disk) so that the agent's accumulated knowledge survives restarts and deployments. In a cloud environment, this might be an NFS mount, an S3 bucket with a FUSE adapter, or a managed file storage service. The agent also needs a reliable LLM API connection. If the LLM API is unavailable, the agent cannot function. Consider implementing a fallback strategy: if the primary LLM is unavailable, fall back to a local model (such as a quantized Llama model) for basic functionality. From a security perspective, the generated capabilities directory should be treated as untrusted code. It should be executed in a sandboxed environment with limited permissions. In a production deployment, consider using Docker containers or WebAssembly sandboxes for executing generated code. From an operational perspective, the agent needs monitoring and alerting. Monitor the reflection cycle duration (if it takes too long, something is wrong), the number of failed capability generations (a high failure rate indicates a problem with the code generation prompt), and the wiki size (a rapidly growing wiki might indicate that ingestion is not deduplicating properly).

CONCLUSION: THE LIVING AGENT

We have traveled a long way in this article. We started with the simplest possible building block: an LLM that can call tools using the Model Context Protocol. We added a persistent memory system inspired by Andrej Karpathy's LLM Wiki concept, turning every web search into a learning opportunity and every session into a page in the agent's growing knowledge base. We added a reflection engine that wakes up every thirty minutes, reads the agent's history, and identifies what new capabilities would make it more useful. And we added a capability generator that turns those insights into working code, validates it, tests it, and integrates it into the running system. The result is an agent that starts as a lightweight framework and grows more powerful over time. It is not a static system that you deploy and forget. It is a living system that learns, adapts, and improves. It is, in a meaningful sense, a system that builds itself. This is not magic. Every component we have described is built on solid engineering principles: clean interfaces, separation of concerns, defensive programming, and careful error handling. The LLM is not doing anything mysterious; it is performing structured reasoning tasks (reflection, code generation, knowledge compilation) that it is well-suited for. The system works because it channels the LLM's capabilities into a well-designed feedback loop. The most important insight in this article is perhaps the simplest one: the agent's growth is driven by its interactions with users. Every question a user asks, every tool call that fails, every capability gap that is exposed is raw material for the reflection engine. The agent grows in the direction that its users need it to grow. It is not evolving randomly; it is evolving purposefully, guided by the needs of the people it serves. This is the vision of the self-evolving agent: not an AI that replaces human judgment, but an AI that continuously improves its ability to support it. An AI that starts small, grows large, and never stops learning. The code is ready. The architecture is clear. The only thing left is to build it.

APPENDIX: KEY DEPENDENCIES AND VERSIONS

The following Python packages are required for the core system. All version numbers reflect the state of the ecosystem as of December 2025. mcp[cli]>=1.2.0 # Model Context Protocol SDK anthropic>=0.40.0 # Anthropic Claude API client httpx>=0.27.0 # Async HTTP client for API calls pydantic>=2.8.0 # Data validation and settings management python-dotenv>=1.0.0 # Environment variable management pathlib # Built-in: file system path handling threading # Built-in: background thread management ast # Built-in: Python AST for code validation importlib # Built-in: dynamic module loading subprocess # Built-in: sandboxed code execution json # Built-in: JSON parsing and generation logging # Built-in: structured audit logging Optional dependencies for specific generated capabilities: smtplib # Built-in: email sending pypdf>=4.0.0 # PDF reading pandas>=2.2.0 # Data analysis matplotlib>=3.9.0 # Data visualization requests>=2.32.0 # Synchronous HTTP requests The project structure on disk looks like this: self_evolving_agent/ |-- main.py # Entry point (startup sequence) |-- agent_orchestrator.py # SelfEvolvingAgent class |-- agent_loop.py # Core agent reasoning loop |-- mcp_tool_registry.py # MCPToolRegistry class |-- wiki_memory.py # WikiMemoryManager class |-- reflection_engine.py # ReflectionEngine class |-- capability_generator.py # CapabilityGenerator class |-- core_tools/ | |-- web_search.py # web_search_and_learn tool | |-- wiki_query.py # wiki_query tool | `-- code_execution.py # sandboxed_code_execution tool |-- generated_capabilities/ # Dynamically generated tools | |-- email_sender.py | |-- pdf_reader.py | `-- [...] |-- wiki/ # The agent's long-term memory | |-- index.md | |-- schema.md | |-- concepts/ | |-- sessions/ | `-- capabilities/ |-- config/ | `-- safety_config.json # Safety and governance settings |-- tests/ | |-- test_wiki_memory.py | |-- test_reflection_engine.py | `-- test_capability_generator.py `-- requirements.txt