Hitchhiker's Guide to AI, Software Architecture, and Everything Else: Self-Extending AI: How to Teach an LLM to Build Its Own Tools

+------------------------------------------------------------------+
|                                                                  |
|   "What if the hammer you need doesn't exist yet --              |
|    so you teach the carpenter to forge it on the spot?"          |
|                                                                  |
+------------------------------------------------------------------+

  1. The Problem Nobody Talks About Enough
  2. A Brief Map of the Territory
  3. The Architecture of a Self-Extending Agent
  4. The Tool Registry: A Living Catalogue
  5. The Code Generator: Teaching the LLM to Write Tools
  6. The Validator: Trust, But Verify
  7. The MCP Server: Plugging Into the Protocol
  8. The Agent Loop: Where Everything Comes Alive
  9. Dynamic Agents: Beyond Tools
 10. Security, Safety, and the Responsible Path Forward
 11. Conclusion: The Beginning of Something Larger

1. THE PROBLEM NOBODY TALKS ABOUT ENOUGH

Every developer who has built a serious tool-based LLM application has hit the same invisible wall. You design your system carefully. You write a dozen tools -- functions that let the model search the web, query a database, send an email, parse a PDF, call an API. You test everything. The demo goes beautifully. And then, inevitably, a user types something like this:

  "Can you calculate the Haversine distance between these two GPS
   coordinates and then plot the result on an ASCII map?"

And your system freezes. Not because the LLM is confused -- the LLM knows exactly what to do. It freezes because the tool simply does not exist. You never wrote a Haversine calculator. You never wrote an ASCII map plotter. The model knows the answer, but it cannot reach it. It is like a surgeon who understands the procedure perfectly but has been handed the wrong instrument tray.

This is the fundamental tension at the heart of tool-based agentic AI: the set of tools you define at design time is always, inevitably, incomplete. The world is infinite. Your tool list is not.

The conventional responses to this problem are unsatisfying. You can try to anticipate every possible user need and write tools in advance -- but this is a fool's errand that leads to bloated, unmaintainable codebases. You can tell the user "sorry, that capability is not available" -- but this is a failure mode that erodes trust and utility. You can redeploy the system every time a new tool is needed -- but this breaks the promise of a living, responsive AI system.

So here is the question that should keep every AI engineer up at night:

+------------------------------------------------------------------+
|                                                                  |
|   What if the LLM could write the missing tool itself,           |
|   validate it, register it, and use it -- all in real time,      |
|   without any human intervention and without restarting?         |
|                                                                  |
+------------------------------------------------------------------+

The answer, as this article will demonstrate in considerable detail, is yes. Not only is this possible, but it is architecturally elegant, practically useful, and -- with the right safeguards -- surprisingly safe. And the implications extend far beyond tools. The same mechanism that lets an agent generate a missing function can be used to generate and integrate entirely new sub-agents, new reasoning strategies, and new orchestration patterns. We are talking about a system that grows its own cognitive apparatus on demand.

This article walks through the complete architecture of such a system, built in Python using the Model Context Protocol (MCP), asyncio, and a live tool registry. We will examine each layer in depth, understand why each design decision was made, and look at the moments where things can go wrong and how to prevent them.

Readers are assumed to have a working understanding of LLMs, function calling, and the basics of agentic AI. What follows is not a beginner's tutorial -- it is an engineering deep-dive into one of the most interesting problems in applied AI today.

The complete code is available in a GitHub repository.

2. A BRIEF MAP OF THE TERRITORY

Before we descend into code, it is worth establishing a shared mental model of the landscape we are navigating.

Modern agentic AI systems are built around a loop. The agent receives a goal, reasons about what actions to take, invokes tools to carry out those actions, observes the results, and reasons again. This loop continues until the goal is achieved or the agent gives up. The tools are the agent's hands -- the mechanisms by which it affects the world outside its own context window.

The Model Context Protocol, or MCP, is an open standard developed by Anthropic that formalises how tools are described, discovered, and invoked. An MCP server exposes a list of tools via a "tools/list" endpoint and handles invocations via a "tools/call" endpoint. An MCP client -- typically the agent -- queries the server to discover what tools are available and then calls them as needed. This clean separation between tool definition and tool invocation is what makes MCP so powerful as a foundation for dynamic systems.

  STANDARD (STATIC) AGENTIC SYSTEM

  +------------+       tools/list        +------------------+
  |            | ----------------------> |                  |
  |   AGENT    |                         |   MCP SERVER     |
  |   (LLM)    | <---------------------- | (fixed tool set) |
  |            |       tools/call        |                  |
  +------------+                         +------------------+
        |
        | "I need a Haversine tool..."
        |
        v
   [DEAD END -- tool does not exist]

In a static system, the tool list is fixed at startup. The server reads a configuration, loads a set of pre-written functions, and serves them forever. This is simple, predictable, and brittle.

Now consider what happens when we make the tool registry dynamic:

  SELF-EXTENDING AGENTIC SYSTEM

  +------------+    tools/list (live)    +------------------+
  |            | ----------------------> |                  |
  |   AGENT    |                         |   MCP SERVER     |
  |   (LLM)    | <---------------------- | (dynamic tools)  |
  |            |       tools/call        |                  |
  +------------+                         +--------+---------+
        |                                         |
        | "I need a Haversine tool"               |
        |                                         v
        |                              +----------+---------+
        +----------------------------> |                    |
          generate_and_register_tool   |  TOOL REGISTRY     |
                                       |  (grows at runtime)|
                                       +----------+---------+
                                                  |
                                       +----------+---------+
                                       |                    |
                                       |  CODE GENERATOR    |
                                       |  (LLM writes code) |
                                       +--------------------+

The agent, upon discovering that a tool does not exist, does not give up. Instead, it calls a special meta-tool called "generate_and_register_tool", passing a natural language description of what it needs. The code generator -- itself powered by an LLM -- writes the function, the validator checks it for safety and correctness, the registry stores it, and the MCP server immediately makes it available. The agent then calls the newly created tool as if it had always been there.

The system has extended itself. No restart. No human intervention. No deployment pipeline. The gap in capability has been filled, in real time, by the system itself.

This is the architecture we will now examine in detail.

3. THE ARCHITECTURE OF A SELF-EXTENDING AGENT

Good software architecture tells a story. Each component has a clear responsibility, a clear interface, and a clear reason to exist. The self-extending agent we are building is composed of five major components, each of which we will examine in its own section.

+------------------------------------------------------------------+
|                   SELF-EXTENDING AGENT SYSTEM                    |
+------------------------------------------------------------------+
|                                                                  |
|   +----------+     +----------+     +----------+                 |
|   |          |     |          |     |          |                 |
|   |  AGENT   +---->+   MCP    +---->+  TOOL    |                 |
|   |  LOOP    |     |  SERVER  |     | REGISTRY |                 |
|   |          |<----+          |<----+          |                 |
|   +----------+     +----+-----+     +----+-----+                 |
|                         |                |                       |
|                         |                |                       |
|                    +----v-----+     +----v-----+                 |
|                    |          |     |          |                 |
|                    |  CODE    |     |   TOOL   |                 |
|                    | GENERATOR|     |VALIDATOR |                 |
|                    |          |     |          |                 |
|                    +----------+     +----------+                 |
|                                                                  |
+------------------------------------------------------------------+

The five components and their responsibilities are as follows.

The Agent Loop is the orchestrating intelligence. It holds the conversation history, decides when to call tools, interprets tool results, and determines when a goal has been achieved. It is the component that first notices a tool is missing and decides to generate one.

The MCP Server is the protocol boundary. It translates between the agent's tool calls and the registry's internal representation. It also hosts the meta-tools -- the built-in tools that allow the agent to manage the registry itself.

The Tool Registry is the living catalogue of available tools. It stores tool metadata, source code, callable functions, and usage statistics. It is thread-safe, async-native, and designed to grow at runtime.

The Code Generator is the creative engine. It takes a natural language description of a desired capability and produces working Python code, complete with type annotations and a docstring that the registry uses to build the tool's JSON Schema.

The Tool Validator is the safety layer. Before any generated code is allowed into the registry, the validator performs static analysis, checking for dangerous patterns, verifying the function signature, and ensuring the code is structurally sound.

These five components form a closed loop of capability generation. Let us examine each one in turn.

4. THE TOOL REGISTRY: A LIVING CATALOGUE

The Tool Registry is the heart of the system. Everything else exists to serve it or to consume from it. Its job is deceptively simple: keep track of what tools exist, store their code and metadata, and make them callable. But the implementation details reveal a rich set of engineering challenges.

The first challenge is concurrency. In an async Python system, multiple agent turns may be running concurrently. One turn might be calling a tool while another is registering a new one. Without careful synchronisation, this leads to race conditions, stale reads, and corrupted state. The registry uses asyncio.Lock to protect all mutations, but with a critical subtlety: the lock is acquired and released within a single coroutine frame, never held across an await boundary that could cause a deadlock.

The second challenge is the representation of a tool. A tool is not just a function -- it is a bundle of related information. It has a name, a description, a JSON Schema describing its inputs, the original source code (for introspection and debugging), a callable reference, usage statistics, and tags for categorisation. We capture all of this in a dataclass called ToolEntry:

@dataclass
class ToolEntry:
    name: str
    description: str
    input_schema: dict[str, Any]
    source_code: str
    callable_fn: Callable[..., Any]
    tags: list[str] = field(default_factory=list)
    call_count: int = 0
    last_error: str | None = None
    created_at: datetime = field(default_factory=datetime.utcnow)

Notice that the ToolEntry carries both the source code as a string and the callable_fn as a live Python callable. This duality is intentional and important. The source code is stored for transparency -- so the agent can inspect what a tool does, and so a human operator can audit the system's self-generated code. The callable is stored for performance -- so that invocation does not require re-parsing or re-compiling the source on every call.

The third challenge is the JSON Schema. MCP requires every tool to declare its input schema in JSON Schema format. For hand-written tools, you write this schema manually. For dynamically generated tools, you need to derive it automatically from the function's type annotations. The registry does this using Python's inspect module combined with a type-annotation-to-JSON-Schema converter:

def _annotation_to_json_schema(annotation: Any) -> dict[str, Any]:
    if annotation is str:
        return {"type": "string"}
    if annotation is int:
        return {"type": "integer"}
    if annotation is float:
        return {"type": "number"}
    if annotation is bool:
        return {"type": "boolean"}
    if annotation is list:
        return {"type": "array"}
    if annotation is dict:
        return {"type": "object"}
    return {}

This function is called for each parameter in the generated function's signature. The result is assembled into a complete JSON Schema object that the MCP server can serve to the agent, allowing the agent to understand exactly what arguments the new tool expects.

The fourth challenge is the registration process itself. When a new tool arrives -- as source code from the generator -- the registry must execute that code in a controlled namespace and extract the resulting function. This is done using Python's built-in exec() function, which is powerful and dangerous in equal measure. The registry uses a restricted namespace and relies on the validator (described in the next section) to ensure the code is safe before exec() is ever called:

async def register_from_source(
    self, source_code: str, tags: list[str] | None = None
) -> ToolEntry:
    namespace: dict[str, Any] = {"__builtins__": __builtins__}
    exec(compile(source_code, "<generated>", "exec"), namespace)
    # ... extract function, build schema, store entry ...

The compile() call before exec() is not just a performance optimisation -- it also provides a second opportunity to catch syntax errors before they corrupt the registry's state.

The fifth and final challenge is introspection. The agent needs to be able to ask the registry questions: What tools exist? How many times has each tool been called? What was the last error? The registry exposes a get_stats() method that returns a structured summary of all this information, which the agent can use to make intelligent decisions about whether to reuse an existing tool or generate a new one.

  TOOL REGISTRY -- INTERNAL STATE

  +---------------------------------------------------------+
  |  _tools: dict[str, ToolEntry]                           |
  |                                                         |
  |  "haversine_distance"                                   |
  |    +- description: "Calculates distance between..."     |
  |    +- input_schema: {lat1, lon1, lat2, lon2}            |
  |    +- source_code: "def haversine_distance(...):\n..."  |
  |    +- callable_fn: <function haversine_distance>        |
  |    +- call_count: 7                                     |
  |    +- last_error: None                                  |
  |    +- tags: ["math", "geography"]                       |
  |                                                         |
  |  "ascii_map_plotter"                                    |
  |    +- description: "Renders a simple ASCII map..."      |
  |    +- ...                                               |
  +---------------------------------------------------------+

The registry is, in a very real sense, the agent's long-term procedural memory. It accumulates tools across sessions (if persisted to disk), grows smarter over time, and allows the agent to avoid regenerating tools it has already created. This is a form of learning that does not require retraining the underlying model.

5. THE CODE GENERATOR: TEACHING THE LLM TO WRITE TOOLS

The Code Generator is where the magic happens, and also where the most interesting engineering challenges live. Its job is to take a natural language description of a desired capability and produce a working, well-formed Python function that the registry can accept.

The generator is itself powered by an LLM -- the same model that drives the agent, or a separate one dedicated to code generation. This creates a fascinating recursive structure: the LLM is using an LLM to extend its own capabilities. The outer LLM decides it needs a tool; the inner LLM writes the tool; the outer LLM then uses it.

  CODE GENERATION PIPELINE

  Natural Language Description
           |
           v
  +--------+--------+
  |                 |
  |  PROMPT BUILDER |  <-- injects schema requirements,
  |                 |      safety constraints, examples
  +--------+--------+
           |
           v
  +--------+--------+
  |                 |
  |   LLM API CALL  |  <-- temperature low (0.2),
  |                 |      deterministic preferred
  +--------+--------+
           |
           v
  +--------+--------+
  |                 |
  |  CODE EXTRACTOR |  <-- strips markdown fences,
  |                 |      isolates the function
  +--------+--------+
           |
           v
  +--------+--------+
  |                 |
  |   VALIDATOR     |  <-- static analysis, safety checks
  |                 |
  +--------+--------+
           |
           v
  +--------+--------+
  |                 |
  |    REGISTRY     |  <-- exec(), store, serve
  |                 |
  +-----------------+

The prompt engineering for code generation is critically important and deserves careful attention. A naive prompt like "write a Python function that calculates Haversine distance" will produce code, but it will likely be missing type annotations, will have an inconsistent style, and may import libraries that are not available in the execution environment. A well-engineered prompt is far more specific:

SYSTEM_PROMPT = """
You are an expert Python engineer writing tool functions for an
AI agent system. You must follow these rules without exception:

1. Write exactly ONE Python function with the given name.
2. Every parameter must have a type annotation.
3. The return type must be annotated.
4. The function must have a Google-style docstring with an Args
   section and a Returns section.
5. Use only the Python standard library. Do not import third-party
   packages unless explicitly told they are available.
6. The function must be synchronous (not async).
7. Handle errors gracefully and return meaningful error messages.
8. Do not include any code outside the function definition.
"""

Each of these constraints exists for a concrete reason. The single function requirement prevents the generator from producing helper classes or module-level state that the registry cannot handle. The type annotation requirement enables automatic JSON Schema generation. The docstring requirement provides the description that the agent uses to decide when to call the tool. The standard-library-only constraint is a safety measure that prevents dependency hell. The synchronous requirement simplifies the execution model in the registry.

The code extraction step deserves special mention. LLMs, even when instructed to produce only code, frequently wrap their output in Markdown fences (triple backticks). The extractor must handle this gracefully, stripping the fences and any surrounding prose to isolate the raw Python source. A robust extractor uses a combination of regex matching and heuristic line-by-line scanning:

def extract_code(raw_response: str) -> str:
    # Try to extract from a markdown code fence first.
    fence_match = re.search(
        r"```(?:python)?\n(.*?)```", raw_response, re.DOTALL
    )
    if fence_match:
        return fence_match.group(1).strip()
    # Fall back to finding the first 'def' line.
    lines = raw_response.splitlines()
    start = next(
        (i for i, l in enumerate(lines) if l.startswith("def ")), None
    )
    if start is not None:
        return "\n".join(lines[start:]).strip()
    raise ValueError("No function definition found in LLM response.")

The temperature setting for the code generation LLM call is an important tuning parameter. Higher temperatures produce more creative but less reliable code. Lower temperatures produce more predictable, syntactically correct code but may miss creative solutions to unusual problems. In practice, a temperature of 0.2 strikes the right balance for code generation -- low enough to be reliable, high enough to handle unusual capability descriptions without getting stuck in degenerate patterns.

One of the most powerful features of the generation pipeline is retry logic. If the validator rejects the generated code, the error message from the validator is fed back into the LLM as a correction prompt:

for attempt in range(max_retries):
    raw = await self._call_llm(prompt)
    code = extract_code(raw)
    valid, errors = await self._validator.validate(code)
    if valid:
        return await self._registry.register_from_source(code, tags)
    # Feed errors back to the LLM for self-correction.
    prompt = self._build_correction_prompt(prompt, code, errors)

raise RuntimeError(f"Failed after {max_retries} attempts.")

This retry-with-feedback loop is a microcosm of the broader agentic pattern: observe, reason, act, observe again. The code generator is itself a tiny agent, trying to satisfy the validator's requirements through iterative refinement. In practice, most well-described capabilities are generated correctly on the first attempt. The retry loop handles edge cases and unusual requirements.

6. THE VALIDATOR: TRUST, BUT VERIFY

The validator is the component that makes the entire system safe enough to run in production. Without it, the code generator is a mechanism for arbitrary code execution -- a security nightmare. With it, the system can confidently accept and run LLM-generated code within well-defined boundaries.

The validator operates entirely on the Abstract Syntax Tree (AST) of the generated code, never executing it. This is the key insight: you can learn an enormous amount about what code does by analysing its structure, without running it and risking the consequences of malicious or buggy behaviour.

  VALIDATION PIPELINE

  Source Code (string)
           |
           v
  +--------+--------+
  |                 |
  |   ast.parse()   |  --> SyntaxError caught here
  |                 |
  +--------+--------+
           |
           v
  +--------+--------+
  |                 |
  | STRUCTURE CHECK |  --> exactly one top-level function?
  |                 |      correct name?
  +--------+--------+
           |
           v
  +--------+--------+
  |                 |
  | SIGNATURE CHECK |  --> all params annotated?
  |                 |      return type present?
  +--------+--------+
           |
           v
  +--------+--------+
  |                 |
  | SECURITY SCAN   |  --> forbidden calls? dangerous imports?
  |                 |      network access? file system writes?
  +--------+--------+
           |
           v
  +--------+--------+
  |                 |
  | COMPLEXITY CHECK|  --> too many lines? too deeply nested?
  |                 |
  +-----------------+

The security scan is the most important step. It walks the AST looking for patterns that indicate dangerous behaviour. The list of forbidden patterns includes direct calls to exec() or eval() (which would allow the generated code to execute further arbitrary code), imports of modules like os, sys, subprocess, socket, and shutil (which provide access to the file system, network, and process management), and any use of import or importlib (which could be used to circumvent the import blacklist):

FORBIDDEN_IMPORTS = {
    "os", "sys", "subprocess", "socket", "shutil",
    "importlib", "ctypes", "pickle", "marshal",
}

FORBIDDEN_CALLS = {
    "exec", "eval", "compile", "__import__",
    "open", "breakpoint",
}

class SecurityVisitor(ast.NodeVisitor):
    def visit_Import(self, node: ast.Import) -> None:
        for alias in node.names:
            root = alias.name.split(".")[0]
            if root in FORBIDDEN_IMPORTS:
                self.errors.append(
                    f"Forbidden import: '{alias.name}'"
                )
        self.generic_visit(node)

    def visit_Call(self, node: ast.Call) -> None:
        if isinstance(node.func, ast.Name):
            if node.func.id in FORBIDDEN_CALLS:
                self.errors.append(
                    f"Forbidden call: '{node.func.id}()'"
                )
        self.generic_visit(node)

The complexity check is a subtler safety measure. Code that is extremely long or deeply nested is harder to reason about and more likely to contain bugs or hidden malicious logic. By setting limits on the number of lines (say, 150) and the maximum nesting depth (say, 5), the validator ensures that generated tools remain simple, auditable, and predictable.

The signature check ensures that the generated function has the name that was requested, that all parameters have type annotations, and that the return type is declared. This is not just a style requirement -- it is a functional requirement, because the registry's schema generation depends on these annotations being present.

When the validator passes all checks, it returns a clean success signal. When it fails, it returns a structured list of error messages that are precise enough for the code generator's retry loop to act on. The quality of these error messages directly affects the quality of the self-correction loop -- vague errors produce vague corrections.

It is worth being honest about the limits of static analysis. A sufficiently determined adversary can write malicious code that passes AST-based checks. The validator is a strong first line of defence, not an impenetrable wall. In production systems, additional layers of defence -- sandboxed execution environments, resource limits, human review queues for generated code -- should be considered. The validator buys you a great deal of safety for very little cost, but it is not a substitute for a comprehensive security posture.

7. THE MCP SERVER: PLUGGING INTO THE PROTOCOL

The MCP Server is the component that makes the system interoperable with the broader ecosystem of MCP-compatible clients and agents. It translates between the agent's protocol-level tool calls and the registry's internal Python API.

The server is built on the low-level MCP Python SDK Server class, which gives us fine-grained control over how tools are listed and called. This is important because we need the tool list to be dynamic -- read fresh from the registry on every "tools/list" request -- rather than computed once at startup.

The most important design decision in the MCP server is the separation between meta-tools and dynamic tools. Meta-tools are the built-in tools that allow the agent to manage the registry: generate a new tool, list existing tools, inspect a tool's source code, remove a tool, and get registry statistics. These meta-tools are always present, regardless of what dynamic tools have been registered. Dynamic tools are everything the registry has accumulated at runtime.

  MCP SERVER -- TOOL NAMESPACE

  +-----------------------------------------------------------+
  |  META-TOOLS (always present)                              |
  |    - generate_and_register_tool                           |
  |    - list_registered_tools                                |
  |    - get_tool_source                                      |
  |    - remove_tool                                          |
  |    - get_registry_stats                                   |
  +-----------------------------------------------------------+
  |  DYNAMIC TOOLS (grow at runtime)                          |
  |    - haversine_distance          (generated at 14:03:22)  |
  |    - ascii_map_plotter           (generated at 14:05:11)  |
  |    - compound_interest_calc      (generated at 14:09:44)  |
  |    - ...                                                  |
  +-----------------------------------------------------------+

The "tools/list" handler is elegantly simple precisely because of the registry's clean API. It reads the current state of the registry on every call, ensuring that any tool registered since the last list request is immediately visible:

@self._server.list_tools()
async def handle_list_tools() -> list[mcp_types.Tool]:
    meta_tools = make_meta_tools()
    dynamic_entries = await self._registry.get_all_tools()
    dynamic_tools = [entry_to_mcp_tool(e) for e in dynamic_entries]
    return meta_tools + dynamic_tools

The entry_to_mcp_tool() conversion function is the bridge between the registry's internal ToolEntry representation and the MCP protocol's Tool descriptor. It extracts the name, description, and input schema from the ToolEntry and packages them into the format the protocol expects. This conversion is trivial in code but architecturally significant -- it is the point where the registry's internal world meets the external protocol world.

The "tools/call" handler routes incoming calls to either the meta-tool dispatcher or the registry's call_tool() method. Error handling here is critical: any exception from a tool call must be caught and returned as a structured error message, not allowed to propagate up and crash the server. The agent needs to be able to read the error, understand what went wrong, and decide how to proceed -- perhaps by generating a corrected version of the tool.

One subtle but important implementation detail concerns the MCP SDK's initialisation options. The get_capabilities() method requires a NotificationOptions object -- not None, not omitted, but an explicit instance. Passing None causes an AttributeError deep in the SDK when it tries to access the tools_changed attribute of the notification options. The correct import path is from mcp.server, not from mcp.server.models, a distinction that is easy to miss and painful to debug:

from mcp.server.models import InitializationOptions
from mcp.server import NotificationOptions   # NOT from mcp.server.models

# ...

capabilities = self._server.get_capabilities(
    notification_options=NotificationOptions(),
    experimental_capabilities={},
)

This kind of subtle SDK-version-specific detail is exactly the sort of thing that turns a two-hour integration into a two-day debugging session. Document it. Comment it. Never assume the obvious import path is the correct one.

8. THE AGENT LOOP: WHERE EVERYTHING COMES ALIVE

The agent loop is where all the components come together and the system begins to exhibit genuinely intelligent behaviour. Understanding the loop in detail is essential to understanding why the self-extension mechanism works as smoothly as it does.

The loop begins with a user message. The agent assembles its current context -- the conversation history, the system prompt, and the list of available tools from the MCP server -- and sends this to the LLM. The LLM responds with either a final answer or a tool call request.

  AGENT LOOP -- DETAILED FLOW

  User Message
       |
       v
  +----+----+
  |         |
  | Assemble|  <-- history + system prompt + tool list
  | Context |
  |         |
  +----+----+
       |
       v
  +----+----+
  |         |
  |  LLM    |  <-- OpenAI-compatible API call
  |  Call   |
  |         |
  +----+----+
       |
       +------------------+------------------+
       |                  |                  |
       v                  v                  v
  [Final Answer]   [Tool Call:         [Tool Call:
                    known tool]         unknown tool]
       |                  |                  |
       v                  v                  v
  [Return to       [Execute via        [Call generate_
   User]            Registry]           and_register_tool,
                         |              then retry]
                         v
                   [Observe Result]
                         |
                         v
                   [Back to LLM Call]

The most interesting path through this diagram is the rightmost one: the case where the LLM determines that it needs a tool that does not exist. In a well-prompted system, the LLM is instructed to check the available tool list before attempting a task, and to call generate_and_register_tool if the required capability is absent.

The system prompt for the agent is carefully crafted to encourage this behaviour. It explains the meta-tools, gives examples of when to use them, and explicitly instructs the model to prefer reusing existing tools over generating new ones. This last point is important for efficiency: generating a tool takes several seconds and an LLM API call, so the agent should only do it when genuinely necessary.

The conversation flow for a self-extension event looks something like this in practice:

  USER:  "Calculate the Haversine distance between Paris
          (48.8566 N, 2.3522 E) and London (51.5074 N, 0.1278 W)"

  AGENT: [checks tool list -- no Haversine tool found]
         [calls generate_and_register_tool with description:
          "Calculate the great-circle distance between two GPS
           coordinates using the Haversine formula. Parameters:
           lat1, lon1, lat2, lon2 as floats in decimal degrees.
           Returns distance in kilometres as a float."]

  SYSTEM: [LLM generates code, validator approves, registry stores]
          [tool 'haversine_distance' now available]

  AGENT: [calls haversine_distance(48.8566, 2.3522, 51.5074, -0.1278)]

  TOOL:  343.56

  AGENT: "The Haversine distance between Paris and London is
          approximately 343.56 kilometres."

The entire self-extension event -- from the agent noticing the missing tool to having a result -- takes a few seconds. From the user's perspective, the system simply answered the question. The self-extension happened invisibly, in the background, as a natural part of the agent's reasoning process.

The agent loop also handles the case where a generated tool fails at runtime. If a tool raises an exception, the error is returned to the agent as a structured message. The agent can then inspect the tool's source code using get_tool_source, reason about what went wrong, call generate_and_register_tool with a corrected description, and try again. This creates a self-healing loop that can recover from generation errors without human intervention.

One important implementation detail in the agent loop is the management of the tool list across turns. The agent fetches the tool list at the beginning of each turn, not once at startup. This ensures that tools generated in earlier turns are available in later turns. Without this, the agent would generate the same tool over and over, never realising it had already created it.

9. DYNAMIC AGENTS: BEYOND TOOLS

Everything discussed so far has focused on dynamically generating tools -- individual functions that perform specific computations or actions. But the same architecture can be extended to generate and integrate entirely new agents. This is where the concept of a self-extending system becomes genuinely profound.

Consider the difference between a tool and an agent. A tool is a stateless function: it takes inputs, performs a computation, and returns an output. An agent is a stateful reasoning loop: it maintains context, makes sequences of decisions, calls tools of its own, and pursues goals over multiple steps. An agent is, in a sense, a tool that can think.

In the MCP framework, an agent can be exposed as a tool. From the calling agent's perspective, invoking a sub-agent looks identical to invoking a simple function -- it sends arguments and receives a result. The fact that the sub-agent internally runs a multi-step reasoning loop is an implementation detail hidden behind the tool interface.

  DYNAMIC AGENT GENERATION

  Orchestrator Agent
         |
         | "I need a sub-agent that can research a topic,
         |  summarise findings, and check facts"
         |
         v
  generate_and_register_tool
         |
         v
  Code Generator
         |
         | [generates an agent function that internally
         |  runs its own LLM loop with its own tools]
         |
         v
  Tool Registry
         |
         v
  "research_and_verify_agent" now available as a tool
         |
         v
  Orchestrator calls research_and_verify_agent("quantum computing")
         |
         v
  Sub-agent runs its own loop, returns structured summary

The generated agent function is a Python function that, internally, instantiates an LLM client, runs a reasoning loop, and returns a structured result. From the registry's perspective, it is just another callable. From the orchestrator's perspective, it is just another tool. But from the system's perspective, it is a new cognitive module that did not exist five minutes ago.

This pattern -- orchestrators generating sub-agents on demand -- suggests a path toward genuinely adaptive AI systems. An orchestrator that encounters a complex, multi-step task it cannot handle with its current tools can generate a specialised sub-agent for that task, delegate to it, and incorporate its results. The sub-agent can, in turn, generate its own tools or sub-agents if needed. The system grows its own cognitive architecture in response to the demands placed on it.

The implications are significant enough to warrant a moment of reflection. We are describing a system that can, in principle, expand its own capabilities without limit, in any direction, in response to any demand. This is extraordinarily powerful. It is also, as we will discuss in the next section, a capability that demands extraordinary care.

There are practical limits, of course. Generated agents are only as good as the LLM that generates them. Complex agent architectures are difficult to specify in natural language. The validator's constraints limit what kinds of code can be generated. And the standard-library-only constraint means that generated agents cannot use specialised frameworks without explicit allowlisting. But within these limits, the space of possible dynamically generated agents is vast.

One particularly promising application is the generation of domain- specific agents. An orchestrator handling a medical query might generate a sub-agent specialised in medical literature search and clinical reasoning. An orchestrator handling a financial query might generate a sub-agent specialised in market data analysis and risk calculation. Each of these sub-agents is generated once, stored in the registry, and reused for all subsequent queries of the same type. Over time, the system accumulates a library of specialised cognitive modules, each tailored to a specific domain or task type.

This is, in a very real sense, a form of learning. Not the gradient- descent kind of learning that updates model weights -- but a higher- level, architectural kind of learning that accumulates reusable cognitive structures. The system gets better at handling the kinds of tasks it has seen before, not by changing its underlying model, but by building up a library of tools and agents that encode hard-won problem-solving strategies.

10. SECURITY, SAFETY, AND THE RESPONSIBLE PATH FORWARD

No article about dynamically executing LLM-generated code would be complete without an honest, detailed discussion of the risks. The self-extending agent is a powerful system, and powerful systems can cause serious harm if deployed carelessly.

The threat model has several distinct layers, and each requires its own mitigation strategy.

  THREAT MODEL

  +----------------------------------------------------------+
  |  THREAT LAYER 1: Prompt Injection                        |
  |  A malicious user crafts a prompt that causes the agent  |
  |  to generate a tool with harmful behaviour.              |
  |  MITIGATION: Validator, sandboxing, human review queue.  |
  +----------------------------------------------------------+
  |  THREAT LAYER 2: LLM Hallucination                       |
  |  The code generator produces plausible-looking but       |
  |  incorrect code that passes validation but fails at      |
  |  runtime in subtle ways.                                 |
  |  MITIGATION: Retry loop, runtime error capture, testing. |
  +----------------------------------------------------------+
  |  THREAT LAYER 3: Resource Exhaustion                     |
  |  A generated tool enters an infinite loop or allocates   |
  |  unbounded memory.                                       |
  |  MITIGATION: Execution timeouts, memory limits,          |
  |              asyncio.wait_for() wrappers.                |
  +----------------------------------------------------------+
  |  THREAT LAYER 4: Registry Pollution                      |
  |  The registry accumulates many low-quality or redundant  |
  |  tools, degrading performance and confusing the agent.   |
  |  MITIGATION: Tool quality scoring, automatic pruning,    |
  |              human review of the tool catalogue.         |
  +----------------------------------------------------------+
  |  THREAT LAYER 5: Cascading Failures                      |
  |  A generated sub-agent generates its own sub-agents,     |
  |  which generate further sub-agents, consuming resources  |
  |  without bound.                                          |
  |  MITIGATION: Recursion depth limits, generation budgets, |
  |              circuit breakers.                           |
  +----------------------------------------------------------+

The validator addresses the most immediate security concern -- arbitrary code execution -- but it is not sufficient on its own. A defence-in- depth approach is required. The execution environment for generated tools should be sandboxed, ideally using a container or a restricted Python interpreter that limits access to the file system, network, and process management. The asyncio.wait_for() wrapper should be applied to all tool calls to enforce execution time limits. Memory usage should be monitored and capped.

The prompt injection threat is particularly subtle. A user who understands the system's architecture might craft a capability description that sounds innocent but produces harmful code. For example, a description like "a tool that reads configuration from the environment and returns it" might produce code that exfiltrates environment variables. The validator's import blacklist catches the most obvious cases (os.environ requires the os module), but creative attackers may find ways around it. Human review of generated code, at least for sensitive deployments, is a prudent additional safeguard.

The registry pollution problem is underappreciated. As the system runs over time, it accumulates tools. Some of these tools are high-quality and frequently used. Others are one-off solutions to unusual problems, generated once and never called again. A large, cluttered registry degrades the agent's ability to find the right tool -- the tool list becomes so long that the LLM's attention is diluted across hundreds of options. Automatic pruning strategies -- removing tools that have not been called in a certain time window, or that have a high error rate -- help keep the registry lean and useful.

The cascading agent generation problem is the most exotic risk but also the most important to think about in advance. If agents can generate sub-agents, and sub-agents can generate their own sub-agents, the system has the potential to grow without bound. A generation budget -- a hard limit on the number of tools or agents that can be generated in a single session -- is a simple and effective safeguard. A recursion depth limit prevents sub-agents from spawning their own sub-agents beyond a configurable depth.

Finally, it is worth reflecting on the broader ethical dimension. A system that can extend its own capabilities is a system that can surprise its operators. The tools it generates may solve problems in unexpected ways. The agents it creates may exhibit emergent behaviours that were not anticipated. This is not necessarily bad -- surprising solutions are often the best ones -- but it requires a culture of careful monitoring, transparent logging, and willingness to intervene.

The system described in this article logs all generated code, all validation results, all tool calls, and all errors. This logging is not optional -- it is the foundation of the human oversight that makes the system trustworthy. An operator who can see exactly what code was generated, when, why, and with what results, is an operator who can maintain meaningful control over a self-extending system.

11. CONCLUSION: THE BEGINNING OF SOMETHING LARGER

We began with a simple, uncomfortable observation: tool-based AI systems are always incomplete. The world is infinite; your tool list is not. Every static system will eventually encounter a user need it cannot meet.

The self-extending agent architecture described in this article is a principled, practical response to that observation. By combining a dynamic tool registry, an LLM-powered code generator, a static analysis validator, and the Model Context Protocol, we can build systems that grow their own capabilities in real time, in response to real user needs, without human intervention and without restarting.

The key insights of this architecture are worth restating clearly.

The separation between meta-tools and dynamic tools is what makes the system self-referential without being self-destructive. The agent can manage its own tool set using the same interface it uses to call tools, without any special-casing in the agent loop itself.

The validator is what makes the system safe enough to deploy. Static AST analysis cannot catch every possible threat, but it catches the most common and most dangerous ones, and it does so without executing any code. The cost of validation is negligible; the benefit is enormous.

The retry-with-feedback loop is what makes the system robust. LLMs make mistakes. Generated code is sometimes wrong. A system that can observe its own failures, reason about them, and try again is a system that can recover from the inevitable imperfections of LLM-generated code.

The living tool list -- fetched fresh on every agent turn -- is what makes the system coherent across time. Tools generated in earlier turns are available in later turns. The agent's capabilities accumulate, turn by turn, building toward a richer and richer problem-solving repertoire.

And the extension to dynamic agents is what makes the system genuinely open-ended. When the unit of dynamic generation is not a function but an agent -- a reasoning loop with its own tools and its own context -- the system can grow cognitive structures of arbitrary complexity. It can specialise, delegate, and orchestrate in ways that were not anticipated at design time.

+------------------------------------------------------------------+
|                                                                  |
|   The most interesting AI systems are not the ones that          |
|   know the most at the start.                                    |
|                                                                  |
|   They are the ones that learn the fastest --                    |
|   not by updating their weights,                                 |
|   but by building their own tools.                               |
|                                                                  |
+------------------------------------------------------------------+

We are at the very beginning of understanding what self-extending AI systems can do. The architecture described here is a starting point, not a destination. The tool registry will grow more sophisticated. The validator will become more nuanced. The code generator will learn to produce better code from richer descriptions. The agent loop will develop more intelligent strategies for deciding when to generate versus when to reuse.

But the fundamental insight -- that an AI system can be given the ability to build its own tools, and that this ability transforms a static, brittle system into a dynamic, adaptive one -- that insight is here to stay. The hammer that forges its own hammers is a different kind of tool entirely.

And we have only just begun to understand what it can build.

  +------------------------------------------------------------+
  |  APPENDIX: COMPONENT SUMMARY                               |
  +------------------------------------------------------------+
  |                                                            |
  |  tool_registry.py   -- ToolEntry, DynamicToolRegistry      |
  |  code_generator.py  -- CodeGenerationPipeline              |
  |  tool_validator.py  -- ToolValidator, SecurityVisitor      |
  |  mcp_server.py      -- DynamicMCPServer, meta-tools        |
  |  agent.py           -- Agent loop, tool-call dispatch      |
  |  main.py            -- Entry point, component wiring       |
  |                                                            |
  +------------------------------------------------------------+
  |  KEY DEPENDENCIES                                          |
  +------------------------------------------------------------+
  |                                                            |
  |  mcp          -- Model Context Protocol Python SDK         |
  |  anyio        -- Async I/O abstraction layer               |
  |  openai       -- OpenAI-compatible API client              |
  |  asyncio      -- Python standard library async runtime     |
  |                                                            |
  +------------------------------------------------------------+

Hitchhiker's Guide to AI, Software Architecture, and Everything Else

Friday, April 03, 2026

Self-Extending AI: How to Teach an LLM to Build Its Own Tools

TABLE OF CONTENTS