FOREWORD: WHY YOUR APPLICATION DESERVES A BRAIN

Imagine you are sitting in front of your favourite text processor. You have written a long technical document, maybe a software design specification or a research paper, and you suddenly realize that every occurrence of the word "one" inside a headline needs to become the word "two" because the project version changed. A simple find-and-replace would change every single occurrence in the entire document, including the body text, footnotes, captions, and code listings, which is exactly what you do not want. You want something that understands context. You want something that knows the difference between a headline and a paragraph. You want, in short, a language model.

Large Language Models, or LLMs, are not magic. They are very large neural networks trained on enormous amounts of text, and they have learned to predict what comes next in a sequence of tokens with astonishing accuracy. That prediction ability, when combined with careful prompting and a well-designed integration layer, turns an LLM into a reasoning engine that can interpret natural language instructions, understand the structure of a document, and produce structured output that your application can act upon.

This tutorial is about exactly that: how you, as an engineer, take an existing application and extend it with an LLM so that users can give natural language commands that the application executes intelligently. We will use a text processor as our running example because it is rich enough to illustrate every important concept, from simple text transformations to figure generation and appendix creation, but every principle we discuss applies equally to IDEs, spreadsheet tools, CAD systems, data analysis pipelines, and any other application you can imagine.

We will cover the architecture of an LLM-augmented application, the difference between local and remote LLMs and when to choose each, the mechanics of prompting and structured output, the concept of tool use and function calling, the orchestration of multi-step agentic workflows, multimodal extensions for generating and embedding figures, and the practical engineering details that separate a toy prototype from a production-ready system.

Every code example in this tutorial is real, grounded in actual APIs and libraries, and explained in enough detail that you can run it yourself. Let us begin.

PART ONE: UNDERSTANDING THE LANDSCAPE

CHAPTER 1: WHAT IS AN LLM AND WHY DOES IT MATTER FOR APPLICATION DEVELOPERS?

A Large Language Model is a neural network, almost always based on the Transformer architecture introduced by Vaswani et al. in 2017, that has been trained to model the probability distribution of text. Given a sequence of tokens (roughly, word fragments), the model predicts the probability of the next token. By sampling from this distribution repeatedly, the model generates coherent, contextually appropriate text.

What makes modern LLMs remarkable for application developers is not just that they generate fluent text. It is that they have internalized an enormous amount of knowledge about the world, about programming languages, about document structure, about reasoning patterns, and about how humans express intent in natural language. When you ask an LLM to "move the function fib() one tab to the right," it understands what a function is, what indentation means in the context of source code, and what "one tab" means in terms of spaces. A traditional regular expression or AST transformation tool could do this too, but it would require you to write the rule explicitly. The LLM infers the rule from your natural language description.

This is the core value proposition: LLMs let users express intent in natural language, and your application translates that intent into action. The LLM becomes the universal interpreter between human thought and machine operation.

The practical implication for you as an engineer is that you are no longer building a fixed set of commands. You are building an open-ended interface where the user's vocabulary is the entire English language (or any other language the model supports), and your job is to design the system that maps that vocabulary onto the operations your application can perform.

CHAPTER 2: LOCAL VS. REMOTE LLMs - CHOOSING YOUR ENGINE

Before you write a single line of integration code, you need to decide where your LLM will run. This is not a trivial decision, and it has significant consequences for latency, cost, privacy, capability, and deployment complexity.

Remote LLMs are models that run on someone else's infrastructure and are accessed via an API over the internet. The most prominent examples are OpenAI's GPT-55, Anthropic's Claude 4.6 Sonnet, Google's Gemini 3.1 Pro, and Mistral's large models. These models are extremely capable, regularly updated, and require no hardware investment on your part. You pay per token consumed. The downsides are that your data leaves your network (a serious concern for confidential documents), latency is bounded by network round-trip time, costs can accumulate quickly at scale, and you are dependent on the provider's availability and pricing decisions.

Local LLMs are models that run on hardware you control, either on the user's machine or on your own servers. The tooling ecosystem for local LLMs has matured enormously. Ollama (https://ollama.com) is currently the most developer-friendly way to run local LLMs. It packages models like Llama 4.0, Mistral, Phi-4, Gemma 3, Qwen 3.5, and many others into a simple server that exposes an OpenAI-compatible REST API. LM Studio (https://lmstudio.ai) provides a graphical interface for the same purpose. llama.cpp (https://github.com/ggerganov/llama.cpp) is the underlying inference engine that most of these tools use, and it supports quantized models that run efficiently on consumer hardware, including Apple Silicon Macs and NVIDIA GPUs.

The practical decision matrix looks like this: if your application handles sensitive or confidential documents, such as legal contracts, medical records, or proprietary engineering specifications, you should strongly prefer a local LLM. If you need the highest possible reasoning capability and your data sensitivity allows it, a remote model like GPT-5.5 is hard to beat. Many production systems use a hybrid approach: a local model handles routine tasks and sensitive data, while a remote model is invoked only for complex reasoning tasks on sanitized or non-sensitive content.

The beautiful engineering insight is that, because Ollama exposes an OpenAI-compatible API, you can write your integration code once and switch between local and remote models by changing a single configuration parameter. We will exploit this throughout the tutorial.

Here is a concrete illustration of the API surface you will be working with. When you run Ollama locally, it starts a server on port 11434. When you use OpenAI's API, you point to https://api.openai.com. The request and response format is identical in both cases, as shown in the two JSON examples below.

Request to a local Ollama server:

POST http://localhost:11434/v1/chat/completions
Content-Type: application/json

{
  "model": "llama3.1:8b",
  "messages": [
    {
      "role": "system",
      "content": "You are a document editing assistant."
    },
    {
      "role": "user",
      "content": "Change all headlines containing 'one' to use 'two' instead."
    }
  ],
  "temperature": 0.2
}

The equivalent request to OpenAI's remote API:

POST https://api.openai.com/v1/chat/completions
Authorization: Bearer sk-...
Content-Type: application/json

{
  "model": "gpt-4o",
  "messages": [
    {
      "role": "system",
      "content": "You are a document editing assistant."
    },
    {
      "role": "user",
      "content": "Change all headlines containing 'one' to use 'two' instead."
    }
  ],
  "temperature": 0.2
}

This API compatibility is not an accident. It is a deliberate design choice by the open-source community to ensure portability, and it is one of the most important engineering facts you need to know when building LLM-augmented applications.

PART TWO: ARCHITECTURE - HOW TO WIRE AN LLM INTO YOUR APPLICATION

CHAPTER 3: THE FOUR-LAYER ARCHITECTURE

A well-designed LLM-augmented application has four distinct layers, and understanding each layer's responsibility is essential before you write any code.

The first layer is the Application Core. This is your existing application: the text processor, the IDE, the spreadsheet tool. It has its own data model (a document object, an AST, a spreadsheet grid), its own rendering engine, and its own set of operations it can perform. You do not rewrite this layer. You extend it.

The second layer is the Tool Layer. This is a set of functions that expose the application's capabilities to the LLM in a structured way. Each tool has a name, a description in natural language, and a schema that defines its parameters. The LLM reads these descriptions and decides which tools to call and with what arguments. We will spend a great deal of time on this layer because it is where most of the engineering work happens.

The third layer is the Orchestration Layer. This is the code that manages the conversation with the LLM, sends tool call results back to the model, handles multi-step reasoning, and decides when the task is complete. In simple cases, this is a straightforward request-response loop. In complex agentic scenarios, it becomes a state machine or even a graph of reasoning steps.

The fourth layer is the LLM Backend. This is the actual language model, running either locally via Ollama or remotely via an API. The orchestration layer communicates with this backend through the OpenAI-compatible chat completions API.

Here is a diagram of these four layers:

+----------------------------------------------------------+
|                    USER INTERFACE                        |
|  (Natural language command bar:                          |
|   e.g., "Move fib() one tab to the right")              |
+----------------------------------------------------------+
                          |
                          v
+----------------------------------------------------------+
|               ORCHESTRATION LAYER                        |
|  - Builds system prompt with document context            |
|  - Sends messages to LLM backend                         |
|  - Receives tool call requests from LLM                  |
|  - Dispatches tool calls to Tool Layer                   |
|  - Loops until task is complete                          |
+----------------------------------------------------------+
      |                                      |
      v                                      v
+-------------------+          +----------------------------+
|   TOOL LAYER      |          |      LLM BACKEND           |
|  replace_text()   |          |  Local: Ollama/llama.cpp   |
|  indent_code()    |          |  Remote: OpenAI/Anthropic  |
|  set_font()       |          |  (OpenAI-compatible API)   |
|  add_appendix()   |          +----------------------------+
|  generate_fig()   |
|  read_section()   |
+-------------------+
      |
      v
+----------------------------------------------------------+
|               APPLICATION CORE                           |
|  Document Object Model: paragraphs, headings, styles,    |
|  code blocks, figures, sections, appendices              |
+----------------------------------------------------------+

This architecture is clean, extensible, and testable. Each layer has a single responsibility. The Tool Layer is the most important interface because it is the contract between the LLM's reasoning and your application's capabilities. If you define your tools well, the LLM will use them correctly. If you define them poorly, you will spend hours debugging mysterious failures.

CHAPTER 4: TOOL USE AND FUNCTION CALLING - THE HEART OF THE INTEGRATION

Tool use, also called function calling, is the mechanism by which an LLM requests that your application execute a specific function with specific arguments. It was introduced by OpenAI in June 2023 and has since been adopted by virtually every major LLM provider and many open-source models including Llama 3.1, Mistral, Qwen 2.5, and Phi-3.5.

The mechanism works as follows. You define a set of tools as JSON schemas and include them in your API request. The LLM, instead of generating a text response, generates a structured JSON object that specifies which tool to call and what arguments to pass. Your orchestration layer receives this JSON, executes the corresponding function, and sends the result back to the LLM as a new message. The LLM then either calls another tool or generates a final text response indicating that the task is complete.

This is a profoundly important design because it means the LLM is not directly modifying your document. The LLM is reasoning about what needs to be done and requesting that your application do it. Your application remains in control at all times. You can validate the LLM's requests before executing them, log every action for auditing, implement undo/redo, and enforce safety constraints.

Let us look at a concrete tool definition. Suppose we want to give the LLM the ability to replace text in headings only. Here is how we define this tool in the OpenAI function calling format:

{
  "type": "function",
  "function": {
    "name": "replace_in_headings",
    "description": "Replaces all occurrences of a search string with a
                    replacement string, but only within heading paragraphs
                    (H1, H2, H3, etc.). Does not affect body text,
                    captions, footnotes, or code blocks.",
    "parameters": {
      "type": "object",
      "properties": {
        "search": {
          "type": "string",
          "description": "The exact text to search for within headings."
        },
        "replace": {
          "type": "string",
          "description": "The text that will replace each occurrence of
                          the search string."
        },
        "case_sensitive": {
          "type": "boolean",
          "description": "Whether the search should be case-sensitive.
                          Defaults to false.",
          "default": false
        }
      },
      "required": ["search", "replace"]
    }
  }
}

Notice how the description is written in plain English and is very specific about what the tool does and, crucially, what it does NOT do. This specificity is essential. The LLM uses the description to decide whether this is the right tool for the job. If your description is vague, the LLM may call the wrong tool or call the right tool with wrong assumptions about its behavior.

Now let us look at the full orchestration loop in Python. Before we do, here is the complete set of imports that all subsequent code examples in this tutorial depend on. Gathering them in one place avoids the confusion of scattered, inconsistent import statements across individual snippets:

# -----------------------------------------------------------------------
# CONSOLIDATED IMPORTS FOR ALL CODE EXAMPLES IN THIS TUTORIAL
# -----------------------------------------------------------------------
import json
import re
import time
import shutil
import logging
import base64
import requests
from pathlib import Path
from datetime import datetime

# python-docx: pip install python-docx
from docx import Document
from docx.shared import Pt, Inches
from docx.oxml import OxmlElement
from docx.oxml.ns import qn
from docx.enum.text import WD_ALIGN_PARAGRAPH, WD_BREAK

# openai: pip install openai
from openai import OpenAI

# anthropic: pip install anthropic
import anthropic

logger = logging.getLogger(__name__)

With imports out of the way, here is the orchestration loop. This example uses the openai Python library, which works with both OpenAI's API and with Ollama's OpenAI-compatible endpoint by simply changing the base_url parameter:

# -----------------------------------------------------------------------
# LLM CLIENT CONFIGURATION
# -----------------------------------------------------------------------

# For Ollama (local) - no data leaves your machine:
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
MODEL  = "llama3.1:8b"

# For OpenAI (remote) - comment out the above two lines and use these:
# client = OpenAI(api_key="sk-your-key-here")
# MODEL  = "gpt-4o"

# -----------------------------------------------------------------------
# TOOL DEFINITIONS - what the LLM is allowed to call
# -----------------------------------------------------------------------

TOOLS = [
    {
        "type": "function",
        "function": {
            "name": "replace_in_headings",
            "description": (
                "Replaces all occurrences of a search string with a "
                "replacement string, but only within heading paragraphs "
                "(H1, H2, H3). Does not affect body text or code blocks."
            ),
            "parameters": {
                "type": "object",
                "properties": {
                    "search":  {"type": "string"},
                    "replace": {"type": "string"},
                    "case_sensitive": {
                        "type": "boolean",
                        "default": False
                    }
                },
                "required": ["search", "replace"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "indent_function",
            "description": (
                "Adds indentation to all lines of a named Python function "
                "found in a code block paragraph. The indent_spaces parameter "
                "specifies how many additional spaces to prepend to each line "
                "of the function body."
            ),
            "parameters": {
                "type": "object",
                "properties": {
                    "function_name": {"type": "string"},
                    "indent_spaces": {
                        "type": "integer",
                        "default": 4
                    }
                },
                "required": ["function_name"]
            }
        }
    }
    # Additional tools (read_section, set_style_font, add_appendix,
    # generate_figure, insert_figure, insert_text_after,
    # replace_section_content) are defined in later chapters.
    # In a real application, all tools would be listed here.
]

# -----------------------------------------------------------------------
# TOOL DISPATCH REGISTRY
# Maps tool names to their Python implementations.
# Extend this dict whenever you add a new tool.
# -----------------------------------------------------------------------

TOOL_IMPLEMENTATIONS = {
    "replace_in_headings":    lambda **kw: tool_replace_in_headings(**kw),
    "indent_function":        lambda **kw: tool_indent_function(**kw),
    # Additional entries added as tools are defined in later chapters:
    # "read_section":           lambda **kw: tool_read_section(**kw),
    # "set_style_font":         lambda **kw: tool_set_style_font(**kw),
    # "add_appendix":           lambda **kw: tool_add_appendix(**kw),
    # "generate_figure":        lambda **kw: tool_generate_figure(**kw),
    # "insert_figure":          lambda **kw: tool_insert_figure(**kw),
    # "insert_text_after":      lambda **kw: tool_insert_text_after(**kw),
    # "replace_section_content":lambda **kw: tool_replace_section_content(**kw),
}

# -----------------------------------------------------------------------
# HELPER: build the initial message list for any agent call
# -----------------------------------------------------------------------

def build_initial_messages(user_instruction: str,
                            document_context: str) -> list:
    """
    Constructs the opening system + user messages for the agent loop.
    The system message gives the LLM its role and the document context.
    The user message contains the natural language instruction to execute.
    """
    return [
        {
            "role": "system",
            "content": (
                "You are a precise document editing assistant. "
                "You have access to tools that modify the document. "
                "Use them to fulfill the user's instruction exactly. "
                "Here is the current document structure:\n\n"
                + document_context
            )
        },
        {
            "role": "user",
            "content": user_instruction
        }
    ]

# -----------------------------------------------------------------------
# CORE AGENT LOOP
# -----------------------------------------------------------------------

def run_agent(user_instruction: str,
              document_context: str,
              client: OpenAI,
              model: str) -> str:
    """
    Runs the LLM agent loop for a single user instruction.
    Keeps calling the LLM until it stops requesting tool calls,
    then returns the LLM's final human-readable summary.

    Parameters
    ----------
    user_instruction  : The natural language command from the user.
    document_context  : A serialized summary of the document structure.
    client            : An OpenAI-compatible client (local or remote).
    model             : The model identifier string (e.g. "llama3.1:8b").
    """
    messages = build_initial_messages(user_instruction, document_context)

    while True:
        response = client.chat.completions.create(
            model=model,
            messages=messages,
            tools=TOOLS,
            tool_choice="auto"
        )

        message = response.choices[0].message

        if message.tool_calls:
            # Append the assistant's tool-call request to the history
            # so the LLM can see what it already asked for.
            messages.append(message)

            for tool_call in message.tool_calls:
                func_name = tool_call.function.name
                func_args = json.loads(tool_call.function.arguments)

                # Dispatch to the registered implementation.
                result = TOOL_IMPLEMENTATIONS.get(
                    func_name,
                    lambda **_: {"error": f"Unknown tool: {func_name}"}
                )(**func_args)

                # Return the result to the LLM as a tool message.
                messages.append({
                    "role":        "tool",
                    "tool_call_id": tool_call.id,
                    "content":     json.dumps(result)
                })

        else:
            # The LLM produced a plain text response: task is complete.
            return message.content

Let us pause and appreciate what is happening in this orchestration loop. The while True loop is the agentic loop. It runs until the LLM decides it has finished the task and returns a plain text response instead of a tool call. This is how multi-step tasks work: the LLM might call replace_in_headings, receive the result, decide it also needs to call indent_function, receive that result, and only then conclude that the task is complete. Each iteration of the loop is one reasoning step.

The document_context parameter is critically important. It is how you give the LLM the information it needs to reason about your document. We will discuss what to include in this context in detail in the next chapter.

CHAPTER 5: DOCUMENT CONTEXT - GIVING THE LLM EYES

An LLM cannot see your document directly. It can only read text that you include in its context window. Therefore, you need to serialize your document into a text representation that gives the LLM enough information to reason about it correctly.

The challenge is that a real document can be very large, and LLM context windows, while growing (GPT-4o supports 128,000 tokens, Llama 3.1 70B supports 128,000 tokens as well), are not infinite. More importantly, including irrelevant content wastes tokens and can confuse the model. You need to be selective and strategic about what you include.

A good document context representation for a text processor should include the document's structural outline (headings and their levels), the text content of sections that are relevant to the current task, the names and locations of code blocks, and the current styles and formatting applied to different elements. You do not need to include the full text of every paragraph for every task.

Here is an example of a compact but informative document context representation. This is plain text that you would pass as the document_context argument to build_initial_messages():

DOCUMENT STRUCTURE SUMMARY
==========================
Title: "Fibonacci Algorithms: A Comparative Study"
Total paragraphs: 47
Total code blocks: 3
Total headings: 8

HEADINGS:
  [H1] para_id=1  "Chapter one: Introduction"
  [H2] para_id=5  "Section one: Background"
  [H1] para_id=12 "Chapter two: Recursive Approaches"
  [H2] para_id=15 "Section one: Naive Recursion"
  [H3] para_id=18 "The fib() Function"
  [H1] para_id=31 "Chapter three: Iterative Approaches"

CODE BLOCKS:
  [CODE] para_id=19  lang=python  func=fib()       lines=8
  [CODE] para_id=22  lang=python  func=fib_memo()  lines=12
  [CODE] para_id=35  lang=python  func=fib_iter()  lines=6

STYLES IN USE:
  Heading1 : font=Arial 16pt Bold
  Heading2 : font=Arial 14pt Bold
  Body     : font=Times New Roman 12pt
  Code     : font=Courier New 10pt

This representation is compact (it would consume perhaps 300 tokens) yet contains everything the LLM needs to answer the question "change every word 'one' in a headline to 'two'." The LLM can see that para_id=1 contains "Chapter one: Introduction" and para_id=5 contains "Section one: Background", and it knows to call replace_in_headings with search="one" and replace="two".

When the task requires the LLM to read and understand the actual content of a section, for example to beautify or extend a paragraph, you include the full text of that section in the context. You can do this dynamically: start with the structural summary, and if the LLM calls a tool like read_section(), you return the full text of that section and the LLM can then reason about its content.

Here is the tool definition for reading a section, followed by its Python implementation:

# Tool definition (add this entry to the TOOLS list):
READ_SECTION_TOOL = {
    "type": "function",
    "function": {
        "name": "read_section",
        "description": (
            "Returns the full text content of a document section identified "
            "by its heading text or paragraph ID. Call this before editing a "
            "section so you can understand its current content before "
            "deciding what changes to make."
        ),
        "parameters": {
            "type": "object",
            "properties": {
                "heading_text": {
                    "type": "string",
                    "description": (
                        "The exact text of the heading that starts "
                        "the section."
                    )
                },
                "para_id": {
                    "type": "integer",
                    "description": (
                        "The paragraph ID of the heading, as shown in "
                        "the document structure summary. An alternative "
                        "to heading_text."
                    )
                }
            }
        }
    }
}

# Implementation:
def tool_read_section(heading_text: str = None,
                      para_id: int = None) -> dict:
    """
    Returns the full text of a section from the document object model.
    Searches by heading_text first; falls back to para_id if provided.
    """
    section_text = document.get_section_text(
        heading_text=heading_text,
        para_id=para_id
    )
    if section_text is None:
        return {"error": "Section not found."}

    return {
        "status":     "ok",
        "heading":    heading_text or f"para_id={para_id}",
        "content":    section_text,
        "word_count": len(section_text.split())
    }

The LLM will call read_section() first, receive the content, and then call whatever editing tool is appropriate. This two-step pattern, read then write, is a fundamental pattern in agentic document editing and it mirrors how a careful human editor would work.

PART THREE: CONCRETE USE CASES IN DEPTH

CHAPTER 6: CONTEXT-AWARE TEXT REPLACEMENT IN HEADINGS

Let us now implement the first use case completely: replacing the word "one" with "two" but only in headings. This seems simple, but it illustrates several important principles about how the LLM's reasoning interacts with your application's data model.

The user types into the command bar: "Change every word 'one' in a headline to 'two'."

The orchestration layer builds the system prompt with the document context (the heading list we showed earlier) and sends the request to the LLM. The LLM sees the headings, identifies that para_id=1 ("Chapter one: Introduction") and para_id=5 ("Section one: Background") both contain the word "one", and calls the replace_in_headings tool.

Here is the complete implementation of the tool_replace_in_headings function, using python-docx as the document library:

# We assume 'doc' is a loaded python-docx Document object,
# opened earlier with: doc = Document("my_document.docx")

# Heading styles in python-docx use these exact style name strings.
HEADING_STYLES = {
    "Heading 1", "Heading 2", "Heading 3",
    "Heading 4", "Heading 5", "Heading 6"
}

def tool_replace_in_headings(search: str,
                              replace: str,
                              case_sensitive: bool = False) -> dict:
    """
    Replaces 'search' with 'replace' in all heading paragraphs only.
    Operates at the run level to preserve bold, italic, and other
    character-level formatting within the heading.
    Returns a summary of every change made.
    """
    changes_made = []
    flags = 0 if case_sensitive else re.IGNORECASE

    for i, para in enumerate(doc.paragraphs):
        if para.style.name in HEADING_STYLES:
            original_text = para.text

            # Only process this heading if the search term appears in it.
            if re.search(re.escape(search), original_text, flags):

                # Replace at the run level, not the paragraph level.
                # Setting para.text directly would destroy all run-level
                # formatting (bold, italic, font overrides, etc.).
                for run in para.runs:
                    if re.search(re.escape(search), run.text, flags):
                        run.text = re.sub(
                            re.escape(search),
                            replace,
                            run.text,
                            flags=flags
                        )

                changes_made.append({
                    "para_id": i,
                    "style":   para.style.name,
                    "before":  original_text,
                    "after":   para.text   # para.text re-reads all runs
                })

    doc.save("my_document.docx")

    return {
        "status":        "ok",
        "changes_count": len(changes_made),
        "changes":       changes_made
    }

There is an important subtlety here that deserves explanation. In python-docx, a paragraph is made up of one or more "runs," where each run is a contiguous sequence of characters that share the same formatting (font, bold, italic, etc.). If you replace text at the paragraph level by setting para.text, you destroy all the runs and lose all the formatting. Therefore, you must replace text at the run level, iterating through each run individually. This is a detail that a traditional programmer might know, but the LLM does not need to know it because the LLM is calling your tool, not writing the implementation. This is the correct division of responsibility.

After the tool executes, it returns a structured result to the LLM. The LLM reads this result, sees that two changes were made (para_id=1 and para_id=5), and generates a final response to the user: "I have updated two headings. 'Chapter one: Introduction' is now 'Chapter two: Introduction', and 'Section one: Background' is now 'Section two: Background'." The user gets a clear, human-readable confirmation of exactly what was changed. This is far more useful than a silent operation with no feedback, and it is something you get for free because the LLM is generating the confirmation message based on the actual tool results.

CHAPTER 7: INDENTING A SPECIFIC FUNCTION IN A CODE BLOCK

The second use case is: "Move the function fib() defined in the code block one tab (4 spaces) to the right."

This is more interesting because it requires the LLM to understand that "move to the right" means "add indentation to each line of the function," and that "one tab" means 4 spaces in the context of Python code. A traditional macro would require the user to specify the exact character count and the exact line range. The LLM infers all of this from the natural language instruction.

def tool_indent_function(function_name: str,
                          indent_spaces: int = 4) -> dict:
    """
    Adds 'indent_spaces' spaces to the beginning of every line of the
    named Python function found in any paragraph styled as 'Code'.

    Scope detection is whitespace-based: the function body ends when a
    non-empty line is encountered that does not start with a space or tab
    (i.e., a top-level definition or statement follows).
    Empty lines within the function body are preserved as-is.
    """
    indent       = " " * indent_spaces
    changes_made = []

    for i, para in enumerate(doc.paragraphs):
        if para.style.name != "Code":
            continue

        code_text = para.text

        # Only process code blocks that contain the target function.
        if f"def {function_name}" not in code_text:
            continue

        lines           = code_text.split("\n")
        new_lines       = []
        inside_function = False
        lines_indented  = 0

        for line in lines:
            stripped = line.strip()

            # Detect the start of the target function definition.
            if stripped.startswith(f"def {function_name}"):
                inside_function = True

            # Detect the end of the function: a non-empty line at the
            # top level (no leading whitespace) that is NOT the function
            # definition itself signals that the function scope has ended.
            elif inside_function and line and not line[0].isspace():
                inside_function = False

            if inside_function:
                new_lines.append(indent + line)
                lines_indented += 1
            else:
                new_lines.append(line)

        new_code = "\n".join(new_lines)

        # Replace the paragraph content while preserving the Code style.
        # para.clear() removes all runs; we then add one new run with
        # the correct font settings for code.
        para.clear()
        run           = para.add_run(new_code)
        run.font.name = "Courier New"
        run.font.size = Pt(10)

        changes_made.append({
            "para_id":       i,
            "function":      function_name,
            "lines_indented": lines_indented
        })

    doc.save("my_document.docx")
    return {"status": "ok", "changes": changes_made}

This example illustrates a key point about the relationship between LLM reasoning and tool implementation. The LLM correctly identifies that the user wants to indent the fib() function and calls the tool with function_name="fib" and indent_spaces=4. The tool then performs the actual text manipulation using Python string operations. The LLM does not need to know how to parse Python code or manipulate docx runs. It only needs to know that the tool exists and what it does.

However, you will notice that the function detection logic in the tool is somewhat naive. It uses a simple string search and whitespace-based scope detection. For a production system, you would want to use a proper Python parser (the ast module in the standard library) to correctly identify function boundaries. The LLM's job is to decide WHAT to do; your tool's job is to do it CORRECTLY. Never compromise on the correctness of your tool implementations just because the LLM is handling the high-level reasoning.

CHAPTER 8: ASSIGNING A DIFFERENT FONT TO ALL CODE LISTINGS

The third use case demonstrates style manipulation: "Assign the font 'JetBrains Mono' to all code listings, but not to the rest of the text."

This is a global style operation. In a well-structured document, code listings should all use the same paragraph style (e.g., "Code" or "Preformatted Text"), so changing the font for all code listings is equivalent to modifying the "Code" style definition. Here is the tool:

def tool_set_style_font(style_name: str,
                         font_name: str,
                         font_size_pt: float = None) -> dict:
    """
    Changes the font (and optionally the size) of a named paragraph style
    throughout the document. Because all paragraphs using that style
    inherit from the style definition, this single change propagates
    automatically to every paragraph that uses it.

    style_name   : e.g. "Code", "Heading 1", "Body Text"
    font_name    : e.g. "JetBrains Mono", "Arial", "Times New Roman"
    font_size_pt : optional new font size in points (e.g. 10.0)
    """
    try:
        style = doc.styles[style_name]
    except KeyError:
        return {"error": f"Style '{style_name}' not found in document."}

    style.font.name = font_name
    if font_size_pt is not None:
        style.font.size = Pt(font_size_pt)

    # Count paragraphs affected so the LLM can report accurately.
    affected = sum(
        1 for p in doc.paragraphs if p.style.name == style_name
    )

    doc.save("my_document.docx")
    return {
        "status":              "ok",
        "style_modified":      style_name,
        "new_font":            font_name,
        "paragraphs_affected": affected
    }

When the user says "assign another font for all code listings," the LLM correctly maps "code listings" to the "Code" paragraph style and calls tool_set_style_font with style_name="Code" and font_name="JetBrains Mono". The tool modifies the style definition, which automatically propagates to all paragraphs using that style. This is the power of style-based document formatting, and the LLM understands this abstraction naturally.

The LLM's response to the user might be: "Done. I have changed the font of the 'Code' style to 'JetBrains Mono'. This affects all 3 code block paragraphs in your document. The rest of the text remains unchanged."

CHAPTER 9: ADDING A CORRECTLY FORMATTED APPENDIX

Now we tackle a more complex task: "Add an appendix with the correct format."

This is interesting because "correct format" is context-dependent. The LLM needs to understand what an appendix looks like in the context of this particular document. It needs to look at the existing document structure, identify the heading styles in use, determine the appropriate heading level for an appendix, and create a new section at the end of the document with the right formatting.

This is a multi-step task. The LLM will first call read_section() or a similar tool to understand the document structure, then call add_appendix() with the appropriate parameters. Here is the tool definition and implementation:

def tool_add_appendix(title: str,
                       content: str,
                       heading_style: str = "Heading 1",
                       label_prefix: str = "Appendix") -> dict:
    """
    Adds a new appendix section at the end of the document, preceded
    by a page break. The appendix heading uses the specified style.

    title        : The appendix identifier and name,
                   e.g. "A: Glossary of Terms"
    content      : The body text of the appendix. Separate paragraphs
                   with double newlines.
    heading_style: The paragraph style for the appendix heading.
    label_prefix : Prepended to the title, e.g. "Appendix".
    """
    # Validate that the requested heading style exists.
    if heading_style not in doc.styles:
        return {"error": f"Style '{heading_style}' not found."}

    # Insert a page break at the end of the last paragraph so the
    # appendix always starts on a fresh page.
    last_para = doc.paragraphs[-1]
    run = last_para.add_run()
    run.add_break(WD_BREAK.PAGE)

    # Add the appendix heading with the correct style.
    full_title    = f"{label_prefix} {title}"
    heading_para  = doc.add_paragraph(full_title)
    heading_para.style = doc.styles[heading_style]

    # Add the body content. Double newlines delimit paragraphs.
    para_texts = [t.strip() for t in content.split("\n\n") if t.strip()]
    for para_text in para_texts:
        doc.add_paragraph(para_text)

    doc.save("my_document.docx")
    return {
        "status":          "ok",
        "appendix_title":  full_title,
        "paragraphs_added": len(para_texts) + 1   # +1 for the heading
    }

But here is where the LLM's contextual understanding really shines. The user said "add an appendix with the correct format" without specifying what the appendix should contain. The LLM, having read the document context, knows that the document is about Fibonacci algorithms. It might respond: "I can add an appendix to your document. What would you like the appendix to contain? For example, I could add a glossary of terms, a list of references, or a mathematical proof of the Fibonacci sequence's properties." This is the LLM acting as an intelligent assistant, not just a command executor.

If the user responds "Add a glossary of terms used in the document," the LLM will call read_section() for each major section, extract the technical terms, and then call tool_add_appendix() with a well-formatted glossary as the content. This is a multi-step agentic workflow that would be impossible to implement with a traditional macro system.

CHAPTER 10: BEAUTIFYING AND EXTENDING A TEXT BLOCK

This use case is perhaps the most powerful demonstration of what LLMs bring to document editing: "Beautify and extend the introduction section."

Here, the LLM is not just executing a structural operation. It is reading the existing text, understanding its meaning, and generating improved text. This requires the LLM to act as both a reader and a writer.

The workflow proceeds in clear steps. First, the orchestration layer sends the user's instruction along with the document structure summary. Second, the LLM calls read_section() with heading_text="Introduction" to get the full text of the introduction. Third, the LLM reads the text, reasons about how to improve it (better word choice, more engaging opening, additional context, clearer structure), and generates the improved text. Fourth, the LLM calls replace_section_content() with the improved text. Fifth, the tool replaces the content of the introduction section in the document.

Here is the replace_section_content tool. Notice that instead of building paragraph XML manually (which requires careful namespace handling), we use doc.add_paragraph() and then move the resulting element into the correct position using lxml's addnext(), which is both safer and more readable:

def tool_replace_section_content(heading_text: str,
                                  new_content: str) -> dict:
    """
    Replaces the body paragraphs of a section (identified by its heading)
    with new_content. The heading paragraph itself is preserved unchanged.

    heading_text : The exact text of the section's heading paragraph.
    new_content  : The full replacement text. Separate paragraphs with
                   double newlines.
    """
    # --- Phase 1: locate the heading and collect old body paragraphs ---
    heading_para   = None
    paras_to_remove = []

    for para in doc.paragraphs:
        if heading_para is None:
            if (para.text == heading_text
                    and para.style.name in HEADING_STYLES):
                heading_para = para
        else:
            # Collect body paragraphs until the next heading is reached.
            if para.style.name in HEADING_STYLES:
                break
            paras_to_remove.append(para)

    if heading_para is None:
        return {"error": f"Heading '{heading_text}' not found."}

    # --- Phase 2: remove the old body paragraphs from the XML tree ---
    for para in paras_to_remove:
        para._element.getparent().remove(para._element)

    # --- Phase 3: insert new paragraphs after the heading ---
    # We create each paragraph via doc.add_paragraph() (which appends it
    # to the end of the document) and then immediately move its XML
    # element to the correct position using lxml's addnext().
    # Inserting in reverse order ensures the first paragraph ends up
    # immediately after the heading.
    new_para_texts = [t.strip() for t in new_content.split("\n\n")
                      if t.strip()]

    for para_text in reversed(new_para_texts):
        new_para = doc.add_paragraph(para_text)
        # Move the new paragraph element to just after the heading.
        heading_para._element.addnext(new_para._element)

    doc.save("my_document.docx")
    return {
        "status":            "ok",
        "section":           heading_text,
        "old_paragraph_count": len(paras_to_remove),
        "new_paragraph_count": len(new_para_texts)
    }

The key insight here is that the LLM is doing two fundamentally different kinds of work in this workflow. First, it is reasoning about the document structure to identify which section to modify and which tool to call. Second, it is generating the improved text content. Both of these are things the LLM is very good at, and neither requires any special programming beyond the tool definitions and the orchestration loop we have already built.

For the text generation step, you may want to use a more capable model than for the structural reasoning steps. This is where the hybrid architecture becomes valuable: use a fast local model (e.g., Llama 3.1 8B) for structural operations and route text generation tasks to a more capable model (e.g., GPT-4o or Llama 3.1 70B) for better quality output. This brings us naturally to the topic of multimodal extensions, where the choice of model becomes even more consequential.

PART FOUR: MULTIMODAL EXTENSIONS - GENERATING AND EMBEDDING FIGURES

CHAPTER 11: ASKING THE LLM TO READ A SECTION AND CREATE A VISUAL

One of the most exciting capabilities of modern LLM ecosystems is the ability to generate images from text descriptions. Models like DALL-E 3 (via OpenAI's API), Stable Diffusion (via local tools like AUTOMATIC1111 or ComfyUI), and Flux (via Replicate or local deployment) can generate high-quality images from natural language descriptions.

The workflow for "read section X and create a figure for it" unfolds in five steps. In step one, the LLM calls read_section() to get the text of the target section. In step two, the LLM generates a detailed image prompt based on the section's content. For a technical document, this might be a diagram description rather than a photorealistic image prompt. In step three, the LLM calls generate_figure() with the image prompt. In step four, the image generation tool sends the prompt to an image generation API and saves the resulting image to disk. In step five, the LLM calls insert_figure() to embed the image into the document at the appropriate location.

Here is the generate_figure tool implementation using OpenAI's DALL-E 3 API:

# A separate OpenAI client for image generation.
# This can point to the same or a different endpoint than the text client.
image_client = OpenAI(api_key="sk-your-openai-key-here")

def tool_generate_figure(prompt: str,
                          filename: str,
                          style: str = "technical diagram",
                          size: str = "1024x1024") -> dict:
    """
    Generates an image using DALL-E 3 and saves it to the 'figures/'
    subdirectory. Returns the saved file path for use by insert_figure().

    prompt   : Natural language description of the desired image.
    filename : The base filename (without extension) to save as.
    style    : Visual style hint, e.g. "technical diagram", "flowchart".
    size     : "1024x1024", "1792x1024", or "1024x1792".
    """
    # Prepend a style directive to the user's prompt so DALL-E 3 produces
    # output appropriate for a technical document.
    full_prompt = (
        f"Create a {style} that shows: {prompt}. "
        f"Use a clean, professional visual style suitable for a technical "
        f"document. White background, clear labels, no decorative elements."
    )

    response = image_client.images.generate(
        model="dall-e-3",
        prompt=full_prompt,
        size=size,
        quality="standard",
        n=1,
        response_format="b64_json"
    )

    image_data  = base64.b64decode(response.data[0].b64_json)
    output_path = Path(f"figures/{filename}.png")
    output_path.parent.mkdir(exist_ok=True)
    output_path.write_bytes(image_data)

    return {
        "status":         "ok",
        "file_path":      str(output_path),
        "revised_prompt": response.data[0].revised_prompt
    }

For local image generation using Stable Diffusion via the AUTOMATIC1111 API (which runs on localhost:7860 by default), no data leaves your network:

def tool_generate_figure_local(prompt: str,
                                filename: str,
                                negative_prompt: str = "",
                                steps: int = 30) -> dict:
    """
    Generates an image using a locally running Stable Diffusion server
    (AUTOMATIC1111 / sd-webui). Completely air-gapped: no data leaves
    your machine.

    prompt          : The positive prompt describing the desired image.
    filename        : Base filename (without extension) to save as.
    negative_prompt : Things to avoid in the generated image.
    steps           : Number of diffusion steps (more = higher quality
                      but slower; 20-30 is a good range).
    """
    payload = {
        "prompt":          prompt,
        "negative_prompt": negative_prompt or "blurry, low quality, text",
        "steps":           steps,
        "width":           768,
        "height":          512,
        "cfg_scale":       7,
        "sampler_name":    "DPM++ 2M Karras"
    }

    response = requests.post(
        "http://localhost:7860/sdapi/v1/txt2img",
        json=payload,
        timeout=120
    )
    response.raise_for_status()

    image_data  = base64.b64decode(response.json()["images"][0])
    output_path = Path(f"figures/{filename}.png")
    output_path.parent.mkdir(exist_ok=True)
    output_path.write_bytes(image_data)

    return {"status": "ok", "file_path": str(output_path)}

And here is the insert_figure tool that embeds the generated image into the document at the correct location:

def tool_insert_figure(file_path: str,
                        after_heading: str,
                        caption: str = "",
                        width_inches: float = 5.0) -> dict:
    """
    Inserts an image into the document immediately after the specified
    heading paragraph, with an optional caption below it.

    file_path     : Path to the image file on disk (PNG, JPEG, etc.).
    after_heading : The exact text of the heading after which to insert.
    caption       : Optional caption text displayed below the figure.
    width_inches  : Display width of the figure in inches (default 5.0).
    """
    if not Path(file_path).exists():
        return {"error": f"Image file not found: {file_path}"}

    # Locate the target heading paragraph.
    target_para = None
    for para in doc.paragraphs:
        if para.text == after_heading and para.style.name in HEADING_STYLES:
            target_para = para
            break

    if target_para is None:
        return {"error": f"Heading '{after_heading}' not found."}

    # Create the image paragraph via doc.add_paragraph() so python-docx
    # manages the XML correctly, then move it to the right position.
    img_para           = doc.add_paragraph()
    img_para.alignment = WD_ALIGN_PARAGRAPH.CENTER
    img_run            = img_para.add_run()
    img_run.add_picture(file_path, width=Inches(width_inches))

    # Move the image paragraph to just after the target heading.
    target_para._element.addnext(img_para._element)

    # Add caption below the image if one was provided.
    if caption:
        caption_style = (doc.styles["Caption"]
                         if "Caption" in doc.styles
                         else doc.styles["Normal"])
        cap_para           = doc.add_paragraph(caption)
        cap_para.style     = caption_style
        cap_para.alignment = WD_ALIGN_PARAGRAPH.CENTER
        # Insert caption immediately after the image paragraph.
        img_para._element.addnext(cap_para._element)

    doc.save("my_document.docx")
    return {
        "status":               "ok",
        "figure_inserted_after": after_heading,
        "caption":              caption
    }

Let us trace through a complete example of this workflow. The user types: "Read the section on Naive Recursion and create a figure showing the recursion tree for fib(5), then insert it into the document."

The LLM calls read_section(heading_text="Naive Recursion") and receives the section text, which explains how the recursive Fibonacci algorithm works. The LLM then calls generate_figure() with a prompt along the lines of: "A recursion tree diagram showing the recursive calls of fib(5), with nodes labeled fib(5), fib(4), fib(3), etc., showing how the tree branches and where calls overlap, in a clean technical diagram style." The image generation service returns a PNG file saved to disk. The LLM then calls insert_figure() with the file path, the heading "Naive Recursion" as the insertion point, and the caption "Figure 1: Recursion tree for fib(5), illustrating the exponential growth of recursive calls."

The entire workflow, from natural language instruction to embedded figure, takes perhaps 15 to 30 seconds (dominated by the image generation time) and requires zero manual steps from the user.

PART FIVE: ANSWERING GENERAL QUESTIONS AND INTEGRATING ANSWERS

CHAPTER 12: THE DOCUMENT AS A LIVING KNOWLEDGE BASE

One of the most natural extensions of an LLM-augmented text processor is the ability to ask general questions and integrate the answers directly into the document. "Explain what an LLM is" is a perfect example. The user wants an explanation, and they want it inserted into their document at a specific location.

This requires a slightly different workflow. Instead of the LLM calling tools to modify the document, the LLM first generates the answer as text, and then the user (or the LLM, in an agentic mode) decides where to insert it.

Here is the tool for inserting generated text, followed by the orchestration function that handles the question-and-insert workflow:

def tool_insert_text_after(heading_text: str,
                            content: str,
                            style_name: str = "Body Text") -> dict:
    """
    Inserts new paragraphs of text immediately after the specified heading.
    Paragraphs in 'content' are delimited by double newlines.
    Each new paragraph uses the specified style_name.

    heading_text : Exact text of the heading to insert after.
    content      : The text to insert (double-newline-separated paragraphs).
    style_name   : Paragraph style for the new text (default "Body Text").
    """
    target_para = None
    for para in doc.paragraphs:
        if (para.text == heading_text
                and para.style.name in HEADING_STYLES):
            target_para = para
            break

    if target_para is None:
        return {"error": f"Heading '{heading_text}' not found."}

    resolved_style = (doc.styles[style_name]
                      if style_name in doc.styles
                      else doc.styles["Normal"])

    new_para_texts = [t.strip() for t in content.split("\n\n") if t.strip()]

    # Insert in reverse order so the first paragraph ends up first.
    for para_text in reversed(new_para_texts):
        new_para       = doc.add_paragraph(para_text)
        new_para.style = resolved_style
        target_para._element.addnext(new_para._element)

    doc.save("my_document.docx")
    return {
        "status":             "ok",
        "inserted_after":     heading_text,
        "paragraphs_inserted": len(new_para_texts)
    }


def run_qa_and_insert(question: str,
                       insert_after_heading: str,
                       document_context: str) -> str:
    """
    Answers a general question and inserts the answer into the document.
    The system prompt instructs the LLM to generate a thorough answer
    and then call insert_text_after() to place it in the document.
    """
    messages = [
        {
            "role": "system",
            "content": (
                "You are a knowledgeable assistant integrated into a text "
                "processor. When the user asks a question, first generate a "
                "thorough, well-structured answer to the question, then insert "
                "that answer into the document using the insert_text_after "
                "tool. Write in a style consistent with the document's "
                "existing content. Use clear, professional language.\n\n"
                "Document context:\n" + document_context
            )
        },
        {
            "role": "user",
            "content": (
                f"Please answer this question: '{question}' "
                f"and insert the answer after the heading "
                f"'{insert_after_heading}'."
            )
        }
    ]

    # Reuse the standard agent loop with the full tool set.
    task_client, task_model = router.get_client("text_generation")
    return run_agent(
        user_instruction=f"Answer: '{question}' and insert after "
                          f"'{insert_after_heading}'.",
        document_context=document_context,
        client=task_client,
        model=task_model
    )

When the user asks "Explain what an LLM is and insert the explanation after the Introduction heading," the LLM generates a well-written explanation of Large Language Models, tailored to the technical level of the document (which it knows from the document context), and then calls insert_text_after() to place it in the document. The result is a seamlessly integrated explanation that matches the document's style and tone.

This capability transforms the text processor from a passive editing tool into an active writing partner. The user can ask questions, request explanations, ask for examples, and have all of this content automatically integrated into their document at the right location.

PART SIX: INTEGRATING DIFFERENT LLM MODELS - THE MULTI-MODEL ARCHITECTURE

CHAPTER 13: ROUTING TASKS TO THE RIGHT MODEL

Not all tasks are equal, and not all models are equal. A sophisticated LLM-augmented application should be able to route different types of tasks to the most appropriate model. This is called model routing, and it is a key architectural pattern in production agentic systems.

Consider the following task taxonomy for our text processor. Structural operations (replace text in headings, change fonts, add sections) require precise instruction following and structured output but do not require deep reasoning or creativity. A small, fast model like Llama 3.1 8B or Phi-3.5 Mini is perfectly adequate and will respond in under a second on modern hardware. Text generation tasks (beautify a section, write an appendix, answer a question) require good writing quality, broad knowledge, and the ability to maintain a consistent style, making a medium or large model like Llama 3.1 70B, Mistral Large, or GPT-4o more appropriate. Multimodal tasks (read a section and describe what figure to generate) may require a vision-capable model if the document contains existing images that the LLM needs to understand, with GPT-4o, Claude 3.5 Sonnet, and LLaVA (a local multimodal model available via Ollama) being suitable candidates.

Here is a model router implementation:

class ModelRouter:
    """
    Routes LLM requests to the appropriate model based on task type.
    Add new task types and model configurations to TASK_MODEL_MAP as
    your application's needs grow.
    """

    TASK_MODEL_MAP = {
        "structural": {
            "base_url": "http://localhost:11434/v1",
            "api_key":  "ollama",
            "model":    "llama3.1:8b"
        },
        "text_generation": {
            "base_url": "http://localhost:11434/v1",
            "api_key":  "ollama",
            "model":    "llama3.1:70b"
        },
        "multimodal": {
            "base_url": "http://localhost:11434/v1",
            "api_key":  "ollama",
            "model":    "llava:13b"
        },
        "high_quality": {
            "base_url": "https://api.openai.com/v1",
            "api_key":  "sk-your-openai-key",
            "model":    "gpt-55"
        }
    }

    def get_client(self, task_type: str) -> tuple:
        """Returns (OpenAI client, model string) for the given task type."""
        config = self.TASK_MODEL_MAP.get(
            task_type,
            self.TASK_MODEL_MAP["structural"]   # safe default
        )
        return (
            OpenAI(base_url=config["base_url"], api_key=config["api_key"]),
            config["model"]
        )

    def classify_task(self, user_instruction: str) -> str:
        """
        Classifies the user's instruction into a task type using keyword
        matching. In a production system you might replace this with a
        small dedicated classifier model for higher accuracy.
        """
        text = user_instruction.lower()

        if any(kw in text for kw in
               ["replace", "indent", "font", "style", "move", "rename"]):
            return "structural"

        if any(kw in text for kw in
               ["write", "explain", "beautify", "extend", "improve",
                "generate text", "add appendix"]):
            return "text_generation"

        if any(kw in text for kw in
               ["figure", "image", "diagram", "visual", "picture"]):
            return "multimodal"

        return "structural"   # Default to the fast local model.


router = ModelRouter()

def run_smart_agent(user_instruction: str,
                    document_context: str) -> str:
    """
    Entry point for all user instructions. Classifies the task,
    selects the appropriate model, and runs the agent loop.
    """
    task_type          = router.classify_task(user_instruction)
    task_client, model = router.get_client(task_type)

    logger.info(f"[Router] Task: {task_type!r}  Model: {model}")

    return run_agent(
        user_instruction=user_instruction,
        document_context=document_context,
        client=task_client,
        model=model
    )

This routing architecture gives you the best of all worlds: speed and privacy for routine operations, quality for creative tasks, and multimodal capability when needed. It also gives you cost control: you only invoke expensive remote models when the task genuinely requires them.

CHAPTER 14: ADDING A NEW LOCAL MODEL - THE PLUGIN PATTERN

One of the most powerful aspects of the OpenAI-compatible API standard is that adding a new model to your application is as simple as adding a new entry to the configuration. Let us say you want to add support for Google's Gemma 2 27B model, which is available via Ollama:

# Pull the model with Ollama (run this once in your terminal):
ollama pull gemma2:27b

# Then register it in your router - no other code changes needed:
ModelRouter.TASK_MODEL_MAP["gemma_large"] = {
    "base_url": "http://localhost:11434/v1",
    "api_key":  "ollama",
    "model":    "gemma2:27b"
}

That is literally all you need to do. Because Ollama handles the model management (downloading, quantization, memory management) and exposes a standard API, your application code does not change at all. This is the plugin pattern applied to LLM backends.

For remote models from providers that do not expose an OpenAI-compatible API, you can write a thin adapter. Here is an adapter for Anthropic's Claude that presents the same interface as the OpenAI client, allowing the ModelRouter to use Claude models without any changes to the orchestration code:

class AnthropicAdapter:
    """
    Wraps the Anthropic Python SDK to present an OpenAI-compatible
    interface. This allows ModelRouter and run_agent() to use Claude
    models without any changes to the orchestration layer.

    Usage:
        adapter = AnthropicAdapter(api_key="sk-ant-...")
        client, model = adapter, "claude-3-5-sonnet-20241022"
        result = run_agent(instruction, context, client, model)
    """

    def __init__(self, api_key: str):
        self._client  = anthropic.Anthropic(api_key=api_key)
        # Mimic the OpenAI client's attribute hierarchy so that
        # run_agent()'s call to client.chat.completions.create() works.
        self.chat         = self
        self.completions  = self

    def create(self,
               model: str,
               messages: list,
               tools: list = None,
               **kwargs) -> object:
        """
        Translates an OpenAI-format chat.completions.create() call into
        an Anthropic messages.create() call and returns a response object
        that looks like an OpenAI ChatCompletion.
        """
        # Separate the system message (Anthropic takes it separately).
        system_msg = next(
            (m["content"] for m in messages if m["role"] == "system"), ""
        )
        user_messages = [m for m in messages if m["role"] != "system"]

        # Convert OpenAI tool schemas to Anthropic tool schemas.
        anthropic_tools = []
        if tools:
            for tool in tools:
                f = tool["function"]
                anthropic_tools.append({
                    "name":         f["name"],
                    "description":  f["description"],
                    "input_schema": f["parameters"]
                })

        response = self._client.messages.create(
            model=model,
            system=system_msg,
            messages=user_messages,
            tools=anthropic_tools if anthropic_tools else anthropic.NOT_GIVEN,
            max_tokens=4096
        )

        return self._wrap_response(response)

    def _wrap_response(self, response) -> object:
        """
        Wraps an Anthropic Message object in a lightweight namespace
        object that mimics the structure of an OpenAI ChatCompletion,
        specifically the response.choices[0].message interface that
        run_agent() depends on.
        """
        class FakeMessage:
            content    = None
            tool_calls = None

        class FakeChoice:
            message = FakeMessage()

        choice = FakeChoice()

        for block in response.content:
            if block.type == "text":
                choice.message.content = block.text
            elif block.type == "tool_use":
                # Wrap the Anthropic tool_use block so it looks like an
                # OpenAI tool_call object with .id, .function.name, and
                # .function.arguments attributes.
                class FakeFunction:
                    name      = block.name
                    arguments = json.dumps(block.input)

                class FakeToolCall:
                    id       = block.id
                    function = FakeFunction()

                choice.message.tool_calls = [FakeToolCall()]

        class FakeResponse:
            choices = [choice]

        return FakeResponse()

In a production system, you would use LiteLLM (https://github.com/BerriAI/litellm), which provides a unified interface to over 100 LLM providers and handles all the format conversion automatically, along with rate limiting, retries, fallbacks, and cost tracking.

=====================================PART SEVEN: PRODUCTION ENGINEERING CONSIDERATIONS

CHAPTER 15: ERROR HANDLING, RETRY LOGIC, AND SAFETY

A production LLM integration must handle failures gracefully. LLMs can hallucinate tool names, generate invalid JSON, call tools with incorrect argument types, or simply fail to complete a task. Your orchestration layer must be robust to all of these failure modes.

Here is a production-grade orchestration loop with error handling, retry logic, and a maximum iteration limit:

def run_agent_robust(user_instruction: str,
                     document_context: str,
                     client: OpenAI,
                     model: str,
                     max_iterations: int = 10,
                     retry_on_error: bool = True) -> dict:
    """
    A production-grade agent loop with comprehensive error handling.

    Returns a dict with keys:
        status        : "ok", "error", or "max_iterations_reached"
        result        : The LLM's final text response (if status=="ok")
        actions_taken : List of {tool, args, result} dicts
        iterations    : Number of loop iterations executed
    """
    messages      = build_initial_messages(user_instruction, document_context)
    actions_taken = []
    iteration     = 0

    while iteration < max_iterations:
        iteration += 1
        logger.info(f"Agent iteration {iteration}/{max_iterations}")

        # --- LLM API call with retry on transient errors ---
        try:
            response = client.chat.completions.create(
                model=model,
                messages=messages,
                tools=TOOLS,
                tool_choice="auto",
                timeout=30.0
            )
        except Exception as exc:
            logger.error(f"LLM API call failed (iteration {iteration}): {exc}")
            if retry_on_error and iteration < max_iterations:
                wait = 2 ** iteration   # exponential back-off
                logger.info(f"Retrying in {wait}s ...")
                time.sleep(wait)
                continue
            return {
                "status":        "error",
                "error":         str(exc),
                "actions_taken": actions_taken,
                "iterations":    iteration
            }

        message = response.choices[0].message

        if message.tool_calls:
            messages.append(message)

            for tool_call in message.tool_calls:
                func_name = tool_call.function.name
                logger.info(f"Tool call requested: {func_name}")

                # --- Validate that the tool is registered ---
                if func_name not in TOOL_IMPLEMENTATIONS:
                    error_result = {
                        "error": (
                            f"Tool '{func_name}' does not exist. "
                            f"Available tools: "
                            f"{list(TOOL_IMPLEMENTATIONS.keys())}"
                        )
                    }
                    messages.append({
                        "role":         "tool",
                        "tool_call_id": tool_call.id,
                        "content":      json.dumps(error_result)
                    })
                    continue

                # --- Parse the JSON arguments safely ---
                try:
                    func_args = json.loads(tool_call.function.arguments)
                except json.JSONDecodeError as exc:
                    error_result = {
                        "error": f"Invalid JSON in tool arguments: {exc}"
                    }
                    messages.append({
                        "role":         "tool",
                        "tool_call_id": tool_call.id,
                        "content":      json.dumps(error_result)
                    })
                    continue

                # --- Execute the tool safely ---
                try:
                    result = TOOL_IMPLEMENTATIONS[func_name](**func_args)
                    actions_taken.append({
                        "tool":   func_name,
                        "args":   func_args,
                        "result": result
                    })
                except Exception as exc:
                    result = {"error": f"Tool execution failed: {exc}"}
                    logger.error(f"Tool {func_name} raised: {exc}")

                messages.append({
                    "role":         "tool",
                    "tool_call_id": tool_call.id,
                    "content":      json.dumps(result)
                })

        else:
            # No tool calls: the LLM has finished the task.
            return {
                "status":        "ok",
                "result":        message.content,
                "actions_taken": actions_taken,
                "iterations":    iteration
            }

    # Reached the iteration cap without a final response.
    return {
        "status":        "max_iterations_reached",
        "actions_taken": actions_taken,
        "iterations":    iteration
    }

The max_iterations limit is a critical safety mechanism. Without it, a buggy tool or a confused LLM could cause an infinite loop. Ten iterations is usually more than enough for even complex multi-step tasks. If a task genuinely requires more steps, you should reconsider whether it should be broken into smaller sub-tasks.

The exponential backoff on API failures (time.sleep(2 ** iteration)) is a standard pattern for handling transient network errors and rate limiting. It ensures that your application does not hammer a failing API endpoint and respects rate limits automatically.

CHAPTER 16: UNDO/REDO AND DOCUMENT VERSIONING

Any application that allows automated modifications to documents must support undo and redo. This is especially important for LLM-driven modifications because the user may not be able to predict exactly what the LLM will do, and they need a safety net.

The correct approach is to save a snapshot of the document both before AND after each LLM operation. The before-snapshot is used for undo (restoring the previous state), and the after-snapshot is used for redo (re-applying a previously undone operation). Here is a correct implementation:

class DocumentVersionManager:
    """
    Maintains before/after snapshots of the document for undo/redo support.

    Undo stack entries are (before_path, after_path, description) tuples.
    Redo stack entries have the same structure.
    """

    def __init__(self, doc_path: str, max_versions: int = 50):
        self.doc_path     = Path(doc_path)
        self.versions_dir = self.doc_path.parent / ".doc_versions"
        self.versions_dir.mkdir(exist_ok=True)
        self.max_versions = max_versions
        self.undo_stack   = []   # list of (before_path, after_path, desc)
        self.redo_stack   = []

    def _snapshot(self, label: str) -> Path:
        """Saves the current document to a timestamped snapshot file."""
        ts   = datetime.now().strftime("%Y%m%d_%H%M%S_%f")
        path = self.versions_dir / f"{label}_{ts}.docx"
        shutil.copy2(self.doc_path, path)
        return path

    def begin_operation(self, description: str) -> Path:
        """
        Call this BEFORE an LLM operation.
        Saves the current state as the 'before' snapshot and returns
        the path so that end_operation() can pair it with the 'after'.
        """
        return self._snapshot("before")

    def end_operation(self, before_path: Path, description: str) -> None:
        """
        Call this AFTER a successful LLM operation.
        Saves the current (modified) state as the 'after' snapshot and
        pushes both snapshots onto the undo stack.
        """
        after_path = self._snapshot("after")
        self.undo_stack.append((before_path, after_path, description))
        self.redo_stack.clear()   # A new operation clears redo history.

        # Trim the oldest entries if we exceed the limit.
        while len(self.undo_stack) > self.max_versions:
            old_before, old_after, _ = self.undo_stack.pop(0)
            old_before.unlink(missing_ok=True)
            old_after.unlink(missing_ok=True)

    def undo(self) -> bool:
        """
        Restores the document to its state before the last operation.
        Returns True on success, False if there is nothing to undo.
        """
        if not self.undo_stack:
            return False
        before_path, after_path, desc = self.undo_stack.pop()
        self.redo_stack.append((before_path, after_path, desc))
        shutil.copy2(before_path, self.doc_path)
        logger.info(f"Undid: {desc}")
        return True

    def redo(self) -> bool:
        """
        Re-applies the most recently undone operation.
        Returns True on success, False if there is nothing to redo.
        """
        if not self.redo_stack:
            return False
        before_path, after_path, desc = self.redo_stack.pop()
        self.undo_stack.append((before_path, after_path, desc))
        shutil.copy2(after_path, self.doc_path)
        logger.info(f"Redid: {desc}")
        return True

    def get_history(self) -> list:
        """Returns the undo history as a list of description strings."""
        return [desc for _, _, desc in reversed(self.undo_stack)]


# -----------------------------------------------------------------------
# Wrapper that integrates versioning with every LLM operation
# -----------------------------------------------------------------------

version_manager = DocumentVersionManager("my_document.docx")

def run_llm_operation(user_instruction: str,
                       document_context: str) -> dict:
    """
    Runs an LLM operation with automatic before/after snapshotting.
    On failure, automatically rolls back to the pre-operation state.
    """
    before_path = version_manager.begin_operation(user_instruction)

    result = run_agent_robust(
        user_instruction=user_instruction,
        document_context=document_context,
        client=client,
        model=MODEL
    )

    if result["status"] == "ok":
        version_manager.end_operation(before_path, user_instruction)
    else:
        # Auto-rollback: restore the document to its pre-operation state.
        shutil.copy2(before_path, version_manager.doc_path)
        logger.warning(f"Operation failed; document rolled back. "
                        f"Reason: {result.get('error', result['status'])}")

    return result

The key correction over the naive approach is that we now save snapshots both before and after each operation. The undo() method restores the before-snapshot, and the redo() method restores the after-snapshot. This gives you a complete, correct undo/redo history. The auto-rollback on failure is a particularly valuable safety net: if the LLM operation fails for any reason, the document is automatically restored to its pre-operation state without any action required from the user.

CHAPTER 17: STREAMING RESPONSES AND PROGRESSIVE UI

For long text generation tasks (beautifying a section, writing an appendix), the user should not have to stare at a blank screen while the LLM generates the response. Streaming allows you to display the generated text progressively as it arrives, which dramatically improves the perceived responsiveness of the application. Both the OpenAI API and Ollama support streaming via Server-Sent Events.

The streaming agent loop is more complex than the standard loop because tool call arguments arrive in fragments that must be accumulated before they can be parsed as JSON. The following implementation handles this correctly:

def run_agent_streaming(user_instruction: str,
                         document_context: str,
                         on_token_callback) -> dict:
    """
    Runs the agent with streaming output for text generation tasks.

    on_token_callback(token: str) -> None
        Called with each new text token as it arrives from the LLM.
        Use this to update a UI text widget in real time.

    Returns the same dict structure as run_agent_robust().
    """
    messages      = build_initial_messages(user_instruction, document_context)
    actions_taken = []

    # We run one streaming request at a time. If the LLM calls tools,
    # we execute them and then start a new streaming request.
    while True:
        accumulated_text  = ""
        tool_calls_buffer = {}   # index -> {id, name, arguments_str}

        with client.chat.completions.stream(
            model=MODEL,
            messages=messages,
            tools=TOOLS,
            tool_choice="auto"
        ) as stream:
            for chunk in stream:
                if not chunk.choices:
                    continue
                delta = chunk.choices[0].delta

                # Accumulate streamed text tokens.
                if delta.content:
                    accumulated_text += delta.content
                    on_token_callback(delta.content)

                # Accumulate streamed tool call fragments.
                # Each chunk may carry a partial tool call; we buffer
                # them by index and reassemble after the stream ends.
                if delta.tool_calls:
                    for tc_chunk in delta.tool_calls:
                        idx = tc_chunk.index
                        if idx not in tool_calls_buffer:
                            tool_calls_buffer[idx] = {
                                "id":            tc_chunk.id or "",
                                "name":          "",
                                "arguments_str": ""
                            }
                        if tc_chunk.function.name:
                            tool_calls_buffer[idx]["name"] += (
                                tc_chunk.function.name
                            )
                        if tc_chunk.function.arguments:
                            tool_calls_buffer[idx]["arguments_str"] += (
                                tc_chunk.function.arguments
                            )

        # --- After the stream ends, process what we received ---

        if tool_calls_buffer:
            # The LLM requested tool calls. Execute them and loop.
            # Reconstruct a message object for the history.
            messages.append({
                "role":       "assistant",
                "content":    accumulated_text or None,
                "tool_calls": [
                    {
                        "id":       buf["id"],
                        "type":     "function",
                        "function": {
                            "name":      buf["name"],
                            "arguments": buf["arguments_str"]
                        }
                    }
                    for buf in tool_calls_buffer.values()
                ]
            })

            for buf in tool_calls_buffer.values():
                try:
                    func_args = json.loads(buf["arguments_str"])
                    result    = TOOL_IMPLEMENTATIONS.get(
                        buf["name"],
                        lambda **_: {"error": f"Unknown: {buf['name']}"}
                    )(**func_args)
                    actions_taken.append({
                        "tool":   buf["name"],
                        "args":   func_args,
                        "result": result
                    })
                except Exception as exc:
                    result = {"error": str(exc)}

                messages.append({
                    "role":         "tool",
                    "tool_call_id": buf["id"],
                    "content":      json.dumps(result)
                })

        else:
            # No tool calls: the LLM is done.
            return {
                "status":        "ok",
                "result":        accumulated_text,
                "actions_taken": actions_taken
            }

In a desktop application built with PyQt6 or Tkinter, the on_token_callback would update a text widget in real time, showing the user the LLM's output as it is generated. This creates a much more engaging and responsive user experience, particularly for the text generation use cases where the LLM may be writing several paragraphs of content.

PART EIGHT: PUTTING IT ALL TOGETHER - A COMPLETE EXAMPLE

CHAPTER 18: THE COMPLETE SYSTEM IN ACTION

Let us now trace through a complete, realistic session with our LLM-augmented text processor to see how all the pieces fit together.

The user opens a document called "fibonacci_study.docx" which has the following structure (this is a schematic illustration, not actual markup syntax):

Document: "fibonacci_study.docx"
-------------------------------------------------------
[Heading 1]  Chapter one: Introduction
[Body]       This chapter introduces the Fibonacci sequence...
[Heading 1]  Chapter two: Recursive Approaches
[Heading 2]  Section one: Naive Recursion
[Body]       The naive recursive approach computes fib(n) by...
[Code]       def fib(n):
                 if n <= 1: return n
                 return fib(n-1) + fib(n-2)
[Heading 2]  Section two: Memoization
[Body]       Memoization avoids redundant computation by...
[Heading 1]  Chapter three: Iterative Approaches
-------------------------------------------------------

The user types the following sequence of commands into the LLM command bar.

Command 1: "Change every word 'one' in a headline to 'two'."

The router classifies this as a structural task and routes it to Llama 3.1 8B. The LLM calls replace_in_headings(search="one", replace="two"). The tool finds three headings containing the word "one" and modifies them: "Chapter one: Introduction" becomes "Chapter two: Introduction", "Section one: Naive Recursion" becomes "Section two: Naive Recursion", and "Section one: Memoization" becomes "Section two: Memoization". The LLM confirms: "Done. I updated 3 headings to replace 'one' with 'two'."

Command 2: "Move the fib() function 4 spaces to the right."

The router classifies this as structural. The LLM calls indent_function(function_name="fib", indent_spaces=4). The code block becomes:

[Code]       def fib(n):
                     if n <= 1: return n
                     return fib(n-1) + fib(n-2)

(Each line of the function has been shifted 4 spaces to the right.)

Command 3: "Change the font of all code listings to JetBrains Mono 10pt."

The LLM calls tool_set_style_font(style_name="Code", font_name="JetBrains Mono", font_size_pt=10). All code blocks in the document now use JetBrains Mono 10pt. The LLM confirms: "Done. The 'Code' style now uses JetBrains Mono 10pt. This affects all 3 code block paragraphs."

Command 4: "Beautify and extend the introduction section."

The router classifies this as text_generation and routes it to Llama 3.1 70B. The LLM calls read_section(heading_text="Chapter two: Introduction") to get the current text, generates an improved and extended version, and calls replace_section_content() to update the document. The user sees the improved text appear in the document.

Command 5: "Read the Naive Recursion section and create a figure showing the recursion tree for fib(4), then insert it after that section's heading."

The router classifies this as multimodal. The LLM calls read_section(), then generate_figure() with an appropriate technical diagram prompt, then insert_figure(). The document now contains a generated diagram of the fib(4) recursion tree with a caption.

Command 6: "Add an appendix titled 'A: Complexity Analysis' with a brief explanation of the time and space complexity of each algorithm discussed."

The router classifies this as text_generation. The LLM calls read_section() for each algorithm section to understand the content, generates the complexity analysis text, and calls tool_add_appendix() to add it to the document with a page break and correct heading style.

Command 7: "Explain what memoization is and insert the explanation after the 'Section two: Memoization' heading."

The LLM generates a clear, well-structured explanation of memoization and calls insert_text_after() to place it in the document immediately after the correct heading.

In approximately two to three minutes of natural language interaction, the user has performed seven complex document editing operations that would have taken significantly longer with traditional tools. More importantly, several of these operations, including beautifying the introduction, generating the recursion tree figure, and writing the complexity analysis appendix, would have required the user to do substantial intellectual work themselves. The LLM has genuinely augmented the user's capabilities, not just automated routine tasks.

CONCLUSION: THE INTELLIGENCE LAYER IS NOW WITHIN REACH

We have covered a great deal of ground in this tutorial. We started with the fundamental question of why an application deserves a brain, and we ended with a complete, working architecture for an LLM-augmented text processor that can understand natural language commands, reason about document structure, generate and embed figures, answer general questions, and route tasks to the most appropriate model.

The key engineering insights to carry forward are these. The OpenAI-compatible API standard means you can write your integration code once and switch between any LLM provider by changing a configuration parameter. Tool use is the mechanism that keeps your application in control: the LLM reasons about what to do, but your code does the actual work, which means you can validate, audit, and undo every action. The document context is how you give the LLM the information it needs to reason correctly, and being strategic about what you include keeps your token usage efficient. The four-layer architecture gives you a clean separation of concerns that makes your system testable, maintainable, and extensible over time. The model router pattern lets you use the right model for each task, balancing speed, quality, cost, and privacy in a principled way. And the versioning system with correct before/after snapshots ensures that users can always recover from unexpected LLM behaviour.

The LLM is not a replacement for your application's logic. It is an intelligence layer that sits on top of your existing capabilities and makes them accessible through natural language. Your application remains the expert on its own domain; the LLM is the translator between human intent and machine operation. Together, they create something more powerful than either could be alone.

Hitchhiker's Guide to AI, Software Architecture, and Everything Else

Saturday, June 27, 2026

TEACHING YOUR APPLICATION TO THINK