Hitchhiker's Guide to AI, Software Architecture, and Everything Else: THE ALCHEMY OF WORDS: BUILDING A PROMPT OPTIMIZER FROM THE GROUND UP

Motivation

There is a peculiar irony at the heart of modern artificial intelligence. We have built systems of staggering complexity — neural networks with hundreds of billions of parameters, trained on essentially the entire written output of human civilization — and yet the quality of what these systems produce depends enormously on something as seemingly simple as how you phrase your question. A few words changed, a sentence restructured, a role assigned or withheld, and the difference between a mediocre response and a genuinely brilliant one can be the difference between a tool that transforms your work and one that frustrates you into abandoning it. This is the domain of prompt engineering, and understanding it deeply is the first step toward automating it.

Prompt engineering is not, as some dismissive observers have suggested, merely a fancy term for "talking to a chatbot." It is a genuine discipline that sits at the intersection of linguistics, cognitive science, software engineering, and machine learning theory. A skilled prompt engineer understands how large language models process text, what kinds of instructions they respond to most reliably, how their training shapes their tendencies and blind spots, and how to construct inputs that reliably elicit high-quality outputs. The best prompt engineers think of themselves not as users of a tool but as collaborators with a very particular kind of mind — one that is extraordinarily capable in some dimensions and surprisingly fragile in others.

To understand why prompt quality matters so much, you need to understand something fundamental about how large language models work. At their core, these models are next-token predictors. Given a sequence of tokens (roughly, word fragments), they compute a probability distribution over what token should come next, sample from that distribution, append the result, and repeat. The entire behavior of the model — its apparent reasoning, its creativity, its factual recall, its tone — emerges from this simple process applied billions of times over. What this means in practice is that the model does not "understand" your prompt in the way a human colleague would. It processes it as a statistical context that shapes the probability distribution over subsequent tokens. A well-crafted prompt creates a context in which high-quality, relevant, accurate tokens are the most probable next choices. A poorly crafted prompt creates a context in which the model wanders, hallucinates, misunderstands, or produces generic filler.

Consider a concrete example. Suppose you want an LLM to help you debug a piece of Python code. You could write your prompt as simply as "fix my code" followed by a code block. The model will produce something — it always does — but what it produces might be a superficial fix that misses the root cause, or a rewrite in a completely different style than you intended, or a response that explains what the code does rather than fixing it. Now consider a more carefully engineered version of the same request. You assign the model a role as a senior software engineer specializing in Python performance optimization. You provide context about what the code is supposed to do. You specify the constraints — preserve the existing API, do not change the function signatures, explain each change with a comment. You ask for a step-by-step analysis before the fix. You specify the output format. The same underlying model, given this richer context, will produce a dramatically better result. The model has not changed. The task has not changed. Only the prompt has changed.

This sensitivity to prompt formulation is both the great opportunity and the great challenge of working with LLMs. The opportunity is that with the right prompting techniques, you can unlock capabilities that seem almost magical. The challenge is that crafting those prompts requires expertise that most users simply do not have, and that the expertise required varies significantly from model to model.

The Model Diversity Problem

Here is where things get genuinely interesting, and where many users run into unexpected trouble. The prompt engineering community has developed a rich set of best practices over the past several years, but a dirty secret lurks beneath the surface of all those best practices: they are not universal. A prompt that works beautifully with GPT-4o may produce mediocre results with Llama 3, and a prompt optimized for Claude may confuse Mistral. This is not a minor inconvenience — it is a fundamental characteristic of the current LLM landscape, and understanding why it happens is essential to building a prompt optimizer that actually works.

The reasons for this model-to-model variation are multiple and deeply rooted. The first reason is training data and fine-tuning. Each major LLM family has been trained on different corpora, fine-tuned with different human feedback datasets, and optimized for different use cases. OpenAI's models have been fine-tuned extensively with RLHF (Reinforcement Learning from Human Feedback) to follow instructions in a particular style. Anthropic's Claude models have been trained with Constitutional AI, a process that instills specific values and response patterns. Meta's Llama models, in their base form, are raw next-token predictors that require careful prompting to behave as instruction-following assistants. The fine-tuned variants of Llama (like Llama-Instruct) have been adapted for instruction following, but with different techniques and different training data than the commercial models.

The second reason is tokenization. Different models use different tokenizers, which means they literally see different sequences of tokens when they process your prompt. This matters because the model's attention mechanisms operate on tokens, and the way your prompt is tokenized affects how the model processes it. A prompt that uses special delimiter tokens that exist in one model's vocabulary may not have equivalents in another model's vocabulary, leading to degraded performance.

The third reason is context window and attention patterns. Models with larger context windows have been trained to handle long-range dependencies differently than models with smaller windows. A prompt that relies on the model attending to information provided early in a long context may work well with a model trained on million-token contexts but poorly with a model whose effective attention span is more limited.

The fourth reason is system prompt support. Some models have been explicitly trained to treat system prompts as high-priority instructions that override user requests. Others treat system prompts as just another part of the input context, giving them no special weight. Some local models, particularly base models that have been lightly fine-tuned, may not have been trained with system prompts at all, and injecting a system prompt into the conversation may confuse rather than guide them.

The fifth reason is the specific instruction-following conventions that each model has internalized. Claude models respond exceptionally well to XML-tagged structure — wrapping your context in tags like context, task, and output_format produces noticeably better results than unstructured prose. GPT models respond well to numbered steps and explicit format specifications. Llama models, particularly the instruct variants, have been trained with specific chat templates that include special tokens marking the boundaries between system, user, and assistant turns, and failing to use these templates correctly can significantly degrade performance.

To make this concrete, consider how the same conceptual request — "analyze this financial document and extract key metrics" — would need to be phrased differently for different models. For a Claude model, you would wrap the document in XML tags, use a structured output_format specification, and leverage Claude's strength with long documents by providing rich context. For a GPT model, you would use a clear numbered list of the metrics to extract, specify the output as a JSON object with defined fields, and keep the system prompt focused and direct. For a local Llama model running via llama-cpp-python, you would need to ensure the chat template is applied correctly, keep the prompt concise enough to fit within the effective context window, and avoid relying on capabilities that may have been reduced by quantization.

This is the fundamental problem that a prompt optimizer must solve: given a raw, user-written prompt and a target model, produce an optimized version of that prompt that is specifically tailored to that model's strengths, conventions, and requirements. And it must do this automatically, without requiring the user to have expert knowledge of prompt engineering.

The Anatomy of a Prompt

Before we can build a system that optimizes prompts, we need to understand what a prompt actually consists of. Modern LLM interactions are structured around a conversation format with distinct roles, and understanding these roles is essential to understanding what optimization means in practice.

The system prompt is the highest-level instruction layer. It establishes the model's persona, defines its constraints, sets the context for the entire conversation, and specifies behavioral guidelines. A well-crafted system prompt is like a job description given to a highly capable employee before they start work — it tells them who they are, what they are trying to accomplish, what they should and should not do, and how they should communicate. The system prompt is processed first and typically receives the highest weight in the model's attention, making it the most powerful lever for shaping model behavior.

The user prompt is the actual request or input from the human side of the conversation. This is where the specific task is described, where input data is provided, and where constraints specific to this particular request are stated. The user prompt should be clear, specific, and complete — it should not assume that the model knows things that have not been stated, and it should not leave ambiguity about what a successful response looks like.

The assistant primer is a less commonly used but powerful technique. By providing the beginning of the model's response, you can steer the model toward a particular format, tone, or approach. If you want the model to respond with a JSON object, starting the assistant turn with an opening brace can dramatically increase the reliability of structured output. If you want the model to begin with a specific phrase or adopt a particular framing, the assistant primer is the tool for that.

The full combined prompt is what the model actually sees — the concatenation of all these elements, formatted according to the model's specific chat template. For a model like Llama 3, this involves special tokens like the beginning-of-sequence token, role markers, and end-of-turn tokens. For OpenAI's API, the formatting is handled by the API itself based on the messages array you provide. For local models running without an API layer, you must apply the chat template manually.

Understanding this structure is what makes it possible to build an optimizer that produces not just a better user prompt, but a complete, model-specific prompt package consisting of an optimized system prompt, an optimized user prompt, and an appropriate assistant primer.

Prompt Engineering Patterns and Anti-Patterns

The field of prompt engineering has developed a rich vocabulary of patterns — reusable techniques that reliably improve model performance across a wide range of tasks. A prompt optimizer must know these patterns and know when to apply them.

Chain-of-Thought prompting is perhaps the most impactful single technique in the prompt engineer's toolkit. The core insight is that LLMs produce better answers to complex questions when they are instructed to reason through the problem step by step before giving a final answer. This works because the intermediate reasoning steps become part of the context that the model attends to when generating the final answer, effectively giving the model more "working memory" to use. For tasks involving mathematics, logic, multi-step reasoning, or complex analysis, adding a phrase like "Think through this step by step before giving your final answer" can dramatically improve accuracy. For reasoning models like OpenAI's o3 or o4-mini, this instruction is unnecessary because the model performs chain-of-thought reasoning internally, and adding it explicitly can actually reduce performance by interfering with the model's internal process.

Role Assignment is the technique of giving the model a specific expert persona. Telling the model "You are a senior software engineer with 20 years of experience in distributed systems" before asking a question about database architecture produces noticeably better results than asking the same question without the role assignment. The reason is that the role assignment activates a particular cluster of knowledge and communication patterns in the model's learned representations. It is not magic — the model does not actually become a senior engineer — but it shifts the probability distribution over responses toward the kind of response a senior engineer would give.

Few-Shot prompting is the technique of providing examples of the desired input-output mapping before presenting the actual task. If you want the model to extract structured data from unstructured text in a specific format, showing it two or three examples of input text paired with correctly formatted output is far more reliable than describing the format in words alone. The model learns the pattern from the examples and applies it to the new input. Few-shot prompting is particularly powerful for tasks where the desired output format is complex or unusual, where the task requires a specific style or tone, or where the model has a tendency to misunderstand the task when described abstractly.

XML Structuring is a technique that has become particularly associated with Anthropic's Claude models, though it is useful with other models as well. By wrapping different sections of your prompt in XML-style tags, you create clear boundaries that help the model understand the structure of the input. A prompt that wraps the background context in a context tag, the task description in a task tag, and the output specification in an output_format tag is much easier for the model to parse correctly than an equivalent prompt written as unstructured prose. Claude models have been specifically trained to attend to these structural cues, making XML structuring especially effective for them.

Self-Reflection and Verification is a technique where you ask the model to review its own output before finalizing it. After generating an initial response, the model is asked to check for errors, inconsistencies, or gaps, and to revise accordingly. This can be implemented as a two-step process (generate, then verify) or as a single prompt that asks the model to generate, verify, and revise in sequence. The technique works because the verification step creates a new context in which errors that were statistically likely during generation become statistically unlikely, since the model is now in a "checking" mode rather than a "generating" mode.

On the anti-pattern side, vague instructions are the most common failure mode. Prompts that say "write something about climate change" or "help me with my code" are so underspecified that the model has no reliable way to produce what the user actually wants. The model will produce something — it always does — but the probability that it produces exactly what the user wanted is low. A good prompt optimizer will detect vague instructions and expand them with specificity.

Task overloading is the anti-pattern of cramming too many unrelated tasks into a single prompt. LLMs have limited "attention" in the sense that when a prompt contains many different requests, the model may address some well and others poorly, or may blend them in unexpected ways. Breaking complex multi-task prompts into focused single-task prompts, or at minimum clearly delineating the tasks with numbered steps or XML tags, significantly improves reliability.

Assumed context is the anti-pattern of writing prompts that assume the model knows things it does not know. LLMs do not have access to your codebase, your company's internal documentation, your personal history with a project, or any other information that was not provided in the prompt or was not part of their training data. A prompt that says "fix the bug in the authentication module" without providing the code for the authentication module is almost certain to produce a hallucinated response.

The Case for Automation

Given the complexity of prompt engineering — the dozens of patterns to know, the model-specific conventions to remember, the anti-patterns to avoid, the structural requirements to satisfy — it becomes clear why automating the process is so valuable. Even experienced prompt engineers spend significant time crafting and iterating on prompts. For non-expert users, the gap between what they write and what they need is often enormous. A prompt optimizer bridges this gap automatically.

The key insight behind a prompt optimizer is elegant: use a powerful LLM to optimize prompts for other LLMs. The optimizer model is given a meta-prompt — a carefully crafted system prompt that encodes all the knowledge of an expert prompt engineer — and is asked to transform a raw user prompt into an optimized version tailored to a specific target model. The optimizer model essentially acts as an expert consultant who takes your rough draft and produces a polished, professional version.

This approach has several important advantages. It scales to any target model, because the optimizer can be given model-specific tips as part of its meta-prompt. It improves consistently, because the meta-prompt encodes best practices that are applied every time. It frees users from needing to learn prompt engineering themselves, because the expertise is embedded in the system. And it can be updated as new best practices are discovered or as new models are released, simply by updating the meta-prompt.

The Architecture of a Prompt Optimizer

A well-designed prompt optimizer consists of several layers that work together to transform raw prompts into optimized ones. The hardware abstraction layer handles the detection and management of available compute resources — CUDA GPUs, Apple Silicon with MLX, Vulkan-capable GPUs, or CPU fallback. The adapter layer provides a uniform interface to different LLM backends, whether they are remote APIs or local models. The pattern library encodes the prompt engineering knowledge that the optimizer applies. The optimizer engine orchestrates the entire process, building the meta-prompt, calling the optimizer model, and parsing the result.

Let us begin building this system from the ground up, starting with the hardware abstraction layer, because the choice of compute backend affects everything else in the system.

Hardware Detection and Backend Selection

The first challenge in building a local LLM system is detecting what hardware is available and selecting the appropriate backend. Modern systems may have NVIDIA GPUs (requiring CUDA), AMD or Intel GPUs (requiring Vulkan or ROCm), Apple Silicon (requiring MLX or Metal), or only a CPU. The detection must be robust, fast, and non-crashing — attempting to import a library that is not installed should not bring down the entire application.

The following module implements a hardware detection system that cascades through available backends in order of performance, selecting the best available option. Each detection function is isolated and catches all exceptions, ensuring that a missing library or incompatible hardware never propagates an error upward.

# hardware_detector.py
#
# Detects available compute backends for local LLM inference.
# Detection order: MLX (Apple Silicon) -> CUDA -> Vulkan -> CPU.
# Each detector is fully isolated; failures are logged, not raised.

from __future__ import annotations

import logging
import platform
import sys
from dataclasses import dataclass, field
from functools import lru_cache
from typing import List

logger = logging.getLogger(__name__)


@dataclass
class HardwareProfile:
    """
    Describes the compute capabilities of the current machine.
    All fields default to safe/conservative values so that callers
    can read them without checking for None.
    """
    has_mlx: bool = False        # Apple Silicon via mlx-lm
    has_cuda: bool = False       # NVIDIA GPU via llama-cpp CUDA build
    has_vulkan: bool = False     # Any Vulkan-capable GPU
    has_cpu: bool = True         # Always available as final fallback

    cuda_device_name: str = ""
    cuda_vram_gb: float = 0.0
    apple_chip: str = ""
    vulkan_device_name: str = ""

    available_backends: List[str] = field(default_factory=list)

    @property
    def best_backend(self) -> str:
        """
        Returns the name of the highest-performance available backend.
        The order reflects real-world performance on typical LLM workloads:
        MLX is extremely efficient on Apple Silicon unified memory,
        CUDA is the gold standard for NVIDIA GPUs,
        Vulkan provides GPU acceleration on non-NVIDIA hardware,
        and CPU is the universal fallback.
        """
        if self.has_mlx:
            return "mlx"
        if self.has_cuda:
            return "cuda"
        if self.has_vulkan:
            return "vulkan"
        return "cpu"

    @property
    def llama_cpp_gpu_layers(self) -> int:
        """
        Returns the n_gpu_layers value for llama-cpp-python.
        A value of -1 tells llama-cpp to offload ALL transformer
        layers to the GPU, maximizing inference speed. A value of 0
        means CPU-only inference. We use -1 whenever any GPU backend
        is available, because partial offloading is rarely beneficial
        and adds configuration complexity.
        """
        if self.has_cuda or self.has_vulkan:
            return -1
        return 0

    def describe(self) -> str:
        """Returns a human-readable summary of detected hardware."""
        lines = [f"Best backend: {self.best_backend.upper()}"]
        if self.has_mlx:
            lines.append(f"  Apple MLX: {self.apple_chip}")
        if self.has_cuda:
            lines.append(
                f"  NVIDIA CUDA: {self.cuda_device_name} "
                f"({self.cuda_vram_gb:.1f} GB VRAM)"
            )
        if self.has_vulkan:
            lines.append(f"  Vulkan: {self.vulkan_device_name}")
        lines.append("  CPU: always available")
        return "\n".join(lines)


class HardwareDetector:
    """
    Performs one-time hardware detection and caches the result.
    All detection methods are static and fully exception-safe.
    """

    @staticmethod
    @lru_cache(maxsize=1)
    def detect() -> HardwareProfile:
        """
        Runs the full detection cascade and returns a cached profile.
        The lru_cache decorator ensures this expensive operation runs
        exactly once per process lifetime, regardless of how many
        times detect() is called.
        """
        profile = HardwareProfile()
        profile.available_backends = ["cpu"]

        # Detection runs in priority order. Each step is independent.
        if sys.platform == "darwin":
            profile = HardwareDetector._probe_mlx(profile)

        if sys.platform != "darwin":
            profile = HardwareDetector._probe_cuda(profile)
            if not profile.has_cuda:
                profile = HardwareDetector._probe_vulkan(profile)

        logger.info(
            "Hardware detection complete:\n%s", profile.describe()
        )
        return profile

    @staticmethod
    def _probe_mlx(profile: HardwareProfile) -> HardwareProfile:
        """
        Checks for Apple MLX availability. MLX requires macOS on
        Apple Silicon (arm64). Running under Rosetta on Intel hardware
        is detected and excluded, because MLX would not have GPU access.
        """
        try:
            # Verify we are on native Apple Silicon, not Rosetta.
            if platform.machine() != "arm64":
                logger.info(
                    "macOS detected but machine is %s, not arm64. "
                    "Skipping MLX.", platform.machine()
                )
                return profile

            # Attempt to import both mlx core and mlx_lm.
            # If either is missing, we fall through to the except block.
            import mlx.core  # type: ignore  # noqa: F401
            import mlx_lm    # type: ignore  # noqa: F401

            # Retrieve the chip name for display purposes.
            import subprocess
            try:
                result = subprocess.run(
                    ["sysctl", "-n", "machdep.cpu.brand_string"],
                    capture_output=True, text=True, timeout=3
                )
                profile.apple_chip = result.stdout.strip() or "Apple Silicon"
            except Exception:
                profile.apple_chip = "Apple Silicon"

            profile.has_mlx = True
            profile.available_backends.insert(0, "mlx")
            logger.info("MLX available: %s", profile.apple_chip)

        except ImportError:
            logger.info(
                "mlx or mlx_lm not installed. "
                "Run: pip install mlx mlx-lm"
            )
        except Exception as exc:
            logger.warning("MLX probe failed unexpectedly: %s", exc)

        return profile

    @staticmethod
    def _probe_cuda(profile: HardwareProfile) -> HardwareProfile:
        """
        Checks for NVIDIA CUDA availability via two independent methods.
        The primary method uses llama-cpp-python's own GPU offload check,
        which is authoritative for our use case. The secondary method uses
        PyTorch, which provides richer device information if installed.
        """
        # Primary: ask llama-cpp-python directly.
        try:
            from llama_cpp import llama_cpp as _lc  # type: ignore
            if (
                hasattr(_lc, "llama_supports_gpu_offload")
                and _lc.llama_supports_gpu_offload()
            ):
                profile.has_cuda = True
                profile.available_backends.insert(0, "cuda")
                logger.info("CUDA detected via llama-cpp GPU offload check.")
        except Exception:
            pass

        # Secondary: use PyTorch for richer device information.
        if not profile.has_cuda:
            try:
                import torch  # type: ignore
                if torch.cuda.is_available():
                    props = torch.cuda.get_device_properties(0)
                    profile.has_cuda = True
                    profile.cuda_device_name = props.name
                    profile.cuda_vram_gb = (
                        props.total_memory / (1024 ** 3)
                    )
                    profile.available_backends.insert(0, "cuda")
                    logger.info(
                        "CUDA detected via PyTorch: %s (%.1f GB)",
                        profile.cuda_device_name,
                        profile.cuda_vram_gb
                    )
            except ImportError:
                pass
            except Exception as exc:
                logger.warning("PyTorch CUDA probe failed: %s", exc)

        return profile

    @staticmethod
    def _probe_vulkan(profile: HardwareProfile) -> HardwareProfile:
        """
        Checks for Vulkan GPU availability via llama-cpp-python.
        Vulkan support in llama-cpp-python is compiled in at build time
        using CMAKE_ARGS="-DGGML_VULKAN=on". If the installed build
        supports GPU offload and CUDA was not detected, we infer Vulkan
        (or ROCm) is the active GPU backend.

        Note: llama-cpp-python does not expose a per-backend query API,
        so we use the presence of GPU offload support combined with the
        absence of CUDA as our Vulkan indicator. This is accurate for
        the vast majority of deployment scenarios.
        """
        try:
            from llama_cpp import llama_cpp as _lc  # type: ignore
            if (
                hasattr(_lc, "llama_supports_gpu_offload")
                and _lc.llama_supports_gpu_offload()
                and not profile.has_cuda
            ):
                profile.has_vulkan = True
                profile.vulkan_device_name = "Vulkan-capable GPU"
                profile.available_backends.insert(0, "vulkan")
                logger.info(
                    "Vulkan/ROCm GPU offload detected via llama-cpp."
                )
        except Exception as exc:
            logger.debug("Vulkan probe failed: %s", exc)

        return profile

The HardwareProfile dataclass is the central data structure of this layer. It is a simple, immutable-in-practice container that holds the results of hardware detection. The best_backend property implements the priority cascade in a single, readable place, making it easy to understand and modify the priority order. The llama_cpp_gpu_layers property encapsulates the logic for translating hardware availability into the specific parameter that llama-cpp-python needs, keeping this implementation detail out of the adapter layer.

The detection methods deserve careful attention. Each one is wrapped in a broad exception handler, not because we are being lazy about error handling, but because the failure modes here are genuinely unpredictable. A user might have a partially installed CUDA toolkit, a version mismatch between PyTorch and the CUDA driver, or a build of llama-cpp-python that was compiled without GPU support. In all these cases, the right behavior is to log the issue and fall back to the next option, not to crash. The lru_cache decorator on the detect() method is a small but important optimization — hardware detection involves subprocess calls and library imports, and there is no reason to repeat it more than once per process.

The Adapter Layer

With hardware detection in place, we can build the adapter layer. The adapter pattern is the right architectural choice here because it allows the rest of the system to interact with any LLM — whether it is a remote API or a local model running on any supported hardware — through a single, uniform interface. Adding a new backend requires only implementing the adapter interface; the optimizer engine and all other components remain unchanged.

The base adapter defines the interface that all concrete adapters must implement. It also provides shared utilities that all adapters can use, particularly the JSON response parser that handles the quirks of different models' output formats.

# adapters/base.py
#
# Abstract base class for all LLM adapters.
# Defines the interface and provides shared parsing utilities.

from __future__ import annotations

import json
import logging
import re
from abc import ABC, abstractmethod
from dataclasses import dataclass, field
from typing import List, Optional

logger = logging.getLogger(__name__)


@dataclass
class OptimizationRequest:
    """
    Encapsulates everything the optimizer needs to produce
    an optimized prompt. Passed unchanged to all adapters,
    so the adapter interface stays stable as requirements evolve.
    """
    raw_prompt: str
    target_model_id: str
    target_model_config: dict
    system_instructions: str   # The meta-prompt built by the engine
    temperature: float = 0.7


@dataclass
class OptimizationResult:
    """
    The structured output of the optimization process.
    Each field corresponds to a distinct component of the
    optimized prompt package that the UI will display.
    """
    optimized_system_prompt: str
    optimized_assistant_primer: str
    optimized_user_prompt: str
    full_combined_prompt: str
    patterns_applied: List[str] = field(default_factory=list)
    anti_patterns_avoided: List[str] = field(default_factory=list)
    model_recommendation: Optional[str] = None
    recommendation_reason: Optional[str] = None
    model_used: str = ""
    backend_info: str = ""


def extract_json_from_llm_output(raw: str) -> dict:
    """
    Robustly extracts a JSON object from LLM output.

    LLMs frequently wrap JSON in markdown code fences, prepend
    explanatory text, or (in the case of reasoning models like
    DeepSeek R1 or OpenAI o3) emit a chain-of-thought block before
    the actual answer. This function handles all of these cases by:

    1. Stripping thinking/reasoning blocks enclosed in XML-like tags.
    2. Removing markdown code fences (```json ... ``` or ``` ... ```).
    3. Finding the last complete JSON object in the remaining text,
       which is the actual answer after any preamble.

    Returns an empty dict if no valid JSON can be extracted, rather
    than raising an exception, so callers can apply fallback logic.
    """
    text = raw.strip()

    # Step 1: Remove chain-of-thought blocks from reasoning models.
    # These appear as <think>...</think> or <thinking>...</thinking>.
    text = re.sub(
        r"<think(?:ing)?>\s*.*?\s*</think(?:ing)?>",
        "",
        text,
        flags=re.DOTALL | re.IGNORECASE
    )
    text = text.strip()

    # Step 2: Extract content from markdown code fences.
    # We take the LAST fenced block to skip any preamble fences.
    json_fence_match = re.search(
        r"```(?:json)?\s*([\s\S]*?)```",
        text
    )
    if json_fence_match:
        text = json_fence_match.group(1).strip()

    # Step 3: Find the last complete JSON object in the text.
    # Walking backwards from the last closing brace ensures we
    # capture the full answer even when preamble text contains
    # partial JSON-like structures.
    last_close = text.rfind("}")
    if last_close == -1:
        logger.warning("No closing brace found in LLM output.")
        return {}

    depth = 0
    start_idx = -1
    for i in range(last_close, -1, -1):
        if text[i] == "}":
            depth += 1
        elif text[i] == "{":
            depth -= 1
            if depth == 0:
                start_idx = i
                break

    if start_idx == -1:
        logger.warning("Could not find matching opening brace.")
        return {}

    candidate = text[start_idx : last_close + 1]
    try:
        return json.loads(candidate)
    except json.JSONDecodeError as exc:
        logger.warning(
            "JSON parse failed after extraction: %s. "
            "Candidate (first 200 chars): %s",
            exc, candidate[:200]
        )
        return {}


class BaseLLMAdapter(ABC):
    """
    Abstract base for all LLM adapters.
    Subclasses implement optimize() for their specific backend.
    """

    def __init__(self, model_config: dict) -> None:
        self.model_config = model_config
        self.model_name: str = model_config.get("api_model_name", "")
        # Use model-recommended temperature, not a hardcoded default.
        self.temperature: float = model_config.get(
            "recommended_temperature", 0.7
        )
        self.is_reasoning_model: bool = model_config.get(
            "is_reasoning_model", False
        )

    @abstractmethod
    async def optimize(
        self, request: OptimizationRequest
    ) -> OptimizationResult:
        """
        Sends the optimization request to the LLM and returns
        a structured result. Must be implemented by all subclasses.
        """
        ...

    @abstractmethod
    def is_available(self) -> bool:
        """
        Returns True if this adapter can be used on the current
        machine with the current configuration. Called during
        startup to determine which adapters to register.
        """
        ...

    def get_backend_label(self) -> str:
        """
        Returns a human-readable label for the backend this adapter
        uses. Displayed in the UI to help users understand where
        inference is running.
        """
        return self.model_config.get("provider", "unknown")

    def _build_result_from_dict(
        self, data: dict, request: OptimizationRequest
    ) -> OptimizationResult:
        """
        Constructs an OptimizationResult from a parsed JSON dict,
        applying safe defaults for any missing fields. This ensures
        that a partially successful LLM response still produces a
        usable result rather than crashing downstream code.
        """
        return OptimizationResult(
            optimized_system_prompt=data.get(
                "optimized_system_prompt",
                "You are a helpful AI assistant."
            ),
            optimized_assistant_primer=data.get(
                "optimized_assistant_primer",
                "Understood. I will help you with this task."
            ),
            optimized_user_prompt=data.get(
                "optimized_user_prompt",
                request.raw_prompt
            ),
            full_combined_prompt=data.get(
                "full_combined_prompt",
                request.raw_prompt
            ),
            patterns_applied=data.get("patterns_applied", []),
            anti_patterns_avoided=data.get("anti_patterns_avoided", []),
            model_recommendation=data.get("model_recommendation"),
            recommendation_reason=data.get("recommendation_reason"),
            model_used=self.model_name,
            backend_info=self.get_backend_label(),
        )

The extract_json_from_llm_output function deserves special attention because it solves a problem that every production LLM application eventually encounters. LLMs are trained to be helpful and communicative, which means they often add explanatory text around structured outputs even when you explicitly ask them not to. Reasoning models like OpenAI's o3 or DeepSeek R1 are particularly prone to this because their chain-of-thought process produces substantial text before the actual answer. The three-step extraction process — strip thinking blocks, extract from code fences, find the last complete JSON object — handles the full range of real-world LLM output formats reliably.

Now let us implement the remote adapters. These adapters communicate with cloud LLM APIs and are the simplest to implement because the API providers handle all the complexity of model loading, hardware management, and inference.

# adapters/openai_adapter.py
#
# Adapter for OpenAI's GPT and o-series models.
# Handles both standard and reasoning models correctly.

from __future__ import annotations

import os
from openai import AsyncOpenAI

from adapters.base import (
    BaseLLMAdapter,
    OptimizationRequest,
    OptimizationResult,
    extract_json_from_llm_output,
)


class OpenAIAdapter(BaseLLMAdapter):
    """
    Connects to the OpenAI API for cloud-based optimization.

    Key design decisions:
    - max_retries=0 prevents the SDK from silently retrying on 429
      or 500 errors, which could cause double-billing or hide
      transient failures that the caller should handle.
    - Reasoning models (o3, o4-mini) must not receive a temperature
      parameter; the API rejects it with a validation error.
    """

    def __init__(self, model_config: dict) -> None:
        super().__init__(model_config)
        self._client = AsyncOpenAI(
            api_key=os.getenv("OPENAI_API_KEY"),
            timeout=float(os.getenv("CLOUD_TIMEOUT", "120")),
            max_retries=0,
        )

    async def optimize(
        self, request: OptimizationRequest
    ) -> OptimizationResult:
        messages = [
            {
                "role": "system",
                "content": request.system_instructions,
            },
            {
                "role": "user",
                "content": self._build_user_content(request),
            },
        ]

        # Reasoning models do not accept a temperature parameter.
        # Standard models benefit from a controlled temperature.
        call_kwargs: dict = {
            "model": self.model_name,
            "messages": messages,
        }
        if not self.is_reasoning_model:
            call_kwargs["temperature"] = self.temperature

        response = await self._client.chat.completions.create(
            **call_kwargs
        )
        raw = response.choices[0].message.content or ""
        data = extract_json_from_llm_output(raw)
        return self._build_result_from_dict(data, request)

    def _build_user_content(self, request: OptimizationRequest) -> str:
        """
        Constructs the user-turn content for the optimization request.
        The target model's configuration is injected here so the
        optimizer has full context about what it is optimizing for.
        """
        cfg = request.target_model_config
        strengths = ", ".join(cfg.get("strengths", []))
        return (
            f"TARGET MODEL: {cfg.get('name', 'Unknown')}\n"
            f"PROVIDER: {cfg.get('provider', 'unknown')}\n"
            f"CONTEXT WINDOW: {cfg.get('context_window', 'unknown')} tokens\n"
            f"MODEL STRENGTHS: {strengths}\n"
            f"MODEL NOTES: {cfg.get('notes', '')}\n\n"
            f"RAW PROMPT TO OPTIMIZE:\n"
            f'"""\n{request.raw_prompt}\n"""\n\n'
            f"Return ONLY a valid JSON object with these exact fields:\n"
            f"{{\n"
            f'  "optimized_system_prompt": "...",\n'
            f'  "optimized_assistant_primer": "...",\n'
            f'  "optimized_user_prompt": "...",\n'
            f'  "full_combined_prompt": "...",\n'
            f'  "patterns_applied": ["pattern1", "pattern2"],\n'
            f'  "anti_patterns_avoided": ["anti1"],\n'
            f'  "model_recommendation": null,\n'
            f'  "recommendation_reason": null\n'
            f"}}"
        )

    def is_available(self) -> bool:
        return bool(os.getenv("OPENAI_API_KEY"))

The Anthropic adapter follows the same pattern but uses Claude's XML-structured prompt format, which is specifically optimized for how Claude processes instructions. Notice how the user content is wrapped in semantic XML tags rather than plain prose — this is not arbitrary style, but a deliberate choice based on how Claude models have been trained to parse structured inputs.

# adapters/anthropic_adapter.py
#
# Adapter for Anthropic's Claude model family.
# Uses XML-structured prompts to leverage Claude's training.

from __future__ import annotations

import os
import anthropic

from adapters.base import (
    BaseLLMAdapter,
    OptimizationRequest,
    OptimizationResult,
    extract_json_from_llm_output,
)


class AnthropicAdapter(BaseLLMAdapter):
    """
    Connects to the Anthropic API for Claude-based optimization.
    Claude models respond especially well to XML-structured prompts,
    so the user content is wrapped in semantic XML tags throughout.
    """

    def __init__(self, model_config: dict) -> None:
        super().__init__(model_config)
        self._client = anthropic.AsyncAnthropic(
            api_key=os.getenv("ANTHROPIC_API_KEY"),
            timeout=float(os.getenv("CLOUD_TIMEOUT", "120")),
            max_retries=0,
        )

    async def optimize(
        self, request: OptimizationRequest
    ) -> OptimizationResult:
        # Claude's temperature must be in [0.0, 1.0].
        safe_temp = max(0.0, min(self.temperature, 1.0))

        response = await self._client.messages.create(
            model=self.model_name,
            max_tokens=4096,
            temperature=safe_temp,
            system=request.system_instructions,
            messages=[
                {
                    "role": "user",
                    "content": self._build_user_content(request),
                }
            ],
        )
        raw = response.content[0].text if response.content else ""
        data = extract_json_from_llm_output(raw)
        return self._build_result_from_dict(data, request)

    def _build_user_content(self, request: OptimizationRequest) -> str:
        """
        Builds XML-structured user content for Claude.
        The XML tags create clear semantic boundaries that Claude's
        attention mechanism is specifically trained to recognize and
        respect, producing more reliable structured output.
        """
        cfg = request.target_model_config
        strengths = ", ".join(cfg.get("strengths", []))
        return (
            f"<target_model>\n"
            f"  <name>{cfg.get('name', 'Unknown')}</name>\n"
            f"  <provider>{cfg.get('provider', 'unknown')}</provider>\n"
            f"  <context_window>"
            f"{cfg.get('context_window', 'unknown')} tokens"
            f"</context_window>\n"
            f"  <strengths>{strengths}</strengths>\n"
            f"  <notes>{cfg.get('notes', '')}</notes>\n"
            f"</target_model>\n\n"
            f"<raw_prompt>\n{request.raw_prompt}\n</raw_prompt>\n\n"
            f"<output_format>\n"
            f"Return ONLY a valid JSON object with these exact fields:\n"
            f"{{\n"
            f'  "optimized_system_prompt": "...",\n'
            f'  "optimized_assistant_primer": "...",\n'
            f'  "optimized_user_prompt": "...",\n'
            f'  "full_combined_prompt": "...",\n'
            f'  "patterns_applied": ["pattern1"],\n'
            f'  "anti_patterns_avoided": ["anti1"],\n'
            f'  "model_recommendation": null,\n'
            f'  "recommendation_reason": null\n'
            f"}}\n"
            f"</output_format>"
        )

    def is_available(self) -> bool:
        return bool(os.getenv("ANTHROPIC_API_KEY"))

Local Model Adapters: Where Hardware Meets Software

The local model adapters are where the hardware abstraction layer becomes essential. These adapters must load models from disk (or download them from HuggingFace Hub), configure them for the available hardware, and run inference in a way that does not block the async event loop. The two primary local backends are llama-cpp-python (which supports CUDA, Vulkan, Metal, and CPU) and mlx-lm (which provides native Apple Silicon acceleration via the MLX framework).

The llama-cpp-python adapter is the most versatile of the local adapters because a single build of the library — compiled with the appropriate flags — can target CUDA, Vulkan, or CPU. The hardware profile we built earlier tells us exactly how to configure it. A key architectural decision here is the use of a dedicated single-threaded executor for all llama-cpp inference calls. The llama-cpp-python Llama class is not thread-safe for concurrent inference on the same model instance, so serializing all calls through a single thread is the correct pattern for async applications that need to call synchronous, non-thread-safe C++ libraries.

# adapters/llamacpp_adapter.py
#
# Local inference adapter using llama-cpp-python.
# Supports CUDA, Vulkan, Metal, and CPU backends automatically.
# Models are downloaded as GGUF files from HuggingFace Hub.

from __future__ import annotations

import asyncio
import logging
import os
from concurrent.futures import ThreadPoolExecutor
from typing import Optional

from adapters.base import (
    BaseLLMAdapter,
    OptimizationRequest,
    OptimizationResult,
    extract_json_from_llm_output,
)
from core.hardware_detector import HardwareDetector

logger = logging.getLogger(__name__)

# A dedicated single-threaded executor for llama-cpp inference.
# llama-cpp-python is not thread-safe for concurrent inference on the
# same model instance, so we serialize all inference calls through
# a single thread. This is the correct pattern for async applications
# that need to call synchronous, non-thread-safe C++ libraries.
_LLAMA_EXECUTOR = ThreadPoolExecutor(
    max_workers=1,
    thread_name_prefix="llamacpp-inference"
)

# Guard the import so the module loads even without llama-cpp-python.
_LLAMACPP_AVAILABLE = False
try:
    from llama_cpp import Llama  # type: ignore
    _LLAMACPP_AVAILABLE = True
except ImportError:
    pass


class LlamaCppAdapter(BaseLLMAdapter):
    """
    Runs GGUF models locally via llama-cpp-python.

    The adapter lazy-loads the model on first use to avoid blocking
    the application startup. An asyncio.Lock ensures that concurrent
    requests do not trigger multiple simultaneous model loads.

    GPU backend selection is fully automatic:
    - CUDA:   compile with CMAKE_ARGS="-DGGML_CUDA=on"
    - Vulkan: compile with CMAKE_ARGS="-DGGML_VULKAN=on"
    - Metal:  automatic on macOS (no special compile flag needed)
    - CPU:    always available as fallback
    """

    def __init__(self, model_config: dict) -> None:
        super().__init__(model_config)
        self._model: Optional[object] = None
        self._load_lock = asyncio.Lock()
        self._loaded = False

        # Model source: prefer a local file, fall back to HuggingFace.
        self.gguf_local_path: str = model_config.get(
            "gguf_local_path", ""
        )
        self.gguf_repo_id: str = model_config.get("gguf_repo_id", "")
        self.gguf_filename: str = model_config.get(
            "gguf_filename", "*.Q4_K_M.gguf"
        )

        # Determine GPU layer count from hardware profile.
        hw = HardwareDetector.detect()
        self._n_gpu_layers: int = model_config.get(
            "n_gpu_layers", hw.llama_cpp_gpu_layers
        )
        self._n_ctx: int = min(
            model_config.get("context_window", 4096), 8192
        )
        self._max_tokens: int = model_config.get("max_tokens", 2048)
        self._timeout: float = float(
            os.getenv("LOCAL_INFERENCE_TIMEOUT", "300")
        )

        # Build a human-readable backend label for the UI.
        self._backend_label = (
            f"llama-cpp ({hw.best_backend.upper()})"
        )

        logger.info(
            "LlamaCppAdapter configured: model=%s backend=%s "
            "n_gpu_layers=%d n_ctx=%d",
            self.model_name,
            hw.best_backend,
            self._n_gpu_layers,
            self._n_ctx,
        )

    async def _ensure_loaded(self) -> None:
        """
        Lazy-loads the model on first call. The asyncio.Lock prevents
        a race condition where multiple concurrent requests all try to
        load the model simultaneously. The double-check after acquiring
        the lock (the second 'if self._loaded' inside the lock) is the
        standard pattern for thread-safe lazy initialization.
        """
        if self._loaded:
            return

        async with self._load_lock:
            if self._loaded:
                return

            loop = asyncio.get_event_loop()
            self._model = await loop.run_in_executor(
                _LLAMA_EXECUTOR,
                self._load_model_sync,
            )
            self._loaded = True

    def _load_model_sync(self) -> object:
        """
        Synchronous model loading, runs in the thread executor.
        Downloads the GGUF file from HuggingFace Hub if no local
        path is provided. The HF_TOKEN environment variable is used
        for gated models (Llama 3, Gemma, etc.).
        """
        hf_token = os.getenv("HF_TOKEN")

        if self.gguf_local_path and os.path.exists(self.gguf_local_path):
            logger.info(
                "Loading GGUF from local path: %s", self.gguf_local_path
            )
            return Llama(
                model_path=self.gguf_local_path,
                n_gpu_layers=self._n_gpu_layers,
                n_ctx=self._n_ctx,
                verbose=False,
            )

        if not self.gguf_repo_id:
            raise RuntimeError(
                f"LlamaCppAdapter: no gguf_local_path or gguf_repo_id "
                f"configured for model '{self.model_name}'"
            )

        logger.info(
            "Downloading GGUF from HuggingFace: %s / %s",
            self.gguf_repo_id,
            self.gguf_filename,
        )
        return Llama.from_pretrained(
            repo_id=self.gguf_repo_id,
            filename=self.gguf_filename,
            n_gpu_layers=self._n_gpu_layers,
            n_ctx=self._n_ctx,
            verbose=False,
            **({"token": hf_token} if hf_token else {}),
        )

    async def optimize(
        self, request: OptimizationRequest
    ) -> OptimizationResult:
        if not _LLAMACPP_AVAILABLE:
            raise RuntimeError(
                "llama-cpp-python is not installed. "
                "See requirements.txt for installation instructions."
            )

        await self._ensure_loaded()

        messages = [
            {
                "role": "system",
                "content": request.system_instructions,
            },
            {
                "role": "user",
                "content": self._build_user_content(request),
            },
        ]

        loop = asyncio.get_event_loop()

        def _run_inference() -> str:
            # create_chat_completion handles the chat template
            # formatting internally, so we pass structured messages
            # rather than a raw formatted string.
            result = self._model.create_chat_completion(
                messages=messages,
                temperature=self.temperature,
                max_tokens=self._max_tokens,
                stop=["</s>", "<|eot_id|>", "<|end|>"],
            )
            return result["choices"][0]["message"]["content"] or ""

        # Run inference in the dedicated executor and apply a timeout
        # to prevent a hung local model from blocking indefinitely.
        raw = await asyncio.wait_for(
            loop.run_in_executor(_LLAMA_EXECUTOR, _run_inference),
            timeout=self._timeout,
        )

        data = extract_json_from_llm_output(raw)
        result = self._build_result_from_dict(data, request)
        result.backend_info = self._backend_label
        return result

    def _build_user_content(self, request: OptimizationRequest) -> str:
        cfg = request.target_model_config
        return (
            f"Optimize the following prompt for the target model.\n\n"
            f"TARGET MODEL: {cfg.get('name', 'Unknown')}\n"
            f"CONTEXT WINDOW: {cfg.get('context_window', 'unknown')} tokens\n"
            f"MODEL NOTES: {cfg.get('notes', '')}\n\n"
            f"RAW PROMPT:\n{request.raw_prompt}\n\n"
            f"Return ONLY a valid JSON object:\n"
            f"{{\n"
            f'  "optimized_system_prompt": "...",\n'
            f'  "optimized_assistant_primer": "...",\n'
            f'  "optimized_user_prompt": "...",\n'
            f'  "full_combined_prompt": "...",\n'
            f'  "patterns_applied": [],\n'
            f'  "anti_patterns_avoided": [],\n'
            f'  "model_recommendation": null,\n'
            f'  "recommendation_reason": null\n'
            f"}}"
        )

    def is_available(self) -> bool:
        return _LLAMACPP_AVAILABLE

    def get_backend_label(self) -> str:
        return self._backend_label

The MLX adapter takes a different approach because the MLX framework is designed specifically for Apple Silicon and provides its own patterns for efficient inference. MLX operations are lazy by default — they are not executed until their results are needed — which means that the generate function, while technically synchronous from Python's perspective, leverages the Neural Engine and GPU cores of Apple Silicon in a highly optimized way. One of the most important details in this adapter is the call to apply_chat_template, which formats the messages list into the specific prompt string that the model expects, including any special tokens. Using the wrong format degrades output quality significantly for instruction-tuned models, and this single call handles all the complexity automatically.

# adapters/mlx_adapter.py
#
# Local inference adapter for Apple Silicon using Apple MLX.
# Provides native Metal GPU acceleration via the mlx-lm library.
# Only available on macOS with Apple Silicon (M1/M2/M3/M4 chips).

from __future__ import annotations

import asyncio
import logging
import os
import sys
from typing import Optional, Tuple

from adapters.base import (
    BaseLLMAdapter,
    OptimizationRequest,
    OptimizationResult,
    extract_json_from_llm_output,
)

logger = logging.getLogger(__name__)

# Conditional import: mlx_lm is only available on macOS Apple Silicon.
# We guard the import so the module can be loaded on any platform
# without errors; the is_available() method will return False on
# non-Apple-Silicon systems, preventing the adapter from being used.
_MLX_AVAILABLE = False
_mlx_load = None
_mlx_generate = None

if sys.platform == "darwin":
    try:
        from mlx_lm import load as _mlx_load_fn    # type: ignore
        from mlx_lm import generate as _mlx_generate_fn  # type: ignore
        _mlx_load = _mlx_load_fn
        _mlx_generate = _mlx_generate_fn
        _MLX_AVAILABLE = True
    except ImportError:
        pass


class MLXAdapter(BaseLLMAdapter):
    """
    Runs HuggingFace models natively on Apple Silicon via MLX.

    MLX (Machine Learning eXperience) is Apple's open-source ML
    framework optimized for Apple Silicon's unified memory architecture.
    Unlike CUDA, which requires explicit data transfers between CPU and
    GPU memory, MLX operates on unified memory that is accessible to
    both the CPU and the Neural Engine simultaneously. This makes MLX
    extremely efficient for LLM inference on M-series chips.

    Models are specified by their HuggingFace repo ID. The mlx_lm.load()
    function handles downloading and caching automatically. Quantized
    models from the mlx-community namespace (e.g., 4-bit quantized
    Llama 3) are recommended for best performance.
    """

    def __init__(self, model_config: dict) -> None:
        super().__init__(model_config)
        self._model = None
        self._tokenizer = None
        self._loaded = False
        self._load_lock = asyncio.Lock()

        # MLX models are identified by their HuggingFace repo ID.
        self._hf_repo_id: str = model_config.get(
            "hf_repo_id",
            model_config.get("api_model_name", "")
        )
        self._max_tokens: int = model_config.get("max_tokens", 4096)
        self._timeout: float = float(
            os.getenv("LOCAL_INFERENCE_TIMEOUT", "300")
        )

    async def _ensure_loaded(self) -> None:
        """
        Lazy-loads the MLX model on first use.
        Model download from HuggingFace Hub happens automatically
        inside mlx_lm.load() and is cached in the HF_HOME directory.
        """
        if self._loaded:
            return

        async with self._load_lock:
            if self._loaded:
                return

            logger.info(
                "Loading MLX model: %s", self._hf_repo_id
            )
            hf_token = os.getenv("HF_TOKEN")

            def _load_sync() -> Tuple:
                return _mlx_load(
                    self._hf_repo_id,
                    tokenizer_config={"trust_remote_code": True},
                    **({"token": hf_token} if hf_token else {}),
                )

            # Run the blocking model load in the default thread pool.
            # Unlike llama-cpp, MLX model loading is thread-safe, so
            # we do not need a dedicated single-threaded executor here.
            self._model, self._tokenizer = await asyncio.to_thread(
                _load_sync
            )
            self._loaded = True
            logger.info("MLX model loaded: %s", self._hf_repo_id)

    async def optimize(
        self, request: OptimizationRequest
    ) -> OptimizationResult:
        if not _MLX_AVAILABLE:
            raise RuntimeError(
                "Apple MLX is not available. "
                "Requires macOS with Apple Silicon and mlx-lm installed."
            )

        await self._ensure_loaded()

        messages = [
            {
                "role": "system",
                "content": request.system_instructions,
            },
            {
                "role": "user",
                "content": self._build_user_content(request),
            },
        ]

        def _format_and_generate() -> str:
            # apply_chat_template formats the messages list into the
            # specific prompt string that this model expects, including
            # any special tokens (BOS, EOS, role markers, etc.).
            # This is critical: using the wrong format degrades output
            # quality significantly for instruction-tuned models.
            if hasattr(self._tokenizer, "apply_chat_template"):
                prompt = self._tokenizer.apply_chat_template(
                    messages,
                    add_generation_prompt=True,
                    tokenize=False,
                )
            else:
                # Fallback for models without a chat template.
                # This is a generic Llama-2-style format.
                system = messages[0]["content"]
                user = messages[1]["content"]
                prompt = (
                    f"<s>[INST] <<SYS>>\n{system}\n<</SYS>>\n\n"
                    f"{user} [/INST]"
                )

            return _mlx_generate(
                self._model,
                self._tokenizer,
                prompt=prompt,
                max_tokens=self._max_tokens,
                verbose=False,
            )

        # MLX generate is synchronous but uses Metal GPU acceleration.
        # We run it in a thread to avoid blocking the event loop.
        raw = await asyncio.wait_for(
            asyncio.to_thread(_format_and_generate),
            timeout=self._timeout,
        )

        data = extract_json_from_llm_output(raw)
        result = self._build_result_from_dict(data, request)
        result.backend_info = f"Apple MLX ({self._get_chip_name()})"
        return result

    def _build_user_content(self, request: OptimizationRequest) -> str:
        cfg = request.target_model_config
        return (
            f"Optimize the following prompt for the target model.\n\n"
            f"TARGET MODEL: {cfg.get('name', 'Unknown')}\n"
            f"CONTEXT WINDOW: "
            f"{cfg.get('context_window', 'unknown')} tokens\n"
            f"MODEL NOTES: {cfg.get('notes', '')}\n\n"
            f"RAW PROMPT:\n{request.raw_prompt}\n\n"
            f"Return ONLY a valid JSON object:\n"
            f"{{\n"
            f'  "optimized_system_prompt": "...",\n'
            f'  "optimized_assistant_primer": "...",\n'
            f'  "optimized_user_prompt": "...",\n'
            f'  "full_combined_prompt": "...",\n'
            f'  "patterns_applied": [],\n'
            f'  "anti_patterns_avoided": [],\n'
            f'  "model_recommendation": null,\n'
            f'  "recommendation_reason": null\n'
            f"}}"
        )

    def _get_chip_name(self) -> str:
        """Retrieves the Apple Silicon chip name for display."""
        try:
            import subprocess
            result = subprocess.run(
                ["sysctl", "-n", "machdep.cpu.brand_string"],
                capture_output=True, text=True, timeout=3
            )
            return result.stdout.strip() or "Apple Silicon"
        except Exception:
            return "Apple Silicon"

    def is_available(self) -> bool:
        return _MLX_AVAILABLE and sys.platform == "darwin"

    def get_backend_label(self) -> str:
        return f"Apple MLX ({self._get_chip_name()})"

The Meta-Prompt: Teaching an LLM to Be a Prompt Engineer

The most intellectually fascinating component of a prompt optimizer is the meta-prompt — the system prompt given to the optimizer model that encodes all the knowledge of an expert prompt engineer. This is where the real magic happens, and getting it right requires both deep knowledge of prompt engineering and careful attention to how the optimizer model processes instructions.

The meta-prompt must accomplish several things simultaneously. It must establish the optimizer model's role as a world-class prompt engineering expert. It must provide a comprehensive catalog of best-practice patterns, with enough detail that the model understands not just what each pattern is but when and how to apply it. It must provide an equally comprehensive catalog of anti-patterns to avoid. It must provide model-specific tips for the target model, so that the optimizer can tailor its output to the specific characteristics of the model the user has chosen. And it must specify the exact output format required, with enough precision that the model reliably produces parseable JSON.

The following optimizer engine implementation shows how all of these elements are assembled into a coherent system. Pay particular attention to the use of Python's string.Template with safe_substitute rather than format strings — this choice prevents crashes when pattern descriptions happen to contain curly braces, which would be misinterpreted as format string placeholders.

# core/optimizer.py
#
# The core prompt optimization engine.
# Assembles the meta-prompt and orchestrates the optimization pipeline.

from __future__ import annotations

import asyncio
import logging
import os
from string import Template
from typing import Optional

from adapters.base import OptimizationRequest, OptimizationResult
from core.pattern_library import PATTERNS, ANTI_PATTERNS, MODEL_TIPS

logger = logging.getLogger(__name__)


# The meta-prompt template. Uses Python's string.Template with $-style
# substitution. The safe_substitute method is used (not substitute) to
# prevent crashes if any pattern description happens to contain a $
# character. Dollar signs in pattern text are pre-escaped to $$ before
# substitution to prevent misinterpretation as variable references.
_META_PROMPT_TEMPLATE = Template(
    """You are a world-class Prompt Engineering Expert with comprehensive
knowledge of all major LLM architectures, training methodologies, and
the latest research in prompt engineering as of 2025-2026.

Your expertise spans:
- GPT-series models (OpenAI): instruction following, structured outputs,
  reasoning models (o3, o4-mini), vision, and function calling.
- Claude models (Anthropic): Constitutional AI, XML-structured prompts,
  long-context reasoning, and multi-agent orchestration.
- Gemini models (Google): multimodal reasoning, grounding, and massive
  context windows up to 2 million tokens.
- Mistral models: multilingual tasks, European data residency, and
  efficient mixture-of-experts architectures.
- Local models (Llama, Mistral, Qwen, DeepSeek, Phi): quantization
  effects, context window limitations, chat template requirements,
  and the performance differences between CUDA, Vulkan, and CPU inference.
- Apple MLX models: unified memory advantages, optimal quantization
  levels for different M-series chips, and throughput characteristics.

Your task is to transform a raw, unoptimized user prompt into a
production-ready, model-specific prompt package. You must:

1. Analyze the raw prompt to understand the user's actual intent,
   even when it is expressed vaguely or incompletely.
2. Apply the most relevant prompt engineering patterns from the list
   below, selecting those that will most improve the specific prompt.
3. Eliminate all anti-patterns present in the raw prompt.
4. Tailor every aspect of the optimized prompt to the specific
   target model's strengths, conventions, and requirements.
5. Produce a complete prompt package: system prompt, assistant primer,
   and user prompt, plus a full combined version.
6. Recommend a better model if one exists for the specific task type,
   with a clear explanation of why it would be superior.

BEST PRACTICE PATTERNS TO APPLY:
$patterns

ANTI-PATTERNS TO DETECT AND ELIMINATE:
$anti_patterns

MODEL-SPECIFIC OPTIMIZATION TIPS FOR THE TARGET MODEL:
$model_tips

OUTPUT REQUIREMENTS:
Return ONLY a valid JSON object. No preamble, no explanation, no
markdown formatting. The JSON must contain exactly these fields:
{
  "optimized_system_prompt": "complete system prompt text",
  "optimized_assistant_primer": "opening of assistant response",
  "optimized_user_prompt": "complete optimized user prompt",
  "full_combined_prompt": "all sections combined for direct use",
  "patterns_applied": ["list", "of", "pattern", "names", "used"],
  "anti_patterns_avoided": ["list", "of", "anti-patterns", "fixed"],
  "model_recommendation": "model_id or null if current model is fine",
  "recommendation_reason": "explanation or null"
}"""
)


def _escape_dollars(text: str) -> str:
    """
    Escapes literal dollar signs in text before Template substitution.
    Without this, any $ in pattern descriptions would be interpreted
    as a variable reference by string.Template, silently corrupting
    the meta-prompt content.
    """
    return text.replace("$", "$$")


def _build_patterns_text() -> str:
    """
    Formats the pattern library into a readable text block for
    injection into the meta-prompt. Each pattern is formatted as
    a concise description with its benefit and when to apply it.
    """
    lines = []
    for key, pattern in PATTERNS.items():
        lines.append(
            f"- {pattern['name']}: {pattern['description']} "
            f"(Benefit: {pattern['benefit']}. "
            f"Apply when: {pattern['when_to_apply']})"
        )
    return _escape_dollars("\n".join(lines))


def _build_anti_patterns_text() -> str:
    """
    Formats the anti-pattern library into a readable text block.
    Each anti-pattern includes its name, what it looks like, and
    the specific fix the optimizer should apply.
    """
    lines = []
    for key, ap in ANTI_PATTERNS.items():
        lines.append(
            f"- {ap['name']}: {ap['description']} "
            f"Fix: {ap['fix']}"
        )
    return _escape_dollars("\n".join(lines))


def _build_model_tips_text(
    provider: str, model_name: str
) -> str:
    """
    Retrieves and formats model-specific tips for the target model.
    Looks up tips by provider first, then by specific model name
    within that provider, falling back to provider-level defaults.
    """
    provider_tips = MODEL_TIPS.get(provider, {})
    specific_tips = provider_tips.get(
        model_name,
        provider_tips.get("default", ["Apply general best practices."])
    )
    return _escape_dollars(
        "\n".join(f"- {tip}" for tip in specific_tips)
    )


class PromptOptimizer:
    """
    Orchestrates the full prompt optimization pipeline.

    The optimizer is model-agnostic: it accepts any adapter that
    implements BaseLLMAdapter and uses it to run the optimization.
    This means the same optimizer can use GPT-4o for cloud-based
    optimization or a local Llama model for privacy-sensitive use cases.
    """

    def __init__(self) -> None:
        self._cloud_timeout = float(
            os.getenv("CLOUD_INFERENCE_TIMEOUT", "120")
        )
        self._local_timeout = float(
            os.getenv("LOCAL_INFERENCE_TIMEOUT", "300")
        )

    async def optimize(
        self,
        raw_prompt: str,
        target_model_config: dict,
        optimizer_adapter,
        request_id: str = "",
    ) -> OptimizationResult:
        """
        Runs the full optimization pipeline:
        1. Builds the model-specific meta-prompt.
        2. Constructs the optimization request.
        3. Calls the optimizer adapter with a timeout guard.
        4. Returns the structured result.

        The timeout guard (asyncio.wait_for) is critical for local
        models, which can hang indefinitely if the model is corrupted,
        if the hardware runs out of memory, or if inference stalls
        for any other reason. Without this guard, a single hung
        request would block the entire async event loop.
        """
        provider = target_model_config.get("provider", "default")
        model_name = target_model_config.get("api_model_name", "")
        model_type = target_model_config.get("type", "cloud")

        # Build the meta-prompt with model-specific knowledge injected.
        system_instructions = _META_PROMPT_TEMPLATE.safe_substitute(
            patterns=_build_patterns_text(),
            anti_patterns=_build_anti_patterns_text(),
            model_tips=_build_model_tips_text(provider, model_name),
        )

        request = OptimizationRequest(
            raw_prompt=raw_prompt,
            target_model_id=target_model_config.get("id", ""),
            target_model_config=target_model_config,
            system_instructions=system_instructions,
            temperature=target_model_config.get(
                "recommended_temperature", 0.7
            ),
        )

        # Select timeout based on whether the optimizer is local or cloud.
        timeout = (
            self._local_timeout
            if model_type == "local"
            else self._cloud_timeout
        )

        logger.info(
            "[%s] Optimizing for target='%s' provider='%s' timeout=%.0fs",
            request_id,
            target_model_config.get("name", "unknown"),
            provider,
            timeout,
        )

        try:
            return await asyncio.wait_for(
                optimizer_adapter.optimize(request),
                timeout=timeout,
            )
        except asyncio.TimeoutError:
            raise RuntimeError(
                f"Optimization timed out after {timeout:.0f}s. "
                f"The optimizer model took too long to respond. "
                f"Try a faster model or increase the timeout setting."
            )

The meta-prompt template is the intellectual heart of the entire system. Every word in it has been chosen deliberately. The role assignment at the beginning — "You are a world-class Prompt Engineering Expert" — is not mere flattery. It activates the cluster of knowledge and communication patterns in the optimizer model's learned representations that are associated with expert prompt engineering. The enumeration of specific model families and their characteristics gives the optimizer model the context it needs to make model-specific decisions. The numbered list of tasks provides a clear, ordered procedure that the model can follow reliably.

The injection of the pattern library, anti-pattern library, and model-specific tips is what makes the optimizer genuinely knowledgeable rather than merely well-intentioned. Without this injection, the optimizer model would rely entirely on its training data, which may be outdated or incomplete for the specific models and use cases the user is working with. By injecting this knowledge at runtime, we can update the optimizer's knowledge simply by updating the pattern library, without retraining any model.

The Pattern Library in Depth

The pattern library is the knowledge base that the optimizer draws on. It is not just a list of pattern names — each entry includes a description, a benefit statement, and a "when to apply" condition that helps the optimizer decide which patterns are relevant for a given prompt. A well-designed pattern library is what separates a prompt optimizer that produces generic improvements from one that produces genuinely expert-level results. The library is also the component that is easiest to extend — adding a new pattern requires only adding a new entry to the dictionary, and the optimizer immediately gains the ability to apply it.

# core/pattern_library.py
#
# The knowledge base for the prompt optimization engine.
# Contains patterns, anti-patterns, and model-specific tips.
# All entries follow a consistent schema for uniform processing.

from __future__ import annotations

# Each pattern entry follows this schema:
#   name:          Human-readable pattern name.
#   description:   What the pattern does and how it works.
#   benefit:       The measurable improvement it provides.
#   when_to_apply: Conditions under which the pattern is most useful.
PATTERNS: dict = {
    "chain_of_thought": {
        "name": "Chain-of-Thought Reasoning",
        "description": (
            "Instruct the model to reason step-by-step before giving "
            "its final answer. Include the phrase 'Think through this "
            "step by step' or 'Let's work through this carefully'."
        ),
        "benefit": (
            "Dramatically improves accuracy on multi-step reasoning, "
            "math, logic, and complex analysis tasks by giving the "
            "model working memory through intermediate steps."
        ),
        "when_to_apply": (
            "Any task involving multiple reasoning steps, mathematics, "
            "logic puzzles, debugging, or complex analysis. Do NOT apply "
            "to reasoning models (o3, o4-mini) that reason internally."
        ),
    },
    "role_assignment": {
        "name": "Expert Role Assignment",
        "description": (
            "Assign the model a specific expert persona in the system "
            "prompt. Be specific: 'You are a senior Python engineer "
            "specializing in async systems' outperforms 'You are a "
            "helpful assistant'."
        ),
        "benefit": (
            "Shifts the probability distribution over responses toward "
            "expert-level vocabulary, depth, and accuracy. Reduces "
            "generic or superficial responses."
        ),
        "when_to_apply": (
            "Almost always. Every prompt benefits from a specific role "
            "assignment. Match the role to the domain of the task."
        ),
    },
    "few_shot": {
        "name": "Few-Shot Exemplars",
        "description": (
            "Provide 2-5 input/output example pairs before the actual "
            "task. Format them consistently and ensure they cover the "
            "range of cases the model will encounter."
        ),
        "benefit": (
            "Teaches the model the exact output format, style, and "
            "level of detail required without lengthy prose instructions. "
            "Especially powerful for structured data extraction."
        ),
        "when_to_apply": (
            "When the output format is complex, unusual, or highly "
            "specific. When the model repeatedly misunderstands the "
            "task despite clear instructions."
        ),
    },
    "xml_structuring": {
        "name": "XML Semantic Structuring",
        "description": (
            "Wrap distinct sections of the prompt in descriptive XML "
            "tags: <context>, <task>, <constraints>, <output_format>, "
            "<examples>. Use closing tags consistently."
        ),
        "benefit": (
            "Creates unambiguous boundaries between prompt sections, "
            "reducing misinterpretation. Especially powerful with "
            "Claude models, which are specifically trained for this."
        ),
        "when_to_apply": (
            "Always for Claude models. For other models when the prompt "
            "has multiple distinct sections that the model might confuse."
        ),
    },
    "output_specification": {
        "name": "Explicit Output Specification",
        "description": (
            "Specify the exact format, length, structure, and content "
            "of the desired output. For JSON, provide the exact schema. "
            "For prose, specify word count, tone, and structure."
        ),
        "benefit": (
            "Eliminates ambiguity about what a successful response looks "
            "like. Reduces post-processing effort and format errors."
        ),
        "when_to_apply": (
            "Always. Every prompt should specify what a good response "
            "looks like, even if only briefly."
        ),
    },
    "constraint_specification": {
        "name": "Explicit Constraint Listing",
        "description": (
            "List all constraints, limitations, and requirements "
            "explicitly. Use numbered constraints for clarity. "
            "Include both positive constraints (must do) and negative "
            "constraints (must not do)."
        ),
        "benefit": (
            "Prevents the model from violating requirements that the "
            "user considers obvious but the model has no way to infer."
        ),
        "when_to_apply": (
            "Whenever the task has non-obvious constraints, style "
            "requirements, or scope limitations."
        ),
    },
    "self_verification": {
        "name": "Self-Verification and Reflection",
        "description": (
            "Ask the model to review its own output for errors, "
            "omissions, and inconsistencies before finalizing. "
            "Include: 'Before responding, verify that your answer "
            "satisfies all stated constraints.'"
        ),
        "benefit": (
            "Catches errors that are statistically unlikely in a "
            "verification context even if they were likely during "
            "initial generation. Improves accuracy and completeness."
        ),
        "when_to_apply": (
            "High-stakes tasks where accuracy is critical. Tasks with "
            "multiple constraints that are easy to partially violate."
        ),
    },
    "context_priming": {
        "name": "Rich Context Priming",
        "description": (
            "Provide all relevant background information the model needs "
            "to answer correctly. Do not assume the model knows your "
            "codebase, domain conventions, or project history."
        ),
        "benefit": (
            "Eliminates hallucination caused by the model filling gaps "
            "in its context with plausible-sounding but incorrect "
            "information."
        ),
        "when_to_apply": (
            "Whenever the task involves domain-specific knowledge, "
            "proprietary systems, or context that was not in the model's "
            "training data."
        ),
    },
}


# Each anti-pattern entry follows this schema:
#   name:        Human-readable anti-pattern name.
#   description: What the anti-pattern looks like and why it fails.
#   fix:         The specific transformation to apply to eliminate it.
ANTI_PATTERNS: dict = {
    "vague_instructions": {
        "name": "Vague Instructions",
        "description": (
            "Prompts that use underspecified verbs like 'help', 'write', "
            "'fix', or 'improve' without specifying what success looks "
            "like, what constraints apply, or what the output should be."
        ),
        "fix": (
            "Replace vague verbs with specific action descriptions. "
            "Add output format, length, tone, and quality criteria. "
            "Specify what the user actually needs, not just what they want."
        ),
    },
    "assumed_context": {
        "name": "Assumed Context",
        "description": (
            "Prompts that reference information the model cannot know: "
            "internal codebases, proprietary documentation, personal "
            "project history, or domain-specific conventions not in "
            "the model's training data."
        ),
        "fix": (
            "Provide all necessary context inline. Paste relevant code, "
            "documentation excerpts, or background information directly "
            "into the prompt. Never assume the model knows your context."
        ),
    },
    "task_overloading": {
        "name": "Task Overloading",
        "description": (
            "Prompts that request multiple unrelated tasks in a single "
            "message, causing the model to address some tasks well and "
            "others poorly, or to blend them in unexpected ways."
        ),
        "fix": (
            "Break multi-task prompts into focused single-task prompts, "
            "or clearly number and separate each task with explicit "
            "section headers or XML tags."
        ),
    },
    "no_format_spec": {
        "name": "Missing Output Format Specification",
        "description": (
            "Prompts that do not specify the desired output format, "
            "leaving the model to choose between prose, lists, JSON, "
            "code, tables, or any other format it finds plausible."
        ),
        "fix": (
            "Always specify the desired output format explicitly. "
            "For structured data, provide the exact schema. For prose, "
            "specify length, tone, and structure."
        ),
    },
    "cot_on_reasoning_model": {
        "name": "Redundant Chain-of-Thought on Reasoning Models",
        "description": (
            "Applying chain-of-thought instructions to models that "
            "already reason internally (o3, o4-mini, DeepSeek R1). "
            "This interferes with the model's internal reasoning process "
            "and can reduce output quality."
        ),
        "fix": (
            "Remove chain-of-thought instructions when targeting "
            "reasoning models. Trust the model's internal process and "
            "focus the prompt on the task and output format instead."
        ),
    },
    "role_mismatch": {
        "name": "Role-Task Mismatch",
        "description": (
            "Assigning a role that does not match the task domain, "
            "such as asking a 'creative writing assistant' to perform "
            "financial analysis, or a 'data scientist' to write poetry."
        ),
        "fix": (
            "Match the assigned role precisely to the task domain. "
            "The role should describe the exact expertise needed for "
            "the specific task, not a generic helpful assistant."
        ),
    },
}


# Model-specific tips organized by provider and then by model name.
# The 'default' key under each provider applies to all models from
# that provider when no model-specific tips are available.
MODEL_TIPS: dict = {
    "openai": {
        "default": [
            "Use numbered steps for multi-part instructions.",
            "Specify JSON output schemas explicitly with field names and types.",
            "Use 'You are a [specific expert role]' in the system prompt.",
            "For GPT-4o, leverage vision capabilities when images are relevant.",
        ],
        "o3": [
            "Do NOT include chain-of-thought instructions; o3 reasons internally.",
            "Focus the prompt entirely on the task and desired output format.",
            "o3 excels at complex multi-step reasoning; trust its process.",
            "Specify output format precisely; o3 follows format specs reliably.",
        ],
        "o4-mini": [
            "Do NOT include chain-of-thought instructions; o4-mini reasons internally.",
            "o4-mini is optimized for speed and cost; keep prompts concise.",
            "Excellent for coding tasks; specify language and style guide.",
            "Specify output format precisely for reliable structured output.",
        ],
    },
    "anthropic": {
        "default": [
            "Use XML tags to structure all prompt sections: <context>, <task>, <output_format>.",
            "Claude excels at long-document analysis; provide full documents, not excerpts.",
            "Use <thinking> tags to encourage visible reasoning when needed.",
            "Claude follows Constitutional AI principles; frame requests constructively.",
            "Specify output format inside <output_format> XML tags for best results.",
        ],
        "claude-opus-4-5": [
            "Claude Opus is the most capable model; use for the most complex tasks.",
            "Leverage its exceptional long-context reasoning with full XML structuring.",
            "Ideal for nuanced analysis, complex writing, and multi-step reasoning.",
        ],
    },
    "meta": {
        "default": [
            "Ensure the correct chat template is applied for the specific Llama variant.",
            "Keep prompts within the effective context window (4096-8192 tokens for most GGUF builds).",
            "Quantized models (Q4_K_M) may struggle with very complex structured output; simplify JSON schemas.",
            "Include chain-of-thought instructions for reasoning tasks.",
            "Llama 3 Instruct responds well to numbered instructions and explicit constraints.",
        ],
    },
    "mistral": {
        "default": [
            "Mistral models use [INST] and [/INST] markers in their native format.",
            "Excellent for multilingual tasks; specify the target language explicitly.",
            "Mistral NeMo and larger variants handle structured output reliably.",
            "Keep system prompts concise; Mistral models respond well to focused instructions.",
        ],
    },
    "google": {
        "default": [
            "Gemini models support extremely long contexts; leverage this for document analysis.",
            "Use Gemini's grounding capability for tasks requiring current information.",
            "Gemini 1.5 Pro and 2.0 handle multimodal inputs natively.",
            "Specify output format clearly; Gemini follows explicit format instructions well.",
        ],
    },
    "default": {
        "default": [
            "Apply role assignment, chain-of-thought, and explicit output specification.",
            "Ensure the chat template appropriate for this model is applied.",
            "Test with a simple prompt first to verify the model is responding correctly.",
        ],
    },
}

The pattern library is designed for extensibility. Adding support for a new model family requires only adding a new entry to the MODEL_TIPS dictionary. Adding a new prompt engineering technique requires only adding a new entry to the PATTERNS dictionary. The optimizer engine picks up these changes automatically on the next run, because the pattern text is built dynamically from the library at optimization time rather than being hardcoded into the meta-prompt.

The Adapter Registry and Factory

With all the adapters implemented, we need a central registry that knows which adapters are available, which models each adapter can serve, and how to instantiate the right adapter for a given model selection. The registry pattern is the right choice here because it decouples the optimizer engine from the specific set of available adapters, making it easy to add new adapters without modifying the engine.

# core/adapter_registry.py
#
# Central registry for all LLM adapters.
# Discovers available adapters at startup and provides
# a factory method for creating adapter instances.

from __future__ import annotations

import logging
import os
from typing import Dict, List, Optional, Type

from adapters.base import BaseLLMAdapter
from adapters.openai_adapter import OpenAIAdapter
from adapters.anthropic_adapter import AnthropicAdapter
from adapters.llamacpp_adapter import LlamaCppAdapter
from adapters.mlx_adapter import MLXAdapter

logger = logging.getLogger(__name__)


# The full model catalog. Each entry defines a model that the optimizer
# can target. The 'optimizer_adapter_class' field specifies which adapter
# class to use when this model is selected as the OPTIMIZER (not the target).
MODEL_CATALOG: Dict[str, dict] = {
    # Cloud models
    "gpt-4o": {
        "id": "gpt-4o",
        "name": "GPT-4o",
        "provider": "openai",
        "api_model_name": "gpt-4o",
        "type": "cloud",
        "context_window": 128000,
        "recommended_temperature": 0.7,
        "is_reasoning_model": False,
        "optimizer_adapter_class": "OpenAIAdapter",
        "strengths": ["coding", "reasoning", "vision", "structured output"],
        "notes": "Flagship OpenAI model. Excellent at following complex instructions.",
    },
    "o3": {
        "id": "o3",
        "name": "OpenAI o3",
        "provider": "openai",
        "api_model_name": "o3",
        "type": "cloud",
        "context_window": 200000,
        "recommended_temperature": 1.0,
        "is_reasoning_model": True,
        "optimizer_adapter_class": "OpenAIAdapter",
        "strengths": ["complex reasoning", "mathematics", "science", "coding"],
        "notes": "Reasoning model. Do NOT add chain-of-thought instructions.",
    },
    "claude-opus-4-5": {
        "id": "claude-opus-4-5",
        "name": "Claude Opus 4.5",
        "provider": "anthropic",
        "api_model_name": "claude-opus-4-5",
        "type": "cloud",
        "context_window": 200000,
        "recommended_temperature": 0.7,
        "is_reasoning_model": False,
        "optimizer_adapter_class": "AnthropicAdapter",
        "strengths": ["long documents", "nuanced writing", "analysis", "coding"],
        "notes": "Use XML-structured prompts. Exceptional long-context performance.",
    },
    "claude-sonnet-4-5": {
        "id": "claude-sonnet-4-5",
        "name": "Claude Sonnet 4.5",
        "provider": "anthropic",
        "api_model_name": "claude-sonnet-4-5",
        "type": "cloud",
        "context_window": 200000,
        "recommended_temperature": 0.7,
        "is_reasoning_model": False,
        "optimizer_adapter_class": "AnthropicAdapter",
        "strengths": ["balanced performance", "coding", "analysis"],
        "notes": "Best price/performance ratio in the Claude family.",
    },
    # Local models (llama-cpp-python)
    "llama3-8b-local": {
        "id": "llama3-8b-local",
        "name": "Llama 3 8B Instruct (Local)",
        "provider": "meta",
        "api_model_name": "llama3-8b-local",
        "type": "local",
        "context_window": 8192,
        "recommended_temperature": 0.7,
        "is_reasoning_model": False,
        "optimizer_adapter_class": "LlamaCppAdapter",
        "gguf_repo_id": "bartowski/Meta-Llama-3-8B-Instruct-GGUF",
        "gguf_filename": "*Q4_K_M*",
        "strengths": ["general tasks", "instruction following", "privacy"],
        "notes": (
            "Runs locally. Apply chat template. "
            "Q4_K_M quantization balances quality and speed."
        ),
    },
    "llama3-70b-local": {
        "id": "llama3-70b-local",
        "name": "Llama 3 70B Instruct (Local)",
        "provider": "meta",
        "api_model_name": "llama3-70b-local",
        "type": "local",
        "context_window": 8192,
        "recommended_temperature": 0.7,
        "is_reasoning_model": False,
        "optimizer_adapter_class": "LlamaCppAdapter",
        "gguf_repo_id": "bartowski/Meta-Llama-3-70B-Instruct-GGUF",
        "gguf_filename": "*Q4_K_M*",
        "strengths": ["complex reasoning", "coding", "analysis", "privacy"],
        "notes": (
            "Requires ~40GB VRAM for Q4_K_M. "
            "Significantly outperforms 8B on complex tasks."
        ),
    },
    # Apple MLX models
    "llama3-8b-mlx": {
        "id": "llama3-8b-mlx",
        "name": "Llama 3 8B Instruct (Apple MLX)",
        "provider": "meta",
        "api_model_name": "llama3-8b-mlx",
        "type": "local",
        "context_window": 8192,
        "recommended_temperature": 0.7,
        "is_reasoning_model": False,
        "optimizer_adapter_class": "MLXAdapter",
        "hf_repo_id": "mlx-community/Meta-Llama-3-8B-Instruct-4bit",
        "strengths": ["Apple Silicon optimized", "privacy", "speed on M-series"],
        "notes": (
            "Optimized for Apple M1/M2/M3/M4 chips. "
            "Uses unified memory; no VRAM limit applies."
        ),
    },
}

# Maps adapter class names to their actual classes.
_ADAPTER_CLASS_MAP: Dict[str, Type[BaseLLMAdapter]] = {
    "OpenAIAdapter": OpenAIAdapter,
    "AnthropicAdapter": AnthropicAdapter,
    "LlamaCppAdapter": LlamaCppAdapter,
    "MLXAdapter": MLXAdapter,
}


class AdapterRegistry:
    """
    Discovers and manages all available LLM adapters.
    Provides a factory method for creating adapter instances
    based on model selection.
    """

    def __init__(self) -> None:
        self._available_models: Dict[str, dict] = {}
        self._discover_available_models()

    def _discover_available_models(self) -> None:
        """
        Tests each model in the catalog to determine if its
        required adapter is available on this machine. Models
        whose adapters are unavailable are excluded from the
        available models list but not from the catalog.
        """
        for model_id, config in MODEL_CATALOG.items():
            adapter_class_name = config.get(
                "optimizer_adapter_class", ""
            )
            adapter_class = _ADAPTER_CLASS_MAP.get(adapter_class_name)
            if adapter_class is None:
                logger.warning(
                    "Unknown adapter class '%s' for model '%s'. Skipping.",
                    adapter_class_name, model_id
                )
                continue

            # Instantiate a temporary adapter to check availability.
            # This is safe because adapter __init__ methods do not
            # load models or make network calls.
            try:
                temp_adapter = adapter_class(config)
                if temp_adapter.is_available():
                    self._available_models[model_id] = config
                    logger.info(
                        "Model available: %s (%s)",
                        config["name"],
                        adapter_class_name
                    )
                else:
                    logger.info(
                        "Model unavailable (missing credentials or library): %s",
                        config["name"]
                    )
            except Exception as exc:
                logger.warning(
                    "Error checking availability for '%s': %s",
                    model_id, exc
                )

    def get_available_models(self) -> List[dict]:
        """Returns all models available on this machine."""
        return list(self._available_models.values())

    def get_model_config(self, model_id: str) -> Optional[dict]:
        """Returns the configuration for a specific model, or None."""
        return MODEL_CATALOG.get(model_id)

    def create_adapter(self, model_id: str) -> BaseLLMAdapter:
        """
        Creates and returns an adapter instance for the given model.
        Raises ValueError if the model is not available.
        """
        config = MODEL_CATALOG.get(model_id)
        if config is None:
            raise ValueError(
                f"Unknown model ID: '{model_id}'. "
                f"Available models: {list(MODEL_CATALOG.keys())}"
            )

        adapter_class_name = config.get("optimizer_adapter_class", "")
        adapter_class = _ADAPTER_CLASS_MAP.get(adapter_class_name)
        if adapter_class is None:
            raise ValueError(
                f"Unknown adapter class '{adapter_class_name}' "
                f"for model '{model_id}'."
            )

        return adapter_class(config)

Putting It All Together: The Main Entry Point

With all components in place, the main entry point ties everything together into a working application. This module demonstrates how the registry, optimizer, and adapters collaborate to transform a raw prompt into an optimized one. It is designed to be run from the command line for testing and demonstration purposes.

# main.py
#
# Entry point for the prompt optimizer.
# Demonstrates the full optimization pipeline end-to-end.
# Usage: python main.py --model gpt-4o --prompt "your prompt here"
#        python main.py --model llama3-8b-local --prompt "your prompt here"

from __future__ import annotations

import argparse
import asyncio
import json
import logging
import sys
import uuid

from core.adapter_registry import AdapterRegistry
from core.optimizer import PromptOptimizer

# Configure logging to show INFO-level messages with timestamps.
logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s [%(levelname)s] %(name)s: %(message)s",
    datefmt="%H:%M:%S",
)
logger = logging.getLogger(__name__)


def _print_result(result) -> None:
    """
    Prints the optimization result in a readable format.
    Each section is clearly labeled and separated for easy reading.
    """
    separator = "=" * 70

    print(f"\n{separator}")
    print("OPTIMIZATION COMPLETE")
    print(f"Optimizer model used: {result.model_used}")
    print(f"Backend: {result.backend_info}")
    print(separator)

    print("\nOPTIMIZED SYSTEM PROMPT:")
    print("-" * 40)
    print(result.optimized_system_prompt)

    print("\nOPTIMIZED USER PROMPT:")
    print("-" * 40)
    print(result.optimized_user_prompt)

    print("\nASSISTANT PRIMER:")
    print("-" * 40)
    print(result.optimized_assistant_primer)

    if result.patterns_applied:
        print("\nPATTERNS APPLIED:")
        for pattern in result.patterns_applied:
            print(f"  + {pattern}")

    if result.anti_patterns_avoided:
        print("\nANTI-PATTERNS FIXED:")
        for ap in result.anti_patterns_avoided:
            print(f"  - {ap}")

    if result.model_recommendation:
        print(f"\nMODEL RECOMMENDATION: {result.model_recommendation}")
        print(f"Reason: {result.recommendation_reason}")

    print(f"\n{separator}\n")


async def _run(args: argparse.Namespace) -> int:
    """
    Async entry point. Initializes the registry, creates the adapter,
    runs the optimizer, and prints the result.
    Returns 0 on success, 1 on failure.
    """
    registry = AdapterRegistry()
    available = registry.get_available_models()

    if not available:
        print(
            "ERROR: No models are available. "
            "Set OPENAI_API_KEY or ANTHROPIC_API_KEY, "
            "or install llama-cpp-python / mlx-lm.",
            file=sys.stderr
        )
        return 1

    # If the user asked for a model list, print it and exit.
    if args.list_models:
        print("\nAvailable models:")
        for m in available:
            print(f"  {m['id']:30s} {m['name']} ({m['type']})")
        print()
        return 0

    # Validate the requested model.
    target_config = registry.get_model_config(args.target_model)
    if target_config is None:
        print(
            f"ERROR: Unknown target model '{args.target_model}'. "
            f"Use --list-models to see available options.",
            file=sys.stderr
        )
        return 1

    # Create the optimizer adapter (the model that RUNS the optimization).
    optimizer_model_id = args.optimizer_model or args.target_model
    try:
        optimizer_adapter = registry.create_adapter(optimizer_model_id)
    except ValueError as exc:
        print(f"ERROR: {exc}", file=sys.stderr)
        return 1

    # Run the optimization.
    optimizer = PromptOptimizer()
    request_id = str(uuid.uuid4())[:8]

    print(
        f"\nOptimizing prompt for target model: {target_config['name']}"
    )
    print(f"Using optimizer model: {optimizer_model_id}")
    print("Please wait...\n")

    try:
        result = await optimizer.optimize(
            raw_prompt=args.prompt,
            target_model_config=target_config,
            optimizer_adapter=optimizer_adapter,
            request_id=request_id,
        )
        _print_result(result)
        return 0

    except RuntimeError as exc:
        print(f"ERROR: {exc}", file=sys.stderr)
        return 1


def main() -> None:
    parser = argparse.ArgumentParser(
        description="Prompt Optimizer: Transform raw prompts into "
                    "model-specific, production-ready prompt packages.",
        formatter_class=argparse.RawDescriptionHelpFormatter,
    )
    parser.add_argument(
        "--target-model",
        default="gpt-4o",
        help="ID of the model to optimize the prompt FOR.",
    )
    parser.add_argument(
        "--optimizer-model",
        default=None,
        help=(
            "ID of the model to USE for optimization. "
            "Defaults to the target model."
        ),
    )
    parser.add_argument(
        "--prompt",
        default="Help me write better code.",
        help="The raw prompt to optimize.",
    )
    parser.add_argument(
        "--list-models",
        action="store_true",
        help="List all available models and exit.",
    )

    args = parser.parse_args()
    exit_code = asyncio.run(_run(args))
    sys.exit(exit_code)


if __name__ == "__main__":
    main()

Evaluation, Iteration, and the Road Ahead

Building a prompt optimizer is not a one-time engineering effort — it is an ongoing process of evaluation and refinement. The optimizer is only as good as its meta-prompt, its pattern library, and its model-specific knowledge. All three of these components need to be updated as the LLM landscape evolves, as new models are released, and as new prompt engineering research is published.

Evaluating prompt quality is itself a non-trivial problem. The most rigorous approach is to define a set of benchmark tasks with known correct outputs, run both the raw and optimized prompts against the target model, and measure the improvement in output quality using automated metrics (BLEU score, ROUGE score, exact match) combined with human evaluation. For production systems, A/B testing — where some users receive the raw prompt and others receive the optimized prompt, and the results are compared — provides the most ecologically valid measure of improvement.

A simpler but still valuable evaluation approach is to use a powerful LLM as a judge. After generating both a raw-prompt response and an optimized-prompt response from the target model, a separate judge model (typically a powerful cloud model like GPT-4o or Claude Opus) is asked to evaluate both responses on dimensions like accuracy, completeness, relevance, and format adherence. This approach is fast, scalable, and surprisingly reliable, though it has known biases (judge models tend to prefer longer responses and responses that match their own style).

The feedback loop is the mechanism by which the optimizer improves over time. Every optimization run produces data: which patterns were applied, which anti-patterns were detected, what the target model was, and whether the user found the result useful. This data can be used to fine-tune the meta-prompt, to add new patterns that are consistently effective, to remove patterns that are rarely applied, and to improve model-specific tips based on observed performance. Over time, a well-maintained prompt optimizer becomes increasingly effective as its knowledge base grows.

The architecture described in this article is deliberately designed to support this kind of iterative improvement. The pattern library is a simple Python dictionary that can be updated without touching any other component. The meta-prompt template is a string that can be refined without changing the optimizer engine. The adapter layer can be extended with new backends without modifying the registry. This separation of concerns is not just good software engineering — it is what makes the system maintainable and improvable over the long term.

The future of prompt optimization is likely to involve more automation of the optimization process itself. Rather than relying on a hand-crafted meta-prompt, future systems may use reinforcement learning to discover prompt transformations that reliably improve output quality across a diverse set of tasks and models. Rather than a static pattern library, future systems may maintain a dynamic library that is updated automatically based on observed performance. Rather than a single optimization pass, future systems may iterate through multiple rounds of optimization, testing each version against the target model and selecting the best one.

But even as the field advances, the fundamental insight at the heart of prompt optimization will remain valid: the quality of what an LLM produces is determined not just by the model's capabilities, but by the quality of the instructions it receives. A prompt optimizer is, at its core, a system for ensuring that every user — regardless of their level of prompt engineering expertise — can give any LLM the instructions it needs to perform at its best. That is a goal worth building toward, and the architecture described in this article provides a solid foundation for doing exactly that.

Hitchhiker's Guide to AI, Software Architecture, and Everything Else

Sunday, March 15, 2026

THE ALCHEMY OF WORDS: BUILDING A PROMPT OPTIMIZER FROM THE GROUND UP