Thursday, April 23, 2026

CONTEXT ENGINEERING: THE NEW FRONTIER - How Filling the Right Window at the Right Time Transforms AI




 

FOREWORD: THE MOMENT EVERYTHING CHANGED

There is a moment every developer working with Large Language Models eventually reaches. You have spent hours crafting the perfect prompt. You have been precise, structured, and even poetic in your phrasing. You have added examples, specified the tone, defined the output format, and triple-checked your grammar. You send the request to the model and receive back something that is confidently, fluently, and utterly wrong. The model did not fail because your prompt was bad. It failed because it did not know enough. It lacked the right documents, the right history, the right tools, and the right memory. It was like asking a brilliant surgeon to perform an operation while blindfolded and without instruments. No amount of eloquent instruction compensates for a missing scalpel. This realization is the birthplace of Context Engineering, a discipline that has quietly become the most important skill in applied AI development. Andrej Karpathy, one of the most respected voices in the field and a co-founder of OpenAI, crystallized the idea in 2024 when he described context engineering as "the delicate art and science of filling the context window with just the right information for the next step." That single sentence contains more practical wisdom than most textbooks on AI application development. This article is a thorough exploration of what Context Engineering is, how it differs from and complements Prompt Engineering, who builds it, how its quality is measured, what its pitfalls are, and how you can implement it today using real code that runs on NVIDIA GPUs via CUDA, Apple Silicon via MLX, and any other GPU via Vulkan. We will travel from first principles to production-grade architecture, with code examples along the way to keep things grounded. Fasten your seatbelt. The context window is open.

CHAPTER ONE: WHAT IS CONTEXT ENGINEERING?

To understand Context Engineering, you must first understand what a context window actually is. Every Large Language Model, whether it is GPT-4, Llama 3, Mistral, Gemma, or any other transformer-based system, processes text by converting it into tokens and then attending to all of those tokens simultaneously. The context window is the maximum number of tokens the model can see at once. Think of it as the model's working memory, its RAM. Everything the model knows about the current task must fit inside that window at inference time. A modern context window might hold 8,000 tokens, 128,000 tokens, or even more than a million tokens in the most recent frontier models. But regardless of its size, it is always finite, and every token you place inside it is a deliberate choice. The model cannot go outside the window to retrieve information on its own. It cannot remember what you said yesterday unless you tell it. It cannot look up a database unless you have already placed the relevant rows inside the window or given it a tool to do so. Karpathy's analogy is worth dwelling on. He compares the LLM to a CPU and the context window to RAM. The CPU is extraordinarily powerful, but it can only compute on what is currently loaded into RAM. The operating system decides what gets loaded, when, and in what form. Context Engineering is the discipline of being that operating system. It is the art and science of deciding what goes into the context window, in what order, in what format, compressed to what degree, and refreshed at what frequency. This is fundamentally different from Prompt Engineering, and the distinction matters enormously. Prompt Engineering asks the question: "How should I phrase this instruction?" Context Engineering asks the question: "What does the model need to know right now, and how do I make sure it knows it?" Both questions are important. Neither is sufficient on its own.

THE SIX CONSTITUENTS OF A CONTEXT WINDOW

A well-engineered context window is not a single blob of text. It is a carefully structured assembly of distinct layers, each serving a specific purpose. Understanding these layers is the foundation of Context Engineering. The first constituent is the System Prompt. This is the foundational layer that establishes who the model is, what it can and cannot do, what tone it should adopt, and what constraints govern its behavior. A good system prompt is not just a few lines of instruction. In production agentic systems, it can run to hundreds or even thousands of tokens, defining the model's persona, its ethical guardrails, its output format preferences, and its awareness of the tools available to it. The system prompt is the constitution of the model's behavior for this session. The second constituent is Long-Term Memory. LLMs are stateless by nature. When a session ends, the model forgets everything. Long-term memory is the mechanism by which information persists across sessions. It is typically implemented as an external database, a vector store, or a structured file system. When a new session begins, the context engine queries this external memory, retrieves the most relevant facts, and injects them into the context window. The model then appears to "remember" things it was told weeks ago, even though it is actually reading them fresh from an external source. The third constituent is Retrieved Documents, which is the domain of Retrieval- Augmented Generation, or RAG. Rather than trying to cram an entire knowledge base into the context window (which would be impossible for any reasonably sized knowledge base), RAG systems retrieve only the most relevant chunks of information at query time and inject them into the context. The model then reasons over these retrieved documents as if it had always known them. This is how you give a model access to proprietary company documents, real-time web data, or specialized technical manuals without fine-tuning it. The fourth constituent is Tool Definitions. Modern LLMs can be given access to external tools: functions that search the web, query databases, call APIs, execute code, or interact with file systems. But the model can only use a tool if it knows the tool exists and understands how to call it. Tool definitions, typically expressed as JSON schemas or structured descriptions, must be placed in the context window. The Model Context Protocol, or MCP, has emerged as a standardized way to define and expose these tools to LLMs, making tool integration modular and interoperable. The fifth constituent is Conversation History. In a multi-turn dialogue, the model needs to know what has already been said. This is the conversation history, the running transcript of the interaction. Managing conversation history is surprisingly subtle. Keeping the entire history verbatim quickly fills the context window. Summarizing it too aggressively loses important details. The context engineer must decide how much history to keep, when to summarize it, and how to structure it so the model can follow the thread of the conversation. The sixth constituent is the Current Task or User Query. This is the immediate input that triggered the current inference step. It carries the highest attention weight in most LLM architectures, meaning the model pays more attention to it than to anything else in the context. Everything else in the context window exists to support the model's ability to respond well to this single input. The following diagram illustrates how these six layers stack together to form a complete context window: +--------------------------------------------------+ | CONTEXT WINDOW (e.g., 128,000 tokens total) | | | | [SYSTEM PROMPT] | | Role, constraints, persona, output format | | | | [LONG-TERM MEMORY] | | Facts retrieved from external memory store | | | | [RETRIEVED DOCUMENTS (RAG)] | | Relevant chunks from knowledge base | | | | [TOOL DEFINITIONS] | | Available functions/APIs the model can call | | | | [CONVERSATION HISTORY] | | Recent turns of the dialogue (summarized) | | | | [CURRENT TASK / USER QUERY] | | The immediate input for this inference step | +--------------------------------------------------+ Each layer competes for the finite space available. The context engineer's job is to allocate that space wisely, ensuring that the most relevant information is always present and that noise, redundancy, and irrelevance are ruthlessly eliminated.

CHAPTER TWO: THE WORKFLOW OF CONTEXT ENGINEERING

Context Engineering does not happen in a single place or at a single moment. It is a continuous process that spans the entire lifecycle of an AI application, from initial design through deployment and ongoing maintenance. Understanding who does what and when is essential for building effective systems. During the design phase, the AI architect is the primary context engineer. This person makes the foundational decisions: What is the model's role? What external knowledge sources will it need? What tools should it have access to? How will conversation history be managed? What are the token budget allocations for each layer? These decisions shape the entire system architecture and are extraordinarily difficult to change later without significant rework. During the development phase, the software engineer takes over as the primary context engineer. This person implements the retrieval pipelines, the memory systems, the tool integrations, and the context assembly logic. They write the code that dynamically builds the context window at inference time, pulling from databases, vector stores, and APIs to assemble the optimal information payload for each request. This is where Context Engineering becomes a software engineering discipline in its own right. During the deployment and operations phase, the MLOps engineer and the data engineer share context engineering responsibilities. The MLOps engineer monitors context quality metrics in production, detecting when the model's performance degrades because the context is stale, poisoned, or overflowing. The data engineer maintains the knowledge bases, vector stores, and memory systems that feed the context pipeline, ensuring that the information available for retrieval is accurate, up-to-date, and well-organized. The end user also participates in context engineering, though usually without realizing it. Every message a user sends becomes part of the conversation history and thus part of the context. Users who provide rich, detailed inputs are inadvertently doing excellent context engineering. Users who send terse, ambiguous messages are creating context engineering challenges that the system must compensate for. The following code demonstrates a simplified but realistic context assembly pipeline. It shows how a context engineer might programmatically construct the context window for a single inference step, drawing from multiple sources: ```python """ context_assembler.py A modular context assembly pipeline that constructs the optimal context window for a single LLM inference step. This module follows clean architecture principles by separating concerns: retrieval, memory, tool loading, and assembly are all independent operations. This code is designed to work with both local LLMs (via llama-cpp-python or Ollama) and remote LLMs (via OpenAI-compatible APIs). """ from __future__ import annotations import json import time from dataclasses import dataclass, field from typing import Any # --------------------------------------------------------------------------- # Data classes representing the distinct layers of the context window. # Each layer is a first-class citizen with its own type and token budget. # --------------------------------------------------------------------------- @dataclass class SystemPromptLayer: """ The foundational layer. Defines the model's role, constraints, and behavioral guidelines. This is set once per session and rarely changes during a conversation. """ content: str token_budget: int = 2000 @dataclass class MemoryEntry: """ A single fact or summary retrieved from long-term memory. Includes a relevance score so the assembler can prioritize the most important memories when the budget is tight. """ content: str relevance_score: float # 0.0 to 1.0 source: str # e.g., "session_2024-01-15" or "user_profile" timestamp: float = field(default_factory=time.time) @dataclass class RetrievedDocument: """ A chunk of text retrieved from a RAG knowledge base. Includes provenance information so the model can cite sources and the system can audit what information influenced the response. """ content: str source_document: str relevance_score: float chunk_index: int @dataclass class ToolDefinition: """ A description of a tool available to the model. Follows the OpenAI function-calling schema format, which is also compatible with most open-source LLM frameworks. """ name: str description: str parameters: dict[str, Any] @dataclass class ConversationTurn: """ A single turn in the conversation history. Role is either 'user' or 'assistant'. """ role: str content: str timestamp: float = field(default_factory=time.time) @dataclass class AssembledContext: """ The final assembled context window, ready to be sent to the LLM. Contains both the structured messages list (for chat APIs) and a token count estimate for budget tracking. """ messages: list[dict[str, str]] tool_definitions: list[dict[str, Any]] estimated_token_count: int assembly_metadata: dict[str, Any] # --------------------------------------------------------------------------- # The ContextAssembler is the heart of the context engineering pipeline. # It takes raw inputs from all sources and produces a single, optimized # AssembledContext ready for LLM inference. # --------------------------------------------------------------------------- class ContextAssembler: """ Assembles the optimal context window for a single LLM inference step. This class implements the four core strategies of context engineering: - SELECT: Choose only the most relevant information from each source. - COMPRESS: Summarize or trim when budgets are tight. - ORDER: Place critical information at the beginning and end, not in the middle (mitigates the "lost in the middle" effect). - ISOLATE: Keep layers clearly separated to prevent context poisoning. """ # A rough approximation: 1 token is approximately 4 characters in English. # For production use, replace this with a proper tokenizer (e.g., tiktoken). CHARS_PER_TOKEN = 4 def __init__( self, total_token_budget: int = 8192, system_prompt_budget: int = 1500, memory_budget: int = 1000, rag_budget: int = 3000, history_budget: int = 2000, query_budget: int = 512, ): """ Initialize the assembler with explicit token budgets for each layer. The sum of all budgets should not exceed total_token_budget. Leaving some headroom (e.g., 192 tokens here) is good practice to account for formatting overhead and response tokens. """ self.total_token_budget = total_token_budget self.budgets = { "system_prompt": system_prompt_budget, "memory": memory_budget, "rag": rag_budget, "history": history_budget, "query": query_budget, } def _estimate_tokens(self, text: str) -> int: """Rough token count estimate. Replace with tiktoken in production.""" return max(1, len(text) // self.CHARS_PER_TOKEN) def _select_memories( self, memories: list[MemoryEntry], budget: int, ) -> list[MemoryEntry]: """ Select the most relevant memories that fit within the token budget. Memories are sorted by relevance score (highest first) so that the most important facts are always included when space is limited. """ sorted_memories = sorted( memories, key=lambda m: m.relevance_score, reverse=True ) selected = [] tokens_used = 0 for memory in sorted_memories: cost = self._estimate_tokens(memory.content) if tokens_used + cost <= budget: selected.append(memory) tokens_used += cost return selected def _select_documents( self, documents: list[RetrievedDocument], budget: int, ) -> list[RetrievedDocument]: """ Select the most relevant RAG documents within the token budget. Uses the same relevance-first selection strategy as memory. Critically, we cap at 5 documents to avoid the "lost in the middle" effect, where the model ignores documents buried deep in a long list. """ sorted_docs = sorted( documents, key=lambda d: d.relevance_score, reverse=True ) # Cap at 5 documents regardless of budget: more than 5 retrieved # documents rarely improves quality and often degrades it. top_docs = sorted_docs[:5] selected = [] tokens_used = 0 for doc in top_docs: cost = self._estimate_tokens(doc.content) if tokens_used + cost <= budget: selected.append(doc) tokens_used += cost return selected def _trim_history( self, history: list[ConversationTurn], budget: int, ) -> list[ConversationTurn]: """ Trim conversation history to fit within the token budget. We keep the MOST RECENT turns, not the oldest, because recent context is almost always more relevant than distant history. Older turns should already be summarized into long-term memory. """ trimmed = [] tokens_used = 0 # Iterate in reverse (most recent first) and prepend to preserve order. for turn in reversed(history): cost = self._estimate_tokens(turn.content) if tokens_used + cost <= budget: trimmed.insert(0, turn) tokens_used += cost else: # Stop as soon as we cannot fit the next turn. break return trimmed def assemble( self, system_prompt: SystemPromptLayer, memories: list[MemoryEntry], documents: list[RetrievedDocument], tools: list[ToolDefinition], history: list[ConversationTurn], current_query: str, ) -> AssembledContext: """ Assemble the complete context window from all available sources. The assembly order follows the "primacy and recency" principle: 1. System prompt (always first, highest priority) 2. Long-term memory (stable background knowledge) 3. Retrieved documents (task-specific knowledge) 4. Conversation history (recent context) 5. Current query (always last, highest attention weight) This ordering ensures the most important information is at the beginning and end of the context, where LLMs pay most attention. """ # --- Step 1: Select and compress each layer --- selected_memories = self._select_memories( memories, self.budgets["memory"] ) selected_documents = self._select_documents( documents, self.budgets["rag"] ) trimmed_history = self._trim_history( history, self.budgets["history"] ) # --- Step 2: Build the messages list --- messages: list[dict[str, str]] = [] # System prompt is always the first message. system_content = system_prompt.content # Inject memories into the system message for clean separation. if selected_memories: memory_block = "\n\n[RELEVANT MEMORIES FROM PREVIOUS SESSIONS]\n" for mem in selected_memories: memory_block += f"- (source: {mem.source}) {mem.content}\n" system_content += memory_block # Inject retrieved documents as a clearly labeled block. if selected_documents: rag_block = "\n\n[RETRIEVED KNOWLEDGE BASE DOCUMENTS]\n" for i, doc in enumerate(selected_documents, start=1): rag_block += ( f"\nDocument {i} (source: {doc.source_document}):\n" f"{doc.content}\n" ) system_content += rag_block messages.append({"role": "system", "content": system_content}) # Add trimmed conversation history. for turn in trimmed_history: messages.append({"role": turn.role, "content": turn.content}) # Current query is always the final user message. messages.append({"role": "user", "content": current_query}) # --- Step 3: Serialize tool definitions --- tool_schemas = [ { "type": "function", "function": { "name": tool.name, "description": tool.description, "parameters": tool.parameters, }, } for tool in tools ] # --- Step 4: Estimate total token usage --- total_text = " ".join(m["content"] for m in messages) estimated_tokens = self._estimate_tokens(total_text) # --- Step 5: Build assembly metadata for observability --- metadata = { "memories_selected": len(selected_memories), "documents_selected": len(selected_documents), "history_turns": len(trimmed_history), "tools_available": len(tools), "estimated_tokens": estimated_tokens, "token_budget": self.total_token_budget, "budget_utilization": estimated_tokens / self.total_token_budget, } return AssembledContext( messages=messages, tool_definitions=tool_schemas, estimated_token_count=estimated_tokens, assembly_metadata=metadata, )

The code above is deliberately verbose and explicit. In a real production system, you would likely split these classes across multiple modules, add proper logging, integrate a real tokenizer like tiktoken or the model's own tokenizer, and connect the retrieval methods to actual vector databases and memory stores. But the structure shown here captures the essential logic of context assembly: select, compress, order, and isolate.

Notice how the assembler never simply concatenates everything it receives. It makes deliberate choices about what to include, what to exclude, and in what order to present information. This is the essence of Context Engineering as a discipline: intentional, measurable, and systematic management of the information the model receives.

CHAPTER THREE: PROMPT ENGINEERING AND CONTEXT ENGINEERING AS PARTNERS

It would be a mistake to frame Prompt Engineering and Context Engineering as competitors. They are complementary disciplines that operate at different levels of abstraction and address different failure modes. Understanding how they interact is essential for building robust AI applications.

Prompt Engineering is the craft of writing effective instructions. It focuses on a single interaction and asks: given that the model will receive this context, what is the best way to phrase the instruction so the model understands exactly what is needed? Prompt Engineering techniques include zero-shot prompting (just asking directly), few-shot prompting (providing examples of the desired input- output pattern), chain-of-thought prompting (asking the model to reason step by step before answering), and role prompting (asking the model to adopt a specific persona or expertise level).

Context Engineering is the infrastructure that makes Prompt Engineering effective. It ensures that when the model receives a beautifully crafted prompt, it also has all the information it needs to respond correctly. A perfect prompt without proper context is like a perfect question asked of someone who has amnesia and no reference materials. A rich context without a clear prompt is like giving someone a library and asking them to "do something useful."

The most powerful AI systems combine both disciplines. The system prompt, which is the primary artifact of Prompt Engineering in a production system, lives inside the context window and is therefore also a product of Context Engineering. The few-shot examples that guide the model's output format are carefully selected pieces of context. The chain-of-thought instruction that tells the model to reason before answering is a prompt engineering technique that only works if the context provides enough information for the reasoning to be grounded in fact.

Consider this example. A customer support agent needs to answer a question about a product warranty. Prompt Engineering determines how the instruction is phrased: "You are a helpful customer support agent. Answer the customer's question accurately and concisely, citing the specific warranty terms that apply." Context Engineering determines what information is available: the specific warranty document for the product the customer purchased, the customer's purchase history, any previous support tickets, and the current date (to determine if the warranty is still active). Without Context Engineering, the beautifully crafted prompt will produce a hallucinated warranty answer. Without Prompt Engineering, the rich context will produce a disorganized, verbose, or off-topic response.

The following code snippet shows how Prompt Engineering and Context Engineering work together in a single inference call. The prompt template is the product of Prompt Engineering, while the variables injected into it are the product of Context Engineering:

"""
prompt_context_integration.py

Demonstrates the integration of Prompt Engineering (the template structure)
with Context Engineering (the dynamic content injection). This pattern is
sometimes called "structured prompting with dynamic context injection."

The PromptTemplate class is a Prompt Engineering artifact.
The ContextInjector class is a Context Engineering artifact.
Together they produce the final inference payload.
"""

from __future__ import annotations

import textwrap
from dataclasses import dataclass
from typing import Any


# ---------------------------------------------------------------------------
# Prompt Engineering artifact: a reusable, versioned prompt template.
# The template defines the structure and tone of the instruction.
# It uses named placeholders for all dynamic content.
# ---------------------------------------------------------------------------

@dataclass
class PromptTemplate:
    """
    A versioned, reusable prompt template.

    The template string uses {placeholder} syntax for dynamic content.
    All placeholders must be filled by the ContextInjector before the
    prompt is sent to the LLM. This enforces a clean separation between
    the instruction structure (Prompt Engineering) and the dynamic
    content (Context Engineering).
    """
    name: str
    version: str
    template: str
    required_placeholders: list[str]

    def validate(self, context: dict[str, Any]) -> list[str]:
        """
        Validate that all required placeholders are present in the context.
        Returns a list of missing placeholder names (empty if all present).
        This prevents silent failures where a missing variable produces
        a malformed prompt that the model tries to interpret literally.
        """
        return [
            placeholder
            for placeholder in self.required_placeholders
            if placeholder not in context
        ]

    def render(self, context: dict[str, Any]) -> str:
        """
        Render the template by substituting all placeholders with
        their corresponding values from the context dictionary.
        Raises ValueError if any required placeholder is missing.
        """
        missing = self.validate(context)
        if missing:
            raise ValueError(
                f"Template '{self.name}' v{self.version} is missing "
                f"required context keys: {missing}"
            )
        return self.template.format(**context)


# ---------------------------------------------------------------------------
# A library of reusable prompt templates. In a production system, these
# would be stored in a version-controlled repository and loaded at startup.
# ---------------------------------------------------------------------------

CUSTOMER_SUPPORT_TEMPLATE = PromptTemplate(
    name="customer_support_warranty",
    version="2.1",
    template=textwrap.dedent("""
        You are a knowledgeable and empathetic customer support specialist
        for {company_name}. Your role is to help customers understand their
        warranty coverage accurately and honestly.

        IMPORTANT CONSTRAINTS:
        - Only cite warranty terms that appear in the WARRANTY DOCUMENT below.
        - If the warranty has expired, say so clearly and compassionately.
        - Never invent coverage that is not explicitly stated in the document.
        - If you are uncertain, say so and offer to escalate to a human agent.

        WARRANTY DOCUMENT:
        {warranty_document}

        CUSTOMER PURCHASE HISTORY:
        - Product: {product_name}
        - Purchase Date: {purchase_date}
        - Order ID: {order_id}

        PREVIOUS SUPPORT INTERACTIONS:
        {previous_tickets}

        Today's Date: {current_date}

        Please answer the customer's question based solely on the above
        information. Be concise, accurate, and empathetic.
    """).strip(),
    required_placeholders=[
        "company_name",
        "warranty_document",
        "product_name",
        "purchase_date",
        "order_id",
        "previous_tickets",
        "current_date",
    ],
)


# ---------------------------------------------------------------------------
# Context Engineering artifact: the ContextInjector assembles the dynamic
# values that will fill the prompt template's placeholders. In a real
# system, each of these values would come from a different source:
# a database, a vector store, a calendar API, etc.
# ---------------------------------------------------------------------------

class ContextInjector:
    """
    Assembles the dynamic context values needed to render a PromptTemplate.

    Each method in this class represents a context source. In production,
    these methods would call real databases, APIs, and retrieval systems.
    Here they are stubs that illustrate the pattern clearly.
    """

    def __init__(self, company_name: str, current_date: str):
        self.company_name = company_name
        self.current_date = current_date

    def fetch_warranty_document(self, product_id: str) -> str:
        """
        Retrieves the warranty document for a given product from the
        knowledge base. In production, this would query a vector store
        or a document database using the product_id as the retrieval key.
        """
        # Stub: in production, call your RAG retrieval pipeline here.
        return (
            "Standard Limited Warranty: This product is covered for 24 months "
            "from the date of purchase against manufacturing defects. "
            "Accidental damage, water damage, and normal wear are not covered. "
            "To claim warranty service, contact support with your order ID."
        )

    def fetch_purchase_history(
        self, customer_id: str
    ) -> dict[str, str]:
        """
        Retrieves the customer's purchase history from the CRM database.
        Returns a dictionary with product name, purchase date, and order ID.
        """
        # Stub: in production, query your CRM or order management system.
        return {
            "product_name":  "ProScan X200 Wireless Headphones",
            "purchase_date": "2024-03-15",
            "order_id":      "ORD-2024-88421",
        }

    def fetch_previous_tickets(self, customer_id: str) -> str:
        """
        Retrieves a summary of previous support interactions for this customer.
        In production, this would query the support ticket system and
        potentially summarize long ticket histories using an LLM.
        """
        # Stub: in production, query your helpdesk system.
        return "No previous support tickets found for this customer."

    def build_context(
        self,
        customer_id: str,
        product_id: str,
    ) -> dict[str, Any]:
        """
        Assembles the complete context dictionary for the prompt template.
        This single method orchestrates all the individual data fetches
        and returns a unified context payload ready for template rendering.
        """
        purchase_history = self.fetch_purchase_history(customer_id)
        return {
            "company_name":      self.company_name,
            "warranty_document": self.fetch_warranty_document(product_id),
            "product_name":      purchase_history["product_name"],
            "purchase_date":     purchase_history["purchase_date"],
            "order_id":          purchase_history["order_id"],
            "previous_tickets":  self.fetch_previous_tickets(customer_id),
            "current_date":      self.current_date,
        }


# ---------------------------------------------------------------------------
# Putting it all together: rendering the final prompt by combining the
# Prompt Engineering template with the Context Engineering injector.
# ---------------------------------------------------------------------------

def build_support_prompt(
    customer_id: str,
    product_id: str,
) -> str:
    """
    Builds the complete, ready-to-send system prompt for the customer
    support agent by combining the prompt template with dynamic context.
    """
    injector = ContextInjector(
        company_name="Acme Electronics",
        current_date="2026-04-22",
    )
    context = injector.build_context(customer_id, product_id)
    rendered = CUSTOMER_SUPPORT_TEMPLATE.render(context)
    return rendered


if __name__ == "__main__":
    prompt = build_support_prompt(
        customer_id="CUST-10042",
        product_id="PROD-X200",
    )
    print(prompt)

This example illustrates a crucial architectural principle: the prompt template and the context assembly logic should be maintained separately. The template is owned by the product team and the AI engineers who craft the instructions. The context assembly logic is owned by the backend engineers who know where the data lives and how to retrieve it efficiently. This separation of concerns makes both components easier to test, version, and improve independently.

PART FOUR: RUNNING LLMs LOCALLY AND REMOTELY WITH GPU SUPPORT

One of the most exciting developments in applied AI is the ability to run powerful language models locally on consumer hardware. This matters for Context Engineering because local models give you complete control over the inference pipeline, including the context window size, the tokenizer, and the memory management strategy. You are not at the mercy of a third-party API's context limits or pricing model.

The primary tool for local LLM inference is llama.cpp, a C++ implementation of LLM inference that supports multiple GPU backends. It runs on NVIDIA GPUs via CUDA, on Apple Silicon via Metal (and increasingly via Apple's MLX framework), and on AMD and other GPUs via Vulkan. The Python bindings for llama.cpp are provided by the llama-cpp-python package.

For remote LLMs, the OpenAI API format has become the de facto standard. Most major LLM providers, including Anthropic, Mistral, and Groq, offer OpenAI- compatible APIs. Local inference servers like Ollama and LM Studio also expose OpenAI-compatible endpoints, which means the same client code can talk to a local Llama 3 model or a remote GPT-4o with minimal changes.

The following code implements a unified LLM client that automatically detects the available GPU backend and routes inference accordingly. It supports CUDA for NVIDIA GPUs, MLX for Apple Silicon, and Vulkan as a cross-platform fallback, while also supporting remote OpenAI-compatible APIs:

"""
llm_client.py

A unified LLM client that supports:
  - Local inference via llama-cpp-python (CUDA, Metal/MLX, Vulkan)
  - Remote inference via any OpenAI-compatible API
    (OpenAI, Anthropic via proxy, Ollama, LM Studio, etc.)

GPU backend detection is automatic: the client inspects the available
hardware and selects the optimal backend without manual configuration.

Installation:
  For CUDA (NVIDIA):
    CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python
  For Metal/MLX (Apple Silicon):
    CMAKE_ARGS="-DGGML_METAL=on" pip install llama-cpp-python
  For Vulkan (AMD/Intel/other):
    CMAKE_ARGS="-DGGML_VULKAN=on" pip install llama-cpp-python
  For remote APIs:
    pip install openai
"""

from __future__ import annotations

import os
import platform
import subprocess
import sys
from abc import ABC, abstractmethod
from dataclasses import dataclass
from enum import Enum, auto
from typing import Any


# ---------------------------------------------------------------------------
# GPU backend enumeration. The client uses this to select the right
# inference path and to report which backend is active.
# ---------------------------------------------------------------------------

class GPUBackend(Enum):
    CUDA   = auto()   # NVIDIA GPUs via CUDA
    METAL  = auto()   # Apple Silicon via Metal / MLX
    VULKAN = auto()   # Cross-platform: AMD, Intel, and others via Vulkan
    CPU    = auto()   # CPU-only fallback (slow but universally compatible)
    REMOTE = auto()   # Remote API (no local GPU needed)


# ---------------------------------------------------------------------------
# Backend detection: inspects the runtime environment to determine which
# GPU acceleration is available. This runs once at startup.
# ---------------------------------------------------------------------------

class BackendDetector:
    """
    Detects the optimal GPU backend for local LLM inference.

    Detection order (highest performance first):
      1. CUDA  - if nvidia-smi is available and returns a valid GPU
      2. METAL - if running on macOS with Apple Silicon (arm64)
      3. VULKAN - if vulkaninfo is available (AMD/Intel/other GPUs)
      4. CPU   - universal fallback
    """

    @staticmethod
    def _command_exists(command: str) -> bool:
        """Check whether a shell command is available on this system."""
        try:
            subprocess.run(
                [command, "--version"],
                capture_output=True,
                timeout=5,
            )
            return True
        except (FileNotFoundError, subprocess.TimeoutExpired):
            return False

    @staticmethod
    def _has_nvidia_gpu() -> bool:
        """Detect NVIDIA GPU via nvidia-smi."""
        try:
            result = subprocess.run(
                ["nvidia-smi", "--query-gpu=name", "--format=csv,noheader"],
                capture_output=True,
                text=True,
                timeout=10,
            )
            return result.returncode == 0 and bool(result.stdout.strip())
        except (FileNotFoundError, subprocess.TimeoutExpired):
            return False

    @staticmethod
    def _is_apple_silicon() -> bool:
        """Detect Apple Silicon (M1/M2/M3/M4) by checking platform info."""
        return (
            platform.system() == "Darwin"
            and platform.machine() == "arm64"
        )

    @staticmethod
    def _has_vulkan() -> bool:
        """Detect Vulkan support via vulkaninfo."""
        try:
            result = subprocess.run(
                ["vulkaninfo", "--summary"],
                capture_output=True,
                text=True,
                timeout=10,
            )
            return result.returncode == 0
        except (FileNotFoundError, subprocess.TimeoutExpired):
            return False

    @classmethod
    def detect(cls) -> GPUBackend:
        """
        Detect and return the best available GPU backend.
        Logs the detection result to help with debugging.
        """
        detector = cls()
        if detector._has_nvidia_gpu():
            print("[BackendDetector] CUDA backend detected (NVIDIA GPU).")
            return GPUBackend.CUDA
        if detector._is_apple_silicon():
            print("[BackendDetector] Metal/MLX backend detected (Apple Silicon).")
            return GPUBackend.METAL
        if detector._has_vulkan():
            print("[BackendDetector] Vulkan backend detected.")
            return GPUBackend.VULKAN
        print("[BackendDetector] No GPU detected. Falling back to CPU.")
        return GPUBackend.CPU


# ---------------------------------------------------------------------------
# Abstract base class for all LLM clients. Defines the interface that
# both local and remote clients must implement. This allows the rest of
# the application to be completely agnostic about where inference runs.
# ---------------------------------------------------------------------------

@dataclass
class InferenceRequest:
    """
    A single inference request, including the assembled context messages
    and any tool definitions available to the model.
    """
    messages: list[dict[str, str]]
    tools: list[dict[str, Any]] | None = None
    max_tokens: int = 1024
    temperature: float = 0.7
    top_p: float = 0.9


@dataclass
class InferenceResponse:
    """
    The model's response to an inference request.
    Includes the generated text and metadata about the inference.
    """
    content: str
    tool_calls: list[dict[str, Any]] | None
    prompt_tokens: int
    completion_tokens: int
    backend: GPUBackend


class BaseLLMClient(ABC):
    """Abstract base class defining the LLM client interface."""

    @abstractmethod
    def infer(self, request: InferenceRequest) -> InferenceResponse:
        """
        Run inference and return the model's response.
        All implementations must honor the InferenceRequest contract.
        """
        ...

    @abstractmethod
    def get_backend(self) -> GPUBackend:
        """Return the GPU backend this client is using."""
        ...


# ---------------------------------------------------------------------------
# Local LLM client using llama-cpp-python.
# Supports CUDA, Metal/MLX, and Vulkan depending on how the package
# was compiled. The n_gpu_layers parameter controls how many model layers
# are offloaded to the GPU: -1 means "offload everything possible."
# ---------------------------------------------------------------------------

class LocalLLMClient(BaseLLMClient):
    """
    Local LLM inference client using llama-cpp-python.

    This client loads a GGUF model file and runs inference locally.
    GPU acceleration is provided by whichever backend llama-cpp-python
    was compiled with (CUDA, Metal, or Vulkan).

    Parameters:
        model_path:    Path to the GGUF model file.
        n_ctx:         Context window size in tokens. Larger values use
                       more VRAM/RAM but allow longer contexts.
        n_gpu_layers:  Number of model layers to offload to GPU.
                       Use -1 to offload all layers (recommended).
        backend:       The GPU backend in use (for reporting purposes).
    """

    def __init__(
        self,
        model_path: str,
        n_ctx: int = 8192,
        n_gpu_layers: int = -1,
        backend: GPUBackend = GPUBackend.CPU,
    ):
        # Lazy import: only import llama_cpp if we are actually using
        # the local client. This avoids import errors on systems where
        # llama-cpp-python is not installed.
        try:
            from llama_cpp import Llama
        except ImportError:
            raise ImportError(
                "llama-cpp-python is not installed. "
                "Install it with: pip install llama-cpp-python\n"
                "For GPU support, see the module docstring for build flags."
            )

        self._backend = backend
        print(
            f"[LocalLLMClient] Loading model from {model_path} "
            f"with n_ctx={n_ctx}, n_gpu_layers={n_gpu_layers} "
            f"(backend: {backend.name})"
        )

        # The Llama constructor handles all GPU setup internally.
        # When compiled with CUDA support, n_gpu_layers > 0 will
        # automatically use CUDA. Same for Metal and Vulkan.
        self._model = Llama(
            model_path=model_path,
            n_ctx=n_ctx,
            n_gpu_layers=n_gpu_layers,
            verbose=False,   # Set to True for detailed inference logs.
        )

    def get_backend(self) -> GPUBackend:
        return self._backend

    def infer(self, request: InferenceRequest) -> InferenceResponse:
        """
        Run local inference using llama-cpp-python's chat completion API.
        The API mirrors the OpenAI chat completions format, making it
        easy to switch between local and remote inference.
        """
        kwargs: dict[str, Any] = {
            "messages":    request.messages,
            "max_tokens":  request.max_tokens,
            "temperature": request.temperature,
            "top_p":       request.top_p,
        }
        # Only include tools if the model supports function calling.
        if request.tools:
            kwargs["tools"] = request.tools

        response = self._model.create_chat_completion(**kwargs)

        choice = response["choices"][0]
        message = choice["message"]

        return InferenceResponse(
            content=message.get("content") or "",
            tool_calls=message.get("tool_calls"),
            prompt_tokens=response["usage"]["prompt_tokens"],
            completion_tokens=response["usage"]["completion_tokens"],
            backend=self._backend,
        )


# ---------------------------------------------------------------------------
# Remote LLM client using the OpenAI-compatible API.
# Works with OpenAI, Anthropic (via proxy), Mistral, Groq, Ollama,
# LM Studio, and any other provider that implements the OpenAI API spec.
# ---------------------------------------------------------------------------

class RemoteLLMClient(BaseLLMClient):
    """
    Remote LLM inference client using the OpenAI-compatible API.

    To use with Ollama (local server with remote-style API):
        base_url = "http://localhost:11434/v1"
        api_key  = "ollama"  # Ollama does not require a real key.

    To use with OpenAI:
        base_url = "https://api.openai.com/v1"
        api_key  = os.environ["OPENAI_API_KEY"]

    To use with LM Studio:
        base_url = "http://localhost:1234/v1"
        api_key  = "lm-studio"
    """

    def __init__(
        self,
        model_name: str,
        base_url: str = "https://api.openai.com/v1",
        api_key: str | None = None,
    ):
        try:
            from openai import OpenAI
        except ImportError:
            raise ImportError(
                "openai package is not installed. "
                "Install it with: pip install openai"
            )

        resolved_key = api_key or os.environ.get("OPENAI_API_KEY", "none")
        self._client = OpenAI(base_url=base_url, api_key=resolved_key)
        self._model_name = model_name
        print(
            f"[RemoteLLMClient] Connected to {base_url} "
            f"using model '{model_name}'."
        )

    def get_backend(self) -> GPUBackend:
        return GPUBackend.REMOTE

    def infer(self, request: InferenceRequest) -> InferenceResponse:
        """
        Run remote inference via the OpenAI-compatible API.
        Handles both standard chat completions and tool-calling responses.
        """
        kwargs: dict[str, Any] = {
            "model":       self._model_name,
            "messages":    request.messages,
            "max_tokens":  request.max_tokens,
            "temperature": request.temperature,
            "top_p":       request.top_p,
        }
        if request.tools:
            kwargs["tools"] = request.tools

        response = self._client.chat.completions.create(**kwargs)
        choice = response.choices[0]
        message = choice.message

        # Serialize tool calls to plain dicts for a uniform response format.
        tool_calls = None
        if message.tool_calls:
            tool_calls = [
                {
                    "id":   tc.id,
                    "type": tc.type,
                    "function": {
                        "name":      tc.function.name,
                        "arguments": tc.function.arguments,
                    },
                }
                for tc in message.tool_calls
            ]

        return InferenceResponse(
            content=message.content or "",
            tool_calls=tool_calls,
            prompt_tokens=response.usage.prompt_tokens,
            completion_tokens=response.usage.completion_tokens,
            backend=GPUBackend.REMOTE,
        )


# ---------------------------------------------------------------------------
# Factory function: creates the appropriate client based on configuration.
# This is the single entry point for the rest of the application.
# ---------------------------------------------------------------------------

def create_llm_client(
    mode: str = "auto",
    model_path: str | None = None,
    model_name: str = "gpt-4o-mini",
    base_url: str = "https://api.openai.com/v1",
    api_key: str | None = None,
    n_ctx: int = 8192,
) -> BaseLLMClient:
    """
    Factory function that creates the appropriate LLM client.

    Parameters:
        mode:       "local"  - always use local inference
                    "remote" - always use remote API
                    "auto"   - use local if model_path is provided,
                               otherwise use remote
        model_path: Path to a local GGUF model file (for local mode).
        model_name: Model name for the remote API.
        base_url:   Base URL for the remote API endpoint.
        api_key:    API key for the remote endpoint.
        n_ctx:      Context window size for local inference.
    """
    if mode == "remote" or (mode == "auto" and model_path is None):
        return RemoteLLMClient(
            model_name=model_name,
            base_url=base_url,
            api_key=api_key,
        )

    # For local mode, detect the best available GPU backend.
    backend = BackendDetector.detect()
    # Offload all layers to GPU if any GPU is available.
    n_gpu_layers = -1 if backend != GPUBackend.CPU else 0

    return LocalLLMClient(
        model_path=model_path,
        n_ctx=n_ctx,
        n_gpu_layers=n_gpu_layers,
        backend=backend,
    )

The factory function at the bottom of this module is the key to making the rest of the application GPU-agnostic. The calling code never needs to know whether it is talking to a local CUDA-accelerated Llama 3 model or a remote GPT-4o. It simply calls create_llm_client() and then calls infer() on the returned object. The GPU backend detection happens automatically, and the appropriate inference path is selected without any manual configuration.

This design is particularly valuable in team environments where different developers have different hardware. A developer on a MacBook Pro with Apple Silicon will automatically get Metal acceleration. A developer on a Linux workstation with an NVIDIA GPU will automatically get CUDA. A developer without a dedicated GPU will fall back to CPU inference (which is slower but still functional for development and testing). And in production, the same code can be pointed at a remote API endpoint without any changes.

CHAPTER FIVE: MEASURING CONTEXT QUALITY

One of the most important and underappreciated aspects of Context Engineering is measurement. You cannot improve what you cannot measure, and context quality is notoriously difficult to measure because its effects are indirect. The context does not appear in the output directly. Its quality is only visible through the quality of the model's responses.

The field has converged on several key metrics for evaluating context quality, each measuring a different dimension of how well the context serves the model's needs.

Contextual Relevancy measures whether the information in the context is actually relevant to the query being answered. A context full of accurate but irrelevant information is just as harmful as a context full of inaccurate information, because it wastes the model's attention budget and can distract it from the information it actually needs. Relevancy is typically measured by embedding both the query and the retrieved documents into a vector space and computing their cosine similarity.

Faithfulness, also called Groundedness, measures whether the model's response is actually supported by the information in the context. A model that produces a response that contradicts or goes beyond the context is hallucinating, and this is a direct failure of context engineering. Faithfulness is typically measured using an LLM-as-a-judge approach, where a second LLM is asked to evaluate whether each claim in the response can be traced to a specific passage in the context.

Contextual Recall measures whether the context contains all the information needed to answer the query correctly. A context that is relevant but incomplete will lead to partial or incorrect answers. Recall is harder to measure than relevancy because it requires knowing in advance what information is needed, which is only possible when you have ground-truth answers for your test cases.

The "Needle in a Haystack" test is a particularly important evaluation for long contexts. It works by placing a specific piece of information (the "needle") at a specific position within a long context (the "haystack") and then asking the model a question that can only be answered by finding the needle. By varying the position of the needle (beginning, middle, or end of the context), you can measure whether the model is suffering from the "lost in the middle" effect, where information placed in the middle of a long context is effectively ignored.

The following code implements a lightweight context quality evaluator that measures relevancy and faithfulness using embedding similarity and an LLM-as- a-judge approach:

"""
context_quality_evaluator.py

A lightweight context quality evaluation framework that measures:
  1. Contextual Relevancy: Are the retrieved documents relevant to the query?
  2. Faithfulness:         Is the model's response grounded in the context?
  3. Context Utilization:  What fraction of the context budget was used?

For production use, consider integrating with dedicated evaluation
frameworks like RAGAS, TruLens, or DeepEval, which provide more
sophisticated metrics and evaluation pipelines.
"""

from __future__ import annotations

import math
from dataclasses import dataclass
from typing import Any


# ---------------------------------------------------------------------------
# Simple cosine similarity for embedding-based relevancy measurement.
# In production, use numpy or scipy for better performance.
# ---------------------------------------------------------------------------

def cosine_similarity(vec_a: list[float], vec_b: list[float]) -> float:
    """
    Compute the cosine similarity between two embedding vectors.
    Returns a value between -1.0 (opposite) and 1.0 (identical).
    Values above 0.7 are generally considered semantically similar.
    """
    if len(vec_a) != len(vec_b):
        raise ValueError("Embedding vectors must have the same dimension.")

    dot_product = sum(a * b for a, b in zip(vec_a, vec_b))
    magnitude_a = math.sqrt(sum(a * a for a in vec_a))
    magnitude_b = math.sqrt(sum(b * b for b in vec_b))

    if magnitude_a == 0.0 or magnitude_b == 0.0:
        return 0.0

    return dot_product / (magnitude_a * magnitude_b)


@dataclass
class ContextQualityReport:
    """
    A structured report of context quality metrics for a single inference.
    All scores are normalized to the range [0.0, 1.0] where 1.0 is best.
    """
    query: str
    avg_relevancy_score: float    # Average similarity of docs to query
    min_relevancy_score: float    # Worst-performing document
    faithfulness_score: float     # Fraction of claims grounded in context
    context_utilization: float    # Fraction of token budget used
    num_documents: int            # Number of documents in context
    hallucination_risk: str       # "LOW", "MEDIUM", or "HIGH"

    def summary(self) -> str:
        """Return a human-readable summary of the quality report."""
        return (
            f"Context Quality Report\n"
            f"  Query:               {self.query[:60]}...\n"
            f"  Avg Relevancy:       {self.avg_relevancy_score:.2f}\n"
            f"  Min Relevancy:       {self.min_relevancy_score:.2f}\n"
            f"  Faithfulness:        {self.faithfulness_score:.2f}\n"
            f"  Context Utilization: {self.context_utilization:.1%}\n"
            f"  Documents:           {self.num_documents}\n"
            f"  Hallucination Risk:  {self.hallucination_risk}\n"
        )


class ContextQualityEvaluator:
    """
    Evaluates the quality of an assembled context window.

    This evaluator uses two complementary approaches:
      1. Embedding-based relevancy scoring (fast, no LLM call needed)
      2. LLM-as-a-judge faithfulness scoring (slower, more accurate)

    The evaluator is designed to run asynchronously in production,
    logging quality metrics without blocking the main inference path.
    """

    # Relevancy threshold below which a document is considered noise.
    RELEVANCY_THRESHOLD = 0.65

    # Faithfulness threshold below which hallucination risk is HIGH.
    FAITHFULNESS_THRESHOLD = 0.80

    def __init__(self, llm_client: Any, embedding_function: Any):
        """
        Parameters:
            llm_client:         An instance of BaseLLMClient for
                                LLM-as-a-judge faithfulness evaluation.
            embedding_function: A callable that takes a string and returns
                                a list of floats (the embedding vector).
                                Can be a local or remote embedding model.
        """
        self._llm = llm_client
        self._embed = embedding_function

    def evaluate_relevancy(
        self,
        query: str,
        documents: list[str],
    ) -> tuple[float, float]:
        """
        Evaluate the relevancy of retrieved documents to the query.

        Embeds both the query and each document, then computes cosine
        similarity. Returns (average_score, minimum_score).

        A low minimum score indicates that at least one irrelevant document
        has been injected into the context, which can degrade performance.
        """
        if not documents:
            return 0.0, 0.0

        query_embedding = self._embed(query)
        scores = [
            cosine_similarity(query_embedding, self._embed(doc))
            for doc in documents
        ]

        return sum(scores) / len(scores), min(scores)

    def evaluate_faithfulness(
        self,
        query: str,
        context: str,
        response: str,
    ) -> float:
        """
        Evaluate whether the model's response is grounded in the context.

        Uses the LLM-as-a-judge pattern: a second LLM call evaluates
        whether each claim in the response can be traced to the context.
        Returns a score between 0.0 (fully hallucinated) and 1.0 (fully
        grounded).

        This is the most expensive metric to compute but also the most
        important for detecting hallucinations in production.
        """
        from llm_client import InferenceRequest  # Local import to avoid circular deps.

        judge_prompt = (
            "You are a strict factual accuracy evaluator. "
            "Your task is to determine what fraction of the claims in "
            "the RESPONSE are directly supported by the CONTEXT.\n\n"
            f"CONTEXT:\n{context}\n\n"
            f"RESPONSE:\n{response}\n\n"
            "Instructions:\n"
            "1. List each factual claim in the RESPONSE.\n"
            "2. For each claim, state whether it is SUPPORTED or "
            "UNSUPPORTED by the CONTEXT.\n"
            "3. On the final line, write only: SCORE: X.XX\n"
            "   where X.XX is the fraction of claims that are SUPPORTED "
            "(e.g., SCORE: 0.75 means 3 out of 4 claims are supported).\n"
            "Be strict: if a claim cannot be directly traced to the "
            "CONTEXT, it is UNSUPPORTED."
        )

        judge_request = InferenceRequest(
            messages=[
                {"role": "system", "content": "You are a factual accuracy evaluator."},
                {"role": "user",   "content": judge_prompt},
            ],
            max_tokens=512,
            temperature=0.0,  # Zero temperature for deterministic evaluation.
        )

        judge_response = self._llm.infer(judge_request)

        # Parse the SCORE line from the judge's response.
        for line in reversed(judge_response.content.splitlines()):
            if line.strip().startswith("SCORE:"):
                try:
                    return float(line.split(":")[1].strip())
                except (ValueError, IndexError):
                    pass

        # If parsing fails, return a conservative middle score.
        return 0.5

    def evaluate(
        self,
        query: str,
        documents: list[str],
        context: str,
        response: str,
        tokens_used: int,
        token_budget: int,
    ) -> ContextQualityReport:
        """
        Run the full evaluation suite and return a ContextQualityReport.
        """
        avg_rel, min_rel = self.evaluate_relevancy(query, documents)
        faithfulness    = self.evaluate_faithfulness(query, context, response)
        utilization     = tokens_used / token_budget if token_budget > 0 else 0.0

        # Determine hallucination risk based on faithfulness and relevancy.
        if faithfulness >= self.FAITHFULNESS_THRESHOLD and min_rel >= self.RELEVANCY_THRESHOLD:
            risk = "LOW"
        elif faithfulness >= 0.60 or min_rel >= 0.50:
            risk = "MEDIUM"
        else:
            risk = "HIGH"

        return ContextQualityReport(
            query=query,
            avg_relevancy_score=avg_rel,
            min_relevancy_score=min_rel,
            faithfulness_score=faithfulness,
            context_utilization=utilization,
            num_documents=len(documents),
            hallucination_risk=risk,
        )

The evaluator above implements the LLM-as-a-judge pattern, which has become the gold standard for faithfulness evaluation in production RAG systems. The key insight is that evaluating whether a response is grounded in a context is itself a language understanding task, and LLMs are very good at language understanding tasks. By using a second LLM call to evaluate the first, you get a nuanced, human-like assessment of faithfulness that simple string- matching or keyword-overlap metrics cannot provide.

The zero temperature setting on the judge LLM is important. You want the evaluation to be deterministic and consistent, not creative. Setting temperature to 0.0 ensures that the same query, context, and response always produce the same faithfulness score, which is essential for tracking quality trends over time.

CHAPTER SIX: THE FOUR CORE STRATEGIES OF CONTEXT ENGINEERING

Context Engineering practitioners have converged on four fundamental strategies for managing the information that flows into the context window. These strategies are sometimes called WRITE, SELECT, COMPRESS, and ISOLATE, and together they form a complete toolkit for handling any context management challenge.

The WRITE strategy involves persisting information outside the context window in external memory stores. When an agent completes a task, it writes a summary of what it learned to a database, a file, or a vector store. The next time a relevant query arrives, that information can be retrieved and injected back into the context. This is how you give a stateless LLM the appearance of persistent memory. The WRITE strategy is the foundation of all long-term memory systems for LLMs.

The SELECT strategy involves choosing only the most relevant information from available sources and injecting it into the context. This is the domain of RAG systems, where a retrieval pipeline selects the top-k most relevant document chunks from a knowledge base. But SELECT is broader than RAG. It also applies to tool selection (which of the available tools should be included in the context for this particular query?), memory selection (which past memories are relevant to the current task?), and history selection (which conversation turns are most important to retain?).

The COMPRESS strategy involves reducing the size of information before injecting it into the context. Summarization is the most common compression technique: instead of including the full text of a long document, you include a concise summary that captures the essential information. Conversation history compression is particularly important in long-running dialogues, where the full history would quickly overflow the context window. A common pattern is to summarize the oldest portion of the history into a compact "session summary" and keep only the most recent turns verbatim.

The ISOLATE strategy involves separating different types of context into independent sub-contexts, typically by using sub-agents. Instead of one agent managing a massive context window that contains everything, you decompose the task into subtasks and assign each subtask to a specialized agent with a clean, focused context window. The main orchestrating agent then synthesizes the outputs of the sub-agents. This approach dramatically reduces the risk of context overflow and the "lost in the middle" effect, because each sub-agent only needs to attend to a small, highly relevant context.

The following code demonstrates the COMPRESS strategy with a rolling summary approach for conversation history management:

"""
conversation_memory.py

Implements a rolling conversation memory manager using the COMPRESS strategy.
As the conversation grows, older turns are summarized and stored as a compact
"session summary" to free up context window space for recent turns.

This is one of the most practically important Context Engineering patterns
for production chatbots and agentic systems.
"""

from __future__ import annotations

import time
from dataclasses import dataclass, field
from typing import Any


@dataclass
class ConversationTurn:
    """A single turn in the conversation."""
    role: str       # "user" or "assistant"
    content: str
    timestamp: float = field(default_factory=time.time)

    def to_message(self) -> dict[str, str]:
        """Convert to the standard messages format for LLM APIs."""
        return {"role": self.role, "content": self.content}


class RollingConversationMemory:
    """
    Manages conversation history with automatic compression.

    When the conversation exceeds max_recent_turns, the oldest turns
    are summarized by the LLM and stored as a compact session summary.
    This keeps the context window usage bounded regardless of how long
    the conversation runs.

    The compression is triggered automatically during get_context_messages(),
    so the caller never needs to think about memory management explicitly.
    This is an important clean architecture principle: memory management
    should be an implementation detail, not a caller responsibility.

    Parameters:
        llm_client:       The LLM client to use for generating summaries.
        max_recent_turns: Maximum number of recent turns to keep verbatim.
        turns_to_compress: Number of old turns to compress when triggered.
    """

    def __init__(
        self,
        llm_client: Any,
        max_recent_turns: int = 10,
        turns_to_compress: int = 6,
    ):
        self._llm = llm_client
        self._max_recent = max_recent_turns
        self._turns_to_compress = turns_to_compress
        self._turns: list[ConversationTurn] = []
        self._session_summary: str = ""

    def add_turn(self, role: str, content: str) -> None:
        """
        Add a new turn to the conversation history.
        Compression is triggered automatically if needed.
        """
        self._turns.append(ConversationTurn(role=role, content=content))
        if len(self._turns) > self._max_recent:
            self._compress_oldest_turns()

    def _compress_oldest_turns(self) -> None:
        """
        Summarize the oldest turns and merge the summary into the
        session summary. The compressed turns are then removed from
        the verbatim history.

        This method is called automatically and should not be called
        directly by application code.
        """
        from llm_client import InferenceRequest  # Avoid circular imports.

        # Take the oldest turns for compression.
        turns_to_compress = self._turns[:self._turns_to_compress]
        self._turns = self._turns[self._turns_to_compress:]

        # Build a transcript of the turns to be compressed.
        transcript = "\n".join(
            f"{turn.role.upper()}: {turn.content}"
            for turn in turns_to_compress
        )

        # Build the compression prompt. Note how the existing summary
        # is included so the new summary can build on it cumulatively.
        compression_prompt = (
            "You are a conversation summarizer. Create a concise but "
            "complete summary of the following conversation excerpt. "
            "If there is an existing summary, integrate the new information "
            "into it rather than replacing it.\n\n"
        )
        if self._session_summary:
            compression_prompt += (
                f"EXISTING SUMMARY:\n{self._session_summary}\n\n"
                f"NEW CONVERSATION TO INTEGRATE:\n{transcript}\n\n"
                "Write an updated summary that incorporates both the existing "
                "summary and the new conversation. Be concise but preserve "
                "all important facts, decisions, and context."
            )
        else:
            compression_prompt += (
                f"CONVERSATION TO SUMMARIZE:\n{transcript}\n\n"
                "Write a concise summary that captures all important facts, "
                "decisions, and context from this conversation."
            )

        summary_request = InferenceRequest(
            messages=[
                {
                    "role":    "system",
                    "content": "You are a precise conversation summarizer.",
                },
                {
                    "role":    "user",
                    "content": compression_prompt,
                },
            ],
            max_tokens=400,
            temperature=0.3,  # Low temperature for consistent summaries.
        )

        response = self._llm.infer(summary_request)
        self._session_summary = response.content
        print(
            f"[Memory] Compressed {len(turns_to_compress)} turns. "
            f"Summary length: {len(self._session_summary)} chars."
        )

    def get_context_messages(self) -> list[dict[str, str]]:
        """
        Return the conversation history in a format ready for injection
        into the context window.

        If a session summary exists, it is prepended as a system message
        so the model has access to the compressed history before reading
        the recent verbatim turns.
        """
        messages: list[dict[str, str]] = []

        if self._session_summary:
            messages.append({
                "role":    "system",
                "content": (
                    "[CONVERSATION SUMMARY - Earlier parts of this conversation]\n"
                    + self._session_summary
                ),
            })

        messages.extend(turn.to_message() for turn in self._turns)
        return messages

    @property
    def total_turns(self) -> int:
        """Total number of turns in the conversation (including compressed)."""
        return len(self._turns)

    @property
    def has_summary(self) -> bool:
        """Whether any turns have been compressed into a summary."""
        return bool(self._session_summary)

The rolling summary approach shown above is elegant in its simplicity but powerful in its effect. A conversation that would otherwise overflow a 8,192- token context window after perhaps 20 exchanges can now run indefinitely, because the oldest information is continuously compressed into an ever-growing but bounded summary. The model always has access to the full history of the conversation, just at different levels of detail: recent turns are verbatim, older turns are summarized.

CHAPTER SEVEN: THE MODEL CONTEXT PROTOCOL (MCP) AND TOOL INTEGRATION

No discussion of Context Engineering would be complete without addressing the Model Context Protocol, or MCP. Introduced by Anthropic in late 2024 and rapidly adopted across the industry, MCP is a standardized protocol for connecting LLMs to external tools and data sources. It solves a problem that had been plaguing the field for years: every LLM framework had its own way of defining and calling tools, making it impossible to reuse tool implementations across different models and frameworks.

MCP defines a universal schema for tool definitions, a standardized way for LLMs to call tools, and a protocol for tool servers to expose their capabilities to any MCP-compatible client. Think of it as USB-C for LLM tools: a single standard connector that works with any device.

From a Context Engineering perspective, MCP is important because tool definitions consume context window space. Every tool you make available to the model requires a JSON schema description that explains what the tool does, what parameters it accepts, and what it returns. If you give the model access to 50 tools, those definitions might consume 5,000 or more tokens of context space. Context Engineering requires you to be selective: only include the tools that are relevant to the current task.

The following code demonstrates a simple MCP-style tool registry that dynamically selects the most relevant tools for a given query, implementing the SELECT strategy for tool context management:

"""
tool_registry.py

A dynamic tool registry that implements the SELECT strategy for tool
context management. Instead of always including all available tools in
the context window, this registry selects only the tools most relevant
to the current query.

This is critical for large tool libraries where including all tool
definitions would consume too much of the context window budget.

The tool selection uses embedding-based similarity, the same technique
used for RAG document retrieval. This makes tool selection a special
case of the broader context selection problem.
"""

from __future__ import annotations

import json
from dataclasses import dataclass, field
from typing import Any, Callable


@dataclass
class Tool:
    """
    Represents a single tool available to the LLM.

    The description field is critical: it must be clear, specific, and
    unambiguous. The LLM decides which tool to call based entirely on
    the description. A vague description leads to wrong tool selection.
    A description that overlaps with another tool leads to confusion.
    """
    name: str
    description: str
    parameters: dict[str, Any]
    handler: Callable[..., Any]
    # Tags help with embedding-free filtering when embeddings are unavailable.
    tags: list[str] = field(default_factory=list)

    def to_schema(self) -> dict[str, Any]:
        """
        Serialize the tool to the OpenAI function-calling schema format.
        This schema is what gets injected into the context window.
        """
        return {
            "type": "function",
            "function": {
                "name":        self.name,
                "description": self.description,
                "parameters":  self.parameters,
            },
        }

    def execute(self, **kwargs: Any) -> Any:
        """
        Execute the tool with the given parameters.
        Wraps the handler in a try/except to prevent tool errors from
        crashing the agent loop. Returns an error message instead,
        which the model can use to decide how to proceed.
        """
        try:
            return self.handler(**kwargs)
        except Exception as exc:
            return {"error": str(exc), "tool": self.name}


class ToolRegistry:
    """
    A registry of available tools with dynamic selection capabilities.

    Tools are registered with descriptions and tags. At query time,
    the registry selects the most relevant subset of tools to include
    in the context window, keeping tool definition overhead bounded.

    In production, the selection would use embedding similarity.
    Here, we use tag-based filtering as a simpler demonstration that
    still captures the essential SELECT strategy.
    """

    def __init__(self, max_tools_per_query: int = 8):
        """
        Parameters:
            max_tools_per_query: Maximum number of tool definitions to
                                 include in any single context window.
                                 8 is a reasonable default that balances
                                 capability with context efficiency.
        """
        self._tools: dict[str, Tool] = {}
        self._max_tools = max_tools_per_query

    def register(self, tool: Tool) -> None:
        """Register a tool in the registry."""
        self._tools[tool.name] = tool
        print(f"[ToolRegistry] Registered tool: '{tool.name}'")

    def select_tools(
        self,
        query: str,
        required_tags: list[str] | None = None,
    ) -> list[Tool]:
        """
        Select the most relevant tools for the given query.

        Selection priority:
          1. Tools whose tags match required_tags (always included).
          2. Tools whose description contains keywords from the query.
          3. Fill remaining slots with general-purpose tools.

        Returns at most max_tools_per_query tools.
        """
        query_lower = query.lower()
        query_keywords = set(query_lower.split())

        scored_tools: list[tuple[float, Tool]] = []
        for tool in self._tools.values():
            score = 0.0

            # Tag match: highest priority.
            if required_tags:
                matching_tags = sum(
                    1 for tag in tool.tags if tag in required_tags
                )
                score += matching_tags * 10.0

            # Keyword match in description: medium priority.
            desc_lower = tool.description.lower()
            matching_keywords = sum(
                1 for kw in query_keywords if kw in desc_lower
            )
            score += matching_keywords * 1.0

            scored_tools.append((score, tool))

        # Sort by score descending, then alphabetically for determinism.
        scored_tools.sort(key=lambda x: (-x[0], x[1].name))

        selected = [tool for _, tool in scored_tools[:self._max_tools]]
        print(
            f"[ToolRegistry] Selected {len(selected)} tools "
            f"for query: '{query[:50]}...'"
        )
        return selected

    def get_schemas(
        self,
        query: str,
        required_tags: list[str] | None = None,
    ) -> list[dict[str, Any]]:
        """
        Get the JSON schemas for the selected tools, ready for injection
        into the context window as tool definitions.
        """
        selected = self.select_tools(query, required_tags)
        return [tool.to_schema() for tool in selected]

    def execute_tool(
        self,
        tool_name: str,
        arguments: str | dict[str, Any],
    ) -> Any:
        """
        Execute a tool by name with the given arguments.

        Arguments can be provided as a JSON string (as returned by the
        LLM's tool_calls response) or as a pre-parsed dictionary.
        """
        if tool_name not in self._tools:
            return {"error": f"Unknown tool: '{tool_name}'"}

        if isinstance(arguments, str):
            try:
                parsed_args = json.loads(arguments)
            except json.JSONDecodeError as exc:
                return {"error": f"Invalid JSON arguments: {exc}"}
        else:
            parsed_args = arguments

        return self._tools[tool_name].execute(**parsed_args)


# ---------------------------------------------------------------------------
# Example tool definitions demonstrating clean, unambiguous descriptions.
# Each tool has a single, clear purpose and well-defined parameters.
# ---------------------------------------------------------------------------

def _create_example_registry() -> ToolRegistry:
    """
    Create a sample tool registry with a few example tools.
    This demonstrates the pattern for registering tools with the registry.
    """
    registry = ToolRegistry(max_tools_per_query=5)

    # Tool 1: Web search
    registry.register(Tool(
        name="search_web",
        description=(
            "Search the internet for current information about a topic. "
            "Use this when you need up-to-date facts, news, or information "
            "that may not be in your training data. Returns a list of "
            "relevant search results with titles, URLs, and snippets."
        ),
        parameters={
            "type": "object",
            "properties": {
                "query": {
                    "type":        "string",
                    "description": "The search query to submit to the web.",
                },
                "max_results": {
                    "type":        "integer",
                    "description": "Maximum number of results to return (1-10).",
                    "default":     5,
                },
            },
            "required": ["query"],
        },
        handler=lambda query, max_results=5: {
            "results": [f"[Stub] Search result for: {query}"]
        },
        tags=["search", "web", "information"],
    ))

    # Tool 2: Calculator
    registry.register(Tool(
        name="calculate",
        description=(
            "Evaluate a mathematical expression and return the numeric result. "
            "Use this for any arithmetic, algebraic, or mathematical computation. "
            "Do NOT use this for text processing or non-numeric operations."
        ),
        parameters={
            "type": "object",
            "properties": {
                "expression": {
                    "type":        "string",
                    "description": "A valid Python mathematical expression, e.g. '2 ** 10 + 42'.",
                },
            },
            "required": ["expression"],
        },
        handler=lambda expression: {
            "result": eval(  # noqa: S307 - safe for demo; use sympy in production
                expression,
                {"__builtins__": {}},
                {},
            )
        },
        tags=["math", "calculate", "arithmetic"],
    ))

    return registry

The tool registry pattern shown here implements a crucial insight: tool selection is a form of context selection. The same principles that govern RAG document retrieval, namely relevancy scoring, budget constraints, and the primacy of the most relevant information, apply equally to tool selection. By treating tools as first-class context citizens, you can build systems that scale to hundreds of available tools without overwhelming the context window.

CHAPTER EIGHT: PITFALLS AND HOW TO AVOID THEM

Context Engineering is powerful, but it introduces failure modes that are qualitatively different from the failure modes of Prompt Engineering. A bad prompt produces a bad response. A bad context can produce a confidently wrong response, which is far more dangerous because it is harder to detect.

Context Poisoning is the most insidious pitfall. It occurs when incorrect, outdated, or malicious information enters the context window and is treated by the model as ground truth. The model has no way to distinguish between information that was carefully curated by a trusted engineer and information that was injected by an attacker via a prompt injection attack or retrieved from a corrupted knowledge base. Once poisoned information is in the context, the model will reason from it faithfully, producing responses that are internally coherent but factually wrong.

The mitigation for context poisoning is provenance tracking and validation. Every piece of information injected into the context should have a traceable source, and that source should be validated before injection. Documents retrieved from a knowledge base should be checked against their last-verified timestamp. Information retrieved from external APIs should be validated against expected schemas. User-provided information should be treated with appropriate skepticism, especially in agentic systems where the user's input could contain adversarial instructions.

The "Lost in the Middle" effect is a well-documented phenomenon in LLM research. Studies have shown that LLMs exhibit a U-shaped attention pattern over long contexts: they pay the most attention to information at the very beginning and very end of the context, and significantly less attention to information in the middle. This means that if you place your most important retrieved document in the middle of a long context, the model may effectively ignore it, even though it is technically present in the context window.

The mitigation is strategic ordering. Always place the most critical information at the beginning or end of the context. Use the system prompt (which is always at the beginning) for the most important instructions and constraints. Place the current user query (which is always at the end) for maximum attention. If you must include multiple retrieved documents, place the most relevant one first and the second most relevant one last, with less relevant documents in the middle.

Context Overflow occurs when the assembled context exceeds the model's maximum context window size. In a simple system, this would cause an error. In a more sophisticated system, the overflow would be silently truncated, causing the model to lose access to information it needs. Context overflow is particularly dangerous in agentic systems where the context grows with each tool call and each step of reasoning. Without careful budget management, a long-running agent can exhaust its context window mid-task and begin producing incoherent responses.

Context Rot is a subtler problem that develops over time. It occurs when the information in a knowledge base becomes stale, but the retrieval system continues to inject it into the context as if it were current. A product specification that was accurate six months ago but has since been updated, a policy document that has been superseded, or a customer record that reflects an old address are all examples of context rot. The mitigation is aggressive metadata management: every document in the knowledge base should have a last-verified timestamp, and retrieval systems should penalize or exclude documents that have not been verified recently.

Semantic Noise is the problem of retrieving documents that are vectorially similar to the query but contextually irrelevant. Vector similarity is a powerful retrieval mechanism, but it is not perfect. A query about "Python memory management" might retrieve documents about "Python snakes in wildlife management" if the knowledge base contains such documents and the embedding model has not learned to distinguish these contexts. The mitigation is hybrid search, which combines vector similarity with keyword-based BM25 search, and reranking, which uses a more sophisticated cross-encoder model to re-score the top-k retrieved documents before injecting them into the context.

The following code demonstrates a simple context validation layer that checks for common context quality issues before injection:

"""
context_validator.py

A pre-injection context validation layer that checks for common context
quality issues before the assembled context is sent to the LLM.

Validation catches problems early (before inference) rather than late
(after a bad response has been generated and potentially acted upon).
This is the "fail fast" principle applied to Context Engineering.
"""

from __future__ import annotations

import time
from dataclasses import dataclass
from enum import Enum, auto
from typing import Any


class ValidationSeverity(Enum):
    """Severity levels for validation issues."""
    INFO    = auto()   # Informational, no action required.
    WARNING = auto()   # Potential issue, monitor closely.
    ERROR   = auto()   # Serious issue, consider blocking inference.


@dataclass
class ValidationIssue:
    """A single validation issue found during context inspection."""
    severity: ValidationSeverity
    issue_type: str
    description: str
    affected_layer: str


@dataclass
class ValidationResult:
    """The result of validating an assembled context."""
    is_valid: bool
    issues: list[ValidationIssue]
    token_count: int
    token_budget: int

    def has_errors(self) -> bool:
        """Return True if any ERROR-severity issues were found."""
        return any(
            issue.severity == ValidationSeverity.ERROR
            for issue in self.issues
        )

    def summary(self) -> str:
        """Return a human-readable validation summary."""
        lines = [
            f"Context Validation: {'PASS' if self.is_valid else 'FAIL'}",
            f"  Token usage: {self.token_count}/{self.token_budget} "
            f"({self.token_count/self.token_budget:.1%})",
            f"  Issues found: {len(self.issues)}",
        ]
        for issue in self.issues:
            lines.append(
                f"  [{issue.severity.name}] {issue.issue_type} "
                f"({issue.affected_layer}): {issue.description}"
            )
        return "\n".join(lines)


class ContextValidator:
    """
    Validates an assembled context window before LLM inference.

    Checks performed:
      1. Token budget compliance (ERROR if exceeded)
      2. Document staleness (WARNING if documents are older than max_age_days)
      3. Minimum relevancy threshold (WARNING if any document is below threshold)
      4. Empty context detection (ERROR if no useful content is present)
      5. Prompt injection pattern detection (ERROR if suspicious patterns found)
    """

    # Patterns that may indicate prompt injection attempts.
    # This is a simplified list; production systems use more sophisticated
    # detection, including embedding-based anomaly detection.
    INJECTION_PATTERNS = [
        "ignore previous instructions",
        "ignore all prior instructions",
        "disregard your system prompt",
        "you are now",
        "new instructions:",
        "override:",
        "jailbreak",
    ]

    def __init__(
        self,
        token_budget: int = 8192,
        max_document_age_days: int = 30,
        min_relevancy_score: float = 0.60,
    ):
        self._token_budget = token_budget
        self._max_age_seconds = max_document_age_days * 86400
        self._min_relevancy = min_relevancy_score

    def _check_token_budget(
        self,
        token_count: int,
    ) -> list[ValidationIssue]:
        """Check that the context fits within the token budget."""
        issues = []
        if token_count > self._token_budget:
            issues.append(ValidationIssue(
                severity=ValidationSeverity.ERROR,
                issue_type="TOKEN_OVERFLOW",
                description=(
                    f"Context uses {token_count} tokens but budget is "
                    f"{self._token_budget}. Inference will fail or truncate."
                ),
                affected_layer="all",
            ))
        elif token_count > self._token_budget * 0.90:
            issues.append(ValidationIssue(
                severity=ValidationSeverity.WARNING,
                issue_type="TOKEN_NEAR_LIMIT",
                description=(
                    f"Context uses {token_count/self._token_budget:.1%} of "
                    f"budget. Little headroom for response generation."
                ),
                affected_layer="all",
            ))
        return issues

    def _check_injection_patterns(
        self,
        messages: list[dict[str, str]],
    ) -> list[ValidationIssue]:
        """
        Scan all user-provided content for prompt injection patterns.
        Only user messages are scanned; system messages are trusted.
        """
        issues = []
        for msg in messages:
            if msg.get("role") != "user":
                continue
            content_lower = msg.get("content", "").lower()
            for pattern in self.INJECTION_PATTERNS:
                if pattern in content_lower:
                    issues.append(ValidationIssue(
                        severity=ValidationSeverity.ERROR,
                        issue_type="INJECTION_PATTERN",
                        description=(
                            f"Potential prompt injection detected: '{pattern}'"
                        ),
                        affected_layer="user_message",
                    ))
                    break  # One warning per message is sufficient.
        return issues

    def _check_document_staleness(
        self,
        documents: list[dict[str, Any]],
    ) -> list[ValidationIssue]:
        """
        Check whether retrieved documents are within the acceptable age limit.
        Documents with a 'last_verified' timestamp older than max_document_age_days
        are flagged as potentially stale.
        """
        issues = []
        now = time.time()
        for doc in documents:
            last_verified = doc.get("last_verified")
            if last_verified is None:
                issues.append(ValidationIssue(
                    severity=ValidationSeverity.WARNING,
                    issue_type="MISSING_PROVENANCE",
                    description=(
                        f"Document '{doc.get('source', 'unknown')}' has no "
                        f"last_verified timestamp. Cannot assess staleness."
                    ),
                    affected_layer="rag_documents",
                ))
            elif now - last_verified > self._max_age_seconds:
                age_days = (now - last_verified) / 86400
                issues.append(ValidationIssue(
                    severity=ValidationSeverity.WARNING,
                    issue_type="STALE_DOCUMENT",
                    description=(
                        f"Document '{doc.get('source', 'unknown')}' is "
                        f"{age_days:.0f} days old (limit: "
                        f"{self._max_age_seconds/86400:.0f} days)."
                    ),
                    affected_layer="rag_documents",
                ))
        return issues

    def validate(
        self,
        messages: list[dict[str, str]],
        token_count: int,
        documents: list[dict[str, Any]] | None = None,
    ) -> ValidationResult:
        """
        Run all validation checks and return a ValidationResult.

        Parameters:
            messages:    The assembled messages list for the LLM.
            token_count: Estimated token count for the assembled context.
            documents:   Optional list of retrieved documents with metadata.
        """
        all_issues: list[ValidationIssue] = []

        all_issues.extend(self._check_token_budget(token_count))
        all_issues.extend(self._check_injection_patterns(messages))
        if documents:
            all_issues.extend(self._check_document_staleness(documents))

        has_errors = any(
            i.severity == ValidationSeverity.ERROR for i in all_issues
        )

        return ValidationResult(
            is_valid=not has_errors,
            issues=all_issues,
            token_count=token_count,
            token_budget=self._token_budget,
        )

The validator above implements the "fail fast" principle: it catches context quality problems before they reach the LLM, where they would be expensive to detect and potentially dangerous to ignore. The prompt injection detection is particularly important in production systems where user input is injected into the context. Adversarial users can attempt to override the system prompt by embedding instructions in their messages, and a context validator is the first line of defense against such attacks.

CHAPTER NINE: ALTERNATIVES TO CONTEXT ENGINEERING

Context Engineering is not the only approach to improving LLM performance, and a thoughtful practitioner should understand the alternatives and when they are more appropriate.

Fine-tuning is the process of continuing to train a pre-trained LLM on a domain-specific dataset. Where Context Engineering teaches the model what to know at inference time, fine-tuning teaches the model what to know at training time. Fine-tuning is appropriate when you need the model to have deeply internalized knowledge that should inform every response, such as a specific writing style, a domain-specific vocabulary, or a particular reasoning pattern. However, fine-tuning is expensive, requires significant data, and produces a model that is frozen at a point in time. It cannot be updated dynamically as knowledge changes, which makes it poorly suited for applications where the relevant information changes frequently.

Retrieval-Augmented Generation, or RAG, is often discussed as an alternative to Context Engineering, but this framing is misleading. RAG is actually one of the core techniques within Context Engineering. The SELECT strategy is implemented via RAG. What people sometimes mean when they contrast RAG with Context Engineering is the difference between a simple, single-stage RAG pipeline and a full Context Engineering system that includes memory, tool integration, conversation management, and quality measurement in addition to retrieval.

Long-context models represent another alternative. As context windows have grown from 4,096 tokens to 128,000 tokens and beyond, it has become tempting to simply stuff everything into the context and let the model sort it out. This approach, sometimes called "context stuffing," avoids the complexity of retrieval pipelines and memory management. However, it has significant limitations. Long contexts are expensive to process because the computational cost of the transformer's attention mechanism scales quadratically with context length. Long contexts also suffer more severely from the "lost in the middle" effect. And for most applications, the relevant information is a tiny fraction of the total available information, making context stuffing wasteful even when it is technically feasible.

Agentic decomposition is an alternative that addresses the context window limitation by breaking large tasks into smaller subtasks, each handled by a separate agent with a clean, focused context window. This is the ISOLATE strategy taken to its logical conclusion. Instead of one agent managing a massive context, a hierarchy of specialized agents each manages a small, highly relevant context. The main orchestrating agent synthesizes their outputs. This approach scales to arbitrarily complex tasks but introduces coordination overhead and requires careful design of the inter-agent communication protocol.

The honest answer is that none of these alternatives is universally superior. The best approach depends on the specific application, the nature of the knowledge required, the frequency with which that knowledge changes, the computational budget, and the acceptable latency. In practice, production AI systems typically combine multiple approaches: fine-tuning for stable domain knowledge, Context Engineering for dynamic and task-specific knowledge, and agentic decomposition for complex multi-step tasks.

CHAPTER TEN: BEST PRACTICES FOR PRODUCTION CONTEXT ENGINEERING

Drawing together everything we have covered, the following principles represent the current state of best practice for Context Engineering in production AI systems.

The first principle is to design context-first, not LLM-first. Before writing a single line of code, ask: what information does the model need to perform this task correctly? Map out all the sources of that information, how it will be retrieved, how it will be compressed, and how it will be ordered in the context window. Only then should you think about which model to use and how to phrase the instructions. The LLM is a component within a larger context- driven system, not the center of the universe.

The second principle is to make every token earn its place. The context window is a finite resource, and every token you include is a token that cannot be used for something else. Before adding any information to the context, ask: does this information materially improve the model's ability to respond to this query? If the answer is not a clear yes, leave it out. Relevancy over completeness, always.

The third principle is to separate concerns rigorously. The system prompt (Prompt Engineering) should be maintained separately from the context assembly logic (Context Engineering). The retrieval pipeline should be separate from the memory management system. The tool registry should be separate from the conversation history manager. Each component should have a single, clear responsibility and a well-defined interface. This makes each component easier to test, debug, and improve independently.

The fourth principle is to measure everything. Instrument your context pipeline to track relevancy scores, faithfulness scores, context utilization rates, token costs, and response quality metrics. Without measurement, you are flying blind. With measurement, you can identify which components are working well and which need improvement, and you can detect degradation before it becomes a user-visible problem.

The fifth principle is to validate before inference. Implement a context validation layer that checks for token overflow, stale documents, injection patterns, and other quality issues before the context is sent to the LLM. Catching problems early is always cheaper than dealing with the consequences of a bad response.

The sixth principle is to plan for context rot. Every piece of information in your knowledge base will eventually become stale. Design your knowledge base management processes with this in mind: track last-verified timestamps, implement automated freshness checks, and build workflows for updating and re-verifying documents on a regular schedule.

The seventh principle is to guard against the "lost in the middle" effect by placing critical information at the beginning and end of the context. Use the system prompt for the most important constraints and instructions. Place the current user query last. If you must include multiple retrieved documents, use a reranker to ensure the most relevant document is first and the second most relevant is last.

The eighth principle is to start small and grow deliberately. Begin with a minimal context: a system prompt, a single retrieval source, and the user query. Measure the quality of responses. Add complexity only when measurement shows that a specific type of information would materially improve quality. This iterative approach prevents the common mistake of building an elaborate context pipeline that is difficult to debug and maintain but only marginally better than a simple one.

CONCLUSION: THE OPERATING SYSTEM FOR INTELLIGENCE

Context Engineering is not a replacement for Prompt Engineering. It is the infrastructure that makes Prompt Engineering effective at scale. Prompt Engineering crafts the instructions. Context Engineering ensures the model has everything it needs to follow those instructions correctly.

Karpathy's analogy is the right one to end with. The LLM is the CPU. The context window is RAM. The context engineer is the operating system. Just as a modern operating system is far more complex than the applications it runs, the context engineering infrastructure of a production AI system is often far more complex than the prompts it serves. And just as a well-designed operating system makes applications faster, more reliable, and more capable, a well-designed context engineering system makes LLMs faster, more reliable, and more capable.

The field is young and moving fast. The techniques described in this article represent the current state of the art, but they will be superseded by better approaches as the community learns more about how LLMs process and use context. What will not change is the fundamental insight: the quality of an LLM's output is bounded by the quality of its input, and the quality of its input is the responsibility of the context engineer.

Fill the window wisely. The model will do the rest.

APPENDIX: QUICK REFERENCE

Context Window Layers (in recommended order):

  1. System Prompt - Role, constraints, persona, output format
  2. Long-Term Memory - Facts retrieved from external memory store
  3. Retrieved Documents - Task-specific knowledge from RAG pipeline
  4. Tool Definitions - Available functions the model can call
  5. Conversation History - Recent turns (older turns compressed)
  6. Current Query - The immediate user input

The Four Core Strategies: WRITE - Persist information to external memory for future retrieval SELECT - Choose only the most relevant information from each source COMPRESS - Summarize or trim information to fit within token budgets ISOLATE - Use sub-agents with focused contexts for complex tasks

Key Quality Metrics: Contextual Relevancy - Are retrieved documents relevant to the query? Faithfulness - Is the response grounded in the context? Contextual Recall - Does the context contain all needed information? Context Utilization - What fraction of the token budget is being used? Hallucination Risk - Composite risk score based on the above metrics

Common Pitfalls: Context Poisoning - Incorrect or malicious information in the context Lost in the Middle - Critical information ignored when buried in context Context Overflow - Context exceeds the model's maximum window size Context Rot - Stale information that was once accurate but is not Semantic Noise - Vectorially similar but contextually irrelevant docs

GPU Backend Support (via llama-cpp-python): CUDA - NVIDIA GPUs (compile with CMAKE_ARGS="-DGGML_CUDA=on") Metal - Apple Silicon (compile with CMAKE_ARGS="-DGGML_METAL=on") Vulkan - AMD/Intel/other (compile with CMAKE_ARGS="-DGGML_VULKAN=on") CPU - Universal fallback (no special compilation flags needed)