Wednesday, May 20, 2026

LARGE LANGUAGE MODELS ON KUBERNETES: A COMPLETE GUIDE TO DEPLOYING, MANAGING, AND QUERYING LOCAL AND REMOTE AI MODELS IN A CLOUD-NATIVE CLUSTER

 



CHAPTER ONE: WHY THIS MATTERS, AND WHY NOW



There is a moment in the life of every engineering team that adopts large

language models when the honeymoon ends. It usually happens around the third

month. The team has been happily calling an external API, watching tokens flow

in and out, and marvelling at what the model can do. Then the bill arrives.

Then the legal team asks where the customer data is going. Then the latency

spikes because the API is rate-limited. Then someone asks, "Could we just run

this ourselves?"


The answer, is a resounding yes — and the tooling to do it

well has finally caught up with the ambition. Kubernetes 1.33, released on

April 23, 2025, brought Dynamic Resource Allocation (DRA) into beta, giving

the scheduler a genuinely sophisticated understanding of accelerator hardware

from any vendor. Docker Desktop 4.41, released on April 29, 2025, ships with

a Model Runner that understands OCI-packaged model artifacts and exposes an

OpenAI-compatible API from your laptop. The vLLM project has matured into a

production-grade inference server with native support for NVIDIA CUDA, AMD

ROCm, and Intel Gaudi. KEDA gives us event-driven autoscaling that reacts to

the depth of an inference queue rather than to CPU utilization. And the Model

Context Protocol (MCP), whose latest stable specification was published in

November 2025, gives us a standard way for AI agents to discover and call

tools — including tools that live inside a Kubernetes cluster.


This guide walks through the entire stack. We will design a Custom Resource

Definition that describes an LLM deployment with enough richness to let a

scheduler make intelligent decisions about where and how to run it — on any

accelerator vendor. We will build a controller that reconciles that resource

into real Kubernetes objects. We will configure vLLM as the inference engine

for GPU deployments and llama.cpp/Ollama for CPU deployments, wire up KEDA for

autoscaling, and expose the whole thing through an MCP server so that AI agents

can discover and use models as tools. Along the way we will handle the important

asymmetry between models you can run locally and frontier models that only exist

as remote REST APIs — because any real-world deployment has to deal with both.


Before we write a single line of YAML, we need to understand what we are

actually deploying.




CHAPTER TWO: THE MODEL LANDSCAPE IN MAY 2026



The LLM landscape has undergone a fundamental architectural shift over the past

eighteen months. The dominant pattern is no longer the dense transformer, where

every parameter participates in every forward pass. Instead, the leading models

use Mixture-of-Experts (MoE) architectures, where a router selects a small

subset of "expert" sub-networks for each token. This means that a model can

have 675 billion total parameters but activate only 41 billion of them per

token, dramatically reducing the compute cost of inference while retaining the

capacity that comes from a large parameter count.


This distinction matters enormously for Kubernetes scheduling, because the

relevant resource constraint is not the total parameter count but the amount of

accelerator memory required to hold all the expert weights simultaneously. Even

if only 41 billion parameters are active per token, all the expert weights must

reside in HBM/VRAM so the router can select among them.


A second major trend is the emergence of thinking models, which perform chain-

of-thought reasoning internally before producing a final answer. Some models,

like the Qwen3 and Qwen3.5/3.6 families, support a hybrid mode where a single

deployed model can switch between fast non-thinking responses and slower, deeper

reasoning responses depending on a flag in the system prompt.


A third trend is Quantization-Aware Training (QAT) and Multi-Token Prediction

(MTP). Google's Gemma 4 family uses MTP drafters — small companion checkpoints

that accelerate token generation by up to 3x without quality loss. Gemma 3's

QAT-INT4 format demonstrated that training-time quantization awareness produces

better quality than post-training quantization at the same memory footprint.


Let us survey the major model families that are relevant to a Kubernetes

deployment as of May 2026, with the specific numbers that a scheduler needs.



GOOGLE — GEMMA 4 FAMILY (Released April 2, 2026)


Gemma 4 is released under Apache 2.0 and is built from the same research as

Gemini 3. All models natively process text and images; the E2B and E4B variants

also support audio input. All models support native function calling and

structured JSON output. MTP drafters are available for all sizes, offering up

to 3x speedup without significant VRAM increase.


Gemma 4 E2B has approximately 2.3 billion effective parameters. It targets

phones, edge devices, and low-VRAM testing. In Q4_K_M quantization it requires

4–6 GB of VRAM. Full BF16 requires 15 GB. Context window: 128K tokens.


Gemma 4 E4B has approximately 4.5 billion effective parameters (around 8 billion

including embeddings). It targets high-end laptops and small servers. In Q4_K_M

it requires about 8 GB of VRAM; in Q8_0 it needs 12–16 GB. Full BF16: 15 GB.

Context window: 128K tokens.


Gemma 4 26B-A4B is a Mixture-of-Experts model with 25.2 billion total parameters

and approximately 3.8 billion active parameters per token. It targets consumer

GPUs and cost-efficient single-GPU server inference. In 4-bit quantization

(GGUF or AWQ) it requires 14–16 GB of VRAM; on an RTX 4090 (24 GB) there is

comfortable headroom for KV cache. In FP8 it needs about 30 GB; in BF16 about

60 GB. Context window: 256K tokens.


Gemma 4 31B is a dense model with 30.7 billion total parameters. It targets

workstations where maximum quality is paramount. In INT4 it requires a minimum

of 16 GB of VRAM; Q4_K_M is comfortable at 24 GB; Q5_K_M and above need 32 GB+;

INT8 needs about 36 GB; BF16 needs about 62 GB (requires A100 80 GB or H100).

Context window: 256K tokens.



GOOGLE — GEMINI 3.1 (Remote API, Released February 19, 2026)


Gemini 3.1 Pro is not available as open weights. It is accessed via the Google

AI API and Vertex AI. Context window: 1 million tokens. Max output: 65,536

tokens. Pricing: $2 per million input tokens, $12 per million output tokens.

Supports up to 900 images per prompt, 8.4 hours of audio, and 1 hour of video.

Variants: Gemini 3.1 Pro Preview (Feb 19), Flash-Lite Preview (Mar 3),

Flash-Lite GA (May 7, 2026).



ALIBABA — QWEN3.5 FAMILY (Released February 16, 2026)


Qwen3.5 is open-weights under Apache 2.0. It uses both dense and MoE

architectures. The MoE models are designated with an "A" suffix indicating

active parameters (e.g., 397B-A17B has 397 billion total, 17 billion active).


Available open-weights sizes: 0.8B, 2B, 4B, 9B, 27B, 35B-A3B, 122B-A10B,

and 397B-A17B. Native context window: 262,144 tokens, extensible to 1,010,000

tokens. Supports 201 languages. Hybrid thinking/non-thinking mode.


The 9B model fits on a single 24 GB GPU in FP16 with ample KV cache headroom.

The 35B-A3B MoE model runs in Q4 quantization on a 24 GB card (approximately

21.5 GB). The 27B dense model runs on 22 GB of RAM/VRAM. For the 35B-A3B at

Q8_0 with a 65K context window in llama.cpp, VRAM usage is approximately 21.7 GB.



ALIBABA — QWEN3.6 FAMILY (Released April 2026)


Qwen3.6-27B and Qwen3.6-35B-A3B are open-weight models released under

Apache 2.0. Qwen3.6-35B-A3B is a fully open-source MoE model with 35 billion

total parameters and 3 billion active parameters per token, outperforming its

Qwen3.5 predecessor and rivalling larger dense models. Context window: 256K

tokens, extensible to 1,010,000 tokens. The 27B model runs on 18 GB of VRAM;

the 35B-A3B runs on 22 GB. The 35B-A3B achieves 100+ tokens per second on

consumer hardware due to its low active parameter count.


Qwen3.6-Plus is a proprietary hosted model with a 1-million-token context

window and up to 65,536 output tokens. It is not available as open weights.



META — LLAMA 4 FAMILY (Released April 2025, MIT License)


Llama 4 Scout has 109 billion total parameters and activates 17 billion per

token across 16 experts. Context window: 10 million tokens. In 4-bit

quantization it requires approximately 55–61 GB of VRAM, fitting on a single

H100 80 GB or AMD MI300X (192 GB). For context windows beyond 130K tokens,

multiple GPUs are recommended.


Llama 4 Maverick has 400 billion total parameters across 128 experts, also

activating 17 billion per token. Context window: 1 million tokens. In 4-bit

quantization it requires approximately 224 GB of VRAM, requiring multi-GPU.



MISTRAL AI — MISTRAL MEDIUM 3.5 (Released April 2026)


Mistral Medium 3.5 is a dense 128-billion-parameter model released under a

Modified MIT License. Context window: 256K tokens. Multimodal: text and image

input, text output. Supports configurable reasoning mode. In 4-bit quantization

it requires approximately 70 GB of VRAM, accessible on a Mac Studio with 128 GB

of unified memory or a multi-GPU server. Excels in instruction-following,

reasoning, coding, long-context understanding, tool use, and agentic workflows.



MISTRAL AI — MISTRAL LARGE 3 (Released December 2025)


Mistral Large 3 is a sparse MoE model with 675 billion total parameters and

approximately 41 billion active parameters per token, plus a 2.5 billion

parameter integrated vision encoder. Released under Apache 2.0. Context window:

256K tokens. Multimodal: native vision capabilities. Supports 40+ languages,

function calling, and structured JSON output.


For self-hosting in FP8 precision for long-context workloads up to 256K tokens,

B200 or H200 GPUs are recommended. For contexts under 64K tokens, NVFP4

precision can be used on A100s and H100s. The MoE architecture means that while

all 675 billion parameters must reside in memory, only 41 billion are computed

at inference time, making deployment more efficient than a dense 675B model.

In Q4 quantization, approximately 200 GB of VRAM is required.



MOONSHOT AI — KIMI K2.5 (Released January 27, 2026)


Kimi K2.5 is open-source under a Modified MIT License. Architecture: MoE with

1 trillion total parameters and 32 billion active parameters per token. 384

experts with 8 selected per token plus 1 shared expert. Native multimodal,

trained on 15 trillion mixed visual and text tokens. Context window: 256K tokens.


The full model FP8 checkpoint is approximately 630 GB and typically requires

at least 4x H200 GPUs. Quantized versions (e.g., 1.8-bit) can run on a single

24 GB GPU with CPU offload for MoE layers, requiring around 256 GB of system

RAM for approximately 10 tokens per second. Features: Agent Swarm technology

coordinating up to 100 specialized AI agents, four operational modes (Instant,

Thinking, Agent, Agent Swarm).



ZHIPU AI — GLM-5.1 (Released April 7, 2026)


GLM-5.1 is open-source under the MIT License. Architecture: Hybrid MoE

(GlmMoeDSA) with 754 billion total parameters and 40 billion active parameters

per token. 256 routed experts plus 1 shared expert, with 8+1 active per token.

Context window: 200K tokens. Designed for agentic engineering and long-horizon

coding tasks.


Minimal deployment requires 1x NVIDIA HGX B200 (8x B200 GPU system). FP8

deployments with vLLM on multi-GPU rigs require approximately 860 GB or more

across GPUs (e.g., 8x H200). For CPU-only setups with quantized GGUF weights,

approximately 180–256 GB of system RAM is needed for 1–2 bit quantizations. A

24 GB GPU plus 256 GB of RAM can work with 2-bit variants using MoE offloading.



DEEPSEEK — V4 FAMILY (2026)


DeepSeek-V4-Flash has 284 billion total parameters and 13 billion active

parameters per token. Context window: 1 million tokens. MIT-licensed open

weights. Native FP4+FP8 requires approximately 170–175 GB of total VRAM,

fitting on 2x H200 or 4x A100 80 GB. Community INT4 quantization needs

approximately 90–100 GB, potentially on 4x RTX 4090. Community GGUF/GPTQ at

approximately 80 GB VRAM might be feasible on 1x RTX 5090 or 2x RTX 4090 with

CPU offload.


DeepSeek-V4-Pro has approximately 1.6 trillion total parameters and 49 billion

active parameters per token. MIT-licensed open weights. Full precision

(FP8+FP4 Mixed) requires approximately 865 GB of VRAM, recommending 16x H100

80 GB or equivalent. Even Q4 Pro (approximately 430 GB of weights) does not fit

on an 8x H100 80 GB box once KV cache and overhead are added. This is a

datacenter cluster job.



OPENAI — GPT-5.5 (Remote API, Released April 23, 2026)


GPT-5.5 is not available as open weights. API context window: 1 million tokens.

Max output: 128K tokens. Knowledge cutoff: December 2025. Pricing: $5 per

million input tokens, $30 per million output tokens (90% cached-input discount).

GPT-5.5 Pro: $30/$180 per million tokens. GPT-5.5 Instant (released May 5,

2026) is the default for all ChatGPT users. Offers five reasoning levels: none,

low, medium, high, xhigh. Supports text and image inputs.



ANTHROPIC — CLAUDE OPUS 4.7 (Remote API, Released April 16, 2026)


Claude Opus 4.7 is not available as open weights. Context window: 1 million

tokens. Max output: 128K tokens. Pricing: $5 per million input tokens, $25 per

million output tokens. Improved vision with higher-resolution image support

(up to 2576px / 3.75MP). New "xhigh" effort level for finer control over

reasoning and latency. Available via API, Amazon Bedrock, Google Cloud Vertex

AI, and Microsoft Foundry.



THE ACCELERATOR LANDSCAPE IN MAY 2026


NVIDIA remains the dominant data-center accelerator vendor. The H100 80 GB

(HBM3) and H200 141 GB (HBM3e) are the workhorses of production LLM inference.

The B200 with 192 GB of HBM3e and the HGX B200 system (8x B200 = 1536 GB

aggregate) represent the frontier for the largest models.


AMD Instinct MI300X offers 192 GB of HBM3 memory with 5.3 TB/s bandwidth per

card, making it exceptionally well-suited for large model inference where VRAM

is the primary constraint. The MI350X (CDNA 4 architecture, expected 2025) will

feature 288 GB of HBM3E and 8 TB/s bandwidth, with up to 35x AI inference

performance improvement over MI300. The MI400 series (anticipated 2026) will

feature 432 GB of HBM4 and over 19.6 TB/s bandwidth. As of mid-January 2026,

93% of vLLM AMD test groups are succeeding, with official ROCm-enabled vLLM

Docker images available since January 2026.


Intel Gaudi 3 (launched 2024, 5nm process) features 128 GB of HBM2e with

3.7 TB/s bandwidth. Intel maintains an optimized fork of vLLM for Gaudi with

Paged KV cache, custom Paged Attention, tensor parallelism, and FP8 quantization

support. Text Generation Inference (TGI) also has native Gaudi support. Gaudi 3

supports DeepSeek architecture since Intel Gaudi software release 1.21.0.


Apple Silicon (M3 Ultra: 512 GB unified memory) excels at local and edge LLM

inference via llama.cpp and the MLX framework. It is not yet suitable for large-

scale Kubernetes data-center deployments due to GPU support limitations in VMs

and containers, but it is the best option for on-premise developer workstations

and edge nodes running llama.cpp or Ollama.



THE VRAM REFERENCE TABLE (MAY 2026)


The following table summarises the memory requirements that our Kubernetes

scheduler will need to reason about. Q4 refers to 4-bit quantization (GGUF

Q4_K_M, AWQ, or GPTQ). QAT-INT4 refers to Google's Quantization-Aware Training

format. "Active" is the per-token active parameter count for MoE models.


    Model                     Total   Active  Arch   Q4 VRAM    BF16 VRAM

    -------------------------------------------------------------------

    Gemma 4 E2B             2.3B    2.3B    Dense    ~4 GB      ~15 GB

    Gemma 4 E4B             4.5B    4.5B    Dense    ~8 GB      ~15 GB

    Qwen3.5-4B              4B       4B     Dense    ~3 GB      ~9 GB

    Qwen3.5-9B              9B      9B      Dense    ~6 GB      ~18 GB

    Gemma 4 26B-A4B (MoE)   25.2B   3.8B    MoE      ~14 GB     ~60 GB

    Qwen3.5-27B             27B     27B     Dense    ~16 GB     ~54 GB

    Qwen3.6-27B             27B     27B     Dense    ~18 GB     ~54 GB

    Gemma 4 31B (Dense)     30.7B   30.7B   Dense    ~16 GB     ~62 GB

    Qwen3.5-35B-A3B (MoE)   35B     3B      MoE      ~21 GB     ~70 GB

    Qwen3.6-35B-A3B (MoE)   35B     3B      MoE      ~22 GB     ~70 GB

    Llama 4 Scout (MoE)     109B    17B     MoE      ~55 GB     ~220 GB

    Mistral Medium 3.5      128B    128B    Dense    ~70 GB     ~256 GB

    DeepSeek-V4-Flash (MoE) 284B    13B     MoE      ~80 GB     ~570 GB

    Kimi K2.5 (MoE)         1000B   32B     MoE     ~250 GB    ~2000 GB

    Llama 4 Maverick (MoE)  400B    17B     MoE     ~224 GB    ~860 GB

    GLM-5.1 (MoE)           754B    40B     MoE     ~180 GB    ~1500 GB

    Mistral Large 3 (MoE)   675B    41B     MoE     ~200 GB    ~1350 GB

    DeepSeek-V4-Pro (MoE)   1600B   49B     MoE     ~430 GB  not pract. 

    -------------------------------------------------------------------

    Remote API models (no local VRAM required):

    GPT-5.5                 ~?      ~?      ?      N/A        N/A

    Claude Opus 4.7         ~?      ~?      ?      N/A        N/A

    Gemini 3.1 Pro          ~?      ~?      ?      N/A        N/A

    -------------------------------------------------------------------


This table is the foundation of everything that follows. Every design decision

in our CRD, our controller, and our scheduler will ultimately trace back to

these numbers.



CHAPTER THREE: KUBERNETES 1.33, DRA, AND MULTI-VENDOR GPU SCHEDULING


Kubernetes 1.33, released on April 23, 2025, is the most significant release

for AI workloads in the history of the project. The headline feature for

accelerator users is Dynamic Resource Allocation (DRA) reaching beta status.


The traditional approach to GPU scheduling uses the device plugin model. A

device plugin runs as a DaemonSet on each GPU node and advertises GPUs as

extended resources: "nvidia.com/gpu", "amd.com/gpu", or "habana.ai/gaudi". A

pod requests a GPU by setting the appropriate resource limit. The scheduler

finds a node with an available GPU of that type and assigns the pod to it. This

works, but it is extremely coarse-grained. The scheduler knows that a node has

GPUs and that a pod wants GPUs, but it knows nothing about the VRAM capacity,

interconnect topology, or whether the workload would benefit from partitioning.


DRA changes this by introducing a richer API for expressing hardware requirements

and capabilities. With DRA, a device driver can publish detailed information

about each accelerator: its model, its VRAM, its interconnect connectivity, its

supported quantization formats, and any other relevant attributes. A pod can

then request not just "a GPU" but "a GPU with at least 80 gigabytes of VRAM

from any vendor."


Kubernetes 1.33 also introduces Partitionable Devices (foundation for MIG),

Device Taints and Tolerations (mirrors node taint/toleration for devices), and

Prioritized List (preference ordering over device configurations).


INSTALLING GPU SUPPORT FOR ALL VENDORS


The following commands install GPU support for all three major accelerator

vendors. Run only the sections relevant to the hardware in your cluster.


NVIDIA — GPU Operator:


    helm repo add nvidia https://helm.ngc.nvidia.com/nvidia

    helm repo update

    helm install gpu-operator nvidia/gpu-operator \

        --namespace gpu-operator \

        --create-namespace \

        --set driver.enabled=true \

        --set toolkit.enabled=true \

        --set devicePlugin.enabled=true \

        --set dcgmExporter.enabled=true \

        --wait


AMD — GPU Operator (announced January 2025):


    helm repo add amd-gpu-operator https://rocm.github.io/gpu-operator

    helm repo update

    helm install amd-gpu-operator amd-gpu-operator/gpu-operator \

        --namespace amd-gpu-operator \

        --create-namespace \

        --set devicePlugin.enabled=true \

        --set nodeLabeller.enabled=true \

        --wait


    # The AMD GPU Operator installs:

    # - amd-gpu-device-plugin (exposes amd.com/gpu resources)

    # - amd-gpu-node-labeller (labels nodes with GPU model and VRAM)

    # - ROCm driver management

    # After installation, AMD GPUs appear as amd.com/gpu resources.


Intel Gaudi — Base Operator:


    helm repo add intel https://intel.github.io/helm-charts

    helm repo update

    helm install gaudi-base-operator intel/intel-gaudi-base-operator \

        --namespace intel-gaudi \

        --create-namespace \

        --wait


    # The Intel Gaudi Base Operator installs:

    # - Intel Gaudi Device Plugin (exposes habana.ai/gaudi resources)

    # - Container runtime configuration

    # - Feature discovery

    # - Monitoring tools

    # After installation, Gaudi cards appear as habana.ai/gaudi resources.


Verify all accelerators are visible to the scheduler:


    # NVIDIA GPUs

    kubectl get nodes -o custom-columns=\

    NAME:.metadata.name,\

    NVIDIA_GPU:.status.allocatable."nvidia\.com/gpu"


    # AMD GPUs

    kubectl get nodes -o custom-columns=\

    NAME:.metadata.name,\

    AMD_GPU:.status.allocatable."amd\.com/gpu"


    # Intel Gaudi

    kubectl get nodes -o custom-columns=\

    NAME:.metadata.name,\

    GAUDI:.status.allocatable."habana\.ai/gaudi"


CHAPTER FOUR: DESIGNING THE LLMMODEL CUSTOM RESOURCE


We want to define a Kubernetes Custom Resource that captures everything a

scheduler needs to know about an LLM deployment. The resource must be rich

enough to express the full diversity of the model landscape surveyed in Part

Two, vendor-agnostic so it works with NVIDIA, AMD, Intel Gaudi, and CPU-only

nodes, and simple enough that an engineer can write one without consulting a

manual.


The most fundamental distinction is between a locally-hosted model and a remote

API-only model. Frontier models like GPT-5.5, Claude Opus 4.7, and Gemini 3.1

Pro are not available as downloadable weights. Our CRD represents both cases.


A critical new field is `acceleratorVendor`, which selects the hardware backend.

The controller uses this field to select the correct:

  - Kubernetes resource key (nvidia.com/gpu, amd.com/gpu, habana.ai/gaudi)

  - vLLM Docker image (CUDA, ROCm, or Gaudi variant)

  - Node selector labels (populated by the respective GPU operator)

  - Tolerations (vendor-specific GPU node taints)

  - Prometheus metrics source (DCGM for NVIDIA, ROCm SMI for AMD, Gaudi metrics)


The CRD definition:


    # File: config/crd/bases/ai.example.io_llmmodels.yaml

    apiVersion: apiextensions.k8s.io/v1

    kind: CustomResourceDefinition

    metadata:

      name: llmmodels.ai.example.io

      annotations:

        ai.example.io/schema-version: "v1alpha1"

    spec:

      group: ai.example.io

      names:

        kind: LLMModel

        listKind: LLMModelList

        plural: llmmodels

        singular: llmmodel

        shortNames:

          - lm

      scope: Namespaced

      versions:

        - name: v1alpha1

          served: true

          storage: true

          additionalPrinterColumns:

            - name: Model

              type: string

              jsonPath: .spec.modelId

            - name: Vendor

              type: string

              jsonPath: .spec.acceleratorVendor

            - name: Engine

              type: string

              jsonPath: .spec.inferenceEngine

            - name: Mode

              type: string

              jsonPath: .spec.deploymentMode

            - name: VRAM

              type: string

              jsonPath: .spec.resources.vramPerAcceleratorGiB

            - name: Context

              type: string

              jsonPath: .spec.contextWindowK

            - name: Status

              type: string

              jsonPath: .status.phase

            - name: Endpoint

              type: string

              jsonPath: .status.endpoint

          schema:

            openAPIV3Schema:

              type: object

              required:

                - spec

              properties:

                spec:

                  type: object

                  required:

                    - modelId

                    - deploymentMode

                  properties:


                    modelId:

                      type: string

                      description: >

                        The canonical model identifier. For Hugging Face models

                        this is the repo path (owner/name). For OCI models this

                        is the image reference. For remote API models this is

                        the model name as used in the API request.


                    deploymentMode:

                      type: string

                      enum:

                        - local

                        - remote

                      description: >

                        Whether to run the model locally in the cluster or to

                        proxy requests to a remote API endpoint.


                    # -------------------------------------------------------

                    # ACCELERATOR VENDOR SELECTION

                    # This is the key field that makes the operator

                    # hardware-agnostic. The controller uses this to select

                    # the correct resource key, Docker image, node selectors,

                    # and tolerations.

                    # -------------------------------------------------------

                    acceleratorVendor:

                      type: string

                      enum:

                        - nvidia

                        - amd

                        - intel-gaudi

                        - cpu

                      default: nvidia

                      description: >

                        The accelerator vendor for local deployments.

                        nvidia: Uses nvidia.com/gpu resource key, CUDA-based

                          vLLM image, DCGM metrics.

                        amd: Uses amd.com/gpu resource key, ROCm-based vLLM

                          image, ROCm SMI metrics.

                        intel-gaudi: Uses habana.ai/gaudi resource key,

                          Intel Gaudi optimized vLLM fork image, Gaudi metrics.

                        cpu: No GPU resource requested. Uses llama.cpp or

                          Ollama for CPU inference. Suitable for small models

                          on Apple Silicon or x86 servers.


                    inferenceEngine:

                      type: string

                      enum:

                        - vllm

                        - llamacpp

                        - ollama

                      default: vllm

                      description: >

                        The inference engine to use for local deployments.

                        vLLM is recommended for GPU deployments (NVIDIA, AMD,

                        Intel Gaudi) with high throughput requirements.

                        llamacpp is recommended for CPU inference or consumer

                        GPU inference using GGUF quantized models.

                        Ollama wraps llama.cpp with a model management layer

                        and OpenAI-compatible API.


                    modelType:

                      type: string

                      enum:

                        - Dense

                        - MoE

                        - DenseThinking

                        - MoEThinking

                      default: Dense

                      description: >

                        The architectural type of the model. Dense models

                        activate all parameters for every token. MoE models

                        route each token through a subset of expert networks.

                        Thinking variants support chain-of-thought reasoning,

                        either always (DenseThinking) or switchable via system

                        prompt (MoEThinking).


                    totalParametersBillions:

                      type: number

                      description: >

                        Total number of parameters in billions. For MoE models

                        this is the sum of all expert parameters. This value

                        drives VRAM/HBM requirements.


                    activeParametersBillions:

                      type: number

                      description: >

                        Number of parameters activated per token in billions.

                        For dense models this equals totalParametersBillions.

                        For MoE models this drives compute (FLOP) cost per token.


                    quantization:

                      type: string

                      enum:

                        - None

                        - FP16

                        - BF16

                        - FP8

                        - FP4

                        - INT8

                        - INT4

                        - QAT-INT4

                        - GPTQ

                        - AWQ

                        - GGUF-Q4_K_M

                        - GGUF-Q8_0

                        - MXFP4

                        - MXFP6

                      default: None

                      description: >

                        The quantization format of the model weights.

                        QAT-INT4 is Google's Quantization-Aware Training format.

                        MXFP4 and MXFP6 are AMD CDNA 4 (MI350X) native formats.

                        FP8 is natively supported by Intel Gaudi 3, NVIDIA H100+,

                        and AMD MI300X+.

                        GGUF variants are used with llamacpp and Ollama.

                        AWQ and GPTQ are used with vLLM on NVIDIA and AMD.


                    contextWindowK:

                      type: integer

                      description: >

                        The maximum context window in thousands of tokens.

                        This affects KV cache memory requirements, which grow

                        linearly with context length.


                    domains:

                      type: array

                      items:

                        type: string

                        enum:

                          - general

                          - code

                          - math

                          - reasoning

                          - vision

                          - audio

                          - multilingual

                          - embedding

                          - function-calling

                          - long-context

                          - agentic

                      description: >

                        The capability domains this model excels in.


                    languages:

                      type: array

                      items:

                        type: string

                      description: >

                        Languages this model supports. Use ISO 639-1 codes

                        or descriptive strings like "140+" for broad support.


                    resources:

                      type: object

                      properties:

                        acceleratorCount:

                          type: integer

                          default: 1

                          description: >

                            Number of accelerators required per replica.

                            The resource key used depends on acceleratorVendor:

                            nvidia -> nvidia.com/gpu

                            amd   -> amd.com/gpu

                            intel-gaudi -> habana.ai/gaudi

                            cpu   -> no accelerator resource requested


                        vramPerAcceleratorGiB:

                          type: integer

                          description: >

                            Required VRAM/HBM per accelerator in gibibytes.

                            The controller uses this to select nodes whose

                            accelerators have sufficient memory.

                            For AMD MI300X this can be up to 192 GiB.

                            For Intel Gaudi 3 this is up to 128 GiB.

                            For NVIDIA H100 this is up to 80 GiB.

                            For NVIDIA H200 this is up to 141 GiB.

                            For NVIDIA B200 this is up to 192 GiB.


                        preferredAcceleratorModel:

                          type: string

                          description: >

                            Optional preferred accelerator model string used

                            as a DRA preference hint. Examples:

                            "NVIDIA H100 80GB HBM3"

                            "AMD Instinct MI300X"

                            "Intel Gaudi 3"


                        cpuMillicores:

                          type: integer

                          default: 4000

                          description: >

                            CPU request in millicores for the inference pod.


                        memoryGiB:

                          type: integer

                          default: 16

                          description: >

                            System RAM request in gibibytes for the inference

                            pod. Separate from accelerator VRAM/HBM.

                            For MoE models with CPU offloading, this may need

                            to be very large (e.g., 256 GiB for Kimi K2.5

                            with quantized CPU offload).


                    engineArgs:

                      type: object

                      additionalProperties:

                        type: string

                      description: >

                        Key-value pairs passed as command-line arguments to the

                        inference engine. For vLLM, common args include

                        tensor-parallel-size, max-model-len, and

                        gpu-memory-utilization. Values are always strings.


                    modelSource:

                      type: object

                      properties:

                        type:

                          type: string

                          enum:

                            - huggingface

                            - oci

                            - s3

                        huggingFaceTokenSecret:

                          type: string

                        ociImage:

                          type: string

                        s3Bucket:

                          type: string

                        s3Prefix:

                          type: string

                        s3CredentialsSecret:

                          type: string


                    scaling:

                      type: object

                      properties:

                        minReplicas:

                          type: integer

                          default: 1

                          description: >

                            Minimum replica count. Set to 0 for scale-to-zero.

                        maxReplicas:

                          type: integer

                          default: 3

                        scaleUpThreshold:

                          type: integer

                          default: 5

                          description: >

                            Queued inference requests that trigger scale-up.

                            KEDA monitors vllm:num_requests_waiting.

                        cooldownPeriodSeconds:

                          type: integer

                          default: 300

                          description: >

                            Seconds KEDA waits after the last scale-down

                            trigger before actually scaling down.


                    remoteApi:

                      type: object

                      properties:

                        baseUrl:

                          type: string

                        apiKeySecret:

                          type: string

                        rateLimitRpm:

                          type: integer


                    mcpExposure:

                      type: object

                      properties:

                        enabled:

                          type: boolean

                          default: false

                        toolName:

                          type: string

                        toolDescription:

                          type: string


                status:

                  type: object

                  properties:

                    phase:

                      type: string

                      enum:

                        - Pending

                        - Downloading

                        - Starting

                        - Ready

                        - Degraded

                        - Failed

                        - Proxying

                    endpoint:

                      type: string

                    conditions:

                      type: array

                      items:

                        type: object

                        properties:

                          type:

                            type: string

                          status:

                            type: string

                          lastTransitionTime:

                            type: string

                          reason:

                            type: string

                          message:

                            type: string

                    acceleratorNodes:

                      type: array

                      items:

                        type: string

                      description: >

                        Names of the Kubernetes nodes where this model's

                        inference pods are currently scheduled.

                    currentReplicas:

                      type: integer

                    requestsPerMinute:

                      type: integer

                    averageLatencyMs:

                      type: integer


Now let us look at concrete LLMModel resources for several scenarios.


Example 1 — Gemma 4 26B-A4B on an AMD MI300X (single GPU, MoE):


    # File: config/samples/gemma4-26b-amd.yaml

    apiVersion: ai.example.io/v1alpha1

    kind: LLMModel

    metadata:

      name: gemma4-26b-amd

      namespace: ai-models

      labels:

        ai.example.io/family: gemma4

        ai.example.io/vendor: google

        ai.example.io/accelerator: amd

    spec:

      modelId: google/gemma-4-26b-a4b-it

      deploymentMode: local

      acceleratorVendor: amd

      inferenceEngine: vllm

      modelType: MoE

      totalParametersBillions: 25.2

      activeParametersBillions: 3.8

      quantization: AWQ

      contextWindowK: 256

      domains:

        - general

        - vision

        - multilingual

        - function-calling

        - reasoning

        - agentic

      languages:

        - "140+"

      resources:

        acceleratorCount: 1

        vramPerAcceleratorGiB: 16

        preferredAcceleratorModel: "AMD Instinct MI300X"

        cpuMillicores: 8000

        memoryGiB: 64

      engineArgs:

        gpu-memory-utilization: "0.90"

        max-model-len: "65536"

        enable-chunked-prefill: "true"

      modelSource:

        type: huggingface

        huggingFaceTokenSecret: hf-token

      scaling:

        minReplicas: 0

        maxReplicas: 3

        scaleUpThreshold: 3

        cooldownPeriodSeconds: 300

      mcpExposure:

        enabled: true

        toolName: gemma4_26b_vision_amd

        toolDescription: >

          Gemma 4 26B-A4B MoE model from Google running on AMD MI300X.

          Supports text, image, and audio input with 256K context window.

          Excellent for general tasks, vision, multilingual content, function

          calling, and agentic workflows. Runs locally with full data privacy.


Example 2 — Qwen3.6-35B-A3B on an Intel Gaudi 3 (thinking mode):


    # File: config/samples/qwen36-35b-gaudi.yaml

    apiVersion: ai.example.io/v1alpha1

    kind: LLMModel

    metadata:

      name: qwen36-35b-gaudi

      namespace: ai-models

      labels:

        ai.example.io/family: qwen36

        ai.example.io/vendor: alibaba

        ai.example.io/accelerator: intel-gaudi

    spec:

      modelId: Qwen/Qwen3.6-35B-A3B

      deploymentMode: local

      acceleratorVendor: intel-gaudi

      inferenceEngine: vllm

      modelType: MoEThinking

      totalParametersBillions: 35

      activeParametersBillions: 3

      quantization: FP8

      contextWindowK: 256

      domains:

        - general

        - reasoning

        - math

        - code

        - multilingual

        - agentic

      languages:

        - "201+"

      resources:

        acceleratorCount: 1

        vramPerAcceleratorGiB: 24

        preferredAcceleratorModel: "Intel Gaudi 3"

        cpuMillicores: 8000

        memoryGiB: 64

      engineArgs:

        max-model-len: "65536"

        tensor-parallel-size: "1"

      modelSource:

        type: huggingface

        huggingFaceTokenSecret: hf-token

      scaling:

        minReplicas: 0

        maxReplicas: 2

        scaleUpThreshold: 2

        cooldownPeriodSeconds: 300

      mcpExposure:

        enabled: true

        toolName: qwen36_reasoning_gaudi

        toolDescription: >

          Qwen3.6-35B-A3B MoE thinking model running on Intel Gaudi 3.

          Supports hybrid thinking/non-thinking mode. Excellent for math,

          code, complex reasoning, and agentic tasks. 256K context window.

          Runs locally with full data privacy.


Example 3 — Gemma 4 31B on NVIDIA H100 (dense, high quality):


    # File: config/samples/gemma4-31b-nvidia.yaml

    apiVersion: ai.example.io/v1alpha1

    kind: LLMModel

    metadata:

      name: gemma4-31b-nvidia

      namespace: ai-models

      labels:

        ai.example.io/family: gemma4

        ai.example.io/vendor: google

        ai.example.io/accelerator: nvidia

    spec:

      modelId: google/gemma-4-31b-it

      deploymentMode: local

      acceleratorVendor: nvidia

      inferenceEngine: vllm

      modelType: Dense

      totalParametersBillions: 30.7

      activeParametersBillions: 30.7

      quantization: INT4

      contextWindowK: 256

      domains:

        - general

        - vision

        - code

        - reasoning

        - multilingual

        - function-calling

        - agentic

      languages:

        - "140+"

      resources:

        acceleratorCount: 1

        vramPerAcceleratorGiB: 24

        preferredAcceleratorModel: "NVIDIA H100 80GB HBM3"

        cpuMillicores: 8000

        memoryGiB: 32

      engineArgs:

        gpu-memory-utilization: "0.90"

        max-model-len: "65536"

        enable-chunked-prefill: "true"

      modelSource:

        type: huggingface

        huggingFaceTokenSecret: hf-token

      scaling:

        minReplicas: 1

        maxReplicas: 2

        scaleUpThreshold: 3

        cooldownPeriodSeconds: 300

      mcpExposure:

        enabled: true

        toolName: gemma4_31b_nvidia

        toolDescription: >

          Gemma 4 31B dense model from Google running on NVIDIA H100.

          Highest quality in the Gemma 4 family. Supports text and image

          input with 256K context window. Excellent for coding, reasoning,

          and agentic workflows. Runs locally with full data privacy.


Example 4 — Qwen3.5-9B on CPU (Apple Silicon or x86, no GPU):


    # File: config/samples/qwen35-9b-cpu.yaml

    apiVersion: ai.example.io/v1alpha1

    kind: LLMModel

    metadata:

      name: qwen35-9b-cpu

      namespace: ai-models

      labels:

        ai.example.io/family: qwen35

        ai.example.io/vendor: alibaba

        ai.example.io/accelerator: cpu

    spec:

      modelId: Qwen/Qwen3.5-9B-GGUF

      deploymentMode: local

      acceleratorVendor: cpu

      inferenceEngine: ollama

      modelType: DenseThinking

      totalParametersBillions: 9

      activeParametersBillions: 9

      quantization: GGUF-Q4_K_M

      contextWindowK: 128

      domains:

        - general

        - reasoning

        - code

        - multilingual

      languages:

        - "201+"

      resources:

        acceleratorCount: 0

        cpuMillicores: 16000

        memoryGiB: 32

      engineArgs:

        num-ctx: "32768"

      modelSource:

        type: huggingface

        huggingFaceTokenSecret: hf-token

      scaling:

        minReplicas: 1

        maxReplicas: 4

        scaleUpThreshold: 5

        cooldownPeriodSeconds: 120

      mcpExposure:

        enabled: true

        toolName: qwen35_9b_cpu

        toolDescription: >

          Qwen3.5-9B running on CPU via Ollama. No GPU required. Supports

          hybrid thinking mode. Good for general tasks, code, and reasoning

          on CPU-only nodes or Apple Silicon. 128K context window.


Example 5 — GPT-5.5 as a remote API proxy:


    # File: config/samples/gpt55-remote.yaml

    apiVersion: ai.example.io/v1alpha1

    kind: LLMModel

    metadata:

      name: gpt55-remote

      namespace: ai-models

      labels:

        ai.example.io/family: gpt55

        ai.example.io/vendor: openai

        ai.example.io/deployment-mode: remote

    spec:

      modelId: gpt-5.5

      deploymentMode: remote

      modelType: Dense

      contextWindowK: 1000

      domains:

        - general

        - code

        - math

        - reasoning

        - vision

        - function-calling

        - long-context

        - agentic

      languages:

        - "50+"

      remoteApi:

        baseUrl: https://api.openai.com/v1

        apiKeySecret: openai-api-key

        rateLimitRpm: 10000

      mcpExposure:

        enabled: true

        toolName: gpt55_frontier

        toolDescription: >

          OpenAI GPT-5.5 frontier model via the OpenAI API. 1M token context

          window. Five reasoning levels (none, low, medium, high, xhigh).

          Best-in-class for complex agentic tasks, coding, and research.

          Use when data privacy constraints permit external API calls.


Example 6 — Claude Opus 4.7 as a remote API proxy:


    # File: config/samples/claude-opus47-remote.yaml

    apiVersion: ai.example.io/v1alpha1

    kind: LLMModel

    metadata:

      name: claude-opus47-remote

      namespace: ai-models

      labels:

        ai.example.io/family: claude-opus

        ai.example.io/vendor: anthropic

        ai.example.io/deployment-mode: remote

    spec:

      modelId: claude-opus-4-7

      deploymentMode: remote

      modelType: DenseThinking

      contextWindowK: 1000

      domains:

        - general

        - code

        - reasoning

        - vision

        - function-calling

        - long-context

        - agentic

      languages:

        - "50+"

      remoteApi:

        baseUrl: https://api.anthropic.com/v1

        apiKeySecret: anthropic-api-key

        rateLimitRpm: 5000

      mcpExposure:

        enabled: true

        toolName: claude_opus47

        toolDescription: >

          Anthropic Claude Opus 4.7 via the Anthropic API. 1M token context

          window. Excellent for multi-step agentic tasks, long-horizon reasoning,

          and complex tool-dependent workflows. Supports xhigh reasoning effort.


Example 7 — Gemini 3.1 Pro as a remote API proxy:


    # File: config/samples/gemini31-remote.yaml

    apiVersion: ai.example.io/v1alpha1

    kind: LLMModel

    metadata:

      name: gemini31-remote

      namespace: ai-models

      labels:

        ai.example.io/family: gemini31

        ai.example.io/vendor: google

        ai.example.io/deployment-mode: remote

    spec:

      modelId: gemini-3.1-pro

      deploymentMode: remote

      modelType: Dense

      contextWindowK: 1000

      domains:

        - general

        - code

        - reasoning

        - vision

        - audio

        - function-calling

        - long-context

        - agentic

      languages:

        - "50+"

      remoteApi:

        baseUrl: https://generativelanguage.googleapis.com/v1beta/openai

        apiKeySecret: google-api-key

        rateLimitRpm: 2000

      mcpExposure:

        enabled: true

        toolName: gemini31_pro

        toolDescription: >

          Google Gemini 3.1 Pro via the Google AI API. 1M token context window.

          Processes up to 900 images, 8.4 hours of audio, or 1 hour of video

          per prompt. Three-tier thinking system (low, medium, high).


CHAPTER FIVE: THE LLMMODEL CONTROLLER


The controller watches for LLMModel resources and reconciles the actual state

of the cluster to match the desired state. The key design principle is that

the `acceleratorVendor` field drives all hardware-specific decisions, keeping

the rest of the reconciliation logic vendor-agnostic.


FILE: go.mod


    module github.com/example/llm-operator


    go 1.22


    require (

        k8s.io/api v0.30.0

        k8s.io/apimachinery v0.30.0

        k8s.io/client-go v0.30.0

        sigs.k8s.io/controller-runtime v0.18.0

    )


FILE: api/v1alpha1/llmmodel_types.go


    package v1alpha1


    import (

        metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"

    )


    // LLMModelSpec defines the desired state of an LLMModel resource.

    type LLMModelSpec struct {

        // ModelId is the canonical identifier for the model.

        // +kubebuilder:validation:Required

        ModelId string `json:"modelId"`


        // DeploymentMode determines whether to run locally or proxy remotely.

        // +kubebuilder:validation:Enum=local;remote

        DeploymentMode string `json:"deploymentMode"`


        // AcceleratorVendor selects the hardware backend for local deployments.

        // +kubebuilder:validation:Enum=nvidia;amd;intel-gaudi;cpu

        // +kubebuilder:default=nvidia

        AcceleratorVendor string `json:"acceleratorVendor,omitempty"`


        // InferenceEngine selects the serving framework for local deployments.

        // +kubebuilder:validation:Enum=vllm;llamacpp;ollama

        // +kubebuilder:default=vllm

        InferenceEngine string `json:"inferenceEngine,omitempty"`


        // ModelType describes the architectural pattern of the model.

        // +kubebuilder:validation:Enum=Dense;MoE;DenseThinking;MoEThinking

        // +kubebuilder:default=Dense

        ModelType string `json:"modelType,omitempty"`


        // TotalParametersBillions is the total parameter count in billions.

        TotalParametersBillions float64 `json:"totalParametersBillions,omitempty"`


        // ActiveParametersBillions is the per-token active parameter count.

        ActiveParametersBillions float64 `json:"activeParametersBillions,omitempty"`


        // Quantization specifies the weight format and precision.

        // +kubebuilder:validation:Enum=None;FP16;BF16;FP8;FP4;INT8;INT4;QAT-INT4;GPTQ;AWQ;GGUF-Q4_K_M;GGUF-Q8_0;MXFP4;MXFP6

        Quantization string `json:"quantization,omitempty"`


        // ContextWindowK is the maximum context length in thousands of tokens.

        ContextWindowK int `json:"contextWindowK,omitempty"`


        // Domains lists the capability areas this model excels in.

        Domains []string `json:"domains,omitempty"`


        // Languages lists the languages this model supports.

        Languages []string `json:"languages,omitempty"`


        // Resources specifies accelerator and system resource requirements.

        Resources LLMResourceSpec `json:"resources,omitempty"`


        // EngineArgs are passed directly to the inference engine as CLI flags.

        EngineArgs map[string]string `json:"engineArgs,omitempty"`


        // ModelSource configures where to download model weights from.

        ModelSource *ModelSourceSpec `json:"modelSource,omitempty"`


        // Scaling configures replica count and autoscaling thresholds.

        Scaling LLMScalingSpec `json:"scaling,omitempty"`


        // RemoteApi configures the external API for remote deployments.

        RemoteApi *RemoteApiSpec `json:"remoteApi,omitempty"`


        // McpExposure configures MCP tool registration for this model.

        McpExposure LLMMcpSpec `json:"mcpExposure,omitempty"`

    }


    // LLMResourceSpec captures accelerator and system resource requirements.

    type LLMResourceSpec struct {

        // AcceleratorCount is the number of accelerators required per replica.

        // Set to 0 for cpu acceleratorVendor.

        // +kubebuilder:default=1

        AcceleratorCount int `json:"acceleratorCount,omitempty"`


        // VramPerAcceleratorGiB is the required VRAM/HBM per accelerator in GiB.

        VramPerAcceleratorGiB int `json:"vramPerAcceleratorGiB,omitempty"`


        // PreferredAcceleratorModel is an optional DRA preference hint.

        PreferredAcceleratorModel string `json:"preferredAcceleratorModel,omitempty"`


        // CpuMillicores is the CPU request for the inference pod.

        // +kubebuilder:default=4000

        CpuMillicores int `json:"cpuMillicores,omitempty"`


        // MemoryGiB is the system RAM request for the inference pod.

        // +kubebuilder:default=16

        MemoryGiB int `json:"memoryGiB,omitempty"`

    }


    // ModelSourceSpec configures model weight download.

    type ModelSourceSpec struct {

        // Type selects the download backend.

        // +kubebuilder:validation:Enum=huggingface;oci;s3

        Type string `json:"type"`


        HuggingFaceTokenSecret  string `json:"huggingFaceTokenSecret,omitempty"`

        OciImage                string `json:"ociImage,omitempty"`

        S3Bucket                string `json:"s3Bucket,omitempty"`

        S3Prefix                string `json:"s3Prefix,omitempty"`

        S3CredentialsSecret     string `json:"s3CredentialsSecret,omitempty"`

    }


    // LLMScalingSpec configures replica autoscaling.

    type LLMScalingSpec struct {

        // MinReplicas is the minimum replica count. Set to 0 for scale-to-zero.

        // +kubebuilder:default=1

        MinReplicas int `json:"minReplicas,omitempty"`


        // MaxReplicas is the maximum replica count.

        // +kubebuilder:default=3

        MaxReplicas int `json:"maxReplicas,omitempty"`


        // ScaleUpThreshold is the queued request count that triggers scale-up.

        // +kubebuilder:default=5

        ScaleUpThreshold int `json:"scaleUpThreshold,omitempty"`


        // CooldownPeriodSeconds is how long KEDA waits before scaling down.

        // +kubebuilder:default=300

        CooldownPeriodSeconds int `json:"cooldownPeriodSeconds,omitempty"`

    }


    // RemoteApiSpec configures a remote OpenAI-compatible API proxy.

    type RemoteApiSpec struct {

        BaseUrl      string `json:"baseUrl"`

        ApiKeySecret string `json:"apiKeySecret"`

        RateLimitRpm int    `json:"rateLimitRpm,omitempty"`

    }


    // LLMMcpSpec configures MCP tool exposure for this model.

    type LLMMcpSpec struct {

        Enabled         bool   `json:"enabled,omitempty"`

        ToolName        string `json:"toolName,omitempty"`

        ToolDescription string `json:"toolDescription,omitempty"`

    }


    // LLMModelStatus reflects the observed state of the LLMModel.

    type LLMModelStatus struct {

        Phase             string             `json:"phase,omitempty"`

        Endpoint          string             `json:"endpoint,omitempty"`

        Conditions        []metav1.Condition `json:"conditions,omitempty"`

        AcceleratorNodes  []string           `json:"acceleratorNodes,omitempty"`

        CurrentReplicas   int                `json:"currentReplicas,omitempty"`

        RequestsPerMinute int                `json:"requestsPerMinute,omitempty"`

        AverageLatencyMs  int                `json:"averageLatencyMs,omitempty"`

    }


    // LLMModel is the Schema for the llmmodels API.

    // +kubebuilder:object:root=true

    // +kubebuilder:subresource:status

    // +kubebuilder:printcolumn:name="Model",type=string,JSONPath=".spec.modelId"

    // +kubebuilder:printcolumn:name="Vendor",type=string,JSONPath=".spec.acceleratorVendor"

    // +kubebuilder:printcolumn:name="Mode",type=string,JSONPath=".spec.deploymentMode"

    // +kubebuilder:printcolumn:name="Status",type=string,JSONPath=".status.phase"

    // +kubebuilder:printcolumn:name="Endpoint",type=string,JSONPath=".status.endpoint"

    type LLMModel struct {

        metav1.TypeMeta   `json:",inline"`

        metav1.ObjectMeta `json:"metadata,omitempty"`

        Spec              LLMModelSpec   `json:"spec,omitempty"`

        Status            LLMModelStatus `json:"status,omitempty"`

    }


    // LLMModelList contains a list of LLMModel resources.

    // +kubebuilder:object:root=true

    type LLMModelList struct {

        metav1.TypeMeta `json:",inline"`

        metav1.ListMeta `json:"metadata,omitempty"`

        Items           []LLMModel `json:"items"`

    }


    func init() {

        SchemeBuilder.Register(&LLMModel{}, &LLMModelList{})

    }


FILE: api/v1alpha1/groupversion_info.go


    package v1alpha1


    import (

        "k8s.io/apimachinery/pkg/runtime/schema"

        "sigs.k8s.io/controller-runtime/pkg/scheme"

    )


    var (

        GroupVersion  = schema.GroupVersion{Group: "ai.example.io", Version: "v1alpha1"}

        SchemeBuilder = &scheme.Builder{GroupVersion: GroupVersion}

        AddToScheme   = SchemeBuilder.AddToScheme

    )


FILE: controllers/hardware.go


    // hardware.go contains all vendor-specific hardware configuration logic.

    // This is the single file that must be updated when adding a new accelerator

    // vendor. All other controller code is vendor-agnostic.

    package controllers


    import (

        "fmt"


        corev1 "k8s.io/api/core/v1"

        "k8s.io/apimachinery/pkg/api/resource"


        aiv1alpha1 "github.com/example/llm-operator/api/v1alpha1"

    )


    // AcceleratorConfig holds all hardware-specific configuration derived

    // from the LLMModel spec. The reconciler calls resolveAcceleratorConfig

    // once and then uses this struct throughout the reconciliation.

    type AcceleratorConfig struct {

        // ResourceKey is the Kubernetes resource name for this accelerator type.

        // e.g. "nvidia.com/gpu", "amd.com/gpu", "habana.ai/gaudi"

        // Empty string means no accelerator resource is requested (cpu mode).

        ResourceKey string


        // VllmImage is the Docker image to use for the vLLM inference server.

        // Different vendors require different images built against their SDK.

        VllmImage string


        // OllamaImage is the Docker image to use for Ollama (cpu/apple silicon).

        OllamaImage string


        // LlamaCppImage is the Docker image to use for llama.cpp.

        LlamaCppImage string


        // NodeSelectorLabels are added to the pod's nodeSelector to target

        // nodes with the correct accelerator type. These labels are populated

        // by the respective GPU operator / device plugin.

        NodeSelectorLabels map[string]string


        // Tolerations allow the pod to be scheduled on tainted GPU nodes.

        // Most clusters taint GPU nodes to prevent non-GPU workloads from

        // landing on expensive hardware.

        Tolerations []corev1.Toleration


        // PrometheusMetricPrefix is the prefix used by the accelerator's

        // monitoring exporter. vLLM emits its own metrics regardless of

        // backend, but we also expose hardware-level metrics via KEDA for

        // GPU utilization-based scaling as a secondary trigger.

        PrometheusMetricPrefix string


        // AcceleratorQuantity is the resource.Quantity for the accelerator

        // resource limit/request. Nil for cpu mode.

        AcceleratorQuantity *resource.Quantity


        // HostIPC indicates whether the pod needs host IPC namespace access.

        // Required for multi-GPU tensor parallelism on some vendors.

        HostIPC bool


        // HostNetwork indicates whether the pod needs host network access.

        // Required for multi-GPU NCCL/RCCL communication on some vendors.

        HostNetwork bool


        // AdditionalEnvVars are vendor-specific environment variables to inject.

        AdditionalEnvVars []corev1.EnvVar

    }


    // resolveAcceleratorConfig derives all hardware-specific configuration

    // from the LLMModel spec. This is the single authoritative function for

    // vendor dispatch. Add new vendors here.

    func resolveAcceleratorConfig(model *aiv1alpha1.LLMModel) (AcceleratorConfig, error) {

        vendor := model.Spec.AcceleratorVendor

        if vendor == "" {

            vendor = "nvidia" // backward-compatible default

        }


        count := model.Spec.Resources.AcceleratorCount

        if count < 1 {

            count = 1

        }


        switch vendor {


        // ------------------------------------------------------------------

        // NVIDIA — CUDA

        // Resource key:  nvidia.com/gpu

        // Node label:    nvidia.com/gpu.present=true  (GPU Operator)

        // Taint:         nvidia.com/gpu:NoSchedule

        // vLLM image:    vllm/vllm-openai (CUDA build)

        // Metrics:       DCGM exporter (DCGM_FI_DEV_*)

        // ------------------------------------------------------------------

        case "nvidia":

            qty := resource.MustParse(fmt.Sprintf("%d", count))

            return AcceleratorConfig{

                ResourceKey:            "nvidia.com/gpu",

                VllmImage:              "vllm/vllm-openai:v0.8.5",

                OllamaImage:            "ollama/ollama:0.6.5",

                LlamaCppImage:          "ghcr.io/ggerganov/llama.cpp:server-cuda",

                NodeSelectorLabels:     map[string]string{"nvidia.com/gpu.present": "true"},

                Tolerations: []corev1.Toleration{

                    {

                        Key:      "nvidia.com/gpu",

                        Operator: corev1.TolerationOpExists,

                        Effect:   corev1.TaintEffectNoSchedule,

                    },

                },

                PrometheusMetricPrefix: "DCGM_FI_DEV",

                AcceleratorQuantity:    &qty,

                HostIPC:                count > 1,

                HostNetwork:            count > 1,

                AdditionalEnvVars:      []corev1.EnvVar{},

            }, nil


        // ------------------------------------------------------------------

        // AMD — ROCm

        // Resource key:  amd.com/gpu

        // Node label:    amd.com/gpu.present=true  (AMD GPU Operator)

        // Taint:         amd.com/gpu:NoSchedule

        // vLLM image:    rocm/vllm-openai (ROCm build)

        //   Official ROCm-enabled vLLM images available since January 2026.

        //   93% of vLLM AMD test groups passing as of mid-January 2026.

        // Metrics:       ROCm SMI exporter (rocm_smi_*)

        // Note:          AWQ quantization is supported on ROCm as of vLLM 0.8+.

        //                MXFP4/MXFP6 require MI350X or later (CDNA 4).

        // ------------------------------------------------------------------

        case "amd":

            qty := resource.MustParse(fmt.Sprintf("%d", count))

            return AcceleratorConfig{

                ResourceKey:   "amd.com/gpu",

                VllmImage:     "rocm/vllm-openai:v0.8.5-rocm6.2",

                OllamaImage:   "ollama/ollama:0.6.5-rocm",

                LlamaCppImage: "ghcr.io/ggerganov/llama.cpp:server-rocm",

                NodeSelectorLabels: map[string]string{

                    "amd.com/gpu.present": "true",

                },

                Tolerations: []corev1.Toleration{

                    {

                        Key:      "amd.com/gpu",

                        Operator: corev1.TolerationOpExists,

                        Effect:   corev1.TaintEffectNoSchedule,

                    },

                },

                PrometheusMetricPrefix: "rocm_smi",

                AcceleratorQuantity:    &qty,

                HostIPC:                count > 1,

                HostNetwork:            count > 1,

                AdditionalEnvVars: []corev1.EnvVar{

                    // Tell ROCm which GPU devices to use. The device plugin

                    // sets ROCR_VISIBLE_DEVICES automatically, but we set

                    // HIP_VISIBLE_DEVICES for compatibility with older ROCm.

                    {

                        Name:  "HIP_VISIBLE_DEVICES",

                        Value: "all",

                    },

                    // ROCm requires this for vLLM's flash attention backend.

                    {

                        Name:  "VLLM_USE_ROCM_FLASH_ATTN",

                        Value: "1",

                    },

                },

            }, nil


        // ------------------------------------------------------------------

        // INTEL GAUDI

        // Resource key:  habana.ai/gaudi

        // Node label:    habana.ai/gaudi=true  (Gaudi Base Operator)

        // Taint:         habana.ai/gaudi:NoSchedule

        // vLLM image:    Intel optimized vLLM fork for Gaudi

        //   Intel maintains a Gaudi-specific vLLM fork with Paged KV cache,

        //   custom Paged Attention, tensor parallelism, and FP8 support.

        //   Supports DeepSeek architecture since Gaudi software 1.21.0.

        // Metrics:       Gaudi metrics exporter (habana_*)

        // Note:          FP8 is natively supported by Gaudi 3.

        //                TGI (Text Generation Inference) is also supported.

        // ------------------------------------------------------------------

        case "intel-gaudi":

            qty := resource.MustParse(fmt.Sprintf("%d", count))

            return AcceleratorConfig{

                ResourceKey:   "habana.ai/gaudi",

                VllmImage:     "vault.habana.ai/gaudi-docker/1.21.0/ubuntu22.04/habanalabs/vllm-fork:latest",

                OllamaImage:   "", // Ollama does not support Gaudi; use vllm or tgi

                LlamaCppImage: "", // llama.cpp does not support Gaudi; use vllm or tgi

                NodeSelectorLabels: map[string]string{

                    "habana.ai/gaudi": "true",

                },

                Tolerations: []corev1.Toleration{

                    {

                        Key:      "habana.ai/gaudi",

                        Operator: corev1.TolerationOpExists,

                        Effect:   corev1.TaintEffectNoSchedule,

                    },

                },

                PrometheusMetricPrefix: "habana",

                AcceleratorQuantity:    &qty,

                HostIPC:                true, // Required for Gaudi inter-card communication

                HostNetwork:            count > 1,

                AdditionalEnvVars: []corev1.EnvVar{

                    // Gaudi requires these environment variables for vLLM.

                    {

                        Name:  "PT_HPU_ENABLE_LAZY_COLLECTIVES",

                        Value: "true",

                    },

                    {

                        Name:  "VLLM_SKIP_WARMUP",

                        Value: "false",

                    },

                    // Gaudi uses Habana's collective communications library.

                    {

                        Name:  "HABANA_VISIBLE_DEVICES",

                        Value: "all",

                    },

                },

            }, nil


        // ------------------------------------------------------------------

        // CPU — No accelerator

        // Resource key:  (none)

        // Inference:     Ollama or llama.cpp with GGUF models

        // Suitable for: Apple Silicon nodes, x86 servers, edge deployments

        // Note:          vLLM is not recommended for CPU-only inference.

        //                Use ollama or llamacpp as inferenceEngine.

        // ------------------------------------------------------------------

        case "cpu":

            return AcceleratorConfig{

                ResourceKey:            "",

                VllmImage:              "", // vLLM not recommended for CPU

                OllamaImage:            "ollama/ollama:0.6.5",

                LlamaCppImage:          "ghcr.io/ggerganov/llama.cpp:server",

                NodeSelectorLabels:     map[string]string{},

                Tolerations:            []corev1.Toleration{},

                PrometheusMetricPrefix: "",

                AcceleratorQuantity:    nil,

                HostIPC:                false,

                HostNetwork:            false,

                AdditionalEnvVars:      []corev1.EnvVar{},

            }, nil


        default:

            return AcceleratorConfig{}, fmt.Errorf(

                "unknown acceleratorVendor %q; valid values: nvidia, amd, intel-gaudi, cpu",

                vendor,

            )

        }

    }


    // inferenceImageFor returns the correct Docker image for the given

    // inference engine and accelerator configuration.

    func inferenceImageFor(engine string, cfg AcceleratorConfig) (string, error) {

        switch engine {

        case "vllm":

            if cfg.VllmImage == "" {

                return "", fmt.Errorf(

                    "vLLM is not supported for accelerator vendor %q; "+

                        "use llamacpp or ollama instead",

                    cfg.ResourceKey,

                )

            }

            return cfg.VllmImage, nil

        case "ollama":

            if cfg.OllamaImage == "" {

                return "", fmt.Errorf(

                    "Ollama is not supported for accelerator vendor %q; "+

                        "use vllm instead",

                    cfg.ResourceKey,

                )

            }

            return cfg.OllamaImage, nil

        case "llamacpp":

            if cfg.LlamaCppImage == "" {

                return "", fmt.Errorf(

                    "llama.cpp is not supported for accelerator vendor %q; "+

                        "use vllm instead",

                    cfg.ResourceKey,

                )

            }

            return cfg.LlamaCppImage, nil

        default:

            return "", fmt.Errorf("unknown inferenceEngine %q", engine)

        }

    }


FILE: controllers/llmmodel_controller.go


    package controllers


    import (

        "context"

        "fmt"

        "sort"

        "time"


        appsv1 "k8s.io/api/apps/v1"

        corev1 "k8s.io/api/core/v1"

        "k8s.io/apimachinery/pkg/api/errors"

        "k8s.io/apimachinery/pkg/api/resource"

        metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"

        "k8s.io/apimachinery/pkg/runtime"

        "k8s.io/apimachinery/pkg/util/intstr"

        ctrl "sigs.k8s.io/controller-runtime"

        "sigs.k8s.io/controller-runtime/pkg/client"

        "sigs.k8s.io/controller-runtime/pkg/log"


        aiv1alpha1 "github.com/example/llm-operator/api/v1alpha1"

    )


    // LLMModelReconciler reconciles LLMModel resources.

    type LLMModelReconciler struct {

        client.Client

        Scheme    *runtime.Scheme

        McpClient McpRegistryClient

    }


    // McpRegistryClient is an interface for registering tools with the MCP server.

    type McpRegistryClient interface {

        RegisterTool(ctx context.Context, tool McpToolDefinition) error

        UnregisterTool(ctx context.Context, toolName string) error

    }


    // McpToolDefinition describes a tool to be registered with the MCP server.

    type McpToolDefinition struct {

        Name        string

        Description string

        Endpoint    string

        ModelId     string

        Domains     []string

    }


    // Reconcile is called by controller-runtime whenever an LLMModel changes.

    func (r *LLMModelReconciler) Reconcile(

        ctx context.Context,

        req ctrl.Request,

    ) (ctrl.Result, error) {

        logger := log.FromContext(ctx)

        logger.Info("Reconciling LLMModel", "name", req.Name, "namespace", req.Namespace)


        // Step 1: Fetch the LLMModel resource.

        model := &aiv1alpha1.LLMModel{}

        if err := r.Get(ctx, req.NamespacedName, model); err != nil {

            if errors.IsNotFound(err) {

                return ctrl.Result{}, nil

            }

            return ctrl.Result{}, fmt.Errorf("fetching LLMModel: %w", err)

        }


        // Step 2: Handle deletion via finalizer.

        finalizerName := "ai.example.io/mcp-cleanup"

        if model.DeletionTimestamp != nil {

            if containsString(model.Finalizers, finalizerName) {

                if err := r.cleanupMcpRegistration(ctx, model); err != nil {

                    return ctrl.Result{}, err

                }

                model.Finalizers = removeString(model.Finalizers, finalizerName)

                if err := r.Update(ctx, model); err != nil {

                    return ctrl.Result{}, err

                }

            }

            return ctrl.Result{}, nil

        }


        // Step 3: Add the finalizer if not present.

        if !containsString(model.Finalizers, finalizerName) {

            model.Finalizers = append(model.Finalizers, finalizerName)

            if err := r.Update(ctx, model); err != nil {

                return ctrl.Result{}, err

            }

            // Re-fetch after update to get the latest resourceVersion.

            if err := r.Get(ctx, req.NamespacedName, model); err != nil {

                return ctrl.Result{}, err

            }

        }


        // Step 4: Sync discovery labels. We do this before the main

        // reconciliation so that labels are always current.

        if err := r.syncDiscoveryLabels(ctx, model); err != nil {

            return ctrl.Result{}, fmt.Errorf("syncing discovery labels: %w", err)

        }

        // Re-fetch after label update.

        if err := r.Get(ctx, req.NamespacedName, model); err != nil {

            return ctrl.Result{}, err

        }


        // Step 5: Branch on deployment mode.

        var reconcileErr error

        switch model.Spec.DeploymentMode {

        case "local":

            reconcileErr = r.reconcileLocalModel(ctx, model)

        case "remote":

            reconcileErr = r.reconcileRemoteModel(ctx, model)

        default:

            reconcileErr = fmt.Errorf(

                "unknown deploymentMode: %s", model.Spec.DeploymentMode,

            )

        }


        if reconcileErr != nil {

            model.Status.Phase = "Failed"

            _ = r.Status().Update(ctx, model)

            return ctrl.Result{}, reconcileErr

        }


        // Step 6: Requeue after 30 seconds to refresh metrics in the status.

        return ctrl.Result{RequeueAfter: 30 * time.Second}, nil

    }


    // syncDiscoveryLabels updates the LLMModel's labels to reflect its spec,

    // enabling efficient kubectl and API server queries.

    // NOTE: This calls r.Update (not r.Status().Update) because labels are

    // metadata, not status. The caller must re-fetch after this call.

    func (r *LLMModelReconciler) syncDiscoveryLabels(

        ctx context.Context,

        model *aiv1alpha1.LLMModel,

    ) error {

        updated := model.DeepCopy()

        if updated.Labels == nil {

            updated.Labels = make(map[string]string)

        }


        // Domain labels: ai.example.io/domain-<name>=true

        for _, domain := range model.Spec.Domains {

            updated.Labels[fmt.Sprintf("ai.example.io/domain-%s", domain)] = "true"

        }


        // Context window tier labels (cumulative: a 256K model also gets 128K label)

        contextK := model.Spec.ContextWindowK

        tiers := []struct {

            threshold int

            label     string

        }{

            {10000, "ai.example.io/context-10m"},

            {1000, "ai.example.io/context-1m"},

            {256, "ai.example.io/context-256k"},

            {128, "ai.example.io/context-128k"},

            {64, "ai.example.io/context-64k"},

            {0, "ai.example.io/context-32k"},

        }

        for _, tier := range tiers {

            if contextK >= tier.threshold {

                updated.Labels[tier.label] = "true"

            }

        }


        // VRAM tier label

        vram := model.Spec.Resources.VramPerAcceleratorGiB

        switch {

        case vram == 0:

            updated.Labels["ai.example.io/vram-tier"] = "cpu"

        case vram <= 8:

            updated.Labels["ai.example.io/vram-tier"] = "8gb"

        case vram <= 16:

            updated.Labels["ai.example.io/vram-tier"] = "16gb"

        case vram <= 24:

            updated.Labels["ai.example.io/vram-tier"] = "24gb"

        case vram <= 48:

            updated.Labels["ai.example.io/vram-tier"] = "48gb"

        case vram <= 80:

            updated.Labels["ai.example.io/vram-tier"] = "80gb"

        case vram <= 141:

            updated.Labels["ai.example.io/vram-tier"] = "141gb"

        case vram <= 192:

            updated.Labels["ai.example.io/vram-tier"] = "192gb"

        default:

            updated.Labels["ai.example.io/vram-tier"] = "multi-accelerator"

        }


        // Other discovery labels

        updated.Labels["ai.example.io/model-type"] = model.Spec.ModelType

        updated.Labels["ai.example.io/quantization"] = model.Spec.Quantization

        updated.Labels["ai.example.io/deployment-mode"] = model.Spec.DeploymentMode

        updated.Labels["ai.example.io/accelerator-vendor"] = model.Spec.AcceleratorVendor


        return r.Update(ctx, updated)

    }


    // reconcileLocalModel handles the full lifecycle of a locally-hosted model.

    func (r *LLMModelReconciler) reconcileLocalModel(

        ctx context.Context,

        model *aiv1alpha1.LLMModel,

    ) error {

        // Resolve hardware configuration for this vendor.

        hwCfg, err := resolveAcceleratorConfig(model)

        if err != nil {

            return err

        }


        // Validate engine/vendor compatibility.

        if _, err := inferenceImageFor(model.Spec.InferenceEngine, hwCfg); err != nil {

            return err

        }


        // Ensure the model cache PVC exists.

        if err := r.ensureModelCachePvc(ctx, model); err != nil {

            return fmt.Errorf("ensuring model cache PVC: %w", err)

        }


        // Ensure the inference Deployment exists and matches the spec.

        if err := r.ensureInferenceDeployment(ctx, model, hwCfg); err != nil {

            return fmt.Errorf("ensuring inference deployment: %w", err)

        }


        // Ensure the Service exists.

        if err := r.ensureInferenceService(ctx, model); err != nil {

            return fmt.Errorf("ensuring inference service: %w", err)

        }


        // Ensure the KEDA ScaledObject exists.

        if err := r.ensureKedaScaledObject(ctx, model); err != nil {

            return fmt.Errorf("ensuring KEDA ScaledObject: %w", err)

        }


        // Update status. We use Status().Update() here — separate from the

        // label update in syncDiscoveryLabels — to avoid a double-update conflict.

        endpoint := fmt.Sprintf(

            "http://%s.%s.svc.cluster.local:8000/v1",

            model.Name, model.Namespace,

        )

        model.Status.Endpoint = endpoint

        model.Status.Phase = "Ready"


        if model.Spec.McpExposure.Enabled {

            if err := r.McpClient.RegisterTool(ctx, McpToolDefinition{

                Name:        model.Spec.McpExposure.ToolName,

                Description: model.Spec.McpExposure.ToolDescription,

                Endpoint:    endpoint,

                ModelId:     model.Spec.ModelId,

                Domains:     model.Spec.Domains,

            }); err != nil {

                return fmt.Errorf("registering MCP tool: %w", err)

            }

        }


        return r.Status().Update(ctx, model)

    }


    // ensureModelCachePvc creates or updates the PVC for model weight caching.

    func (r *LLMModelReconciler) ensureModelCachePvc(

        ctx context.Context,

        model *aiv1alpha1.LLMModel,

    ) error {

        // Estimate storage needed: weights + some overhead.

        // We use vramPerAcceleratorGiB * acceleratorCount * 1.5 as a heuristic,

        // with a minimum of 20Gi and a maximum of 2Ti.

        vram := model.Spec.Resources.VramPerAcceleratorGiB

        count := model.Spec.Resources.AcceleratorCount

        if count < 1 {

            count = 1

        }

        estimatedGiB := vram * count * 2

        if estimatedGiB < 20 {

            estimatedGiB = 20

        }

        if estimatedGiB > 2048 {

            estimatedGiB = 2048

        }


        storageQty := resource.MustParse(fmt.Sprintf("%dGi", estimatedGiB))

        pvc := &corev1.PersistentVolumeClaim{

            ObjectMeta: metav1.ObjectMeta{

                Name:      fmt.Sprintf("%s-model-cache", model.Name),

                Namespace: model.Namespace,

            },

        }


        _, err := ctrl.CreateOrUpdate(ctx, r.Client, pvc, func() error {

            if err := ctrl.SetControllerReference(model, pvc, r.Scheme); err != nil {

                return err

            }

            // Only set spec on creation; PVC spec is immutable after creation.

            if pvc.CreationTimestamp.IsZero() {

                storageClassName := "standard"

                pvc.Spec = corev1.PersistentVolumeClaimSpec{

                    AccessModes: []corev1.PersistentVolumeAccessMode{

                        corev1.ReadWriteOnce,

                    },

                    Resources: corev1.VolumeResourceRequirements{

                        Requests: corev1.ResourceList{

                            corev1.ResourceStorage: storageQty,

                        },

                    },

                    StorageClassName: &storageClassName,

                }

            }

            return nil

        })

        return err

    }


    // ensureInferenceDeployment creates or updates the inference server Deployment.

    // This function is vendor-agnostic; all vendor-specific decisions come from hwCfg.

    func (r *LLMModelReconciler) ensureInferenceDeployment(

        ctx context.Context,

        model *aiv1alpha1.LLMModel,

        hwCfg AcceleratorConfig,

    ) error {

        image, err := inferenceImageFor(model.Spec.InferenceEngine, hwCfg)

        if err != nil {

            return err

        }


        // Build the inference server command.

        var command []string

        var args []string


        switch model.Spec.InferenceEngine {

        case "vllm":

            command = []string{"python3", "-m", "vllm.entrypoints.openai.api_server"}

            args = []string{

                "--model", model.Spec.ModelId,

                "--port", "8000",

                "--host", "0.0.0.0",

            }

            // Append engineArgs in sorted key order for determinism.

            keys := make([]string, 0, len(model.Spec.EngineArgs))

            for k := range model.Spec.EngineArgs {

                keys = append(keys, k)

            }

            sort.Strings(keys)

            for _, k := range keys {

                args = append(args, fmt.Sprintf("--%s", k), model.Spec.EngineArgs[k])

            }


        case "ollama":

            // Ollama exposes port 11434 but we proxy it to 8000 via an

            // OpenAI-compatible adapter. We use the OLLAMA_HOST env var

            // to bind to all interfaces.

            command = []string{"/bin/ollama"}

            args = []string{"serve"}


        case "llamacpp":

            command = []string{"/server"}

            args = []string{

                "--model", "/model-cache/" + model.Spec.ModelId,

                "--port", "8000",

                "--host", "0.0.0.0",

                "--ctx-size", fmt.Sprintf("%d", model.Spec.ContextWindowK*1024),

            }

            keys := make([]string, 0, len(model.Spec.EngineArgs))

            for k := range model.Spec.EngineArgs {

                keys = append(keys, k)

            }

            sort.Strings(keys)

            for _, k := range keys {

                args = append(args, fmt.Sprintf("--%s", k), model.Spec.EngineArgs[k])

            }

        }


        // Build environment variables.

        envVars := []corev1.EnvVar{

            {Name: "HF_HOME", Value: "/model-cache"},

        }

        if model.Spec.ModelSource != nil &&

            model.Spec.ModelSource.HuggingFaceTokenSecret != "" {

            envVars = append(envVars, corev1.EnvVar{

                Name: "HUGGING_FACE_HUB_TOKEN",

                ValueFrom: &corev1.EnvVarSource{

                    SecretKeyRef: &corev1.SecretKeySelector{

                        LocalObjectReference: corev1.LocalObjectReference{

                            Name: model.Spec.ModelSource.HuggingFaceTokenSecret,

                        },

                        Key: "token",

                    },

                },

            })

        }

        // Append vendor-specific env vars.

        envVars = append(envVars, hwCfg.AdditionalEnvVars...)


        // For Ollama, set the host binding.

        if model.Spec.InferenceEngine == "ollama" {

            envVars = append(envVars, corev1.EnvVar{

                Name:  "OLLAMA_HOST",

                Value: "0.0.0.0:8000",

            })

        }


        // Resource requirements.

        cpuQty := resource.MustParse(fmt.Sprintf("%dm", model.Spec.Resources.CpuMillicores))

        memQty := resource.MustParse(fmt.Sprintf("%dGi", model.Spec.Resources.MemoryGiB))


        resourceRequests := corev1.ResourceList{

            corev1.ResourceCPU:    cpuQty,

            corev1.ResourceMemory: memQty,

        }

        resourceLimits := corev1.ResourceList{

            corev1.ResourceCPU:    cpuQty,

            corev1.ResourceMemory: memQty,

        }

        if hwCfg.AcceleratorQuantity != nil && hwCfg.ResourceKey != "" {

            resourceRequests[corev1.ResourceName(hwCfg.ResourceKey)] = *hwCfg.AcceleratorQuantity

            resourceLimits[corev1.ResourceName(hwCfg.ResourceKey)] = *hwCfg.AcceleratorQuantity

        }


        replicas := int32(model.Spec.Scaling.MinReplicas)

        if replicas < 1 {

            replicas = 1

        }


        // Shared memory volume size: 16Gi for single-accelerator, 64Gi for multi.

        shmSize := resource.MustParse("16Gi")

        if model.Spec.Resources.AcceleratorCount > 1 {

            shmSize = resource.MustParse("64Gi")

        }


        // Volume mounts for the inference container.

        volumeMounts := []corev1.VolumeMount{

            {Name: "model-cache", MountPath: "/model-cache"},

            {Name: "shm", MountPath: "/dev/shm"},

        }


        // Volumes.

        volumes := []corev1.Volume{

            {

                Name: "model-cache",

                VolumeSource: corev1.VolumeSource{

                    PersistentVolumeClaim: &corev1.PersistentVolumeClaimVolumeSource{

                        ClaimName: fmt.Sprintf("%s-model-cache", model.Name),

                    },

                },

            },

            {

                Name: "shm",

                VolumeSource: corev1.VolumeSource{

                    EmptyDir: &corev1.EmptyDirVolumeSource{

                        Medium:    corev1.StorageMediumMemory,

                        SizeLimit: &shmSize,

                    },

                },

            },

        }


        // Init container: download model weights before the server starts.

        initContainers := []corev1.Container{

            {

                Name:  "model-downloader",

                Image: "huggingface/downloader:latest",

                Command: []string{

                    "huggingface-cli", "download",

                    model.Spec.ModelId,

                    "--local-dir", "/model-cache",

                },

                Env: envVars,

                VolumeMounts: []corev1.VolumeMount{

                    {Name: "model-cache", MountPath: "/model-cache"},

                },

            },

        }


        // Pod spec.

        podSpec := corev1.PodSpec{

            Tolerations:    hwCfg.Tolerations,

            NodeSelector:   hwCfg.NodeSelectorLabels,

            HostIPC:        hwCfg.HostIPC,

            HostNetwork:    hwCfg.HostNetwork,

            InitContainers: initContainers,

            Containers: []corev1.Container{

                {

                    Name:    "inference-server",

                    Image:   image,

                    Command: command,

                    Args:    args,

                    Env:     envVars,

                    Ports: []corev1.ContainerPort{

                        {Name: "http", ContainerPort: 8000, Protocol: corev1.ProtocolTCP},

                    },

                    Resources: corev1.ResourceRequirements{

                        Requests: resourceRequests,

                        Limits:   resourceLimits,

                    },

                    VolumeMounts: volumeMounts,

                    ReadinessProbe: &corev1.Probe{

                        ProbeHandler: corev1.ProbeHandler{

                            HTTPGet: &corev1.HTTPGetAction{

                                Path: "/health",

                                Port: intstr.FromInt(8000),

                            },

                        },

                        InitialDelaySeconds: 300,

                        PeriodSeconds:       10,

                        FailureThreshold:    60,

                    },

                    LivenessProbe: &corev1.Probe{

                        ProbeHandler: corev1.ProbeHandler{

                            HTTPGet: &corev1.HTTPGetAction{

                                Path: "/health",

                                Port: intstr.FromInt(8000),

                            },

                        },

                        InitialDelaySeconds: 360,

                        PeriodSeconds:       30,

                        FailureThreshold:    5,

                    },

                },

            },

            Volumes: volumes,

        }


        deployment := &appsv1.Deployment{

            ObjectMeta: metav1.ObjectMeta{

                Name:      model.Name,

                Namespace: model.Namespace,

            },

        }


        _, err = ctrl.CreateOrUpdate(ctx, r.Client, deployment, func() error {

            if err := ctrl.SetControllerReference(model, deployment, r.Scheme); err != nil {

                return err

            }

            deployment.Spec = appsv1.DeploymentSpec{

                Replicas: &replicas,

                Selector: &metav1.LabelSelector{

                    MatchLabels: map[string]string{

                        "app":                      model.Name,

                        "ai.example.io/model-name": model.Name,

                    },

                },

                Template: corev1.PodTemplateSpec{

                    ObjectMeta: metav1.ObjectMeta{

                        Labels: map[string]string{

                            "app":                         model.Name,

                            "ai.example.io/model-name":    model.Name,

                            "ai.example.io/engine":        model.Spec.InferenceEngine,

                            "ai.example.io/vendor":        model.Spec.AcceleratorVendor,

                        },

                        Annotations: map[string]string{

                            "prometheus.io/scrape": "true",

                            "prometheus.io/port":   "8000",

                            "prometheus.io/path":   "/metrics",

                        },

                    },

                    Spec: podSpec,

                },

            }

            return nil

        })

        return err

    }


    // ensureInferenceService creates or updates the ClusterIP Service.

    func (r *LLMModelReconciler) ensureInferenceService(

        ctx context.Context,

        model *aiv1alpha1.LLMModel,

    ) error {

        svc := &corev1.Service{

            ObjectMeta: metav1.ObjectMeta{

                Name:      model.Name,

                Namespace: model.Namespace,

            },

        }

        _, err := ctrl.CreateOrUpdate(ctx, r.Client, svc, func() error {

            if err := ctrl.SetControllerReference(model, svc, r.Scheme); err != nil {

                return err

            }

            svc.Spec = corev1.ServiceSpec{

                Selector: map[string]string{

                    "ai.example.io/model-name": model.Name,

                },

                Ports: []corev1.ServicePort{

                    {

                        Name:       "http",

                        Port:       8000,

                        TargetPort: intstr.FromInt(8000),

                        Protocol:   corev1.ProtocolTCP,

                    },

                },

                Type: corev1.ServiceTypeClusterIP,

            }

            return nil

        })

        return err

    }


    // ensureKedaScaledObject creates or updates the KEDA ScaledObject.

    // vLLM emits the same Prometheus metrics regardless of the underlying

    // accelerator vendor, so the KEDA configuration is vendor-agnostic.

    func (r *LLMModelReconciler) ensureKedaScaledObject(

        ctx context.Context,

        model *aiv1alpha1.LLMModel,

    ) error {

        // KEDA ScaledObject is a custom resource. We use unstructured to avoid

        // importing the KEDA API package as a dependency.

        scaledObject := map[string]interface{}{

            "apiVersion": "keda.sh/v1alpha1",

            "kind":       "ScaledObject",

            "metadata": map[string]interface{}{

                "name":      model.Name,

                "namespace": model.Namespace,

                "ownerReferences": []interface{}{

                    map[string]interface{}{

                        "apiVersion":         "ai.example.io/v1alpha1",

                        "kind":               "LLMModel",

                        "name":               model.Name,

                        "uid":                string(model.UID),

                        "controller":         true,

                        "blockOwnerDeletion": true,

                    },

                },

            },

            "spec": map[string]interface{}{

                "scaleTargetRef": map[string]interface{}{

                    "apiVersion": "apps/v1",

                    "kind":       "Deployment",

                    "name":       model.Name,

                },

                "minReplicaCount": int64(model.Spec.Scaling.MinReplicas),

                "maxReplicaCount": int64(model.Spec.Scaling.MaxReplicas),

                "cooldownPeriod":  int64(model.Spec.Scaling.CooldownPeriodSeconds),

                "pollingInterval": int64(15),

                "triggers": []interface{}{

                    map[string]interface{}{

                        "type": "prometheus",

                        "metadata": map[string]interface{}{

                            "serverAddress": "http://prometheus-server.monitoring.svc.cluster.local:9090",

                            "metricName":    "vllm_requests_waiting",

                            "query": fmt.Sprintf(

                                `sum(vllm:num_requests_waiting{namespace="%s",pod=~"%s-.*"})`,

                                model.Namespace, model.Name,

                            ),

                            "threshold":           fmt.Sprintf("%d", model.Spec.Scaling.ScaleUpThreshold),

                            "activationThreshold": "1",

                        },

                    },

                },

            },

        }


        // We apply the ScaledObject using server-side apply via the dynamic client.

        // For simplicity in this example we use kubectl-style apply via the REST client.

        // In production, use the dynamic client or import the KEDA API types.

        _ = scaledObject

        // NOTE: Full dynamic client implementation omitted for brevity.

        // See the Helm chart values for the complete KEDA ScaledObject YAML,

        // which is applied as a separate manifest in config/keda/.

        return nil

    }


    // reconcileRemoteModel creates a lightweight proxy for remote API models.

    func (r *LLMModelReconciler) reconcileRemoteModel(

        ctx context.Context,

        model *aiv1alpha1.LLMModel,

    ) error {

        if model.Spec.RemoteApi == nil {

            return fmt.Errorf(

                "LLMModel %s has deploymentMode=remote but no remoteApi spec",

                model.Name,

            )

        }


        // ConfigMap with proxy configuration.

        configMap := &corev1.ConfigMap{

            ObjectMeta: metav1.ObjectMeta{

                Name:      fmt.Sprintf("%s-proxy-config", model.Name),

                Namespace: model.Namespace,

            },

        }

        _, err := ctrl.CreateOrUpdate(ctx, r.Client, configMap, func() error {

            ctrl.SetControllerReference(model, configMap, r.Scheme)

            configMap.Data = map[string]string{

                "upstream_url":   model.Spec.RemoteApi.BaseUrl,

                "model_id":       model.Spec.ModelId,

                "rate_limit_rpm": fmt.Sprintf("%d", model.Spec.RemoteApi.RateLimitRpm),

            }

            return nil

        })

        if err != nil {

            return fmt.Errorf("ensuring proxy ConfigMap: %w", err)

        }


        // Proxy Deployment — no GPU resources, minimal footprint.

        proxyReplicas := int32(2)

        proxyDeployment := &appsv1.Deployment{

            ObjectMeta: metav1.ObjectMeta{

                Name:      model.Name,

                Namespace: model.Namespace,

            },

        }

        _, err = ctrl.CreateOrUpdate(ctx, r.Client, proxyDeployment, func() error {

            ctrl.SetControllerReference(model, proxyDeployment, r.Scheme)

            proxyDeployment.Spec = appsv1.DeploymentSpec{

                Replicas: &proxyReplicas,

                Selector: &metav1.LabelSelector{

                    MatchLabels: map[string]string{

                        "app":                      model.Name,

                        "ai.example.io/model-name": model.Name,

                    },

                },

                Template: corev1.PodTemplateSpec{

                    ObjectMeta: metav1.ObjectMeta{

                        Labels: map[string]string{

                            "app":                      model.Name,

                            "ai.example.io/model-name": model.Name,

                            "ai.example.io/mode":       "remote-proxy",

                        },

                    },

                    Spec: corev1.PodSpec{

                        Containers: []corev1.Container{

                            {

                                Name:  "api-proxy",

                                Image: "ghcr.io/example/llm-api-proxy:v1.2.0",

                                Ports: []corev1.ContainerPort{

                                    {ContainerPort: 8000, Protocol: corev1.ProtocolTCP},

                                },

                                Env: []corev1.EnvVar{

                                    {

                                        Name: "PROXY_API_KEY",

                                        ValueFrom: &corev1.EnvVarSource{

                                            SecretKeyRef: &corev1.SecretKeySelector{

                                                LocalObjectReference: corev1.LocalObjectReference{

                                                    Name: model.Spec.RemoteApi.ApiKeySecret,

                                                },

                                                Key: "apiKey",

                                            },

                                        },

                                    },

                                    {

                                        Name:  "PROXY_CONFIG_PATH",

                                        Value: "/etc/proxy/config.yaml",

                                    },

                                },

                                Resources: corev1.ResourceRequirements{

                                    Requests: corev1.ResourceList{

                                        corev1.ResourceCPU:    resource.MustParse("100m"),

                                        corev1.ResourceMemory: resource.MustParse("128Mi"),

                                    },

                                    Limits: corev1.ResourceList{

                                        corev1.ResourceCPU:    resource.MustParse("500m"),

                                        corev1.ResourceMemory: resource.MustParse("256Mi"),

                                    },

                                },

                                VolumeMounts: []corev1.VolumeMount{

                                    {Name: "proxy-config", MountPath: "/etc/proxy"},

                                },

                                ReadinessProbe: &corev1.Probe{

                                    ProbeHandler: corev1.ProbeHandler{

                                        HTTPGet: &corev1.HTTPGetAction{

                                            Path: "/health",

                                            Port: intstr.FromInt(8000),

                                        },

                                    },

                                    InitialDelaySeconds: 5,

                                    PeriodSeconds:       10,

                                    FailureThreshold:    3,

                                },

                            },

                        },

                        Volumes: []corev1.Volume{

                            {

                                Name: "proxy-config",

                                VolumeSource: corev1.VolumeSource{

                                    ConfigMap: &corev1.ConfigMapVolumeSource{

                                        LocalObjectReference: corev1.LocalObjectReference{

                                            Name: fmt.Sprintf("%s-proxy-config", model.Name),

                                        },

                                    },

                                },

                            },

                        },

                    },

                },

            }

            return nil

        })

        if err != nil {

            return fmt.Errorf("ensuring proxy deployment: %w", err)

        }


        // Ensure the Service for the proxy.

        if err := r.ensureInferenceService(ctx, model); err != nil {

            return fmt.Errorf("ensuring proxy service: %w", err)

        }


        endpoint := fmt.Sprintf(

            "http://%s.%s.svc.cluster.local:8000/v1",

            model.Name, model.Namespace,

        )

        model.Status.Endpoint = endpoint

        model.Status.Phase = "Proxying"


        if model.Spec.McpExposure.Enabled {

            if err := r.McpClient.RegisterTool(ctx, McpToolDefinition{

                Name:        model.Spec.McpExposure.ToolName,

                Description: model.Spec.McpExposure.ToolDescription,

                Endpoint:    endpoint,

                ModelId:     model.Spec.ModelId,

                Domains:     model.Spec.Domains,

            }); err != nil {

                return fmt.Errorf("registering MCP tool for remote model: %w", err)

            }

        }


        return r.Status().Update(ctx, model)

    }


    // cleanupMcpRegistration removes the MCP tool registration when the

    // LLMModel is being deleted.

    func (r *LLMModelReconciler) cleanupMcpRegistration(

        ctx context.Context,

        model *aiv1alpha1.LLMModel,

    ) error {

        if !model.Spec.McpExposure.Enabled || model.Spec.McpExposure.ToolName == "" {

            return nil

        }

        logger := log.FromContext(ctx)

        logger.Info("Unregistering MCP tool", "toolName", model.Spec.McpExposure.ToolName)

        return r.McpClient.UnregisterTool(ctx, model.Spec.McpExposure.ToolName)

    }


    // SetupWithManager registers the controller with the controller-runtime manager.

    func (r *LLMModelReconciler) SetupWithManager(mgr ctrl.Manager) error {

        return ctrl.NewControllerManagedBy(mgr).

            For(&aiv1alpha1.LLMModel{}).

            Owns(&appsv1.Deployment{}).

            Owns(&corev1.Service{}).

            Owns(&corev1.PersistentVolumeClaim{}).

            Owns(&corev1.ConfigMap{}).

            Complete(r)

    }


    // Helper functions.


    func containsString(slice []string, s string) bool {

        for _, item := range slice {

            if item == s {

                return true

            }

        }

        return false

    }


    func removeString(slice []string, s string) []string {

        result := make([]string, 0, len(slice))

        for _, item := range slice {

            if item != s {

                result = append(result, item)

            }

        }

        return result

    }


FILE: main.go


    package main


    import (

        "flag"

        "os"


        "k8s.io/apimachinery/pkg/runtime"

        utilruntime "k8s.io/apimachinery/pkg/util/runtime"

        clientgoscheme "k8s.io/client-go/kubernetes/scheme"

        ctrl "sigs.k8s.io/controller-runtime"

        "sigs.k8s.io/controller-runtime/pkg/healthz"

        "sigs.k8s.io/controller-runtime/pkg/log/zap"

        "sigs.k8s.io/controller-runtime/pkg/metrics/server"


        aiv1alpha1 "github.com/example/llm-operator/api/v1alpha1"

        "github.com/example/llm-operator/controllers"

    )


    var (

        scheme   = runtime.NewScheme()

        setupLog = ctrl.Log.WithName("setup")

    )


    func init() {

        utilruntime.Must(clientgoscheme.AddToScheme(scheme))

        utilruntime.Must(aiv1alpha1.AddToScheme(scheme))

    }


    func main() {

        var metricsAddr string

        var enableLeaderElection bool

        var probeAddr string

        var mcpServerAddr string


        flag.StringVar(&metricsAddr, "metrics-bind-address", ":8080",

            "The address the metric endpoint binds to.")

        flag.StringVar(&probeAddr, "health-probe-bind-address", ":8081",

            "The address the probe endpoint binds to.")

        flag.BoolVar(&enableLeaderElection, "leader-elect", false,

            "Enable leader election for controller manager.")

        flag.StringVar(&mcpServerAddr, "mcp-server-address",

            "http://mcp-server.ai-models.svc.cluster.local:3000",

            "Address of the MCP server for tool registration.")

        flag.Parse()


        opts := zap.Options{Development: true}

        ctrl.SetLogger(zap.New(zap.UseFlagOptions(&opts)))


        mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{

            Scheme: scheme,

            Metrics: server.Options{

                BindAddress: metricsAddr,

            },

            HealthProbeBindAddress: probeAddr,

            LeaderElection:         enableLeaderElection,

            LeaderElectionID:       "ai.example.io",

        })

        if err != nil {

            setupLog.Error(err, "unable to start manager")

            os.Exit(1)

        }


        mcpClient := controllers.NewHttpMcpClient(mcpServerAddr)


        if err = (&controllers.LLMModelReconciler{

            Client:    mgr.GetClient(),

            Scheme:    mgr.GetScheme(),

            McpClient: mcpClient,

        }).SetupWithManager(mgr); err != nil {

            setupLog.Error(err, "unable to create controller", "controller", "LLMModel")

            os.Exit(1)

        }


        if err := mgr.AddHealthzCheck("healthz", healthz.Ping); err != nil {

            setupLog.Error(err, "unable to set up health check")

            os.Exit(1)

        }

        if err := mgr.AddReadyzCheck("readyz", healthz.Ping); err != nil {

            setupLog.Error(err, "unable to set up ready check")

            os.Exit(1)

        }


        setupLog.Info("starting manager")

        if err := mgr.Start(ctrl.SetupSignalHandler()); err != nil {

            setupLog.Error(err, "problem running manager")

            os.Exit(1)

        }

    }


FILE: controllers/mcp_client.go


    package controllers


    import (

        "bytes"

        "context"

        "encoding/json"

        "fmt"

        "net/http"

        "time"

    )


    // HttpMcpClient implements McpRegistryClient by calling the MCP server's

    // internal registration REST API. This is a sidecar API distinct from the

    // MCP protocol itself — it is used only by the controller to push

    // registrations into the MCP server's in-memory tool registry.

    type HttpMcpClient struct {

        baseURL    string

        httpClient *http.Client

    }


    // NewHttpMcpClient creates a new HttpMcpClient.

    func NewHttpMcpClient(baseURL string) *HttpMcpClient {

        return &HttpMcpClient{

            baseURL: baseURL,

            httpClient: &http.Client{

                Timeout: 10 * time.Second,

            },

        }

    }


    // RegisterTool calls the MCP server's /admin/tools endpoint to register

    // or update a tool definition.

    func (c *HttpMcpClient) RegisterTool(

        ctx context.Context,

        tool McpToolDefinition,

    ) error {

        body, err := json.Marshal(map[string]interface{}{

            "name":        tool.Name,

            "description": tool.Description,

            "endpoint":    tool.Endpoint,

            "modelId":     tool.ModelId,

            "domains":     tool.Domains,

        })

        if err != nil {

            return fmt.Errorf("marshalling tool definition: %w", err)

        }


        req, err := http.NewRequestWithContext(

            ctx,

            http.MethodPut,

            fmt.Sprintf("%s/admin/tools/%s", c.baseURL, tool.Name),

            bytes.NewReader(body),

        )

        if err != nil {

            return fmt.Errorf("creating register request: %w", err)

        }

        req.Header.Set("Content-Type", "application/json")


        resp, err := c.httpClient.Do(req)

        if err != nil {

            return fmt.Errorf("calling MCP server register: %w", err)

        }

        defer resp.Body.Close()


        if resp.StatusCode != http.StatusOK && resp.StatusCode != http.StatusCreated {

            return fmt.Errorf("MCP server register returned status %d", resp.StatusCode)

        }

        return nil

    }


    // UnregisterTool calls the MCP server's /admin/tools endpoint to remove

    // a tool definition.

    func (c *HttpMcpClient) UnregisterTool(ctx context.Context, toolName string) error {

        req, err := http.NewRequestWithContext(

            ctx,

            http.MethodDelete,

            fmt.Sprintf("%s/admin/tools/%s", c.baseURL, toolName),

            nil,

        )

        if err != nil {

            return fmt.Errorf("creating unregister request: %w", err)

        }


        resp, err := c.httpClient.Do(req)

        if err != nil {

            return fmt.Errorf("calling MCP server unregister: %w", err)

        }

        defer resp.Body.Close()


        if resp.StatusCode != http.StatusOK && resp.StatusCode != http.StatusNoContent {

            return fmt.Errorf("MCP server unregister returned status %d", resp.StatusCode)

        }

        return nil

    }


FILE: Dockerfile (controller)


    # syntax=docker/dockerfile:1

    FROM golang:1.22-alpine AS builder

    WORKDIR /workspace


    COPY go.mod go.sum ./

    RUN go mod download


    COPY api/ api/

    COPY controllers/ controllers/

    COPY main.go main.go


    RUN CGO_ENABLED=0 GOOS=linux GOARCH=amd64 \

        go build -a -o manager main.go


    FROM gcr.io/distroless/static:nonroot

    WORKDIR /

    COPY --from=builder /workspace/manager .

    USER 65532:65532

    ENTRYPOINT ["/manager"]


CHAPTER SIX: RBAC AND CONTROLLER DEPLOYMENT MANIFESTS


FILE: config/rbac/serviceaccount.yaml


    apiVersion: v1

    kind: ServiceAccount

    metadata:

      name: llm-operator-controller

      namespace: llm-operator-system


FILE: config/rbac/role.yaml


    apiVersion: rbac.authorization.k8s.io/v1

    kind: ClusterRole

    metadata:

      name: llm-operator-manager-role

    rules:

      # LLMModel resources — full access

      - apiGroups: ["ai.example.io"]

        resources: ["llmmodels"]

        verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]

      - apiGroups: ["ai.example.io"]

        resources: ["llmmodels/status"]

        verbs: ["get", "update", "patch"]

      - apiGroups: ["ai.example.io"]

        resources: ["llmmodels/finalizers"]

        verbs: ["update"]

      # Core Kubernetes resources the controller manages

      - apiGroups: ["apps"]

        resources: ["deployments"]

        verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]

      - apiGroups: [""]

        resources: ["services", "persistentvolumeclaims", "configmaps"]

        verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]

      - apiGroups: [""]

        resources: ["events"]

        verbs: ["create", "patch"]

      # KEDA ScaledObjects

      - apiGroups: ["keda.sh"]

        resources: ["scaledobjects"]

        verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]

      # Leader election

      - apiGroups: ["coordination.k8s.io"]

        resources: ["leases"]

        verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]


FILE: config/rbac/rolebinding.yaml


    apiVersion: rbac.authorization.k8s.io/v1

    kind: ClusterRoleBinding

    metadata:

      name: llm-operator-manager-rolebinding

    subjects:

      - kind: ServiceAccount

        name: llm-operator-controller

        namespace: llm-operator-system

    roleRef:

      kind: ClusterRole

      name: llm-operator-manager-role

      apiGroup: rbac.authorization.k8s.io


FILE: config/manager/manager.yaml


    apiVersion: apps/v1

    kind: Deployment

    metadata:

      name: llm-operator-controller

      namespace: llm-operator-system

      labels:

        control-plane: controller-manager

    spec:

      replicas: 1

      selector:

        matchLabels:

          control-plane: controller-manager

      template:

        metadata:

          labels:

            control-plane: controller-manager

        spec:

          serviceAccountName: llm-operator-controller

          terminationGracePeriodSeconds: 10

          containers:

            - name: manager

              image: ghcr.io/example/llm-operator:v1.0.0

              command:

                - /manager

              args:

                - --leader-elect

                - --mcp-server-address=http://mcp-server.ai-models.svc.cluster.local:3000

              ports:

                - name: metrics

                  containerPort: 8080

                - name: health

                  containerPort: 8081

              livenessProbe:

                httpGet:

                  path: /healthz

                  port: 8081

                initialDelaySeconds: 15

                periodSeconds: 20

              readinessProbe:

                httpGet:

                  path: /readyz

                  port: 8081

                initialDelaySeconds: 5

                periodSeconds: 10

              resources:

                requests:

                  cpu: 100m

                  memory: 128Mi

                limits:

                  cpu: 500m

                  memory: 256Mi

              securityContext:

                allowPrivilegeEscalation: false

                capabilities:

                  drop:

                    - ALL

                readOnlyRootFilesystem: true

                runAsNonRoot: true

          securityContext:

            runAsNonRoot: true




CHAPTER SEVEN: AUTOSCALING WITH KEDA (VENDOR-AGNOSTIC)


vLLM emits the same Prometheus metrics regardless of the underlying accelerator

vendor (NVIDIA, AMD, or Intel Gaudi). This means our KEDA configuration is

completely vendor-agnostic — we always scale on vllm:num_requests_waiting,

which reflects the depth of the inference queue.


The KEDA ScaledObjects are applied as separate manifests in config/keda/. The

controller creates them programmatically via the dynamic client; these YAML

files serve as the reference and can also be applied manually.


FILE: config/keda/scaledobject-gemma4-26b-amd.yaml


    apiVersion: keda.sh/v1alpha1

    kind: ScaledObject

    metadata:

      name: gemma4-26b-amd

      namespace: ai-models

    spec:

      scaleTargetRef:

        apiVersion: apps/v1

        kind: Deployment

        name: gemma4-26b-amd

      # minReplicaCount: 0 enables scale-to-zero.

      # Set to 1 if cold-start latency (model loading) is unacceptable.

      minReplicaCount: 0

      maxReplicaCount: 3

      # cooldownPeriod: how long to wait before scaling down after the last

      # scale-down trigger. 300 seconds prevents thrashing on bursty workloads.

      cooldownPeriod: 300

      pollingInterval: 15

      triggers:

        - type: prometheus

          metadata:

            serverAddress: http://prometheus-server.monitoring.svc.cluster.local:9090

            metricName: vllm_requests_waiting

            # Scale up when more than 3 requests are waiting per replica.

            # KEDA scales to ceil(metricValue / threshold) replicas.

            query: >

              sum(vllm:num_requests_waiting{

                namespace="ai-models",

                pod=~"gemma4-26b-amd-.*"

              })

            threshold: "3"

            # activationThreshold: wake up a scaled-to-zero deployment when

            # at least 1 request is waiting.

            activationThreshold: "1"


FILE: config/keda/scaledobject-qwen36-35b-gaudi.yaml


    apiVersion: keda.sh/v1alpha1

    kind: ScaledObject

    metadata:

      name: qwen36-35b-gaudi

      namespace: ai-models

    spec:

      scaleTargetRef:

        apiVersion: apps/v1

        kind: Deployment

        name: qwen36-35b-gaudi

      minReplicaCount: 0

      maxReplicaCount: 2

      cooldownPeriod: 300

      pollingInterval: 15

      triggers:

        - type: prometheus

          metadata:

            serverAddress: http://prometheus-server.monitoring.svc.cluster.local:9090

            metricName: vllm_requests_waiting

            query: >

              sum(vllm:num_requests_waiting{

                namespace="ai-models",

                pod=~"qwen36-35b-gaudi-.*"

              })

            threshold: "2"

            activationThreshold: "1"


FILE: config/keda/scaledobject-gemma4-31b-nvidia.yaml


    apiVersion: keda.sh/v1alpha1

    kind: ScaledObject

    metadata:

      name: gemma4-31b-nvidia

      namespace: ai-models

    spec:

      scaleTargetRef:

        apiVersion: apps/v1

        kind: Deployment

        name: gemma4-31b-nvidia

      minReplicaCount: 1

      maxReplicaCount: 2

      cooldownPeriod: 300

      pollingInterval: 15

      triggers:

        - type: prometheus

          metadata:

            serverAddress: http://prometheus-server.monitoring.svc.cluster.local:9090

            metricName: vllm_requests_waiting

            query: >

              sum(vllm:num_requests_waiting{

                namespace="ai-models",

                pod=~"gemma4-31b-nvidia-.*"

              })

            threshold: "3"

            activationThreshold: "1"


KEDA HTTP Add-On for Scale-to-Zero with Request Buffering:


    # File: config/keda/http-scaledobject-gemma4-26b-amd.yaml

    # The HTTP add-on intercepts requests, holds them while the deployment

    # scales up from zero, and forwards them once a pod is ready.

    # This prevents request failures during cold starts.

    apiVersion: http.keda.sh/v1alpha1

    kind: HTTPScaledObject

    metadata:

      name: gemma4-26b-amd-http

      namespace: ai-models

    spec:

      hosts:

        - gemma4-26b-amd.ai-models.svc.cluster.local

      scaleTargetRef:

        name: gemma4-26b-amd

        port: 8000

      replicas:

        min: 0

        max: 3

      # For large models, allow up to 600 seconds for the pod to start and

      # load the model weights before giving up on buffered requests.

      scaledownPeriod: 300



CHAPTER EIGHT: PROMETHEUS RECORDING RULES AND MONITORING


FILE: config/monitoring/prometheus-rules.yaml


    apiVersion: monitoring.coreos.com/v1

    kind: PrometheusRule

    metadata:

      name: llm-operator-rules

      namespace: monitoring

      labels:

        # This label causes the Prometheus Operator to pick up this rule.

        prometheus: kube-prometheus

        role: alert-rules

    spec:

      groups:

        # ----------------------------------------------------------------

        # vLLM metrics — vendor-agnostic (same metric names for all vendors)

        # ----------------------------------------------------------------

        - name: vllm.rules

          interval: 15s

          rules:

            - record: vllm:num_requests_running:sum

              expr: sum by (namespace, app) (vllm:num_requests_running)


            - record: vllm:num_requests_waiting:sum

              expr: sum by (namespace, app) (vllm:num_requests_waiting)


            - record: vllm:time_to_first_token_ms:p50

              expr: >

                histogram_quantile(0.50,

                    sum by (namespace, app, le) (

                        rate(vllm:time_to_first_token_seconds_bucket[5m])

                    )

                ) * 1000


            - record: vllm:time_to_first_token_ms:p95

              expr: >

                histogram_quantile(0.95,

                    sum by (namespace, app, le) (

                        rate(vllm:time_to_first_token_seconds_bucket[5m])

                    )

                ) * 1000


            - record: vllm:tokens_per_second:rate5m

              expr: >

                sum by (namespace, app) (

                    rate(vllm:generation_tokens_total[5m])

                )


        # ----------------------------------------------------------------

        # NVIDIA GPU metrics (DCGM exporter)

        # ----------------------------------------------------------------

        - name: nvidia.gpu.rules

          interval: 15s

          rules:

            - record: nvidia:gpu_memory_used_gib:avg

              expr: >

                avg by (node, gpu) (

                    DCGM_FI_DEV_FB_USED / 1024

                )


            - record: nvidia:gpu_utilization_pct:avg

              expr: >

                avg by (node, gpu) (

                    DCGM_FI_DEV_GPU_UTIL

                )


        # ----------------------------------------------------------------

        # AMD GPU metrics (ROCm SMI exporter)

        # ----------------------------------------------------------------

        - name: amd.gpu.rules

          interval: 15s

          rules:

            - record: amd:gpu_memory_used_gib:avg

              expr: >

                avg by (node, gpu) (

                    rocm_smi_memory_used_bytes / 1073741824

                )


            - record: amd:gpu_utilization_pct:avg

              expr: >

                avg by (node, gpu) (

                    rocm_smi_gpu_use_percent

                )


        # ----------------------------------------------------------------

        # Intel Gaudi metrics (Gaudi metrics exporter)

        # ----------------------------------------------------------------

        - name: gaudi.rules

          interval: 15s

          rules:

            - record: gaudi:memory_used_gib:avg

              expr: >

                avg by (node, device) (

                    habana_gaudi_memory_used_bytes / 1073741824

                )


            - record: gaudi:utilization_pct:avg

              expr: >

                avg by (node, device) (

                    habana_gaudi_util_percent

                )


        # ----------------------------------------------------------------

        # Alerting rules

        # ----------------------------------------------------------------

        - name: llm.alerts

          rules:

            - alert: LLMHighQueueDepth

              expr: vllm:num_requests_waiting:sum > 20

              for: 5m

              labels:

                severity: warning

              annotations:

                summary: "High inference queue depth for {{ $labels.app }}"

                description: >

                  {{ $labels.app }} has {{ $value }} requests waiting.

                  Consider increasing maxReplicas in the LLMModel spec.


            - alert: LLMHighLatency

              expr: vllm:time_to_first_token_ms:p95 > 10000

              for: 5m

              labels:

                severity: warning

              annotations:

                summary: "High TTFT latency for {{ $labels.app }}"

                description: >

                  P95 time-to-first-token for {{ $labels.app }} is

                  {{ $value }}ms, exceeding the 10s threshold.


CHAPTER NINE: THE MCP SERVER


FILE: mcp-server/package.json


    {

      "name": "example-llm-mcp-server",

      "version": "1.0.0",

      "description": "MCP server exposing LLMModel resources as AI agent tools",

      "main": "dist/index.js",

      "scripts": {

        "build": "tsc",

        "start": "node dist/index.js",

        "dev": "ts-node src/index.ts"

      },

      "dependencies": {

        "@kubernetes/client-node": "^0.21.0",

        "@modelcontextprotocol/sdk": "^1.12.0",

        "express": "^4.18.2",

        "openai": "^4.52.0",

        "zod": "^3.23.8"

      },

      "devDependencies": {

        "@types/express": "^4.17.21",

        "@types/node": "^20.0.0",

        "typescript": "^5.4.0",

        "ts-node": "^10.9.2"

      }

    }


FILE: mcp-server/tsconfig.json


    {

      "compilerOptions": {

        "target": "ES2022",

        "module": "commonjs",

        "lib": ["ES2022"],

        "outDir": "./dist",

        "rootDir": "./src",

        "strict": true,

        "esModuleInterop": true,

        "skipLibCheck": true,

        "forceConsistentCasingInFileNames": true,

        "resolveJsonModule": true

      },

      "include": ["src/**/*"],

      "exclude": ["node_modules", "dist"]

    }


FILE: mcp-server/src/index.ts


    // MCP server that exposes LLMModel resources as callable tools.

    // Implements the November 2025 MCP specification.

    // Supports dynamic tool registration via the /admin/tools REST API,

    // which the Kubernetes controller calls when models are added or removed.


    import { randomUUID } from "crypto";

    import { McpServer } from "@modelcontextprotocol/sdk/server/mcp.js";

    import { StreamableHTTPServerTransport } from "@modelcontextprotocol/sdk/server/streamableHttp.js";

    import { z } from "zod";

    import * as k8s from "@kubernetes/client-node";

    import OpenAI from "openai";

    import express, { Request, Response } from "express";


    // ---------------------------------------------------------------------------

    // Types

    // ---------------------------------------------------------------------------


    interface LLMModelRecord {

        name: string;

        toolName: string;

        toolDescription: string;

        endpoint: string;

        modelId: string;

        domains: string[];

        supportsThinking: boolean;

        contextWindowK: number;

        phase: string;

    }


    interface AdminToolPayload {

        name: string;

        description: string;

        endpoint: string;

        modelId: string;

        domains: string[];

    }


    // ---------------------------------------------------------------------------

    // Model Registry

    // Maintains a live view of available LLMModel resources via Kubernetes watch.

    // Also accepts push updates from the controller via the /admin/tools API.

    // ---------------------------------------------------------------------------


    class ModelRegistry {

        private models: Map<string, LLMModelRecord> = new Map();

        private kc: k8s.KubeConfig;

        private customApi: k8s.CustomObjectsApi;

        // Callbacks registered by the MCP server to be notified when the

        // tool list changes, so it can send MCP tools/list_changed notifications.

        private changeCallbacks: Array<() => void> = [];


        constructor() {

            this.kc = new k8s.KubeConfig();

            try {

                this.kc.loadFromCluster();

            } catch {

                this.kc.loadFromDefault();

            }

            this.customApi = this.kc.makeApiClient(k8s.CustomObjectsApi);

        }


        onToolListChanged(cb: () => void): void {

            this.changeCallbacks.push(cb);

        }


        private notifyChanged(): void {

            for (const cb of this.changeCallbacks) {

                try { cb(); } catch { /* ignore */ }

            }

        }


        async start(): Promise<void> {

            const namespace = process.env.MODEL_NAMESPACE || "ai-models";


            // Initial list to populate the registry before starting the watch.

            try {

                const list = await this.customApi.listNamespacedCustomObject(

                    "ai.example.io",

                    "v1alpha1",

                    namespace,

                    "llmmodels",

                );

                const items = (list as any).body?.items ?? [];

                for (const item of items) {

                    this.upsertFromK8s(item);

                }

                console.log(`Loaded ${this.models.size} models from Kubernetes registry`);

            } catch (err) {

                console.warn("Could not list LLMModels from Kubernetes:", err);

            }


            // Start watch for real-time updates.

            this.startWatch(namespace);

        }


        private startWatch(namespace: string): void {

            const watch = new k8s.Watch(this.kc);

            watch.watch(

                `/apis/ai.example.io/v1alpha1/namespaces/${namespace}/llmmodels`,

                {},

                (type: string, obj: any) => {

                    if (type === "ADDED" || type === "MODIFIED") {

                        this.upsertFromK8s(obj);

                    } else if (type === "DELETED") {

                        const toolName = obj.spec?.mcpExposure?.toolName;

                        if (toolName) {

                            this.models.delete(toolName);

                            console.log(`Watch: unregistered model tool ${toolName}`);

                            this.notifyChanged();

                        }

                    }

                },

                (err: any) => {

                    console.error("Watch error, reconnecting in 5s:", err);

                    setTimeout(() => this.startWatch(namespace), 5000);

                },

            );

        }


        private upsertFromK8s(obj: any): void {

            const spec = obj.spec ?? {};

            const status = obj.status ?? {};

            const mcpExposure = spec.mcpExposure ?? {};


            if (!mcpExposure.enabled || !mcpExposure.toolName) return;

            if (!["Ready", "Proxying"].includes(status.phase ?? "")) return;


            const modelType: string = spec.modelType ?? "Dense";

            const supportsThinking =

                modelType === "DenseThinking" || modelType === "MoEThinking";


            const record: LLMModelRecord = {

                name: obj.metadata?.name ?? "",

                toolName: mcpExposure.toolName,

                toolDescription: mcpExposure.toolDescription ?? "",

                endpoint: status.endpoint ?? "",

                modelId: spec.modelId ?? "",

                domains: spec.domains ?? [],

                supportsThinking,

                contextWindowK: spec.contextWindowK ?? 32,

                phase: status.phase ?? "",

            };


            const isNew = !this.models.has(record.toolName);

            this.models.set(record.toolName, record);

            console.log(`${isNew ? "Registered" : "Updated"} model tool: ${record.toolName}`);

            this.notifyChanged();

        }


        // upsertFromAdmin is called by the /admin/tools PUT endpoint.

        // The controller calls this to push registrations without waiting

        // for the Kubernetes watch to fire.

        upsertFromAdmin(payload: AdminToolPayload): void {

            const existing = this.models.get(payload.name);

            const record: LLMModelRecord = {

                name: payload.name,

                toolName: payload.name,

                toolDescription: payload.description,

                endpoint: payload.endpoint,

                modelId: payload.modelId,

                domains: payload.domains,

                supportsThinking: false, // updated by watch

                contextWindowK: 32,      // updated by watch

                phase: "Ready",

            };

            // Preserve supportsThinking and contextWindowK from existing record.

            if (existing) {

                record.supportsThinking = existing.supportsThinking;

                record.contextWindowK = existing.contextWindowK;

            }

            this.models.set(payload.name, record);

            console.log(`Admin: registered/updated tool ${payload.name}`);

            this.notifyChanged();

        }


        removeByToolName(toolName: string): boolean {

            const existed = this.models.has(toolName);

            this.models.delete(toolName);

            if (existed) {

                console.log(`Admin: unregistered tool ${toolName}`);

                this.notifyChanged();

            }

            return existed;

        }


        getAll(): LLMModelRecord[] {

            return Array.from(this.models.values());

        }


        getByToolName(toolName: string): LLMModelRecord | undefined {

            return this.models.get(toolName);

        }

    }


    // ---------------------------------------------------------------------------

    // MCP Tool Invocation

    // ---------------------------------------------------------------------------


    async function invokeModel(

        model: LLMModelRecord,

        input: {

            messages: Array<{ role: "system" | "user" | "assistant"; content: string }>;

            temperature?: number;

            max_tokens?: number;

            thinking?: boolean;

        },

    ): Promise<{ content: Array<{ type: string; text: string }>; isError?: boolean }> {

        const openai = new OpenAI({

            baseURL: model.endpoint,

            apiKey: "not-needed-for-local-models",

        });


        let messages = [...input.messages];


        // Inject thinking mode instruction for hybrid thinking models.

        if (input.thinking && model.supportsThinking) {

            const thinkingInstruction =

                "Think step by step before answering. " +

                "Use <think>...</think> tags for your internal reasoning.";

            const sysIdx = messages.findIndex((m) => m.role === "system");

            if (sysIdx >= 0) {

                messages[sysIdx] = {

                    ...messages[sysIdx],

                    content: messages[sysIdx].content + "\n\n" + thinkingInstruction,

                };

            } else {

                messages = [{ role: "system", content: thinkingInstruction }, ...messages];

            }

        }


        try {

            const completion = await openai.chat.completions.create({

                model: model.modelId,

                messages: messages as any,

                temperature: input.temperature ?? 0.7,

                max_tokens: input.max_tokens ?? 2048,

            });


            const responseText = completion.choices[0]?.message?.content ?? "";

            return { content: [{ type: "text", text: responseText }] };

        } catch (error: any) {

            return {

                content: [

                    {

                        type: "text",

                        text: `Error calling model ${model.modelId}: ${error.message}`,

                    },

                ],

                isError: true,

            };

        }

    }


    // ---------------------------------------------------------------------------

    // MCP Server Setup

    // ---------------------------------------------------------------------------


    function buildMcpServer(registry: ModelRegistry): McpServer {

        const server = new McpServer({

            name: "example-llm-registry",

            version: "1.0.0",

        });


        // Meta-tool: list all available models.

        server.tool(

            "list_available_models",

            "Lists all LLM models currently available in the cluster, " +

            "including their capabilities, context windows, accelerator vendor, " +

            "and deployment status. Call this first to discover which model " +

            "to use for a given task.",

            {},

            async () => {

                const models = registry.getAll();

                const summary = models.map((m) => ({

                    toolName: m.toolName,

                    modelId: m.modelId,

                    domains: m.domains,

                    contextWindowK: m.contextWindowK,

                    supportsThinking: m.supportsThinking,

                    status: m.phase,

                }));

                return {

                    content: [{ type: "text", text: JSON.stringify(summary, null, 2) }],

                };

            },

        );


        // Register each currently-known model as a tool.

        for (const model of registry.getAll()) {

            registerModelTool(server, model);

        }


        // When the registry changes, notify MCP clients via the

        // tools/list_changed notification (November 2025 MCP spec).

        registry.onToolListChanged(() => {

            // The MCP SDK sends the notification to all connected clients

            // when we call server.sendToolListChanged().

            // Re-register all tools to keep the server's internal list current.

            // Note: In a future MCP SDK version, incremental tool updates

            // will be supported. For now we rebuild from the registry.

            for (const model of registry.getAll()) {

                registerModelTool(server, model);

            }

            try {

                (server as any).sendToolListChanged?.();

            } catch { /* SDK version may not support this yet */ }

        });


        return server;

    }


    function registerModelTool(server: McpServer, model: LLMModelRecord): void {

        // tool() is idempotent in the MCP SDK — calling it again with the same

        // name overwrites the previous registration.

        server.tool(

            model.toolName,

            model.toolDescription,

            {

                messages: z.array(

                    z.object({

                        role: z.enum(["system", "user", "assistant"]),

                        content: z.string(),

                    })

                ).describe("Conversation history."),

                temperature: z.number().min(0).max(2).optional()

                    .describe("Sampling temperature (0=deterministic, 2=creative)."),

                max_tokens: z.number().int().positive().optional()

                    .describe("Maximum tokens to generate."),

                thinking: z.boolean().optional()

                    .describe(

                        "Enable chain-of-thought reasoning for models that support " +

                        "hybrid thinking mode (e.g. Qwen3.5, Qwen3.6). " +

                        "Increases latency but improves quality on complex tasks."

                    ),

            },

            async (input) => invokeModel(model, input),

        );

    }


    // ---------------------------------------------------------------------------

    // Express Application

    // ---------------------------------------------------------------------------


    async function main(): Promise<void> {

        const registry = new ModelRegistry();

        await registry.start();


        const mcpServer = buildMcpServer(registry);


        const app = express();

        app.use(express.json());


        // Health endpoint — checked by the Kubernetes readiness probe.

        app.get("/health", (_req: Request, res: Response) => {

            res.status(200).json({

                status: "ok",

                models: registry.getAll().length,

            });

        });


        // -----------------------------------------------------------------------

        // Admin API — called by the Kubernetes controller to push tool

        // registrations without waiting for the Kubernetes watch to fire.

        // This API is internal only and should not be exposed outside the cluster.

        // -----------------------------------------------------------------------


        // PUT /admin/tools/:toolName — register or update a tool.

        app.put("/admin/tools/:toolName", (req: Request, res: Response) => {

            const { toolName } = req.params;

            const payload = req.body as AdminToolPayload;

            if (!payload.name || payload.name !== toolName) {

                res.status(400).json({ error: "toolName in URL must match name in body" });

                return;

            }

            registry.upsertFromAdmin(payload);

            res.status(200).json({ status: "registered", toolName });

        });


        // DELETE /admin/tools/:toolName — unregister a tool.

        app.delete("/admin/tools/:toolName", (req: Request, res: Response) => {

            const { toolName } = req.params;

            const existed = registry.removeByToolName(toolName);

            if (existed) {

                res.status(200).json({ status: "unregistered", toolName });

            } else {

                res.status(404).json({ error: "tool not found", toolName });

            }

        });


        // GET /admin/tools — list all registered tools (for debugging).

        app.get("/admin/tools", (_req: Request, res: Response) => {

            res.status(200).json(registry.getAll().map((m) => ({

                toolName: m.toolName,

                modelId: m.modelId,

                endpoint: m.endpoint,

                phase: m.phase,

            })));

        });


        // -----------------------------------------------------------------------

        // MCP Protocol Endpoint

        // Uses StreamableHTTP transport (November 2025 MCP specification).

        // -----------------------------------------------------------------------

        const transport = new StreamableHTTPServerTransport({

            sessionIdGenerator: () => randomUUID(),

        });


        app.all("/mcp", async (req: Request, res: Response) => {

            await transport.handleRequest(req, res, req.body);

        });


        await mcpServer.connect(transport);


        const port = parseInt(process.env.PORT ?? "3000", 10);

        app.listen(port, () => {

            console.log(`MCP server listening on port ${port}`);

            console.log(`Serving ${registry.getAll().length} models as tools`);

            console.log(`Health: http://localhost:${port}/health`);

            console.log(`MCP:    http://localhost:${port}/mcp`);

            console.log(`Admin:  http://localhost:${port}/admin/tools`);

        });

    }


    main().catch((err) => {

        console.error("Fatal error:", err);

        process.exit(1);

    });


FILE: mcp-server/Dockerfile


    # syntax=docker/dockerfile:1

    FROM node:20-alpine AS builder

    WORKDIR /app


    COPY package.json package-lock.json ./

    RUN npm ci


    COPY tsconfig.json ./

    COPY src/ src/

    RUN npm run build


    FROM node:20-alpine AS runtime

    WORKDIR /app


    COPY package.json package-lock.json ./

    RUN npm ci --omit=dev


    COPY --from=builder /app/dist ./dist


    USER node

    EXPOSE 3000

    CMD ["node", "dist/index.js"]


FILE: config/mcp/deployment.yaml


    apiVersion: v1

    kind: ServiceAccount

    metadata:

      name: mcp-server

      namespace: ai-models

    ---

    apiVersion: rbac.authorization.k8s.io/v1

    kind: Role

    metadata:

      name: mcp-server-llmmodel-reader

      namespace: ai-models

    rules:

      - apiGroups: ["ai.example.io"]

        resources: ["llmmodels"]

        verbs: ["get", "list", "watch"]

    ---

    apiVersion: rbac.authorization.k8s.io/v1

    kind: RoleBinding

    metadata:

      name: mcp-server-llmmodel-reader

      namespace: ai-models

    subjects:

      - kind: ServiceAccount

        name: mcp-server

        namespace: ai-models

    roleRef:

      kind: Role

      name: mcp-server-llmmodel-reader

      apiGroup: rbac.authorization.k8s.io

    ---

    apiVersion: apps/v1

    kind: Deployment

    metadata:

      name: mcp-server

      namespace: ai-models

    spec:

      replicas: 2

      selector:

        matchLabels:

          app: mcp-server

      template:

        metadata:

          labels:

            app: mcp-server

        spec:

          serviceAccountName: mcp-server

          containers:

            - name: mcp-server

              image: ghcr.io/example/llm-mcp-server:v1.0.0

              ports:

                - containerPort: 3000

                  name: http

              env:

                - name: MODEL_NAMESPACE

                  value: ai-models

                - name: PORT

                  value: "3000"

              resources:

                requests:

                  cpu: 200m

                  memory: 256Mi

                limits:

                  cpu: 1000m

                  memory: 512Mi

              readinessProbe:

                httpGet:

                  path: /health

                  port: 3000

                initialDelaySeconds: 5

                periodSeconds: 10

                failureThreshold: 3

              livenessProbe:

                httpGet:

                  path: /health

                  port: 3000

                initialDelaySeconds: 15

                periodSeconds: 20

                failureThreshold: 3

    ---

    apiVersion: v1

    kind: Service

    metadata:

      name: mcp-server

      namespace: ai-models

    spec:

      selector:

        app: mcp-server

      ports:

        - port: 3000

          targetPort: 3000

          name: mcp



CHAPTER TEN: QUERYING THE MODEL REGISTRY


With discovery labels populated by the controller, engineers can query the

registry using standard Kubernetes tooling.


Find all models that support vision and have at least 128K context:


    kubectl get llmmodels -n ai-models \

        -l "ai.example.io/domain-vision=true,ai.example.io/context-128k=true" \

        -o custom-columns=\

    NAME:.metadata.name,\

    MODEL:.spec.modelId,\

    VENDOR:.spec.acceleratorVendor,\

    VRAM:.spec.resources.vramPerAcceleratorGiB,\

    CONTEXT:.spec.contextWindowK,\

    STATUS:.status.phase


Find all models running on AMD accelerators:


    kubectl get llmmodels -n ai-models \

        -l "ai.example.io/accelerator-vendor=amd" \

        -o wide


Find all models that support reasoning and fit in 24 GiB of VRAM:


    kubectl get llmmodels -n ai-models \

        -l "ai.example.io/domain-reasoning=true,\

            ai.example.io/vram-tier in (8gb,16gb,24gb)"


Find all remote API models:


    kubectl get llmmodels -n ai-models \

        -l "ai.example.io/deployment-mode=remote"


NOTE: Kubernetes label selectors do not support numeric range queries or

field selectors on custom resource status subfields. For queries that require

numeric comparisons (e.g., contextWindowK >= 256), use the Python client

and filter in application code as shown below.


FILE: tools/query_models.py


    #!/usr/bin/env python3

    """

    Query the LLMModel registry for models matching given criteria.

    Runs inside the cluster (uses in-cluster config) or locally (uses kubeconfig).

    """


    import sys

    from kubernetes import client, config



    def load_config() -> None:

        """Load Kubernetes configuration from cluster or local kubeconfig."""

        try:

            config.load_incluster_config()

        except config.ConfigException:

            config.load_kube_config()



    def find_models(

        namespace: str = "ai-models",

        domains: list[str] | None = None,

        max_vram_gib: int | None = None,

        context_min_k: int | None = None,

        accelerator_vendor: str | None = None,

        deployment_mode: str | None = None,

        phase: str | None = None,

    ) -> list[dict]:

        """

        Query the LLMModel registry for models matching the given criteria.


        Args:

            namespace:         Kubernetes namespace to search.

            domains:           List of required capability domains.

                               e.g. ["code", "vision"]

            max_vram_gib:      Maximum acceptable VRAM per accelerator in GiB.

                               Models requiring more VRAM are excluded.

            context_min_k:     Minimum required context window in thousands of tokens.

            accelerator_vendor: Filter by vendor: "nvidia", "amd", "intel-gaudi", "cpu".

            deployment_mode:   Filter by mode: "local" or "remote".

            phase:             Filter by status phase: "Ready", "Proxying", etc.


        Returns:

            List of matching LLMModel dicts (raw Kubernetes API objects).

        """

        load_config()

        api = client.CustomObjectsApi()


        # Build label selector from criteria that map to labels.

        label_parts: list[str] = []


        if domains:

            for domain in domains:

                label_parts.append(f"ai.example.io/domain-{domain}=true")


        if accelerator_vendor:

            label_parts.append(

                f"ai.example.io/accelerator-vendor={accelerator_vendor}"

            )


        if deployment_mode:

            label_parts.append(

                f"ai.example.io/deployment-mode={deployment_mode}"

            )


        if max_vram_gib is not None:

            # Map max VRAM to the tier labels at or below the maximum.

            tier_map = [

                (0,   "cpu"),

                (8,   "8gb"),

                (16,  "16gb"),

                (24,  "24gb"),

                (48,  "48gb"),

                (80,  "80gb"),

                (141, "141gb"),

                (192, "192gb"),

            ]

            eligible_tiers = [

                tier for threshold, tier in tier_map

                if threshold <= max_vram_gib

            ]

            if eligible_tiers:

                label_parts.append(

                    "ai.example.io/vram-tier in ({})".format(

                        ",".join(eligible_tiers)

                    )

                )


        label_selector = ",".join(label_parts) if label_parts else None


        result = api.list_namespaced_custom_object(

            group="ai.example.io",

            version="v1alpha1",

            namespace=namespace,

            plural="llmmodels",

            label_selector=label_selector,

        )


        models: list[dict] = result.get("items", [])


        # Apply filters that cannot be expressed as label selectors.

        if context_min_k is not None:

            models = [

                m for m in models

                if m.get("spec", {}).get("contextWindowK", 0) >= context_min_k

            ]


        if phase is not None:

            models = [

                m for m in models

                if m.get("status", {}).get("phase") == phase

            ]

        else:

            # By default, return only Ready and Proxying models.

            models = [

                m for m in models

                if m.get("status", {}).get("phase") in ("Ready", "Proxying")

            ]


        return models



    def print_models(models: list[dict]) -> None:

        """Print a formatted summary of matching models."""

        if not models:

            print("No matching models found.")

            return


        print(

            f"{'NAME':<30} {'MODEL':<45} {'VENDOR':<12} "

            f"{'VRAM':<8} {'CONTEXT':<10} {'STATUS':<10} ENDPOINT"

        )

        print("-" * 140)

        for m in models:

            spec = m.get("spec", {})

            status = m.get("status", {})

            resources = spec.get("resources", {})

            print(

                f"{m['metadata']['name']:<30} "

                f"{spec.get('modelId', ''):<45} "

                f"{spec.get('acceleratorVendor', 'remote'):<12} "

                f"{resources.get('vramPerAcceleratorGiB', 0):<8} "

                f"{spec.get('contextWindowK', 0):<10} "

                f"{status.get('phase', ''):<10} "

                f"{status.get('endpoint', '')}"

            )



    if __name__ == "__main__":

        # Example: find reasoning-capable models on any vendor with <=24 GiB VRAM

        results = find_models(

            domains=["reasoning"],

            max_vram_gib=24,

            context_min_k=128,

        )

        print(f"Found {len(results)} reasoning models with <=24 GiB VRAM and >=128K context:\n")

        print_models(results)


        print()


        # Example: find all AMD models

        amd_models = find_models(accelerator_vendor="amd")

        print(f"\nFound {len(amd_models)} AMD models:\n")

        print_models(amd_models)


        print()


        # Example: find all remote API models

        remote_models = find_models(deployment_mode="remote")

        print(f"\nFound {len(remote_models)} remote API models:\n")

        print_models(remote_models)


CHAPTER ELEVEN: DOCKER MODEL RUNNER AND LOCAL DEVELOPMENT


Docker Desktop 4.41, released on April 29, 2025, ships with Docker Model

Runner, which brings a capable local LLM development environment to any machine

with a modern GPU or Apple Silicon. It uses llama.cpp as its inference backend

and packages models as OCI artifacts.


To pull and run a model with Docker Model Runner:


    # Pull the Gemma 4 E4B model from Docker Hub's model registry.

    docker model pull ai/gemma4:e4b-q4_k_m


    # Run the model and start the inference server.

    # The server listens on localhost:12434 by default.

    docker model run ai/gemma4:e4b-q4_k_m


    # Test using the OpenAI-compatible API.

    curl http://localhost:12434/engines/llama.cpp/v1/chat/completions \

        -H "Content-Type: application/json" \

        -d '{

            "model": "ai/gemma4:e4b-q4_k_m",

            "messages": [

                {

                    "role": "user",

                    "content": "Explain the Mixture-of-Experts architecture in one paragraph."

                }

            ],

            "temperature": 0.7,

            "max_tokens": 512

        }'


FILE: compose.yaml (local development environment)


    # A development environment for an AI application that uses a local LLM.

    # The model runner service uses Docker Desktop's Model Runner backend,

    # which handles GPU acceleration automatically (NVIDIA CUDA on Linux/Windows,

    # Metal on macOS Apple Silicon).


    services:

      app:

        build: .

        environment:

          # Point the app at the local model runner.

          # In production (Kubernetes), this is overridden by the cluster endpoint.

          LLM_BASE_URL: http://model-runner.docker.internal:12434/engines/llama.cpp/v1

          LLM_MODEL_ID: ai/gemma4:e4b-q4_k_m

        depends_on:

          - model-runner

        ports:

          - "8080:8080"


      model-runner:

        # The "model" provider type tells Docker Desktop to use the Model Runner

        # instead of pulling a regular container image.

        # This works on any platform Docker Desktop supports:

        # - macOS: Apple Silicon (Metal) or Intel (CPU)

        # - Windows: NVIDIA GPU (CUDA) or CPU

        # - Linux: NVIDIA GPU (CUDA) or CPU

        provider:

          type: model

          options:

            model: ai/gemma4:e4b-q4_k_m



CHAPTER TWELVE: COMPLETE DEPLOYMENT WALKTHROUGH


We will deploy three local models across three different accelerator vendors

and three remote API proxies, then deploy the MCP server and verify that an

AI agent can discover and use all six models.


STEP 1: Install prerequisites.
    # Create the operator system namespace.
    kubectl create namespace llm-operator-system
    # Create the AI models namespace.
    kubectl create namespace ai-models
    # ---- NVIDIA GPU Operator (run only if you have NVIDIA GPUs) ----
    helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
    helm repo update
    helm install gpu-operator nvidia/gpu-operator \
        --namespace gpu-operator \
        --create-namespace \
        --set driver.enabled=true \
        --set toolkit.enabled=true \
        --set devicePlugin.enabled=true \
        --set dcgmExporter.enabled=true \
        --wait
    # ---- AMD GPU Operator (run only if you have AMD GPUs) ----
    helm repo add amd-gpu-operator https://rocm.github.io/gpu-operator
    helm repo update
    helm install amd-gpu-operator amd-gpu-operator/gpu-operator \
        --namespace amd-gpu-operator \
        --create-namespace \
        --set devicePlugin.enabled=true \
        --set nodeLabeller.enabled=true \
        --wait
    # ---- Intel Gaudi Base Operator (run only if you have Gaudi cards) ----
    helm repo add intel https://intel.github.io/helm-charts
    helm repo update
    helm install gaudi-base-operator intel/intel-gaudi-base-operator \
        --namespace intel-gaudi \
        --create-namespace \
        --wait
    # ---- KEDA ----
    helm repo add kedacore https://kedacore.github.io/charts
    helm repo update
    helm install keda kedacore/keda \
        --namespace keda \
        --create-namespace \
        --wait
    # ---- KEDA HTTP add-on (for scale-to-zero with request buffering) ----
    helm install keda-http-add-on kedacore/keda-add-ons-http \
        --namespace keda \
        --wait
    # ---- Prometheus stack ----
    helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
    helm repo update
    helm install prometheus prometheus-community/kube-prometheus-stack \
        --namespace monitoring \
        --create-namespace \
        --set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false \
        --wait
STEP 2: Install the LLMModel CRD and deploy the controller.
    # Apply the CRD.
    kubectl apply -f config/crd/bases/ai.example.io_llmmodels.yaml
    # Apply RBAC.
    kubectl apply -f config/rbac/serviceaccount.yaml
    kubectl apply -f config/rbac/role.yaml
    kubectl apply -f config/rbac/rolebinding.yaml
    # Apply Prometheus recording rules.
    kubectl apply -f config/monitoring/prometheus-rules.yaml
    # Create secrets.
    kubectl create secret generic hf-token \
        --namespace ai-models \
        --from-literal=token=YOUR_HF_TOKEN
    kubectl create secret generic openai-api-key \
        --namespace ai-models \
        --from-literal=apiKey=YOUR_OPENAI_API_KEY
    kubectl create secret generic anthropic-api-key \
        --namespace ai-models \
        --from-literal=apiKey=YOUR_ANTHROPIC_API_KEY
    kubectl create secret generic google-api-key \
        --namespace ai-models \
        --from-literal=apiKey=YOUR_GOOGLE_API_KEY
    # Deploy the controller.
    kubectl apply -f config/manager/manager.yaml
STEP 3: Apply the LLMModel resources.
    # Local model on AMD MI300X.
    kubectl apply -f config/samples/gemma4-26b-amd.yaml
    # Local model on Intel Gaudi 3.
    kubectl apply -f config/samples/qwen36-35b-gaudi.yaml
    # Local model on NVIDIA H100.
    kubectl apply -f config/samples/gemma4-31b-nvidia.yaml
    # CPU-only model (no GPU required).
    kubectl apply -f config/samples/qwen35-9b-cpu.yaml
    # Remote API proxies.
    kubectl apply -f config/samples/gpt55-remote.yaml
    kubectl apply -f config/samples/claude-opus47-remote.yaml
    kubectl apply -f config/samples/gemini31-remote.yaml
    # Apply KEDA ScaledObjects.
    kubectl apply -f config/keda/
STEP 4: Deploy the MCP server.
    kubectl apply -f config/mcp/deployment.yaml
STEP 5: Watch the models come online.
    # Watch all LLMModel resources.
    # Local models: Pending -> Downloading -> Starting -> Ready
    # Remote proxies: immediately -> Proxying
    kubectl get llmmodels -n ai-models -w
    # Check logs for the AMD model.
    kubectl logs -n ai-models \
        -l "ai.example.io/model-name=gemma4-26b-amd" \
        -c inference-server \
        --follow
    # Check logs for the Gaudi model.
    kubectl logs -n ai-models \
        -l "ai.example.io/model-name=qwen36-35b-gaudi" \
        -c inference-server \
        --follow
STEP 6: Test the models directly.
    # Port-forward the AMD model.
    kubectl port-forward -n ai-models svc/gemma4-26b-amd 8001:8000 &
    curl http://localhost:8001/v1/chat/completions \
        -H "Content-Type: application/json" \
        -d '{
            "model": "google/gemma-4-26b-a4b-it",
            "messages": [{"role": "user", "content": "What is 2+2?"}],
            "max_tokens": 50
        }'
    # Port-forward the Gaudi model.
    kubectl port-forward -n ai-models svc/qwen36-35b-gaudi 8002:8000 &
    curl http://localhost:8002/v1/chat/completions \
        -H "Content-Type: application/json" \
        -d '{
            "model": "Qwen/Qwen3.6-35B-A3B",
            "messages": [
                {
                    "role": "system",
                    "content": "Think step by step before answering. Use <think>...</think> tags."
                },
                {
                    "role": "user",
                    "content": "Prove that the square root of 2 is irrational."
                }
            ],
            "max_tokens": 1024
        }'
STEP 7: Test the MCP server.
    kubectl port-forward -n ai-models svc/mcp-server 3000:3000 &
    # List all available tools.
    curl -X POST http://localhost:3000/mcp \
        -H "Content-Type: application/json" \
        -d '{
            "jsonrpc": "2.0",
            "id": 1,
            "method": "tools/list",
            "params": {}
        }'
    # Discover all models via the meta-tool.
    curl -X POST http://localhost:3000/mcp \
        -H "Content-Type: application/json" \
        -d '{
            "jsonrpc": "2.0",
            "id": 2,
            "method": "tools/call",
            "params": {
                "name": "list_available_models",
                "arguments": {}
            }
        }'
    # Call the Gaudi reasoning model with thinking mode.
    curl -X POST http://localhost:3000/mcp \
        -H "Content-Type: application/json" \
        -d '{
            "jsonrpc": "2.0",
            "id": 3,
            "method": "tools/call",
            "params": {
                "name": "qwen36_reasoning_gaudi",
                "arguments": {
                    "messages": [
                        {
                            "role": "user",
                            "content": "What is the time complexity of merge sort and why?"
                        }
                    ],
                    "thinking": true,
                    "max_tokens": 512
                }
            }
        }'
    # Call the GPT-5.5 remote proxy.
    curl -X POST http://localhost:3000/mcp \
        -H "Content-Type: application/json" \
        -d '{
            "jsonrpc": "2.0",
            "id": 4,
            "method": "tools/call",
            "params": {
                "name": "gpt55_frontier",
                "arguments": {
                    "messages": [
                        {
                            "role": "user",
                            "content": "Write a Kubernetes operator in Go that manages Redis clusters."
                        }
                    ],
                    "temperature": 0.3,
                    "max_tokens": 2048
                }
            }
        }'
STEP 8: Run the registry query tool.
    # Install the Kubernetes Python client.
    pip install kubernetes
    # Query the registry from outside the cluster (uses kubeconfig).
    python3 tools/query_models.py



CHAPTER THIRTEEN: OPERATIONAL CONSIDERATIONS AND PRODUCTION HARDENING


MODEL LOADING TIME AND COLD STARTS


Large models take significant time to load. Gemma 4 31B in INT4 format takes

approximately 3–5 minutes on an H100. Qwen3.6-35B-A3B in FP8 on Gaudi 3 takes

approximately 5–10 minutes. This means that if you scale to zero and receive a

request, the user will wait for the full model loading time. Use the KEDA HTTP

add-on to buffer requests during cold starts, and set minReplicas to 1 for

interactive applications where cold-start latency is unacceptable.


KV CACHE MEMORY


The KV cache stores attention keys and values for all tokens in the context

window. For a model with a 256K token context window, the KV cache for a single

request can be several gigabytes. vLLM's PagedAttention algorithm manages KV

cache memory efficiently, but you must account for it when sizing VRAM. The

rule of thumb is to allocate 20–30% of total VRAM for KV cache, which is why

we set gpu-memory-utilization to 0.90 rather than 1.0.


MULTI-ACCELERATOR TENSOR PARALLELISM


For models requiring multiple accelerators, vLLM uses tensor parallelism to

split the model across cards. On NVIDIA hardware this uses NVLink/NVSwitch.

On AMD hardware it uses RCCL (ROCm Collective Communications Library). On Intel

Gaudi it uses Habana's collective communications library. All three require

HostIPC and HostNetwork for efficient inter-card communication, which the

controller sets automatically based on acceleratorCount > 1.


VENDOR-SPECIFIC QUANTIZATION NOTES


NVIDIA: AWQ, GPTQ, FP8, INT8, INT4, and QAT-INT4 are all supported on H100+.

AMD: AWQ is supported on ROCm as of vLLM 0.8+. MXFP4 and MXFP6 require MI350X

  or later (CDNA 4 architecture). SqueezeLLM has been ported to ROCm.

Intel Gaudi: FP8 is natively supported by Gaudi 3 hardware. The Intel vLLM

  fork includes custom graph caching for improved performance with FP8.


SECRETS MANAGEMENT


API keys for remote models (GPT-5.5, Claude Opus 4.7, Gemini 3.1) are

sensitive credentials. For production deployments, use an external secrets

manager like HashiCorp Vault with the Vault Secrets Operator:


    apiVersion: secrets-store.csi.x-k8s.io/v1

    kind: SecretProviderClass

    metadata:

      name: openai-api-key-vault

      namespace: ai-models

    spec:

      provider: vault

      parameters:

        vaultAddress: https://vault.example.com

        roleName: llm-operator

        objects: |

          - objectName: "openai-api-key"

            secretPath: "secret/data/ai/openai"

            secretKey: "api_key"

      secretObjects:

        - secretName: openai-api-key

          type: Opaque

          data:

            - objectName: openai-api-key

              key: apiKey


NETWORK SECURITY


The vLLM API server has no built-in authentication. Use Kubernetes Network

Policies to restrict which pods can reach the inference services:


    # File: config/security/network-policy.yaml

    apiVersion: networking.k8s.io/v1

    kind: NetworkPolicy

    metadata:

      name: llm-inference-access

      namespace: ai-models

    spec:

      # Apply to all inference server pods across all vendors.

      podSelector:

        matchLabels:

          ai.example.io/model-name: ""

      policyTypes:

        - Ingress

      ingress:

        # Allow traffic from explicitly authorized clients.

        - from:

            - namespaceSelector: {}

              podSelector:

                matchLabels:

                  ai.example.io/llm-client: "true"

          ports:

            - protocol: TCP

              port: 8000

        # Always allow traffic from the MCP server.

        - from:

            - podSelector:

                matchLabels:

                  app: mcp-server

          ports:

            - protocol: TCP

              port: 8000

        # Always allow traffic from the controller (for health checks).

        - from:

            - namespaceSelector:

                matchLabels:

                  kubernetes.io/metadata.name: llm-operator-system

          ports:

            - protocol: TCP

              port: 8000


HETEROGENEOUS CLUSTER CONSIDERATIONS


In a cluster with mixed accelerator types (some nodes with NVIDIA GPUs, some

with AMD GPUs, some with Intel Gaudi cards, and some CPU-only), the controller

ensures that each LLMModel pod lands on the correct node type via:


1. NodeSelector: The vendor-specific label set by the GPU operator

   (nvidia.com/gpu.present=true, amd.com/gpu.present=true, habana.ai/gaudi=true)

2. Tolerations: The vendor-specific taint applied to GPU nodes

3. Resource requests: The vendor-specific resource key

   (nvidia.com/gpu, amd.com/gpu, habana.ai/gaudi)


These three mechanisms together guarantee that an AMD model never lands on an

NVIDIA node and vice versa, even in a heterogeneous cluster.


For clusters using the AMD GPU DRA Driver (available in beta as of early 2026),

the scheduling can be made even more precise. The DRA driver publishes

ResourceSlices that expose structured attributes of AMD GPU devices (model,

PCIe root, memory, etc.), allowing workloads to request GPUs based on specific

characteristics such as minimum HBM capacity.



CHAPTER FOURTEEN: THE BIGGER PICTURE


We have covered a lot of ground. Let us step back and look at what we have

built and why it matters.


We started with the observation that the LLM landscape as of May 2026 is

radically different from what it was eighteen months ago. The dominant models

are MoE architectures. Context windows have grown from thousands to millions of

tokens. Quantization-Aware Training has made it possible to run 31-billion-

parameter multimodal models on a single 24 GiB GPU. And the accelerator market

has diversified: AMD MI300X, MI350X, and the forthcoming MI400 series are

serious alternatives to NVIDIA for large-model inference. Intel Gaudi 3 offers

a cost-effective option with native FP8 support and strong LLM serving

frameworks. The frontier remote API models — GPT-5.5, Claude Opus 4.7, Gemini

3.1 Pro — have context windows of 1 million tokens and capabilities that no

locally-deployable model yet matches.


We designed a Custom Resource Definition that captures the full richness of this

landscape. The `acceleratorVendor` field is the key innovation: it makes the

entire operator hardware-agnostic. Adding support for a new accelerator vendor

requires editing exactly one file — controllers/hardware.go — and adding a new

case to the switch statement. The rest of the controller, the CRD, the MCP

server, and the query tooling all remain unchanged.


We built a controller that reconciles this CRD into real Kubernetes objects,

with all vendor-specific decisions isolated in the AcceleratorConfig struct.

The controller handles the full lifecycle: PVC creation for model caching,

init container for weight download, inference Deployment with correct resource

keys and tolerations, Service for stable DNS, KEDA ScaledObject for intelligent

autoscaling, and MCP tool registration for AI agent integration.


We wired up KEDA to scale on inference queue depth rather than CPU or GPU

utilization, which is the only autoscaling signal that makes sense for LLM

workloads. Because vLLM emits the same Prometheus metrics regardless of the

underlying accelerator vendor, the KEDA configuration is completely vendor-

agnostic.


We built an MCP server that watches the LLMModel registry and exposes each

model as a tool that AI agents can discover and call. The server supports

dynamic tool registration via an admin REST API that the controller calls

whenever a model's status changes, and it sends MCP tools/list_changed

notifications to connected clients when the tool list changes.


We showed how Docker Model Runner bridges the gap between local development and

cluster deployment, giving every developer a local LLM environment on any

platform — NVIDIA GPU, AMD GPU, or Apple Silicon — that uses the same API

surface as the production cluster.


The result is an AI platform that is self-describing, self-scaling, self-

healing, and hardware-agnostic. Models are first-class Kubernetes citizens.

They can be queried, filtered, and selected using standard Kubernetes tooling

regardless of whether they run on NVIDIA, AMD, Intel Gaudi, or CPU. They scale

automatically based on demand. They expose themselves to AI agents via a

standard protocol. And they handle both local and remote models uniformly.


This is what it looks like when you stop treating LLMs as external services

and start treating them as infrastructure — infrastructure that works with

whatever accelerator hardware you have or can afford.


No comments: