CHAPTER ONE: WHY THIS MATTERS, AND WHY NOW

There is a moment in the life of every engineering team that adopts large

language models when the honeymoon ends. It usually happens around the third

month. The team has been happily calling an external API, watching tokens flow

in and out, and marvelling at what the model can do. Then the bill arrives.

Then the legal team asks where the customer data is going. Then the latency

spikes because the API is rate-limited. Then someone asks, "Could we just run

this ourselves?"

The answer, is a resounding yes — and the tooling to do it

well has finally caught up with the ambition. Kubernetes 1.33, released on

April 23, 2025, brought Dynamic Resource Allocation (DRA) into beta, giving

the scheduler a genuinely sophisticated understanding of accelerator hardware

from any vendor. Docker Desktop 4.41, released on April 29, 2025, ships with

a Model Runner that understands OCI-packaged model artifacts and exposes an

OpenAI-compatible API from your laptop. The vLLM project has matured into a

production-grade inference server with native support for NVIDIA CUDA, AMD

ROCm, and Intel Gaudi. KEDA gives us event-driven autoscaling that reacts to

the depth of an inference queue rather than to CPU utilization. And the Model

Context Protocol (MCP), whose latest stable specification was published in

November 2025, gives us a standard way for AI agents to discover and call

tools — including tools that live inside a Kubernetes cluster.

This guide walks through the entire stack. We will design a Custom Resource

Definition that describes an LLM deployment with enough richness to let a

scheduler make intelligent decisions about where and how to run it — on any

accelerator vendor. We will build a controller that reconciles that resource

into real Kubernetes objects. We will configure vLLM as the inference engine

for GPU deployments and llama.cpp/Ollama for CPU deployments, wire up KEDA for

autoscaling, and expose the whole thing through an MCP server so that AI agents

can discover and use models as tools. Along the way we will handle the important

asymmetry between models you can run locally and frontier models that only exist

as remote REST APIs — because any real-world deployment has to deal with both.

Before we write a single line of YAML, we need to understand what we are

actually deploying.

CHAPTER TWO: THE MODEL LANDSCAPE IN MAY 2026

The LLM landscape has undergone a fundamental architectural shift over the past

eighteen months. The dominant pattern is no longer the dense transformer, where

every parameter participates in every forward pass. Instead, the leading models

use Mixture-of-Experts (MoE) architectures, where a router selects a small

subset of "expert" sub-networks for each token. This means that a model can

have 675 billion total parameters but activate only 41 billion of them per

token, dramatically reducing the compute cost of inference while retaining the

capacity that comes from a large parameter count.

This distinction matters enormously for Kubernetes scheduling, because the

relevant resource constraint is not the total parameter count but the amount of

accelerator memory required to hold all the expert weights simultaneously. Even

if only 41 billion parameters are active per token, all the expert weights must

reside in HBM/VRAM so the router can select among them.

A second major trend is the emergence of thinking models, which perform chain-

of-thought reasoning internally before producing a final answer. Some models,

like the Qwen3 and Qwen3.5/3.6 families, support a hybrid mode where a single

deployed model can switch between fast non-thinking responses and slower, deeper

reasoning responses depending on a flag in the system prompt.

A third trend is Quantization-Aware Training (QAT) and Multi-Token Prediction

(MTP). Google's Gemma 4 family uses MTP drafters — small companion checkpoints

that accelerate token generation by up to 3x without quality loss. Gemma 3's

QAT-INT4 format demonstrated that training-time quantization awareness produces

better quality than post-training quantization at the same memory footprint.

Let us survey the major model families that are relevant to a Kubernetes

deployment as of May 2026, with the specific numbers that a scheduler needs.

GOOGLE — GEMMA 4 FAMILY (Released April 2, 2026)

Gemma 4 is released under Apache 2.0 and is built from the same research as

Gemini 3. All models natively process text and images; the E2B and E4B variants

also support audio input. All models support native function calling and

structured JSON output. MTP drafters are available for all sizes, offering up

to 3x speedup without significant VRAM increase.

Gemma 4 E2B has approximately 2.3 billion effective parameters. It targets

phones, edge devices, and low-VRAM testing. In Q4_K_M quantization it requires

4–6 GB of VRAM. Full BF16 requires 15 GB. Context window: 128K tokens.

Gemma 4 E4B has approximately 4.5 billion effective parameters (around 8 billion

including embeddings). It targets high-end laptops and small servers. In Q4_K_M

it requires about 8 GB of VRAM; in Q8_0 it needs 12–16 GB. Full BF16: 15 GB.

Context window: 128K tokens.

Gemma 4 26B-A4B is a Mixture-of-Experts model with 25.2 billion total parameters

and approximately 3.8 billion active parameters per token. It targets consumer

GPUs and cost-efficient single-GPU server inference. In 4-bit quantization

(GGUF or AWQ) it requires 14–16 GB of VRAM; on an RTX 4090 (24 GB) there is

comfortable headroom for KV cache. In FP8 it needs about 30 GB; in BF16 about

60 GB. Context window: 256K tokens.

Gemma 4 31B is a dense model with 30.7 billion total parameters. It targets

workstations where maximum quality is paramount. In INT4 it requires a minimum

of 16 GB of VRAM; Q4_K_M is comfortable at 24 GB; Q5_K_M and above need 32 GB+;

INT8 needs about 36 GB; BF16 needs about 62 GB (requires A100 80 GB or H100).

Context window: 256K tokens.

GOOGLE — GEMINI 3.1 (Remote API, Released February 19, 2026)

Gemini 3.1 Pro is not available as open weights. It is accessed via the Google

AI API and Vertex AI. Context window: 1 million tokens. Max output: 65,536

tokens. Pricing: $2 per million input tokens, $12 per million output tokens.

Supports up to 900 images per prompt, 8.4 hours of audio, and 1 hour of video.

Variants: Gemini 3.1 Pro Preview (Feb 19), Flash-Lite Preview (Mar 3),

Flash-Lite GA (May 7, 2026).

ALIBABA — QWEN3.5 FAMILY (Released February 16, 2026)

Qwen3.5 is open-weights under Apache 2.0. It uses both dense and MoE

architectures. The MoE models are designated with an "A" suffix indicating

active parameters (e.g., 397B-A17B has 397 billion total, 17 billion active).

Available open-weights sizes: 0.8B, 2B, 4B, 9B, 27B, 35B-A3B, 122B-A10B,

and 397B-A17B. Native context window: 262,144 tokens, extensible to 1,010,000

tokens. Supports 201 languages. Hybrid thinking/non-thinking mode.

The 9B model fits on a single 24 GB GPU in FP16 with ample KV cache headroom.

The 35B-A3B MoE model runs in Q4 quantization on a 24 GB card (approximately

21.5 GB). The 27B dense model runs on 22 GB of RAM/VRAM. For the 35B-A3B at

Q8_0 with a 65K context window in llama.cpp, VRAM usage is approximately 21.7 GB.

ALIBABA — QWEN3.6 FAMILY (Released April 2026)

Qwen3.6-27B and Qwen3.6-35B-A3B are open-weight models released under

Apache 2.0. Qwen3.6-35B-A3B is a fully open-source MoE model with 35 billion

total parameters and 3 billion active parameters per token, outperforming its

Qwen3.5 predecessor and rivalling larger dense models. Context window: 256K

tokens, extensible to 1,010,000 tokens. The 27B model runs on 18 GB of VRAM;

the 35B-A3B runs on 22 GB. The 35B-A3B achieves 100+ tokens per second on

consumer hardware due to its low active parameter count.

Qwen3.6-Plus is a proprietary hosted model with a 1-million-token context

window and up to 65,536 output tokens. It is not available as open weights.

META — LLAMA 4 FAMILY (Released April 2025, MIT License)

Llama 4 Scout has 109 billion total parameters and activates 17 billion per

token across 16 experts. Context window: 10 million tokens. In 4-bit

quantization it requires approximately 55–61 GB of VRAM, fitting on a single

H100 80 GB or AMD MI300X (192 GB). For context windows beyond 130K tokens,

multiple GPUs are recommended.

Llama 4 Maverick has 400 billion total parameters across 128 experts, also

activating 17 billion per token. Context window: 1 million tokens. In 4-bit

quantization it requires approximately 224 GB of VRAM, requiring multi-GPU.

MISTRAL AI — MISTRAL MEDIUM 3.5 (Released April 2026)

Mistral Medium 3.5 is a dense 128-billion-parameter model released under a

Modified MIT License. Context window: 256K tokens. Multimodal: text and image

input, text output. Supports configurable reasoning mode. In 4-bit quantization

it requires approximately 70 GB of VRAM, accessible on a Mac Studio with 128 GB

of unified memory or a multi-GPU server. Excels in instruction-following,

reasoning, coding, long-context understanding, tool use, and agentic workflows.

MISTRAL AI — MISTRAL LARGE 3 (Released December 2025)

Mistral Large 3 is a sparse MoE model with 675 billion total parameters and

approximately 41 billion active parameters per token, plus a 2.5 billion

parameter integrated vision encoder. Released under Apache 2.0. Context window:

256K tokens. Multimodal: native vision capabilities. Supports 40+ languages,

function calling, and structured JSON output.

For self-hosting in FP8 precision for long-context workloads up to 256K tokens,

B200 or H200 GPUs are recommended. For contexts under 64K tokens, NVFP4

precision can be used on A100s and H100s. The MoE architecture means that while

all 675 billion parameters must reside in memory, only 41 billion are computed

at inference time, making deployment more efficient than a dense 675B model.

In Q4 quantization, approximately 200 GB of VRAM is required.

MOONSHOT AI — KIMI K2.5 (Released January 27, 2026)

Kimi K2.5 is open-source under a Modified MIT License. Architecture: MoE with

1 trillion total parameters and 32 billion active parameters per token. 384

experts with 8 selected per token plus 1 shared expert. Native multimodal,

trained on 15 trillion mixed visual and text tokens. Context window: 256K tokens.

The full model FP8 checkpoint is approximately 630 GB and typically requires

at least 4x H200 GPUs. Quantized versions (e.g., 1.8-bit) can run on a single

24 GB GPU with CPU offload for MoE layers, requiring around 256 GB of system

RAM for approximately 10 tokens per second. Features: Agent Swarm technology

coordinating up to 100 specialized AI agents, four operational modes (Instant,

Thinking, Agent, Agent Swarm).

ZHIPU AI — GLM-5.1 (Released April 7, 2026)

GLM-5.1 is open-source under the MIT License. Architecture: Hybrid MoE

(GlmMoeDSA) with 754 billion total parameters and 40 billion active parameters

per token. 256 routed experts plus 1 shared expert, with 8+1 active per token.

Context window: 200K tokens. Designed for agentic engineering and long-horizon

coding tasks.

Minimal deployment requires 1x NVIDIA HGX B200 (8x B200 GPU system). FP8

deployments with vLLM on multi-GPU rigs require approximately 860 GB or more

across GPUs (e.g., 8x H200). For CPU-only setups with quantized GGUF weights,

approximately 180–256 GB of system RAM is needed for 1–2 bit quantizations. A

24 GB GPU plus 256 GB of RAM can work with 2-bit variants using MoE offloading.

DEEPSEEK — V4 FAMILY (2026)

DeepSeek-V4-Flash has 284 billion total parameters and 13 billion active

parameters per token. Context window: 1 million tokens. MIT-licensed open

weights. Native FP4+FP8 requires approximately 170–175 GB of total VRAM,

fitting on 2x H200 or 4x A100 80 GB. Community INT4 quantization needs

approximately 90–100 GB, potentially on 4x RTX 4090. Community GGUF/GPTQ at

approximately 80 GB VRAM might be feasible on 1x RTX 5090 or 2x RTX 4090 with

CPU offload.

DeepSeek-V4-Pro has approximately 1.6 trillion total parameters and 49 billion

active parameters per token. MIT-licensed open weights. Full precision

(FP8+FP4 Mixed) requires approximately 865 GB of VRAM, recommending 16x H100

80 GB or equivalent. Even Q4 Pro (approximately 430 GB of weights) does not fit

on an 8x H100 80 GB box once KV cache and overhead are added. This is a

datacenter cluster job.

OPENAI — GPT-5.5 (Remote API, Released April 23, 2026)

GPT-5.5 is not available as open weights. API context window: 1 million tokens.

Max output: 128K tokens. Knowledge cutoff: December 2025. Pricing: $5 per

million input tokens, $30 per million output tokens (90% cached-input discount).

GPT-5.5 Pro: $30/$180 per million tokens. GPT-5.5 Instant (released May 5,

2026) is the default for all ChatGPT users. Offers five reasoning levels: none,

low, medium, high, xhigh. Supports text and image inputs.

ANTHROPIC — CLAUDE OPUS 4.7 (Remote API, Released April 16, 2026)

Claude Opus 4.7 is not available as open weights. Context window: 1 million

tokens. Max output: 128K tokens. Pricing: $5 per million input tokens, $25 per

million output tokens. Improved vision with higher-resolution image support

(up to 2576px / 3.75MP). New "xhigh" effort level for finer control over

reasoning and latency. Available via API, Amazon Bedrock, Google Cloud Vertex

AI, and Microsoft Foundry.

THE ACCELERATOR LANDSCAPE IN MAY 2026

NVIDIA remains the dominant data-center accelerator vendor. The H100 80 GB

(HBM3) and H200 141 GB (HBM3e) are the workhorses of production LLM inference.

The B200 with 192 GB of HBM3e and the HGX B200 system (8x B200 = 1536 GB

aggregate) represent the frontier for the largest models.

AMD Instinct MI300X offers 192 GB of HBM3 memory with 5.3 TB/s bandwidth per

card, making it exceptionally well-suited for large model inference where VRAM

is the primary constraint. The MI350X (CDNA 4 architecture, expected 2025) will

feature 288 GB of HBM3E and 8 TB/s bandwidth, with up to 35x AI inference

performance improvement over MI300. The MI400 series (anticipated 2026) will

feature 432 GB of HBM4 and over 19.6 TB/s bandwidth. As of mid-January 2026,

93% of vLLM AMD test groups are succeeding, with official ROCm-enabled vLLM

Docker images available since January 2026.

Intel Gaudi 3 (launched 2024, 5nm process) features 128 GB of HBM2e with

3.7 TB/s bandwidth. Intel maintains an optimized fork of vLLM for Gaudi with

Paged KV cache, custom Paged Attention, tensor parallelism, and FP8 quantization

support. Text Generation Inference (TGI) also has native Gaudi support. Gaudi 3

supports DeepSeek architecture since Intel Gaudi software release 1.21.0.

Apple Silicon (M3 Ultra: 512 GB unified memory) excels at local and edge LLM

inference via llama.cpp and the MLX framework. It is not yet suitable for large-

scale Kubernetes data-center deployments due to GPU support limitations in VMs

and containers, but it is the best option for on-premise developer workstations

and edge nodes running llama.cpp or Ollama.

THE VRAM REFERENCE TABLE (MAY 2026)

The following table summarises the memory requirements that our Kubernetes

scheduler will need to reason about. Q4 refers to 4-bit quantization (GGUF

Q4_K_M, AWQ, or GPTQ). QAT-INT4 refers to Google's Quantization-Aware Training

format. "Active" is the per-token active parameter count for MoE models.

Model Total Active Arch Q4 VRAM BF16 VRAM

-------------------------------------------------------------------

Gemma 4 E2B 2.3B 2.3B Dense ~4 GB ~15 GB

Gemma 4 E4B 4.5B 4.5B Dense ~8 GB ~15 GB

Qwen3.5-4B 4B 4B Dense ~3 GB ~9 GB

Qwen3.5-9B 9B 9B Dense ~6 GB ~18 GB

Gemma 4 26B-A4B (MoE) 25.2B 3.8B MoE ~14 GB ~60 GB

Qwen3.5-27B 27B 27B Dense ~16 GB ~54 GB

Qwen3.6-27B 27B 27B Dense ~18 GB ~54 GB

Gemma 4 31B (Dense) 30.7B 30.7B Dense ~16 GB ~62 GB

Qwen3.5-35B-A3B (MoE) 35B 3B MoE ~21 GB ~70 GB

Qwen3.6-35B-A3B (MoE) 35B 3B MoE ~22 GB ~70 GB

Llama 4 Scout (MoE) 109B 17B MoE ~55 GB ~220 GB

Mistral Medium 3.5 128B 128B Dense ~70 GB ~256 GB

DeepSeek-V4-Flash (MoE) 284B 13B MoE ~80 GB ~570 GB

Kimi K2.5 (MoE) 1000B 32B MoE ~250 GB ~2000 GB

Llama 4 Maverick (MoE) 400B 17B MoE ~224 GB ~860 GB

GLM-5.1 (MoE) 754B 40B MoE ~180 GB ~1500 GB

Mistral Large 3 (MoE) 675B 41B MoE ~200 GB ~1350 GB

DeepSeek-V4-Pro (MoE) 1600B 49B MoE ~430 GB not pract.

-------------------------------------------------------------------

Remote API models (no local VRAM required):

GPT-5.5 ~? ~? ? N/A N/A

Claude Opus 4.7 ~? ~? ? N/A N/A

Gemini 3.1 Pro ~? ~? ? N/A N/A

-------------------------------------------------------------------

This table is the foundation of everything that follows. Every design decision

in our CRD, our controller, and our scheduler will ultimately trace back to

these numbers.

CHAPTER THREE: KUBERNETES 1.33, DRA, AND MULTI-VENDOR GPU SCHEDULING

Kubernetes 1.33, released on April 23, 2025, is the most significant release

for AI workloads in the history of the project. The headline feature for

accelerator users is Dynamic Resource Allocation (DRA) reaching beta status.

The traditional approach to GPU scheduling uses the device plugin model. A

device plugin runs as a DaemonSet on each GPU node and advertises GPUs as

extended resources: "nvidia.com/gpu", "amd.com/gpu", or "habana.ai/gaudi". A

pod requests a GPU by setting the appropriate resource limit. The scheduler

finds a node with an available GPU of that type and assigns the pod to it. This

works, but it is extremely coarse-grained. The scheduler knows that a node has

GPUs and that a pod wants GPUs, but it knows nothing about the VRAM capacity,

interconnect topology, or whether the workload would benefit from partitioning.

DRA changes this by introducing a richer API for expressing hardware requirements

and capabilities. With DRA, a device driver can publish detailed information

about each accelerator: its model, its VRAM, its interconnect connectivity, its

supported quantization formats, and any other relevant attributes. A pod can

then request not just "a GPU" but "a GPU with at least 80 gigabytes of VRAM

from any vendor."

Kubernetes 1.33 also introduces Partitionable Devices (foundation for MIG),

Device Taints and Tolerations (mirrors node taint/toleration for devices), and

Prioritized List (preference ordering over device configurations).

INSTALLING GPU SUPPORT FOR ALL VENDORS

The following commands install GPU support for all three major accelerator

vendors. Run only the sections relevant to the hardware in your cluster.

NVIDIA — GPU Operator:

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia

helm repo update

helm install gpu-operator nvidia/gpu-operator \

--namespace gpu-operator \

--create-namespace \

--set driver.enabled=true \

--set toolkit.enabled=true \

--set devicePlugin.enabled=true \

--set dcgmExporter.enabled=true \

--wait

AMD — GPU Operator (announced January 2025):

helm repo add amd-gpu-operator https://rocm.github.io/gpu-operator

helm repo update

helm install amd-gpu-operator amd-gpu-operator/gpu-operator \

--namespace amd-gpu-operator \

--create-namespace \

--set devicePlugin.enabled=true \

--set nodeLabeller.enabled=true \

--wait

# The AMD GPU Operator installs:

# - amd-gpu-device-plugin (exposes amd.com/gpu resources)

# - amd-gpu-node-labeller (labels nodes with GPU model and VRAM)

# - ROCm driver management

# After installation, AMD GPUs appear as amd.com/gpu resources.

Intel Gaudi — Base Operator:

helm repo add intel https://intel.github.io/helm-charts

helm repo update

helm install gaudi-base-operator intel/intel-gaudi-base-operator \

--namespace intel-gaudi \

--create-namespace \

--wait

# The Intel Gaudi Base Operator installs:

# - Intel Gaudi Device Plugin (exposes habana.ai/gaudi resources)

# - Container runtime configuration

# - Feature discovery

# - Monitoring tools

# After installation, Gaudi cards appear as habana.ai/gaudi resources.

Verify all accelerators are visible to the scheduler:

# NVIDIA GPUs

kubectl get nodes -o custom-columns=\

NAME:.metadata.name,\

NVIDIA_GPU:.status.allocatable."nvidia\.com/gpu"

# AMD GPUs

kubectl get nodes -o custom-columns=\

NAME:.metadata.name,\

AMD_GPU:.status.allocatable."amd\.com/gpu"

# Intel Gaudi

kubectl get nodes -o custom-columns=\

NAME:.metadata.name,\

GAUDI:.status.allocatable."habana\.ai/gaudi"

CHAPTER FOUR: DESIGNING THE LLMMODEL CUSTOM RESOURCE

We want to define a Kubernetes Custom Resource that captures everything a

scheduler needs to know about an LLM deployment. The resource must be rich

enough to express the full diversity of the model landscape surveyed in Part

Two, vendor-agnostic so it works with NVIDIA, AMD, Intel Gaudi, and CPU-only

nodes, and simple enough that an engineer can write one without consulting a

manual.

The most fundamental distinction is between a locally-hosted model and a remote

API-only model. Frontier models like GPT-5.5, Claude Opus 4.7, and Gemini 3.1

Pro are not available as downloadable weights. Our CRD represents both cases.

A critical new field is `acceleratorVendor`, which selects the hardware backend.

The controller uses this field to select the correct:

- Kubernetes resource key (nvidia.com/gpu, amd.com/gpu, habana.ai/gaudi)

- vLLM Docker image (CUDA, ROCm, or Gaudi variant)

- Node selector labels (populated by the respective GPU operator)

- Tolerations (vendor-specific GPU node taints)

- Prometheus metrics source (DCGM for NVIDIA, ROCm SMI for AMD, Gaudi metrics)

The CRD definition:

# File: config/crd/bases/ai.example.io_llmmodels.yaml

apiVersion: apiextensions.k8s.io/v1

kind: CustomResourceDefinition

metadata:

annotations:

ai.example.io/schema-version: "v1alpha1"

spec:

group: ai.example.io

names:

kind: LLMModel

listKind: LLMModelList

plural: llmmodels

singular: llmmodel

shortNames:

- lm

scope: Namespaced

versions:

- name: v1alpha1

served: true

storage: true

additionalPrinterColumns:

- name: Model

type: string

jsonPath: .spec.modelId

- name: Vendor

type: string

jsonPath: .spec.acceleratorVendor

- name: Engine

type: string

jsonPath: .spec.inferenceEngine

- name: Mode

type: string

jsonPath: .spec.deploymentMode

- name: VRAM

type: string

jsonPath: .spec.resources.vramPerAcceleratorGiB

- name: Context

type: string

jsonPath: .spec.contextWindowK

- name: Status

type: string

jsonPath: .status.phase

- name: Endpoint

type: string

jsonPath: .status.endpoint

schema:

openAPIV3Schema:

type: object

required:

- spec

properties:

spec:

type: object

required:

- modelId

- deploymentMode

properties:

modelId:

type: string

description: >

The canonical model identifier. For Hugging Face models

this is the repo path (owner/name). For OCI models this

is the image reference. For remote API models this is

the model name as used in the API request.

deploymentMode:

type: string

enum:

- local

- remote

description: >

Whether to run the model locally in the cluster or to

proxy requests to a remote API endpoint.

# -------------------------------------------------------

# ACCELERATOR VENDOR SELECTION

# This is the key field that makes the operator

# hardware-agnostic. The controller uses this to select

# the correct resource key, Docker image, node selectors,

# and tolerations.

# -------------------------------------------------------

acceleratorVendor:

type: string

enum:

- nvidia

- amd

- intel-gaudi

- cpu

default: nvidia

description: >

The accelerator vendor for local deployments.

nvidia: Uses nvidia.com/gpu resource key, CUDA-based

vLLM image, DCGM metrics.

amd: Uses amd.com/gpu resource key, ROCm-based vLLM

image, ROCm SMI metrics.

intel-gaudi: Uses habana.ai/gaudi resource key,

Intel Gaudi optimized vLLM fork image, Gaudi metrics.

cpu: No GPU resource requested. Uses llama.cpp or

Ollama for CPU inference. Suitable for small models

on Apple Silicon or x86 servers.

inferenceEngine:

type: string

enum:

- vllm

- llamacpp

- ollama

default: vllm

description: >

The inference engine to use for local deployments.

vLLM is recommended for GPU deployments (NVIDIA, AMD,

Intel Gaudi) with high throughput requirements.

llamacpp is recommended for CPU inference or consumer

GPU inference using GGUF quantized models.

Ollama wraps llama.cpp with a model management layer

and OpenAI-compatible API.

modelType:

type: string

enum:

- Dense

- MoE

- DenseThinking

- MoEThinking

default: Dense

description: >

The architectural type of the model. Dense models

activate all parameters for every token. MoE models

route each token through a subset of expert networks.

Thinking variants support chain-of-thought reasoning,

either always (DenseThinking) or switchable via system

prompt (MoEThinking).

totalParametersBillions:

type: number

description: >

Total number of parameters in billions. For MoE models

this is the sum of all expert parameters. This value

drives VRAM/HBM requirements.

activeParametersBillions:

type: number

description: >

Number of parameters activated per token in billions.

For dense models this equals totalParametersBillions.

For MoE models this drives compute (FLOP) cost per token.

quantization:

type: string

enum:

- None

- FP16

- BF16

- FP8

- FP4

- INT8

- INT4

- QAT-INT4

- GPTQ

- AWQ

- GGUF-Q4_K_M

- GGUF-Q8_0

- MXFP4

- MXFP6

default: None

description: >

The quantization format of the model weights.

QAT-INT4 is Google's Quantization-Aware Training format.

MXFP4 and MXFP6 are AMD CDNA 4 (MI350X) native formats.

FP8 is natively supported by Intel Gaudi 3, NVIDIA H100+,

and AMD MI300X+.

GGUF variants are used with llamacpp and Ollama.

AWQ and GPTQ are used with vLLM on NVIDIA and AMD.

contextWindowK:

type: integer

description: >

The maximum context window in thousands of tokens.

This affects KV cache memory requirements, which grow

linearly with context length.

domains:

type: array

items:

type: string

enum:

- general

- code

- math

- reasoning

- vision

- audio

- multilingual

- embedding

- function-calling

- long-context

- agentic

description: >

The capability domains this model excels in.

languages:

type: array

items:

type: string

description: >

Languages this model supports. Use ISO 639-1 codes

or descriptive strings like "140+" for broad support.

resources:

type: object

properties:

acceleratorCount:

type: integer

default: 1

description: >

Number of accelerators required per replica.

The resource key used depends on acceleratorVendor:

nvidia -> nvidia.com/gpu

amd -> amd.com/gpu

intel-gaudi -> habana.ai/gaudi

cpu -> no accelerator resource requested

vramPerAcceleratorGiB:

type: integer

description: >

Required VRAM/HBM per accelerator in gibibytes.

The controller uses this to select nodes whose

accelerators have sufficient memory.

For AMD MI300X this can be up to 192 GiB.

For Intel Gaudi 3 this is up to 128 GiB.

For NVIDIA H100 this is up to 80 GiB.

For NVIDIA H200 this is up to 141 GiB.

For NVIDIA B200 this is up to 192 GiB.

preferredAcceleratorModel:

type: string

description: >

Optional preferred accelerator model string used

as a DRA preference hint. Examples:

"NVIDIA H100 80GB HBM3"

"AMD Instinct MI300X"

"Intel Gaudi 3"

cpuMillicores:

type: integer

default: 4000

description: >

CPU request in millicores for the inference pod.

memoryGiB:

type: integer

default: 16

description: >

System RAM request in gibibytes for the inference

pod. Separate from accelerator VRAM/HBM.

For MoE models with CPU offloading, this may need

to be very large (e.g., 256 GiB for Kimi K2.5

with quantized CPU offload).

engineArgs:

type: object

additionalProperties:

type: string

description: >

Key-value pairs passed as command-line arguments to the

inference engine. For vLLM, common args include

tensor-parallel-size, max-model-len, and

gpu-memory-utilization. Values are always strings.

modelSource:

type: object

properties:

type:

type: string

enum:

- huggingface

- oci

- s3

huggingFaceTokenSecret:

type: string

ociImage:

type: string

s3Bucket:

type: string

s3Prefix:

type: string

s3CredentialsSecret:

type: string

scaling:

type: object

properties:

minReplicas:

type: integer

default: 1

description: >

Minimum replica count. Set to 0 for scale-to-zero.

maxReplicas:

type: integer

default: 3

scaleUpThreshold:

type: integer

default: 5

description: >

Queued inference requests that trigger scale-up.

KEDA monitors vllm:num_requests_waiting.

cooldownPeriodSeconds:

type: integer

default: 300

description: >

Seconds KEDA waits after the last scale-down

trigger before actually scaling down.

remoteApi:

type: object

properties:

baseUrl:

type: string

apiKeySecret:

type: string

rateLimitRpm:

type: integer

mcpExposure:

type: object

properties:

enabled:

type: boolean

default: false

toolName:

type: string

toolDescription:

type: string

status:

type: object

properties:

phase:

type: string

enum:

- Pending

- Downloading

- Starting

- Ready

- Degraded

- Failed

- Proxying

endpoint:

type: string

conditions:

type: array

items:

type: object

properties:

type:

type: string

status:

type: string

lastTransitionTime:

type: string

reason:

type: string

message:

type: string

acceleratorNodes:

type: array

items:

type: string

description: >

Names of the Kubernetes nodes where this model's

inference pods are currently scheduled.

currentReplicas:

type: integer

requestsPerMinute:

type: integer

averageLatencyMs:

type: integer

Now let us look at concrete LLMModel resources for several scenarios.

Example 1 — Gemma 4 26B-A4B on an AMD MI300X (single GPU, MoE):

# File: config/samples/gemma4-26b-amd.yaml

apiVersion: ai.example.io/v1alpha1

kind: LLMModel

metadata:

namespace: ai-models

labels:

ai.example.io/family: gemma4

ai.example.io/vendor: google

ai.example.io/accelerator: amd

spec:

modelId: google/gemma-4-26b-a4b-it

deploymentMode: local

acceleratorVendor: amd

inferenceEngine: vllm

modelType: MoE

totalParametersBillions: 25.2

activeParametersBillions: 3.8

quantization: AWQ

contextWindowK: 256

domains:

- general

- vision

- multilingual

- function-calling

- reasoning

- agentic

languages:

- "140+"

resources:

acceleratorCount: 1

vramPerAcceleratorGiB: 16

preferredAcceleratorModel: "AMD Instinct MI300X"

cpuMillicores: 8000

memoryGiB: 64

engineArgs:

gpu-memory-utilization: "0.90"

max-model-len: "65536"

enable-chunked-prefill: "true"

modelSource:

type: huggingface

huggingFaceTokenSecret: hf-token

scaling:

minReplicas: 0

maxReplicas: 3

scaleUpThreshold: 3

cooldownPeriodSeconds: 300

mcpExposure:

enabled: true

toolName: gemma4_26b_vision_amd

toolDescription: >

Gemma 4 26B-A4B MoE model from Google running on AMD MI300X.

Supports text, image, and audio input with 256K context window.

Excellent for general tasks, vision, multilingual content, function

calling, and agentic workflows. Runs locally with full data privacy.

Example 2 — Qwen3.6-35B-A3B on an Intel Gaudi 3 (thinking mode):

# File: config/samples/qwen36-35b-gaudi.yaml

apiVersion: ai.example.io/v1alpha1

kind: LLMModel

metadata:

namespace: ai-models

labels:

ai.example.io/family: qwen36

ai.example.io/vendor: alibaba

ai.example.io/accelerator: intel-gaudi

spec:

modelId: Qwen/Qwen3.6-35B-A3B

deploymentMode: local

acceleratorVendor: intel-gaudi

inferenceEngine: vllm

modelType: MoEThinking

totalParametersBillions: 35

activeParametersBillions: 3

quantization: FP8

contextWindowK: 256

domains:

- general

- reasoning

- math

- code

- multilingual

- agentic

languages:

- "201+"

resources:

acceleratorCount: 1

vramPerAcceleratorGiB: 24

preferredAcceleratorModel: "Intel Gaudi 3"

cpuMillicores: 8000

memoryGiB: 64

engineArgs:

max-model-len: "65536"

tensor-parallel-size: "1"

modelSource:

type: huggingface

huggingFaceTokenSecret: hf-token

scaling:

minReplicas: 0

maxReplicas: 2

scaleUpThreshold: 2

cooldownPeriodSeconds: 300

mcpExposure:

enabled: true

toolName: qwen36_reasoning_gaudi

toolDescription: >

Qwen3.6-35B-A3B MoE thinking model running on Intel Gaudi 3.

Supports hybrid thinking/non-thinking mode. Excellent for math,

code, complex reasoning, and agentic tasks. 256K context window.

Runs locally with full data privacy.

Example 3 — Gemma 4 31B on NVIDIA H100 (dense, high quality):

# File: config/samples/gemma4-31b-nvidia.yaml

apiVersion: ai.example.io/v1alpha1

kind: LLMModel

metadata:

namespace: ai-models

labels:

ai.example.io/family: gemma4

ai.example.io/vendor: google

ai.example.io/accelerator: nvidia

spec:

modelId: google/gemma-4-31b-it

deploymentMode: local

acceleratorVendor: nvidia

inferenceEngine: vllm

modelType: Dense

totalParametersBillions: 30.7

activeParametersBillions: 30.7

quantization: INT4

contextWindowK: 256

domains:

- general

- vision

- code

- reasoning

- multilingual

- function-calling

- agentic

languages:

- "140+"

resources:

acceleratorCount: 1

vramPerAcceleratorGiB: 24

preferredAcceleratorModel: "NVIDIA H100 80GB HBM3"

cpuMillicores: 8000

memoryGiB: 32

engineArgs:

gpu-memory-utilization: "0.90"

max-model-len: "65536"

enable-chunked-prefill: "true"

modelSource:

type: huggingface

huggingFaceTokenSecret: hf-token

scaling:

minReplicas: 1

maxReplicas: 2

scaleUpThreshold: 3

cooldownPeriodSeconds: 300

mcpExposure:

enabled: true

toolName: gemma4_31b_nvidia

toolDescription: >

Gemma 4 31B dense model from Google running on NVIDIA H100.

Highest quality in the Gemma 4 family. Supports text and image

input with 256K context window. Excellent for coding, reasoning,

and agentic workflows. Runs locally with full data privacy.

Example 4 — Qwen3.5-9B on CPU (Apple Silicon or x86, no GPU):

# File: config/samples/qwen35-9b-cpu.yaml

apiVersion: ai.example.io/v1alpha1

kind: LLMModel

metadata:

namespace: ai-models

labels:

ai.example.io/family: qwen35

ai.example.io/vendor: alibaba

ai.example.io/accelerator: cpu

spec:

modelId: Qwen/Qwen3.5-9B-GGUF

deploymentMode: local

acceleratorVendor: cpu

inferenceEngine: ollama

modelType: DenseThinking

totalParametersBillions: 9

activeParametersBillions: 9

quantization: GGUF-Q4_K_M

contextWindowK: 128

domains:

- general

- reasoning

- code

- multilingual

languages:

- "201+"

resources:

acceleratorCount: 0

cpuMillicores: 16000

memoryGiB: 32

engineArgs:

num-ctx: "32768"

modelSource:

type: huggingface

huggingFaceTokenSecret: hf-token

scaling:

minReplicas: 1

maxReplicas: 4

scaleUpThreshold: 5

cooldownPeriodSeconds: 120

mcpExposure:

enabled: true

toolName: qwen35_9b_cpu

toolDescription: >

Qwen3.5-9B running on CPU via Ollama. No GPU required. Supports

hybrid thinking mode. Good for general tasks, code, and reasoning

on CPU-only nodes or Apple Silicon. 128K context window.

Example 5 — GPT-5.5 as a remote API proxy:

# File: config/samples/gpt55-remote.yaml

apiVersion: ai.example.io/v1alpha1

kind: LLMModel

metadata:

namespace: ai-models

labels:

ai.example.io/family: gpt55

ai.example.io/vendor: openai

ai.example.io/deployment-mode: remote

spec:

modelId: gpt-5.5

deploymentMode: remote

modelType: Dense

contextWindowK: 1000

domains:

- general

- code

- math

- reasoning

- vision

- function-calling

- long-context

- agentic

languages:

- "50+"

remoteApi:

baseUrl: https://api.openai.com/v1

apiKeySecret: openai-api-key

rateLimitRpm: 10000

mcpExposure:

enabled: true

toolName: gpt55_frontier

toolDescription: >

OpenAI GPT-5.5 frontier model via the OpenAI API. 1M token context

window. Five reasoning levels (none, low, medium, high, xhigh).

Best-in-class for complex agentic tasks, coding, and research.

Use when data privacy constraints permit external API calls.

Example 6 — Claude Opus 4.7 as a remote API proxy:

# File: config/samples/claude-opus47-remote.yaml

apiVersion: ai.example.io/v1alpha1

kind: LLMModel

metadata:

namespace: ai-models

labels:

ai.example.io/family: claude-opus

ai.example.io/vendor: anthropic

ai.example.io/deployment-mode: remote

spec:

modelId: claude-opus-4-7

deploymentMode: remote

modelType: DenseThinking

contextWindowK: 1000

domains:

- general

- code

- reasoning

- vision

- function-calling

- long-context

- agentic

languages:

- "50+"

remoteApi:

baseUrl: https://api.anthropic.com/v1

apiKeySecret: anthropic-api-key

rateLimitRpm: 5000

mcpExposure:

enabled: true

toolName: claude_opus47

toolDescription: >

Anthropic Claude Opus 4.7 via the Anthropic API. 1M token context

window. Excellent for multi-step agentic tasks, long-horizon reasoning,

and complex tool-dependent workflows. Supports xhigh reasoning effort.

Example 7 — Gemini 3.1 Pro as a remote API proxy:

# File: config/samples/gemini31-remote.yaml

apiVersion: ai.example.io/v1alpha1

kind: LLMModel

metadata:

namespace: ai-models

labels:

ai.example.io/family: gemini31

ai.example.io/vendor: google

ai.example.io/deployment-mode: remote

spec:

modelId: gemini-3.1-pro

deploymentMode: remote

modelType: Dense

contextWindowK: 1000

domains:

- general

- code

- reasoning

- vision

- audio

- function-calling

- long-context

- agentic

languages:

- "50+"

remoteApi:

baseUrl: https://generativelanguage.googleapis.com/v1beta/openai

apiKeySecret: google-api-key

rateLimitRpm: 2000

mcpExposure:

enabled: true

toolName: gemini31_pro

toolDescription: >

Google Gemini 3.1 Pro via the Google AI API. 1M token context window.

Processes up to 900 images, 8.4 hours of audio, or 1 hour of video

per prompt. Three-tier thinking system (low, medium, high).

CHAPTER FIVE: THE LLMMODEL CONTROLLER

The controller watches for LLMModel resources and reconciles the actual state

of the cluster to match the desired state. The key design principle is that

the `acceleratorVendor` field drives all hardware-specific decisions, keeping

the rest of the reconciliation logic vendor-agnostic.

FILE: go.mod

module github.com/example/llm-operator

go 1.22

require (

k8s.io/api v0.30.0

k8s.io/apimachinery v0.30.0

k8s.io/client-go v0.30.0

sigs.k8s.io/controller-runtime v0.18.0

)

FILE: api/v1alpha1/llmmodel_types.go

package v1alpha1

import (

metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"

)

// LLMModelSpec defines the desired state of an LLMModel resource.

type LLMModelSpec struct {

// ModelId is the canonical identifier for the model.

// +kubebuilder:validation:Required

ModelId string `json:"modelId"`

// DeploymentMode determines whether to run locally or proxy remotely.

// +kubebuilder:validation:Enum=local;remote

DeploymentMode string `json:"deploymentMode"`

// AcceleratorVendor selects the hardware backend for local deployments.

// +kubebuilder:validation:Enum=nvidia;amd;intel-gaudi;cpu

// +kubebuilder:default=nvidia

AcceleratorVendor string `json:"acceleratorVendor,omitempty"`

// InferenceEngine selects the serving framework for local deployments.

// +kubebuilder:validation:Enum=vllm;llamacpp;ollama

// +kubebuilder:default=vllm

InferenceEngine string `json:"inferenceEngine,omitempty"`

// ModelType describes the architectural pattern of the model.

// +kubebuilder:validation:Enum=Dense;MoE;DenseThinking;MoEThinking

// +kubebuilder:default=Dense

ModelType string `json:"modelType,omitempty"`

// TotalParametersBillions is the total parameter count in billions.

TotalParametersBillions float64 `json:"totalParametersBillions,omitempty"`

// ActiveParametersBillions is the per-token active parameter count.

ActiveParametersBillions float64 `json:"activeParametersBillions,omitempty"`

// Quantization specifies the weight format and precision.

// +kubebuilder:validation:Enum=None;FP16;BF16;FP8;FP4;INT8;INT4;QAT-INT4;GPTQ;AWQ;GGUF-Q4_K_M;GGUF-Q8_0;MXFP4;MXFP6

Quantization string `json:"quantization,omitempty"`

// ContextWindowK is the maximum context length in thousands of tokens.

ContextWindowK int `json:"contextWindowK,omitempty"`

// Domains lists the capability areas this model excels in.

Domains []string `json:"domains,omitempty"`

// Languages lists the languages this model supports.

Languages []string `json:"languages,omitempty"`

// Resources specifies accelerator and system resource requirements.

Resources LLMResourceSpec `json:"resources,omitempty"`

// EngineArgs are passed directly to the inference engine as CLI flags.

EngineArgs map[string]string `json:"engineArgs,omitempty"`

// ModelSource configures where to download model weights from.

ModelSource *ModelSourceSpec `json:"modelSource,omitempty"`

// Scaling configures replica count and autoscaling thresholds.

Scaling LLMScalingSpec `json:"scaling,omitempty"`

// RemoteApi configures the external API for remote deployments.

RemoteApi *RemoteApiSpec `json:"remoteApi,omitempty"`

// McpExposure configures MCP tool registration for this model.

McpExposure LLMMcpSpec `json:"mcpExposure,omitempty"`

}

// LLMResourceSpec captures accelerator and system resource requirements.

type LLMResourceSpec struct {

// AcceleratorCount is the number of accelerators required per replica.

// Set to 0 for cpu acceleratorVendor.

// +kubebuilder:default=1

AcceleratorCount int `json:"acceleratorCount,omitempty"`

// VramPerAcceleratorGiB is the required VRAM/HBM per accelerator in GiB.

VramPerAcceleratorGiB int `json:"vramPerAcceleratorGiB,omitempty"`

// PreferredAcceleratorModel is an optional DRA preference hint.

PreferredAcceleratorModel string `json:"preferredAcceleratorModel,omitempty"`

// CpuMillicores is the CPU request for the inference pod.

// +kubebuilder:default=4000

CpuMillicores int `json:"cpuMillicores,omitempty"`

// MemoryGiB is the system RAM request for the inference pod.

// +kubebuilder:default=16

MemoryGiB int `json:"memoryGiB,omitempty"`

}

// ModelSourceSpec configures model weight download.

type ModelSourceSpec struct {

// Type selects the download backend.

// +kubebuilder:validation:Enum=huggingface;oci;s3

Type string `json:"type"`

HuggingFaceTokenSecret string `json:"huggingFaceTokenSecret,omitempty"`

OciImage string `json:"ociImage,omitempty"`

S3Bucket string `json:"s3Bucket,omitempty"`

S3Prefix string `json:"s3Prefix,omitempty"`

S3CredentialsSecret string `json:"s3CredentialsSecret,omitempty"`

}

// LLMScalingSpec configures replica autoscaling.

type LLMScalingSpec struct {

// MinReplicas is the minimum replica count. Set to 0 for scale-to-zero.

// +kubebuilder:default=1

MinReplicas int `json:"minReplicas,omitempty"`

// MaxReplicas is the maximum replica count.

// +kubebuilder:default=3

MaxReplicas int `json:"maxReplicas,omitempty"`

// ScaleUpThreshold is the queued request count that triggers scale-up.

// +kubebuilder:default=5

ScaleUpThreshold int `json:"scaleUpThreshold,omitempty"`

// CooldownPeriodSeconds is how long KEDA waits before scaling down.

// +kubebuilder:default=300

CooldownPeriodSeconds int `json:"cooldownPeriodSeconds,omitempty"`

}

// RemoteApiSpec configures a remote OpenAI-compatible API proxy.

type RemoteApiSpec struct {

BaseUrl string `json:"baseUrl"`

ApiKeySecret string `json:"apiKeySecret"`

RateLimitRpm int `json:"rateLimitRpm,omitempty"`

}

// LLMMcpSpec configures MCP tool exposure for this model.

type LLMMcpSpec struct {

Enabled bool `json:"enabled,omitempty"`

ToolName string `json:"toolName,omitempty"`

ToolDescription string `json:"toolDescription,omitempty"`

}

// LLMModelStatus reflects the observed state of the LLMModel.

type LLMModelStatus struct {

Phase string `json:"phase,omitempty"`

Endpoint string `json:"endpoint,omitempty"`

Conditions []metav1.Condition `json:"conditions,omitempty"`

AcceleratorNodes []string `json:"acceleratorNodes,omitempty"`

CurrentReplicas int `json:"currentReplicas,omitempty"`

RequestsPerMinute int `json:"requestsPerMinute,omitempty"`

AverageLatencyMs int `json:"averageLatencyMs,omitempty"`

}

// LLMModel is the Schema for the llmmodels API.

// +kubebuilder:object:root=true

// +kubebuilder:subresource:status

// +kubebuilder:printcolumn:name="Model",type=string,JSONPath=".spec.modelId"

// +kubebuilder:printcolumn:name="Vendor",type=string,JSONPath=".spec.acceleratorVendor"

// +kubebuilder:printcolumn:name="Mode",type=string,JSONPath=".spec.deploymentMode"

// +kubebuilder:printcolumn:name="Status",type=string,JSONPath=".status.phase"

// +kubebuilder:printcolumn:name="Endpoint",type=string,JSONPath=".status.endpoint"

type LLMModel struct {

metav1.TypeMeta `json:",inline"`

metav1.ObjectMeta `json:"metadata,omitempty"`

Spec LLMModelSpec `json:"spec,omitempty"`

Status LLMModelStatus `json:"status,omitempty"`

}

// LLMModelList contains a list of LLMModel resources.

// +kubebuilder:object:root=true

type LLMModelList struct {

metav1.TypeMeta `json:",inline"`

metav1.ListMeta `json:"metadata,omitempty"`

Items []LLMModel `json:"items"`

}

func init() {

SchemeBuilder.Register(&LLMModel{}, &LLMModelList{})

}

FILE: api/v1alpha1/groupversion_info.go

package v1alpha1

import (

"k8s.io/apimachinery/pkg/runtime/schema"

"sigs.k8s.io/controller-runtime/pkg/scheme"

)

var (

GroupVersion = schema.GroupVersion{Group: "ai.example.io", Version: "v1alpha1"}

SchemeBuilder = &scheme.Builder{GroupVersion: GroupVersion}

AddToScheme = SchemeBuilder.AddToScheme

)

FILE: controllers/hardware.go

// hardware.go contains all vendor-specific hardware configuration logic.

// This is the single file that must be updated when adding a new accelerator

// vendor. All other controller code is vendor-agnostic.

package controllers

import (

"fmt"

corev1 "k8s.io/api/core/v1"

"k8s.io/apimachinery/pkg/api/resource"

aiv1alpha1 "github.com/example/llm-operator/api/v1alpha1"

)

// AcceleratorConfig holds all hardware-specific configuration derived

// from the LLMModel spec. The reconciler calls resolveAcceleratorConfig

// once and then uses this struct throughout the reconciliation.

type AcceleratorConfig struct {

// ResourceKey is the Kubernetes resource name for this accelerator type.

// e.g. "nvidia.com/gpu", "amd.com/gpu", "habana.ai/gaudi"

// Empty string means no accelerator resource is requested (cpu mode).

ResourceKey string

// VllmImage is the Docker image to use for the vLLM inference server.

// Different vendors require different images built against their SDK.

VllmImage string

// OllamaImage is the Docker image to use for Ollama (cpu/apple silicon).

OllamaImage string

// LlamaCppImage is the Docker image to use for llama.cpp.

LlamaCppImage string

// NodeSelectorLabels are added to the pod's nodeSelector to target

// nodes with the correct accelerator type. These labels are populated

// by the respective GPU operator / device plugin.

NodeSelectorLabels map[string]string

// Tolerations allow the pod to be scheduled on tainted GPU nodes.

// Most clusters taint GPU nodes to prevent non-GPU workloads from

// landing on expensive hardware.

Tolerations []corev1.Toleration

// PrometheusMetricPrefix is the prefix used by the accelerator's

// monitoring exporter. vLLM emits its own metrics regardless of

// backend, but we also expose hardware-level metrics via KEDA for

// GPU utilization-based scaling as a secondary trigger.

PrometheusMetricPrefix string

// AcceleratorQuantity is the resource.Quantity for the accelerator

// resource limit/request. Nil for cpu mode.

AcceleratorQuantity *resource.Quantity

// HostIPC indicates whether the pod needs host IPC namespace access.

// Required for multi-GPU tensor parallelism on some vendors.

HostIPC bool

// HostNetwork indicates whether the pod needs host network access.

// Required for multi-GPU NCCL/RCCL communication on some vendors.

HostNetwork bool

// AdditionalEnvVars are vendor-specific environment variables to inject.

AdditionalEnvVars []corev1.EnvVar

}

// resolveAcceleratorConfig derives all hardware-specific configuration

// from the LLMModel spec. This is the single authoritative function for

// vendor dispatch. Add new vendors here.

func resolveAcceleratorConfig(model *aiv1alpha1.LLMModel) (AcceleratorConfig, error) {

vendor := model.Spec.AcceleratorVendor

if vendor == "" {

vendor = "nvidia" // backward-compatible default

}

count := model.Spec.Resources.AcceleratorCount

if count < 1 {

count = 1

}

switch vendor {

// ------------------------------------------------------------------

// NVIDIA — CUDA

// Resource key: nvidia.com/gpu

// Node label: nvidia.com/gpu.present=true (GPU Operator)

// Taint: nvidia.com/gpu:NoSchedule

// vLLM image: vllm/vllm-openai (CUDA build)

// Metrics: DCGM exporter (DCGM_FI_DEV_*)

// ------------------------------------------------------------------

case "nvidia":

qty := resource.MustParse(fmt.Sprintf("%d", count))

return AcceleratorConfig{

ResourceKey: "nvidia.com/gpu",

VllmImage: "vllm/vllm-openai:v0.8.5",

OllamaImage: "ollama/ollama:0.6.5",

LlamaCppImage: "ghcr.io/ggerganov/llama.cpp:server-cuda",

NodeSelectorLabels: map[string]string{"nvidia.com/gpu.present": "true"},

Tolerations: []corev1.Toleration{

{

Key: "nvidia.com/gpu",

Operator: corev1.TolerationOpExists,

Effect: corev1.TaintEffectNoSchedule,

PrometheusMetricPrefix: "DCGM_FI_DEV",

AcceleratorQuantity: &qty,

HostIPC: count > 1,

HostNetwork: count > 1,

AdditionalEnvVars: []corev1.EnvVar{},

}, nil

// ------------------------------------------------------------------

// AMD — ROCm

// Resource key: amd.com/gpu

// Node label: amd.com/gpu.present=true (AMD GPU Operator)

// Taint: amd.com/gpu:NoSchedule

// vLLM image: rocm/vllm-openai (ROCm build)

// Official ROCm-enabled vLLM images available since January 2026.

// 93% of vLLM AMD test groups passing as of mid-January 2026.

// Metrics: ROCm SMI exporter (rocm_smi_*)

// Note: AWQ quantization is supported on ROCm as of vLLM 0.8+.

// MXFP4/MXFP6 require MI350X or later (CDNA 4).

// ------------------------------------------------------------------

case "amd":

qty := resource.MustParse(fmt.Sprintf("%d", count))

return AcceleratorConfig{

ResourceKey: "amd.com/gpu",

VllmImage: "rocm/vllm-openai:v0.8.5-rocm6.2",

OllamaImage: "ollama/ollama:0.6.5-rocm",

LlamaCppImage: "ghcr.io/ggerganov/llama.cpp:server-rocm",

NodeSelectorLabels: map[string]string{

"amd.com/gpu.present": "true",

Tolerations: []corev1.Toleration{

{

Key: "amd.com/gpu",

Operator: corev1.TolerationOpExists,

Effect: corev1.TaintEffectNoSchedule,

PrometheusMetricPrefix: "rocm_smi",

AcceleratorQuantity: &qty,

HostIPC: count > 1,

HostNetwork: count > 1,

AdditionalEnvVars: []corev1.EnvVar{

// Tell ROCm which GPU devices to use. The device plugin

// sets ROCR_VISIBLE_DEVICES automatically, but we set

// HIP_VISIBLE_DEVICES for compatibility with older ROCm.

{

Name: "HIP_VISIBLE_DEVICES",

Value: "all",

// ROCm requires this for vLLM's flash attention backend.

{

Name: "VLLM_USE_ROCM_FLASH_ATTN",

Value: "1",

}, nil

// ------------------------------------------------------------------

// INTEL GAUDI

// Resource key: habana.ai/gaudi

// Node label: habana.ai/gaudi=true (Gaudi Base Operator)

// Taint: habana.ai/gaudi:NoSchedule

// vLLM image: Intel optimized vLLM fork for Gaudi

// Intel maintains a Gaudi-specific vLLM fork with Paged KV cache,

// custom Paged Attention, tensor parallelism, and FP8 support.

// Supports DeepSeek architecture since Gaudi software 1.21.0.

// Metrics: Gaudi metrics exporter (habana_*)

// Note: FP8 is natively supported by Gaudi 3.

// TGI (Text Generation Inference) is also supported.

// ------------------------------------------------------------------

case "intel-gaudi":

qty := resource.MustParse(fmt.Sprintf("%d", count))

return AcceleratorConfig{

ResourceKey: "habana.ai/gaudi",

VllmImage: "vault.habana.ai/gaudi-docker/1.21.0/ubuntu22.04/habanalabs/vllm-fork:latest",

OllamaImage: "", // Ollama does not support Gaudi; use vllm or tgi

LlamaCppImage: "", // llama.cpp does not support Gaudi; use vllm or tgi

NodeSelectorLabels: map[string]string{

"habana.ai/gaudi": "true",

Tolerations: []corev1.Toleration{

{

Key: "habana.ai/gaudi",

Operator: corev1.TolerationOpExists,

Effect: corev1.TaintEffectNoSchedule,

PrometheusMetricPrefix: "habana",

AcceleratorQuantity: &qty,

HostIPC: true, // Required for Gaudi inter-card communication

HostNetwork: count > 1,

AdditionalEnvVars: []corev1.EnvVar{

// Gaudi requires these environment variables for vLLM.

{

Name: "PT_HPU_ENABLE_LAZY_COLLECTIVES",

Value: "true",

{

Name: "VLLM_SKIP_WARMUP",

Value: "false",

// Gaudi uses Habana's collective communications library.

{

Name: "HABANA_VISIBLE_DEVICES",

Value: "all",

}, nil

// ------------------------------------------------------------------

// CPU — No accelerator

// Resource key: (none)

// Inference: Ollama or llama.cpp with GGUF models

// Suitable for: Apple Silicon nodes, x86 servers, edge deployments

// Note: vLLM is not recommended for CPU-only inference.

// Use ollama or llamacpp as inferenceEngine.

// ------------------------------------------------------------------

case "cpu":

return AcceleratorConfig{

ResourceKey: "",

VllmImage: "", // vLLM not recommended for CPU

OllamaImage: "ollama/ollama:0.6.5",

LlamaCppImage: "ghcr.io/ggerganov/llama.cpp:server",

NodeSelectorLabels: map[string]string{},

Tolerations: []corev1.Toleration{},

PrometheusMetricPrefix: "",

AcceleratorQuantity: nil,

HostIPC: false,

HostNetwork: false,

AdditionalEnvVars: []corev1.EnvVar{},

}, nil

default:

return AcceleratorConfig{}, fmt.Errorf(

"unknown acceleratorVendor %q; valid values: nvidia, amd, intel-gaudi, cpu",

vendor,

)

}

// inferenceImageFor returns the correct Docker image for the given

// inference engine and accelerator configuration.

func inferenceImageFor(engine string, cfg AcceleratorConfig) (string, error) {

switch engine {

case "vllm":

if cfg.VllmImage == "" {

return "", fmt.Errorf(

"vLLM is not supported for accelerator vendor %q; "+

"use llamacpp or ollama instead",

cfg.ResourceKey,

)

}

return cfg.VllmImage, nil

case "ollama":

if cfg.OllamaImage == "" {

return "", fmt.Errorf(

"Ollama is not supported for accelerator vendor %q; "+

"use vllm instead",

cfg.ResourceKey,

)

}

return cfg.OllamaImage, nil

case "llamacpp":

if cfg.LlamaCppImage == "" {

return "", fmt.Errorf(

"llama.cpp is not supported for accelerator vendor %q; "+

"use vllm instead",

cfg.ResourceKey,

)

}

return cfg.LlamaCppImage, nil

default:

return "", fmt.Errorf("unknown inferenceEngine %q", engine)

}

FILE: controllers/llmmodel_controller.go

package controllers

import (

"context"

"fmt"

"sort"

"time"

appsv1 "k8s.io/api/apps/v1"

corev1 "k8s.io/api/core/v1"

"k8s.io/apimachinery/pkg/api/errors"

"k8s.io/apimachinery/pkg/api/resource"

metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"

"k8s.io/apimachinery/pkg/runtime"

"k8s.io/apimachinery/pkg/util/intstr"

ctrl "sigs.k8s.io/controller-runtime"

"sigs.k8s.io/controller-runtime/pkg/client"

"sigs.k8s.io/controller-runtime/pkg/log"

aiv1alpha1 "github.com/example/llm-operator/api/v1alpha1"

)

// LLMModelReconciler reconciles LLMModel resources.

type LLMModelReconciler struct {

client.Client

Scheme *runtime.Scheme

McpClient McpRegistryClient

}

// McpRegistryClient is an interface for registering tools with the MCP server.

type McpRegistryClient interface {

RegisterTool(ctx context.Context, tool McpToolDefinition) error

UnregisterTool(ctx context.Context, toolName string) error

}

// McpToolDefinition describes a tool to be registered with the MCP server.

type McpToolDefinition struct {

Name string

Description string

Endpoint string

ModelId string

Domains []string

}

// Reconcile is called by controller-runtime whenever an LLMModel changes.

func (r *LLMModelReconciler) Reconcile(

ctx context.Context,

req ctrl.Request,

) (ctrl.Result, error) {

logger := log.FromContext(ctx)

logger.Info("Reconciling LLMModel", "name", req.Name, "namespace", req.Namespace)

// Step 1: Fetch the LLMModel resource.

model := &aiv1alpha1.LLMModel{}

if err := r.Get(ctx, req.NamespacedName, model); err != nil {

if errors.IsNotFound(err) {

return ctrl.Result{}, nil

}

return ctrl.Result{}, fmt.Errorf("fetching LLMModel: %w", err)

}

// Step 2: Handle deletion via finalizer.

finalizerName := "ai.example.io/mcp-cleanup"

if model.DeletionTimestamp != nil {

if containsString(model.Finalizers, finalizerName) {

if err := r.cleanupMcpRegistration(ctx, model); err != nil {

return ctrl.Result{}, err

}

model.Finalizers = removeString(model.Finalizers, finalizerName)

if err := r.Update(ctx, model); err != nil {

return ctrl.Result{}, err

}

return ctrl.Result{}, nil

}

// Step 3: Add the finalizer if not present.

if !containsString(model.Finalizers, finalizerName) {

model.Finalizers = append(model.Finalizers, finalizerName)

if err := r.Update(ctx, model); err != nil {

return ctrl.Result{}, err

}

// Re-fetch after update to get the latest resourceVersion.

if err := r.Get(ctx, req.NamespacedName, model); err != nil {

return ctrl.Result{}, err

}

// Step 4: Sync discovery labels. We do this before the main

// reconciliation so that labels are always current.

if err := r.syncDiscoveryLabels(ctx, model); err != nil {

return ctrl.Result{}, fmt.Errorf("syncing discovery labels: %w", err)

}

// Re-fetch after label update.

if err := r.Get(ctx, req.NamespacedName, model); err != nil {

return ctrl.Result{}, err

}

// Step 5: Branch on deployment mode.

var reconcileErr error

switch model.Spec.DeploymentMode {

case "local":

reconcileErr = r.reconcileLocalModel(ctx, model)

case "remote":

reconcileErr = r.reconcileRemoteModel(ctx, model)

default:

reconcileErr = fmt.Errorf(

"unknown deploymentMode: %s", model.Spec.DeploymentMode,

)

}

if reconcileErr != nil {

model.Status.Phase = "Failed"

_ = r.Status().Update(ctx, model)

return ctrl.Result{}, reconcileErr

}

// Step 6: Requeue after 30 seconds to refresh metrics in the status.

return ctrl.Result{RequeueAfter: 30 * time.Second}, nil

}

// syncDiscoveryLabels updates the LLMModel's labels to reflect its spec,

// enabling efficient kubectl and API server queries.

// NOTE: This calls r.Update (not r.Status().Update) because labels are

// metadata, not status. The caller must re-fetch after this call.

func (r *LLMModelReconciler) syncDiscoveryLabels(

ctx context.Context,

model *aiv1alpha1.LLMModel,

) error {

updated := model.DeepCopy()

if updated.Labels == nil {

updated.Labels = make(map[string]string)

}

// Domain labels: ai.example.io/domain-<name>=true

for _, domain := range model.Spec.Domains {

updated.Labels[fmt.Sprintf("ai.example.io/domain-%s", domain)] = "true"

}

// Context window tier labels (cumulative: a 256K model also gets 128K label)

contextK := model.Spec.ContextWindowK

tiers := []struct {

threshold int

label string

}{

{10000, "ai.example.io/context-10m"},

{1000, "ai.example.io/context-1m"},

{256, "ai.example.io/context-256k"},

{128, "ai.example.io/context-128k"},

{64, "ai.example.io/context-64k"},

{0, "ai.example.io/context-32k"},

}

for _, tier := range tiers {

if contextK >= tier.threshold {

updated.Labels[tier.label] = "true"

}

// VRAM tier label

vram := model.Spec.Resources.VramPerAcceleratorGiB

switch {

case vram == 0:

updated.Labels["ai.example.io/vram-tier"] = "cpu"

case vram <= 8:

updated.Labels["ai.example.io/vram-tier"] = "8gb"

case vram <= 16:

updated.Labels["ai.example.io/vram-tier"] = "16gb"

case vram <= 24:

updated.Labels["ai.example.io/vram-tier"] = "24gb"

case vram <= 48:

updated.Labels["ai.example.io/vram-tier"] = "48gb"

case vram <= 80:

updated.Labels["ai.example.io/vram-tier"] = "80gb"

case vram <= 141:

updated.Labels["ai.example.io/vram-tier"] = "141gb"

case vram <= 192:

updated.Labels["ai.example.io/vram-tier"] = "192gb"

default:

updated.Labels["ai.example.io/vram-tier"] = "multi-accelerator"

}

// Other discovery labels

updated.Labels["ai.example.io/model-type"] = model.Spec.ModelType

updated.Labels["ai.example.io/quantization"] = model.Spec.Quantization

updated.Labels["ai.example.io/deployment-mode"] = model.Spec.DeploymentMode

updated.Labels["ai.example.io/accelerator-vendor"] = model.Spec.AcceleratorVendor

return r.Update(ctx, updated)

}

// reconcileLocalModel handles the full lifecycle of a locally-hosted model.

func (r *LLMModelReconciler) reconcileLocalModel(

ctx context.Context,

model *aiv1alpha1.LLMModel,

) error {

// Resolve hardware configuration for this vendor.

hwCfg, err := resolveAcceleratorConfig(model)

if err != nil {

return err

}

// Validate engine/vendor compatibility.

if _, err := inferenceImageFor(model.Spec.InferenceEngine, hwCfg); err != nil {

return err

}

// Ensure the model cache PVC exists.

if err := r.ensureModelCachePvc(ctx, model); err != nil {

return fmt.Errorf("ensuring model cache PVC: %w", err)

}

// Ensure the inference Deployment exists and matches the spec.

if err := r.ensureInferenceDeployment(ctx, model, hwCfg); err != nil {

return fmt.Errorf("ensuring inference deployment: %w", err)

}

// Ensure the Service exists.

if err := r.ensureInferenceService(ctx, model); err != nil {

return fmt.Errorf("ensuring inference service: %w", err)

}

// Ensure the KEDA ScaledObject exists.

if err := r.ensureKedaScaledObject(ctx, model); err != nil {

return fmt.Errorf("ensuring KEDA ScaledObject: %w", err)

}

// Update status. We use Status().Update() here — separate from the

// label update in syncDiscoveryLabels — to avoid a double-update conflict.

endpoint := fmt.Sprintf(

"http://%s.%s.svc.cluster.local:8000/v1",

model.Name, model.Namespace,

)

model.Status.Endpoint = endpoint

model.Status.Phase = "Ready"

if model.Spec.McpExposure.Enabled {

if err := r.McpClient.RegisterTool(ctx, McpToolDefinition{

Name: model.Spec.McpExposure.ToolName,

Description: model.Spec.McpExposure.ToolDescription,

Endpoint: endpoint,

ModelId: model.Spec.ModelId,

Domains: model.Spec.Domains,

}); err != nil {

return fmt.Errorf("registering MCP tool: %w", err)

}

return r.Status().Update(ctx, model)

}

// ensureModelCachePvc creates or updates the PVC for model weight caching.

func (r *LLMModelReconciler) ensureModelCachePvc(

ctx context.Context,

model *aiv1alpha1.LLMModel,

) error {

// Estimate storage needed: weights + some overhead.

// We use vramPerAcceleratorGiB * acceleratorCount * 1.5 as a heuristic,

// with a minimum of 20Gi and a maximum of 2Ti.

vram := model.Spec.Resources.VramPerAcceleratorGiB

count := model.Spec.Resources.AcceleratorCount

if count < 1 {

count = 1

}

estimatedGiB := vram * count * 2

if estimatedGiB < 20 {

estimatedGiB = 20

}

if estimatedGiB > 2048 {

estimatedGiB = 2048

}

storageQty := resource.MustParse(fmt.Sprintf("%dGi", estimatedGiB))

pvc := &corev1.PersistentVolumeClaim{

ObjectMeta: metav1.ObjectMeta{

Name: fmt.Sprintf("%s-model-cache", model.Name),

Namespace: model.Namespace,

}

_, err := ctrl.CreateOrUpdate(ctx, r.Client, pvc, func() error {

if err := ctrl.SetControllerReference(model, pvc, r.Scheme); err != nil {

return err

}

// Only set spec on creation; PVC spec is immutable after creation.

if pvc.CreationTimestamp.IsZero() {

storageClassName := "standard"

pvc.Spec = corev1.PersistentVolumeClaimSpec{

AccessModes: []corev1.PersistentVolumeAccessMode{

corev1.ReadWriteOnce,

Resources: corev1.VolumeResourceRequirements{

Requests: corev1.ResourceList{

corev1.ResourceStorage: storageQty,

StorageClassName: &storageClassName,

}

return nil

})

return err

}

// ensureInferenceDeployment creates or updates the inference server Deployment.

// This function is vendor-agnostic; all vendor-specific decisions come from hwCfg.

func (r *LLMModelReconciler) ensureInferenceDeployment(

ctx context.Context,

model *aiv1alpha1.LLMModel,

hwCfg AcceleratorConfig,

) error {

image, err := inferenceImageFor(model.Spec.InferenceEngine, hwCfg)

if err != nil {

return err

}

// Build the inference server command.

var command []string

var args []string

switch model.Spec.InferenceEngine {

case "vllm":

command = []string{"python3", "-m", "vllm.entrypoints.openai.api_server"}

args = []string{

"--model", model.Spec.ModelId,

"--port", "8000",

"--host", "0.0.0.0",

}

// Append engineArgs in sorted key order for determinism.

keys := make([]string, 0, len(model.Spec.EngineArgs))

for k := range model.Spec.EngineArgs {

keys = append(keys, k)

}

sort.Strings(keys)

for _, k := range keys {

args = append(args, fmt.Sprintf("--%s", k), model.Spec.EngineArgs[k])

}

case "ollama":

// Ollama exposes port 11434 but we proxy it to 8000 via an

// OpenAI-compatible adapter. We use the OLLAMA_HOST env var

// to bind to all interfaces.

command = []string{"/bin/ollama"}

args = []string{"serve"}

case "llamacpp":

command = []string{"/server"}

args = []string{

"--model", "/model-cache/" + model.Spec.ModelId,

"--port", "8000",

"--host", "0.0.0.0",

"--ctx-size", fmt.Sprintf("%d", model.Spec.ContextWindowK*1024),

}

keys := make([]string, 0, len(model.Spec.EngineArgs))

for k := range model.Spec.EngineArgs {

keys = append(keys, k)

}

sort.Strings(keys)

for _, k := range keys {

args = append(args, fmt.Sprintf("--%s", k), model.Spec.EngineArgs[k])

}

// Build environment variables.

envVars := []corev1.EnvVar{

{Name: "HF_HOME", Value: "/model-cache"},

}

if model.Spec.ModelSource != nil &&

model.Spec.ModelSource.HuggingFaceTokenSecret != "" {

envVars = append(envVars, corev1.EnvVar{

Name: "HUGGING_FACE_HUB_TOKEN",

ValueFrom: &corev1.EnvVarSource{

SecretKeyRef: &corev1.SecretKeySelector{

LocalObjectReference: corev1.LocalObjectReference{

Name: model.Spec.ModelSource.HuggingFaceTokenSecret,

Key: "token",

})

}

// Append vendor-specific env vars.

envVars = append(envVars, hwCfg.AdditionalEnvVars...)

// For Ollama, set the host binding.

if model.Spec.InferenceEngine == "ollama" {

envVars = append(envVars, corev1.EnvVar{

Name: "OLLAMA_HOST",

Value: "0.0.0.0:8000",

})

}

// Resource requirements.

cpuQty := resource.MustParse(fmt.Sprintf("%dm", model.Spec.Resources.CpuMillicores))

memQty := resource.MustParse(fmt.Sprintf("%dGi", model.Spec.Resources.MemoryGiB))

resourceRequests := corev1.ResourceList{

corev1.ResourceCPU: cpuQty,

corev1.ResourceMemory: memQty,

}

resourceLimits := corev1.ResourceList{

corev1.ResourceCPU: cpuQty,

corev1.ResourceMemory: memQty,

}

if hwCfg.AcceleratorQuantity != nil && hwCfg.ResourceKey != "" {

resourceRequests[corev1.ResourceName(hwCfg.ResourceKey)] = *hwCfg.AcceleratorQuantity

resourceLimits[corev1.ResourceName(hwCfg.ResourceKey)] = *hwCfg.AcceleratorQuantity

}

replicas := int32(model.Spec.Scaling.MinReplicas)

if replicas < 1 {

replicas = 1

}

// Shared memory volume size: 16Gi for single-accelerator, 64Gi for multi.

shmSize := resource.MustParse("16Gi")

if model.Spec.Resources.AcceleratorCount > 1 {

shmSize = resource.MustParse("64Gi")

}

// Volume mounts for the inference container.

volumeMounts := []corev1.VolumeMount{

{Name: "model-cache", MountPath: "/model-cache"},

{Name: "shm", MountPath: "/dev/shm"},

}

// Volumes.

volumes := []corev1.Volume{

{

Name: "model-cache",

VolumeSource: corev1.VolumeSource{

PersistentVolumeClaim: &corev1.PersistentVolumeClaimVolumeSource{

ClaimName: fmt.Sprintf("%s-model-cache", model.Name),

{

Name: "shm",

VolumeSource: corev1.VolumeSource{

EmptyDir: &corev1.EmptyDirVolumeSource{

Medium: corev1.StorageMediumMemory,

SizeLimit: &shmSize,

}

// Init container: download model weights before the server starts.

initContainers := []corev1.Container{

{

Name: "model-downloader",

Image: "huggingface/downloader:latest",

Command: []string{

"huggingface-cli", "download",

model.Spec.ModelId,

"--local-dir", "/model-cache",

Env: envVars,

VolumeMounts: []corev1.VolumeMount{

{Name: "model-cache", MountPath: "/model-cache"},

}

// Pod spec.

podSpec := corev1.PodSpec{

Tolerations: hwCfg.Tolerations,

NodeSelector: hwCfg.NodeSelectorLabels,

HostIPC: hwCfg.HostIPC,

HostNetwork: hwCfg.HostNetwork,

InitContainers: initContainers,

Containers: []corev1.Container{

{

Name: "inference-server",

Image: image,

Command: command,

Args: args,

Env: envVars,

Ports: []corev1.ContainerPort{

{Name: "http", ContainerPort: 8000, Protocol: corev1.ProtocolTCP},

Resources: corev1.ResourceRequirements{

Requests: resourceRequests,

Limits: resourceLimits,

VolumeMounts: volumeMounts,

ReadinessProbe: &corev1.Probe{

ProbeHandler: corev1.ProbeHandler{

HTTPGet: &corev1.HTTPGetAction{

Path: "/health",

Port: intstr.FromInt(8000),

InitialDelaySeconds: 300,

PeriodSeconds: 10,

FailureThreshold: 60,

LivenessProbe: &corev1.Probe{

ProbeHandler: corev1.ProbeHandler{

HTTPGet: &corev1.HTTPGetAction{

Path: "/health",

Port: intstr.FromInt(8000),

InitialDelaySeconds: 360,

PeriodSeconds: 30,

FailureThreshold: 5,

Volumes: volumes,

}

deployment := &appsv1.Deployment{

ObjectMeta: metav1.ObjectMeta{

Name: model.Name,

Namespace: model.Namespace,

}

_, err = ctrl.CreateOrUpdate(ctx, r.Client, deployment, func() error {

if err := ctrl.SetControllerReference(model, deployment, r.Scheme); err != nil {

return err

}

deployment.Spec = appsv1.DeploymentSpec{

Replicas: &replicas,

Selector: &metav1.LabelSelector{

MatchLabels: map[string]string{

"app": model.Name,

"ai.example.io/model-name": model.Name,

Template: corev1.PodTemplateSpec{

ObjectMeta: metav1.ObjectMeta{

Labels: map[string]string{

"app": model.Name,

"ai.example.io/model-name": model.Name,

"ai.example.io/engine": model.Spec.InferenceEngine,

"ai.example.io/vendor": model.Spec.AcceleratorVendor,

Annotations: map[string]string{

"prometheus.io/scrape": "true",

"prometheus.io/port": "8000",

"prometheus.io/path": "/metrics",

Spec: podSpec,

}

return nil

})

return err

}

// ensureInferenceService creates or updates the ClusterIP Service.

func (r *LLMModelReconciler) ensureInferenceService(

ctx context.Context,

model *aiv1alpha1.LLMModel,

) error {

svc := &corev1.Service{

ObjectMeta: metav1.ObjectMeta{

Name: model.Name,

Namespace: model.Namespace,

}

_, err := ctrl.CreateOrUpdate(ctx, r.Client, svc, func() error {

if err := ctrl.SetControllerReference(model, svc, r.Scheme); err != nil {

return err

}

svc.Spec = corev1.ServiceSpec{

Selector: map[string]string{

"ai.example.io/model-name": model.Name,

Ports: []corev1.ServicePort{

{

Name: "http",

Port: 8000,

TargetPort: intstr.FromInt(8000),

Protocol: corev1.ProtocolTCP,

Type: corev1.ServiceTypeClusterIP,

}

return nil

})

return err

}

// ensureKedaScaledObject creates or updates the KEDA ScaledObject.

// vLLM emits the same Prometheus metrics regardless of the underlying

// accelerator vendor, so the KEDA configuration is vendor-agnostic.

func (r *LLMModelReconciler) ensureKedaScaledObject(

ctx context.Context,

model *aiv1alpha1.LLMModel,

) error {

// KEDA ScaledObject is a custom resource. We use unstructured to avoid

// importing the KEDA API package as a dependency.

scaledObject := map[string]interface{}{

"apiVersion": "keda.sh/v1alpha1",

"kind": "ScaledObject",

"metadata": map[string]interface{}{

"name": model.Name,

"namespace": model.Namespace,

"ownerReferences": []interface{}{

map[string]interface{}{

"apiVersion": "ai.example.io/v1alpha1",

"kind": "LLMModel",

"name": model.Name,

"uid": string(model.UID),

"controller": true,

"blockOwnerDeletion": true,

"spec": map[string]interface{}{

"scaleTargetRef": map[string]interface{}{

"apiVersion": "apps/v1",

"kind": "Deployment",

"name": model.Name,

"minReplicaCount": int64(model.Spec.Scaling.MinReplicas),

"maxReplicaCount": int64(model.Spec.Scaling.MaxReplicas),

"cooldownPeriod": int64(model.Spec.Scaling.CooldownPeriodSeconds),

"pollingInterval": int64(15),

"triggers": []interface{}{

map[string]interface{}{

"type": "prometheus",

"metadata": map[string]interface{}{

"serverAddress": "http://prometheus-server.monitoring.svc.cluster.local:9090",

"metricName": "vllm_requests_waiting",

"query": fmt.Sprintf(

`sum(vllm:num_requests_waiting{namespace="%s",pod=~"%s-.*"})`,

model.Namespace, model.Name,

"threshold": fmt.Sprintf("%d", model.Spec.Scaling.ScaleUpThreshold),

"activationThreshold": "1",

}

// We apply the ScaledObject using server-side apply via the dynamic client.

// For simplicity in this example we use kubectl-style apply via the REST client.

// In production, use the dynamic client or import the KEDA API types.

_ = scaledObject

// NOTE: Full dynamic client implementation omitted for brevity.

// See the Helm chart values for the complete KEDA ScaledObject YAML,

// which is applied as a separate manifest in config/keda/.

return nil

}

// reconcileRemoteModel creates a lightweight proxy for remote API models.

func (r *LLMModelReconciler) reconcileRemoteModel(

ctx context.Context,

model *aiv1alpha1.LLMModel,

) error {

if model.Spec.RemoteApi == nil {

return fmt.Errorf(

"LLMModel %s has deploymentMode=remote but no remoteApi spec",

model.Name,

)

}

// ConfigMap with proxy configuration.

configMap := &corev1.ConfigMap{

ObjectMeta: metav1.ObjectMeta{

Name: fmt.Sprintf("%s-proxy-config", model.Name),

Namespace: model.Namespace,

}

_, err := ctrl.CreateOrUpdate(ctx, r.Client, configMap, func() error {

ctrl.SetControllerReference(model, configMap, r.Scheme)

configMap.Data = map[string]string{

"upstream_url": model.Spec.RemoteApi.BaseUrl,

"model_id": model.Spec.ModelId,

"rate_limit_rpm": fmt.Sprintf("%d", model.Spec.RemoteApi.RateLimitRpm),

}

return nil

})

if err != nil {

return fmt.Errorf("ensuring proxy ConfigMap: %w", err)

}

// Proxy Deployment — no GPU resources, minimal footprint.

proxyReplicas := int32(2)

proxyDeployment := &appsv1.Deployment{

ObjectMeta: metav1.ObjectMeta{

Name: model.Name,

Namespace: model.Namespace,

}

_, err = ctrl.CreateOrUpdate(ctx, r.Client, proxyDeployment, func() error {

ctrl.SetControllerReference(model, proxyDeployment, r.Scheme)

proxyDeployment.Spec = appsv1.DeploymentSpec{

Replicas: &proxyReplicas,

Selector: &metav1.LabelSelector{

MatchLabels: map[string]string{

"app": model.Name,

"ai.example.io/model-name": model.Name,

Template: corev1.PodTemplateSpec{

ObjectMeta: metav1.ObjectMeta{

Labels: map[string]string{

"app": model.Name,

"ai.example.io/model-name": model.Name,

"ai.example.io/mode": "remote-proxy",

Spec: corev1.PodSpec{

Containers: []corev1.Container{

{

Name: "api-proxy",

Image: "ghcr.io/example/llm-api-proxy:v1.2.0",

Ports: []corev1.ContainerPort{

{ContainerPort: 8000, Protocol: corev1.ProtocolTCP},

Env: []corev1.EnvVar{

{

Name: "PROXY_API_KEY",

ValueFrom: &corev1.EnvVarSource{

SecretKeyRef: &corev1.SecretKeySelector{

LocalObjectReference: corev1.LocalObjectReference{

Name: model.Spec.RemoteApi.ApiKeySecret,

Key: "apiKey",

{

Name: "PROXY_CONFIG_PATH",

Value: "/etc/proxy/config.yaml",

Resources: corev1.ResourceRequirements{

Requests: corev1.ResourceList{

corev1.ResourceCPU: resource.MustParse("100m"),

corev1.ResourceMemory: resource.MustParse("128Mi"),

Limits: corev1.ResourceList{

corev1.ResourceCPU: resource.MustParse("500m"),

corev1.ResourceMemory: resource.MustParse("256Mi"),

VolumeMounts: []corev1.VolumeMount{

{Name: "proxy-config", MountPath: "/etc/proxy"},

ReadinessProbe: &corev1.Probe{

ProbeHandler: corev1.ProbeHandler{

HTTPGet: &corev1.HTTPGetAction{

Path: "/health",

Port: intstr.FromInt(8000),

InitialDelaySeconds: 5,

PeriodSeconds: 10,

FailureThreshold: 3,

Volumes: []corev1.Volume{

{

Name: "proxy-config",

VolumeSource: corev1.VolumeSource{

ConfigMap: &corev1.ConfigMapVolumeSource{

LocalObjectReference: corev1.LocalObjectReference{

Name: fmt.Sprintf("%s-proxy-config", model.Name),

}

return nil

})

if err != nil {

return fmt.Errorf("ensuring proxy deployment: %w", err)

}

// Ensure the Service for the proxy.

if err := r.ensureInferenceService(ctx, model); err != nil {

return fmt.Errorf("ensuring proxy service: %w", err)

}

endpoint := fmt.Sprintf(

"http://%s.%s.svc.cluster.local:8000/v1",

model.Name, model.Namespace,

)

model.Status.Endpoint = endpoint

model.Status.Phase = "Proxying"

if model.Spec.McpExposure.Enabled {

if err := r.McpClient.RegisterTool(ctx, McpToolDefinition{

Name: model.Spec.McpExposure.ToolName,

Description: model.Spec.McpExposure.ToolDescription,

Endpoint: endpoint,

ModelId: model.Spec.ModelId,

Domains: model.Spec.Domains,

}); err != nil {

return fmt.Errorf("registering MCP tool for remote model: %w", err)

}

return r.Status().Update(ctx, model)

}

// cleanupMcpRegistration removes the MCP tool registration when the

// LLMModel is being deleted.

func (r *LLMModelReconciler) cleanupMcpRegistration(

ctx context.Context,

model *aiv1alpha1.LLMModel,

) error {

if !model.Spec.McpExposure.Enabled || model.Spec.McpExposure.ToolName == "" {

return nil

}

logger := log.FromContext(ctx)

logger.Info("Unregistering MCP tool", "toolName", model.Spec.McpExposure.ToolName)

return r.McpClient.UnregisterTool(ctx, model.Spec.McpExposure.ToolName)

}

// SetupWithManager registers the controller with the controller-runtime manager.

func (r *LLMModelReconciler) SetupWithManager(mgr ctrl.Manager) error {

return ctrl.NewControllerManagedBy(mgr).

For(&aiv1alpha1.LLMModel{}).

Owns(&appsv1.Deployment{}).

Owns(&corev1.Service{}).

Owns(&corev1.PersistentVolumeClaim{}).

Owns(&corev1.ConfigMap{}).

Complete(r)

}

// Helper functions.

func containsString(slice []string, s string) bool {

for _, item := range slice {

if item == s {

return true

}

return false

}

func removeString(slice []string, s string) []string {

result := make([]string, 0, len(slice))

for _, item := range slice {

if item != s {

result = append(result, item)

}

return result

}

FILE: main.go

package main

import (

"flag"

"os"

"k8s.io/apimachinery/pkg/runtime"

utilruntime "k8s.io/apimachinery/pkg/util/runtime"

clientgoscheme "k8s.io/client-go/kubernetes/scheme"

ctrl "sigs.k8s.io/controller-runtime"

"sigs.k8s.io/controller-runtime/pkg/healthz"

"sigs.k8s.io/controller-runtime/pkg/log/zap"

"sigs.k8s.io/controller-runtime/pkg/metrics/server"

aiv1alpha1 "github.com/example/llm-operator/api/v1alpha1"

"github.com/example/llm-operator/controllers"

)

var (

scheme = runtime.NewScheme()

setupLog = ctrl.Log.WithName("setup")

)

func init() {

utilruntime.Must(clientgoscheme.AddToScheme(scheme))

utilruntime.Must(aiv1alpha1.AddToScheme(scheme))

}

func main() {

var metricsAddr string

var enableLeaderElection bool

var probeAddr string

var mcpServerAddr string

flag.StringVar(&metricsAddr, "metrics-bind-address", ":8080",

"The address the metric endpoint binds to.")

flag.StringVar(&probeAddr, "health-probe-bind-address", ":8081",

"The address the probe endpoint binds to.")

flag.BoolVar(&enableLeaderElection, "leader-elect", false,

"Enable leader election for controller manager.")

flag.StringVar(&mcpServerAddr, "mcp-server-address",

"http://mcp-server.ai-models.svc.cluster.local:3000",

"Address of the MCP server for tool registration.")

flag.Parse()

opts := zap.Options{Development: true}

ctrl.SetLogger(zap.New(zap.UseFlagOptions(&opts)))

mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{

Scheme: scheme,

Metrics: server.Options{

BindAddress: metricsAddr,

HealthProbeBindAddress: probeAddr,

LeaderElection: enableLeaderElection,

LeaderElectionID: "ai.example.io",

})

if err != nil {

setupLog.Error(err, "unable to start manager")

os.Exit(1)

}

mcpClient := controllers.NewHttpMcpClient(mcpServerAddr)

if err = (&controllers.LLMModelReconciler{

Client: mgr.GetClient(),

Scheme: mgr.GetScheme(),

McpClient: mcpClient,

}).SetupWithManager(mgr); err != nil {

setupLog.Error(err, "unable to create controller", "controller", "LLMModel")

os.Exit(1)

}

if err := mgr.AddHealthzCheck("healthz", healthz.Ping); err != nil {

setupLog.Error(err, "unable to set up health check")

os.Exit(1)

}

if err := mgr.AddReadyzCheck("readyz", healthz.Ping); err != nil {

setupLog.Error(err, "unable to set up ready check")

os.Exit(1)

}

setupLog.Info("starting manager")

if err := mgr.Start(ctrl.SetupSignalHandler()); err != nil {

setupLog.Error(err, "problem running manager")

os.Exit(1)

}

FILE: controllers/mcp_client.go

package controllers

import (

"bytes"

"context"

"encoding/json"

"fmt"

"net/http"

"time"

)

// HttpMcpClient implements McpRegistryClient by calling the MCP server's

// internal registration REST API. This is a sidecar API distinct from the

// MCP protocol itself — it is used only by the controller to push

// registrations into the MCP server's in-memory tool registry.

type HttpMcpClient struct {

baseURL string

httpClient *http.Client

}

// NewHttpMcpClient creates a new HttpMcpClient.

func NewHttpMcpClient(baseURL string) *HttpMcpClient {

return &HttpMcpClient{

baseURL: baseURL,

httpClient: &http.Client{

Timeout: 10 * time.Second,

}

// RegisterTool calls the MCP server's /admin/tools endpoint to register

// or update a tool definition.

func (c *HttpMcpClient) RegisterTool(

ctx context.Context,

tool McpToolDefinition,

) error {

body, err := json.Marshal(map[string]interface{}{

"name": tool.Name,

"description": tool.Description,

"endpoint": tool.Endpoint,

"modelId": tool.ModelId,

"domains": tool.Domains,

})

if err != nil {

return fmt.Errorf("marshalling tool definition: %w", err)

}

req, err := http.NewRequestWithContext(

ctx,

http.MethodPut,

fmt.Sprintf("%s/admin/tools/%s", c.baseURL, tool.Name),

bytes.NewReader(body),

)

if err != nil {

return fmt.Errorf("creating register request: %w", err)

}

req.Header.Set("Content-Type", "application/json")

resp, err := c.httpClient.Do(req)

if err != nil {

return fmt.Errorf("calling MCP server register: %w", err)

}

defer resp.Body.Close()

if resp.StatusCode != http.StatusOK && resp.StatusCode != http.StatusCreated {

return fmt.Errorf("MCP server register returned status %d", resp.StatusCode)

}

return nil

}

// UnregisterTool calls the MCP server's /admin/tools endpoint to remove

// a tool definition.

func (c *HttpMcpClient) UnregisterTool(ctx context.Context, toolName string) error {

req, err := http.NewRequestWithContext(

ctx,

http.MethodDelete,

fmt.Sprintf("%s/admin/tools/%s", c.baseURL, toolName),

nil,

)

if err != nil {

return fmt.Errorf("creating unregister request: %w", err)

}

resp, err := c.httpClient.Do(req)

if err != nil {

return fmt.Errorf("calling MCP server unregister: %w", err)

}

defer resp.Body.Close()

if resp.StatusCode != http.StatusOK && resp.StatusCode != http.StatusNoContent {

return fmt.Errorf("MCP server unregister returned status %d", resp.StatusCode)

}

return nil

}

FILE: Dockerfile (controller)

# syntax=docker/dockerfile:1

FROM golang:1.22-alpine AS builder

WORKDIR /workspace

COPY go.mod go.sum ./

RUN go mod download

COPY api/ api/

COPY controllers/ controllers/

COPY main.go main.go

RUN CGO_ENABLED=0 GOOS=linux GOARCH=amd64 \

go build -a -o manager main.go

FROM gcr.io/distroless/static:nonroot

WORKDIR /

COPY --from=builder /workspace/manager .

USER 65532:65532

ENTRYPOINT ["/manager"]

CHAPTER SIX: RBAC AND CONTROLLER DEPLOYMENT MANIFESTS

FILE: config/rbac/serviceaccount.yaml

apiVersion: v1

kind: ServiceAccount

metadata:

namespace: llm-operator-system

FILE: config/rbac/role.yaml

apiVersion: rbac.authorization.k8s.io/v1

kind: ClusterRole

metadata:

rules:

# LLMModel resources — full access

- apiGroups: ["ai.example.io"]

resources: ["llmmodels"]

verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]

- apiGroups: ["ai.example.io"]

resources: ["llmmodels/status"]

verbs: ["get", "update", "patch"]

- apiGroups: ["ai.example.io"]

resources: ["llmmodels/finalizers"]

verbs: ["update"]

# Core Kubernetes resources the controller manages

- apiGroups: ["apps"]

resources: ["deployments"]

verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]

- apiGroups: [""]

resources: ["services", "persistentvolumeclaims", "configmaps"]

verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]

- apiGroups: [""]

resources: ["events"]

verbs: ["create", "patch"]

# KEDA ScaledObjects

- apiGroups: ["keda.sh"]

resources: ["scaledobjects"]

verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]

# Leader election

- apiGroups: ["coordination.k8s.io"]

resources: ["leases"]

verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]

FILE: config/rbac/rolebinding.yaml

apiVersion: rbac.authorization.k8s.io/v1

kind: ClusterRoleBinding

metadata:

subjects:

- kind: ServiceAccount

namespace: llm-operator-system

roleRef:

kind: ClusterRole

apiGroup: rbac.authorization.k8s.io

FILE: config/manager/manager.yaml

apiVersion: apps/v1

kind: Deployment

metadata:

namespace: llm-operator-system

labels:

control-plane: controller-manager

spec:

replicas: 1

selector:

matchLabels:

control-plane: controller-manager

template:

metadata:

labels:

control-plane: controller-manager

spec:

serviceAccountName: llm-operator-controller

terminationGracePeriodSeconds: 10

containers:

- name: manager

image: ghcr.io/example/llm-operator:v1.0.0

command:

- /manager

args:

- --leader-elect

- --mcp-server-address=http://mcp-server.ai-models.svc.cluster.local:3000

ports:

- name: metrics

containerPort: 8080

- name: health

containerPort: 8081

livenessProbe:

httpGet:

path: /healthz

port: 8081

initialDelaySeconds: 15

periodSeconds: 20

readinessProbe:

httpGet:

path: /readyz

port: 8081

initialDelaySeconds: 5

periodSeconds: 10

resources:

requests:

cpu: 100m

memory: 128Mi

limits:

cpu: 500m

memory: 256Mi

securityContext:

allowPrivilegeEscalation: false

capabilities:

drop:

- ALL

readOnlyRootFilesystem: true

runAsNonRoot: true

securityContext:

runAsNonRoot: true

CHAPTER SEVEN: AUTOSCALING WITH KEDA (VENDOR-AGNOSTIC)

vLLM emits the same Prometheus metrics regardless of the underlying accelerator

vendor (NVIDIA, AMD, or Intel Gaudi). This means our KEDA configuration is

completely vendor-agnostic — we always scale on vllm:num_requests_waiting,

which reflects the depth of the inference queue.

The KEDA ScaledObjects are applied as separate manifests in config/keda/. The

controller creates them programmatically via the dynamic client; these YAML

files serve as the reference and can also be applied manually.

FILE: config/keda/scaledobject-gemma4-26b-amd.yaml

apiVersion: keda.sh/v1alpha1

kind: ScaledObject

metadata:

namespace: ai-models

spec:

scaleTargetRef:

apiVersion: apps/v1

kind: Deployment

# minReplicaCount: 0 enables scale-to-zero.

# Set to 1 if cold-start latency (model loading) is unacceptable.

minReplicaCount: 0

maxReplicaCount: 3

# cooldownPeriod: how long to wait before scaling down after the last

# scale-down trigger. 300 seconds prevents thrashing on bursty workloads.

cooldownPeriod: 300

pollingInterval: 15

triggers:

- type: prometheus

metadata:

serverAddress: http://prometheus-server.monitoring.svc.cluster.local:9090

metricName: vllm_requests_waiting

# Scale up when more than 3 requests are waiting per replica.

# KEDA scales to ceil(metricValue / threshold) replicas.

query: >

sum(vllm:num_requests_waiting{

namespace="ai-models",

pod=~"gemma4-26b-amd-.*"

})

threshold: "3"

# activationThreshold: wake up a scaled-to-zero deployment when

# at least 1 request is waiting.

activationThreshold: "1"

FILE: config/keda/scaledobject-qwen36-35b-gaudi.yaml

apiVersion: keda.sh/v1alpha1

kind: ScaledObject

metadata:

namespace: ai-models

spec:

scaleTargetRef:

apiVersion: apps/v1

kind: Deployment

minReplicaCount: 0

maxReplicaCount: 2

cooldownPeriod: 300

pollingInterval: 15

triggers:

- type: prometheus

metadata:

serverAddress: http://prometheus-server.monitoring.svc.cluster.local:9090

metricName: vllm_requests_waiting

query: >

sum(vllm:num_requests_waiting{

namespace="ai-models",

pod=~"qwen36-35b-gaudi-.*"

})

threshold: "2"

activationThreshold: "1"

FILE: config/keda/scaledobject-gemma4-31b-nvidia.yaml

apiVersion: keda.sh/v1alpha1

kind: ScaledObject

metadata:

namespace: ai-models

spec:

scaleTargetRef:

apiVersion: apps/v1

kind: Deployment

minReplicaCount: 1

maxReplicaCount: 2

cooldownPeriod: 300

pollingInterval: 15

triggers:

- type: prometheus

metadata:

serverAddress: http://prometheus-server.monitoring.svc.cluster.local:9090

metricName: vllm_requests_waiting

query: >

sum(vllm:num_requests_waiting{

namespace="ai-models",

pod=~"gemma4-31b-nvidia-.*"

})

threshold: "3"

activationThreshold: "1"

KEDA HTTP Add-On for Scale-to-Zero with Request Buffering:

# File: config/keda/http-scaledobject-gemma4-26b-amd.yaml

# The HTTP add-on intercepts requests, holds them while the deployment

# scales up from zero, and forwards them once a pod is ready.

# This prevents request failures during cold starts.

apiVersion: http.keda.sh/v1alpha1

kind: HTTPScaledObject

metadata:

namespace: ai-models

spec:

hosts:

- gemma4-26b-amd.ai-models.svc.cluster.local

scaleTargetRef:

port: 8000

replicas:

min: 0

max: 3

# For large models, allow up to 600 seconds for the pod to start and

# load the model weights before giving up on buffered requests.

scaledownPeriod: 300

CHAPTER EIGHT: PROMETHEUS RECORDING RULES AND MONITORING

FILE: config/monitoring/prometheus-rules.yaml

apiVersion: monitoring.coreos.com/v1

kind: PrometheusRule

metadata:

namespace: monitoring

labels:

# This label causes the Prometheus Operator to pick up this rule.

prometheus: kube-prometheus

role: alert-rules

spec:

groups:

# ----------------------------------------------------------------

# vLLM metrics — vendor-agnostic (same metric names for all vendors)

# ----------------------------------------------------------------

- name: vllm.rules

interval: 15s

rules:

- record: vllm:num_requests_running:sum

expr: sum by (namespace, app) (vllm:num_requests_running)

- record: vllm:num_requests_waiting:sum

expr: sum by (namespace, app) (vllm:num_requests_waiting)

- record: vllm:time_to_first_token_ms:p50

expr: >

histogram_quantile(0.50,

sum by (namespace, app, le) (

rate(vllm:time_to_first_token_seconds_bucket[5m])

)

) * 1000

- record: vllm:time_to_first_token_ms:p95

expr: >

histogram_quantile(0.95,

sum by (namespace, app, le) (

rate(vllm:time_to_first_token_seconds_bucket[5m])

)

) * 1000

- record: vllm:tokens_per_second:rate5m

expr: >

sum by (namespace, app) (

rate(vllm:generation_tokens_total[5m])

)

# ----------------------------------------------------------------

# NVIDIA GPU metrics (DCGM exporter)

# ----------------------------------------------------------------

- name: nvidia.gpu.rules

interval: 15s

rules:

- record: nvidia:gpu_memory_used_gib:avg

expr: >

avg by (node, gpu) (

DCGM_FI_DEV_FB_USED / 1024

)

- record: nvidia:gpu_utilization_pct:avg

expr: >

avg by (node, gpu) (

DCGM_FI_DEV_GPU_UTIL

)

# ----------------------------------------------------------------

# AMD GPU metrics (ROCm SMI exporter)

# ----------------------------------------------------------------

- name: amd.gpu.rules

interval: 15s

rules:

- record: amd:gpu_memory_used_gib:avg

expr: >

avg by (node, gpu) (

rocm_smi_memory_used_bytes / 1073741824

)

- record: amd:gpu_utilization_pct:avg

expr: >

avg by (node, gpu) (

rocm_smi_gpu_use_percent

)

# ----------------------------------------------------------------

# Intel Gaudi metrics (Gaudi metrics exporter)

# ----------------------------------------------------------------

- name: gaudi.rules

interval: 15s

rules:

- record: gaudi:memory_used_gib:avg

expr: >

avg by (node, device) (

habana_gaudi_memory_used_bytes / 1073741824

)

- record: gaudi:utilization_pct:avg

expr: >

avg by (node, device) (

habana_gaudi_util_percent

)

# ----------------------------------------------------------------

# Alerting rules

# ----------------------------------------------------------------

- name: llm.alerts

rules:

- alert: LLMHighQueueDepth

expr: vllm:num_requests_waiting:sum > 20

for: 5m

labels:

severity: warning

annotations:

summary: "High inference queue depth for {{ $labels.app }}"

description: >

{{ $labels.app }} has {{ $value }} requests waiting.

Consider increasing maxReplicas in the LLMModel spec.

- alert: LLMHighLatency

expr: vllm:time_to_first_token_ms:p95 > 10000

for: 5m

labels:

severity: warning

annotations:

summary: "High TTFT latency for {{ $labels.app }}"

description: >

P95 time-to-first-token for {{ $labels.app }} is

{{ $value }}ms, exceeding the 10s threshold.

CHAPTER NINE: THE MCP SERVER

FILE: mcp-server/package.json

{

"name": "example-llm-mcp-server",

"version": "1.0.0",

"description": "MCP server exposing LLMModel resources as AI agent tools",

"main": "dist/index.js",

"scripts": {

"build": "tsc",

"start": "node dist/index.js",

"dev": "ts-node src/index.ts"

"dependencies": {

"@kubernetes/client-node": "^0.21.0",

"@modelcontextprotocol/sdk": "^1.12.0",

"express": "^4.18.2",

"openai": "^4.52.0",

"zod": "^3.23.8"

"devDependencies": {

"@types/express": "^4.17.21",

"@types/node": "^20.0.0",

"typescript": "^5.4.0",

"ts-node": "^10.9.2"

}

FILE: mcp-server/tsconfig.json

{

"compilerOptions": {

"target": "ES2022",

"module": "commonjs",

"lib": ["ES2022"],

"outDir": "./dist",

"rootDir": "./src",

"strict": true,

"esModuleInterop": true,

"skipLibCheck": true,

"forceConsistentCasingInFileNames": true,

"resolveJsonModule": true

"include": ["src/**/*"],

"exclude": ["node_modules", "dist"]

}

FILE: mcp-server/src/index.ts

// MCP server that exposes LLMModel resources as callable tools.

// Implements the November 2025 MCP specification.

// Supports dynamic tool registration via the /admin/tools REST API,

// which the Kubernetes controller calls when models are added or removed.

import { randomUUID } from "crypto";

import { McpServer } from "@modelcontextprotocol/sdk/server/mcp.js";

import { StreamableHTTPServerTransport } from "@modelcontextprotocol/sdk/server/streamableHttp.js";

import { z } from "zod";

import * as k8s from "@kubernetes/client-node";

import OpenAI from "openai";

import express, { Request, Response } from "express";

// ---------------------------------------------------------------------------

// Types

// ---------------------------------------------------------------------------

interface LLMModelRecord {

toolName: string;

toolDescription: string;

endpoint: string;

modelId: string;

domains: string[];

supportsThinking: boolean;

contextWindowK: number;

phase: string;

}

interface AdminToolPayload {

description: string;

endpoint: string;

modelId: string;

domains: string[];

}

// ---------------------------------------------------------------------------

// Model Registry

// Maintains a live view of available LLMModel resources via Kubernetes watch.

// Also accepts push updates from the controller via the /admin/tools API.

// ---------------------------------------------------------------------------

class ModelRegistry {

private models: Map<string, LLMModelRecord> = new Map();

private kc: k8s.KubeConfig;

private customApi: k8s.CustomObjectsApi;

// Callbacks registered by the MCP server to be notified when the

// tool list changes, so it can send MCP tools/list_changed notifications.

private changeCallbacks: Array<() => void> = [];

constructor() {

this.kc = new k8s.KubeConfig();

try {

this.kc.loadFromCluster();

} catch {

this.kc.loadFromDefault();

}

this.customApi = this.kc.makeApiClient(k8s.CustomObjectsApi);

}

onToolListChanged(cb: () => void): void {

this.changeCallbacks.push(cb);

}

private notifyChanged(): void {

for (const cb of this.changeCallbacks) {

try { cb(); } catch { /* ignore */ }

}

async start(): Promise<void> {

const namespace = process.env.MODEL_NAMESPACE || "ai-models";

// Initial list to populate the registry before starting the watch.

try {

const list = await this.customApi.listNamespacedCustomObject(

"ai.example.io",

"v1alpha1",

namespace,

"llmmodels",

);

const items = (list as any).body?.items ?? [];

for (const item of items) {

this.upsertFromK8s(item);

}

console.log(`Loaded ${this.models.size} models from Kubernetes registry`);

} catch (err) {

console.warn("Could not list LLMModels from Kubernetes:", err);

}

// Start watch for real-time updates.

this.startWatch(namespace);

}

private startWatch(namespace: string): void {

const watch = new k8s.Watch(this.kc);

watch.watch(

`/apis/ai.example.io/v1alpha1/namespaces/${namespace}/llmmodels`,

{},

(type: string, obj: any) => {

if (type === "ADDED" || type === "MODIFIED") {

this.upsertFromK8s(obj);

} else if (type === "DELETED") {

const toolName = obj.spec?.mcpExposure?.toolName;

if (toolName) {

this.models.delete(toolName);

console.log(`Watch: unregistered model tool ${toolName}`);

this.notifyChanged();

}

(err: any) => {

console.error("Watch error, reconnecting in 5s:", err);

setTimeout(() => this.startWatch(namespace), 5000);

);

}

private upsertFromK8s(obj: any): void {

const spec = obj.spec ?? {};

const status = obj.status ?? {};

const mcpExposure = spec.mcpExposure ?? {};

if (!mcpExposure.enabled || !mcpExposure.toolName) return;

if (!["Ready", "Proxying"].includes(status.phase ?? "")) return;

const modelType: string = spec.modelType ?? "Dense";

const supportsThinking =

modelType === "DenseThinking" || modelType === "MoEThinking";

const record: LLMModelRecord = {

toolName: mcpExposure.toolName,

toolDescription: mcpExposure.toolDescription ?? "",

endpoint: status.endpoint ?? "",

modelId: spec.modelId ?? "",

domains: spec.domains ?? [],

supportsThinking,

contextWindowK: spec.contextWindowK ?? 32,

phase: status.phase ?? "",

};

const isNew = !this.models.has(record.toolName);

this.models.set(record.toolName, record);

console.log(`${isNew ? "Registered" : "Updated"} model tool: ${record.toolName}`);

this.notifyChanged();

}

// upsertFromAdmin is called by the /admin/tools PUT endpoint.

// The controller calls this to push registrations without waiting

// for the Kubernetes watch to fire.

upsertFromAdmin(payload: AdminToolPayload): void {

const existing = this.models.get(payload.name);

const record: LLMModelRecord = {

toolName: payload.name,

toolDescription: payload.description,

endpoint: payload.endpoint,

modelId: payload.modelId,

domains: payload.domains,

supportsThinking: false, // updated by watch

contextWindowK: 32, // updated by watch

phase: "Ready",

};

// Preserve supportsThinking and contextWindowK from existing record.

if (existing) {

record.supportsThinking = existing.supportsThinking;

record.contextWindowK = existing.contextWindowK;

}

this.models.set(payload.name, record);

console.log(`Admin: registered/updated tool ${payload.name}`);

this.notifyChanged();

}

removeByToolName(toolName: string): boolean {

const existed = this.models.has(toolName);

this.models.delete(toolName);

if (existed) {

console.log(`Admin: unregistered tool ${toolName}`);

this.notifyChanged();

}

return existed;

}

getAll(): LLMModelRecord[] {

return Array.from(this.models.values());

}

getByToolName(toolName: string): LLMModelRecord | undefined {

return this.models.get(toolName);

}

// ---------------------------------------------------------------------------

// MCP Tool Invocation

// ---------------------------------------------------------------------------

async function invokeModel(

model: LLMModelRecord,

input: {

messages: Array<{ role: "system" | "user" | "assistant"; content: string }>;

temperature?: number;

max_tokens?: number;

thinking?: boolean;

): Promise<{ content: Array<{ type: string; text: string }>; isError?: boolean }> {

const openai = new OpenAI({

baseURL: model.endpoint,

apiKey: "not-needed-for-local-models",

});

let messages = [...input.messages];

// Inject thinking mode instruction for hybrid thinking models.

if (input.thinking && model.supportsThinking) {

const thinkingInstruction =

"Think step by step before answering. " +

"Use <think>...</think> tags for your internal reasoning.";

const sysIdx = messages.findIndex((m) => m.role === "system");

if (sysIdx >= 0) {

messages[sysIdx] = {

...messages[sysIdx],

content: messages[sysIdx].content + "\n\n" + thinkingInstruction,

};

} else {

messages = [{ role: "system", content: thinkingInstruction }, ...messages];

}

try {

const completion = await openai.chat.completions.create({

model: model.modelId,

messages: messages as any,

temperature: input.temperature ?? 0.7,

max_tokens: input.max_tokens ?? 2048,

});

const responseText = completion.choices[0]?.message?.content ?? "";

return { content: [{ type: "text", text: responseText }] };

} catch (error: any) {

return {

content: [

{

type: "text",

text: `Error calling model ${model.modelId}: ${error.message}`,

isError: true,

};

}

// ---------------------------------------------------------------------------

// MCP Server Setup

// ---------------------------------------------------------------------------

function buildMcpServer(registry: ModelRegistry): McpServer {

const server = new McpServer({

version: "1.0.0",

});

// Meta-tool: list all available models.

server.tool(

"list_available_models",

"Lists all LLM models currently available in the cluster, " +

"including their capabilities, context windows, accelerator vendor, " +

"and deployment status. Call this first to discover which model " +

"to use for a given task.",

{},

async () => {

const models = registry.getAll();

const summary = models.map((m) => ({

toolName: m.toolName,

modelId: m.modelId,

domains: m.domains,

contextWindowK: m.contextWindowK,

supportsThinking: m.supportsThinking,

status: m.phase,

}));

return {

content: [{ type: "text", text: JSON.stringify(summary, null, 2) }],

};

);

// Register each currently-known model as a tool.

for (const model of registry.getAll()) {

registerModelTool(server, model);

}

// When the registry changes, notify MCP clients via the

// tools/list_changed notification (November 2025 MCP spec).

registry.onToolListChanged(() => {

// The MCP SDK sends the notification to all connected clients

// when we call server.sendToolListChanged().

// Re-register all tools to keep the server's internal list current.

// Note: In a future MCP SDK version, incremental tool updates

// will be supported. For now we rebuild from the registry.

for (const model of registry.getAll()) {

registerModelTool(server, model);

}

try {

(server as any).sendToolListChanged?.();

} catch { /* SDK version may not support this yet */ }

});

return server;

}

function registerModelTool(server: McpServer, model: LLMModelRecord): void {

// tool() is idempotent in the MCP SDK — calling it again with the same

// name overwrites the previous registration.

server.tool(

model.toolName,

model.toolDescription,

{

messages: z.array(

z.object({

role: z.enum(["system", "user", "assistant"]),

content: z.string(),

})

).describe("Conversation history."),

temperature: z.number().min(0).max(2).optional()

.describe("Sampling temperature (0=deterministic, 2=creative)."),

max_tokens: z.number().int().positive().optional()

.describe("Maximum tokens to generate."),

thinking: z.boolean().optional()

.describe(

"Enable chain-of-thought reasoning for models that support " +

"hybrid thinking mode (e.g. Qwen3.5, Qwen3.6). " +

"Increases latency but improves quality on complex tasks."

async (input) => invokeModel(model, input),

);

}

// ---------------------------------------------------------------------------

// Express Application

// ---------------------------------------------------------------------------

async function main(): Promise<void> {

const registry = new ModelRegistry();

await registry.start();

const mcpServer = buildMcpServer(registry);

const app = express();

app.use(express.json());

// Health endpoint — checked by the Kubernetes readiness probe.

app.get("/health", (_req: Request, res: Response) => {

res.status(200).json({

status: "ok",

models: registry.getAll().length,

});

// -----------------------------------------------------------------------

// Admin API — called by the Kubernetes controller to push tool

// registrations without waiting for the Kubernetes watch to fire.

// This API is internal only and should not be exposed outside the cluster.

// -----------------------------------------------------------------------

// PUT /admin/tools/:toolName — register or update a tool.

app.put("/admin/tools/:toolName", (req: Request, res: Response) => {

const { toolName } = req.params;

const payload = req.body as AdminToolPayload;

if (!payload.name || payload.name !== toolName) {

res.status(400).json({ error: "toolName in URL must match name in body" });

return;

}

registry.upsertFromAdmin(payload);

res.status(200).json({ status: "registered", toolName });

});

// DELETE /admin/tools/:toolName — unregister a tool.

app.delete("/admin/tools/:toolName", (req: Request, res: Response) => {

const { toolName } = req.params;

const existed = registry.removeByToolName(toolName);

if (existed) {

res.status(200).json({ status: "unregistered", toolName });

} else {

res.status(404).json({ error: "tool not found", toolName });

}

});

// GET /admin/tools — list all registered tools (for debugging).

app.get("/admin/tools", (_req: Request, res: Response) => {

res.status(200).json(registry.getAll().map((m) => ({

toolName: m.toolName,

modelId: m.modelId,

endpoint: m.endpoint,

phase: m.phase,

})));

});

// -----------------------------------------------------------------------

// MCP Protocol Endpoint

// Uses StreamableHTTP transport (November 2025 MCP specification).

// -----------------------------------------------------------------------

const transport = new StreamableHTTPServerTransport({

sessionIdGenerator: () => randomUUID(),

});

app.all("/mcp", async (req: Request, res: Response) => {

await transport.handleRequest(req, res, req.body);

});

await mcpServer.connect(transport);

const port = parseInt(process.env.PORT ?? "3000", 10);

app.listen(port, () => {

console.log(`MCP server listening on port ${port}`);

console.log(`Serving ${registry.getAll().length} models as tools`);

console.log(`Health: http://localhost:${port}/health`);

console.log(`MCP: http://localhost:${port}/mcp`);

console.log(`Admin: http://localhost:${port}/admin/tools`);

});

}

main().catch((err) => {

console.error("Fatal error:", err);

process.exit(1);

});

FILE: mcp-server/Dockerfile

# syntax=docker/dockerfile:1

FROM node:20-alpine AS builder

WORKDIR /app

COPY package.json package-lock.json ./

RUN npm ci

COPY tsconfig.json ./

COPY src/ src/

RUN npm run build

FROM node:20-alpine AS runtime

WORKDIR /app

COPY package.json package-lock.json ./

RUN npm ci --omit=dev

COPY --from=builder /app/dist ./dist

USER node

EXPOSE 3000

CMD ["node", "dist/index.js"]

FILE: config/mcp/deployment.yaml

apiVersion: v1

kind: ServiceAccount

metadata:

namespace: ai-models

---

apiVersion: rbac.authorization.k8s.io/v1

kind: Role

metadata:

namespace: ai-models

rules:

- apiGroups: ["ai.example.io"]

resources: ["llmmodels"]

verbs: ["get", "list", "watch"]

---

apiVersion: rbac.authorization.k8s.io/v1

kind: RoleBinding

metadata:

namespace: ai-models

subjects:

- kind: ServiceAccount

namespace: ai-models

roleRef:

kind: Role

apiGroup: rbac.authorization.k8s.io

---

apiVersion: apps/v1

kind: Deployment

metadata:

namespace: ai-models

spec:

replicas: 2

selector:

matchLabels:

app: mcp-server

template:

metadata:

labels:

app: mcp-server

spec:

serviceAccountName: mcp-server

containers:

- name: mcp-server

image: ghcr.io/example/llm-mcp-server:v1.0.0

ports:

- containerPort: 3000

env:

- name: MODEL_NAMESPACE

value: ai-models

- name: PORT

value: "3000"

resources:

requests:

cpu: 200m

memory: 256Mi

limits:

cpu: 1000m

memory: 512Mi

readinessProbe:

httpGet:

path: /health

port: 3000

initialDelaySeconds: 5

periodSeconds: 10

failureThreshold: 3

livenessProbe:

httpGet:

path: /health

port: 3000

initialDelaySeconds: 15

periodSeconds: 20

failureThreshold: 3

---

apiVersion: v1

kind: Service

metadata:

namespace: ai-models

spec:

selector:

app: mcp-server

ports:

- port: 3000

targetPort: 3000

CHAPTER TEN: QUERYING THE MODEL REGISTRY

With discovery labels populated by the controller, engineers can query the

registry using standard Kubernetes tooling.

Find all models that support vision and have at least 128K context:

kubectl get llmmodels -n ai-models \

-l "ai.example.io/domain-vision=true,ai.example.io/context-128k=true" \

-o custom-columns=\

NAME:.metadata.name,\

MODEL:.spec.modelId,\

VENDOR:.spec.acceleratorVendor,\

VRAM:.spec.resources.vramPerAcceleratorGiB,\

CONTEXT:.spec.contextWindowK,\

STATUS:.status.phase

Find all models running on AMD accelerators:

kubectl get llmmodels -n ai-models \

-l "ai.example.io/accelerator-vendor=amd" \

-o wide

Find all models that support reasoning and fit in 24 GiB of VRAM:

kubectl get llmmodels -n ai-models \

-l "ai.example.io/domain-reasoning=true,\

ai.example.io/vram-tier in (8gb,16gb,24gb)"

Find all remote API models:

kubectl get llmmodels -n ai-models \

-l "ai.example.io/deployment-mode=remote"

NOTE: Kubernetes label selectors do not support numeric range queries or

field selectors on custom resource status subfields. For queries that require

numeric comparisons (e.g., contextWindowK >= 256), use the Python client

and filter in application code as shown below.

FILE: tools/query_models.py

#!/usr/bin/env python3

"""

Query the LLMModel registry for models matching given criteria.

Runs inside the cluster (uses in-cluster config) or locally (uses kubeconfig).

"""

import sys

from kubernetes import client, config

def load_config() -> None:

"""Load Kubernetes configuration from cluster or local kubeconfig."""

try:

config.load_incluster_config()

except config.ConfigException:

config.load_kube_config()

def find_models(

namespace: str = "ai-models",

domains: list[str] | None = None,

max_vram_gib: int | None = None,

context_min_k: int | None = None,

accelerator_vendor: str | None = None,

deployment_mode: str | None = None,

phase: str | None = None,

) -> list[dict]:

"""

Query the LLMModel registry for models matching the given criteria.

Args:

namespace: Kubernetes namespace to search.

domains: List of required capability domains.

e.g. ["code", "vision"]

max_vram_gib: Maximum acceptable VRAM per accelerator in GiB.

Models requiring more VRAM are excluded.

context_min_k: Minimum required context window in thousands of tokens.

accelerator_vendor: Filter by vendor: "nvidia", "amd", "intel-gaudi", "cpu".

deployment_mode: Filter by mode: "local" or "remote".

phase: Filter by status phase: "Ready", "Proxying", etc.

Returns:

List of matching LLMModel dicts (raw Kubernetes API objects).

"""

load_config()

api = client.CustomObjectsApi()

# Build label selector from criteria that map to labels.

label_parts: list[str] = []

if domains:

for domain in domains:

label_parts.append(f"ai.example.io/domain-{domain}=true")

if accelerator_vendor:

label_parts.append(

f"ai.example.io/accelerator-vendor={accelerator_vendor}"

)

if deployment_mode:

label_parts.append(

f"ai.example.io/deployment-mode={deployment_mode}"

)

if max_vram_gib is not None:

# Map max VRAM to the tier labels at or below the maximum.

tier_map = [

(0, "cpu"),

(8, "8gb"),

(16, "16gb"),

(24, "24gb"),

(48, "48gb"),

(80, "80gb"),

(141, "141gb"),

(192, "192gb"),

]

eligible_tiers = [

tier for threshold, tier in tier_map

if threshold <= max_vram_gib

]

if eligible_tiers:

label_parts.append(

"ai.example.io/vram-tier in ({})".format(

",".join(eligible_tiers)

)

label_selector = ",".join(label_parts) if label_parts else None

result = api.list_namespaced_custom_object(

group="ai.example.io",

version="v1alpha1",

namespace=namespace,

plural="llmmodels",

label_selector=label_selector,

)

models: list[dict] = result.get("items", [])

# Apply filters that cannot be expressed as label selectors.

if context_min_k is not None:

models = [

m for m in models

if m.get("spec", {}).get("contextWindowK", 0) >= context_min_k

]

if phase is not None:

models = [

m for m in models

if m.get("status", {}).get("phase") == phase

]

else:

# By default, return only Ready and Proxying models.

models = [

m for m in models

if m.get("status", {}).get("phase") in ("Ready", "Proxying")

]

return models

def print_models(models: list[dict]) -> None:

"""Print a formatted summary of matching models."""

if not models:

print("No matching models found.")

return

print(

f"{'NAME':<30} {'MODEL':<45} {'VENDOR':<12} "

f"{'VRAM':<8} {'CONTEXT':<10} {'STATUS':<10} ENDPOINT"

)

print("-" * 140)

for m in models:

spec = m.get("spec", {})

status = m.get("status", {})

resources = spec.get("resources", {})

print(

f"{m['metadata']['name']:<30} "

f"{spec.get('modelId', ''):<45} "

f"{spec.get('acceleratorVendor', 'remote'):<12} "

f"{resources.get('vramPerAcceleratorGiB', 0):<8} "

f"{spec.get('contextWindowK', 0):<10} "

f"{status.get('phase', ''):<10} "

f"{status.get('endpoint', '')}"

)

if __name__ == "__main__":

# Example: find reasoning-capable models on any vendor with <=24 GiB VRAM

results = find_models(

domains=["reasoning"],

max_vram_gib=24,

context_min_k=128,

)

print(f"Found {len(results)} reasoning models with <=24 GiB VRAM and >=128K context:\n")

print_models(results)

print()

# Example: find all AMD models

amd_models = find_models(accelerator_vendor="amd")

print(f"\nFound {len(amd_models)} AMD models:\n")

print_models(amd_models)

print()

# Example: find all remote API models

remote_models = find_models(deployment_mode="remote")

print(f"\nFound {len(remote_models)} remote API models:\n")

print_models(remote_models)

CHAPTER ELEVEN: DOCKER MODEL RUNNER AND LOCAL DEVELOPMENT

Docker Desktop 4.41, released on April 29, 2025, ships with Docker Model

Runner, which brings a capable local LLM development environment to any machine

with a modern GPU or Apple Silicon. It uses llama.cpp as its inference backend

and packages models as OCI artifacts.

To pull and run a model with Docker Model Runner:

# Pull the Gemma 4 E4B model from Docker Hub's model registry.

docker model pull ai/gemma4:e4b-q4_k_m

# Run the model and start the inference server.

# The server listens on localhost:12434 by default.

docker model run ai/gemma4:e4b-q4_k_m

# Test using the OpenAI-compatible API.

curl http://localhost:12434/engines/llama.cpp/v1/chat/completions \

-H "Content-Type: application/json" \

-d '{

"model": "ai/gemma4:e4b-q4_k_m",

"messages": [

{

"role": "user",

"content": "Explain the Mixture-of-Experts architecture in one paragraph."

}

"temperature": 0.7,

"max_tokens": 512

FILE: compose.yaml (local development environment)

# A development environment for an AI application that uses a local LLM.

# The model runner service uses Docker Desktop's Model Runner backend,

# which handles GPU acceleration automatically (NVIDIA CUDA on Linux/Windows,

# Metal on macOS Apple Silicon).

services:

app:

build: .

environment:

# Point the app at the local model runner.

# In production (Kubernetes), this is overridden by the cluster endpoint.

LLM_BASE_URL: http://model-runner.docker.internal:12434/engines/llama.cpp/v1

LLM_MODEL_ID: ai/gemma4:e4b-q4_k_m

depends_on:

- model-runner

ports:

- "8080:8080"

model-runner:

# The "model" provider type tells Docker Desktop to use the Model Runner

# instead of pulling a regular container image.

# This works on any platform Docker Desktop supports:

# - macOS: Apple Silicon (Metal) or Intel (CPU)

# - Windows: NVIDIA GPU (CUDA) or CPU

# - Linux: NVIDIA GPU (CUDA) or CPU

provider:

type: model

options:

model: ai/gemma4:e4b-q4_k_m

CHAPTER TWELVE: COMPLETE DEPLOYMENT WALKTHROUGH

We will deploy three local models across three different accelerator vendors

and three remote API proxies, then deploy the MCP server and verify that an

AI agent can discover and use all six models.

STEP 1: Install prerequisites.
# Create the operator system namespace.
kubectl create namespace llm-operator-system
# Create the AI models namespace.
kubectl create namespace ai-models
# ---- NVIDIA GPU Operator (run only if you have NVIDIA GPUs) ----
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
helm install gpu-operator nvidia/gpu-operator \
--namespace gpu-operator \
--create-namespace \
--set driver.enabled=true \
--set toolkit.enabled=true \
--set devicePlugin.enabled=true \
--set dcgmExporter.enabled=true \
--wait
# ---- AMD GPU Operator (run only if you have AMD GPUs) ----
helm repo add amd-gpu-operator https://rocm.github.io/gpu-operator
helm repo update
helm install amd-gpu-operator amd-gpu-operator/gpu-operator \
--namespace amd-gpu-operator \
--create-namespace \
--set devicePlugin.enabled=true \
--set nodeLabeller.enabled=true \
--wait
# ---- Intel Gaudi Base Operator (run only if you have Gaudi cards) ----
helm repo add intel https://intel.github.io/helm-charts
helm repo update
helm install gaudi-base-operator intel/intel-gaudi-base-operator \
--namespace intel-gaudi \
--create-namespace \
--wait
# ---- KEDA ----
helm repo add kedacore https://kedacore.github.io/charts
helm repo update
helm install keda kedacore/keda \
--namespace keda \
--create-namespace \
--wait
# ---- KEDA HTTP add-on (for scale-to-zero with request buffering) ----
helm install keda-http-add-on kedacore/keda-add-ons-http \
--namespace keda \
--wait
# ---- Prometheus stack ----
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
--set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false \
--wait
STEP 2: Install the LLMModel CRD and deploy the controller.
# Apply the CRD.
kubectl apply -f config/crd/bases/ai.example.io_llmmodels.yaml
# Apply RBAC.
kubectl apply -f config/rbac/serviceaccount.yaml
kubectl apply -f config/rbac/role.yaml
kubectl apply -f config/rbac/rolebinding.yaml
# Apply Prometheus recording rules.
kubectl apply -f config/monitoring/prometheus-rules.yaml
# Create secrets.
kubectl create secret generic hf-token \
--namespace ai-models \
--from-literal=token=YOUR_HF_TOKEN
kubectl create secret generic openai-api-key \
--namespace ai-models \
--from-literal=apiKey=YOUR_OPENAI_API_KEY
kubectl create secret generic anthropic-api-key \
--namespace ai-models \
--from-literal=apiKey=YOUR_ANTHROPIC_API_KEY
kubectl create secret generic google-api-key \
--namespace ai-models \
--from-literal=apiKey=YOUR_GOOGLE_API_KEY
# Deploy the controller.
kubectl apply -f config/manager/manager.yaml
STEP 3: Apply the LLMModel resources.
# Local model on AMD MI300X.
kubectl apply -f config/samples/gemma4-26b-amd.yaml
# Local model on Intel Gaudi 3.
kubectl apply -f config/samples/qwen36-35b-gaudi.yaml
# Local model on NVIDIA H100.
kubectl apply -f config/samples/gemma4-31b-nvidia.yaml
# CPU-only model (no GPU required).
kubectl apply -f config/samples/qwen35-9b-cpu.yaml
# Remote API proxies.
kubectl apply -f config/samples/gpt55-remote.yaml
kubectl apply -f config/samples/claude-opus47-remote.yaml
kubectl apply -f config/samples/gemini31-remote.yaml
# Apply KEDA ScaledObjects.
kubectl apply -f config/keda/
STEP 4: Deploy the MCP server.
kubectl apply -f config/mcp/deployment.yaml
STEP 5: Watch the models come online.
# Watch all LLMModel resources.
# Local models: Pending -> Downloading -> Starting -> Ready
# Remote proxies: immediately -> Proxying
kubectl get llmmodels -n ai-models -w
# Check logs for the AMD model.
kubectl logs -n ai-models \
-l "ai.example.io/model-name=gemma4-26b-amd" \
-c inference-server \
--follow
# Check logs for the Gaudi model.
kubectl logs -n ai-models \
-l "ai.example.io/model-name=qwen36-35b-gaudi" \
-c inference-server \
--follow
STEP 6: Test the models directly.
# Port-forward the AMD model.
kubectl port-forward -n ai-models svc/gemma4-26b-amd 8001:8000 &
curl http://localhost:8001/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "google/gemma-4-26b-a4b-it",
"messages": [{"role": "user", "content": "What is 2+2?"}],
"max_tokens": 50
}'
# Port-forward the Gaudi model.
kubectl port-forward -n ai-models svc/qwen36-35b-gaudi 8002:8000 &
curl http://localhost:8002/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3.6-35B-A3B",
"messages": [
{
"role": "system",
"content": "Think step by step before answering. Use <think>...</think> tags."
},
{
"role": "user",
"content": "Prove that the square root of 2 is irrational."
}
],
"max_tokens": 1024
}'
STEP 7: Test the MCP server.
kubectl port-forward -n ai-models svc/mcp-server 3000:3000 &
# List all available tools.
curl -X POST http://localhost:3000/mcp \
-H "Content-Type: application/json" \
-d '{
"jsonrpc": "2.0",
"id": 1,
"method": "tools/list",
"params": {}
}'
# Discover all models via the meta-tool.
curl -X POST http://localhost:3000/mcp \
-H "Content-Type: application/json" \
-d '{
"jsonrpc": "2.0",
"id": 2,
"method": "tools/call",
"params": {
"name": "list_available_models",
"arguments": {}
}
}'
# Call the Gaudi reasoning model with thinking mode.
curl -X POST http://localhost:3000/mcp \
-H "Content-Type: application/json" \
-d '{
"jsonrpc": "2.0",
"id": 3,
"method": "tools/call",
"params": {
"name": "qwen36_reasoning_gaudi",
"arguments": {
"messages": [
{
"role": "user",
"content": "What is the time complexity of merge sort and why?"
}
],
"thinking": true,
"max_tokens": 512
}
}
}'
# Call the GPT-5.5 remote proxy.
curl -X POST http://localhost:3000/mcp \
-H "Content-Type: application/json" \
-d '{
"jsonrpc": "2.0",
"id": 4,
"method": "tools/call",
"params": {
"name": "gpt55_frontier",
"arguments": {
"messages": [
{
"role": "user",
"content": "Write a Kubernetes operator in Go that manages Redis clusters."
}
],
"temperature": 0.3,
"max_tokens": 2048
}
}
}'
STEP 8: Run the registry query tool.
# Install the Kubernetes Python client.
pip install kubernetes
# Query the registry from outside the cluster (uses kubeconfig).
python3 tools/query_models.py

CHAPTER THIRTEEN: OPERATIONAL CONSIDERATIONS AND PRODUCTION HARDENING

MODEL LOADING TIME AND COLD STARTS

Large models take significant time to load. Gemma 4 31B in INT4 format takes

approximately 3–5 minutes on an H100. Qwen3.6-35B-A3B in FP8 on Gaudi 3 takes

approximately 5–10 minutes. This means that if you scale to zero and receive a

request, the user will wait for the full model loading time. Use the KEDA HTTP

add-on to buffer requests during cold starts, and set minReplicas to 1 for

interactive applications where cold-start latency is unacceptable.

KV CACHE MEMORY

The KV cache stores attention keys and values for all tokens in the context

window. For a model with a 256K token context window, the KV cache for a single

request can be several gigabytes. vLLM's PagedAttention algorithm manages KV

cache memory efficiently, but you must account for it when sizing VRAM. The

rule of thumb is to allocate 20–30% of total VRAM for KV cache, which is why

we set gpu-memory-utilization to 0.90 rather than 1.0.

MULTI-ACCELERATOR TENSOR PARALLELISM

For models requiring multiple accelerators, vLLM uses tensor parallelism to

split the model across cards. On NVIDIA hardware this uses NVLink/NVSwitch.

On AMD hardware it uses RCCL (ROCm Collective Communications Library). On Intel

Gaudi it uses Habana's collective communications library. All three require

HostIPC and HostNetwork for efficient inter-card communication, which the

controller sets automatically based on acceleratorCount > 1.

VENDOR-SPECIFIC QUANTIZATION NOTES

NVIDIA: AWQ, GPTQ, FP8, INT8, INT4, and QAT-INT4 are all supported on H100+.

AMD: AWQ is supported on ROCm as of vLLM 0.8+. MXFP4 and MXFP6 require MI350X

or later (CDNA 4 architecture). SqueezeLLM has been ported to ROCm.

Intel Gaudi: FP8 is natively supported by Gaudi 3 hardware. The Intel vLLM

fork includes custom graph caching for improved performance with FP8.

SECRETS MANAGEMENT

API keys for remote models (GPT-5.5, Claude Opus 4.7, Gemini 3.1) are

sensitive credentials. For production deployments, use an external secrets

manager like HashiCorp Vault with the Vault Secrets Operator:

apiVersion: secrets-store.csi.x-k8s.io/v1

kind: SecretProviderClass

metadata:

namespace: ai-models

spec:

provider: vault

parameters:

vaultAddress: https://vault.example.com

roleName: llm-operator

objects: |

- objectName: "openai-api-key"

secretPath: "secret/data/ai/openai"

secretKey: "api_key"

secretObjects:

- secretName: openai-api-key

type: Opaque

data:

- objectName: openai-api-key

key: apiKey

NETWORK SECURITY

The vLLM API server has no built-in authentication. Use Kubernetes Network

Policies to restrict which pods can reach the inference services:

# File: config/security/network-policy.yaml

apiVersion: networking.k8s.io/v1

kind: NetworkPolicy

metadata:

namespace: ai-models

spec:

# Apply to all inference server pods across all vendors.

podSelector:

matchLabels:

ai.example.io/model-name: ""

policyTypes:

- Ingress

ingress:

# Allow traffic from explicitly authorized clients.

- from:

- namespaceSelector: {}

podSelector:

matchLabels:

ai.example.io/llm-client: "true"

ports:

- protocol: TCP

port: 8000

# Always allow traffic from the MCP server.

- from:

- podSelector:

matchLabels:

app: mcp-server

ports:

- protocol: TCP

port: 8000

# Always allow traffic from the controller (for health checks).

- from:

- namespaceSelector:

matchLabels:

kubernetes.io/metadata.name: llm-operator-system

ports:

- protocol: TCP

port: 8000

HETEROGENEOUS CLUSTER CONSIDERATIONS

In a cluster with mixed accelerator types (some nodes with NVIDIA GPUs, some

with AMD GPUs, some with Intel Gaudi cards, and some CPU-only), the controller

ensures that each LLMModel pod lands on the correct node type via:

1. NodeSelector: The vendor-specific label set by the GPU operator

(nvidia.com/gpu.present=true, amd.com/gpu.present=true, habana.ai/gaudi=true)

2. Tolerations: The vendor-specific taint applied to GPU nodes

3. Resource requests: The vendor-specific resource key

(nvidia.com/gpu, amd.com/gpu, habana.ai/gaudi)

These three mechanisms together guarantee that an AMD model never lands on an

NVIDIA node and vice versa, even in a heterogeneous cluster.

For clusters using the AMD GPU DRA Driver (available in beta as of early 2026),

the scheduling can be made even more precise. The DRA driver publishes

ResourceSlices that expose structured attributes of AMD GPU devices (model,

PCIe root, memory, etc.), allowing workloads to request GPUs based on specific

characteristics such as minimum HBM capacity.

CHAPTER FOURTEEN: THE BIGGER PICTURE

We have covered a lot of ground. Let us step back and look at what we have

built and why it matters.

We started with the observation that the LLM landscape as of May 2026 is

radically different from what it was eighteen months ago. The dominant models

are MoE architectures. Context windows have grown from thousands to millions of

tokens. Quantization-Aware Training has made it possible to run 31-billion-

parameter multimodal models on a single 24 GiB GPU. And the accelerator market

has diversified: AMD MI300X, MI350X, and the forthcoming MI400 series are

serious alternatives to NVIDIA for large-model inference. Intel Gaudi 3 offers

a cost-effective option with native FP8 support and strong LLM serving

frameworks. The frontier remote API models — GPT-5.5, Claude Opus 4.7, Gemini

3.1 Pro — have context windows of 1 million tokens and capabilities that no

locally-deployable model yet matches.

We designed a Custom Resource Definition that captures the full richness of this

landscape. The `acceleratorVendor` field is the key innovation: it makes the

entire operator hardware-agnostic. Adding support for a new accelerator vendor

requires editing exactly one file — controllers/hardware.go — and adding a new

case to the switch statement. The rest of the controller, the CRD, the MCP

server, and the query tooling all remain unchanged.

We built a controller that reconciles this CRD into real Kubernetes objects,

with all vendor-specific decisions isolated in the AcceleratorConfig struct.

The controller handles the full lifecycle: PVC creation for model caching,

init container for weight download, inference Deployment with correct resource

keys and tolerations, Service for stable DNS, KEDA ScaledObject for intelligent

autoscaling, and MCP tool registration for AI agent integration.

We wired up KEDA to scale on inference queue depth rather than CPU or GPU

utilization, which is the only autoscaling signal that makes sense for LLM

workloads. Because vLLM emits the same Prometheus metrics regardless of the

underlying accelerator vendor, the KEDA configuration is completely vendor-

agnostic.

We built an MCP server that watches the LLMModel registry and exposes each

model as a tool that AI agents can discover and call. The server supports

dynamic tool registration via an admin REST API that the controller calls

whenever a model's status changes, and it sends MCP tools/list_changed

notifications to connected clients when the tool list changes.

We showed how Docker Model Runner bridges the gap between local development and

cluster deployment, giving every developer a local LLM environment on any

platform — NVIDIA GPU, AMD GPU, or Apple Silicon — that uses the same API

surface as the production cluster.

The result is an AI platform that is self-describing, self-scaling, self-

healing, and hardware-agnostic. Models are first-class Kubernetes citizens.

They can be queried, filtered, and selected using standard Kubernetes tooling

regardless of whether they run on NVIDIA, AMD, Intel Gaudi, or CPU. They scale

automatically based on demand. They expose themselves to AI agents via a

standard protocol. And they handle both local and remote models uniformly.

This is what it looks like when you stop treating LLMs as external services

and start treating them as infrastructure — infrastructure that works with

whatever accelerator hardware you have or can afford.

Wednesday, May 20, 2026

LARGE LANGUAGE MODELS ON KUBERNETES: A COMPLETE GUIDE TO DEPLOYING, MANAGING, AND QUERYING LOCAL AND REMOTE AI MODELS IN A CLOUD-NATIVE CLUSTER

CHAPTER ONE: WHY THIS MATTERS, AND WHY NOW

CHAPTER TWO: THE MODEL LANDSCAPE IN MAY 2026

GOOGLE — GEMMA 4 FAMILY (Released April 2, 2026)

GOOGLE — GEMINI 3.1 (Remote API, Released February 19, 2026)

ALIBABA — QWEN3.5 FAMILY (Released February 16, 2026)

ALIBABA — QWEN3.6 FAMILY (Released April 2026)

META — LLAMA 4 FAMILY (Released April 2025, MIT License)

MISTRAL AI — MISTRAL MEDIUM 3.5 (Released April 2026)

MISTRAL AI — MISTRAL LARGE 3 (Released December 2025)

MOONSHOT AI — KIMI K2.5 (Released January 27, 2026)

ZHIPU AI — GLM-5.1 (Released April 7, 2026)

DEEPSEEK — V4 FAMILY (2026)

OPENAI — GPT-5.5 (Remote API, Released April 23, 2026)

ANTHROPIC — CLAUDE OPUS 4.7 (Remote API, Released April 16, 2026)

THE ACCELERATOR LANDSCAPE IN MAY 2026

THE VRAM REFERENCE TABLE (MAY 2026)

CHAPTER THREE: KUBERNETES 1.33, DRA, AND MULTI-VENDOR GPU SCHEDULING

INSTALLING GPU SUPPORT FOR ALL VENDORS

NVIDIA — GPU Operator:

AMD — GPU Operator (announced January 2025):

Intel Gaudi — Base Operator:

CHAPTER FOUR: DESIGNING THE LLMMODEL CUSTOM RESOURCE

The CRD definition:

Example 1 — Gemma 4 26B-A4B on an AMD MI300X (single GPU, MoE):

Example 2 — Qwen3.6-35B-A3B on an Intel Gaudi 3 (thinking mode):

Example 3 — Gemma 4 31B on NVIDIA H100 (dense, high quality):

Example 4 — Qwen3.5-9B on CPU (Apple Silicon or x86, no GPU):

Example 5 — GPT-5.5 as a remote API proxy:

Example 6 — Claude Opus 4.7 as a remote API proxy:

Example 7 — Gemini 3.1 Pro as a remote API proxy:

CHAPTER FIVE: THE LLMMODEL CONTROLLER

FILE: go.mod

FILE: api/v1alpha1/llmmodel_types.go

FILE: api/v1alpha1/groupversion_info.go

FILE: controllers/hardware.go

FILE: controllers/llmmodel_controller.go

FILE: main.go

FILE: controllers/mcp_client.go

FILE: Dockerfile (controller)

CHAPTER SIX: RBAC AND CONTROLLER DEPLOYMENT MANIFESTS

FILE: config/rbac/serviceaccount.yaml

FILE: config/rbac/role.yaml

FILE: config/rbac/rolebinding.yaml

FILE: config/manager/manager.yaml

CHAPTER SEVEN: AUTOSCALING WITH KEDA (VENDOR-AGNOSTIC)

FILE: config/keda/scaledobject-gemma4-26b-amd.yaml

FILE: config/keda/scaledobject-qwen36-35b-gaudi.yaml

FILE: config/keda/scaledobject-gemma4-31b-nvidia.yaml

CHAPTER EIGHT: PROMETHEUS RECORDING RULES AND MONITORING

FILE: config/monitoring/prometheus-rules.yaml

CHAPTER NINE: THE MCP SERVER

FILE: mcp-server/package.json

FILE: mcp-server/tsconfig.json

FILE: mcp-server/src/index.ts

FILE: mcp-server/Dockerfile

FILE: config/mcp/deployment.yaml

CHAPTER TEN: QUERYING THE MODEL REGISTRY

FILE: tools/query_models.py

CHAPTER ELEVEN: DOCKER MODEL RUNNER AND LOCAL DEVELOPMENT

FILE: compose.yaml (local development environment)

CHAPTER TWELVE: COMPLETE DEPLOYMENT WALKTHROUGH

CHAPTER THIRTEEN: OPERATIONAL CONSIDERATIONS AND PRODUCTION HARDENING

MODEL LOADING TIME AND COLD STARTS

KV CACHE MEMORY

MULTI-ACCELERATOR TENSOR PARALLELISM

VENDOR-SPECIFIC QUANTIZATION NOTES

SECRETS MANAGEMENT

NETWORK SECURITY

HETEROGENEOUS CLUSTER CONSIDERATIONS

CHAPTER FOURTEEN: THE BIGGER PICTURE

No comments: