CHAPTER ONE: WHY THIS MATTERS, AND WHY NOW
There is a moment in the life of every engineering team that adopts large
language models when the honeymoon ends. It usually happens around the third
month. The team has been happily calling an external API, watching tokens flow
in and out, and marvelling at what the model can do. Then the bill arrives.
Then the legal team asks where the customer data is going. Then the latency
spikes because the API is rate-limited. Then someone asks, "Could we just run
this ourselves?"
The answer, is a resounding yes — and the tooling to do it
well has finally caught up with the ambition. Kubernetes 1.33, released on
April 23, 2025, brought Dynamic Resource Allocation (DRA) into beta, giving
the scheduler a genuinely sophisticated understanding of accelerator hardware
from any vendor. Docker Desktop 4.41, released on April 29, 2025, ships with
a Model Runner that understands OCI-packaged model artifacts and exposes an
OpenAI-compatible API from your laptop. The vLLM project has matured into a
production-grade inference server with native support for NVIDIA CUDA, AMD
ROCm, and Intel Gaudi. KEDA gives us event-driven autoscaling that reacts to
the depth of an inference queue rather than to CPU utilization. And the Model
Context Protocol (MCP), whose latest stable specification was published in
November 2025, gives us a standard way for AI agents to discover and call
tools — including tools that live inside a Kubernetes cluster.
This guide walks through the entire stack. We will design a Custom Resource
Definition that describes an LLM deployment with enough richness to let a
scheduler make intelligent decisions about where and how to run it — on any
accelerator vendor. We will build a controller that reconciles that resource
into real Kubernetes objects. We will configure vLLM as the inference engine
for GPU deployments and llama.cpp/Ollama for CPU deployments, wire up KEDA for
autoscaling, and expose the whole thing through an MCP server so that AI agents
can discover and use models as tools. Along the way we will handle the important
asymmetry between models you can run locally and frontier models that only exist
as remote REST APIs — because any real-world deployment has to deal with both.
Before we write a single line of YAML, we need to understand what we are
actually deploying.
CHAPTER TWO: THE MODEL LANDSCAPE IN MAY 2026
The LLM landscape has undergone a fundamental architectural shift over the past
eighteen months. The dominant pattern is no longer the dense transformer, where
every parameter participates in every forward pass. Instead, the leading models
use Mixture-of-Experts (MoE) architectures, where a router selects a small
subset of "expert" sub-networks for each token. This means that a model can
have 675 billion total parameters but activate only 41 billion of them per
token, dramatically reducing the compute cost of inference while retaining the
capacity that comes from a large parameter count.
This distinction matters enormously for Kubernetes scheduling, because the
relevant resource constraint is not the total parameter count but the amount of
accelerator memory required to hold all the expert weights simultaneously. Even
if only 41 billion parameters are active per token, all the expert weights must
reside in HBM/VRAM so the router can select among them.
A second major trend is the emergence of thinking models, which perform chain-
of-thought reasoning internally before producing a final answer. Some models,
like the Qwen3 and Qwen3.5/3.6 families, support a hybrid mode where a single
deployed model can switch between fast non-thinking responses and slower, deeper
reasoning responses depending on a flag in the system prompt.
A third trend is Quantization-Aware Training (QAT) and Multi-Token Prediction
(MTP). Google's Gemma 4 family uses MTP drafters — small companion checkpoints
that accelerate token generation by up to 3x without quality loss. Gemma 3's
QAT-INT4 format demonstrated that training-time quantization awareness produces
better quality than post-training quantization at the same memory footprint.
Let us survey the major model families that are relevant to a Kubernetes
deployment as of May 2026, with the specific numbers that a scheduler needs.
GOOGLE — GEMMA 4 FAMILY (Released April 2, 2026)
Gemma 4 is released under Apache 2.0 and is built from the same research as
Gemini 3. All models natively process text and images; the E2B and E4B variants
also support audio input. All models support native function calling and
structured JSON output. MTP drafters are available for all sizes, offering up
to 3x speedup without significant VRAM increase.
Gemma 4 E2B has approximately 2.3 billion effective parameters. It targets
phones, edge devices, and low-VRAM testing. In Q4_K_M quantization it requires
4–6 GB of VRAM. Full BF16 requires 15 GB. Context window: 128K tokens.
Gemma 4 E4B has approximately 4.5 billion effective parameters (around 8 billion
including embeddings). It targets high-end laptops and small servers. In Q4_K_M
it requires about 8 GB of VRAM; in Q8_0 it needs 12–16 GB. Full BF16: 15 GB.
Context window: 128K tokens.
Gemma 4 26B-A4B is a Mixture-of-Experts model with 25.2 billion total parameters
and approximately 3.8 billion active parameters per token. It targets consumer
GPUs and cost-efficient single-GPU server inference. In 4-bit quantization
(GGUF or AWQ) it requires 14–16 GB of VRAM; on an RTX 4090 (24 GB) there is
comfortable headroom for KV cache. In FP8 it needs about 30 GB; in BF16 about
60 GB. Context window: 256K tokens.
Gemma 4 31B is a dense model with 30.7 billion total parameters. It targets
workstations where maximum quality is paramount. In INT4 it requires a minimum
of 16 GB of VRAM; Q4_K_M is comfortable at 24 GB; Q5_K_M and above need 32 GB+;
INT8 needs about 36 GB; BF16 needs about 62 GB (requires A100 80 GB or H100).
Context window: 256K tokens.
GOOGLE — GEMINI 3.1 (Remote API, Released February 19, 2026)
Gemini 3.1 Pro is not available as open weights. It is accessed via the Google
AI API and Vertex AI. Context window: 1 million tokens. Max output: 65,536
tokens. Pricing: $2 per million input tokens, $12 per million output tokens.
Supports up to 900 images per prompt, 8.4 hours of audio, and 1 hour of video.
Variants: Gemini 3.1 Pro Preview (Feb 19), Flash-Lite Preview (Mar 3),
Flash-Lite GA (May 7, 2026).
ALIBABA — QWEN3.5 FAMILY (Released February 16, 2026)
Qwen3.5 is open-weights under Apache 2.0. It uses both dense and MoE
architectures. The MoE models are designated with an "A" suffix indicating
active parameters (e.g., 397B-A17B has 397 billion total, 17 billion active).
Available open-weights sizes: 0.8B, 2B, 4B, 9B, 27B, 35B-A3B, 122B-A10B,
and 397B-A17B. Native context window: 262,144 tokens, extensible to 1,010,000
tokens. Supports 201 languages. Hybrid thinking/non-thinking mode.
The 9B model fits on a single 24 GB GPU in FP16 with ample KV cache headroom.
The 35B-A3B MoE model runs in Q4 quantization on a 24 GB card (approximately
21.5 GB). The 27B dense model runs on 22 GB of RAM/VRAM. For the 35B-A3B at
Q8_0 with a 65K context window in llama.cpp, VRAM usage is approximately 21.7 GB.
ALIBABA — QWEN3.6 FAMILY (Released April 2026)
Qwen3.6-27B and Qwen3.6-35B-A3B are open-weight models released under
Apache 2.0. Qwen3.6-35B-A3B is a fully open-source MoE model with 35 billion
total parameters and 3 billion active parameters per token, outperforming its
Qwen3.5 predecessor and rivalling larger dense models. Context window: 256K
tokens, extensible to 1,010,000 tokens. The 27B model runs on 18 GB of VRAM;
the 35B-A3B runs on 22 GB. The 35B-A3B achieves 100+ tokens per second on
consumer hardware due to its low active parameter count.
Qwen3.6-Plus is a proprietary hosted model with a 1-million-token context
window and up to 65,536 output tokens. It is not available as open weights.
META — LLAMA 4 FAMILY (Released April 2025, MIT License)
Llama 4 Scout has 109 billion total parameters and activates 17 billion per
token across 16 experts. Context window: 10 million tokens. In 4-bit
quantization it requires approximately 55–61 GB of VRAM, fitting on a single
H100 80 GB or AMD MI300X (192 GB). For context windows beyond 130K tokens,
multiple GPUs are recommended.
Llama 4 Maverick has 400 billion total parameters across 128 experts, also
activating 17 billion per token. Context window: 1 million tokens. In 4-bit
quantization it requires approximately 224 GB of VRAM, requiring multi-GPU.
MISTRAL AI — MISTRAL MEDIUM 3.5 (Released April 2026)
Mistral Medium 3.5 is a dense 128-billion-parameter model released under a
Modified MIT License. Context window: 256K tokens. Multimodal: text and image
input, text output. Supports configurable reasoning mode. In 4-bit quantization
it requires approximately 70 GB of VRAM, accessible on a Mac Studio with 128 GB
of unified memory or a multi-GPU server. Excels in instruction-following,
reasoning, coding, long-context understanding, tool use, and agentic workflows.
MISTRAL AI — MISTRAL LARGE 3 (Released December 2025)
Mistral Large 3 is a sparse MoE model with 675 billion total parameters and
approximately 41 billion active parameters per token, plus a 2.5 billion
parameter integrated vision encoder. Released under Apache 2.0. Context window:
256K tokens. Multimodal: native vision capabilities. Supports 40+ languages,
function calling, and structured JSON output.
For self-hosting in FP8 precision for long-context workloads up to 256K tokens,
B200 or H200 GPUs are recommended. For contexts under 64K tokens, NVFP4
precision can be used on A100s and H100s. The MoE architecture means that while
all 675 billion parameters must reside in memory, only 41 billion are computed
at inference time, making deployment more efficient than a dense 675B model.
In Q4 quantization, approximately 200 GB of VRAM is required.
MOONSHOT AI — KIMI K2.5 (Released January 27, 2026)
Kimi K2.5 is open-source under a Modified MIT License. Architecture: MoE with
1 trillion total parameters and 32 billion active parameters per token. 384
experts with 8 selected per token plus 1 shared expert. Native multimodal,
trained on 15 trillion mixed visual and text tokens. Context window: 256K tokens.
The full model FP8 checkpoint is approximately 630 GB and typically requires
at least 4x H200 GPUs. Quantized versions (e.g., 1.8-bit) can run on a single
24 GB GPU with CPU offload for MoE layers, requiring around 256 GB of system
RAM for approximately 10 tokens per second. Features: Agent Swarm technology
coordinating up to 100 specialized AI agents, four operational modes (Instant,
Thinking, Agent, Agent Swarm).
ZHIPU AI — GLM-5.1 (Released April 7, 2026)
GLM-5.1 is open-source under the MIT License. Architecture: Hybrid MoE
(GlmMoeDSA) with 754 billion total parameters and 40 billion active parameters
per token. 256 routed experts plus 1 shared expert, with 8+1 active per token.
Context window: 200K tokens. Designed for agentic engineering and long-horizon
coding tasks.
Minimal deployment requires 1x NVIDIA HGX B200 (8x B200 GPU system). FP8
deployments with vLLM on multi-GPU rigs require approximately 860 GB or more
across GPUs (e.g., 8x H200). For CPU-only setups with quantized GGUF weights,
approximately 180–256 GB of system RAM is needed for 1–2 bit quantizations. A
24 GB GPU plus 256 GB of RAM can work with 2-bit variants using MoE offloading.
DEEPSEEK — V4 FAMILY (2026)
DeepSeek-V4-Flash has 284 billion total parameters and 13 billion active
parameters per token. Context window: 1 million tokens. MIT-licensed open
weights. Native FP4+FP8 requires approximately 170–175 GB of total VRAM,
fitting on 2x H200 or 4x A100 80 GB. Community INT4 quantization needs
approximately 90–100 GB, potentially on 4x RTX 4090. Community GGUF/GPTQ at
approximately 80 GB VRAM might be feasible on 1x RTX 5090 or 2x RTX 4090 with
CPU offload.
DeepSeek-V4-Pro has approximately 1.6 trillion total parameters and 49 billion
active parameters per token. MIT-licensed open weights. Full precision
(FP8+FP4 Mixed) requires approximately 865 GB of VRAM, recommending 16x H100
80 GB or equivalent. Even Q4 Pro (approximately 430 GB of weights) does not fit
on an 8x H100 80 GB box once KV cache and overhead are added. This is a
datacenter cluster job.
OPENAI — GPT-5.5 (Remote API, Released April 23, 2026)
GPT-5.5 is not available as open weights. API context window: 1 million tokens.
Max output: 128K tokens. Knowledge cutoff: December 2025. Pricing: $5 per
million input tokens, $30 per million output tokens (90% cached-input discount).
GPT-5.5 Pro: $30/$180 per million tokens. GPT-5.5 Instant (released May 5,
2026) is the default for all ChatGPT users. Offers five reasoning levels: none,
low, medium, high, xhigh. Supports text and image inputs.
ANTHROPIC — CLAUDE OPUS 4.7 (Remote API, Released April 16, 2026)
Claude Opus 4.7 is not available as open weights. Context window: 1 million
tokens. Max output: 128K tokens. Pricing: $5 per million input tokens, $25 per
million output tokens. Improved vision with higher-resolution image support
(up to 2576px / 3.75MP). New "xhigh" effort level for finer control over
reasoning and latency. Available via API, Amazon Bedrock, Google Cloud Vertex
AI, and Microsoft Foundry.
THE ACCELERATOR LANDSCAPE IN MAY 2026
NVIDIA remains the dominant data-center accelerator vendor. The H100 80 GB
(HBM3) and H200 141 GB (HBM3e) are the workhorses of production LLM inference.
The B200 with 192 GB of HBM3e and the HGX B200 system (8x B200 = 1536 GB
aggregate) represent the frontier for the largest models.
AMD Instinct MI300X offers 192 GB of HBM3 memory with 5.3 TB/s bandwidth per
card, making it exceptionally well-suited for large model inference where VRAM
is the primary constraint. The MI350X (CDNA 4 architecture, expected 2025) will
feature 288 GB of HBM3E and 8 TB/s bandwidth, with up to 35x AI inference
performance improvement over MI300. The MI400 series (anticipated 2026) will
feature 432 GB of HBM4 and over 19.6 TB/s bandwidth. As of mid-January 2026,
93% of vLLM AMD test groups are succeeding, with official ROCm-enabled vLLM
Docker images available since January 2026.
Intel Gaudi 3 (launched 2024, 5nm process) features 128 GB of HBM2e with
3.7 TB/s bandwidth. Intel maintains an optimized fork of vLLM for Gaudi with
Paged KV cache, custom Paged Attention, tensor parallelism, and FP8 quantization
support. Text Generation Inference (TGI) also has native Gaudi support. Gaudi 3
supports DeepSeek architecture since Intel Gaudi software release 1.21.0.
Apple Silicon (M3 Ultra: 512 GB unified memory) excels at local and edge LLM
inference via llama.cpp and the MLX framework. It is not yet suitable for large-
scale Kubernetes data-center deployments due to GPU support limitations in VMs
and containers, but it is the best option for on-premise developer workstations
and edge nodes running llama.cpp or Ollama.
THE VRAM REFERENCE TABLE (MAY 2026)
The following table summarises the memory requirements that our Kubernetes
scheduler will need to reason about. Q4 refers to 4-bit quantization (GGUF
Q4_K_M, AWQ, or GPTQ). QAT-INT4 refers to Google's Quantization-Aware Training
format. "Active" is the per-token active parameter count for MoE models.
Model Total Active Arch Q4 VRAM BF16 VRAM
-------------------------------------------------------------------
Gemma 4 E2B 2.3B 2.3B Dense ~4 GB ~15 GB
Gemma 4 E4B 4.5B 4.5B Dense ~8 GB ~15 GB
Qwen3.5-4B 4B 4B Dense ~3 GB ~9 GB
Qwen3.5-9B 9B 9B Dense ~6 GB ~18 GB
Gemma 4 26B-A4B (MoE) 25.2B 3.8B MoE ~14 GB ~60 GB
Qwen3.5-27B 27B 27B Dense ~16 GB ~54 GB
Qwen3.6-27B 27B 27B Dense ~18 GB ~54 GB
Gemma 4 31B (Dense) 30.7B 30.7B Dense ~16 GB ~62 GB
Qwen3.5-35B-A3B (MoE) 35B 3B MoE ~21 GB ~70 GB
Qwen3.6-35B-A3B (MoE) 35B 3B MoE ~22 GB ~70 GB
Llama 4 Scout (MoE) 109B 17B MoE ~55 GB ~220 GB
Mistral Medium 3.5 128B 128B Dense ~70 GB ~256 GB
DeepSeek-V4-Flash (MoE) 284B 13B MoE ~80 GB ~570 GB
Kimi K2.5 (MoE) 1000B 32B MoE ~250 GB ~2000 GB
Llama 4 Maverick (MoE) 400B 17B MoE ~224 GB ~860 GB
GLM-5.1 (MoE) 754B 40B MoE ~180 GB ~1500 GB
Mistral Large 3 (MoE) 675B 41B MoE ~200 GB ~1350 GB
DeepSeek-V4-Pro (MoE) 1600B 49B MoE ~430 GB not pract.
-------------------------------------------------------------------
Remote API models (no local VRAM required):
GPT-5.5 ~? ~? ? N/A N/A
Claude Opus 4.7 ~? ~? ? N/A N/A
Gemini 3.1 Pro ~? ~? ? N/A N/A
-------------------------------------------------------------------
This table is the foundation of everything that follows. Every design decision
in our CRD, our controller, and our scheduler will ultimately trace back to
these numbers.
CHAPTER THREE: KUBERNETES 1.33, DRA, AND MULTI-VENDOR GPU SCHEDULING
Kubernetes 1.33, released on April 23, 2025, is the most significant release
for AI workloads in the history of the project. The headline feature for
accelerator users is Dynamic Resource Allocation (DRA) reaching beta status.
The traditional approach to GPU scheduling uses the device plugin model. A
device plugin runs as a DaemonSet on each GPU node and advertises GPUs as
extended resources: "nvidia.com/gpu", "amd.com/gpu", or "habana.ai/gaudi". A
pod requests a GPU by setting the appropriate resource limit. The scheduler
finds a node with an available GPU of that type and assigns the pod to it. This
works, but it is extremely coarse-grained. The scheduler knows that a node has
GPUs and that a pod wants GPUs, but it knows nothing about the VRAM capacity,
interconnect topology, or whether the workload would benefit from partitioning.
DRA changes this by introducing a richer API for expressing hardware requirements
and capabilities. With DRA, a device driver can publish detailed information
about each accelerator: its model, its VRAM, its interconnect connectivity, its
supported quantization formats, and any other relevant attributes. A pod can
then request not just "a GPU" but "a GPU with at least 80 gigabytes of VRAM
from any vendor."
Kubernetes 1.33 also introduces Partitionable Devices (foundation for MIG),
Device Taints and Tolerations (mirrors node taint/toleration for devices), and
Prioritized List (preference ordering over device configurations).
INSTALLING GPU SUPPORT FOR ALL VENDORS
The following commands install GPU support for all three major accelerator
vendors. Run only the sections relevant to the hardware in your cluster.
NVIDIA — GPU Operator:
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
helm install gpu-operator nvidia/gpu-operator \
--namespace gpu-operator \
--create-namespace \
--set driver.enabled=true \
--set toolkit.enabled=true \
--set devicePlugin.enabled=true \
--set dcgmExporter.enabled=true \
--wait
AMD — GPU Operator (announced January 2025):
helm repo add amd-gpu-operator https://rocm.github.io/gpu-operator
helm repo update
helm install amd-gpu-operator amd-gpu-operator/gpu-operator \
--namespace amd-gpu-operator \
--create-namespace \
--set devicePlugin.enabled=true \
--set nodeLabeller.enabled=true \
--wait
# The AMD GPU Operator installs:
# - amd-gpu-device-plugin (exposes amd.com/gpu resources)
# - amd-gpu-node-labeller (labels nodes with GPU model and VRAM)
# - ROCm driver management
# After installation, AMD GPUs appear as amd.com/gpu resources.
Intel Gaudi — Base Operator:
helm repo add intel https://intel.github.io/helm-charts
helm repo update
helm install gaudi-base-operator intel/intel-gaudi-base-operator \
--namespace intel-gaudi \
--create-namespace \
--wait
# The Intel Gaudi Base Operator installs:
# - Intel Gaudi Device Plugin (exposes habana.ai/gaudi resources)
# - Container runtime configuration
# - Feature discovery
# - Monitoring tools
# After installation, Gaudi cards appear as habana.ai/gaudi resources.
Verify all accelerators are visible to the scheduler:
# NVIDIA GPUs
kubectl get nodes -o custom-columns=\
NAME:.metadata.name,\
NVIDIA_GPU:.status.allocatable."nvidia\.com/gpu"
# AMD GPUs
kubectl get nodes -o custom-columns=\
NAME:.metadata.name,\
AMD_GPU:.status.allocatable."amd\.com/gpu"
# Intel Gaudi
kubectl get nodes -o custom-columns=\
NAME:.metadata.name,\
GAUDI:.status.allocatable."habana\.ai/gaudi"
CHAPTER FOUR: DESIGNING THE LLMMODEL CUSTOM RESOURCE
We want to define a Kubernetes Custom Resource that captures everything a
scheduler needs to know about an LLM deployment. The resource must be rich
enough to express the full diversity of the model landscape surveyed in Part
Two, vendor-agnostic so it works with NVIDIA, AMD, Intel Gaudi, and CPU-only
nodes, and simple enough that an engineer can write one without consulting a
manual.
The most fundamental distinction is between a locally-hosted model and a remote
API-only model. Frontier models like GPT-5.5, Claude Opus 4.7, and Gemini 3.1
Pro are not available as downloadable weights. Our CRD represents both cases.
A critical new field is `acceleratorVendor`, which selects the hardware backend.
The controller uses this field to select the correct:
- Kubernetes resource key (nvidia.com/gpu, amd.com/gpu, habana.ai/gaudi)
- vLLM Docker image (CUDA, ROCm, or Gaudi variant)
- Node selector labels (populated by the respective GPU operator)
- Tolerations (vendor-specific GPU node taints)
- Prometheus metrics source (DCGM for NVIDIA, ROCm SMI for AMD, Gaudi metrics)
The CRD definition:
# File: config/crd/bases/ai.example.io_llmmodels.yaml
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
name: llmmodels.ai.example.io
annotations:
ai.example.io/schema-version: "v1alpha1"
spec:
group: ai.example.io
names:
kind: LLMModel
listKind: LLMModelList
plural: llmmodels
singular: llmmodel
shortNames:
- lm
scope: Namespaced
versions:
- name: v1alpha1
served: true
storage: true
additionalPrinterColumns:
- name: Model
type: string
jsonPath: .spec.modelId
- name: Vendor
type: string
jsonPath: .spec.acceleratorVendor
- name: Engine
type: string
jsonPath: .spec.inferenceEngine
- name: Mode
type: string
jsonPath: .spec.deploymentMode
- name: VRAM
type: string
jsonPath: .spec.resources.vramPerAcceleratorGiB
- name: Context
type: string
jsonPath: .spec.contextWindowK
- name: Status
type: string
jsonPath: .status.phase
- name: Endpoint
type: string
jsonPath: .status.endpoint
schema:
openAPIV3Schema:
type: object
required:
- spec
properties:
spec:
type: object
required:
- modelId
- deploymentMode
properties:
modelId:
type: string
description: >
The canonical model identifier. For Hugging Face models
this is the repo path (owner/name). For OCI models this
is the image reference. For remote API models this is
the model name as used in the API request.
deploymentMode:
type: string
enum:
- local
- remote
description: >
Whether to run the model locally in the cluster or to
proxy requests to a remote API endpoint.
# -------------------------------------------------------
# ACCELERATOR VENDOR SELECTION
# This is the key field that makes the operator
# hardware-agnostic. The controller uses this to select
# the correct resource key, Docker image, node selectors,
# and tolerations.
# -------------------------------------------------------
acceleratorVendor:
type: string
enum:
- nvidia
- amd
- intel-gaudi
- cpu
default: nvidia
description: >
The accelerator vendor for local deployments.
nvidia: Uses nvidia.com/gpu resource key, CUDA-based
vLLM image, DCGM metrics.
amd: Uses amd.com/gpu resource key, ROCm-based vLLM
image, ROCm SMI metrics.
intel-gaudi: Uses habana.ai/gaudi resource key,
Intel Gaudi optimized vLLM fork image, Gaudi metrics.
cpu: No GPU resource requested. Uses llama.cpp or
Ollama for CPU inference. Suitable for small models
on Apple Silicon or x86 servers.
inferenceEngine:
type: string
enum:
- vllm
- llamacpp
- ollama
default: vllm
description: >
The inference engine to use for local deployments.
vLLM is recommended for GPU deployments (NVIDIA, AMD,
Intel Gaudi) with high throughput requirements.
llamacpp is recommended for CPU inference or consumer
GPU inference using GGUF quantized models.
Ollama wraps llama.cpp with a model management layer
and OpenAI-compatible API.
modelType:
type: string
enum:
- Dense
- MoE
- DenseThinking
- MoEThinking
default: Dense
description: >
The architectural type of the model. Dense models
activate all parameters for every token. MoE models
route each token through a subset of expert networks.
Thinking variants support chain-of-thought reasoning,
either always (DenseThinking) or switchable via system
prompt (MoEThinking).
totalParametersBillions:
type: number
description: >
Total number of parameters in billions. For MoE models
this is the sum of all expert parameters. This value
drives VRAM/HBM requirements.
activeParametersBillions:
type: number
description: >
Number of parameters activated per token in billions.
For dense models this equals totalParametersBillions.
For MoE models this drives compute (FLOP) cost per token.
quantization:
type: string
enum:
- None
- FP16
- BF16
- FP8
- FP4
- INT8
- INT4
- QAT-INT4
- GPTQ
- AWQ
- GGUF-Q4_K_M
- GGUF-Q8_0
- MXFP4
- MXFP6
default: None
description: >
The quantization format of the model weights.
QAT-INT4 is Google's Quantization-Aware Training format.
MXFP4 and MXFP6 are AMD CDNA 4 (MI350X) native formats.
FP8 is natively supported by Intel Gaudi 3, NVIDIA H100+,
and AMD MI300X+.
GGUF variants are used with llamacpp and Ollama.
AWQ and GPTQ are used with vLLM on NVIDIA and AMD.
contextWindowK:
type: integer
description: >
The maximum context window in thousands of tokens.
This affects KV cache memory requirements, which grow
linearly with context length.
domains:
type: array
items:
type: string
enum:
- general
- code
- math
- reasoning
- vision
- audio
- multilingual
- embedding
- function-calling
- long-context
- agentic
description: >
The capability domains this model excels in.
languages:
type: array
items:
type: string
description: >
Languages this model supports. Use ISO 639-1 codes
or descriptive strings like "140+" for broad support.
resources:
type: object
properties:
acceleratorCount:
type: integer
default: 1
description: >
Number of accelerators required per replica.
The resource key used depends on acceleratorVendor:
nvidia -> nvidia.com/gpu
amd -> amd.com/gpu
intel-gaudi -> habana.ai/gaudi
cpu -> no accelerator resource requested
vramPerAcceleratorGiB:
type: integer
description: >
Required VRAM/HBM per accelerator in gibibytes.
The controller uses this to select nodes whose
accelerators have sufficient memory.
For AMD MI300X this can be up to 192 GiB.
For Intel Gaudi 3 this is up to 128 GiB.
For NVIDIA H100 this is up to 80 GiB.
For NVIDIA H200 this is up to 141 GiB.
For NVIDIA B200 this is up to 192 GiB.
preferredAcceleratorModel:
type: string
description: >
Optional preferred accelerator model string used
as a DRA preference hint. Examples:
"NVIDIA H100 80GB HBM3"
"AMD Instinct MI300X"
"Intel Gaudi 3"
cpuMillicores:
type: integer
default: 4000
description: >
CPU request in millicores for the inference pod.
memoryGiB:
type: integer
default: 16
description: >
System RAM request in gibibytes for the inference
pod. Separate from accelerator VRAM/HBM.
For MoE models with CPU offloading, this may need
to be very large (e.g., 256 GiB for Kimi K2.5
with quantized CPU offload).
engineArgs:
type: object
additionalProperties:
type: string
description: >
Key-value pairs passed as command-line arguments to the
inference engine. For vLLM, common args include
tensor-parallel-size, max-model-len, and
gpu-memory-utilization. Values are always strings.
modelSource:
type: object
properties:
type:
type: string
enum:
- huggingface
- oci
- s3
huggingFaceTokenSecret:
type: string
ociImage:
type: string
s3Bucket:
type: string
s3Prefix:
type: string
s3CredentialsSecret:
type: string
scaling:
type: object
properties:
minReplicas:
type: integer
default: 1
description: >
Minimum replica count. Set to 0 for scale-to-zero.
maxReplicas:
type: integer
default: 3
scaleUpThreshold:
type: integer
default: 5
description: >
Queued inference requests that trigger scale-up.
KEDA monitors vllm:num_requests_waiting.
cooldownPeriodSeconds:
type: integer
default: 300
description: >
Seconds KEDA waits after the last scale-down
trigger before actually scaling down.
remoteApi:
type: object
properties:
baseUrl:
type: string
apiKeySecret:
type: string
rateLimitRpm:
type: integer
mcpExposure:
type: object
properties:
enabled:
type: boolean
default: false
toolName:
type: string
toolDescription:
type: string
status:
type: object
properties:
phase:
type: string
enum:
- Pending
- Downloading
- Starting
- Ready
- Degraded
- Failed
- Proxying
endpoint:
type: string
conditions:
type: array
items:
type: object
properties:
type:
type: string
status:
type: string
lastTransitionTime:
type: string
reason:
type: string
message:
type: string
acceleratorNodes:
type: array
items:
type: string
description: >
Names of the Kubernetes nodes where this model's
inference pods are currently scheduled.
currentReplicas:
type: integer
requestsPerMinute:
type: integer
averageLatencyMs:
type: integer
Now let us look at concrete LLMModel resources for several scenarios.
Example 1 — Gemma 4 26B-A4B on an AMD MI300X (single GPU, MoE):
# File: config/samples/gemma4-26b-amd.yaml
apiVersion: ai.example.io/v1alpha1
kind: LLMModel
metadata:
name: gemma4-26b-amd
namespace: ai-models
labels:
ai.example.io/family: gemma4
ai.example.io/vendor: google
ai.example.io/accelerator: amd
spec:
modelId: google/gemma-4-26b-a4b-it
deploymentMode: local
acceleratorVendor: amd
inferenceEngine: vllm
modelType: MoE
totalParametersBillions: 25.2
activeParametersBillions: 3.8
quantization: AWQ
contextWindowK: 256
domains:
- general
- vision
- multilingual
- function-calling
- reasoning
- agentic
languages:
- "140+"
resources:
acceleratorCount: 1
vramPerAcceleratorGiB: 16
preferredAcceleratorModel: "AMD Instinct MI300X"
cpuMillicores: 8000
memoryGiB: 64
engineArgs:
gpu-memory-utilization: "0.90"
max-model-len: "65536"
enable-chunked-prefill: "true"
modelSource:
type: huggingface
huggingFaceTokenSecret: hf-token
scaling:
minReplicas: 0
maxReplicas: 3
scaleUpThreshold: 3
cooldownPeriodSeconds: 300
mcpExposure:
enabled: true
toolName: gemma4_26b_vision_amd
toolDescription: >
Gemma 4 26B-A4B MoE model from Google running on AMD MI300X.
Supports text, image, and audio input with 256K context window.
Excellent for general tasks, vision, multilingual content, function
calling, and agentic workflows. Runs locally with full data privacy.
Example 2 — Qwen3.6-35B-A3B on an Intel Gaudi 3 (thinking mode):
# File: config/samples/qwen36-35b-gaudi.yaml
apiVersion: ai.example.io/v1alpha1
kind: LLMModel
metadata:
name: qwen36-35b-gaudi
namespace: ai-models
labels:
ai.example.io/family: qwen36
ai.example.io/vendor: alibaba
ai.example.io/accelerator: intel-gaudi
spec:
modelId: Qwen/Qwen3.6-35B-A3B
deploymentMode: local
acceleratorVendor: intel-gaudi
inferenceEngine: vllm
modelType: MoEThinking
totalParametersBillions: 35
activeParametersBillions: 3
quantization: FP8
contextWindowK: 256
domains:
- general
- reasoning
- math
- code
- multilingual
- agentic
languages:
- "201+"
resources:
acceleratorCount: 1
vramPerAcceleratorGiB: 24
preferredAcceleratorModel: "Intel Gaudi 3"
cpuMillicores: 8000
memoryGiB: 64
engineArgs:
max-model-len: "65536"
tensor-parallel-size: "1"
modelSource:
type: huggingface
huggingFaceTokenSecret: hf-token
scaling:
minReplicas: 0
maxReplicas: 2
scaleUpThreshold: 2
cooldownPeriodSeconds: 300
mcpExposure:
enabled: true
toolName: qwen36_reasoning_gaudi
toolDescription: >
Qwen3.6-35B-A3B MoE thinking model running on Intel Gaudi 3.
Supports hybrid thinking/non-thinking mode. Excellent for math,
code, complex reasoning, and agentic tasks. 256K context window.
Runs locally with full data privacy.
Example 3 — Gemma 4 31B on NVIDIA H100 (dense, high quality):
# File: config/samples/gemma4-31b-nvidia.yaml
apiVersion: ai.example.io/v1alpha1
kind: LLMModel
metadata:
name: gemma4-31b-nvidia
namespace: ai-models
labels:
ai.example.io/family: gemma4
ai.example.io/vendor: google
ai.example.io/accelerator: nvidia
spec:
modelId: google/gemma-4-31b-it
deploymentMode: local
acceleratorVendor: nvidia
inferenceEngine: vllm
modelType: Dense
totalParametersBillions: 30.7
activeParametersBillions: 30.7
quantization: INT4
contextWindowK: 256
domains:
- general
- vision
- code
- reasoning
- multilingual
- function-calling
- agentic
languages:
- "140+"
resources:
acceleratorCount: 1
vramPerAcceleratorGiB: 24
preferredAcceleratorModel: "NVIDIA H100 80GB HBM3"
cpuMillicores: 8000
memoryGiB: 32
engineArgs:
gpu-memory-utilization: "0.90"
max-model-len: "65536"
enable-chunked-prefill: "true"
modelSource:
type: huggingface
huggingFaceTokenSecret: hf-token
scaling:
minReplicas: 1
maxReplicas: 2
scaleUpThreshold: 3
cooldownPeriodSeconds: 300
mcpExposure:
enabled: true
toolName: gemma4_31b_nvidia
toolDescription: >
Gemma 4 31B dense model from Google running on NVIDIA H100.
Highest quality in the Gemma 4 family. Supports text and image
input with 256K context window. Excellent for coding, reasoning,
and agentic workflows. Runs locally with full data privacy.
Example 4 — Qwen3.5-9B on CPU (Apple Silicon or x86, no GPU):
# File: config/samples/qwen35-9b-cpu.yaml
apiVersion: ai.example.io/v1alpha1
kind: LLMModel
metadata:
name: qwen35-9b-cpu
namespace: ai-models
labels:
ai.example.io/family: qwen35
ai.example.io/vendor: alibaba
ai.example.io/accelerator: cpu
spec:
modelId: Qwen/Qwen3.5-9B-GGUF
deploymentMode: local
acceleratorVendor: cpu
inferenceEngine: ollama
modelType: DenseThinking
totalParametersBillions: 9
activeParametersBillions: 9
quantization: GGUF-Q4_K_M
contextWindowK: 128
domains:
- general
- reasoning
- code
- multilingual
languages:
- "201+"
resources:
acceleratorCount: 0
cpuMillicores: 16000
memoryGiB: 32
engineArgs:
num-ctx: "32768"
modelSource:
type: huggingface
huggingFaceTokenSecret: hf-token
scaling:
minReplicas: 1
maxReplicas: 4
scaleUpThreshold: 5
cooldownPeriodSeconds: 120
mcpExposure:
enabled: true
toolName: qwen35_9b_cpu
toolDescription: >
Qwen3.5-9B running on CPU via Ollama. No GPU required. Supports
hybrid thinking mode. Good for general tasks, code, and reasoning
on CPU-only nodes or Apple Silicon. 128K context window.
Example 5 — GPT-5.5 as a remote API proxy:
# File: config/samples/gpt55-remote.yaml
apiVersion: ai.example.io/v1alpha1
kind: LLMModel
metadata:
name: gpt55-remote
namespace: ai-models
labels:
ai.example.io/family: gpt55
ai.example.io/vendor: openai
ai.example.io/deployment-mode: remote
spec:
modelId: gpt-5.5
deploymentMode: remote
modelType: Dense
contextWindowK: 1000
domains:
- general
- code
- math
- reasoning
- vision
- function-calling
- long-context
- agentic
languages:
- "50+"
remoteApi:
baseUrl: https://api.openai.com/v1
apiKeySecret: openai-api-key
rateLimitRpm: 10000
mcpExposure:
enabled: true
toolName: gpt55_frontier
toolDescription: >
OpenAI GPT-5.5 frontier model via the OpenAI API. 1M token context
window. Five reasoning levels (none, low, medium, high, xhigh).
Best-in-class for complex agentic tasks, coding, and research.
Use when data privacy constraints permit external API calls.
Example 6 — Claude Opus 4.7 as a remote API proxy:
# File: config/samples/claude-opus47-remote.yaml
apiVersion: ai.example.io/v1alpha1
kind: LLMModel
metadata:
name: claude-opus47-remote
namespace: ai-models
labels:
ai.example.io/family: claude-opus
ai.example.io/vendor: anthropic
ai.example.io/deployment-mode: remote
spec:
modelId: claude-opus-4-7
deploymentMode: remote
modelType: DenseThinking
contextWindowK: 1000
domains:
- general
- code
- reasoning
- vision
- function-calling
- long-context
- agentic
languages:
- "50+"
remoteApi:
baseUrl: https://api.anthropic.com/v1
apiKeySecret: anthropic-api-key
rateLimitRpm: 5000
mcpExposure:
enabled: true
toolName: claude_opus47
toolDescription: >
Anthropic Claude Opus 4.7 via the Anthropic API. 1M token context
window. Excellent for multi-step agentic tasks, long-horizon reasoning,
and complex tool-dependent workflows. Supports xhigh reasoning effort.
Example 7 — Gemini 3.1 Pro as a remote API proxy:
# File: config/samples/gemini31-remote.yaml
apiVersion: ai.example.io/v1alpha1
kind: LLMModel
metadata:
name: gemini31-remote
namespace: ai-models
labels:
ai.example.io/family: gemini31
ai.example.io/vendor: google
ai.example.io/deployment-mode: remote
spec:
modelId: gemini-3.1-pro
deploymentMode: remote
modelType: Dense
contextWindowK: 1000
domains:
- general
- code
- reasoning
- vision
- audio
- function-calling
- long-context
- agentic
languages:
- "50+"
remoteApi:
baseUrl: https://generativelanguage.googleapis.com/v1beta/openai
apiKeySecret: google-api-key
rateLimitRpm: 2000
mcpExposure:
enabled: true
toolName: gemini31_pro
toolDescription: >
Google Gemini 3.1 Pro via the Google AI API. 1M token context window.
Processes up to 900 images, 8.4 hours of audio, or 1 hour of video
per prompt. Three-tier thinking system (low, medium, high).
CHAPTER FIVE: THE LLMMODEL CONTROLLER
The controller watches for LLMModel resources and reconciles the actual state
of the cluster to match the desired state. The key design principle is that
the `acceleratorVendor` field drives all hardware-specific decisions, keeping
the rest of the reconciliation logic vendor-agnostic.
FILE: go.mod
module github.com/example/llm-operator
go 1.22
require (
k8s.io/api v0.30.0
k8s.io/apimachinery v0.30.0
k8s.io/client-go v0.30.0
sigs.k8s.io/controller-runtime v0.18.0
)
FILE: api/v1alpha1/llmmodel_types.go
package v1alpha1
import (
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
)
// LLMModelSpec defines the desired state of an LLMModel resource.
type LLMModelSpec struct {
// ModelId is the canonical identifier for the model.
// +kubebuilder:validation:Required
ModelId string `json:"modelId"`
// DeploymentMode determines whether to run locally or proxy remotely.
// +kubebuilder:validation:Enum=local;remote
DeploymentMode string `json:"deploymentMode"`
// AcceleratorVendor selects the hardware backend for local deployments.
// +kubebuilder:validation:Enum=nvidia;amd;intel-gaudi;cpu
// +kubebuilder:default=nvidia
AcceleratorVendor string `json:"acceleratorVendor,omitempty"`
// InferenceEngine selects the serving framework for local deployments.
// +kubebuilder:validation:Enum=vllm;llamacpp;ollama
// +kubebuilder:default=vllm
InferenceEngine string `json:"inferenceEngine,omitempty"`
// ModelType describes the architectural pattern of the model.
// +kubebuilder:validation:Enum=Dense;MoE;DenseThinking;MoEThinking
// +kubebuilder:default=Dense
ModelType string `json:"modelType,omitempty"`
// TotalParametersBillions is the total parameter count in billions.
TotalParametersBillions float64 `json:"totalParametersBillions,omitempty"`
// ActiveParametersBillions is the per-token active parameter count.
ActiveParametersBillions float64 `json:"activeParametersBillions,omitempty"`
// Quantization specifies the weight format and precision.
// +kubebuilder:validation:Enum=None;FP16;BF16;FP8;FP4;INT8;INT4;QAT-INT4;GPTQ;AWQ;GGUF-Q4_K_M;GGUF-Q8_0;MXFP4;MXFP6
Quantization string `json:"quantization,omitempty"`
// ContextWindowK is the maximum context length in thousands of tokens.
ContextWindowK int `json:"contextWindowK,omitempty"`
// Domains lists the capability areas this model excels in.
Domains []string `json:"domains,omitempty"`
// Languages lists the languages this model supports.
Languages []string `json:"languages,omitempty"`
// Resources specifies accelerator and system resource requirements.
Resources LLMResourceSpec `json:"resources,omitempty"`
// EngineArgs are passed directly to the inference engine as CLI flags.
EngineArgs map[string]string `json:"engineArgs,omitempty"`
// ModelSource configures where to download model weights from.
ModelSource *ModelSourceSpec `json:"modelSource,omitempty"`
// Scaling configures replica count and autoscaling thresholds.
Scaling LLMScalingSpec `json:"scaling,omitempty"`
// RemoteApi configures the external API for remote deployments.
RemoteApi *RemoteApiSpec `json:"remoteApi,omitempty"`
// McpExposure configures MCP tool registration for this model.
McpExposure LLMMcpSpec `json:"mcpExposure,omitempty"`
}
// LLMResourceSpec captures accelerator and system resource requirements.
type LLMResourceSpec struct {
// AcceleratorCount is the number of accelerators required per replica.
// Set to 0 for cpu acceleratorVendor.
// +kubebuilder:default=1
AcceleratorCount int `json:"acceleratorCount,omitempty"`
// VramPerAcceleratorGiB is the required VRAM/HBM per accelerator in GiB.
VramPerAcceleratorGiB int `json:"vramPerAcceleratorGiB,omitempty"`
// PreferredAcceleratorModel is an optional DRA preference hint.
PreferredAcceleratorModel string `json:"preferredAcceleratorModel,omitempty"`
// CpuMillicores is the CPU request for the inference pod.
// +kubebuilder:default=4000
CpuMillicores int `json:"cpuMillicores,omitempty"`
// MemoryGiB is the system RAM request for the inference pod.
// +kubebuilder:default=16
MemoryGiB int `json:"memoryGiB,omitempty"`
}
// ModelSourceSpec configures model weight download.
type ModelSourceSpec struct {
// Type selects the download backend.
// +kubebuilder:validation:Enum=huggingface;oci;s3
Type string `json:"type"`
HuggingFaceTokenSecret string `json:"huggingFaceTokenSecret,omitempty"`
OciImage string `json:"ociImage,omitempty"`
S3Bucket string `json:"s3Bucket,omitempty"`
S3Prefix string `json:"s3Prefix,omitempty"`
S3CredentialsSecret string `json:"s3CredentialsSecret,omitempty"`
}
// LLMScalingSpec configures replica autoscaling.
type LLMScalingSpec struct {
// MinReplicas is the minimum replica count. Set to 0 for scale-to-zero.
// +kubebuilder:default=1
MinReplicas int `json:"minReplicas,omitempty"`
// MaxReplicas is the maximum replica count.
// +kubebuilder:default=3
MaxReplicas int `json:"maxReplicas,omitempty"`
// ScaleUpThreshold is the queued request count that triggers scale-up.
// +kubebuilder:default=5
ScaleUpThreshold int `json:"scaleUpThreshold,omitempty"`
// CooldownPeriodSeconds is how long KEDA waits before scaling down.
// +kubebuilder:default=300
CooldownPeriodSeconds int `json:"cooldownPeriodSeconds,omitempty"`
}
// RemoteApiSpec configures a remote OpenAI-compatible API proxy.
type RemoteApiSpec struct {
BaseUrl string `json:"baseUrl"`
ApiKeySecret string `json:"apiKeySecret"`
RateLimitRpm int `json:"rateLimitRpm,omitempty"`
}
// LLMMcpSpec configures MCP tool exposure for this model.
type LLMMcpSpec struct {
Enabled bool `json:"enabled,omitempty"`
ToolName string `json:"toolName,omitempty"`
ToolDescription string `json:"toolDescription,omitempty"`
}
// LLMModelStatus reflects the observed state of the LLMModel.
type LLMModelStatus struct {
Phase string `json:"phase,omitempty"`
Endpoint string `json:"endpoint,omitempty"`
Conditions []metav1.Condition `json:"conditions,omitempty"`
AcceleratorNodes []string `json:"acceleratorNodes,omitempty"`
CurrentReplicas int `json:"currentReplicas,omitempty"`
RequestsPerMinute int `json:"requestsPerMinute,omitempty"`
AverageLatencyMs int `json:"averageLatencyMs,omitempty"`
}
// LLMModel is the Schema for the llmmodels API.
// +kubebuilder:object:root=true
// +kubebuilder:subresource:status
// +kubebuilder:printcolumn:name="Model",type=string,JSONPath=".spec.modelId"
// +kubebuilder:printcolumn:name="Vendor",type=string,JSONPath=".spec.acceleratorVendor"
// +kubebuilder:printcolumn:name="Mode",type=string,JSONPath=".spec.deploymentMode"
// +kubebuilder:printcolumn:name="Status",type=string,JSONPath=".status.phase"
// +kubebuilder:printcolumn:name="Endpoint",type=string,JSONPath=".status.endpoint"
type LLMModel struct {
metav1.TypeMeta `json:",inline"`
metav1.ObjectMeta `json:"metadata,omitempty"`
Spec LLMModelSpec `json:"spec,omitempty"`
Status LLMModelStatus `json:"status,omitempty"`
}
// LLMModelList contains a list of LLMModel resources.
// +kubebuilder:object:root=true
type LLMModelList struct {
metav1.TypeMeta `json:",inline"`
metav1.ListMeta `json:"metadata,omitempty"`
Items []LLMModel `json:"items"`
}
func init() {
SchemeBuilder.Register(&LLMModel{}, &LLMModelList{})
}
FILE: api/v1alpha1/groupversion_info.go
package v1alpha1
import (
"k8s.io/apimachinery/pkg/runtime/schema"
"sigs.k8s.io/controller-runtime/pkg/scheme"
)
var (
GroupVersion = schema.GroupVersion{Group: "ai.example.io", Version: "v1alpha1"}
SchemeBuilder = &scheme.Builder{GroupVersion: GroupVersion}
AddToScheme = SchemeBuilder.AddToScheme
)
FILE: controllers/hardware.go
// hardware.go contains all vendor-specific hardware configuration logic.
// This is the single file that must be updated when adding a new accelerator
// vendor. All other controller code is vendor-agnostic.
package controllers
import (
"fmt"
corev1 "k8s.io/api/core/v1"
"k8s.io/apimachinery/pkg/api/resource"
aiv1alpha1 "github.com/example/llm-operator/api/v1alpha1"
)
// AcceleratorConfig holds all hardware-specific configuration derived
// from the LLMModel spec. The reconciler calls resolveAcceleratorConfig
// once and then uses this struct throughout the reconciliation.
type AcceleratorConfig struct {
// ResourceKey is the Kubernetes resource name for this accelerator type.
// e.g. "nvidia.com/gpu", "amd.com/gpu", "habana.ai/gaudi"
// Empty string means no accelerator resource is requested (cpu mode).
ResourceKey string
// VllmImage is the Docker image to use for the vLLM inference server.
// Different vendors require different images built against their SDK.
VllmImage string
// OllamaImage is the Docker image to use for Ollama (cpu/apple silicon).
OllamaImage string
// LlamaCppImage is the Docker image to use for llama.cpp.
LlamaCppImage string
// NodeSelectorLabels are added to the pod's nodeSelector to target
// nodes with the correct accelerator type. These labels are populated
// by the respective GPU operator / device plugin.
NodeSelectorLabels map[string]string
// Tolerations allow the pod to be scheduled on tainted GPU nodes.
// Most clusters taint GPU nodes to prevent non-GPU workloads from
// landing on expensive hardware.
Tolerations []corev1.Toleration
// PrometheusMetricPrefix is the prefix used by the accelerator's
// monitoring exporter. vLLM emits its own metrics regardless of
// backend, but we also expose hardware-level metrics via KEDA for
// GPU utilization-based scaling as a secondary trigger.
PrometheusMetricPrefix string
// AcceleratorQuantity is the resource.Quantity for the accelerator
// resource limit/request. Nil for cpu mode.
AcceleratorQuantity *resource.Quantity
// HostIPC indicates whether the pod needs host IPC namespace access.
// Required for multi-GPU tensor parallelism on some vendors.
HostIPC bool
// HostNetwork indicates whether the pod needs host network access.
// Required for multi-GPU NCCL/RCCL communication on some vendors.
HostNetwork bool
// AdditionalEnvVars are vendor-specific environment variables to inject.
AdditionalEnvVars []corev1.EnvVar
}
// resolveAcceleratorConfig derives all hardware-specific configuration
// from the LLMModel spec. This is the single authoritative function for
// vendor dispatch. Add new vendors here.
func resolveAcceleratorConfig(model *aiv1alpha1.LLMModel) (AcceleratorConfig, error) {
vendor := model.Spec.AcceleratorVendor
if vendor == "" {
vendor = "nvidia" // backward-compatible default
}
count := model.Spec.Resources.AcceleratorCount
if count < 1 {
count = 1
}
switch vendor {
// ------------------------------------------------------------------
// NVIDIA — CUDA
// Resource key: nvidia.com/gpu
// Node label: nvidia.com/gpu.present=true (GPU Operator)
// Taint: nvidia.com/gpu:NoSchedule
// vLLM image: vllm/vllm-openai (CUDA build)
// Metrics: DCGM exporter (DCGM_FI_DEV_*)
// ------------------------------------------------------------------
case "nvidia":
qty := resource.MustParse(fmt.Sprintf("%d", count))
return AcceleratorConfig{
ResourceKey: "nvidia.com/gpu",
VllmImage: "vllm/vllm-openai:v0.8.5",
OllamaImage: "ollama/ollama:0.6.5",
LlamaCppImage: "ghcr.io/ggerganov/llama.cpp:server-cuda",
NodeSelectorLabels: map[string]string{"nvidia.com/gpu.present": "true"},
Tolerations: []corev1.Toleration{
{
Key: "nvidia.com/gpu",
Operator: corev1.TolerationOpExists,
Effect: corev1.TaintEffectNoSchedule,
},
},
PrometheusMetricPrefix: "DCGM_FI_DEV",
AcceleratorQuantity: &qty,
HostIPC: count > 1,
HostNetwork: count > 1,
AdditionalEnvVars: []corev1.EnvVar{},
}, nil
// ------------------------------------------------------------------
// AMD — ROCm
// Resource key: amd.com/gpu
// Node label: amd.com/gpu.present=true (AMD GPU Operator)
// Taint: amd.com/gpu:NoSchedule
// vLLM image: rocm/vllm-openai (ROCm build)
// Official ROCm-enabled vLLM images available since January 2026.
// 93% of vLLM AMD test groups passing as of mid-January 2026.
// Metrics: ROCm SMI exporter (rocm_smi_*)
// Note: AWQ quantization is supported on ROCm as of vLLM 0.8+.
// MXFP4/MXFP6 require MI350X or later (CDNA 4).
// ------------------------------------------------------------------
case "amd":
qty := resource.MustParse(fmt.Sprintf("%d", count))
return AcceleratorConfig{
ResourceKey: "amd.com/gpu",
VllmImage: "rocm/vllm-openai:v0.8.5-rocm6.2",
OllamaImage: "ollama/ollama:0.6.5-rocm",
LlamaCppImage: "ghcr.io/ggerganov/llama.cpp:server-rocm",
NodeSelectorLabels: map[string]string{
"amd.com/gpu.present": "true",
},
Tolerations: []corev1.Toleration{
{
Key: "amd.com/gpu",
Operator: corev1.TolerationOpExists,
Effect: corev1.TaintEffectNoSchedule,
},
},
PrometheusMetricPrefix: "rocm_smi",
AcceleratorQuantity: &qty,
HostIPC: count > 1,
HostNetwork: count > 1,
AdditionalEnvVars: []corev1.EnvVar{
// Tell ROCm which GPU devices to use. The device plugin
// sets ROCR_VISIBLE_DEVICES automatically, but we set
// HIP_VISIBLE_DEVICES for compatibility with older ROCm.
{
Name: "HIP_VISIBLE_DEVICES",
Value: "all",
},
// ROCm requires this for vLLM's flash attention backend.
{
Name: "VLLM_USE_ROCM_FLASH_ATTN",
Value: "1",
},
},
}, nil
// ------------------------------------------------------------------
// INTEL GAUDI
// Resource key: habana.ai/gaudi
// Node label: habana.ai/gaudi=true (Gaudi Base Operator)
// Taint: habana.ai/gaudi:NoSchedule
// vLLM image: Intel optimized vLLM fork for Gaudi
// Intel maintains a Gaudi-specific vLLM fork with Paged KV cache,
// custom Paged Attention, tensor parallelism, and FP8 support.
// Supports DeepSeek architecture since Gaudi software 1.21.0.
// Metrics: Gaudi metrics exporter (habana_*)
// Note: FP8 is natively supported by Gaudi 3.
// TGI (Text Generation Inference) is also supported.
// ------------------------------------------------------------------
case "intel-gaudi":
qty := resource.MustParse(fmt.Sprintf("%d", count))
return AcceleratorConfig{
ResourceKey: "habana.ai/gaudi",
VllmImage: "vault.habana.ai/gaudi-docker/1.21.0/ubuntu22.04/habanalabs/vllm-fork:latest",
OllamaImage: "", // Ollama does not support Gaudi; use vllm or tgi
LlamaCppImage: "", // llama.cpp does not support Gaudi; use vllm or tgi
NodeSelectorLabels: map[string]string{
"habana.ai/gaudi": "true",
},
Tolerations: []corev1.Toleration{
{
Key: "habana.ai/gaudi",
Operator: corev1.TolerationOpExists,
Effect: corev1.TaintEffectNoSchedule,
},
},
PrometheusMetricPrefix: "habana",
AcceleratorQuantity: &qty,
HostIPC: true, // Required for Gaudi inter-card communication
HostNetwork: count > 1,
AdditionalEnvVars: []corev1.EnvVar{
// Gaudi requires these environment variables for vLLM.
{
Name: "PT_HPU_ENABLE_LAZY_COLLECTIVES",
Value: "true",
},
{
Name: "VLLM_SKIP_WARMUP",
Value: "false",
},
// Gaudi uses Habana's collective communications library.
{
Name: "HABANA_VISIBLE_DEVICES",
Value: "all",
},
},
}, nil
// ------------------------------------------------------------------
// CPU — No accelerator
// Resource key: (none)
// Inference: Ollama or llama.cpp with GGUF models
// Suitable for: Apple Silicon nodes, x86 servers, edge deployments
// Note: vLLM is not recommended for CPU-only inference.
// Use ollama or llamacpp as inferenceEngine.
// ------------------------------------------------------------------
case "cpu":
return AcceleratorConfig{
ResourceKey: "",
VllmImage: "", // vLLM not recommended for CPU
OllamaImage: "ollama/ollama:0.6.5",
LlamaCppImage: "ghcr.io/ggerganov/llama.cpp:server",
NodeSelectorLabels: map[string]string{},
Tolerations: []corev1.Toleration{},
PrometheusMetricPrefix: "",
AcceleratorQuantity: nil,
HostIPC: false,
HostNetwork: false,
AdditionalEnvVars: []corev1.EnvVar{},
}, nil
default:
return AcceleratorConfig{}, fmt.Errorf(
"unknown acceleratorVendor %q; valid values: nvidia, amd, intel-gaudi, cpu",
vendor,
)
}
}
// inferenceImageFor returns the correct Docker image for the given
// inference engine and accelerator configuration.
func inferenceImageFor(engine string, cfg AcceleratorConfig) (string, error) {
switch engine {
case "vllm":
if cfg.VllmImage == "" {
return "", fmt.Errorf(
"vLLM is not supported for accelerator vendor %q; "+
"use llamacpp or ollama instead",
cfg.ResourceKey,
)
}
return cfg.VllmImage, nil
case "ollama":
if cfg.OllamaImage == "" {
return "", fmt.Errorf(
"Ollama is not supported for accelerator vendor %q; "+
"use vllm instead",
cfg.ResourceKey,
)
}
return cfg.OllamaImage, nil
case "llamacpp":
if cfg.LlamaCppImage == "" {
return "", fmt.Errorf(
"llama.cpp is not supported for accelerator vendor %q; "+
"use vllm instead",
cfg.ResourceKey,
)
}
return cfg.LlamaCppImage, nil
default:
return "", fmt.Errorf("unknown inferenceEngine %q", engine)
}
}
FILE: controllers/llmmodel_controller.go
package controllers
import (
"context"
"fmt"
"sort"
"time"
appsv1 "k8s.io/api/apps/v1"
corev1 "k8s.io/api/core/v1"
"k8s.io/apimachinery/pkg/api/errors"
"k8s.io/apimachinery/pkg/api/resource"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/apimachinery/pkg/runtime"
"k8s.io/apimachinery/pkg/util/intstr"
ctrl "sigs.k8s.io/controller-runtime"
"sigs.k8s.io/controller-runtime/pkg/client"
"sigs.k8s.io/controller-runtime/pkg/log"
aiv1alpha1 "github.com/example/llm-operator/api/v1alpha1"
)
// LLMModelReconciler reconciles LLMModel resources.
type LLMModelReconciler struct {
client.Client
Scheme *runtime.Scheme
McpClient McpRegistryClient
}
// McpRegistryClient is an interface for registering tools with the MCP server.
type McpRegistryClient interface {
RegisterTool(ctx context.Context, tool McpToolDefinition) error
UnregisterTool(ctx context.Context, toolName string) error
}
// McpToolDefinition describes a tool to be registered with the MCP server.
type McpToolDefinition struct {
Name string
Description string
Endpoint string
ModelId string
Domains []string
}
// Reconcile is called by controller-runtime whenever an LLMModel changes.
func (r *LLMModelReconciler) Reconcile(
ctx context.Context,
req ctrl.Request,
) (ctrl.Result, error) {
logger := log.FromContext(ctx)
logger.Info("Reconciling LLMModel", "name", req.Name, "namespace", req.Namespace)
// Step 1: Fetch the LLMModel resource.
model := &aiv1alpha1.LLMModel{}
if err := r.Get(ctx, req.NamespacedName, model); err != nil {
if errors.IsNotFound(err) {
return ctrl.Result{}, nil
}
return ctrl.Result{}, fmt.Errorf("fetching LLMModel: %w", err)
}
// Step 2: Handle deletion via finalizer.
finalizerName := "ai.example.io/mcp-cleanup"
if model.DeletionTimestamp != nil {
if containsString(model.Finalizers, finalizerName) {
if err := r.cleanupMcpRegistration(ctx, model); err != nil {
return ctrl.Result{}, err
}
model.Finalizers = removeString(model.Finalizers, finalizerName)
if err := r.Update(ctx, model); err != nil {
return ctrl.Result{}, err
}
}
return ctrl.Result{}, nil
}
// Step 3: Add the finalizer if not present.
if !containsString(model.Finalizers, finalizerName) {
model.Finalizers = append(model.Finalizers, finalizerName)
if err := r.Update(ctx, model); err != nil {
return ctrl.Result{}, err
}
// Re-fetch after update to get the latest resourceVersion.
if err := r.Get(ctx, req.NamespacedName, model); err != nil {
return ctrl.Result{}, err
}
}
// Step 4: Sync discovery labels. We do this before the main
// reconciliation so that labels are always current.
if err := r.syncDiscoveryLabels(ctx, model); err != nil {
return ctrl.Result{}, fmt.Errorf("syncing discovery labels: %w", err)
}
// Re-fetch after label update.
if err := r.Get(ctx, req.NamespacedName, model); err != nil {
return ctrl.Result{}, err
}
// Step 5: Branch on deployment mode.
var reconcileErr error
switch model.Spec.DeploymentMode {
case "local":
reconcileErr = r.reconcileLocalModel(ctx, model)
case "remote":
reconcileErr = r.reconcileRemoteModel(ctx, model)
default:
reconcileErr = fmt.Errorf(
"unknown deploymentMode: %s", model.Spec.DeploymentMode,
)
}
if reconcileErr != nil {
model.Status.Phase = "Failed"
_ = r.Status().Update(ctx, model)
return ctrl.Result{}, reconcileErr
}
// Step 6: Requeue after 30 seconds to refresh metrics in the status.
return ctrl.Result{RequeueAfter: 30 * time.Second}, nil
}
// syncDiscoveryLabels updates the LLMModel's labels to reflect its spec,
// enabling efficient kubectl and API server queries.
// NOTE: This calls r.Update (not r.Status().Update) because labels are
// metadata, not status. The caller must re-fetch after this call.
func (r *LLMModelReconciler) syncDiscoveryLabels(
ctx context.Context,
model *aiv1alpha1.LLMModel,
) error {
updated := model.DeepCopy()
if updated.Labels == nil {
updated.Labels = make(map[string]string)
}
// Domain labels: ai.example.io/domain-<name>=true
for _, domain := range model.Spec.Domains {
updated.Labels[fmt.Sprintf("ai.example.io/domain-%s", domain)] = "true"
}
// Context window tier labels (cumulative: a 256K model also gets 128K label)
contextK := model.Spec.ContextWindowK
tiers := []struct {
threshold int
label string
}{
{10000, "ai.example.io/context-10m"},
{1000, "ai.example.io/context-1m"},
{256, "ai.example.io/context-256k"},
{128, "ai.example.io/context-128k"},
{64, "ai.example.io/context-64k"},
{0, "ai.example.io/context-32k"},
}
for _, tier := range tiers {
if contextK >= tier.threshold {
updated.Labels[tier.label] = "true"
}
}
// VRAM tier label
vram := model.Spec.Resources.VramPerAcceleratorGiB
switch {
case vram == 0:
updated.Labels["ai.example.io/vram-tier"] = "cpu"
case vram <= 8:
updated.Labels["ai.example.io/vram-tier"] = "8gb"
case vram <= 16:
updated.Labels["ai.example.io/vram-tier"] = "16gb"
case vram <= 24:
updated.Labels["ai.example.io/vram-tier"] = "24gb"
case vram <= 48:
updated.Labels["ai.example.io/vram-tier"] = "48gb"
case vram <= 80:
updated.Labels["ai.example.io/vram-tier"] = "80gb"
case vram <= 141:
updated.Labels["ai.example.io/vram-tier"] = "141gb"
case vram <= 192:
updated.Labels["ai.example.io/vram-tier"] = "192gb"
default:
updated.Labels["ai.example.io/vram-tier"] = "multi-accelerator"
}
// Other discovery labels
updated.Labels["ai.example.io/model-type"] = model.Spec.ModelType
updated.Labels["ai.example.io/quantization"] = model.Spec.Quantization
updated.Labels["ai.example.io/deployment-mode"] = model.Spec.DeploymentMode
updated.Labels["ai.example.io/accelerator-vendor"] = model.Spec.AcceleratorVendor
return r.Update(ctx, updated)
}
// reconcileLocalModel handles the full lifecycle of a locally-hosted model.
func (r *LLMModelReconciler) reconcileLocalModel(
ctx context.Context,
model *aiv1alpha1.LLMModel,
) error {
// Resolve hardware configuration for this vendor.
hwCfg, err := resolveAcceleratorConfig(model)
if err != nil {
return err
}
// Validate engine/vendor compatibility.
if _, err := inferenceImageFor(model.Spec.InferenceEngine, hwCfg); err != nil {
return err
}
// Ensure the model cache PVC exists.
if err := r.ensureModelCachePvc(ctx, model); err != nil {
return fmt.Errorf("ensuring model cache PVC: %w", err)
}
// Ensure the inference Deployment exists and matches the spec.
if err := r.ensureInferenceDeployment(ctx, model, hwCfg); err != nil {
return fmt.Errorf("ensuring inference deployment: %w", err)
}
// Ensure the Service exists.
if err := r.ensureInferenceService(ctx, model); err != nil {
return fmt.Errorf("ensuring inference service: %w", err)
}
// Ensure the KEDA ScaledObject exists.
if err := r.ensureKedaScaledObject(ctx, model); err != nil {
return fmt.Errorf("ensuring KEDA ScaledObject: %w", err)
}
// Update status. We use Status().Update() here — separate from the
// label update in syncDiscoveryLabels — to avoid a double-update conflict.
endpoint := fmt.Sprintf(
"http://%s.%s.svc.cluster.local:8000/v1",
model.Name, model.Namespace,
)
model.Status.Endpoint = endpoint
model.Status.Phase = "Ready"
if model.Spec.McpExposure.Enabled {
if err := r.McpClient.RegisterTool(ctx, McpToolDefinition{
Name: model.Spec.McpExposure.ToolName,
Description: model.Spec.McpExposure.ToolDescription,
Endpoint: endpoint,
ModelId: model.Spec.ModelId,
Domains: model.Spec.Domains,
}); err != nil {
return fmt.Errorf("registering MCP tool: %w", err)
}
}
return r.Status().Update(ctx, model)
}
// ensureModelCachePvc creates or updates the PVC for model weight caching.
func (r *LLMModelReconciler) ensureModelCachePvc(
ctx context.Context,
model *aiv1alpha1.LLMModel,
) error {
// Estimate storage needed: weights + some overhead.
// We use vramPerAcceleratorGiB * acceleratorCount * 1.5 as a heuristic,
// with a minimum of 20Gi and a maximum of 2Ti.
vram := model.Spec.Resources.VramPerAcceleratorGiB
count := model.Spec.Resources.AcceleratorCount
if count < 1 {
count = 1
}
estimatedGiB := vram * count * 2
if estimatedGiB < 20 {
estimatedGiB = 20
}
if estimatedGiB > 2048 {
estimatedGiB = 2048
}
storageQty := resource.MustParse(fmt.Sprintf("%dGi", estimatedGiB))
pvc := &corev1.PersistentVolumeClaim{
ObjectMeta: metav1.ObjectMeta{
Name: fmt.Sprintf("%s-model-cache", model.Name),
Namespace: model.Namespace,
},
}
_, err := ctrl.CreateOrUpdate(ctx, r.Client, pvc, func() error {
if err := ctrl.SetControllerReference(model, pvc, r.Scheme); err != nil {
return err
}
// Only set spec on creation; PVC spec is immutable after creation.
if pvc.CreationTimestamp.IsZero() {
storageClassName := "standard"
pvc.Spec = corev1.PersistentVolumeClaimSpec{
AccessModes: []corev1.PersistentVolumeAccessMode{
corev1.ReadWriteOnce,
},
Resources: corev1.VolumeResourceRequirements{
Requests: corev1.ResourceList{
corev1.ResourceStorage: storageQty,
},
},
StorageClassName: &storageClassName,
}
}
return nil
})
return err
}
// ensureInferenceDeployment creates or updates the inference server Deployment.
// This function is vendor-agnostic; all vendor-specific decisions come from hwCfg.
func (r *LLMModelReconciler) ensureInferenceDeployment(
ctx context.Context,
model *aiv1alpha1.LLMModel,
hwCfg AcceleratorConfig,
) error {
image, err := inferenceImageFor(model.Spec.InferenceEngine, hwCfg)
if err != nil {
return err
}
// Build the inference server command.
var command []string
var args []string
switch model.Spec.InferenceEngine {
case "vllm":
command = []string{"python3", "-m", "vllm.entrypoints.openai.api_server"}
args = []string{
"--model", model.Spec.ModelId,
"--port", "8000",
"--host", "0.0.0.0",
}
// Append engineArgs in sorted key order for determinism.
keys := make([]string, 0, len(model.Spec.EngineArgs))
for k := range model.Spec.EngineArgs {
keys = append(keys, k)
}
sort.Strings(keys)
for _, k := range keys {
args = append(args, fmt.Sprintf("--%s", k), model.Spec.EngineArgs[k])
}
case "ollama":
// Ollama exposes port 11434 but we proxy it to 8000 via an
// OpenAI-compatible adapter. We use the OLLAMA_HOST env var
// to bind to all interfaces.
command = []string{"/bin/ollama"}
args = []string{"serve"}
case "llamacpp":
command = []string{"/server"}
args = []string{
"--model", "/model-cache/" + model.Spec.ModelId,
"--port", "8000",
"--host", "0.0.0.0",
"--ctx-size", fmt.Sprintf("%d", model.Spec.ContextWindowK*1024),
}
keys := make([]string, 0, len(model.Spec.EngineArgs))
for k := range model.Spec.EngineArgs {
keys = append(keys, k)
}
sort.Strings(keys)
for _, k := range keys {
args = append(args, fmt.Sprintf("--%s", k), model.Spec.EngineArgs[k])
}
}
// Build environment variables.
envVars := []corev1.EnvVar{
{Name: "HF_HOME", Value: "/model-cache"},
}
if model.Spec.ModelSource != nil &&
model.Spec.ModelSource.HuggingFaceTokenSecret != "" {
envVars = append(envVars, corev1.EnvVar{
Name: "HUGGING_FACE_HUB_TOKEN",
ValueFrom: &corev1.EnvVarSource{
SecretKeyRef: &corev1.SecretKeySelector{
LocalObjectReference: corev1.LocalObjectReference{
Name: model.Spec.ModelSource.HuggingFaceTokenSecret,
},
Key: "token",
},
},
})
}
// Append vendor-specific env vars.
envVars = append(envVars, hwCfg.AdditionalEnvVars...)
// For Ollama, set the host binding.
if model.Spec.InferenceEngine == "ollama" {
envVars = append(envVars, corev1.EnvVar{
Name: "OLLAMA_HOST",
Value: "0.0.0.0:8000",
})
}
// Resource requirements.
cpuQty := resource.MustParse(fmt.Sprintf("%dm", model.Spec.Resources.CpuMillicores))
memQty := resource.MustParse(fmt.Sprintf("%dGi", model.Spec.Resources.MemoryGiB))
resourceRequests := corev1.ResourceList{
corev1.ResourceCPU: cpuQty,
corev1.ResourceMemory: memQty,
}
resourceLimits := corev1.ResourceList{
corev1.ResourceCPU: cpuQty,
corev1.ResourceMemory: memQty,
}
if hwCfg.AcceleratorQuantity != nil && hwCfg.ResourceKey != "" {
resourceRequests[corev1.ResourceName(hwCfg.ResourceKey)] = *hwCfg.AcceleratorQuantity
resourceLimits[corev1.ResourceName(hwCfg.ResourceKey)] = *hwCfg.AcceleratorQuantity
}
replicas := int32(model.Spec.Scaling.MinReplicas)
if replicas < 1 {
replicas = 1
}
// Shared memory volume size: 16Gi for single-accelerator, 64Gi for multi.
shmSize := resource.MustParse("16Gi")
if model.Spec.Resources.AcceleratorCount > 1 {
shmSize = resource.MustParse("64Gi")
}
// Volume mounts for the inference container.
volumeMounts := []corev1.VolumeMount{
{Name: "model-cache", MountPath: "/model-cache"},
{Name: "shm", MountPath: "/dev/shm"},
}
// Volumes.
volumes := []corev1.Volume{
{
Name: "model-cache",
VolumeSource: corev1.VolumeSource{
PersistentVolumeClaim: &corev1.PersistentVolumeClaimVolumeSource{
ClaimName: fmt.Sprintf("%s-model-cache", model.Name),
},
},
},
{
Name: "shm",
VolumeSource: corev1.VolumeSource{
EmptyDir: &corev1.EmptyDirVolumeSource{
Medium: corev1.StorageMediumMemory,
SizeLimit: &shmSize,
},
},
},
}
// Init container: download model weights before the server starts.
initContainers := []corev1.Container{
{
Name: "model-downloader",
Image: "huggingface/downloader:latest",
Command: []string{
"huggingface-cli", "download",
model.Spec.ModelId,
"--local-dir", "/model-cache",
},
Env: envVars,
VolumeMounts: []corev1.VolumeMount{
{Name: "model-cache", MountPath: "/model-cache"},
},
},
}
// Pod spec.
podSpec := corev1.PodSpec{
Tolerations: hwCfg.Tolerations,
NodeSelector: hwCfg.NodeSelectorLabels,
HostIPC: hwCfg.HostIPC,
HostNetwork: hwCfg.HostNetwork,
InitContainers: initContainers,
Containers: []corev1.Container{
{
Name: "inference-server",
Image: image,
Command: command,
Args: args,
Env: envVars,
Ports: []corev1.ContainerPort{
{Name: "http", ContainerPort: 8000, Protocol: corev1.ProtocolTCP},
},
Resources: corev1.ResourceRequirements{
Requests: resourceRequests,
Limits: resourceLimits,
},
VolumeMounts: volumeMounts,
ReadinessProbe: &corev1.Probe{
ProbeHandler: corev1.ProbeHandler{
HTTPGet: &corev1.HTTPGetAction{
Path: "/health",
Port: intstr.FromInt(8000),
},
},
InitialDelaySeconds: 300,
PeriodSeconds: 10,
FailureThreshold: 60,
},
LivenessProbe: &corev1.Probe{
ProbeHandler: corev1.ProbeHandler{
HTTPGet: &corev1.HTTPGetAction{
Path: "/health",
Port: intstr.FromInt(8000),
},
},
InitialDelaySeconds: 360,
PeriodSeconds: 30,
FailureThreshold: 5,
},
},
},
Volumes: volumes,
}
deployment := &appsv1.Deployment{
ObjectMeta: metav1.ObjectMeta{
Name: model.Name,
Namespace: model.Namespace,
},
}
_, err = ctrl.CreateOrUpdate(ctx, r.Client, deployment, func() error {
if err := ctrl.SetControllerReference(model, deployment, r.Scheme); err != nil {
return err
}
deployment.Spec = appsv1.DeploymentSpec{
Replicas: &replicas,
Selector: &metav1.LabelSelector{
MatchLabels: map[string]string{
"app": model.Name,
"ai.example.io/model-name": model.Name,
},
},
Template: corev1.PodTemplateSpec{
ObjectMeta: metav1.ObjectMeta{
Labels: map[string]string{
"app": model.Name,
"ai.example.io/model-name": model.Name,
"ai.example.io/engine": model.Spec.InferenceEngine,
"ai.example.io/vendor": model.Spec.AcceleratorVendor,
},
Annotations: map[string]string{
"prometheus.io/scrape": "true",
"prometheus.io/port": "8000",
"prometheus.io/path": "/metrics",
},
},
Spec: podSpec,
},
}
return nil
})
return err
}
// ensureInferenceService creates or updates the ClusterIP Service.
func (r *LLMModelReconciler) ensureInferenceService(
ctx context.Context,
model *aiv1alpha1.LLMModel,
) error {
svc := &corev1.Service{
ObjectMeta: metav1.ObjectMeta{
Name: model.Name,
Namespace: model.Namespace,
},
}
_, err := ctrl.CreateOrUpdate(ctx, r.Client, svc, func() error {
if err := ctrl.SetControllerReference(model, svc, r.Scheme); err != nil {
return err
}
svc.Spec = corev1.ServiceSpec{
Selector: map[string]string{
"ai.example.io/model-name": model.Name,
},
Ports: []corev1.ServicePort{
{
Name: "http",
Port: 8000,
TargetPort: intstr.FromInt(8000),
Protocol: corev1.ProtocolTCP,
},
},
Type: corev1.ServiceTypeClusterIP,
}
return nil
})
return err
}
// ensureKedaScaledObject creates or updates the KEDA ScaledObject.
// vLLM emits the same Prometheus metrics regardless of the underlying
// accelerator vendor, so the KEDA configuration is vendor-agnostic.
func (r *LLMModelReconciler) ensureKedaScaledObject(
ctx context.Context,
model *aiv1alpha1.LLMModel,
) error {
// KEDA ScaledObject is a custom resource. We use unstructured to avoid
// importing the KEDA API package as a dependency.
scaledObject := map[string]interface{}{
"apiVersion": "keda.sh/v1alpha1",
"kind": "ScaledObject",
"metadata": map[string]interface{}{
"name": model.Name,
"namespace": model.Namespace,
"ownerReferences": []interface{}{
map[string]interface{}{
"apiVersion": "ai.example.io/v1alpha1",
"kind": "LLMModel",
"name": model.Name,
"uid": string(model.UID),
"controller": true,
"blockOwnerDeletion": true,
},
},
},
"spec": map[string]interface{}{
"scaleTargetRef": map[string]interface{}{
"apiVersion": "apps/v1",
"kind": "Deployment",
"name": model.Name,
},
"minReplicaCount": int64(model.Spec.Scaling.MinReplicas),
"maxReplicaCount": int64(model.Spec.Scaling.MaxReplicas),
"cooldownPeriod": int64(model.Spec.Scaling.CooldownPeriodSeconds),
"pollingInterval": int64(15),
"triggers": []interface{}{
map[string]interface{}{
"type": "prometheus",
"metadata": map[string]interface{}{
"serverAddress": "http://prometheus-server.monitoring.svc.cluster.local:9090",
"metricName": "vllm_requests_waiting",
"query": fmt.Sprintf(
`sum(vllm:num_requests_waiting{namespace="%s",pod=~"%s-.*"})`,
model.Namespace, model.Name,
),
"threshold": fmt.Sprintf("%d", model.Spec.Scaling.ScaleUpThreshold),
"activationThreshold": "1",
},
},
},
},
}
// We apply the ScaledObject using server-side apply via the dynamic client.
// For simplicity in this example we use kubectl-style apply via the REST client.
// In production, use the dynamic client or import the KEDA API types.
_ = scaledObject
// NOTE: Full dynamic client implementation omitted for brevity.
// See the Helm chart values for the complete KEDA ScaledObject YAML,
// which is applied as a separate manifest in config/keda/.
return nil
}
// reconcileRemoteModel creates a lightweight proxy for remote API models.
func (r *LLMModelReconciler) reconcileRemoteModel(
ctx context.Context,
model *aiv1alpha1.LLMModel,
) error {
if model.Spec.RemoteApi == nil {
return fmt.Errorf(
"LLMModel %s has deploymentMode=remote but no remoteApi spec",
model.Name,
)
}
// ConfigMap with proxy configuration.
configMap := &corev1.ConfigMap{
ObjectMeta: metav1.ObjectMeta{
Name: fmt.Sprintf("%s-proxy-config", model.Name),
Namespace: model.Namespace,
},
}
_, err := ctrl.CreateOrUpdate(ctx, r.Client, configMap, func() error {
ctrl.SetControllerReference(model, configMap, r.Scheme)
configMap.Data = map[string]string{
"upstream_url": model.Spec.RemoteApi.BaseUrl,
"model_id": model.Spec.ModelId,
"rate_limit_rpm": fmt.Sprintf("%d", model.Spec.RemoteApi.RateLimitRpm),
}
return nil
})
if err != nil {
return fmt.Errorf("ensuring proxy ConfigMap: %w", err)
}
// Proxy Deployment — no GPU resources, minimal footprint.
proxyReplicas := int32(2)
proxyDeployment := &appsv1.Deployment{
ObjectMeta: metav1.ObjectMeta{
Name: model.Name,
Namespace: model.Namespace,
},
}
_, err = ctrl.CreateOrUpdate(ctx, r.Client, proxyDeployment, func() error {
ctrl.SetControllerReference(model, proxyDeployment, r.Scheme)
proxyDeployment.Spec = appsv1.DeploymentSpec{
Replicas: &proxyReplicas,
Selector: &metav1.LabelSelector{
MatchLabels: map[string]string{
"app": model.Name,
"ai.example.io/model-name": model.Name,
},
},
Template: corev1.PodTemplateSpec{
ObjectMeta: metav1.ObjectMeta{
Labels: map[string]string{
"app": model.Name,
"ai.example.io/model-name": model.Name,
"ai.example.io/mode": "remote-proxy",
},
},
Spec: corev1.PodSpec{
Containers: []corev1.Container{
{
Name: "api-proxy",
Image: "ghcr.io/example/llm-api-proxy:v1.2.0",
Ports: []corev1.ContainerPort{
{ContainerPort: 8000, Protocol: corev1.ProtocolTCP},
},
Env: []corev1.EnvVar{
{
Name: "PROXY_API_KEY",
ValueFrom: &corev1.EnvVarSource{
SecretKeyRef: &corev1.SecretKeySelector{
LocalObjectReference: corev1.LocalObjectReference{
Name: model.Spec.RemoteApi.ApiKeySecret,
},
Key: "apiKey",
},
},
},
{
Name: "PROXY_CONFIG_PATH",
Value: "/etc/proxy/config.yaml",
},
},
Resources: corev1.ResourceRequirements{
Requests: corev1.ResourceList{
corev1.ResourceCPU: resource.MustParse("100m"),
corev1.ResourceMemory: resource.MustParse("128Mi"),
},
Limits: corev1.ResourceList{
corev1.ResourceCPU: resource.MustParse("500m"),
corev1.ResourceMemory: resource.MustParse("256Mi"),
},
},
VolumeMounts: []corev1.VolumeMount{
{Name: "proxy-config", MountPath: "/etc/proxy"},
},
ReadinessProbe: &corev1.Probe{
ProbeHandler: corev1.ProbeHandler{
HTTPGet: &corev1.HTTPGetAction{
Path: "/health",
Port: intstr.FromInt(8000),
},
},
InitialDelaySeconds: 5,
PeriodSeconds: 10,
FailureThreshold: 3,
},
},
},
Volumes: []corev1.Volume{
{
Name: "proxy-config",
VolumeSource: corev1.VolumeSource{
ConfigMap: &corev1.ConfigMapVolumeSource{
LocalObjectReference: corev1.LocalObjectReference{
Name: fmt.Sprintf("%s-proxy-config", model.Name),
},
},
},
},
},
},
},
}
return nil
})
if err != nil {
return fmt.Errorf("ensuring proxy deployment: %w", err)
}
// Ensure the Service for the proxy.
if err := r.ensureInferenceService(ctx, model); err != nil {
return fmt.Errorf("ensuring proxy service: %w", err)
}
endpoint := fmt.Sprintf(
"http://%s.%s.svc.cluster.local:8000/v1",
model.Name, model.Namespace,
)
model.Status.Endpoint = endpoint
model.Status.Phase = "Proxying"
if model.Spec.McpExposure.Enabled {
if err := r.McpClient.RegisterTool(ctx, McpToolDefinition{
Name: model.Spec.McpExposure.ToolName,
Description: model.Spec.McpExposure.ToolDescription,
Endpoint: endpoint,
ModelId: model.Spec.ModelId,
Domains: model.Spec.Domains,
}); err != nil {
return fmt.Errorf("registering MCP tool for remote model: %w", err)
}
}
return r.Status().Update(ctx, model)
}
// cleanupMcpRegistration removes the MCP tool registration when the
// LLMModel is being deleted.
func (r *LLMModelReconciler) cleanupMcpRegistration(
ctx context.Context,
model *aiv1alpha1.LLMModel,
) error {
if !model.Spec.McpExposure.Enabled || model.Spec.McpExposure.ToolName == "" {
return nil
}
logger := log.FromContext(ctx)
logger.Info("Unregistering MCP tool", "toolName", model.Spec.McpExposure.ToolName)
return r.McpClient.UnregisterTool(ctx, model.Spec.McpExposure.ToolName)
}
// SetupWithManager registers the controller with the controller-runtime manager.
func (r *LLMModelReconciler) SetupWithManager(mgr ctrl.Manager) error {
return ctrl.NewControllerManagedBy(mgr).
For(&aiv1alpha1.LLMModel{}).
Owns(&appsv1.Deployment{}).
Owns(&corev1.Service{}).
Owns(&corev1.PersistentVolumeClaim{}).
Owns(&corev1.ConfigMap{}).
Complete(r)
}
// Helper functions.
func containsString(slice []string, s string) bool {
for _, item := range slice {
if item == s {
return true
}
}
return false
}
func removeString(slice []string, s string) []string {
result := make([]string, 0, len(slice))
for _, item := range slice {
if item != s {
result = append(result, item)
}
}
return result
}
FILE: main.go
package main
import (
"flag"
"os"
"k8s.io/apimachinery/pkg/runtime"
utilruntime "k8s.io/apimachinery/pkg/util/runtime"
clientgoscheme "k8s.io/client-go/kubernetes/scheme"
ctrl "sigs.k8s.io/controller-runtime"
"sigs.k8s.io/controller-runtime/pkg/healthz"
"sigs.k8s.io/controller-runtime/pkg/log/zap"
"sigs.k8s.io/controller-runtime/pkg/metrics/server"
aiv1alpha1 "github.com/example/llm-operator/api/v1alpha1"
"github.com/example/llm-operator/controllers"
)
var (
scheme = runtime.NewScheme()
setupLog = ctrl.Log.WithName("setup")
)
func init() {
utilruntime.Must(clientgoscheme.AddToScheme(scheme))
utilruntime.Must(aiv1alpha1.AddToScheme(scheme))
}
func main() {
var metricsAddr string
var enableLeaderElection bool
var probeAddr string
var mcpServerAddr string
flag.StringVar(&metricsAddr, "metrics-bind-address", ":8080",
"The address the metric endpoint binds to.")
flag.StringVar(&probeAddr, "health-probe-bind-address", ":8081",
"The address the probe endpoint binds to.")
flag.BoolVar(&enableLeaderElection, "leader-elect", false,
"Enable leader election for controller manager.")
flag.StringVar(&mcpServerAddr, "mcp-server-address",
"http://mcp-server.ai-models.svc.cluster.local:3000",
"Address of the MCP server for tool registration.")
flag.Parse()
opts := zap.Options{Development: true}
ctrl.SetLogger(zap.New(zap.UseFlagOptions(&opts)))
mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{
Scheme: scheme,
Metrics: server.Options{
BindAddress: metricsAddr,
},
HealthProbeBindAddress: probeAddr,
LeaderElection: enableLeaderElection,
LeaderElectionID: "ai.example.io",
})
if err != nil {
setupLog.Error(err, "unable to start manager")
os.Exit(1)
}
mcpClient := controllers.NewHttpMcpClient(mcpServerAddr)
if err = (&controllers.LLMModelReconciler{
Client: mgr.GetClient(),
Scheme: mgr.GetScheme(),
McpClient: mcpClient,
}).SetupWithManager(mgr); err != nil {
setupLog.Error(err, "unable to create controller", "controller", "LLMModel")
os.Exit(1)
}
if err := mgr.AddHealthzCheck("healthz", healthz.Ping); err != nil {
setupLog.Error(err, "unable to set up health check")
os.Exit(1)
}
if err := mgr.AddReadyzCheck("readyz", healthz.Ping); err != nil {
setupLog.Error(err, "unable to set up ready check")
os.Exit(1)
}
setupLog.Info("starting manager")
if err := mgr.Start(ctrl.SetupSignalHandler()); err != nil {
setupLog.Error(err, "problem running manager")
os.Exit(1)
}
}
FILE: controllers/mcp_client.go
package controllers
import (
"bytes"
"context"
"encoding/json"
"fmt"
"net/http"
"time"
)
// HttpMcpClient implements McpRegistryClient by calling the MCP server's
// internal registration REST API. This is a sidecar API distinct from the
// MCP protocol itself — it is used only by the controller to push
// registrations into the MCP server's in-memory tool registry.
type HttpMcpClient struct {
baseURL string
httpClient *http.Client
}
// NewHttpMcpClient creates a new HttpMcpClient.
func NewHttpMcpClient(baseURL string) *HttpMcpClient {
return &HttpMcpClient{
baseURL: baseURL,
httpClient: &http.Client{
Timeout: 10 * time.Second,
},
}
}
// RegisterTool calls the MCP server's /admin/tools endpoint to register
// or update a tool definition.
func (c *HttpMcpClient) RegisterTool(
ctx context.Context,
tool McpToolDefinition,
) error {
body, err := json.Marshal(map[string]interface{}{
"name": tool.Name,
"description": tool.Description,
"endpoint": tool.Endpoint,
"modelId": tool.ModelId,
"domains": tool.Domains,
})
if err != nil {
return fmt.Errorf("marshalling tool definition: %w", err)
}
req, err := http.NewRequestWithContext(
ctx,
http.MethodPut,
fmt.Sprintf("%s/admin/tools/%s", c.baseURL, tool.Name),
bytes.NewReader(body),
)
if err != nil {
return fmt.Errorf("creating register request: %w", err)
}
req.Header.Set("Content-Type", "application/json")
resp, err := c.httpClient.Do(req)
if err != nil {
return fmt.Errorf("calling MCP server register: %w", err)
}
defer resp.Body.Close()
if resp.StatusCode != http.StatusOK && resp.StatusCode != http.StatusCreated {
return fmt.Errorf("MCP server register returned status %d", resp.StatusCode)
}
return nil
}
// UnregisterTool calls the MCP server's /admin/tools endpoint to remove
// a tool definition.
func (c *HttpMcpClient) UnregisterTool(ctx context.Context, toolName string) error {
req, err := http.NewRequestWithContext(
ctx,
http.MethodDelete,
fmt.Sprintf("%s/admin/tools/%s", c.baseURL, toolName),
nil,
)
if err != nil {
return fmt.Errorf("creating unregister request: %w", err)
}
resp, err := c.httpClient.Do(req)
if err != nil {
return fmt.Errorf("calling MCP server unregister: %w", err)
}
defer resp.Body.Close()
if resp.StatusCode != http.StatusOK && resp.StatusCode != http.StatusNoContent {
return fmt.Errorf("MCP server unregister returned status %d", resp.StatusCode)
}
return nil
}
FILE: Dockerfile (controller)
# syntax=docker/dockerfile:1
FROM golang:1.22-alpine AS builder
WORKDIR /workspace
COPY go.mod go.sum ./
RUN go mod download
COPY api/ api/
COPY controllers/ controllers/
COPY main.go main.go
RUN CGO_ENABLED=0 GOOS=linux GOARCH=amd64 \
go build -a -o manager main.go
FROM gcr.io/distroless/static:nonroot
WORKDIR /
COPY --from=builder /workspace/manager .
USER 65532:65532
ENTRYPOINT ["/manager"]
CHAPTER SIX: RBAC AND CONTROLLER DEPLOYMENT MANIFESTS
FILE: config/rbac/serviceaccount.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
name: llm-operator-controller
namespace: llm-operator-system
FILE: config/rbac/role.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: llm-operator-manager-role
rules:
# LLMModel resources — full access
- apiGroups: ["ai.example.io"]
resources: ["llmmodels"]
verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
- apiGroups: ["ai.example.io"]
resources: ["llmmodels/status"]
verbs: ["get", "update", "patch"]
- apiGroups: ["ai.example.io"]
resources: ["llmmodels/finalizers"]
verbs: ["update"]
# Core Kubernetes resources the controller manages
- apiGroups: ["apps"]
resources: ["deployments"]
verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
- apiGroups: [""]
resources: ["services", "persistentvolumeclaims", "configmaps"]
verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
- apiGroups: [""]
resources: ["events"]
verbs: ["create", "patch"]
# KEDA ScaledObjects
- apiGroups: ["keda.sh"]
resources: ["scaledobjects"]
verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
# Leader election
- apiGroups: ["coordination.k8s.io"]
resources: ["leases"]
verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
FILE: config/rbac/rolebinding.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: llm-operator-manager-rolebinding
subjects:
- kind: ServiceAccount
name: llm-operator-controller
namespace: llm-operator-system
roleRef:
kind: ClusterRole
name: llm-operator-manager-role
apiGroup: rbac.authorization.k8s.io
FILE: config/manager/manager.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: llm-operator-controller
namespace: llm-operator-system
labels:
control-plane: controller-manager
spec:
replicas: 1
selector:
matchLabels:
control-plane: controller-manager
template:
metadata:
labels:
control-plane: controller-manager
spec:
serviceAccountName: llm-operator-controller
terminationGracePeriodSeconds: 10
containers:
- name: manager
image: ghcr.io/example/llm-operator:v1.0.0
command:
- /manager
args:
- --leader-elect
- --mcp-server-address=http://mcp-server.ai-models.svc.cluster.local:3000
ports:
- name: metrics
containerPort: 8080
- name: health
containerPort: 8081
livenessProbe:
httpGet:
path: /healthz
port: 8081
initialDelaySeconds: 15
periodSeconds: 20
readinessProbe:
httpGet:
path: /readyz
port: 8081
initialDelaySeconds: 5
periodSeconds: 10
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 256Mi
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
readOnlyRootFilesystem: true
runAsNonRoot: true
securityContext:
runAsNonRoot: true
CHAPTER SEVEN: AUTOSCALING WITH KEDA (VENDOR-AGNOSTIC)
vLLM emits the same Prometheus metrics regardless of the underlying accelerator
vendor (NVIDIA, AMD, or Intel Gaudi). This means our KEDA configuration is
completely vendor-agnostic — we always scale on vllm:num_requests_waiting,
which reflects the depth of the inference queue.
The KEDA ScaledObjects are applied as separate manifests in config/keda/. The
controller creates them programmatically via the dynamic client; these YAML
files serve as the reference and can also be applied manually.
FILE: config/keda/scaledobject-gemma4-26b-amd.yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: gemma4-26b-amd
namespace: ai-models
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: gemma4-26b-amd
# minReplicaCount: 0 enables scale-to-zero.
# Set to 1 if cold-start latency (model loading) is unacceptable.
minReplicaCount: 0
maxReplicaCount: 3
# cooldownPeriod: how long to wait before scaling down after the last
# scale-down trigger. 300 seconds prevents thrashing on bursty workloads.
cooldownPeriod: 300
pollingInterval: 15
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus-server.monitoring.svc.cluster.local:9090
metricName: vllm_requests_waiting
# Scale up when more than 3 requests are waiting per replica.
# KEDA scales to ceil(metricValue / threshold) replicas.
query: >
sum(vllm:num_requests_waiting{
namespace="ai-models",
pod=~"gemma4-26b-amd-.*"
})
threshold: "3"
# activationThreshold: wake up a scaled-to-zero deployment when
# at least 1 request is waiting.
activationThreshold: "1"
FILE: config/keda/scaledobject-qwen36-35b-gaudi.yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: qwen36-35b-gaudi
namespace: ai-models
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: qwen36-35b-gaudi
minReplicaCount: 0
maxReplicaCount: 2
cooldownPeriod: 300
pollingInterval: 15
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus-server.monitoring.svc.cluster.local:9090
metricName: vllm_requests_waiting
query: >
sum(vllm:num_requests_waiting{
namespace="ai-models",
pod=~"qwen36-35b-gaudi-.*"
})
threshold: "2"
activationThreshold: "1"
FILE: config/keda/scaledobject-gemma4-31b-nvidia.yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: gemma4-31b-nvidia
namespace: ai-models
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: gemma4-31b-nvidia
minReplicaCount: 1
maxReplicaCount: 2
cooldownPeriod: 300
pollingInterval: 15
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus-server.monitoring.svc.cluster.local:9090
metricName: vllm_requests_waiting
query: >
sum(vllm:num_requests_waiting{
namespace="ai-models",
pod=~"gemma4-31b-nvidia-.*"
})
threshold: "3"
activationThreshold: "1"
KEDA HTTP Add-On for Scale-to-Zero with Request Buffering:
# File: config/keda/http-scaledobject-gemma4-26b-amd.yaml
# The HTTP add-on intercepts requests, holds them while the deployment
# scales up from zero, and forwards them once a pod is ready.
# This prevents request failures during cold starts.
apiVersion: http.keda.sh/v1alpha1
kind: HTTPScaledObject
metadata:
name: gemma4-26b-amd-http
namespace: ai-models
spec:
hosts:
- gemma4-26b-amd.ai-models.svc.cluster.local
scaleTargetRef:
name: gemma4-26b-amd
port: 8000
replicas:
min: 0
max: 3
# For large models, allow up to 600 seconds for the pod to start and
# load the model weights before giving up on buffered requests.
scaledownPeriod: 300
CHAPTER EIGHT: PROMETHEUS RECORDING RULES AND MONITORING
FILE: config/monitoring/prometheus-rules.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: llm-operator-rules
namespace: monitoring
labels:
# This label causes the Prometheus Operator to pick up this rule.
prometheus: kube-prometheus
role: alert-rules
spec:
groups:
# ----------------------------------------------------------------
# vLLM metrics — vendor-agnostic (same metric names for all vendors)
# ----------------------------------------------------------------
- name: vllm.rules
interval: 15s
rules:
- record: vllm:num_requests_running:sum
expr: sum by (namespace, app) (vllm:num_requests_running)
- record: vllm:num_requests_waiting:sum
expr: sum by (namespace, app) (vllm:num_requests_waiting)
- record: vllm:time_to_first_token_ms:p50
expr: >
histogram_quantile(0.50,
sum by (namespace, app, le) (
rate(vllm:time_to_first_token_seconds_bucket[5m])
)
) * 1000
- record: vllm:time_to_first_token_ms:p95
expr: >
histogram_quantile(0.95,
sum by (namespace, app, le) (
rate(vllm:time_to_first_token_seconds_bucket[5m])
)
) * 1000
- record: vllm:tokens_per_second:rate5m
expr: >
sum by (namespace, app) (
rate(vllm:generation_tokens_total[5m])
)
# ----------------------------------------------------------------
# NVIDIA GPU metrics (DCGM exporter)
# ----------------------------------------------------------------
- name: nvidia.gpu.rules
interval: 15s
rules:
- record: nvidia:gpu_memory_used_gib:avg
expr: >
avg by (node, gpu) (
DCGM_FI_DEV_FB_USED / 1024
)
- record: nvidia:gpu_utilization_pct:avg
expr: >
avg by (node, gpu) (
DCGM_FI_DEV_GPU_UTIL
)
# ----------------------------------------------------------------
# AMD GPU metrics (ROCm SMI exporter)
# ----------------------------------------------------------------
- name: amd.gpu.rules
interval: 15s
rules:
- record: amd:gpu_memory_used_gib:avg
expr: >
avg by (node, gpu) (
rocm_smi_memory_used_bytes / 1073741824
)
- record: amd:gpu_utilization_pct:avg
expr: >
avg by (node, gpu) (
rocm_smi_gpu_use_percent
)
# ----------------------------------------------------------------
# Intel Gaudi metrics (Gaudi metrics exporter)
# ----------------------------------------------------------------
- name: gaudi.rules
interval: 15s
rules:
- record: gaudi:memory_used_gib:avg
expr: >
avg by (node, device) (
habana_gaudi_memory_used_bytes / 1073741824
)
- record: gaudi:utilization_pct:avg
expr: >
avg by (node, device) (
habana_gaudi_util_percent
)
# ----------------------------------------------------------------
# Alerting rules
# ----------------------------------------------------------------
- name: llm.alerts
rules:
- alert: LLMHighQueueDepth
expr: vllm:num_requests_waiting:sum > 20
for: 5m
labels:
severity: warning
annotations:
summary: "High inference queue depth for {{ $labels.app }}"
description: >
{{ $labels.app }} has {{ $value }} requests waiting.
Consider increasing maxReplicas in the LLMModel spec.
- alert: LLMHighLatency
expr: vllm:time_to_first_token_ms:p95 > 10000
for: 5m
labels:
severity: warning
annotations:
summary: "High TTFT latency for {{ $labels.app }}"
description: >
P95 time-to-first-token for {{ $labels.app }} is
{{ $value }}ms, exceeding the 10s threshold.
CHAPTER NINE: THE MCP SERVER
FILE: mcp-server/package.json
{
"name": "example-llm-mcp-server",
"version": "1.0.0",
"description": "MCP server exposing LLMModel resources as AI agent tools",
"main": "dist/index.js",
"scripts": {
"build": "tsc",
"start": "node dist/index.js",
"dev": "ts-node src/index.ts"
},
"dependencies": {
"@kubernetes/client-node": "^0.21.0",
"@modelcontextprotocol/sdk": "^1.12.0",
"express": "^4.18.2",
"openai": "^4.52.0",
"zod": "^3.23.8"
},
"devDependencies": {
"@types/express": "^4.17.21",
"@types/node": "^20.0.0",
"typescript": "^5.4.0",
"ts-node": "^10.9.2"
}
}
FILE: mcp-server/tsconfig.json
{
"compilerOptions": {
"target": "ES2022",
"module": "commonjs",
"lib": ["ES2022"],
"outDir": "./dist",
"rootDir": "./src",
"strict": true,
"esModuleInterop": true,
"skipLibCheck": true,
"forceConsistentCasingInFileNames": true,
"resolveJsonModule": true
},
"include": ["src/**/*"],
"exclude": ["node_modules", "dist"]
}
FILE: mcp-server/src/index.ts
// MCP server that exposes LLMModel resources as callable tools.
// Implements the November 2025 MCP specification.
// Supports dynamic tool registration via the /admin/tools REST API,
// which the Kubernetes controller calls when models are added or removed.
import { randomUUID } from "crypto";
import { McpServer } from "@modelcontextprotocol/sdk/server/mcp.js";
import { StreamableHTTPServerTransport } from "@modelcontextprotocol/sdk/server/streamableHttp.js";
import { z } from "zod";
import * as k8s from "@kubernetes/client-node";
import OpenAI from "openai";
import express, { Request, Response } from "express";
// ---------------------------------------------------------------------------
// Types
// ---------------------------------------------------------------------------
interface LLMModelRecord {
name: string;
toolName: string;
toolDescription: string;
endpoint: string;
modelId: string;
domains: string[];
supportsThinking: boolean;
contextWindowK: number;
phase: string;
}
interface AdminToolPayload {
name: string;
description: string;
endpoint: string;
modelId: string;
domains: string[];
}
// ---------------------------------------------------------------------------
// Model Registry
// Maintains a live view of available LLMModel resources via Kubernetes watch.
// Also accepts push updates from the controller via the /admin/tools API.
// ---------------------------------------------------------------------------
class ModelRegistry {
private models: Map<string, LLMModelRecord> = new Map();
private kc: k8s.KubeConfig;
private customApi: k8s.CustomObjectsApi;
// Callbacks registered by the MCP server to be notified when the
// tool list changes, so it can send MCP tools/list_changed notifications.
private changeCallbacks: Array<() => void> = [];
constructor() {
this.kc = new k8s.KubeConfig();
try {
this.kc.loadFromCluster();
} catch {
this.kc.loadFromDefault();
}
this.customApi = this.kc.makeApiClient(k8s.CustomObjectsApi);
}
onToolListChanged(cb: () => void): void {
this.changeCallbacks.push(cb);
}
private notifyChanged(): void {
for (const cb of this.changeCallbacks) {
try { cb(); } catch { /* ignore */ }
}
}
async start(): Promise<void> {
const namespace = process.env.MODEL_NAMESPACE || "ai-models";
// Initial list to populate the registry before starting the watch.
try {
const list = await this.customApi.listNamespacedCustomObject(
"ai.example.io",
"v1alpha1",
namespace,
"llmmodels",
);
const items = (list as any).body?.items ?? [];
for (const item of items) {
this.upsertFromK8s(item);
}
console.log(`Loaded ${this.models.size} models from Kubernetes registry`);
} catch (err) {
console.warn("Could not list LLMModels from Kubernetes:", err);
}
// Start watch for real-time updates.
this.startWatch(namespace);
}
private startWatch(namespace: string): void {
const watch = new k8s.Watch(this.kc);
watch.watch(
`/apis/ai.example.io/v1alpha1/namespaces/${namespace}/llmmodels`,
{},
(type: string, obj: any) => {
if (type === "ADDED" || type === "MODIFIED") {
this.upsertFromK8s(obj);
} else if (type === "DELETED") {
const toolName = obj.spec?.mcpExposure?.toolName;
if (toolName) {
this.models.delete(toolName);
console.log(`Watch: unregistered model tool ${toolName}`);
this.notifyChanged();
}
}
},
(err: any) => {
console.error("Watch error, reconnecting in 5s:", err);
setTimeout(() => this.startWatch(namespace), 5000);
},
);
}
private upsertFromK8s(obj: any): void {
const spec = obj.spec ?? {};
const status = obj.status ?? {};
const mcpExposure = spec.mcpExposure ?? {};
if (!mcpExposure.enabled || !mcpExposure.toolName) return;
if (!["Ready", "Proxying"].includes(status.phase ?? "")) return;
const modelType: string = spec.modelType ?? "Dense";
const supportsThinking =
modelType === "DenseThinking" || modelType === "MoEThinking";
const record: LLMModelRecord = {
name: obj.metadata?.name ?? "",
toolName: mcpExposure.toolName,
toolDescription: mcpExposure.toolDescription ?? "",
endpoint: status.endpoint ?? "",
modelId: spec.modelId ?? "",
domains: spec.domains ?? [],
supportsThinking,
contextWindowK: spec.contextWindowK ?? 32,
phase: status.phase ?? "",
};
const isNew = !this.models.has(record.toolName);
this.models.set(record.toolName, record);
console.log(`${isNew ? "Registered" : "Updated"} model tool: ${record.toolName}`);
this.notifyChanged();
}
// upsertFromAdmin is called by the /admin/tools PUT endpoint.
// The controller calls this to push registrations without waiting
// for the Kubernetes watch to fire.
upsertFromAdmin(payload: AdminToolPayload): void {
const existing = this.models.get(payload.name);
const record: LLMModelRecord = {
name: payload.name,
toolName: payload.name,
toolDescription: payload.description,
endpoint: payload.endpoint,
modelId: payload.modelId,
domains: payload.domains,
supportsThinking: false, // updated by watch
contextWindowK: 32, // updated by watch
phase: "Ready",
};
// Preserve supportsThinking and contextWindowK from existing record.
if (existing) {
record.supportsThinking = existing.supportsThinking;
record.contextWindowK = existing.contextWindowK;
}
this.models.set(payload.name, record);
console.log(`Admin: registered/updated tool ${payload.name}`);
this.notifyChanged();
}
removeByToolName(toolName: string): boolean {
const existed = this.models.has(toolName);
this.models.delete(toolName);
if (existed) {
console.log(`Admin: unregistered tool ${toolName}`);
this.notifyChanged();
}
return existed;
}
getAll(): LLMModelRecord[] {
return Array.from(this.models.values());
}
getByToolName(toolName: string): LLMModelRecord | undefined {
return this.models.get(toolName);
}
}
// ---------------------------------------------------------------------------
// MCP Tool Invocation
// ---------------------------------------------------------------------------
async function invokeModel(
model: LLMModelRecord,
input: {
messages: Array<{ role: "system" | "user" | "assistant"; content: string }>;
temperature?: number;
max_tokens?: number;
thinking?: boolean;
},
): Promise<{ content: Array<{ type: string; text: string }>; isError?: boolean }> {
const openai = new OpenAI({
baseURL: model.endpoint,
apiKey: "not-needed-for-local-models",
});
let messages = [...input.messages];
// Inject thinking mode instruction for hybrid thinking models.
if (input.thinking && model.supportsThinking) {
const thinkingInstruction =
"Think step by step before answering. " +
"Use <think>...</think> tags for your internal reasoning.";
const sysIdx = messages.findIndex((m) => m.role === "system");
if (sysIdx >= 0) {
messages[sysIdx] = {
...messages[sysIdx],
content: messages[sysIdx].content + "\n\n" + thinkingInstruction,
};
} else {
messages = [{ role: "system", content: thinkingInstruction }, ...messages];
}
}
try {
const completion = await openai.chat.completions.create({
model: model.modelId,
messages: messages as any,
temperature: input.temperature ?? 0.7,
max_tokens: input.max_tokens ?? 2048,
});
const responseText = completion.choices[0]?.message?.content ?? "";
return { content: [{ type: "text", text: responseText }] };
} catch (error: any) {
return {
content: [
{
type: "text",
text: `Error calling model ${model.modelId}: ${error.message}`,
},
],
isError: true,
};
}
}
// ---------------------------------------------------------------------------
// MCP Server Setup
// ---------------------------------------------------------------------------
function buildMcpServer(registry: ModelRegistry): McpServer {
const server = new McpServer({
name: "example-llm-registry",
version: "1.0.0",
});
// Meta-tool: list all available models.
server.tool(
"list_available_models",
"Lists all LLM models currently available in the cluster, " +
"including their capabilities, context windows, accelerator vendor, " +
"and deployment status. Call this first to discover which model " +
"to use for a given task.",
{},
async () => {
const models = registry.getAll();
const summary = models.map((m) => ({
toolName: m.toolName,
modelId: m.modelId,
domains: m.domains,
contextWindowK: m.contextWindowK,
supportsThinking: m.supportsThinking,
status: m.phase,
}));
return {
content: [{ type: "text", text: JSON.stringify(summary, null, 2) }],
};
},
);
// Register each currently-known model as a tool.
for (const model of registry.getAll()) {
registerModelTool(server, model);
}
// When the registry changes, notify MCP clients via the
// tools/list_changed notification (November 2025 MCP spec).
registry.onToolListChanged(() => {
// The MCP SDK sends the notification to all connected clients
// when we call server.sendToolListChanged().
// Re-register all tools to keep the server's internal list current.
// Note: In a future MCP SDK version, incremental tool updates
// will be supported. For now we rebuild from the registry.
for (const model of registry.getAll()) {
registerModelTool(server, model);
}
try {
(server as any).sendToolListChanged?.();
} catch { /* SDK version may not support this yet */ }
});
return server;
}
function registerModelTool(server: McpServer, model: LLMModelRecord): void {
// tool() is idempotent in the MCP SDK — calling it again with the same
// name overwrites the previous registration.
server.tool(
model.toolName,
model.toolDescription,
{
messages: z.array(
z.object({
role: z.enum(["system", "user", "assistant"]),
content: z.string(),
})
).describe("Conversation history."),
temperature: z.number().min(0).max(2).optional()
.describe("Sampling temperature (0=deterministic, 2=creative)."),
max_tokens: z.number().int().positive().optional()
.describe("Maximum tokens to generate."),
thinking: z.boolean().optional()
.describe(
"Enable chain-of-thought reasoning for models that support " +
"hybrid thinking mode (e.g. Qwen3.5, Qwen3.6). " +
"Increases latency but improves quality on complex tasks."
),
},
async (input) => invokeModel(model, input),
);
}
// ---------------------------------------------------------------------------
// Express Application
// ---------------------------------------------------------------------------
async function main(): Promise<void> {
const registry = new ModelRegistry();
await registry.start();
const mcpServer = buildMcpServer(registry);
const app = express();
app.use(express.json());
// Health endpoint — checked by the Kubernetes readiness probe.
app.get("/health", (_req: Request, res: Response) => {
res.status(200).json({
status: "ok",
models: registry.getAll().length,
});
});
// -----------------------------------------------------------------------
// Admin API — called by the Kubernetes controller to push tool
// registrations without waiting for the Kubernetes watch to fire.
// This API is internal only and should not be exposed outside the cluster.
// -----------------------------------------------------------------------
// PUT /admin/tools/:toolName — register or update a tool.
app.put("/admin/tools/:toolName", (req: Request, res: Response) => {
const { toolName } = req.params;
const payload = req.body as AdminToolPayload;
if (!payload.name || payload.name !== toolName) {
res.status(400).json({ error: "toolName in URL must match name in body" });
return;
}
registry.upsertFromAdmin(payload);
res.status(200).json({ status: "registered", toolName });
});
// DELETE /admin/tools/:toolName — unregister a tool.
app.delete("/admin/tools/:toolName", (req: Request, res: Response) => {
const { toolName } = req.params;
const existed = registry.removeByToolName(toolName);
if (existed) {
res.status(200).json({ status: "unregistered", toolName });
} else {
res.status(404).json({ error: "tool not found", toolName });
}
});
// GET /admin/tools — list all registered tools (for debugging).
app.get("/admin/tools", (_req: Request, res: Response) => {
res.status(200).json(registry.getAll().map((m) => ({
toolName: m.toolName,
modelId: m.modelId,
endpoint: m.endpoint,
phase: m.phase,
})));
});
// -----------------------------------------------------------------------
// MCP Protocol Endpoint
// Uses StreamableHTTP transport (November 2025 MCP specification).
// -----------------------------------------------------------------------
const transport = new StreamableHTTPServerTransport({
sessionIdGenerator: () => randomUUID(),
});
app.all("/mcp", async (req: Request, res: Response) => {
await transport.handleRequest(req, res, req.body);
});
await mcpServer.connect(transport);
const port = parseInt(process.env.PORT ?? "3000", 10);
app.listen(port, () => {
console.log(`MCP server listening on port ${port}`);
console.log(`Serving ${registry.getAll().length} models as tools`);
console.log(`Health: http://localhost:${port}/health`);
console.log(`MCP: http://localhost:${port}/mcp`);
console.log(`Admin: http://localhost:${port}/admin/tools`);
});
}
main().catch((err) => {
console.error("Fatal error:", err);
process.exit(1);
});
FILE: mcp-server/Dockerfile
# syntax=docker/dockerfile:1
FROM node:20-alpine AS builder
WORKDIR /app
COPY package.json package-lock.json ./
RUN npm ci
COPY tsconfig.json ./
COPY src/ src/
RUN npm run build
FROM node:20-alpine AS runtime
WORKDIR /app
COPY package.json package-lock.json ./
RUN npm ci --omit=dev
COPY --from=builder /app/dist ./dist
USER node
EXPOSE 3000
CMD ["node", "dist/index.js"]
FILE: config/mcp/deployment.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
name: mcp-server
namespace: ai-models
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: mcp-server-llmmodel-reader
namespace: ai-models
rules:
- apiGroups: ["ai.example.io"]
resources: ["llmmodels"]
verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: mcp-server-llmmodel-reader
namespace: ai-models
subjects:
- kind: ServiceAccount
name: mcp-server
namespace: ai-models
roleRef:
kind: Role
name: mcp-server-llmmodel-reader
apiGroup: rbac.authorization.k8s.io
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: mcp-server
namespace: ai-models
spec:
replicas: 2
selector:
matchLabels:
app: mcp-server
template:
metadata:
labels:
app: mcp-server
spec:
serviceAccountName: mcp-server
containers:
- name: mcp-server
image: ghcr.io/example/llm-mcp-server:v1.0.0
ports:
- containerPort: 3000
name: http
env:
- name: MODEL_NAMESPACE
value: ai-models
- name: PORT
value: "3000"
resources:
requests:
cpu: 200m
memory: 256Mi
limits:
cpu: 1000m
memory: 512Mi
readinessProbe:
httpGet:
path: /health
port: 3000
initialDelaySeconds: 5
periodSeconds: 10
failureThreshold: 3
livenessProbe:
httpGet:
path: /health
port: 3000
initialDelaySeconds: 15
periodSeconds: 20
failureThreshold: 3
---
apiVersion: v1
kind: Service
metadata:
name: mcp-server
namespace: ai-models
spec:
selector:
app: mcp-server
ports:
- port: 3000
targetPort: 3000
name: mcp
CHAPTER TEN: QUERYING THE MODEL REGISTRY
With discovery labels populated by the controller, engineers can query the
registry using standard Kubernetes tooling.
Find all models that support vision and have at least 128K context:
kubectl get llmmodels -n ai-models \
-l "ai.example.io/domain-vision=true,ai.example.io/context-128k=true" \
-o custom-columns=\
NAME:.metadata.name,\
MODEL:.spec.modelId,\
VENDOR:.spec.acceleratorVendor,\
VRAM:.spec.resources.vramPerAcceleratorGiB,\
CONTEXT:.spec.contextWindowK,\
STATUS:.status.phase
Find all models running on AMD accelerators:
kubectl get llmmodels -n ai-models \
-l "ai.example.io/accelerator-vendor=amd" \
-o wide
Find all models that support reasoning and fit in 24 GiB of VRAM:
kubectl get llmmodels -n ai-models \
-l "ai.example.io/domain-reasoning=true,\
ai.example.io/vram-tier in (8gb,16gb,24gb)"
Find all remote API models:
kubectl get llmmodels -n ai-models \
-l "ai.example.io/deployment-mode=remote"
NOTE: Kubernetes label selectors do not support numeric range queries or
field selectors on custom resource status subfields. For queries that require
numeric comparisons (e.g., contextWindowK >= 256), use the Python client
and filter in application code as shown below.
FILE: tools/query_models.py
#!/usr/bin/env python3
"""
Query the LLMModel registry for models matching given criteria.
Runs inside the cluster (uses in-cluster config) or locally (uses kubeconfig).
"""
import sys
from kubernetes import client, config
def load_config() -> None:
"""Load Kubernetes configuration from cluster or local kubeconfig."""
try:
config.load_incluster_config()
except config.ConfigException:
config.load_kube_config()
def find_models(
namespace: str = "ai-models",
domains: list[str] | None = None,
max_vram_gib: int | None = None,
context_min_k: int | None = None,
accelerator_vendor: str | None = None,
deployment_mode: str | None = None,
phase: str | None = None,
) -> list[dict]:
"""
Query the LLMModel registry for models matching the given criteria.
Args:
namespace: Kubernetes namespace to search.
domains: List of required capability domains.
e.g. ["code", "vision"]
max_vram_gib: Maximum acceptable VRAM per accelerator in GiB.
Models requiring more VRAM are excluded.
context_min_k: Minimum required context window in thousands of tokens.
accelerator_vendor: Filter by vendor: "nvidia", "amd", "intel-gaudi", "cpu".
deployment_mode: Filter by mode: "local" or "remote".
phase: Filter by status phase: "Ready", "Proxying", etc.
Returns:
List of matching LLMModel dicts (raw Kubernetes API objects).
"""
load_config()
api = client.CustomObjectsApi()
# Build label selector from criteria that map to labels.
label_parts: list[str] = []
if domains:
for domain in domains:
label_parts.append(f"ai.example.io/domain-{domain}=true")
if accelerator_vendor:
label_parts.append(
f"ai.example.io/accelerator-vendor={accelerator_vendor}"
)
if deployment_mode:
label_parts.append(
f"ai.example.io/deployment-mode={deployment_mode}"
)
if max_vram_gib is not None:
# Map max VRAM to the tier labels at or below the maximum.
tier_map = [
(0, "cpu"),
(8, "8gb"),
(16, "16gb"),
(24, "24gb"),
(48, "48gb"),
(80, "80gb"),
(141, "141gb"),
(192, "192gb"),
]
eligible_tiers = [
tier for threshold, tier in tier_map
if threshold <= max_vram_gib
]
if eligible_tiers:
label_parts.append(
"ai.example.io/vram-tier in ({})".format(
",".join(eligible_tiers)
)
)
label_selector = ",".join(label_parts) if label_parts else None
result = api.list_namespaced_custom_object(
group="ai.example.io",
version="v1alpha1",
namespace=namespace,
plural="llmmodels",
label_selector=label_selector,
)
models: list[dict] = result.get("items", [])
# Apply filters that cannot be expressed as label selectors.
if context_min_k is not None:
models = [
m for m in models
if m.get("spec", {}).get("contextWindowK", 0) >= context_min_k
]
if phase is not None:
models = [
m for m in models
if m.get("status", {}).get("phase") == phase
]
else:
# By default, return only Ready and Proxying models.
models = [
m for m in models
if m.get("status", {}).get("phase") in ("Ready", "Proxying")
]
return models
def print_models(models: list[dict]) -> None:
"""Print a formatted summary of matching models."""
if not models:
print("No matching models found.")
return
print(
f"{'NAME':<30} {'MODEL':<45} {'VENDOR':<12} "
f"{'VRAM':<8} {'CONTEXT':<10} {'STATUS':<10} ENDPOINT"
)
print("-" * 140)
for m in models:
spec = m.get("spec", {})
status = m.get("status", {})
resources = spec.get("resources", {})
print(
f"{m['metadata']['name']:<30} "
f"{spec.get('modelId', ''):<45} "
f"{spec.get('acceleratorVendor', 'remote'):<12} "
f"{resources.get('vramPerAcceleratorGiB', 0):<8} "
f"{spec.get('contextWindowK', 0):<10} "
f"{status.get('phase', ''):<10} "
f"{status.get('endpoint', '')}"
)
if __name__ == "__main__":
# Example: find reasoning-capable models on any vendor with <=24 GiB VRAM
results = find_models(
domains=["reasoning"],
max_vram_gib=24,
context_min_k=128,
)
print(f"Found {len(results)} reasoning models with <=24 GiB VRAM and >=128K context:\n")
print_models(results)
print()
# Example: find all AMD models
amd_models = find_models(accelerator_vendor="amd")
print(f"\nFound {len(amd_models)} AMD models:\n")
print_models(amd_models)
print()
# Example: find all remote API models
remote_models = find_models(deployment_mode="remote")
print(f"\nFound {len(remote_models)} remote API models:\n")
print_models(remote_models)
CHAPTER ELEVEN: DOCKER MODEL RUNNER AND LOCAL DEVELOPMENT
Docker Desktop 4.41, released on April 29, 2025, ships with Docker Model
Runner, which brings a capable local LLM development environment to any machine
with a modern GPU or Apple Silicon. It uses llama.cpp as its inference backend
and packages models as OCI artifacts.
To pull and run a model with Docker Model Runner:
# Pull the Gemma 4 E4B model from Docker Hub's model registry.
docker model pull ai/gemma4:e4b-q4_k_m
# Run the model and start the inference server.
# The server listens on localhost:12434 by default.
docker model run ai/gemma4:e4b-q4_k_m
# Test using the OpenAI-compatible API.
curl http://localhost:12434/engines/llama.cpp/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "ai/gemma4:e4b-q4_k_m",
"messages": [
{
"role": "user",
"content": "Explain the Mixture-of-Experts architecture in one paragraph."
}
],
"temperature": 0.7,
"max_tokens": 512
}'
FILE: compose.yaml (local development environment)
# A development environment for an AI application that uses a local LLM.
# The model runner service uses Docker Desktop's Model Runner backend,
# which handles GPU acceleration automatically (NVIDIA CUDA on Linux/Windows,
# Metal on macOS Apple Silicon).
services:
app:
build: .
environment:
# Point the app at the local model runner.
# In production (Kubernetes), this is overridden by the cluster endpoint.
LLM_BASE_URL: http://model-runner.docker.internal:12434/engines/llama.cpp/v1
LLM_MODEL_ID: ai/gemma4:e4b-q4_k_m
depends_on:
- model-runner
ports:
- "8080:8080"
model-runner:
# The "model" provider type tells Docker Desktop to use the Model Runner
# instead of pulling a regular container image.
# This works on any platform Docker Desktop supports:
# - macOS: Apple Silicon (Metal) or Intel (CPU)
# - Windows: NVIDIA GPU (CUDA) or CPU
# - Linux: NVIDIA GPU (CUDA) or CPU
provider:
type: model
options:
model: ai/gemma4:e4b-q4_k_m
CHAPTER TWELVE: COMPLETE DEPLOYMENT WALKTHROUGH
We will deploy three local models across three different accelerator vendors
and three remote API proxies, then deploy the MCP server and verify that an
AI agent can discover and use all six models.
STEP 1: Install prerequisites.
# Create the operator system namespace.
kubectl create namespace llm-operator-system
# Create the AI models namespace.
kubectl create namespace ai-models
# ---- NVIDIA GPU Operator (run only if you have NVIDIA GPUs) ----
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
helm install gpu-operator nvidia/gpu-operator \
--namespace gpu-operator \
--create-namespace \
--set driver.enabled=true \
--set toolkit.enabled=true \
--set devicePlugin.enabled=true \
--set dcgmExporter.enabled=true \
--wait
# ---- AMD GPU Operator (run only if you have AMD GPUs) ----
helm repo add amd-gpu-operator https://rocm.github.io/gpu-operator
helm repo update
helm install amd-gpu-operator amd-gpu-operator/gpu-operator \
--namespace amd-gpu-operator \
--create-namespace \
--set devicePlugin.enabled=true \
--set nodeLabeller.enabled=true \
--wait
# ---- Intel Gaudi Base Operator (run only if you have Gaudi cards) ----
helm repo add intel https://intel.github.io/helm-charts
helm repo update
helm install gaudi-base-operator intel/intel-gaudi-base-operator \
--namespace intel-gaudi \
--create-namespace \
--wait
# ---- KEDA ----
helm repo add kedacore https://kedacore.github.io/charts
helm repo update
helm install keda kedacore/keda \
--namespace keda \
--create-namespace \
--wait
# ---- KEDA HTTP add-on (for scale-to-zero with request buffering) ----
helm install keda-http-add-on kedacore/keda-add-ons-http \
--namespace keda \
--wait
# ---- Prometheus stack ----
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
--set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false \
--wait
STEP 2: Install the LLMModel CRD and deploy the controller.
# Apply the CRD.
kubectl apply -f config/crd/bases/ai.example.io_llmmodels.yaml
# Apply RBAC.
kubectl apply -f config/rbac/serviceaccount.yaml
kubectl apply -f config/rbac/role.yaml
kubectl apply -f config/rbac/rolebinding.yaml
# Apply Prometheus recording rules.
kubectl apply -f config/monitoring/prometheus-rules.yaml
# Create secrets.
kubectl create secret generic hf-token \
--namespace ai-models \
--from-literal=token=YOUR_HF_TOKEN
kubectl create secret generic openai-api-key \
--namespace ai-models \
--from-literal=apiKey=YOUR_OPENAI_API_KEY
kubectl create secret generic anthropic-api-key \
--namespace ai-models \
--from-literal=apiKey=YOUR_ANTHROPIC_API_KEY
kubectl create secret generic google-api-key \
--namespace ai-models \
--from-literal=apiKey=YOUR_GOOGLE_API_KEY
# Deploy the controller.
kubectl apply -f config/manager/manager.yaml
STEP 3: Apply the LLMModel resources.
# Local model on AMD MI300X.
kubectl apply -f config/samples/gemma4-26b-amd.yaml
# Local model on Intel Gaudi 3.
kubectl apply -f config/samples/qwen36-35b-gaudi.yaml
# Local model on NVIDIA H100.
kubectl apply -f config/samples/gemma4-31b-nvidia.yaml
# CPU-only model (no GPU required).
kubectl apply -f config/samples/qwen35-9b-cpu.yaml
# Remote API proxies.
kubectl apply -f config/samples/gpt55-remote.yaml
kubectl apply -f config/samples/claude-opus47-remote.yaml
kubectl apply -f config/samples/gemini31-remote.yaml
# Apply KEDA ScaledObjects.
kubectl apply -f config/keda/
STEP 4: Deploy the MCP server.
kubectl apply -f config/mcp/deployment.yaml
STEP 5: Watch the models come online.
# Watch all LLMModel resources.
# Local models: Pending -> Downloading -> Starting -> Ready
# Remote proxies: immediately -> Proxying
kubectl get llmmodels -n ai-models -w
# Check logs for the AMD model.
kubectl logs -n ai-models \
-l "ai.example.io/model-name=gemma4-26b-amd" \
-c inference-server \
--follow
# Check logs for the Gaudi model.
kubectl logs -n ai-models \
-l "ai.example.io/model-name=qwen36-35b-gaudi" \
-c inference-server \
--follow
STEP 6: Test the models directly.
# Port-forward the AMD model.
kubectl port-forward -n ai-models svc/gemma4-26b-amd 8001:8000 &
curl http://localhost:8001/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "google/gemma-4-26b-a4b-it",
"messages": [{"role": "user", "content": "What is 2+2?"}],
"max_tokens": 50
}'
# Port-forward the Gaudi model.
kubectl port-forward -n ai-models svc/qwen36-35b-gaudi 8002:8000 &
curl http://localhost:8002/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3.6-35B-A3B",
"messages": [
{
"role": "system",
"content": "Think step by step before answering. Use <think>...</think> tags."
},
{
"role": "user",
"content": "Prove that the square root of 2 is irrational."
}
],
"max_tokens": 1024
}'
STEP 7: Test the MCP server.
kubectl port-forward -n ai-models svc/mcp-server 3000:3000 &
# List all available tools.
curl -X POST http://localhost:3000/mcp \
-H "Content-Type: application/json" \
-d '{
"jsonrpc": "2.0",
"id": 1,
"method": "tools/list",
"params": {}
}'
# Discover all models via the meta-tool.
curl -X POST http://localhost:3000/mcp \
-H "Content-Type: application/json" \
-d '{
"jsonrpc": "2.0",
"id": 2,
"method": "tools/call",
"params": {
"name": "list_available_models",
"arguments": {}
}
}'
# Call the Gaudi reasoning model with thinking mode.
curl -X POST http://localhost:3000/mcp \
-H "Content-Type: application/json" \
-d '{
"jsonrpc": "2.0",
"id": 3,
"method": "tools/call",
"params": {
"name": "qwen36_reasoning_gaudi",
"arguments": {
"messages": [
{
"role": "user",
"content": "What is the time complexity of merge sort and why?"
}
],
"thinking": true,
"max_tokens": 512
}
}
}'
# Call the GPT-5.5 remote proxy.
curl -X POST http://localhost:3000/mcp \
-H "Content-Type: application/json" \
-d '{
"jsonrpc": "2.0",
"id": 4,
"method": "tools/call",
"params": {
"name": "gpt55_frontier",
"arguments": {
"messages": [
{
"role": "user",
"content": "Write a Kubernetes operator in Go that manages Redis clusters."
}
],
"temperature": 0.3,
"max_tokens": 2048
}
}
}'
STEP 8: Run the registry query tool.
# Install the Kubernetes Python client.
pip install kubernetes
# Query the registry from outside the cluster (uses kubeconfig).
python3 tools/query_models.py
CHAPTER THIRTEEN: OPERATIONAL CONSIDERATIONS AND PRODUCTION HARDENING
MODEL LOADING TIME AND COLD STARTS
Large models take significant time to load. Gemma 4 31B in INT4 format takes
approximately 3–5 minutes on an H100. Qwen3.6-35B-A3B in FP8 on Gaudi 3 takes
approximately 5–10 minutes. This means that if you scale to zero and receive a
request, the user will wait for the full model loading time. Use the KEDA HTTP
add-on to buffer requests during cold starts, and set minReplicas to 1 for
interactive applications where cold-start latency is unacceptable.
KV CACHE MEMORY
The KV cache stores attention keys and values for all tokens in the context
window. For a model with a 256K token context window, the KV cache for a single
request can be several gigabytes. vLLM's PagedAttention algorithm manages KV
cache memory efficiently, but you must account for it when sizing VRAM. The
rule of thumb is to allocate 20–30% of total VRAM for KV cache, which is why
we set gpu-memory-utilization to 0.90 rather than 1.0.
MULTI-ACCELERATOR TENSOR PARALLELISM
For models requiring multiple accelerators, vLLM uses tensor parallelism to
split the model across cards. On NVIDIA hardware this uses NVLink/NVSwitch.
On AMD hardware it uses RCCL (ROCm Collective Communications Library). On Intel
Gaudi it uses Habana's collective communications library. All three require
HostIPC and HostNetwork for efficient inter-card communication, which the
controller sets automatically based on acceleratorCount > 1.
VENDOR-SPECIFIC QUANTIZATION NOTES
NVIDIA: AWQ, GPTQ, FP8, INT8, INT4, and QAT-INT4 are all supported on H100+.
AMD: AWQ is supported on ROCm as of vLLM 0.8+. MXFP4 and MXFP6 require MI350X
or later (CDNA 4 architecture). SqueezeLLM has been ported to ROCm.
Intel Gaudi: FP8 is natively supported by Gaudi 3 hardware. The Intel vLLM
fork includes custom graph caching for improved performance with FP8.
SECRETS MANAGEMENT
API keys for remote models (GPT-5.5, Claude Opus 4.7, Gemini 3.1) are
sensitive credentials. For production deployments, use an external secrets
manager like HashiCorp Vault with the Vault Secrets Operator:
apiVersion: secrets-store.csi.x-k8s.io/v1
kind: SecretProviderClass
metadata:
name: openai-api-key-vault
namespace: ai-models
spec:
provider: vault
parameters:
vaultAddress: https://vault.example.com
roleName: llm-operator
objects: |
- objectName: "openai-api-key"
secretPath: "secret/data/ai/openai"
secretKey: "api_key"
secretObjects:
- secretName: openai-api-key
type: Opaque
data:
- objectName: openai-api-key
key: apiKey
NETWORK SECURITY
The vLLM API server has no built-in authentication. Use Kubernetes Network
Policies to restrict which pods can reach the inference services:
# File: config/security/network-policy.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: llm-inference-access
namespace: ai-models
spec:
# Apply to all inference server pods across all vendors.
podSelector:
matchLabels:
ai.example.io/model-name: ""
policyTypes:
- Ingress
ingress:
# Allow traffic from explicitly authorized clients.
- from:
- namespaceSelector: {}
podSelector:
matchLabels:
ai.example.io/llm-client: "true"
ports:
- protocol: TCP
port: 8000
# Always allow traffic from the MCP server.
- from:
- podSelector:
matchLabels:
app: mcp-server
ports:
- protocol: TCP
port: 8000
# Always allow traffic from the controller (for health checks).
- from:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: llm-operator-system
ports:
- protocol: TCP
port: 8000
HETEROGENEOUS CLUSTER CONSIDERATIONS
In a cluster with mixed accelerator types (some nodes with NVIDIA GPUs, some
with AMD GPUs, some with Intel Gaudi cards, and some CPU-only), the controller
ensures that each LLMModel pod lands on the correct node type via:
1. NodeSelector: The vendor-specific label set by the GPU operator
(nvidia.com/gpu.present=true, amd.com/gpu.present=true, habana.ai/gaudi=true)
2. Tolerations: The vendor-specific taint applied to GPU nodes
3. Resource requests: The vendor-specific resource key
(nvidia.com/gpu, amd.com/gpu, habana.ai/gaudi)
These three mechanisms together guarantee that an AMD model never lands on an
NVIDIA node and vice versa, even in a heterogeneous cluster.
For clusters using the AMD GPU DRA Driver (available in beta as of early 2026),
the scheduling can be made even more precise. The DRA driver publishes
ResourceSlices that expose structured attributes of AMD GPU devices (model,
PCIe root, memory, etc.), allowing workloads to request GPUs based on specific
characteristics such as minimum HBM capacity.
CHAPTER FOURTEEN: THE BIGGER PICTURE
We have covered a lot of ground. Let us step back and look at what we have
built and why it matters.
We started with the observation that the LLM landscape as of May 2026 is
radically different from what it was eighteen months ago. The dominant models
are MoE architectures. Context windows have grown from thousands to millions of
tokens. Quantization-Aware Training has made it possible to run 31-billion-
parameter multimodal models on a single 24 GiB GPU. And the accelerator market
has diversified: AMD MI300X, MI350X, and the forthcoming MI400 series are
serious alternatives to NVIDIA for large-model inference. Intel Gaudi 3 offers
a cost-effective option with native FP8 support and strong LLM serving
frameworks. The frontier remote API models — GPT-5.5, Claude Opus 4.7, Gemini
3.1 Pro — have context windows of 1 million tokens and capabilities that no
locally-deployable model yet matches.
We designed a Custom Resource Definition that captures the full richness of this
landscape. The `acceleratorVendor` field is the key innovation: it makes the
entire operator hardware-agnostic. Adding support for a new accelerator vendor
requires editing exactly one file — controllers/hardware.go — and adding a new
case to the switch statement. The rest of the controller, the CRD, the MCP
server, and the query tooling all remain unchanged.
We built a controller that reconciles this CRD into real Kubernetes objects,
with all vendor-specific decisions isolated in the AcceleratorConfig struct.
The controller handles the full lifecycle: PVC creation for model caching,
init container for weight download, inference Deployment with correct resource
keys and tolerations, Service for stable DNS, KEDA ScaledObject for intelligent
autoscaling, and MCP tool registration for AI agent integration.
We wired up KEDA to scale on inference queue depth rather than CPU or GPU
utilization, which is the only autoscaling signal that makes sense for LLM
workloads. Because vLLM emits the same Prometheus metrics regardless of the
underlying accelerator vendor, the KEDA configuration is completely vendor-
agnostic.
We built an MCP server that watches the LLMModel registry and exposes each
model as a tool that AI agents can discover and call. The server supports
dynamic tool registration via an admin REST API that the controller calls
whenever a model's status changes, and it sends MCP tools/list_changed
notifications to connected clients when the tool list changes.
We showed how Docker Model Runner bridges the gap between local development and
cluster deployment, giving every developer a local LLM environment on any
platform — NVIDIA GPU, AMD GPU, or Apple Silicon — that uses the same API
surface as the production cluster.
The result is an AI platform that is self-describing, self-scaling, self-
healing, and hardware-agnostic. Models are first-class Kubernetes citizens.
They can be queried, filtered, and selected using standard Kubernetes tooling
regardless of whether they run on NVIDIA, AMD, Intel Gaudi, or CPU. They scale
automatically based on demand. They expose themselves to AI agents via a
standard protocol. And they handle both local and remote models uniformly.
This is what it looks like when you stop treating LLMs as external services
and start treating them as infrastructure — infrastructure that works with
whatever accelerator hardware you have or can afford.