Hitchhiker's Guide to AI, Software Architecture, and Everything Else: DEPLOYING MLX LARGE LANGUAGE MODELS IN DOCKER CONTAINERS AND KUBERNETES CLUSTERS

Note: While I designed the full concept, texual explanations and architecture of this system, I used Antrophic Claude 4.5 Sonnet to generate the code. The image above was created by DALL-E.

INTRODUCTION AND OVERVIEW

The deployment of Large Language Models has become increasingly important as organizations seek to leverage artificial intelligence capabilities in their applications and services. Apple's MLX framework represents a significant advancement in this space, offering optimized machine learning operations specifically designed for Apple Silicon architecture. This tutorial explores the intersection of MLX with containerization technologies, specifically Docker and Kubernetes, to create scalable and maintainable LLM inference services.

MLX is an array framework developed by Apple's machine learning research team that provides efficient computation on Apple Silicon processors. The framework leverages the unified memory architecture of Apple's M-series chips, allowing seamless data sharing between CPU and GPU without explicit memory transfers. This architecture enables developers to build high-performance machine learning applications that can run inference on models ranging from small language models to large-scale transformers.

The challenge of deploying MLX in containerized environments stems from Apple's virtualization framework limitations. Unlike NVIDIA GPUs on Linux systems where direct GPU passthrough to containers is well-supported, Docker containers on macOS operate within a virtual machine that does not expose direct Metal GPU access. This fundamental architectural constraint means that MLX applications running in Docker containers typically fall back to CPU-only execution, significantly impacting performance for GPU-intensive workloads.

Despite these limitations, recent developments in 2024 and 2025 have made MLX deployment in Docker increasingly viable. Docker Desktop introduced Model Runner, a feature designed to simplify AI model execution on Apple Silicon, with plans to integrate MLX and vLLM engines. Community projects have also emerged, demonstrating full Docker support across platforms including Apple Silicon. These solutions often employ creative approaches such as running GPU-accelerated services on the host and exposing them to containers, or utilizing alternative container runtimes with virtualized GPU acceleration.

This tutorial presents a comprehensive approach to deploying MLX-based LLM services using Docker and Kubernetes. We will construct a web-based LLM playground that allows users to interact with language models through a browser interface. The architecture consists of multiple components including a FastAPI backend server that handles model inference, a web frontend for user interaction, proper containerization with Docker, and orchestration using Kubernetes for production deployment.

UNDERSTANDING THE ARCHITECTURE

The architecture of our MLX LLM deployment system consists of several interconnected layers, each serving a specific purpose in the overall infrastructure. At the foundation lies the MLX framework itself, which provides the computational engine for running language model inference. Above this, we construct an API layer using FastAPI that exposes model capabilities through HTTP endpoints. The containerization layer packages these components along with their dependencies into portable Docker images. Finally, the orchestration layer uses Kubernetes to manage deployment, scaling, and high availability.

The MLX framework operates on a lazy evaluation model, meaning computations are not executed immediately when operations are defined. Instead, MLX builds a computation graph that is optimized and executed only when results are explicitly needed. This approach enables sophisticated optimizations including operation fusion and memory management. The framework supports automatic differentiation, making it suitable for both training and inference workloads, though our focus remains on inference for deployed LLM services.

For our LLM playground, we utilize the mlx-lm library, which builds upon MLX to provide specific functionality for language model operations. This library includes pre-built functions for loading models from Hugging Face, generating text with various sampling strategies, and streaming responses token by token. The mlx-lm package also provides a lightweight server implementation that exposes an OpenAI-compatible API, allowing existing tools and libraries to interact with local MLX models using familiar interfaces.

The API layer serves as the bridge between the MLX inference engine and external clients. We implement this using FastAPI, a modern Python web framework known for its performance and developer-friendly features. FastAPI provides automatic request validation through Pydantic models, generates interactive API documentation via OpenAPI specifications, and supports asynchronous request handling for improved concurrency. Our API design follows RESTful principles, exposing endpoints for model information retrieval, text generation, and streaming responses.

The web frontend provides users with an intuitive interface for interacting with the LLM. We implement this using a combination of HTML, CSS, and JavaScript, creating a chat-like interface where users can submit prompts and receive responses. The frontend communicates with the FastAPI backend through HTTP requests, displaying generated text in real-time as it streams from the model. This architecture separates concerns between the presentation layer and the inference engine, allowing independent scaling and updates.

Containerization with Docker encapsulates the entire application stack including the Python runtime, MLX framework, model files, and application code into a single deployable unit. The Docker image serves as a portable artifact that can run consistently across different environments, from local development machines to production clusters. Our Dockerfile defines the build process, installing dependencies, copying application code, and configuring the runtime environment. We employ multi-stage builds to minimize image size and separate build-time dependencies from runtime requirements.

Kubernetes orchestration provides the infrastructure for running our containerized application at scale. A Kubernetes deployment manages the lifecycle of application pods, ensuring the desired number of replicas are running and handling rolling updates for new versions. Services expose the application to network traffic, providing stable endpoints that abstract away individual pod instances. ConfigMaps and Secrets manage configuration data and sensitive information separately from application code. Horizontal Pod Autoscaling adjusts the number of running instances based on metrics like CPU utilization or custom metrics such as request queue depth.

TECHNICAL CONSIDERATIONS FOR MLX IN CONTAINERS

The deployment of MLX applications in Docker containers requires careful consideration of several technical factors that differ from traditional containerized applications. The most significant challenge involves GPU acceleration, as MLX is specifically optimized to leverage Apple Silicon's Metal framework for GPU computation. When running inside a Docker container on macOS, the virtualization layer prevents direct access to the Metal GPU, forcing MLX to fall back to CPU-only execution.

This limitation has important implications for performance. Language model inference is computationally intensive, with larger models requiring substantial processing power. On Apple Silicon running natively, MLX can achieve impressive inference speeds by utilizing the GPU and the unified memory architecture. However, when constrained to CPU execution in a container, inference speeds can be significantly slower, potentially making real-time interactive applications challenging.

Several approaches exist to mitigate this limitation. One strategy involves running the MLX inference engine natively on the host system and exposing it to containers through network APIs. In this configuration, the containerized application acts as a client that communicates with the host-based inference service. This preserves GPU acceleration while maintaining the benefits of containerization for other application components. The mlx-lm server's OpenAI-compatible API makes this approach particularly viable, as containers can use standard OpenAI client libraries configured to point to the host service.

Another consideration involves model storage and loading. Language models can be quite large, with popular models ranging from several hundred megabytes to tens of gigabytes. Including model weights directly in Docker images would result in extremely large images that are slow to build, push, and pull. Instead, we employ strategies such as mounting model directories as volumes, downloading models at container startup, or using a separate model storage service. Each approach has trade-offs regarding startup time, storage efficiency, and deployment complexity.

Memory requirements represent another critical factor. Language models consume substantial memory during inference, both for the model weights themselves and for the key-value cache used during text generation.

Containers must be configured with adequate memory limits to prevent out-of-memory errors. Kubernetes resource requests and limits should be set based on the specific model being served, with larger models requiring proportionally more memory. The unified memory architecture of Apple Silicon means that memory is shared between CPU and GPU, requiring careful capacity planning.

Network configuration affects how clients access the LLM service. In a Kubernetes environment, services can be exposed through various mechanisms including ClusterIP for internal access, NodePort for direct node access, or LoadBalancer for cloud provider integration. For production deployments, an Ingress controller typically provides HTTP routing with features like TLS termination, path-based routing, and rate limiting. The choice of exposure mechanism depends on security requirements, infrastructure capabilities, and expected traffic patterns.

Model versioning and updates require careful orchestration. As new model versions become available or fine-tuned variants are created, the deployment system must support rolling updates without service interruption. Kubernetes deployments handle this through rolling update strategies, gradually replacing old pods with new ones while maintaining service availability. Blue-green deployment patterns can also be employed, running both old and new versions simultaneously and switching traffic once the new version is validated.

BUILDING THE FASTAPI INFERENCE SERVER

The FastAPI inference server forms the core of our LLM deployment, providing a robust HTTP interface for model interactions. This server handles model loading, request processing, text generation, and response streaming. We design the server with production considerations in mind, including proper error handling, logging, health checks, and configuration management.

The server initialization begins with creating a FastAPI application instance and configuring it with metadata such as title, description, and version information. This metadata appears in the automatically generated API documentation, helping developers understand the service capabilities. We also configure CORS middleware to allow cross-origin requests from web frontends, specifying allowed origins, methods, and headers based on security requirements.

Model loading occurs during application startup using FastAPI's event system. The startup event handler loads the MLX model and tokenizer into memory, making them available for subsequent requests. This approach ensures the model is loaded only once rather than for each request, significantly improving response times. We implement error handling around model loading to gracefully handle failures such as missing model files or insufficient memory, logging detailed error messages for debugging.

Here is the core structure of our FastAPI server implementation:

from fastapi import FastAPI, HTTPException, Request

from fastapi.middleware.cors import CORSMiddleware

from fastapi.responses import StreamingResponse, JSONResponse

from pydantic import BaseModel, Field

from typing import Optional, List, Dict, Any

import mlx.core as mx

import mlx.nn as nn

from mlx_lm import load, generate, stream_generate

import logging

import asyncio

import json

from contextlib import asynccontextmanager

# Configure logging

logging.basicConfig(

level=logging.INFO,

format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'

)

logger = logging.getLogger(__name__)

# Global variables for model and tokenizer

model_state = {

"model": None,

"tokenizer": None,

"model_path": None,

"loaded": False

}

The model loading function encapsulates the logic for initializing the MLX

model. We implement this as a separate function that can be called during startup or when switching models. The function accepts a model path parameter, which can reference either a local directory or a Hugging Face model identifier. The mlx-lm library automatically downloads models from Hugging Face if they are not available locally, caching them for future use.

async def load_model(model_path: str):

"""

Load the MLX model and tokenizer.

Args:

model_path: Path to local model or Hugging Face model identifier

Returns:

Tuple of (model, tokenizer)

"""

try:

logger.info(f"Loading model from {model_path}")

model, tokenizer = load(model_path)

logger.info(f"Model loaded successfully: {model_path}")

return model, tokenizer

except Exception as e:

logger.error(f"Failed to load model {model_path}: {str(e)}")

raise

Request validation uses Pydantic models to define the expected structure of incoming requests. These models provide automatic validation, type conversion, and documentation generation. For text generation requests, we define fields for the input prompt, maximum token count, temperature for sampling randomness, top-p for nucleus sampling, and other generation parameters. Default values ensure the API remains usable even when clients omit optional parameters.

class GenerationRequest(BaseModel):

"""Request model for text generation"""

prompt: str = Field(..., description="Input text prompt for generation")

max_tokens: int = Field(default=100, ge=1, le=2048, description="Maximum tokens to generate")

temperature: float = Field(default=0.7, ge=0.0, le=2.0, description="Sampling temperature")

top_p: float = Field(default=1.0, ge=0.0, le=1.0, description="Nucleus sampling probability")

repetition_penalty: Optional[float] = Field(default=1.0, ge=0.0, description="Repetition penalty")

repetition_context_size: Optional[int] = Field(default=20, ge=0, description="Context size for repetition penalty")

stream: bool = Field(default=False, description="Whether to stream the response")

class GenerationResponse(BaseModel):

"""Response model for text generation"""

generated_text: str = Field(..., description="Generated text output")

prompt: str = Field(..., description="Original input prompt")

tokens_generated: int = Field(..., description="Number of tokens generated")

class ModelInfo(BaseModel):

"""Information about the loaded model"""

model_path: str = Field(..., description="Path or identifier of loaded model")

loaded: bool = Field(..., description="Whether model is currently loaded")

The text generation endpoint implements the core functionality of the

service. When a request arrives, we validate the input, check that a model is loaded, and invoke the MLX generation function with the specified parameters. For non-streaming requests, we accumulate the full generated text and return it in a single response. For streaming requests, we use a generator function that yields tokens as they are produced, enabling real-time display in the client interface.

@app.post("/generate", response_model=GenerationResponse)

async def generate_text(request: GenerationRequest):

"""

Generate text based on the provided prompt.

This endpoint performs text generation using the loaded MLX model.

It supports both standard and streaming responses.

"""

if not model_state["loaded"]:

raise HTTPException(

status_code=503,

detail="Model not loaded. Please wait for initialization or check server logs."

)

try:

logger.info(f"Generating text for prompt: {request.prompt[:50]}...")

if request.stream:

return StreamingResponse(

generate_stream(request),

media_type="text/event-stream"

)

else:

generated_text = generate(

model_state["model"],

model_state["tokenizer"],

prompt=request.prompt,

max_tokens=request.max_tokens,

temp=request.temperature,

top_p=request.top_p,

repetition_penalty=request.repetition_penalty,

repetition_context_size=request.repetition_context_size,

verbose=False

)

# Count tokens in generated text

tokens_generated = len(model_state["tokenizer"].encode(generated_text))

return GenerationResponse(

generated_text=generated_text,

prompt=request.prompt,

tokens_generated=tokens_generated

)

except Exception as e:

logger.error(f"Generation failed: {str(e)}")

raise HTTPException(status_code=500, detail=f"Text generation failed: {str(e)}")

The streaming generation function yields tokens as they are produced by

the model. We implement this as an async generator that uses the stream_generate function from mlx-lm. Each generated token is formatted as a Server-Sent Event and sent to the client, allowing the frontend to display text progressively as it is generated. This approach significantly improves perceived responsiveness for longer generations.

async def generate_stream(request: GenerationRequest):

"""

Generator function for streaming text generation.

Yields tokens as Server-Sent Events for real-time display.

"""

try:

token_count = 0

for token in stream_generate(

model_state["model"],

model_state["tokenizer"],

prompt=request.prompt,

max_tokens=request.max_tokens,

temp=request.temperature,

top_p=request.top_p,

repetition_penalty=request.repetition_penalty,

repetition_context_size=request.repetition_context_size

token_count += 1

data = json.dumps({

"token": token,

"token_count": token_count

})

yield f"data: {data}\n\n"

# Send completion event

yield f"data: {json.dumps({'done': True, 'total_tokens': token_count})}\n\n"

except Exception as e:

logger.error(f"Streaming generation failed: {str(e)}")

error_data = json.dumps({"error": str(e)})

yield f"data: {error_data}\n\n"

Health check endpoints enable monitoring systems and orchestration

platforms to verify service availability. We implement both liveness and readiness probes. The liveness probe indicates whether the application is running and should return success as long as the server process is active. The readiness probe indicates whether the service is ready to handle requests, checking that the model is loaded and available. Kubernetes uses these probes to manage pod lifecycle and traffic routing.

@app.get("/health/live")

async def liveness():

"""

Liveness probe endpoint.

Returns 200 if the application is running.

"""

return {"status": "alive"}

@app.get("/health/ready")

async def readiness():

"""

Readiness probe endpoint.

Returns 200 if the model is loaded and service is ready to handle requests.

"""

if model_state["loaded"]:

return {"status": "ready", "model": model_state["model_path"]}

else:

raise HTTPException(status_code=503, detail="Model not loaded")

Configuration management allows the service to adapt to different

deployment environments. We use environment variables to specify the model path, server host and port, log level, and other operational parameters. This approach follows the twelve-factor app methodology, keeping configuration separate from code and enabling easy customization without rebuilding images.

CREATING THE WEB FRONTEND

The web frontend provides users with an intuitive interface for interacting with the LLM service. We design a chat-style interface where users can enter prompts and view generated responses in a conversation format. The frontend is implemented as a single-page application using vanilla JavaScript, avoiding framework dependencies to keep the implementation simple and the payload small.

The HTML structure defines the layout of the chat interface. We create a container for displaying message history, an input area for entering prompts, and controls for adjusting generation parameters. The interface includes visual feedback for loading states and error conditions, ensuring users understand the system status at all times.

<!DOCTYPE html>

<head>

<title>MLX LLM Playground</title>

<style>

* {

margin: 0;

padding: 0;

box-sizing: border-box;

}

body {

font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, Oxygen, Ubuntu, Cantarell, sans-serif;

background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);

min-height: 100vh;

display: flex;

justify-content: center;

align-items: center;

padding: 20px;

}

.container {

background: white;

border-radius: 20px;

box-shadow: 0 20px 60px rgba(0, 0, 0, 0.3);

width: 100%;

max-width: 900px;

height: 90vh;

display: flex;

flex-direction: column;

overflow: hidden;

}

.header {

background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);

color: white;

padding: 20px 30px;

border-radius: 20px 20px 0 0;

}

.header h1 {

font-size: 24px;

font-weight: 600;

}

.header p {

font-size: 14px;

opacity: 0.9;

margin-top: 5px;

}

.chat-container {

flex: 1;

overflow-y: auto;

padding: 20px 30px;

background: #f8f9fa;

}

.message {

margin-bottom: 20px;

display: flex;

flex-direction: column;

}

.message.user {

align-items: flex-end;

}

.message.assistant {

align-items: flex-start;

}

.message-content {

max-width: 70%;

padding: 12px 18px;

border-radius: 18px;

word-wrap: break-word;

white-space: pre-wrap;

}

.message.user .message-content {

background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);

color: white;

}

.message.assistant .message-content {

background: white;

color: #333;

border: 1px solid #e0e0e0;

}

.input-area {

padding: 20px 30px;

background: white;

border-top: 1px solid #e0e0e0;

}

.input-controls {

display: flex;

gap: 10px;

margin-bottom: 15px;

}

.input-controls input,

.input-controls select {

padding: 8px 12px;

border: 1px solid #e0e0e0;

border-radius: 8px;

font-size: 14px;

}

.input-controls label {

display: flex;

align-items: center;

gap: 5px;

font-size: 14px;

color: #666;

}

.input-row {

display: flex;

gap: 10px;

}

.input-row textarea {

flex: 1;

padding: 12px 18px;

border: 1px solid #e0e0e0;

border-radius: 12px;

font-size: 16px;

font-family: inherit;

resize: none;

min-height: 50px;

max-height: 150px;

}

.input-row textarea:focus {

outline: none;

border-color: #667eea;

}

.input-row button {

padding: 12px 30px;

background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);

color: white;

border: none;

border-radius: 12px;

font-size: 16px;

font-weight: 600;

cursor: pointer;

transition: transform 0.2s;

}

.input-row button:hover {

transform: translateY(-2px);

}

.input-row button:disabled {

opacity: 0.6;

cursor: not-allowed;

transform: none;

}

.loading {

display: flex;

align-items: center;

gap: 8px;

color: #666;

font-size: 14px;

}

.loading-spinner {

width: 16px;

height: 16px;

border: 2px solid #e0e0e0;

border-top-color: #667eea;

border-radius: 50%;

animation: spin 1s linear infinite;

}

@keyframes spin {

to { transform: rotate(360deg); }

}

.error-message {

background: #fee;

color: #c33;

padding: 12px 18px;

border-radius: 8px;

margin-bottom: 15px;

font-size: 14px;

}

</style>

</head>

<body>

<h1>MLX LLM Playground</h1>

<p>Interact with Large Language Models powered by Apple MLX</p>

</div>

Hello! I'm an AI assistant powered by MLX. How can I help you today?

</div>

<label>

Max Tokens:

</label>

<label>

Temperature:

</label>

<label>

Top P:

</label>

<label>

Stream

</label>

</div>

<textarea

id="promptInput"

placeholder="Type your message here..."

rows="1"

></textarea>

</div>

<span>Generating response...</span>

</div>

The JavaScript implementation handles user interactions and

communication with the backend API. We implement functions for sending messages, receiving responses, and updating the UI. The code includes error handling for network failures and API errors, displaying user-friendly error messages when issues occur.

const API_BASE_URL = window.location.origin;

let isGenerating = false;

// Auto-resize textarea

const promptInput = document.getElementById('promptInput');

promptInput.addEventListener('input', function() {

this.style.height = 'auto';

this.style.height = Math.min(this.scrollHeight, 150) + 'px';

});

// Allow Enter to send (Shift+Enter for new line)

promptInput.addEventListener('keydown', function(e) {

if (e.key === 'Enter' && !e.shiftKey) {

e.preventDefault();

sendMessage();

}

});

function addMessage(role, content) {

const chatContainer = document.getElementById('chatContainer');

const messageDiv = document.createElement('div');

messageDiv.className = `message ${role}`;

const contentDiv = document.createElement('div');

contentDiv.className = 'message-content';

contentDiv.textContent = content;

messageDiv.appendChild(contentDiv);

chatContainer.appendChild(messageDiv);

chatContainer.scrollTop = chatContainer.scrollHeight;

return contentDiv;

}

function showError(message) {

const errorContainer = document.getElementById('errorContainer');

errorContainer.innerHTML = `<div class="error-message">${message}</div>`;

setTimeout(() => {

errorContainer.innerHTML = '';

}, 5000);

}

function setLoading(loading) {

isGenerating = loading;

document.getElementById('sendButton').disabled = loading;

document.getElementById('promptInput').disabled = loading;

document.getElementById('loadingIndicator').style.display = loading ? 'flex' : 'none';

}

async function sendMessage() {

if (isGenerating) return;

const promptInput = document.getElementById('promptInput');

const prompt = promptInput.value.trim();

if (!prompt) return;

const maxTokens = parseInt(document.getElementById('maxTokens').value);

const temperature = parseFloat(document.getElementById('temperature').value);

const topP = parseFloat(document.getElementById('topP').value);

const stream = document.getElementById('streamMode').checked;

// Add user message

addMessage('user', prompt);

promptInput.value = '';

promptInput.style.height = 'auto';

setLoading(true);

try {

if (stream) {

await handleStreamingResponse(prompt, maxTokens, temperature, topP);

} else {

await handleStandardResponse(prompt, maxTokens, temperature, topP);

}

} catch (error) {

console.error('Error:', error);

showError(`Failed to generate response: ${error.message}`);

} finally {

setLoading(false);

}

async function handleStandardResponse(prompt, maxTokens, temperature, topP) {

const response = await fetch(`${API_BASE_URL}/generate`, {

method: 'POST',

headers: {

'Content-Type': 'application/json',

body: JSON.stringify({

prompt: prompt,

max_tokens: maxTokens,

temperature: temperature,

top_p: topP,

stream: false

})

});

if (!response.ok) {

const error = await response.json();

throw new Error(error.detail || 'Failed to generate response');

}

const data = await response.json();

addMessage('assistant', data.generated_text);

}

async function handleStreamingResponse(prompt, maxTokens, temperature, topP) {

const response = await fetch(`${API_BASE_URL}/generate`, {

method: 'POST',

headers: {

'Content-Type': 'application/json',

body: JSON.stringify({

prompt: prompt,

max_tokens: maxTokens,

temperature: temperature,

top_p: topP,

stream: true

})

});

if (!response.ok) {

const error = await response.json();

throw new Error(error.detail || 'Failed to generate response');

}

const reader = response.body.getReader();

const decoder = new TextDecoder();

const contentDiv = addMessage('assistant', '');

let fullText = '';

while (true) {

const { done, value } = await reader.read();

if (done) break;

const chunk = decoder.decode(value);

const lines = chunk.split('\n');

for (const line of lines) {

if (line.startsWith('data: ')) {

const data = JSON.parse(line.slice(6));

if (data.error) {

throw new Error(data.error);

}

if (data.done) {

break;

}

if (data.token) {

fullText += data.token;

contentDiv.textContent = fullText;

// Auto-scroll

const chatContainer = document.getElementById('chatContainer');

chatContainer.scrollTop = chatContainer.scrollHeight;

}

// Check server health on load

async function checkHealth() {

try {

const response = await fetch(`${API_BASE_URL}/health/ready`);

if (!response.ok) {

showError('Model is still loading. Please wait...');

}

} catch (error) {

showError('Cannot connect to server. Please check if the service is running.');

}

checkHealth();

</script>

</body>

</html>

This frontend implementation provides a complete user interface with real-time streaming, parameter controls, and error handling. The design is responsive and works across different screen sizes, making it suitable for both desktop and mobile access.

CONTAINERIZING WITH DOCKER

Containerization packages our application and its dependencies into a portable, reproducible unit that can run consistently across different environments. The Docker image includes the Python runtime, MLX framework, FastAPI server, web frontend, and all necessary libraries. We structure the Dockerfile to optimize build times and minimize image size through multi-stage builds and layer caching.

The Dockerfile begins by specifying a base image. For MLX applications, we need a Python runtime compatible with the MLX framework. Since MLX is optimized for Apple Silicon, the ideal deployment target is an ARM-based system. However, for development and testing purposes, we can also build images that run on x86 architecture using CPU-only execution.

# Dockerfile for MLX LLM Playground

# Use Python 3.11 slim image as base

FROM python:3.11-slim as base

# Set environment variables

ENV PYTHONUNBUFFERED=1 \

PYTHONDONTWRITEBYTECODE=1 \

PIP_NO_CACHE_DIR=1 \

PIP_DISABLE_PIP_VERSION_CHECK=1

# Set working directory

WORKDIR /app

# Install system dependencies

RUN apt-get update && apt-get install -y \

build-essential \

curl \

git \

&& rm -rf /var/lib/apt/lists/*

# Create a non-root user

RUN useradd -m -u 1000 appuser && \

chown -R appuser:appuser /app

# Copy requirements file

COPY requirements.txt .

# Install Python dependencies

RUN pip install --no-cache-dir -r requirements.txt

# Copy application code

COPY --chown=appuser:appuser . .

# Switch to non-root user

USER appuser

# Expose port

EXPOSE 8000

# Health check

HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \

CMD curl -f http://localhost:8000/health/live || exit 1

# Run the application

CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

The requirements file specifies all Python dependencies needed by the

application. We pin versions to ensure reproducible builds and avoid unexpected breakages from dependency updates.

# requirements.txt

# Web framework

fastapi==0.109.0

uvicorn[standard]==0.27.0

pydantic==2.5.3

pydantic-settings==2.1.0

# MLX and ML dependencies

mlx==0.4.0

mlx-lm==0.8.0

# Utilities

python-multipart==0.0.6

aiofiles==23.2.1

# Monitoring and logging

prometheus-client==0.19.0

To build the Docker image, we execute the docker build command from the directory containing the Dockerfile. The build process executes each instruction in the Dockerfile, creating layers that can be cached for faster subsequent builds.

docker build -t mlx-llm-playground:latest .

This command creates an image tagged as mlx-llm-playground with the latest tag. We can verify the image was created successfully by listing available images.

docker images | grep mlx-llm-playground

Running the container locally allows us to test the application before deploying to Kubernetes. We map the container's port to a host port and optionally mount volumes for model storage.

docker run -d \

--name mlx-llm \

-p 8000:8000 \

-e MODEL_PATH="mlx-community/Mistral-7B-Instruct-v0.3-4bit" \

mlx-llm-playground:latest

The container starts in detached mode, running in the background. We can view logs to monitor startup progress and verify the model loads successfully.

docker logs -f mlx-llm

To push the image to a container registry for use in Kubernetes, we tag it with the registry URL and push it.

docker tag mlx-llm-playground:latest your-registry.com/mlx-llm-playground:latest

docker push your-registry.com/mlx-llm-playground:latest

For production deployments, we implement additional optimizations such as multi-stage builds to separate build dependencies from runtime dependencies, reducing the final image size. We also configure proper signal handling to ensure graceful shutdown when containers are stopped.

DEPLOYING TO KUBERNETES

Kubernetes deployment transforms our containerized application into a scalable, highly available service. We define Kubernetes resources using YAML manifests that describe the desired state of our application. The Kubernetes control plane continuously works to maintain this desired state, handling pod scheduling, health monitoring, and automatic recovery from failures.

The deployment resource manages the application pods, specifying how many replicas should run and how updates should be performed. We configure resource requests and limits to ensure pods receive adequate CPU and memory while preventing resource exhaustion on cluster nodes.

# deployment.yaml

apiVersion: apps/v1

kind: Deployment

metadata:

namespace: mlx-llm

labels:

app: mlx-llm

version: v1

spec:

replicas: 2

strategy:

type: RollingUpdate

rollingUpdate:

maxSurge: 1

maxUnavailable: 0

selector:

matchLabels:

app: mlx-llm

template:

metadata:

labels:

app: mlx-llm

version: v1

spec:

containers:

- name: mlx-llm

image: your-registry.com/mlx-llm-playground:latest

imagePullPolicy: Always

ports:

- containerPort: 8000

protocol: TCP

env:

- name: MODEL_PATH

valueFrom:

configMapKeyRef:

key: model_path

- name: LOG_LEVEL

value: "INFO"

resources:

requests:

memory: "4Gi"

cpu: "2000m"

limits:

memory: "8Gi"

cpu: "4000m"

livenessProbe:

httpGet:

path: /health/live

port: 8000

initialDelaySeconds: 30

periodSeconds: 10

timeoutSeconds: 5

failureThreshold: 3

readinessProbe:

httpGet:

path: /health/ready

port: 8000

initialDelaySeconds: 60

periodSeconds: 10

timeoutSeconds: 5

failureThreshold: 3

lifecycle:

preStop:

exec:

command: ["/bin/sh", "-c", "sleep 15"]

terminationGracePeriodSeconds: 30

The service resource provides a stable network endpoint for accessing the application. We use a ClusterIP service for internal access and can add an Ingress for external access with additional features like TLS termination and path-based routing.

# service.yaml

apiVersion: v1

kind: Service

metadata:

namespace: mlx-llm

labels:

app: mlx-llm

spec:

type: ClusterIP

ports:

- port: 80

targetPort: 8000

protocol: TCP

selector:

app: mlx-llm

Configuration management uses ConfigMaps to store non-sensitive configuration data. This allows us to change configuration without rebuilding images or redeploying pods.

# configmap.yaml

apiVersion: v1

kind: ConfigMap

metadata:

namespace: mlx-llm

data:

model_path: "mlx-community/Mistral-7B-Instruct-v0.3-4bit"

max_tokens_default: "200"

temperature_default: "0.7"

For external access, we define an Ingress resource that configures HTTP routing rules and optionally TLS certificates.

# ingress.yaml

apiVersion: networking.k8s.io/v1

kind: Ingress

metadata:

namespace: mlx-llm

annotations:

nginx.ingress.kubernetes.io/rewrite-target: /

nginx.ingress.kubernetes.io/ssl-redirect: "true"

cert-manager.io/cluster-issuer: "letsencrypt-prod"

spec:

ingressClassName: nginx

tls:

- hosts:

- mlx-llm.yourdomain.com

secretName: mlx-llm-tls

rules:

- host: mlx-llm.yourdomain.com

http:

paths:

- path: /

pathType: Prefix

backend:

service:

port:

number: 80

Horizontal Pod Autoscaling automatically adjusts the number of pod replicas based on observed metrics. We configure the HPA to scale based on CPU utilization and custom metrics like request queue depth.

# hpa.yaml

apiVersion: autoscaling/v2

kind: HorizontalPodAutoscaler

metadata:

namespace: mlx-llm

spec:

scaleTargetRef:

apiVersion: apps/v1

kind: Deployment

minReplicas: 2

maxReplicas: 10

metrics:

- type: Resource

resource:

target:

type: Utilization

averageUtilization: 70

- type: Resource

resource:

target:

type: Utilization

averageUtilization: 80

behavior:

scaleDown:

stabilizationWindowSeconds: 300

policies:

- type: Percent

value: 50

periodSeconds: 60

scaleUp:

stabilizationWindowSeconds: 0

policies:

- type: Percent

value: 100

periodSeconds: 30

- type: Pods

value: 2

periodSeconds: 30

selectPolicy: Max

To deploy these resources to a Kubernetes cluster, we first create a namespace to isolate our application resources.

kubectl create namespace mlx-llm

Then we apply all the manifest files in sequence.

kubectl apply -f configmap.yaml

kubectl apply -f deployment.yaml

kubectl apply -f service.yaml

kubectl apply -f ingress.yaml

kubectl apply -f hpa.yaml

We can verify the deployment status by checking the pods and services.

kubectl get pods -n mlx-llm

kubectl get services -n mlx-llm

kubectl get ingress -n mlx-llm

Monitoring the deployment involves checking pod logs and describing resources to identify any issues.

kubectl logs -f deployment/mlx-llm-deployment -n mlx-llm

kubectl describe deployment mlx-llm-deployment -n mlx-llm

For production environments, we implement additional features such as Pod Disruption Budgets to maintain availability during cluster maintenance, Network Policies to control traffic flow, and Resource Quotas to prevent resource exhaustion.

PRODUCTION CONSIDERATIONS AND BEST PRACTICES

Deploying LLM services in production requires careful attention to performance, reliability, security, and operational concerns. The unique characteristics of language model inference, including high memory requirements, variable latency, and GPU dependencies, necessitate specialized approaches beyond standard web application deployment practices.

Performance optimization begins with model selection and configuration. Smaller quantized models provide faster inference with lower memory requirements, making them suitable for real-time interactive applications. The mlx-community on Hugging Face provides numerous pre-quantized models optimized for MLX, including 4-bit and 8-bit variants that significantly reduce memory footprint while maintaining acceptable quality. For applications requiring higher quality outputs, larger models can be deployed with appropriate resource allocation and potentially longer response times.

Caching strategies can dramatically improve response times for common queries. We can implement a cache layer that stores generated responses for frequently asked questions or common prompts. This cache can be implemented using Redis or Memcached, with cache keys derived from prompt hashes and generation parameters. Cache invalidation strategies ensure that cached responses remain relevant as models are updated or fine-tuned.

Load balancing distributes requests across multiple model instances to handle concurrent users. Kubernetes services provide basic round-robin load balancing, but more sophisticated strategies may be beneficial for LLM workloads. Intelligent routing can direct requests to instances with available capacity, considering factors like current queue depth and GPU utilization. The Gateway API Inference Extension or custom routing logic can implement these advanced strategies.

Monitoring and observability provide visibility into system behavior and performance. We instrument the application to collect metrics including request rate, response latency, token generation speed, error rates, and resource utilization. Prometheus serves as the metrics collection system, with Grafana providing visualization dashboards. We define alerts for critical conditions such as high error rates, excessive latency, or resource exhaustion.

Key metrics for LLM inference services include Time to First Token, which measures the latency before the first token is generated and directly impacts perceived responsiveness. Time Per Output Token measures the generation speed for subsequent tokens, affecting overall throughput. End-to-end latency captures the total time from request receipt to response completion. Tokens per second quantifies throughput for batch processing scenarios. Queue depth indicates the number of pending requests, helping identify capacity issues before they impact users.

Security considerations protect both the service and its users. API authentication ensures only authorized clients can access the service, implemented through API keys, JWT tokens, or OAuth 2.0. Rate limiting prevents abuse and ensures fair resource allocation among users. Input validation and sanitization protect against injection attacks and malicious inputs. Output filtering can prevent the model from generating harmful or inappropriate content, though this requires careful implementation to avoid excessive censorship.

Network security isolates the service from unauthorized access. Kubernetes Network Policies restrict traffic flow between pods and namespaces. TLS encryption protects data in transit between clients and the service. Secrets management using Kubernetes Secrets or external systems like HashiCorp Vault protects sensitive configuration data such as API keys and credentials.

Cost optimization balances performance requirements with infrastructure expenses. GPU resources represent a significant cost factor, making efficient utilization critical. Strategies include right-sizing instance types to match workload requirements, using spot instances or preemptible VMs for non-critical workloads, implementing autoscaling to match capacity with demand, and choosing appropriate model sizes that meet quality requirements without over-provisioning.

Disaster recovery and business continuity planning ensure service availability despite failures. Regular backups of configuration, custom models, and application state enable recovery from data loss. Multi-region deployment provides geographic redundancy, protecting against regional outages. Automated failover mechanisms detect failures and redirect traffic to healthy instances. Regular disaster recovery drills validate recovery procedures and identify gaps in planning.

Model versioning and A/B testing enable safe deployment of model updates. We maintain multiple model versions simultaneously, gradually shifting traffic from old to new versions while monitoring quality metrics. Blue-green deployment patterns run both versions in parallel, switching traffic once the new version is validated. Canary deployments route a small percentage of traffic to the new version, expanding gradually as confidence grows.

Logging and debugging support troubleshooting and incident response. Structured logging using JSON format enables efficient log parsing and analysis. Centralized log aggregation using tools like Elasticsearch, Fluentd, and Kibana provides unified access to logs across all pods. Distributed tracing using OpenTelemetry tracks requests across multiple services, identifying performance bottlenecks and failure points.

Capacity planning ensures adequate resources for current and future demand. We analyze historical usage patterns to identify trends and seasonal variations. Load testing simulates peak traffic to validate capacity and identify breaking points. Resource forecasting projects future requirements based on growth trends, informing infrastructure planning and budgeting.

ADVANCED DEPLOYMENT PATTERNS

Beyond basic deployment, several advanced patterns enhance the capabilities and efficiency of MLX LLM services. These patterns address specific challenges such as multi-model serving, request batching, and specialized hardware utilization.

Multi-model serving allows a single deployment to host multiple language models, enabling applications to choose the appropriate model for each task. Smaller models handle simple queries with lower latency, while larger models process complex requests requiring deeper reasoning. We implement this by loading multiple models at startup and routing requests based on model selection parameters or automatic complexity detection.

The implementation extends our FastAPI server to manage multiple models concurrently. Each model is loaded into memory with a unique identifier, and requests specify which model to use. Resource management becomes critical, as multiple large models can exceed available memory. Strategies include loading models on-demand with LRU eviction, using quantized variants to reduce memory footprint, or deploying separate pods for each model with routing at the service level.

Request batching improves throughput by processing multiple requests simultaneously. Language model inference can benefit from batching when the computational overhead of individual requests is high relative to the marginal cost of additional requests. Dynamic batching accumulates requests up to a maximum batch size or timeout, then processes them together. This approach trades slight increases in latency for substantial improvements in throughput.

Model quantization reduces memory requirements and can accelerate inference by using lower precision arithmetic. MLX supports various quantization schemes including 4-bit and 8-bit quantization. The mlx-lm library provides tools for quantizing models, and the mlx-community on Hugging Face offers pre-quantized versions of popular models. Quantization typically reduces model size by 4x to 8x with minimal quality degradation, enabling deployment of larger models on resource-constrained hardware.

Prompt caching optimizes repeated inference on similar prompts by reusing computation from previous requests. When prompts share common prefixes, the key-value cache from processing the prefix can be reused, eliminating redundant computation. This technique is particularly effective for conversational applications where each turn builds on previous context.

Distributed inference splits model execution across multiple devices or nodes, enabling deployment of models too large for a single device. Tensor parallelism partitions model layers across devices, with each device computing a portion of each layer. Pipeline parallelism assigns different layers to different devices, processing requests in a pipeline fashion. MLX supports distributed training with similar techniques applicable to inference, though the complexity of distributed deployment often outweighs benefits for inference workloads.

Edge deployment brings inference closer to users, reducing latency and network bandwidth requirements. Kubernetes edge computing platforms like K3s enable deployment on edge devices including powerful workstations or edge servers. For MLX specifically, deployment on Mac mini or Mac Studio devices at edge locations provides GPU-accelerated inference with the full MLX framework capabilities.

Serverless deployment using platforms like Knative enables scale-to-zero behavior, eliminating costs when the service is idle. However, language model inference faces challenges in serverless environments due to long cold start times for model loading. Strategies to mitigate this include keeping a minimum number of instances warm, using smaller models with faster load times, or implementing model caching in persistent volumes.

COMPLETE RUNNING EXAMPLE Production-Ready MLX LLM Playground Service

This section provides a complete, production-ready implementation of the MLX LLM Playground. All code is fully functional without placeholders or simplifications. The implementation includes comprehensive error handling, logging, configuration management, and production best practices.

DIRECTORY STRUCTURE

The project is organized as follows to maintain clean separation of concerns and facilitate deployment:

mlx-llm-playground/

+-- app/

| +-- __init__.py

| +-- main.py

| +-- config.py

| +-- models.py

| +-- routers/

| +-- __init__.py

| +-- generation.py

| +-- health.py

+-- static/

| +-- index.html

| +-- styles.css

| +-- script.js

+-- tests/

| +-- __init__.py

| +-- test_generation.py

| +-- test_health.py

+-- k8s/

| +-- namespace.yaml

| +-- configmap.yaml

| +-- deployment.yaml

| +-- service.yaml

| +-- ingress.yaml

| +-- hpa.yaml

+-- Dockerfile

+-- requirements.txt

+-- README.md

+-- .dockerignore

+-- .gitignore

CONFIGURATION MODULE

The configuration module manages all application settings using environment variables, following twelve-factor app principles.

# app/config.py

from pydantic_settings import BaseSettings

from typing import Optional

import os

class Settings(BaseSettings):

"""

Application configuration settings.

All settings can be overridden via environment variables.

"""

# Application settings

app_name: str = "MLX LLM Playground"

app_version: str = "1.0.0"

debug: bool = False

# Server settings

host: str = "0.0.0.0"

port: int = 8000

workers: int = 1

# Model settings

model_path: str = "mlx-community/Mistral-7B-Instruct-v0.3-4bit"

model_cache_dir: Optional[str] = None

# Generation defaults

default_max_tokens: int = 200

default_temperature: float = 0.7

default_top_p: float = 1.0

default_repetition_penalty: float = 1.0

default_repetition_context_size: int = 20

# Limits

max_tokens_limit: int = 2048

max_temperature: float = 2.0

max_concurrent_requests: int = 10

# Logging

log_level: str = "INFO"

log_format: str = "%(asctime)s - %(name)s - %(levelname)s - %(message)s"

# CORS settings

cors_origins: list = ["*"]

cors_allow_credentials: bool = True

cors_allow_methods: list = ["*"]

cors_allow_headers: list = ["*"]

# Monitoring

enable_metrics: bool = True

metrics_port: int = 9090

class Config:

env_file = ".env"

case_sensitive = False

# Global settings instance

settings = Settings()

DATA MODELS

Pydantic models define the structure of API requests and responses, providing automatic validation and documentation.

# app/models.py

from pydantic import BaseModel, Field, validator

from typing import Optional, List, Dict, Any

from datetime import datetime

class GenerationRequest(BaseModel):

"""Request model for text generation"""

prompt: str = Field(

...,

description="Input text prompt for generation",

min_length=1,

max_length=10000

)

max_tokens: int = Field(

default=200,

ge=1,

le=2048,

description="Maximum number of tokens to generate"

)

temperature: float = Field(

default=0.7,

ge=0.0,

le=2.0,

description="Sampling temperature for randomness control"

)

top_p: float = Field(

default=1.0,

ge=0.0,

le=1.0,

description="Nucleus sampling probability threshold"

)

repetition_penalty: Optional[float] = Field(

default=1.0,

ge=0.0,

description="Penalty for repeating tokens"

)

repetition_context_size: Optional[int] = Field(

default=20,

ge=0,

description="Number of recent tokens to consider for repetition penalty"

)

stream: bool = Field(

default=False,

description="Whether to stream the response token by token"

)

@validator('prompt')

def validate_prompt(cls, v):

"""Ensure prompt is not empty after stripping whitespace"""

if not v.strip():

raise ValueError("Prompt cannot be empty")

return v.strip()

class GenerationResponse(BaseModel):

"""Response model for text generation"""

generated_text: str = Field(

...,

description="Generated text output"

)

prompt: str = Field(

...,

description="Original input prompt"

)

tokens_generated: int = Field(

...,

description="Number of tokens generated"

)

generation_time: float = Field(

...,

description="Time taken for generation in seconds"

)

timestamp: datetime = Field(

default_factory=datetime.utcnow,

description="Timestamp of generation"

)

class ModelInfo(BaseModel):

"""Information about the loaded model"""

model_path: str = Field(

...,

description="Path or identifier of the loaded model"

)

loaded: bool = Field(

...,

description="Whether the model is currently loaded"

)

model_type: Optional[str] = Field(

None,

description="Type of model (e.g., 'llama', 'mistral')"

)

parameters: Optional[Dict[str, Any]] = Field(

None,

description="Model parameters and configuration"

)

class HealthResponse(BaseModel):

"""Health check response"""

status: str = Field(

...,

description="Health status ('healthy', 'unhealthy', 'degraded')"

)

timestamp: datetime = Field(

default_factory=datetime.utcnow,

description="Timestamp of health check"

)

details: Optional[Dict[str, Any]] = Field(

None,

description="Additional health check details"

)

class ErrorResponse(BaseModel):

"""Error response model"""

error: str = Field(

...,

description="Error type or code"

)

message: str = Field(

...,

description="Human-readable error message"

)

details: Optional[Dict[str, Any]] = Field(

None,

description="Additional error details"

)

timestamp: datetime = Field(

default_factory=datetime.utcnow,

description="Timestamp of error"

)

HEALTH CHECK ROUTER

The health check router provides endpoints for monitoring service availability and readiness.

# app/routers/health.py

from fastapi import APIRouter, HTTPException

from app.models import HealthResponse, ModelInfo

from app.config import settings

import logging

from datetime import datetime

router = APIRouter(prefix="/health", tags=["health"])

logger = logging.getLogger(__name__)

# Model state will be injected by main app

model_state = None

def set_model_state(state):

"""Set the model state reference"""

global model_state

model_state = state

@router.get("/live", response_model=HealthResponse)

async def liveness():

"""

Liveness probe endpoint.

Returns 200 if the application process is running.

This endpoint should always return success unless the process is dead.

"""

return HealthResponse(

status="healthy",

details={

"app_name": settings.app_name,

"version": settings.app_version

}

)

@router.get("/ready", response_model=HealthResponse)

async def readiness():

"""

Readiness probe endpoint.

Returns 200 if the service is ready to handle requests.

This includes checking that the model is loaded successfully.

"""

if model_state is None:

raise HTTPException(

status_code=503,

detail="Model state not initialized"

)

if not model_state.get("loaded", False):

raise HTTPException(

status_code=503,

detail="Model not loaded yet"

)

return HealthResponse(

status="healthy",

details={

"model_path": model_state.get("model_path"),

"model_loaded": model_state.get("loaded")

}

)

@router.get("/model", response_model=ModelInfo)

async def model_info():

"""

Get information about the currently loaded model.

Returns details about the model including its path, load status,

and configuration parameters.

"""

if model_state is None:

raise HTTPException(

status_code=503,

detail="Model state not initialized"

)

return ModelInfo(

model_path=model_state.get("model_path", "unknown"),

loaded=model_state.get("loaded", False),

model_type=model_state.get("model_type"),

parameters=model_state.get("parameters")

)

GENERATION ROUTER

The generation router handles text generation requests with both standard and streaming responses.

# app/routers/generation.py

from fastapi import APIRouter, HTTPException

from fastapi.responses import StreamingResponse

from app.models import GenerationRequest, GenerationResponse, ErrorResponse

from app.config import settings

import logging

import json

import time

from typing import AsyncGenerator

from mlx_lm import generate, stream_generate

router = APIRouter(prefix="/api", tags=["generation"])

logger = logging.getLogger(__name__)

# Model state will be injected by main app

model_state = None

def set_model_state(state):

"""Set the model state reference"""

global model_state

model_state = state

async def generate_stream_response(request: GenerationRequest) -> AsyncGenerator[str, None]:

"""

Async generator for streaming text generation.

Yields Server-Sent Events containing generated tokens.

"""

try:

token_count = 0

start_time = time.time()

# Stream generate tokens

for token in stream_generate(

model_state["model"],

model_state["tokenizer"],

prompt=request.prompt,

max_tokens=request.max_tokens,

temp=request.temperature,

top_p=request.top_p,

repetition_penalty=request.repetition_penalty,

repetition_context_size=request.repetition_context_size

token_count += 1

data = {

"token": token,

"token_count": token_count,

"elapsed_time": time.time() - start_time

}

yield f"data: {json.dumps(data)}\n\n"

# Send completion event

completion_data = {

"done": True,

"total_tokens": token_count,

"total_time": time.time() - start_time,

"tokens_per_second": token_count / (time.time() - start_time) if time.time() > start_time else 0

}

yield f"data: {json.dumps(completion_data)}\n\n"

except Exception as e:

logger.error(f"Streaming generation failed: {str(e)}", exc_info=True)

error_data = {

"error": "generation_failed",

"message": str(e)

}

yield f"data: {json.dumps(error_data)}\n\n"

@router.post("/generate", response_model=GenerationResponse)

async def generate_text(request: GenerationRequest):

"""

Generate text based on the provided prompt.

This endpoint performs text generation using the loaded MLX model.

It supports both standard responses (returning all text at once)

and streaming responses (returning tokens as they are generated).

For streaming, set the 'stream' parameter to true and the response

will be sent as Server-Sent Events.

"""

# Check model is loaded

if model_state is None or not model_state.get("loaded", False):

raise HTTPException(

status_code=503,

detail="Model not loaded. Service is starting up or failed to load model."

)

logger.info(f"Generation request: prompt_length={len(request.prompt)}, max_tokens={request.max_tokens}")

try:

# Handle streaming response

if request.stream:

return StreamingResponse(

generate_stream_response(request),

media_type="text/event-stream",

headers={

"Cache-Control": "no-cache",

"Connection": "keep-alive",

"X-Accel-Buffering": "no"

}

)

# Handle standard response

start_time = time.time()

generated_text = generate(

model_state["model"],

model_state["tokenizer"],

prompt=request.prompt,

max_tokens=request.max_tokens,

temp=request.temperature,

top_p=request.top_p,

repetition_penalty=request.repetition_penalty,

repetition_context_size=request.repetition_context_size,

verbose=False

)

generation_time = time.time() - start_time

# Count tokens in generated text

try:

tokens_generated = len(model_state["tokenizer"].encode(generated_text))

except Exception as e:

logger.warning(f"Failed to count tokens: {e}")

tokens_generated = len(generated_text.split())

logger.info(f"Generation completed: tokens={tokens_generated}, time={generation_time:.2f}s")

return GenerationResponse(

generated_text=generated_text,

prompt=request.prompt,

tokens_generated=tokens_generated,

generation_time=generation_time

)

except Exception as e:

logger.error(f"Generation failed: {str(e)}", exc_info=True)

raise HTTPException(

status_code=500,

detail=f"Text generation failed: {str(e)}"

)

MAIN APPLICATION

The main application module initializes FastAPI, loads the model, and configures all routes and middleware.

# app/main.py

from fastapi import FastAPI, Request

from fastapi.middleware.cors import CORSMiddleware

from fastapi.staticfiles import StaticFiles

from fastapi.responses import FileResponse, JSONResponse

from contextlib import asynccontextmanager

import logging

import sys

from pathlib import Path

from app.config import settings

from app.routers import generation, health

from mlx_lm import load

# Configure logging

logging.basicConfig(

level=getattr(logging, settings.log_level.upper()),

format=settings.log_format,

handlers=[

logging.StreamHandler(sys.stdout)

]

)

logger = logging.getLogger(__name__)

# Global model state

model_state = {

"model": None,

"tokenizer": None,

"model_path": None,

"loaded": False,

"model_type": None,

"parameters": None

}

@asynccontextmanager

async def lifespan(app: FastAPI):

"""

Application lifespan manager.

Handles startup and shutdown events including model loading.

"""

# Startup

logger.info(f"Starting {settings.app_name} v{settings.app_version}")

logger.info(f"Loading model from: {settings.model_path}")

try:

model, tokenizer = load(settings.model_path)

model_state["model"] = model

model_state["tokenizer"] = tokenizer

model_state["model_path"] = settings.model_path

model_state["loaded"] = True

# Try to extract model type and parameters

try:

if hasattr(model, 'model_type'):

model_state["model_type"] = model.model_type

if hasattr(model, 'args'):

model_state["parameters"] = {

k: str(v) for k, v in vars(model.args).items()

if not k.startswith('_')

}

except Exception as e:

logger.warning(f"Could not extract model metadata: {e}")

logger.info("Model loaded successfully")

except Exception as e:

logger.error(f"Failed to load model: {str(e)}", exc_info=True)

model_state["loaded"] = False

# Don't raise - allow app to start but mark as not ready

yield

# Shutdown

logger.info("Shutting down application")

model_state["model"] = None

model_state["tokenizer"] = None

model_state["loaded"] = False

# Create FastAPI application

app = FastAPI(

title=settings.app_name,

version=settings.app_version,

description="Production-ready MLX LLM inference service with web playground",

lifespan=lifespan

)

# Configure CORS

app.add_middleware(

CORSMiddleware,

allow_origins=settings.cors_origins,

allow_credentials=settings.cors_allow_credentials,

allow_methods=settings.cors_allow_methods,

allow_headers=settings.cors_allow_headers,

)

# Inject model state into routers

generation.set_model_state(model_state)

health.set_model_state(model_state)

# Include routers

app.include_router(generation.router)

app.include_router(health.router)

# Mount static files

static_path = Path(__file__).parent.parent / "static"

if static_path.exists():

app.mount("/static", StaticFiles(directory=str(static_path)), name="static")

@app.get("/")

async def root():

"""Serve the main web interface"""

static_file = static_path / "index.html"

if static_file.exists():

return FileResponse(static_file)

return {

"message": f"Welcome to {settings.app_name}",

"version": settings.app_version,

"docs": "/docs"

}

@app.exception_handler(Exception)

async def global_exception_handler(request: Request, exc: Exception):

"""Global exception handler for unhandled errors"""

logger.error(f"Unhandled exception: {str(exc)}", exc_info=True)

return JSONResponse(

status_code=500,

content={

"error": "internal_server_error",

"message": "An unexpected error occurred",

"details": str(exc) if settings.debug else None

}

)

if __name__ == "__main__":

import uvicorn

uvicorn.run(

"app.main:app",

host=settings.host,

port=settings.port,

workers=settings.workers,

log_level=settings.log_level.lower()

)

COMPLETE WEB FRONTEND

The complete web frontend with all functionality integrated into a single HTML file.

<!DOCTYPE html>

<head>

<title>MLX LLM Playground</title>

<style>

* {

margin: 0;

padding: 0;

box-sizing: border-box;

}

body {

font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, Oxygen, Ubuntu, Cantarell, 'Helvetica Neue', sans-serif;

background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);

min-height: 100vh;

display: flex;

justify-content: center;

align-items: center;

padding: 20px;

}

.container {

background: white;

border-radius: 20px;

box-shadow: 0 20px 60px rgba(0, 0, 0, 0.3);

width: 100%;

max-width: 1000px;

height: 90vh;

display: flex;

flex-direction: column;

overflow: hidden;

}

.header {

background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);

color: white;

padding: 25px 35px;

border-radius: 20px 20px 0 0;

display: flex;

justify-content: space-between;

align-items: center;

}

.header-left h1 {

font-size: 26px;

font-weight: 600;

margin-bottom: 5px;

}

.header-left p {

font-size: 14px;

opacity: 0.9;

}

.header-right {

display: flex;

align-items: center;

gap: 15px;

}

.status-indicator {

display: flex;

align-items: center;

gap: 8px;

background: rgba(255, 255, 255, 0.2);

padding: 8px 15px;

border-radius: 20px;

font-size: 13px;

}

.status-dot {

width: 10px;

height: 10px;

border-radius: 50%;

background: #4ade80;

animation: pulse 2s infinite;

}

.status-dot.loading {

background: #fbbf24;

}

.status-dot.error {

background: #ef4444;

animation: none;

}

@keyframes pulse {

0%, 100% { opacity: 1; }

50% { opacity: 0.5; }

}

.chat-container {

flex: 1;

overflow-y: auto;

padding: 25px 35px;

background: #f8f9fa;

}

.message {

margin-bottom: 20px;

display: flex;

flex-direction: column;

animation: fadeIn 0.3s ease-in;

}

@keyframes fadeIn {

from { opacity: 0; transform: translateY(10px); }

to { opacity: 1; transform: translateY(0); }

}

.message.user {

align-items: flex-end;

}

.message.assistant {

align-items: flex-start;

}

.message-label {

font-size: 12px;

color: #666;

margin-bottom: 5px;

font-weight: 500;

}

.message-content {

max-width: 75%;

padding: 14px 20px;

border-radius: 18px;

word-wrap: break-word;

white-space: pre-wrap;

line-height: 1.5;

}

.message.user .message-content {

background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);

color: white;

}

.message.assistant .message-content {

background: white;

color: #333;

border: 1px solid #e0e0e0;

}

.message-meta {

font-size: 11px;

color: #999;

margin-top: 5px;

}

.controls-panel {

background: white;

border-top: 1px solid #e0e0e0;

padding: 20px 35px;

}

.controls-toggle {

display: flex;

align-items: center;

justify-content: space-between;

cursor: pointer;

padding: 10px 0;

user-select: none;

}

.controls-toggle-label {

font-size: 14px;

font-weight: 500;

color: #666;

display: flex;

align-items: center;

gap: 8px;

}

.controls-toggle-icon {

transition: transform 0.3s;

}

.controls-toggle-icon.expanded {

transform: rotate(180deg);

}

.controls-content {

max-height: 0;

overflow: hidden;

transition: max-height 0.3s ease-out;

}

.controls-content.expanded {

max-height: 200px;

}

.controls-grid {

display: grid;

grid-template-columns: repeat(auto-fit, minmax(200px, 1fr));

gap: 15px;

padding: 15px 0;

}

.control-group {

display: flex;

flex-direction: column;

gap: 5px;

}

.control-group label {

font-size: 13px;

color: #666;

font-weight: 500;

}

.control-group input[type="number"],

.control-group input[type="range"] {

padding: 8px 12px;

border: 1px solid #e0e0e0;

border-radius: 8px;

font-size: 14px;

}

.control-group input[type="range"] {

padding: 0;

}

.range-value {

font-size: 12px;

color: #999;

text-align: right;

}

.checkbox-group {

display: flex;

align-items: center;

gap: 8px;

}

.checkbox-group input[type="checkbox"] {

width: 18px;

height: 18px;

cursor: pointer;

}

.input-area {

padding: 20px 35px;

background: white;

border-top: 1px solid #e0e0e0;

}

.error-message {

background: #fee;

color: #c33;

padding: 12px 18px;

border-radius: 8px;

margin-bottom: 15px;

font-size: 14px;

display: flex;

align-items: center;

gap: 10px;

}

.error-message-close {

margin-left: auto;

cursor: pointer;

font-size: 18px;

opacity: 0.7;

}

.error-message-close:hover {

opacity: 1;

}

.input-row {

display: flex;

gap: 12px;

}

.input-row textarea {

flex: 1;

padding: 14px 20px;

border: 1px solid #e0e0e0;

border-radius: 12px;

font-size: 16px;

font-family: inherit;

resize: none;

min-height: 55px;

max-height: 150px;

transition: border-color 0.2s;

}

.input-row textarea:focus {

outline: none;

border-color: #667eea;

}

.input-row textarea:disabled {

background: #f5f5f5;

cursor: not-allowed;

}

.input-row button {

padding: 14px 35px;

background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);

color: white;

border: none;

border-radius: 12px;

font-size: 16px;

font-weight: 600;

cursor: pointer;

transition: transform 0.2s, box-shadow 0.2s;

white-space: nowrap;

}

.input-row button:hover:not(:disabled) {

transform: translateY(-2px);

box-shadow: 0 5px 15px rgba(102, 126, 234, 0.4);

}

.input-row button:disabled {

opacity: 0.6;

cursor: not-allowed;

transform: none;

}

.loading-indicator {

display: flex;

align-items: center;

gap: 10px;

color: #666;

font-size: 14px;

margin-top: 12px;

}

.loading-spinner {

width: 18px;

height: 18px;

border: 2px solid #e0e0e0;

border-top-color: #667eea;

border-radius: 50%;

animation: spin 1s linear infinite;

}

@keyframes spin {

to { transform: rotate(360deg); }

}

.stats-display {

font-size: 12px;

color: #999;

margin-top: 8px;

}

@media (max-width: 768px) {

.container {

height: 100vh;

border-radius: 0;

}

.header {

border-radius: 0;

flex-direction: column;

align-items: flex-start;

gap: 10px;

}

.controls-grid {

grid-template-columns: 1fr;

}

.message-content {

max-width: 85%;

}

</style>

</head>

<body>

<h1>MLX LLM Playground</h1>

<p>Interact with Large Language Models powered by Apple MLX</p>

</div>

<span id="statusText">Ready</span>

</div>

<div class="message-label">Assistant</div>

Hello! I'm an AI assistant powered by MLX running on Apple Silicon. I'm ready to help you with questions, creative writing, analysis, and more. What would you like to talk about today?

</div>

<span>Generation Parameters</span>

</div>

</div>

<label for="maxTokens">Max Tokens</label>

</div>

<label for="temperature">Temperature</label>

</div>

</div>

<label for="repetitionPenalty">Repetition Penalty</label>

</div>

<label for="streamMode">Stream Response</label>

</div>

<textarea

id="promptInput"

placeholder="Type your message here... (Press Enter to send, Shift+Enter for new line)"

rows="1"

></textarea>

</div>

<span id="loadingText">Generating response...</span>

</div>

</div>

const API_BASE_URL = window.location.origin;

let isGenerating = false;

let currentMessageContent = null;

let generationStartTime = null;

// Initialize

document.addEventListener('DOMContentLoaded', function() {

checkHealth();

setupEventListeners();

updateRangeValues();

});

function setupEventListeners() {

const promptInput = document.getElementById('promptInput');

// Auto-resize textarea

promptInput.addEventListener('input', function() {

this.style.height = 'auto';

this.style.height = Math.min(this.scrollHeight, 150) + 'px';

});

// Enter to send, Shift+Enter for new line

promptInput.addEventListener('keydown', function(e) {

if (e.key === 'Enter' && !e.shiftKey) {

e.preventDefault();

sendMessage();

}

});

}

function updateRangeValues() {

updateRangeValue('temperature');

updateRangeValue('topP');

updateRangeValue('repetitionPenalty');

}

function updateRangeValue(id) {

const input = document.getElementById(id);

const display = document.getElementById(id + 'Value');

if (input && display) {

display.textContent = parseFloat(input.value).toFixed(2);

}

function toggleControls() {

const content = document.getElementById('controlsContent');

const icon = document.getElementById('controlsIcon');

content.classList.toggle('expanded');

icon.classList.toggle('expanded');

}

function setStatus(status, text) {

const dot = document.getElementById('statusDot');

const statusText = document.getElementById('statusText');

dot.className = 'status-dot ' + status;

statusText.textContent = text;

}

function addMessage(role, content, meta = null) {

const chatContainer = document.getElementById('chatContainer');

const messageDiv = document.createElement('div');

messageDiv.className = `message ${role}`;

const labelDiv = document.createElement('div');

labelDiv.className = 'message-label';

labelDiv.textContent = role === 'user' ? 'You' : 'Assistant';

const contentDiv = document.createElement('div');

contentDiv.className = 'message-content';

contentDiv.textContent = content;

messageDiv.appendChild(labelDiv);

messageDiv.appendChild(contentDiv);

if (meta) {

const metaDiv = document.createElement('div');

metaDiv.className = 'message-meta';

metaDiv.textContent = meta;

messageDiv.appendChild(metaDiv);

}

chatContainer.appendChild(messageDiv);

chatContainer.scrollTop = chatContainer.scrollHeight;

return contentDiv;

}

function showError(message) {

const errorContainer = document.getElementById('errorContainer');

errorContainer.innerHTML = `

<span>⚠️ ${message}</span>

</div>

}

function setLoading(loading, text = 'Generating response...') {

isGenerating = loading;

document.getElementById('sendButton').disabled = loading;

document.getElementById('promptInput').disabled = loading;

document.getElementById('loadingIndicator').style.display = loading ? 'flex' : 'none';

document.getElementById('loadingText').textContent = text;

if (loading) {

setStatus('loading', 'Generating');

} else {

setStatus('', 'Ready');

}

function updateStats(stats) {

const statsDisplay = document.getElementById('statsDisplay');

statsDisplay.textContent = stats;

}

async function checkHealth() {

try {

const response = await fetch(`${API_BASE_URL}/health/ready`);

if (response.ok) {

const data = await response.json();

setStatus('', 'Ready');

console.log('Model loaded:', data.details);

} else {

setStatus('loading', 'Loading model...');

setTimeout(checkHealth, 2000);

}

} catch (error) {

setStatus('error', 'Connection error');

showError('Cannot connect to server. Please check if the service is running.');

}

async function sendMessage() {

if (isGenerating) return;

const promptInput = document.getElementById('promptInput');

const prompt = promptInput.value.trim();

if (!prompt) return;

const maxTokens = parseInt(document.getElementById('maxTokens').value);

const temperature = parseFloat(document.getElementById('temperature').value);

const topP = parseFloat(document.getElementById('topP').value);

const repetitionPenalty = parseFloat(document.getElementById('repetitionPenalty').value);

const stream = document.getElementById('streamMode').checked;

addMessage('user', prompt);

promptInput.value = '';

promptInput.style.height = 'auto';

setLoading(true);

generationStartTime = Date.now();

try {

if (stream) {

await handleStreamingResponse(prompt, maxTokens, temperature, topP, repetitionPenalty);

} else {

await handleStandardResponse(prompt, maxTokens, temperature, topP, repetitionPenalty);

}

} catch (error) {

console.error('Error:', error);

showError(`Failed to generate response: ${error.message}`);

} finally {

setLoading(false);

currentMessageContent = null;

}

async function handleStandardResponse(prompt, maxTokens, temperature, topP, repetitionPenalty) {

const response = await fetch(`${API_BASE_URL}/api/generate`, {

method: 'POST',

headers: {

'Content-Type': 'application/json',

body: JSON.stringify({

prompt: prompt,

max_tokens: maxTokens,

temperature: temperature,

top_p: topP,

repetition_penalty: repetitionPenalty,

stream: false

})

});

if (!response.ok) {

const error = await response.json();

throw new Error(error.detail || 'Failed to generate response');

}

const data = await response.json();

const tokensPerSecond = (data.tokens_generated / data.generation_time).toFixed(2);

const meta = `${data.tokens_generated} tokens in ${data.generation_time.toFixed(2)}s (${tokensPerSecond} tokens/s)`;

addMessage('assistant', data.generated_text, meta);

updateStats(`Last generation: ${meta}`);

}

async function handleStreamingResponse(prompt, maxTokens, temperature, topP, repetitionPenalty) {

const response = await fetch(`${API_BASE_URL}/api/generate`, {

method: 'POST',

headers: {

'Content-Type': 'application/json',

body: JSON.stringify({

prompt: prompt,

max_tokens: maxTokens,

temperature: temperature,

top_p: topP,

repetition_penalty: repetitionPenalty,

stream: true

})

});

if (!response.ok) {

const error = await response.json();

throw new Error(error.detail || 'Failed to generate response');

}

const reader = response.body.getReader();

const decoder = new TextDecoder();

currentMessageContent = addMessage('assistant', '');

let fullText = '';

let tokenCount = 0;

while (true) {

const { done, value } = await reader.read();

if (done) break;

const chunk = decoder.decode(value);

const lines = chunk.split('\n');

for (const line of lines) {

if (line.startsWith('data: ')) {

try {

const data = JSON.parse(line.slice(6));

if (data.error) {

throw new Error(data.error);

}

if (data.done) {

const tokensPerSecond = data.tokens_per_second.toFixed(2);

const meta = `${data.total_tokens} tokens in ${data.total_time.toFixed(2)}s (${tokensPerSecond} tokens/s)`;

updateStats(`Last generation: ${meta}`);

break;

}

if (data.token) {

fullText += data.token;

tokenCount = data.token_count;

currentMessageContent.textContent = fullText;

const chatContainer = document.getElementById('chatContainer');

chatContainer.scrollTop = chatContainer.scrollHeight;

const elapsed = (Date.now() - generationStartTime) / 1000;

const tokensPerSecond = (tokenCount / elapsed).toFixed(2);

setLoading(true, `Generating... (${tokenCount} tokens, ${tokensPerSecond} tokens/s)`);

}

} catch (e) {

console.error('Error parsing SSE data:', e);

}

</script>

</body>

</html>

KUBERNETES MANIFESTS

Complete Kubernetes manifests for production deployment.

# k8s/namespace.yaml

apiVersion: v1

kind: Namespace

metadata:

labels:

environment: production

# k8s/configmap.yaml

apiVersion: v1

kind: ConfigMap

metadata:

namespace: mlx-llm

data:

model_path: "mlx-community/Mistral-7B-Instruct-v0.3-4bit"

log_level: "INFO"

default_max_tokens: "200"

default_temperature: "0.7"

default_top_p: "1.0"

max_concurrent_requests: "10"

# k8s/deployment.yaml

apiVersion: apps/v1

kind: Deployment

metadata:

namespace: mlx-llm

labels:

app: mlx-llm

version: v1.0.0

spec:

replicas: 2

strategy:

type: RollingUpdate

rollingUpdate:

maxSurge: 1

maxUnavailable: 0

selector:

matchLabels:

app: mlx-llm

template:

metadata:

labels:

app: mlx-llm

version: v1.0.0

annotations:

prometheus.io/scrape: "true"

prometheus.io/port: "9090"

prometheus.io/path: "/metrics"

spec:

affinity:

podAntiAffinity:

preferredDuringSchedulingIgnoredDuringExecution:

- weight: 100

podAffinityTerm:

labelSelector:

matchExpressions:

- key: app

operator: In

values:

- mlx-llm

topologyKey: kubernetes.io/hostname

containers:

- name: mlx-llm

image: your-registry.com/mlx-llm-playground:1.0.0

imagePullPolicy: Always

ports:

- containerPort: 8000

protocol: TCP

- containerPort: 9090

protocol: TCP

env:

- name: MODEL_PATH

valueFrom:

configMapKeyRef:

key: model_path

- name: LOG_LEVEL

valueFrom:

configMapKeyRef:

key: log_level

- name: DEFAULT_MAX_TOKENS

valueFrom:

configMapKeyRef:

key: default_max_tokens

- name: MAX_CONCURRENT_REQUESTS

valueFrom:

configMapKeyRef:

key: max_concurrent_requests

resources:

requests:

memory: "4Gi"

cpu: "2000m"

limits:

memory: "8Gi"

cpu: "4000m"

livenessProbe:

httpGet:

path: /health/live

port: 8000

initialDelaySeconds: 30

periodSeconds: 10

timeoutSeconds: 5

failureThreshold: 3

readinessProbe:

httpGet:

path: /health/ready

port: 8000

initialDelaySeconds: 90

periodSeconds: 10

timeoutSeconds: 5

failureThreshold: 3

lifecycle:

preStop:

exec:

command: ["/bin/sh", "-c", "sleep 15"]

securityContext:

runAsNonRoot: true

runAsUser: 1000

allowPrivilegeEscalation: false

readOnlyRootFilesystem: false

capabilities:

drop:

- ALL

terminationGracePeriodSeconds: 30

securityContext:

fsGroup: 1000

# k8s/service.yaml

apiVersion: v1

kind: Service

metadata:

namespace: mlx-llm

labels:

app: mlx-llm

spec:

type: ClusterIP

sessionAffinity: ClientIP

sessionAffinityConfig:

clientIP:

timeoutSeconds: 10800

ports:

- port: 80

targetPort: 8000

protocol: TCP

- port: 9090

targetPort: 9090

protocol: TCP

selector:

app: mlx-llm

# k8s/ingress.yaml

apiVersion: networking.k8s.io/v1

kind: Ingress

metadata:

namespace: mlx-llm

annotations:

nginx.ingress.kubernetes.io/rewrite-target: /

nginx.ingress.kubernetes.io/ssl-redirect: "true"

nginx.ingress.kubernetes.io/proxy-body-size: "10m"

nginx.ingress.kubernetes.io/proxy-read-timeout: "300"

nginx.ingress.kubernetes.io/proxy-send-timeout: "300"

cert-manager.io/cluster-issuer: "letsencrypt-prod"

nginx.ingress.kubernetes.io/rate-limit: "100"

spec:

ingressClassName: nginx

tls:

- hosts:

- mlx-llm.yourdomain.com

secretName: mlx-llm-tls

rules:

- host: mlx-llm.yourdomain.com

http:

paths:

- path: /

pathType: Prefix

backend:

service:

port:

number: 80

# k8s/hpa.yaml

apiVersion: autoscaling/v2

kind: HorizontalPodAutoscaler

metadata:

namespace: mlx-llm

spec:

scaleTargetRef:

apiVersion: apps/v1

kind: Deployment

minReplicas: 2

maxReplicas: 10

metrics:

- type: Resource

resource:

target:

type: Utilization

averageUtilization: 70

- type: Resource

resource:

target:

type: Utilization

averageUtilization: 80

behavior:

scaleDown:

stabilizationWindowSeconds: 300

policies:

- type: Percent

value: 50

periodSeconds: 60

scaleUp:

stabilizationWindowSeconds: 0

policies:

- type: Percent

value: 100

periodSeconds: 30

- type: Pods

value: 2

periodSeconds: 30

selectPolicy: Max

REQUIREMENTS FILE

Complete Python dependencies with pinned versions for reproducible builds.

# requirements.txt

# Web framework and server

fastapi==0.109.2

uvicorn[standard]==0.27.1

pydantic==2.6.1

pydantic-settings==2.1.0

# MLX and machine learning

mlx==0.4.0

mlx-lm==0.8.1

# HTTP and networking

httpx==0.26.0

python-multipart==0.0.9

aiofiles==23.2.1

# Monitoring and observability

prometheus-client==0.19.0

# Utilities

python-dotenv==1.0.1

# Testing (optional, for development)

pytest==8.0.0

pytest-asyncio==0.23.4

httpx==0.26.0

DOCKERFILE

Production-ready Dockerfile with security best practices and optimization.

# Dockerfile

FROM python:3.11-slim as base

# Metadata

LABEL maintainer="your-email@example.com"

LABEL description="MLX LLM Playground - Production-ready inference service"

LABEL version="1.0.0"

# Environment variables

ENV PYTHONUNBUFFERED=1 \

PYTHONDONTWRITEBYTECODE=1 \

PIP_NO_CACHE_DIR=1 \

PIP_DISABLE_PIP_VERSION_CHECK=1 \

DEBIAN_FRONTEND=noninteractive

# Install system dependencies

RUN apt-get update && apt-get install -y --no-install-recommends \

build-essential \

curl \

git \

ca-certificates \

&& rm -rf /var/lib/apt/lists/*

# Create non-root user

RUN useradd -m -u 1000 -s /bin/bash appuser && \

mkdir -p /app /home/appuser/.cache && \

chown -R appuser:appuser /app /home/appuser

# Set working directory

WORKDIR /app

# Copy requirements first for better caching

COPY --chown=appuser:appuser requirements.txt .

# Install Python dependencies

RUN pip install --no-cache-dir --upgrade pip && \

pip install --no-cache-dir -r requirements.txt

# Copy application code

COPY --chown=appuser:appuser app/ ./app/

COPY --chown=appuser:appuser static/ ./static/

# Switch to non-root user

USER appuser

# Expose ports

EXPOSE 8000 9090

# Health check

HEALTHCHECK --interval=30s --timeout=10s --start-period=120s --retries=3 \

CMD curl -f http://localhost:8000/health/live || exit 1

# Run application

CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "1"]

DOCKERIGNORE FILE

Optimize Docker build context by excluding unnecessary files.

# .dockerignore

# Python

__pycache__/

*.py[cod]

*$py.class

*.so

.Python

build/

develop-eggs/

dist/

downloads/

eggs/

.eggs/

lib/

lib64/

parts/

sdist/

var/

wheels/

*.egg-info/

.installed.cfg

*.egg

# Virtual environments

venv/

env/

ENV/

# IDE

.vscode/

.idea/

*.swp

*.swo

# Git

.git/

.gitignore

.gitattributes

# Documentation

README.md

docs/

# Tests

tests/

.pytest_cache/

# CI/CD

.github/

.gitlab-ci.yml

# Kubernetes

k8s/

# Docker

Dockerfile*

docker-compose*.yml

.dockerignore

# OS

.DS_Store

Thumbs.db

# Logs

*.log

logs/

# Environment

.env

.env.*

README FILE

Comprehensive documentation for deploying and using the MLX LLM Playground.

# MLX LLM Playground

Production-ready web-based playground for interacting with Large Language Models using Apple's MLX framework.

## Features

- Interactive chat interface for LLM interaction

- Real-time streaming responses

- Configurable generation parameters

- Production-ready FastAPI backend

- Complete Kubernetes deployment manifests

- Health checks and monitoring endpoints

- Comprehensive error handling and logging

## Prerequisites

- Docker and Docker Compose

- Kubernetes cluster (for production deployment)

- Python 3.11 or higher (for local development)

- Apple Silicon Mac (for optimal MLX performance)

## Quick Start

### Local Development

1. Clone the repository

2. Create a virtual environment: python -m venv venv

3. Activate the environment: source venv/bin/activate

4. Install dependencies: pip install -r requirements.txt

5. Run the application: uvicorn app.main:app --reload

6. Open browser to http://localhost:8000

### Docker Deployment

1. Build the image: docker build -t mlx-llm-playground:latest .

2. Run the container: docker run -p 8000:8000 mlx-llm-playground:latest

3. Access at http://localhost:8000

### Kubernetes Deployment

1. Create namespace: kubectl apply -f k8s/namespace.yaml

2. Apply configurations: kubectl apply -f k8s/

3. Verify deployment: kubectl get pods -n mlx-llm

4. Access via ingress or port-forward

## Configuration

All configuration is managed through environment variables or ConfigMaps in Kubernetes.

Key settings:

- MODEL_PATH: Hugging Face model identifier or local path

- LOG_LEVEL: Logging verbosity (DEBUG, INFO, WARNING, ERROR)

- DEFAULT_MAX_TOKENS: Default maximum tokens for generation

- MAX_CONCURRENT_REQUESTS: Maximum concurrent inference requests

## API Documentation

Interactive API documentation is available at /docs when the service is running.

## Monitoring

Prometheus metrics are exposed on port 9090 at /metrics endpoint.

## License

MIT License

## Support

For issues and questions, please open an issue on the repository.

This complete implementation provides a production-ready MLX LLM playground with all necessary components for deployment in Docker and Kubernetes environments. The code follows clean architecture principles, includes comprehensive error handling, implements security best practices, and provides full functionality without placeholders or simplifications.

Hitchhiker's Guide to AI, Software Architecture, and Everything Else

Thursday, November 13, 2025

DEPLOYING MLX LARGE LANGUAGE MODELS IN DOCKER CONTAINERS AND KUBERNETES CLUSTERS

INTRODUCTION AND OVERVIEW

UNDERSTANDING THE ARCHITECTURE

TECHNICAL CONSIDERATIONS FOR MLX IN CONTAINERS

BUILDING THE FASTAPI INFERENCE SERVER

CREATING THE WEB FRONTEND

CONTAINERIZING WITH DOCKER

DEPLOYING TO KUBERNETES

PRODUCTION CONSIDERATIONS AND BEST PRACTICES

ADVANCED DEPLOYMENT PATTERNS

COMPLETE RUNNING EXAMPLE Production-Ready MLX LLM Playground Service

DIRECTORY STRUCTURE

CONFIGURATION MODULE

DATA MODELS

HEALTH CHECK ROUTER

GENERATION ROUTER

MAIN APPLICATION

COMPLETE WEB FRONTEND

KUBERNETES MANIFESTS

REQUIREMENTS FILE

DOCKERFILE

DOCKERIGNORE FILE

README FILE

No comments:

About Me