Hitchhiker's Guide to AI, Software Architecture, and Everything Else: AN LLM CHAT MICROSERVICE: A GUIDE FOR DOCKER AND KUBERNETES DEPLOYMENT WITH MULTI-GPU ARCHITECTURE SUPPORT

INTRODUCTION: WHY BUILD AN LLM CHAT MICROSERVICE?

In the rapidly evolving landscape of artificial intelligence, Large Language Models have become indispensable tools for modern applications. However, deploying these powerful models in a production environment presents unique challenges. This article guides you through creating a robust, scalable LLM chat service that runs as a microservice in containerized environments.

The microservice architecture approach offers several compelling advantages. First, it provides isolation, meaning your LLM service runs independently from other application components. If the LLM service crashes or needs updates, your main application continues functioning. Second, it enables horizontal scaling, allowing you to run multiple instances of your LLM service to handle increased load. Third, it facilitates resource management, particularly important for

GPU-intensive LLM operations where you need precise control over hardware allocation.

Using local LLM models instead of cloud-based APIs offers significant benefits. You maintain complete data privacy since no information leaves your infrastructure. You eliminate per-token costs associated with commercial APIs, making it economically viable for high-volume applications. You gain full control over model selection, fine-tuning, and updates. Most importantly, you avoid dependency on external services and their potential downtime or rate limits.

UNDERSTANDING THE FUNDAMENTAL COMPONENTS

Before diving into implementation, let us explore the essential building blocks of our system. Understanding these components deeply will help you make informed decisions throughout the development process.

What Are Large Language Models?

Large Language Models are neural networks trained on vast amounts of text data to understand and generate human-like text. Unlike traditional software that follows explicit rules, LLMs learn patterns from data and can perform tasks they were not explicitly programmed for. When you send a prompt to an LLM, it processes the text through billions of parameters to generate a contextually appropriate response.

The models we will use are quantized versions, meaning they have been compressed from their original size while maintaining most of their capabilities. A model originally requiring 80GB of memory might be quantized to run in 8GB or less. This compression uses techniques like reducing the precision of numerical weights from 32-bit floating point to 4-bit integers. The GGUF format, developed by the llama.cpp project, has become the standard for these quantized models because it supports efficient loading and inference across different hardware platforms.

What Is Docker and Why Use It?

Docker is a containerization platform that packages your application along with all its dependencies into a standardized unit called a container. Think of a container as a lightweight, isolated environment that contains everything needed to run your application: the code, runtime, system tools, libraries, and settings.

The key advantage of Docker for LLM services is consistency. The phrase "it works on my machine" becomes obsolete because the container runs identically everywhere. Whether you develop on a MacBook with Apple Silicon, deploy to a Linux server with NVIDIA GPUs, or scale across a Kubernetes cluster with AMD GPUs, the containerized application behaves the same way.

Docker containers differ from virtual machines in crucial ways. Virtual machines include an entire operating system, making them heavy and slow to start. Containers share the host operating system kernel, making them lightweight and fast. A container can start in seconds and uses only megabytes of memory for the container overhead itself, though our LLM will require substantial memory for the model.

What Is Kubernetes and Why Use It?

Kubernetes is an orchestration platform that manages containerized applications across a cluster of machines. While Docker handles individual containers, Kubernetes manages fleets of containers, deciding where to run them, how to scale them, and what to do when they fail.

For LLM services, Kubernetes provides critical capabilities. It offers automatic scaling based on demand, spinning up new instances when traffic increases and shutting them down when traffic decreases. It provides self-healing, automatically restarting failed containers and replacing unhealthy instances. It manages resource allocation, ensuring your GPU-hungry LLM containers get the hardware they need without starving other services. It handles load balancing, distributing incoming requests across multiple LLM instances for optimal performance.

Understanding Microservice Architecture

Microservice architecture structures an application as a collection of loosely coupled services. Instead of building one large monolithic application, you create multiple small services that each handle a specific business capability. Each service runs independently, has its own database if needed, and communicates with other services through well-defined APIs.

For our LLM chat service, the microservice approach means creating a dedicated service that does one thing well: processing chat requests using a local LLM. Other parts of your application, like user authentication, data storage, or frontend interfaces, run as separate services. This separation allows you to update your LLM service without touching other components, scale it independently based on AI workload, and even use different programming languages or frameworks for different services.

MULTI-GPU ARCHITECTURE SUPPORT: THE TECHNICAL FOUNDATION

One of the most challenging aspects of deploying LLM services is supporting diverse GPU architectures. Different hardware manufacturers use different programming models and libraries for GPU acceleration. Understanding these differences is essential for building a truly portable LLM service.

NVIDIA CUDA Architecture

NVIDIA GPUs use the CUDA (Compute Unified Device Architecture) platform for parallel computing. CUDA has been the dominant force in machine learning for years, with extensive library support and optimization. When you run an LLM on NVIDIA hardware, the inference engine uses CUDA cores to parallelize matrix operations, the fundamental mathematical operations in neural networks.

The llama.cpp library we will use supports CUDA through cuBLAS, NVIDIA's library for basic linear algebra operations. When compiled with CUDA support, llama.cpp automatically offloads computation to the GPU, dramatically accelerating inference. A response that might take 30 seconds on CPU could complete in 2 seconds on a modern NVIDIA GPU.

AMD ROCm Architecture

AMD's ROCm (Radeon Open Compute) platform provides GPU acceleration for AMD graphics cards. ROCm is an open-source alternative to CUDA, designed to support high-performance computing and machine learning workloads. While historically less mature than CUDA, ROCm has improved significantly and now provides competitive performance for LLM inference.

The llama.cpp library supports ROCm through HIP (Heterogeneous-computing Interface for Portability), which allows CUDA code to run on AMD GPUs with minimal modifications. When you compile llama.cpp with ROCm support, it uses AMD GPU cores for acceleration just as it would use NVIDIA CUDA cores.

Apple Metal Performance Shaders

Apple Silicon chips (M1, M2, M3, and their variants) include integrated GPUs that use the Metal framework for GPU computing. Metal Performance Shaders (MPS) provides optimized implementations of common computational patterns, including the matrix operations needed for neural networks.

The llama.cpp library includes Metal support specifically for Apple Silicon.

When running on a Mac with Apple Silicon, the library automatically uses the integrated GPU for acceleration. This is particularly powerful because Apple's unified memory architecture allows the CPU and GPU to share the same memory pool, eliminating the need to copy data between separate CPU and GPU memory.

Intel GPU Architecture

Intel provides GPU acceleration through multiple technologies. Modern Intel CPUs with integrated graphics support OpenCL for general-purpose GPU computing. Intel also offers oneAPI, a unified programming model that works across CPUs, GPUs, and other accelerators. Additionally, Intel's Arc discrete GPUs provide substantial computational power for AI workloads.

The llama.cpp library supports Intel GPUs primarily through SYCL (a higher-level programming model built on oneAPI) and OpenCL. This support enables acceleration on Intel integrated graphics as well as discrete Arc GPUs.

THE ARCHITECTURE OF OUR LLM MICROSERVICE

Now that we understand the components, let us design the architecture of our LLM chat microservice. A well-designed architecture makes the system easier to understand, maintain, and extend.

Service Architecture Overview

Our microservice follows a layered architecture with clear separation of

concerns. At the top layer, we have the API layer, which handles HTTP requests and responses. This layer exposes REST endpoints that clients use to send chat messages and receive responses. It validates input, handles errors gracefully, and formats responses according to API specifications.

Below the API layer sits the service layer, which contains the business logic. This layer manages conversation context, handles prompt formatting, and orchestrates the interaction with the LLM. It implements features like conversation history management, token counting, and response streaming.

The inference layer interfaces directly with the LLM. This layer loads the

model, manages GPU resources, and executes inference. It abstracts the

complexity of the underlying llama.cpp library, providing a clean interface for the service layer.

Finally, the configuration layer manages all settings: model paths, GPU

settings, performance parameters, and API configuration. This layer reads from environment variables and configuration files, allowing the service to adapt to different deployment environments without code changes.

Request Flow Through the System

When a client sends a chat request, it travels through our system in a

well-defined path. The request first arrives at the API layer as an HTTP POST to the chat endpoint. The API layer validates the request structure, ensuring required fields are present and properly formatted.

The validated request moves to the service layer, which retrieves any relevant conversation history and constructs a complete prompt for the LLM. The prompt includes system instructions, conversation history, and the new user message, all formatted according to the specific model's expected format.

The service layer passes the formatted prompt to the inference layer, which feeds it to the loaded LLM. The model processes the prompt and generates tokens one at a time. For streaming responses, these tokens flow back through the layers immediately, allowing the client to receive partial responses as they are generated. For non-streaming responses, the inference layer collects all tokens before returning the complete response.

Finally, the response travels back through the service layer, which may perform post-processing like filtering or logging, and then through the API layer, which formats it as JSON and sends it to the client.

SETTING UP THE DEVELOPMENT ENVIRONMENT

Before writing code, we need to prepare our development environment. This section walks through the setup process step by step.

Installing Required System Dependencies

Our LLM service requires several system-level dependencies. On Ubuntu or Debian Linux, you need build tools for compiling native extensions, Python development headers, and GPU-specific libraries depending on your hardware.

For NVIDIA GPU support, install the CUDA toolkit matching your driver version. The CUDA toolkit includes the compiler and libraries needed to build GPU-accelerated code. You can verify your CUDA installation by running the nvidia-smi command, which displays GPU information and CUDA version.

For AMD GPU support, install ROCm following AMD's official documentation. ROCm installation is more involved than CUDA, requiring specific kernel modules and libraries. After installation, verify it by running rocm-smi, the AMD equivalent of nvidia-smi.

For Apple Silicon, no additional installation is needed. The Metal framework comes pre-installed with macOS. The llama.cpp library will automatically detect and use Metal when running on Apple Silicon.

For Intel GPU support, install the Intel oneAPI toolkit or OpenCL runtime. The specific requirements depend on whether you are using integrated graphics or discrete Arc GPUs.

Installing Python and Dependencies

Our microservice uses Python 3.10 or later. Python 3.10 introduced several

features we will use, including improved type hints and better error messages. Create a virtual environment to isolate our project dependencies from system Python packages. This isolation prevents version conflicts and makes the project reproducible.

The core Python dependencies include FastAPI, a modern web framework for building APIs with automatic documentation and validation. We use Uvicorn as the ASGI server to run FastAPI. The llama-cpp-python package provides Python bindings for llama.cpp, giving us access to efficient LLM inference with GPU support. We also need Pydantic for data validation and serialization.

Obtaining LLM Models

To run a local LLM, you need to download model files. The most accessible source is Hugging Face, which hosts thousands of models in various formats. For our service, we need models in GGUF format, the format used by llama.cpp.

Popular model choices include Llama 2, an open-source model from Meta available in sizes from 7 billion to 70 billion parameters. Mistral 7B offers excellent performance for its size with a focus on instruction following. Phi-3 from Microsoft provides strong capabilities in a compact 3.8 billion parameter model. Qwen models from Alibaba offer multilingual support with strong performance.

When selecting a model, consider the quantization level. Q4_K_M quantization provides a good balance between quality and size, reducing model size to roughly 25 percent of the original while maintaining most capabilities. Q5_K_M offers slightly better quality at a modest size increase. Q8_0 provides near-original quality but requires more memory.

Download your chosen model and note its path. We will configure our service to load this model at startup.

IMPLEMENTING THE LLM MICROSERVICE

Now we begin implementing our microservice. We will build it incrementally,

explaining each component in detail.

Creating the Project Structure

A well-organized project structure makes the code easier to navigate and

maintain. Create a directory for your project and organize it into logical

modules. The main application code lives in an app directory. Configuration handling goes in a config module. The LLM inference logic resides in a models module. API endpoints are defined in a routes module. Shared utilities and helpers go in a utils module.

This structure follows clean architecture principles by separating concerns. The routes module depends on the models module, but the models module does not know about HTTP or routing. This separation allows you to test the LLM inference logic independently of the web framework.

Implementing Configuration Management

Configuration management is critical for a production service. Hard-coding values makes the service inflexible and difficult to deploy in different environments. We use environment variables and configuration files to make the service adaptable.

Create a configuration module that defines all settings using Pydantic. Pydantic provides automatic validation and type conversion for configuration values. It can read from environment variables, providing sensible defaults when values are not specified.

Here is the configuration module:

# app/config.py

from pydantic_settings import BaseSettings

from typing import Optional

from enum import Enum

class GPUType(str, Enum):

"""Enumeration of supported GPU types"""

CUDA = "cuda"

ROCM = "rocm"

METAL = "metal"

SYCL = "sycl"

NONE = "none"

class Settings(BaseSettings):

"""Application configuration settings"""

# Model configuration

model_path: str

model_name: str = "local-llm"

context_length: int = 4096

max_tokens: int = 2048

temperature: float = 0.7

top_p: float = 0.9

top_k: int = 40

repeat_penalty: float = 1.1

# GPU configuration

gpu_type: GPUType = GPUType.NONE

n_gpu_layers: int = 0

main_gpu: int = 0

tensor_split: Optional[str] = None

# Server configuration

host: str = "0.0.0.0"

port: int = 8000

workers: int = 1

log_level: str = "info"

# Performance configuration

n_threads: int = 4

n_batch: int = 512

# API configuration

api_key: Optional[str] = None

enable_streaming: bool = True

max_concurrent_requests: int = 10

class Config:

env_file = ".env"

env_file_encoding = "utf-8"

def get_settings() -> Settings:

"""Get application settings singleton"""

return Settings()

This configuration module defines all the settings our service needs. The

model_path specifies where to find the GGUF model file. The context_length determines how much conversation history the model can consider. The max_tokens limits the length of generated responses. Temperature, top_p, and top_k control the randomness and creativity of responses. Lower temperature values produce more focused and deterministic outputs, while higher values increase creativity and randomness.

The GPU configuration section is particularly important. The gpu_type setting tells the service which GPU acceleration to use. The n_gpu_layers specifies how many model layers to offload to the GPU. Setting this to a high value like 99 offloads the entire model to GPU memory, maximizing performance. The main_gpu setting selects which GPU to use in multi-GPU systems. The tensor_split allows distributing the model across multiple GPUs, specified as a comma-separated list of proportions.

The server configuration controls how the HTTP server runs. The host setting determines which network interfaces to bind to. Using 0.0.0.0 makes the service accessible from any network interface, necessary for Docker containers. The port specifies which TCP port to listen on.

The performance configuration affects inference speed and resource usage. The n_threads setting controls CPU parallelism for operations not offloaded to GPU. The n_batch parameter affects how the model processes tokens internally, with higher values potentially improving throughput at the cost of memory.

Implementing the LLM Inference Layer

The inference layer encapsulates all interaction with the LLM. This abstraction isolates the rest of the application from the specifics of the llama.cpp library, making it easier to swap implementations if needed.

Here is the inference layer implementation:

# app/models/llm.py

from llama_cpp import Llama, LlamaGrammar

from typing import Iterator, Dict, Any, Optional, List

import logging

from app.config import Settings, GPUType

logger = logging.getLogger(__name__)

class LLMInference:

"""Handles LLM model loading and inference"""

def __init__(self, settings: Settings):

"""

Initialize the LLM inference engine

Args:

settings: Application settings containing model configuration

"""

self.settings = settings

self.model: Optional[Llama] = None

self._initialize_model()

def _initialize_model(self) -> None:

"""Load and initialize the LLM model with appropriate GPU settings"""

logger.info(f"Loading model from {self.settings.model_path}")

logger.info(f"GPU type: {self.settings.gpu_type}")

logger.info(f"GPU layers: {self.settings.n_gpu_layers}")

# Prepare model initialization parameters

model_params = {

"model_path": self.settings.model_path,

"n_ctx": self.settings.context_length,

"n_threads": self.settings.n_threads,

"n_batch": self.settings.n_batch,

"verbose": self.settings.log_level == "debug",

}

# Configure GPU acceleration based on type

if self.settings.gpu_type != GPUType.NONE:

model_params["n_gpu_layers"] = self.settings.n_gpu_layers

if self.settings.gpu_type == GPUType.CUDA:

# CUDA-specific settings

model_params["main_gpu"] = self.settings.main_gpu

if self.settings.tensor_split:

# Parse tensor split string into list of floats

splits = [float(x) for x in self.settings.tensor_split.split(",")]

model_params["tensor_split"] = splits

logger.info("Configured for NVIDIA CUDA acceleration")

elif self.settings.gpu_type == GPUType.ROCM:

# ROCm uses same parameters as CUDA through HIP

model_params["main_gpu"] = self.settings.main_gpu

if self.settings.tensor_split:

splits = [float(x) for x in self.settings.tensor_split.split(",")]

model_params["tensor_split"] = splits

logger.info("Configured for AMD ROCm acceleration")

elif self.settings.gpu_type == GPUType.METAL:

# Metal acceleration for Apple Silicon

logger.info("Configured for Apple Metal acceleration")

elif self.settings.gpu_type == GPUType.SYCL:

# Intel GPU acceleration through SYCL

model_params["main_gpu"] = self.settings.main_gpu

logger.info("Configured for Intel SYCL acceleration")

else:

logger.info("Running on CPU only (no GPU acceleration)")

try:

self.model = Llama(**model_params)

logger.info("Model loaded successfully")

except Exception as e:

logger.error(f"Failed to load model: {e}")

raise

def generate(

self,

prompt: str,

max_tokens: Optional[int] = None,

temperature: Optional[float] = None,

top_p: Optional[float] = None,

top_k: Optional[int] = None,

repeat_penalty: Optional[float] = None,

stop: Optional[List[str]] = None,

stream: bool = False

) -> Iterator[Dict[str, Any]]:

"""

Generate text from the model

Args:

prompt: Input text to generate from

max_tokens: Maximum tokens to generate (uses config default if None)

temperature: Sampling temperature (uses config default if None)

top_p: Nucleus sampling parameter (uses config default if None)

top_k: Top-k sampling parameter (uses config default if None)

repeat_penalty: Repetition penalty (uses config default if None)

stop: List of stop sequences

stream: Whether to stream tokens as they are generated

Yields:

Dictionary containing generated text and metadata

"""

if self.model is None:

raise RuntimeError("Model not initialized")

# Use provided parameters or fall back to configuration defaults

generation_params = {

"prompt": prompt,

"max_tokens": max_tokens or self.settings.max_tokens,

"temperature": temperature if temperature is not None else self.settings.temperature,

"top_p": top_p if top_p is not None else self.settings.top_p,

"top_k": top_k if top_k is not None else self.settings.top_k,

"repeat_penalty": repeat_penalty if repeat_penalty is not None else self.settings.repeat_penalty,

"stop": stop or [],

"stream": stream,

"echo": False,

}

logger.debug(f"Generating with params: {generation_params}")

try:

if stream:

# Stream tokens as they are generated

for output in self.model(**generation_params):

yield {

"text": output["choices"][0]["text"],

"finish_reason": output["choices"][0].get("finish_reason"),

}

else:

# Generate complete response

output = self.model(**generation_params)

yield {

"text": output["choices"][0]["text"],

"finish_reason": output["choices"][0].get("finish_reason"),

"usage": {

"prompt_tokens": output["usage"]["prompt_tokens"],

"completion_tokens": output["usage"]["completion_tokens"],

"total_tokens": output["usage"]["total_tokens"],

}

except Exception as e:

logger.error(f"Generation failed: {e}")

raise

def get_model_info(self) -> Dict[str, Any]:

"""

Get information about the loaded model

Returns:

Dictionary containing model metadata

"""

return {

"model_name": self.settings.model_name,

"model_path": self.settings.model_path,

"context_length": self.settings.context_length,

"gpu_type": self.settings.gpu_type.value,

"gpu_layers": self.settings.n_gpu_layers,

}

This inference layer provides a clean interface for the rest of the application. The initialization method loads the model with appropriate GPU settings based on the configuration. The generate method handles both streaming and non-streaming inference, accepting parameters that control the generation process.

The GPU configuration logic deserves special attention. For CUDA and ROCm, we set the main_gpu parameter to select which GPU to use. The tensor_split parameter allows distributing the model across multiple GPUs. For example, a tensor_split of "0.6,0.4" would put 60 percent of the model on the first GPU and 40 percent on the second GPU. This is useful for very large models that do not fit in a single GPU's memory.

For Metal acceleration on Apple Silicon, we simply set n_gpu_layers to offload computation to the integrated GPU. The llama.cpp library handles the Metal-specific details automatically.

The generate method implements both streaming and non-streaming modes. In streaming mode, tokens are yielded as soon as they are generated, allowing clients to display partial responses. This significantly improves perceived responsiveness for long responses. In non-streaming mode, the complete response is generated before returning, which is simpler but requires the client to wait for the entire response.

Implementing the Service Layer

The service layer sits between the API and the inference layer, handling

business logic like conversation management and prompt formatting.

Here is the service layer implementation:

# app/services/chat_service.py

from typing import List, Dict, Any, Iterator, Optional

from app.models.llm import LLMInference

from app.config import Settings

import logging

import json

logger = logging.getLogger(__name__)

class Message:

"""Represents a single message in a conversation"""

def __init__(self, role: str, content: str):

"""

Initialize a message

Args:

role: Message role (system, user, or assistant)

content: Message content text

"""

self.role = role

self.content = content

def to_dict(self) -> Dict[str, str]:

"""Convert message to dictionary"""

return {"role": self.role, "content": self.content}

class Conversation:

"""Manages a conversation with message history"""

def __init__(self, system_prompt: Optional[str] = None):

"""

Initialize a conversation

Args:

system_prompt: Optional system prompt to set behavior

"""

self.messages: List[Message] = []

if system_prompt:

self.messages.append(Message("system", system_prompt))

def add_message(self, role: str, content: str) -> None:

"""

Add a message to the conversation

Args:

role: Message role (user or assistant)

content: Message content

"""

self.messages.append(Message(role, content))

def get_messages(self) -> List[Dict[str, str]]:

"""Get all messages as list of dictionaries"""

return [msg.to_dict() for msg in self.messages]

def format_for_model(self, model_format: str = "chatml") -> str:

"""

Format conversation for model input

Args:

model_format: Format to use (chatml, llama2, etc.)

Returns:

Formatted prompt string

"""

if model_format == "chatml":

# ChatML format used by many models

formatted = ""

for msg in self.messages:

formatted += f"<|im_start|>{msg.role}\n{msg.content}<|im_end|>\n"

formatted += "<|im_start|>assistant\n"

return formatted

elif model_format == "llama2":

# Llama 2 chat format

formatted = ""

system_msg = None

# Extract system message if present

messages = self.messages.copy()

if messages and messages[0].role == "system":

system_msg = messages.pop(0).content

# Format with special tokens

if system_msg:

formatted = f"[INST] <<SYS>>\n{system_msg}\n<</SYS>>\n\n"

for i, msg in enumerate(messages):

if msg.role == "user":

if i == 0 and system_msg:

formatted += f"{msg.content} [/INST] "

else:

formatted += f"[INST] {msg.content} [/INST] "

elif msg.role == "assistant":

formatted += f"{msg.content} "

return formatted

elif model_format == "alpaca":

# Alpaca instruction format

formatted = ""

for msg in self.messages:

if msg.role == "system":

formatted += f"{msg.content}\n\n"

elif msg.role == "user":

formatted += f"### Instruction:\n{msg.content}\n\n"

elif msg.role == "assistant":

formatted += f"### Response:\n{msg.content}\n\n"

formatted += "### Response:\n"

return formatted

else:

# Simple format as fallback

formatted = ""

for msg in self.messages:

formatted += f"{msg.role}: {msg.content}\n"

formatted += "assistant: "

return formatted

class ChatService:

"""Service for handling chat interactions with the LLM"""

def __init__(self, llm: LLMInference, settings: Settings):

"""

Initialize chat service

Args:

llm: LLM inference engine

settings: Application settings

"""

self.llm = llm

self.settings = settings

self.conversations: Dict[str, Conversation] = {}

def create_conversation(

self,

conversation_id: str,

system_prompt: Optional[str] = None

) -> None:

"""

Create a new conversation

Args:

conversation_id: Unique identifier for the conversation

system_prompt: Optional system prompt

"""

self.conversations[conversation_id] = Conversation(system_prompt)

logger.info(f"Created conversation {conversation_id}")

def get_conversation(self, conversation_id: str) -> Optional[Conversation]:

"""

Get an existing conversation

Args:

conversation_id: Conversation identifier

Returns:

Conversation object or None if not found

"""

return self.conversations.get(conversation_id)

def delete_conversation(self, conversation_id: str) -> bool:

"""

Delete a conversation

Args:

conversation_id: Conversation identifier

Returns:

True if deleted, False if not found

"""

if conversation_id in self.conversations:

del self.conversations[conversation_id]

logger.info(f"Deleted conversation {conversation_id}")

return True

return False

def chat(

self,

message: str,

conversation_id: Optional[str] = None,

system_prompt: Optional[str] = None,

model_format: str = "chatml",

stream: bool = False,

**generation_params

) -> Iterator[Dict[str, Any]]:

"""

Process a chat message and generate response

Args:

message: User message text

conversation_id: Optional conversation ID for multi-turn chat

system_prompt: Optional system prompt for stateless chat

model_format: Prompt format to use

stream: Whether to stream the response

**generation_params: Additional generation parameters

Yields:

Response chunks with generated text

"""

# Handle conversation context

if conversation_id:

conversation = self.get_conversation(conversation_id)

if not conversation:

raise ValueError(f"Conversation {conversation_id} not found")

conversation.add_message("user", message)

else:

# Stateless chat - create temporary conversation

conversation = Conversation(system_prompt)

conversation.add_message("user", message)

# Format prompt for model

prompt = conversation.format_for_model(model_format)

logger.info(f"Processing chat message (stream={stream})")

logger.debug(f"Formatted prompt: {prompt}")

# Generate response

accumulated_text = ""

for chunk in self.llm.generate(prompt=prompt, stream=stream, **generation_params):

accumulated_text += chunk["text"]

yield chunk

# Add assistant response to conversation history

if conversation_id:

conversation.add_message("assistant", accumulated_text)

The service layer implements several important abstractions. The Message class represents individual messages in a conversation. The Conversation class manages message history and handles prompt formatting for different model types.

Different LLM models expect different prompt formats. The format_for_model method implements several common formats. ChatML format uses special tokens like <|im_start|> and <|im_end|> to delimit messages. Llama 2 format uses [INST] and [/INST] tokens with special handling for system prompts. Alpaca format uses structured sections with headers like "### Instruction:" and "### Response:".

The ChatService class provides high-level chat functionality. It manages

multiple concurrent conversations, each identified by a unique ID. For stateless single-turn interactions, it creates temporary conversations. The chat method orchestrates the entire process: retrieving or creating a conversation, adding the user message, formatting the prompt, generating the response, and updating the conversation history.

Implementing the API Layer

The API layer exposes our service through HTTP endpoints. We use FastAPI, which provides automatic request validation, response serialization, and interactive API documentation.

Here is the API implementation:

# app/routes/chat.py

from fastapi import APIRouter, HTTPException, Depends

from fastapi.responses import StreamingResponse

from pydantic import BaseModel, Field

from typing import Optional, List, Dict, Any

import json

import logging

from app.services.chat_service import ChatService

from app.models.llm import LLMInference

from app.config import Settings, get_settings

logger = logging.getLogger(__name__)

router = APIRouter()

# Request and response models

class ChatMessage(BaseModel):

"""Single chat message"""

role: str = Field(..., description="Message role (system, user, or assistant)")

content: str = Field(..., description="Message content")

class ChatRequest(BaseModel):

"""Request for chat completion"""

message: str = Field(..., description="User message to process")

conversation_id: Optional[str] = Field(None, description="Conversation ID for multi-turn chat")

system_prompt: Optional[str] = Field(None, description="System prompt for behavior control")

model_format: str = Field("chatml", description="Prompt format (chatml, llama2, alpaca)")

stream: bool = Field(False, description="Whether to stream the response")

max_tokens: Optional[int] = Field(None, description="Maximum tokens to generate")

temperature: Optional[float] = Field(None, description="Sampling temperature")

top_p: Optional[float] = Field(None, description="Nucleus sampling parameter")

top_k: Optional[int] = Field(None, description="Top-k sampling parameter")

stop: Optional[List[str]] = Field(None, description="Stop sequences")

class ChatResponse(BaseModel):

"""Response from chat completion"""

response: str = Field(..., description="Generated response text")

conversation_id: Optional[str] = Field(None, description="Conversation ID if applicable")

finish_reason: Optional[str] = Field(None, description="Reason generation stopped")

usage: Optional[Dict[str, int]] = Field(None, description="Token usage statistics")

class ConversationRequest(BaseModel):

"""Request to create a conversation"""

conversation_id: str = Field(..., description="Unique conversation identifier")

system_prompt: Optional[str] = Field(None, description="System prompt for the conversation")

class ConversationResponse(BaseModel):

"""Response for conversation operations"""

conversation_id: str = Field(..., description="Conversation identifier")

messages: List[ChatMessage] = Field(..., description="Conversation messages")

class HealthResponse(BaseModel):

"""Health check response"""

status: str = Field(..., description="Service status")

model_info: Dict[str, Any] = Field(..., description="Model information")

# Dependency injection for services

_chat_service: Optional[ChatService] = None

def get_chat_service() -> ChatService:

"""Get chat service singleton"""

global _chat_service

if _chat_service is None:

settings = get_settings()

llm = LLMInference(settings)

_chat_service = ChatService(llm, settings)

return _chat_service

@router.post("/chat", response_model=ChatResponse)

async def chat(

request: ChatRequest,

chat_service: ChatService = Depends(get_chat_service)

"""

Process a chat message and generate a response

This endpoint supports both stateless single-turn chat and stateful

multi-turn conversations. For multi-turn chat, create a conversation

first and provide its ID in subsequent requests.

"""

try:

# Prepare generation parameters

gen_params = {}

if request.max_tokens is not None:

gen_params["max_tokens"] = request.max_tokens

if request.temperature is not None:

gen_params["temperature"] = request.temperature

if request.top_p is not None:

gen_params["top_p"] = request.top_p

if request.top_k is not None:

gen_params["top_k"] = request.top_k

if request.stop is not None:

gen_params["stop"] = request.stop

if request.stream:

# Return streaming response

async def generate():

try:

for chunk in chat_service.chat(

message=request.message,

conversation_id=request.conversation_id,

system_prompt=request.system_prompt,

model_format=request.model_format,

stream=True,

**gen_params

# Format as server-sent events

data = json.dumps(chunk)

yield f"data: {data}\n\n"

yield "data: [DONE]\n\n"

except Exception as e:

logger.error(f"Streaming error: {e}")

error_data = json.dumps({"error": str(e)})

yield f"data: {error_data}\n\n"

return StreamingResponse(

generate(),

media_type="text/event-stream"

)

else:

# Return complete response

response_text = ""

finish_reason = None

usage = None

for chunk in chat_service.chat(

message=request.message,

conversation_id=request.conversation_id,

system_prompt=request.system_prompt,

model_format=request.model_format,

stream=False,

**gen_params

response_text += chunk["text"]

finish_reason = chunk.get("finish_reason")

usage = chunk.get("usage")

return ChatResponse(

response=response_text,

conversation_id=request.conversation_id,

finish_reason=finish_reason,

usage=usage

)

except ValueError as e:

raise HTTPException(status_code=400, detail=str(e))

except Exception as e:

logger.error(f"Chat error: {e}", exc_info=True)

raise HTTPException(status_code=500, detail="Internal server error")

@router.post("/conversations", response_model=ConversationResponse)

async def create_conversation(

request: ConversationRequest,

chat_service: ChatService = Depends(get_chat_service)

"""

Create a new conversation for multi-turn chat

Conversations maintain message history across multiple chat requests.

Use the returned conversation_id in subsequent chat requests.

"""

try:

chat_service.create_conversation(

conversation_id=request.conversation_id,

system_prompt=request.system_prompt

)

conversation = chat_service.get_conversation(request.conversation_id)

messages = [

ChatMessage(role=msg["role"], content=msg["content"])

for msg in conversation.get_messages()

]

return ConversationResponse(

conversation_id=request.conversation_id,

messages=messages

)

except Exception as e:

logger.error(f"Conversation creation error: {e}", exc_info=True)

raise HTTPException(status_code=500, detail="Internal server error")

@router.get("/conversations/{conversation_id}", response_model=ConversationResponse)

async def get_conversation(

conversation_id: str,

chat_service: ChatService = Depends(get_chat_service)

"""

Get an existing conversation with its message history

"""

conversation = chat_service.get_conversation(conversation_id)

if not conversation:

raise HTTPException(status_code=404, detail="Conversation not found")

messages = [

ChatMessage(role=msg["role"], content=msg["content"])

for msg in conversation.get_messages()

]

return ConversationResponse(

conversation_id=conversation_id,

messages=messages

)

@router.delete("/conversations/{conversation_id}")

async def delete_conversation(

conversation_id: str,

chat_service: ChatService = Depends(get_chat_service)

"""

Delete a conversation and its history

"""

deleted = chat_service.delete_conversation(conversation_id)

if not deleted:

raise HTTPException(status_code=404, detail="Conversation not found")

return {"status": "deleted", "conversation_id": conversation_id}

@router.get("/health", response_model=HealthResponse)

async def health_check(

chat_service: ChatService = Depends(get_chat_service)

"""

Check service health and get model information

"""

model_info = chat_service.llm.get_model_info()

return HealthResponse(

status="healthy",

model_info=model_info

)

The API layer defines several endpoints. The /chat endpoint is the primary

interface for generating responses. It accepts a ChatRequest containing the user message and optional parameters. It supports both streaming and non-streaming responses. For streaming, it returns server-sent events that clients can process incrementally.

The /conversations endpoints manage multi-turn conversations. The POST endpoint creates a new conversation with an optional system prompt. The GET endpoint retrieves conversation history. The DELETE endpoint removes a conversation and its history.

The /health endpoint provides a way to check if the service is running and get information about the loaded model. This is useful for monitoring and debugging.

The dependency injection pattern used here ensures that only one instance of the ChatService and LLMInference is created and shared across all requests. This is critical because loading the LLM model is expensive and should only happen once.

Creating the Main Application

Now we tie everything together in the main application file:

# app/main.py

from fastapi import FastAPI

from fastapi.middleware.cors import CORSMiddleware

import logging

from app.routes import chat

from app.config import get_settings

# Configure logging

logging.basicConfig(

level=logging.INFO,

format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'

)

logger = logging.getLogger(__name__)

def create_app() -> FastAPI:

"""Create and configure the FastAPI application"""

settings = get_settings()

# Create FastAPI app

app = FastAPI(

title="LLM Chat Microservice",

description="Local LLM chat service with multi-GPU support",

version="1.0.0",

docs_url="/docs",

redoc_url="/redoc"

)

# Configure CORS for cross-origin requests

app.add_middleware(

CORSMiddleware,

allow_origins=["*"],

allow_credentials=True,

allow_methods=["*"],

allow_headers=["*"],

)

# Include routers

app.include_router(chat.router, prefix="/api/v1", tags=["chat"])

@app.on_event("startup")

async def startup_event():

"""Initialize services on startup"""

logger.info("Starting LLM Chat Microservice")

logger.info(f"Model: {settings.model_path}")

logger.info(f"GPU: {settings.gpu_type.value}")

logger.info(f"Server: {settings.host}:{settings.port}")

@app.on_event("shutdown")

async def shutdown_event():

"""Cleanup on shutdown"""

logger.info("Shutting down LLM Chat Microservice")

return app

app = create_app()

if __name__ == "__main__":

import uvicorn

settings = get_settings()

uvicorn.run(

"app.main:app",

host=settings.host,

port=settings.port,

log_level=settings.log_level,

reload=False

)

This main application file creates the FastAPI application and configures it.

The CORS middleware allows the service to accept requests from web browsers

running on different domains. The startup event logs important configuration

information. The shutdown event provides a hook for cleanup if needed.

CONTAINERIZING THE SERVICE WITH DOCKER

Now that we have a working service, we need to containerize it. Docker allows us

to package the service with all its dependencies, making it portable and easy to

deploy.

Understanding Docker Images and Containers

A Docker image is a template that contains your application and everything it needs to run. An image is built from a Dockerfile, which specifies the build steps. A container is a running instance of an image. You can run multiple containers from the same image, each isolated from the others.

Docker images are built in layers. Each instruction in the Dockerfile creates a new layer. Layers are cached, so if you rebuild an image and a layer has not changed, Docker reuses the cached layer. This makes builds faster and more efficient.

Creating a Multi-Architecture Dockerfile

Our Dockerfile needs to support multiple GPU architectures. We will use build arguments to control which GPU support to include. Here is the Dockerfile:

# Dockerfile

# Use official Python base image

FROM python:3.11-slim as base

# Set working directory

WORKDIR /app

# Install system dependencies

RUN apt-get update && apt-get install -y \

build-essential \

cmake \

git \

wget \

&& rm -rf /var/lib/apt/lists/*

# Create a build stage for compiling llama.cpp with GPU support

FROM base as builder

# Build arguments for GPU support

ARG GPU_TYPE=none

ARG CUDA_VERSION=12.2.0

ARG ROCM_VERSION=5.7

# Install GPU-specific dependencies based on build argument

RUN if [ "$GPU_TYPE" = "cuda" ]; then \

# Install CUDA toolkit

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.0-1_all.deb && \

dpkg -i cuda-keyring_1.0-1_all.deb && \

apt-get update && \

apt-get install -y cuda-toolkit-$(echo $CUDA_VERSION | cut -d. -f1,2 | tr . -) && \

rm cuda-keyring_1.0-1_all.deb; \

elif [ "$GPU_TYPE" = "rocm" ]; then \

# Install ROCm

wget https://repo.radeon.com/amdgpu-install/latest/ubuntu/jammy/amdgpu-install_latest_all.deb && \

apt-get install -y ./amdgpu-install_latest_all.deb && \

amdgpu-install -y --usecase=rocm && \

rm amdgpu-install_latest_all.deb; \

# Copy requirements file

COPY requirements.txt .

# Install Python dependencies with GPU support

RUN if [ "$GPU_TYPE" = "cuda" ]; then \

CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python --no-cache-dir; \

elif [ "$GPU_TYPE" = "rocm" ]; then \

CMAKE_ARGS="-DLLAMA_HIPBLAS=on" pip install llama-cpp-python --no-cache-dir; \

elif [ "$GPU_TYPE" = "metal" ]; then \

CMAKE_ARGS="-DLLAMA_METAL=on" pip install llama-cpp-python --no-cache-dir; \

elif [ "$GPU_TYPE" = "sycl" ]; then \

CMAKE_ARGS="-DLLAMA_SYCL=on" pip install llama-cpp-python --no-cache-dir; \

else \

pip install llama-cpp-python --no-cache-dir; \

# Install other Python dependencies

RUN pip install --no-cache-dir -r requirements.txt

# Final runtime stage

FROM base as runtime

# Copy installed packages from builder

COPY --from=builder /usr/local/lib/python3.11/site-packages /usr/local/lib/python3.11/site-packages

COPY --from=builder /usr/local/bin /usr/local/bin

# Copy application code

COPY app/ /app/app/

# Create directory for models

RUN mkdir -p /app/models

# Set environment variables

ENV PYTHONUNBUFFERED=1

ENV MODEL_PATH=/app/models/model.gguf

# Expose port

EXPOSE 8000

# Health check

HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \

CMD python -c "import requests; requests.get('http://localhost:8000/api/v1/health')"

# Run the application

CMD ["python", "-m", "uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]

This Dockerfile uses a multi-stage build to keep the final image size

reasonable. The builder stage compiles llama-cpp-python with appropriate GPU support based on the GPU_TYPE build argument. The runtime stage copies only the necessary files from the builder, excluding build tools and intermediate files.

The GPU_TYPE argument controls which GPU support to compile. Setting it to "cuda" compiles with CUDA support. Setting it to "rocm" compiles with ROCm support. Setting it to "metal" compiles with Metal support for Apple Silicon. Setting it to "sycl" compiles with SYCL support for Intel GPUs. Setting it to "none" or omitting it compiles CPU-only support.

The HEALTHCHECK instruction tells Docker how to check if the container is

healthy. It periodically calls the health endpoint and marks the container as

unhealthy if the check fails. This is important for orchestration systems like

Kubernetes, which can automatically restart unhealthy containers.

Creating the Requirements File

The requirements.txt file lists all Python dependencies:

# requirements.txt

fastapi==0.104.1

uvicorn[standard]==0.24.0

pydantic==2.5.0

pydantic-settings==2.1.0

python-multipart==0.0.6

Notice that llama-cpp-python is not in this file. We install it separately in

the Dockerfile with appropriate CMAKE_ARGS for GPU support. This is necessary because the package needs to be compiled with specific flags for each GPU type.

Building Docker Images for Different GPU Types

To build an image with CUDA support:

docker build --build-arg GPU_TYPE=cuda -t llm-chat-service:cuda .

To build an image with ROCm support:

docker build --build-arg GPU_TYPE=rocm -t llm-chat-service:rocm .

To build an image with Metal support for Apple Silicon:

docker build --build-arg GPU_TYPE=metal -t llm-chat-service:metal .

To build an image with SYCL support for Intel GPUs:

docker build --build-arg GPU_TYPE=sycl -t llm-chat-service:sycl .

To build a CPU-only image:

docker build --build-arg GPU_TYPE=none -t llm-chat-service:cpu .

The build process takes several minutes because it compiles llama-cpp-python from source with GPU support. The resulting image contains everything needed to run the service.

Running the Docker Container

To run the container, you need to mount a volume containing your model file and set environment variables for configuration. Here is an example for CUDA:

docker run -d \

--name llm-chat \

--gpus all \

-p 8000:8000 \

-v /path/to/models:/app/models \

-e MODEL_PATH=/app/models/your-model.gguf \

-e GPU_TYPE=cuda \

-e N_GPU_LAYERS=99 \

llm-chat-service:cuda

The --gpus all flag makes all GPUs available to the container. For specific

GPUs, use --gpus '"device=0,1"' to expose only GPUs 0 and 1.

For ROCm on AMD GPUs:

docker run -d \

--name llm-chat \

--device=/dev/kfd \

--device=/dev/dri \

--group-add video \

-p 8000:8000 \

-v /path/to/models:/app/models \

-e MODEL_PATH=/app/models/your-model.gguf \

-e GPU_TYPE=rocm \

-e N_GPU_LAYERS=99 \

llm-chat-service:rocm

ROCm requires exposing specific devices and adding the container to the video group for GPU access.

For Metal on Apple Silicon:

docker run -d \

--name llm-chat \

-p 8000:8000 \

-v /path/to/models:/app/models \

-e MODEL_PATH=/app/models/your-model.gguf \

-e GPU_TYPE=metal \

-e N_GPU_LAYERS=99 \

llm-chat-service:metal

Metal support works automatically on Apple Silicon Macs without additional flags.

Creating a Docker Compose File

Docker Compose simplifies running multi-container applications. Here is a

docker-compose.yml file for our service:

# docker-compose.yml

version: '3.8'

services:

llm-chat:

build:

context: .

dockerfile: Dockerfile

args:

GPU_TYPE: ${GPU_TYPE:-none}

image: llm-chat-service:${GPU_TYPE:-none}

container_name: llm-chat

ports:

- "${PORT:-8000}:8000"

volumes:

- ${MODEL_DIR:-./models}:/app/models

environment:

- MODEL_PATH=${MODEL_PATH:-/app/models/model.gguf}

- GPU_TYPE=${GPU_TYPE:-none}

- N_GPU_LAYERS=${N_GPU_LAYERS:-0}

- CONTEXT_LENGTH=${CONTEXT_LENGTH:-4096}

- MAX_TOKENS=${MAX_TOKENS:-2048}

- TEMPERATURE=${TEMPERATURE:-0.7}

- LOG_LEVEL=${LOG_LEVEL:-info}

deploy:

resources:

reservations:

devices:

- driver: nvidia

capabilities: [gpu]

restart: unless-stopped

healthcheck:

test: ["CMD", "python", "-c", "import requests; requests.get('http://localhost:8000/api/v1/health')"]

interval: 30s

timeout: 10s

retries: 3

start_period: 60s

This compose file uses environment variables for configuration, making it easy to customize without editing the file. Create a .env file with your settings:

# .env

GPU_TYPE=cuda

MODEL_DIR=/path/to/models

MODEL_PATH=/app/models/your-model.gguf

N_GPU_LAYERS=99

PORT=8000

CONTEXT_LENGTH=4096

MAX_TOKENS=2048

TEMPERATURE=0.7

LOG_LEVEL=info

Then start the service with:

docker-compose up -d

Docker Compose reads the .env file automatically and substitutes the values into the compose file.

DEPLOYING TO KUBERNETES

Kubernetes provides production-grade orchestration for containerized

applications. Deploying our LLM service to Kubernetes enables automatic scaling, self-healing, and efficient resource management.

Understanding Kubernetes Concepts

Kubernetes organizes resources into several key abstractions. A Pod is the

smallest deployable unit, typically containing one container. A Deployment

manages a set of identical Pods, ensuring the desired number of replicas are running. A Service provides a stable network endpoint for accessing Pods, load balancing requests across replicas. A ConfigMap stores configuration data that Pods can consume. A PersistentVolume provides storage that persists beyond Pod lifecycles.

For GPU workloads, Kubernetes uses device plugins to expose GPUs to Pods. Each GPU vendor provides a device plugin that makes their GPUs available as schedulable resources. Pods can request GPU resources, and Kubernetes schedules them on nodes with available GPUs.

Installing GPU Support in Kubernetes

Before deploying our service, ensure your Kubernetes cluster has GPU support configured. For NVIDIA GPUs, install the NVIDIA device plugin:

kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/main/nvidia-device-plugin.yml

For AMD GPUs, install the AMD device plugin:

kubectl create -f https://raw.githubusercontent.com/RadeonOpenCompute/k8s-device-plugin/master/k8s-ds-amdgpu-dp.yaml

For Intel GPUs, install the Intel device plugin:

kubectl apply -k 'https://github.com/intel/intel-device-plugins-for-kubernetes/deployments/gpu_plugin?ref=main'

These device plugins run as DaemonSets, meaning they run on every node in the cluster, exposing GPUs as schedulable resources.

Creating Kubernetes Manifests

We need several Kubernetes resources to deploy our service. Let us create them one by one.

First, create a ConfigMap for configuration:

# k8s/configmap.yaml

apiVersion: v1

kind: ConfigMap

metadata:

namespace: default

data:

MODEL_NAME: "local-llm"

CONTEXT_LENGTH: "4096"

MAX_TOKENS: "2048"

TEMPERATURE: "0.7"

TOP_P: "0.9"

TOP_K: "40"

REPEAT_PENALTY: "1.1"

N_THREADS: "4"

N_BATCH: "512"

LOG_LEVEL: "info"

HOST: "0.0.0.0"

PORT: "8000"

ENABLE_STREAMING: "true"

MAX_CONCURRENT_REQUESTS: "10"

This ConfigMap stores non-sensitive configuration. We reference it in the

Deployment to inject these values as environment variables.

Next, create a PersistentVolumeClaim for model storage:

# k8s/pvc.yaml

apiVersion: v1

kind: PersistentVolumeClaim

metadata:

namespace: default

spec:

accessModes:

- ReadOnlyMany

resources:

requests:

storage: 20Gi

storageClassName: standard

This PVC requests 20GB of storage for model files. The ReadOnlyMany access mode allows multiple Pods to mount the volume simultaneously, which is safe because the models are read-only. In practice, you would populate this volume with your model files before deploying the service.

Now create the Deployment:

# k8s/deployment.yaml

apiVersion: apps/v1

kind: Deployment

metadata:

namespace: default

labels:

app: llm-chat

spec:

replicas: 2

selector:

matchLabels:

app: llm-chat

template:

metadata:

labels:

app: llm-chat

spec:

containers:

- name: llm-chat

image: llm-chat-service:cuda

imagePullPolicy: IfNotPresent

ports:

- containerPort: 8000

protocol: TCP

env:

- name: MODEL_PATH

value: "/app/models/model.gguf"

- name: GPU_TYPE

value: "cuda"

- name: N_GPU_LAYERS

value: "99"

- name: MAIN_GPU

value: "0"

envFrom:

- configMapRef:

volumeMounts:

- name: models

mountPath: /app/models

readOnly: true

resources:

requests:

memory: "8Gi"

cpu: "2"

nvidia.com/gpu: "1"

limits:

memory: "16Gi"

cpu: "4"

nvidia.com/gpu: "1"

livenessProbe:

httpGet:

path: /api/v1/health

port: 8000

initialDelaySeconds: 60

periodSeconds: 30

timeoutSeconds: 10

failureThreshold: 3

readinessProbe:

httpGet:

path: /api/v1/health

port: 8000

initialDelaySeconds: 30

periodSeconds: 10

timeoutSeconds: 5

failureThreshold: 3

volumes:

- name: models

persistentVolumeClaim:

claimName: llm-models-pvc

nodeSelector:

accelerator: nvidia-gpu

tolerations:

- key: nvidia.com/gpu

operator: Exists

effect: NoSchedule

This Deployment creates two replicas of our LLM service. The replicas field

specifies how many identical Pods to run. Kubernetes distributes these Pods across available nodes, providing redundancy and load distribution.

The resources section is critical for GPU workloads. The requests specify the minimum resources the Pod needs. Requesting nvidia.com/gpu: "1" tells Kubernetes this Pod needs one NVIDIA GPU. Kubernetes only schedules the Pod on nodes with available GPUs. The limits specify maximum resources the Pod can use. Setting limits equal to requests for GPUs ensures the Pod gets exclusive GPU access.

The nodeSelector ensures Pods only run on nodes with NVIDIA GPUs. The

tolerations allow Pods to run on nodes tainted for GPU workloads. Taints and tolerations are Kubernetes mechanisms for dedicating nodes to specific workloads.

The livenessProbe checks if the container is still running correctly. If the

probe fails repeatedly, Kubernetes restarts the container. The readinessProbe

checks if the container is ready to accept traffic. If the probe fails,

Kubernetes removes the Pod from the Service load balancer until it passes again.

For AMD ROCm GPUs, modify the resources section:

resources:

requests:

memory: "8Gi"

cpu: "2"

amd.com/gpu: "1"

limits:

memory: "16Gi"

cpu: "4"

amd.com/gpu: "1"

And update the nodeSelector:

nodeSelector:

accelerator: amd-gpu

For Intel GPUs, use:

resources:

requests:

memory: "8Gi"

cpu: "2"

gpu.intel.com/i915: "1"

limits:

memory: "16Gi"

cpu: "4"

gpu.intel.com/i915: "1"

Now create a Service to expose the Deployment:

# k8s/service.yaml

apiVersion: v1

kind: Service

metadata:

namespace: default

labels:

app: llm-chat

spec:

type: ClusterIP

selector:

app: llm-chat

ports:

- name: http

port: 80

targetPort: 8000

protocol: TCP

sessionAffinity: ClientIP

sessionAffinityConfig:

clientIP:

timeoutSeconds: 3600

This Service creates a stable endpoint for accessing the LLM Pods. The

ClusterIP type makes the Service accessible only within the cluster. For

external access, you would use LoadBalancer or create an Ingress.

The sessionAffinity setting is important for stateful conversations. Setting it

to ClientIP ensures requests from the same client IP always go to the same Pod.

This is useful if you store conversation state in memory. However, for

production systems, you should use external storage like Redis for conversation

state to enable true statelessness.

For external access, create an Ingress:

# k8s/ingress.yaml

apiVersion: networking.k8s.io/v1

kind: Ingress

metadata:

namespace: default

annotations:

nginx.ingress.kubernetes.io/proxy-body-size: "10m"

nginx.ingress.kubernetes.io/proxy-read-timeout: "300"

nginx.ingress.kubernetes.io/proxy-send-timeout: "300"

spec:

ingressClassName: nginx

rules:

- host: llm-chat.example.com

http:

paths:

- path: /

pathType: Prefix

backend:

service:

port:

number: 80

This Ingress routes external traffic to the Service. The annotations configure

the NGINX ingress controller to handle large request bodies and long timeouts,

both important for LLM services. Replace llm-chat.example.com with your actual

domain.

For automatic scaling based on load, create a HorizontalPodAutoscaler:

# k8s/hpa.yaml

apiVersion: autoscaling/v2

kind: HorizontalPodAutoscaler

metadata:

namespace: default

spec:

scaleTargetRef:

apiVersion: apps/v1

kind: Deployment

minReplicas: 2

maxReplicas: 10

metrics:

- type: Resource

resource:

target:

type: Utilization

averageUtilization: 70

- type: Resource

resource:

target:

type: Utilization

averageUtilization: 80

behavior:

scaleDown:

stabilizationWindowSeconds: 300

policies:

- type: Percent

value: 50

periodSeconds: 60

scaleUp:

stabilizationWindowSeconds: 60

policies:

- type: Percent

value: 100

periodSeconds: 30

This HorizontalPodAutoscaler automatically adjusts the number of replicas based

on CPU and memory usage. When average CPU usage exceeds 70 percent or memory

usage exceeds 80 percent, it scales up. When usage drops, it scales down. The

behavior section controls how quickly scaling happens. Scaling up happens

quickly to handle traffic spikes, while scaling down happens slowly to avoid

thrashing.

Note that GPU-based autoscaling is more complex because GPUs are discrete

resources. Each Pod gets a whole GPU, so scaling happens in GPU-sized increments.

For more sophisticated GPU-based autoscaling, you would use custom metrics based

on GPU utilization or request queue depth.

Deploying to Kubernetes

To deploy all resources, apply the manifests in order:

kubectl apply -f k8s/configmap.yaml

kubectl apply -f k8s/pvc.yaml

kubectl apply -f k8s/deployment.yaml

kubectl apply -f k8s/service.yaml

kubectl apply -f k8s/ingress.yaml

kubectl apply -f k8s/hpa.yaml

Check the deployment status:

kubectl get deployments

kubectl get pods

kubectl get services

Watch the Pods start:

kubectl get pods -w

View logs from a Pod:

kubectl logs -f <pod-name>

If a Pod fails to start, describe it to see events and errors:

kubectl describe pod <pod-name>

Common issues include insufficient GPU resources, missing model files in the

PersistentVolume, or incorrect environment variables. The describe command shows

detailed information about why a Pod is not running.

TESTING THE SERVICE

Once deployed, test the service to ensure it works correctly. We will test both

locally with Docker and in Kubernetes.

Testing with curl

The simplest test uses curl to send a request to the chat endpoint. For a local

Docker container:

curl -X POST http://localhost:8000/api/v1/chat \

-H "Content-Type: application/json" \

-d '{

"message": "What is the capital of France?",

"system_prompt": "You are a helpful assistant.",

"model_format": "chatml",

"stream": false,

"temperature": 0.7

This sends a simple question to the LLM. The response includes the generated

text and metadata:

{

"response": "The capital of France is Paris. Paris is not only the capital but also the largest city in France, known for its art, culture, fashion, and iconic landmarks like the Eiffel Tower.",

"conversation_id": null,

"finish_reason": "stop",

"usage": {

"prompt_tokens": 42,

"completion_tokens": 38,

"total_tokens": 80

}

To test streaming responses:

curl -X POST http://localhost:8000/api/v1/chat \

-H "Content-Type: application/json" \

-d '{

"message": "Write a short poem about AI.",

"stream": true

}' \

--no-buffer

The --no-buffer flag prevents curl from buffering the output, allowing you to

see tokens as they arrive. The response comes as server-sent events:

data: {"text": " In", "finish_reason": null}

data: {"text": " circuits", "finish_reason": null}

data: {"text": " deep", "finish_reason": null}

data: {"text": ",", "finish_reason": null}

data: [DONE]

Testing multi-turn conversations requires creating a conversation first:

curl -X POST http://localhost:8000/api/v1/conversations \

-H "Content-Type: application/json" \

-d '{

"conversation_id": "test-conv-123",

"system_prompt": "You are a helpful math tutor."

Then send messages with the conversation ID:

curl -X POST http://localhost:8000/api/v1/chat \

-H "Content-Type: application/json" \

-d '{

"message": "What is 15 times 23?",

"conversation_id": "test-conv-123"

Send a follow-up message:

curl -X POST http://localhost:8000/api/v1/chat \

-H "Content-Type: application/json" \

-d '{

"message": "Now add 100 to that result.",

"conversation_id": "test-conv-123"

The service maintains conversation context, so the second message refers to the

previous result without repeating it.

Retrieve the conversation history:

curl http://localhost:8000/api/v1/conversations/test-conv-123

Delete the conversation when done:

curl -X DELETE http://localhost:8000/api/v1/conversations/test-conv-123

Testing in Kubernetes

For Kubernetes deployments, first port-forward to access the Service locally:

kubectl port-forward service/llm-chat-service 8000:80

Then use the same curl commands as above, connecting to localhost:8000.

Alternatively, if you configured an Ingress, access the service through its

external domain:

curl -X POST https://llm-chat.example.com/api/v1/chat \

-H "Content-Type: application/json" \

-d '{

"message": "Hello, how are you?"

MONITORING AND OBSERVABILITY

Production services require monitoring to track health, performance, and usage.

Kubernetes provides built-in tools, and you can add more sophisticated

monitoring with Prometheus and Grafana.

Viewing Logs

Kubernetes aggregates logs from all Pods. View logs from all Pods in the

Deployment:

kubectl logs -l app=llm-chat --tail=100 -f

This follows logs from all Pods with the app=llm-chat label, showing the last

100 lines and streaming new entries.

For structured logging, modify the application to output JSON logs. This makes

logs easier to parse and analyze with log aggregation tools like ELK

(Elasticsearch, Logstash, Kibana) or Loki.

Checking Resource Usage

Monitor resource usage with kubectl top:

kubectl top pods -l app=llm-chat

This shows CPU and memory usage for each Pod. For GPU usage, you need to install

a GPU monitoring solution like NVIDIA DCGM Exporter for NVIDIA GPUs or ROCm SMI

Exporter for AMD GPUs.

Health Checks

The health endpoint provides service status:

curl http://localhost:8000/api/v1/health

The response includes model information:

{

"status": "healthy",

"model_info": {

"model_name": "local-llm",

"model_path": "/app/models/model.gguf",

"context_length": 4096,

"gpu_type": "cuda",

"gpu_layers": 99

}

Kubernetes uses this endpoint for liveness and readiness probes. If the endpoint

returns an error or times out, Kubernetes takes corrective action.

BUILDING A CLIENT APPLICATION

To demonstrate using the LLM service, let us build a simple client application.

We will create both a Python client library and a command-line interface.

Python Client Library

First, create a reusable client library that encapsulates the API calls:

# client/llm_client.py

import requests

from typing import Optional, Dict, Any, Iterator, List

import json

class LLMClientError(Exception):

"""Base exception for LLM client errors"""

pass

class LLMClient:

"""Client for interacting with the LLM chat microservice"""

def __init__(self, base_url: str, api_key: Optional[str] = None, timeout: int = 300):

"""

Initialize the LLM client

Args:

base_url: Base URL of the LLM service (e.g., http://localhost:8000)

api_key: Optional API key for authentication

timeout: Request timeout in seconds

"""

self.base_url = base_url.rstrip('/')

self.api_key = api_key

self.timeout = timeout

self.session = requests.Session()

if api_key:

self.session.headers.update({'Authorization': f'Bearer {api_key}'})

def chat(

self,

message: str,

conversation_id: Optional[str] = None,

system_prompt: Optional[str] = None,

model_format: str = "chatml",

stream: bool = False,

max_tokens: Optional[int] = None,

temperature: Optional[float] = None,

top_p: Optional[float] = None,

top_k: Optional[int] = None,

stop: Optional[List[str]] = None

) -> Dict[str, Any]:

"""

Send a chat message and get a response

Args:

message: User message to send

conversation_id: Optional conversation ID for multi-turn chat

system_prompt: Optional system prompt for behavior control

model_format: Prompt format (chatml, llama2, alpaca)

stream: Whether to stream the response

max_tokens: Maximum tokens to generate

temperature: Sampling temperature

top_p: Nucleus sampling parameter

top_k: Top-k sampling parameter

stop: List of stop sequences

Returns:

Response dictionary with generated text and metadata

Raises:

LLMClientError: If the request fails

"""

url = f"{self.base_url}/api/v1/chat"

payload = {

"message": message,

"model_format": model_format,

"stream": stream

}

if conversation_id:

payload["conversation_id"] = conversation_id

if system_prompt:

payload["system_prompt"] = system_prompt

if max_tokens is not None:

payload["max_tokens"] = max_tokens

if temperature is not None:

payload["temperature"] = temperature

if top_p is not None:

payload["top_p"] = top_p

if top_k is not None:

payload["top_k"] = top_k

if stop is not None:

payload["stop"] = stop

try:

if stream:

return self._stream_chat(url, payload)

else:

response = self.session.post(url, json=payload, timeout=self.timeout)

response.raise_for_status()

return response.json()

except requests.exceptions.RequestException as e:

raise LLMClientError(f"Chat request failed: {e}")

def _stream_chat(self, url: str, payload: Dict[str, Any]) -> Iterator[str]:

"""

Stream chat response

Args:

url: API endpoint URL

payload: Request payload

Yields:

Text chunks as they are generated

Raises:

LLMClientError: If streaming fails

"""

try:

response = self.session.post(

url,

json=payload,

stream=True,

timeout=self.timeout

)

response.raise_for_status()

for line in response.iter_lines():

if line:

line = line.decode('utf-8')

if line.startswith('data: '):

data = line[6:] # Remove 'data: ' prefix

if data == '[DONE]':

break

try:

chunk = json.loads(data)

if 'text' in chunk:

yield chunk['text']

elif 'error' in chunk:

raise LLMClientError(f"Server error: {chunk['error']}")

except json.JSONDecodeError:

continue

except requests.exceptions.RequestException as e:

raise LLMClientError(f"Streaming request failed: {e}")

def create_conversation(

self,

conversation_id: str,

system_prompt: Optional[str] = None

) -> Dict[str, Any]:

"""

Create a new conversation

Args:

conversation_id: Unique identifier for the conversation

system_prompt: Optional system prompt

Returns:

Conversation details

Raises:

LLMClientError: If creation fails

"""

url = f"{self.base_url}/api/v1/conversations"

payload = {"conversation_id": conversation_id}

if system_prompt:

payload["system_prompt"] = system_prompt

try:

response = self.session.post(url, json=payload, timeout=self.timeout)

response.raise_for_status()

return response.json()

except requests.exceptions.RequestException as e:

raise LLMClientError(f"Conversation creation failed: {e}")

def get_conversation(self, conversation_id: str) -> Dict[str, Any]:

"""

Get conversation details and history

Args:

conversation_id: Conversation identifier

Returns:

Conversation details with message history

Raises:

LLMClientError: If retrieval fails

"""

url = f"{self.base_url}/api/v1/conversations/{conversation_id}"

try:

response = self.session.get(url, timeout=self.timeout)

response.raise_for_status()

return response.json()

except requests.exceptions.RequestException as e:

raise LLMClientError(f"Conversation retrieval failed: {e}")

def delete_conversation(self, conversation_id: str) -> Dict[str, Any]:

"""

Delete a conversation

Args:

conversation_id: Conversation identifier

Returns:

Deletion confirmation

Raises:

LLMClientError: If deletion fails

"""

url = f"{self.base_url}/api/v1/conversations/{conversation_id}"

try:

response = self.session.delete(url, timeout=self.timeout)

response.raise_for_status()

return response.json()

except requests.exceptions.RequestException as e:

raise LLMClientError(f"Conversation deletion failed: {e}")

def health_check(self) -> Dict[str, Any]:

"""

Check service health

Returns:

Health status and model information

Raises:

LLMClientError: If health check fails

"""

url = f"{self.base_url}/api/v1/health"

try:

response = self.session.get(url, timeout=10)

response.raise_for_status()

return response.json()

except requests.exceptions.RequestException as e:

raise LLMClientError(f"Health check failed: {e}")

def close(self):

"""Close the client session"""

self.session.close()

def __enter__(self):

"""Context manager entry"""

return self

def __exit__(self, exc_type, exc_val, exc_tb):

"""Context manager exit"""

self.close()

This client library provides a clean Python interface for the LLM service. It

handles request formatting, error handling, and streaming responses. The context manager support allows using it with the with statement for automatic cleanup.

Command-Line Interface

Now create a command-line interface using the client library:

# client/cli.py

import argparse

import sys

import uuid

from llm_client import LLMClient, LLMClientError

def print_streaming_response(client: LLMClient, message: str, **kwargs):

"""

Print a streaming response with real-time output

Args:

client: LLM client instance

message: Message to send

**kwargs: Additional chat parameters

"""

print("Assistant: ", end='', flush=True)

try:

for chunk in client.chat(message, stream=True, **kwargs):

print(chunk, end='', flush=True)

print() # New line after response

except LLMClientError as e:

print(f"\nError: {e}", file=sys.stderr)

sys.exit(1)

def print_complete_response(client: LLMClient, message: str, **kwargs):

"""

Print a complete response after generation finishes

Args:

client: LLM client instance

message: Message to send

**kwargs: Additional chat parameters

"""

try:

response = client.chat(message, stream=False, **kwargs)

print(f"Assistant: {response['response']}")

if response.get('usage'):

usage = response['usage']

print(f"\nTokens - Prompt: {usage['prompt_tokens']}, "

f"Completion: {usage['completion_tokens']}, "

f"Total: {usage['total_tokens']}")

except LLMClientError as e:

print(f"Error: {e}", file=sys.stderr)

sys.exit(1)

def interactive_mode(client: LLMClient, args: argparse.Namespace):

"""

Run interactive chat mode

Args:

client: LLM client instance

args: Command-line arguments

"""

# Create conversation if using multi-turn mode

conversation_id = None

if args.multi_turn:

conversation_id = f"cli-{uuid.uuid4()}"

try:

client.create_conversation(conversation_id, args.system_prompt)

print(f"Started conversation: {conversation_id}")

except LLMClientError as e:

print(f"Error creating conversation: {e}", file=sys.stderr)

sys.exit(1)

print("Interactive mode. Type 'exit' or 'quit' to end, 'clear' to start new conversation.")

print()

try:

while True:

try:

user_input = input("You: ").strip()

except EOFError:

break

if not user_input:

continue

if user_input.lower() in ['exit', 'quit']:

break

if user_input.lower() == 'clear':

if conversation_id:

client.delete_conversation(conversation_id)

conversation_id = f"cli-{uuid.uuid4()}"

client.create_conversation(conversation_id, args.system_prompt)

print("Started new conversation")

continue

# Send message

chat_params = {

'conversation_id': conversation_id,

'system_prompt': args.system_prompt if not conversation_id else None,

'model_format': args.format,

'temperature': args.temperature,

'max_tokens': args.max_tokens,

}

if args.stream:

print_streaming_response(client, user_input, **chat_params)

else:

print_complete_response(client, user_input, **chat_params)

print()

finally:

# Cleanup conversation

if conversation_id:

try:

client.delete_conversation(conversation_id)

except LLMClientError:

pass

def single_message_mode(client: LLMClient, args: argparse.Namespace):

"""

Send a single message and exit

Args:

client: LLM client instance

args: Command-line arguments

"""

chat_params = {

'system_prompt': args.system_prompt,

'model_format': args.format,

'temperature': args.temperature,

'max_tokens': args.max_tokens,

}

if args.stream:

print_streaming_response(client, args.message, **chat_params)

else:

print_complete_response(client, args.message, **chat_params)

def health_check_mode(client: LLMClient):

"""

Check service health and display information

Args:

client: LLM client instance

"""

try:

health = client.health_check()

print(f"Status: {health['status']}")

print("\nModel Information:")

for key, value in health['model_info'].items():

print(f" {key}: {value}")

except LLMClientError as e:

print(f"Error: {e}", file=sys.stderr)

sys.exit(1)

def main():

"""Main entry point for the CLI"""

parser = argparse.ArgumentParser(

description='Command-line client for LLM chat microservice',

formatter_class=argparse.RawDescriptionHelpFormatter,

epilog="""

Examples:

# Interactive chat with streaming

%(prog)s -i -s

# Single message

%(prog)s -m "What is the capital of France?"

# Multi-turn conversation

%(prog)s -i --multi-turn --system "You are a helpful math tutor"

# Check service health

%(prog)s --health

# Custom service URL

%(prog)s -u http://llm-chat.example.com -m "Hello"

"""

)

parser.add_argument(

'-u', '--url',

default='http://localhost:8000',

help='Base URL of the LLM service (default: http://localhost:8000)'

)

parser.add_argument(

'-k', '--api-key',

help='API key for authentication'

)

parser.add_argument(

'-i', '--interactive',

action='store_true',

help='Run in interactive mode'

)

parser.add_argument(

'-m', '--message',

help='Single message to send (non-interactive mode)'

)

parser.add_argument(

'--health',

action='store_true',

help='Check service health and display model information'

)

parser.add_argument(

'-s', '--stream',

action='store_true',

help='Stream responses in real-time'

)

parser.add_argument(

'--multi-turn',

action='store_true',

help='Enable multi-turn conversation mode (maintains context)'

)

parser.add_argument(

'--system',

dest='system_prompt',

help='System prompt to control assistant behavior'

)

parser.add_argument(

'--format',

choices=['chatml', 'llama2', 'alpaca'],

default='chatml',

help='Prompt format (default: chatml)'

)

parser.add_argument(

'--temperature',

type=float,

default=0.7,

help='Sampling temperature (default: 0.7)'

)

parser.add_argument(

'--max-tokens',

type=int,

default=2048,

help='Maximum tokens to generate (default: 2048)'

)

parser.add_argument(

'--timeout',

type=int,

default=300,

help='Request timeout in seconds (default: 300)'

)

args = parser.parse_args()

# Validate arguments

if not args.interactive and not args.message and not args.health:

parser.error('Either --interactive, --message, or --health is required')

# Create client

with LLMClient(args.url, args.api_key, args.timeout) as client:

if args.health:

health_check_mode(client)

elif args.interactive:

interactive_mode(client, args)

else:

single_message_mode(client, args)

if __name__ == '__main__':

main()

This command-line interface provides multiple modes of operation. Interactive mode allows ongoing conversations with the LLM. Single message mode sends one message and exits, useful for scripting. Health check mode verifies the service is running and displays model information.

The CLI supports all the features of the underlying service: streaming responses, multi-turn conversations, custom system prompts, and generation parameters. It provides a user-friendly interface for testing and using the LLM service.

Web-Based Client Application

For a more user-friendly interface, create a simple web application:

# client/web_app.py

from flask import Flask, render_template, request, jsonify, Response, stream_with_context

import uuid

import json

from llm_client import LLMClient, LLMClientError

app = Flask(__name__)

# Configuration

LLM_SERVICE_URL = "http://localhost:8000"

client = LLMClient(LLM_SERVICE_URL)

@app.route('/')

def index():

"""Render the main chat interface"""

return render_template('index.html')

@app.route('/api/chat', methods=['POST'])

def chat():

"""Handle chat requests"""

data = request.json

message = data.get('message')

conversation_id = data.get('conversation_id')

stream = data.get('stream', False)

if not message:

return jsonify({'error': 'Message is required'}), 400

try:

if stream:

def generate():

try:

for chunk in client.chat(

message=message,

conversation_id=conversation_id,

stream=True

yield f"data: {json.dumps({'text': chunk})}\n\n"

yield "data: [DONE]\n\n"

except LLMClientError as e:

yield f"data: {json.dumps({'error': str(e)})}\n\n"

return Response(

stream_with_context(generate()),

mimetype='text/event-stream'

)

else:

response = client.chat(

message=message,

conversation_id=conversation_id,

stream=False

)

return jsonify(response)

except LLMClientError as e:

return jsonify({'error': str(e)}), 500

@app.route('/api/conversations', methods=['POST'])

def create_conversation():

"""Create a new conversation"""

data = request.json

conversation_id = data.get('conversation_id') or str(uuid.uuid4())

system_prompt = data.get('system_prompt')

try:

result = client.create_conversation(conversation_id, system_prompt)

return jsonify(result)

except LLMClientError as e:

return jsonify({'error': str(e)}), 500

@app.route('/api/conversations/<conversation_id>', methods=['GET'])

def get_conversation(conversation_id):

"""Get conversation history"""

try:

result = client.get_conversation(conversation_id)

return jsonify(result)

except LLMClientError as e:

return jsonify({'error': str(e)}), 404

@app.route('/api/conversations/<conversation_id>', methods=['DELETE'])

def delete_conversation(conversation_id):

"""Delete a conversation"""

try:

result = client.delete_conversation(conversation_id)

return jsonify(result)

except LLMClientError as e:

return jsonify({'error': str(e)}), 404

@app.route('/api/health', methods=['GET'])

def health():

"""Check service health"""

try:

result = client.health_check()

return jsonify(result)

except LLMClientError as e:

return jsonify({'error': str(e)}), 500

if __name__ == '__main__':

app.run(host='0.0.0.0', port=5000, debug=True)

Create the HTML template for the web interface:

<!DOCTYPE html>

<head>

<title>LLM Chat Interface</title>

<style>

* {

margin: 0;

padding: 0;

box-sizing: border-box;

}

body {

font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, Oxygen, Ubuntu, Cantarell, sans-serif;

background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);

height: 100vh;

display: flex;

justify-content: center;

align-items: center;

padding: 20px;

}

.container {

background: white;

border-radius: 10px;

box-shadow: 0 10px 40px rgba(0,0,0,0.2);

width: 100%;

max-width: 800px;

height: 90vh;

display: flex;

flex-direction: column;

}

.header {

padding: 20px;

border-bottom: 1px solid #e0e0e0;

display: flex;

justify-content: space-between;

align-items: center;

}

.header h1 {

font-size: 24px;

color: #333;

}

.status {

display: flex;

align-items: center;

gap: 8px;

}

.status-indicator {

width: 10px;

height: 10px;

border-radius: 50%;

background: #4caf50;

}

.status-indicator.offline {

background: #f44336;

}

.chat-area {

flex: 1;

overflow-y: auto;

padding: 20px;

display: flex;

flex-direction: column;

gap: 16px;

}

.message {

display: flex;

gap: 12px;

max-width: 80%;

}

.message.user {

align-self: flex-end;

flex-direction: row-reverse;

}

.message-avatar {

width: 40px;

height: 40px;

border-radius: 50%;

display: flex;

align-items: center;

justify-content: center;

font-weight: bold;

color: white;

flex-shrink: 0;

}

.message.user .message-avatar {

background: #667eea;

}

.message.assistant .message-avatar {

background: #764ba2;

}

.message-content {

background: #f5f5f5;

padding: 12px 16px;

border-radius: 12px;

line-height: 1.5;

}

.message.user .message-content {

background: #667eea;

color: white;

}

.input-area {

padding: 20px;

border-top: 1px solid #e0e0e0;

}

.input-container {

display: flex;

gap: 12px;

}

#message-input {

flex: 1;

padding: 12px 16px;

border: 2px solid #e0e0e0;

border-radius: 24px;

font-size: 14px;

outline: none;

transition: border-color 0.3s;

}

#message-input:focus {

border-color: #667eea;

}

#send-button {

padding: 12px 24px;

background: #667eea;

color: white;

border: none;

border-radius: 24px;

font-size: 14px;

font-weight: 600;

cursor: pointer;

transition: background 0.3s;

}

#send-button:hover {

background: #5568d3;

}

#send-button:disabled {

background: #ccc;

cursor: not-allowed;

}

.controls {

display: flex;

gap: 12px;

margin-bottom: 12px;

flex-wrap: wrap;

}

.control-button {

padding: 8px 16px;

background: #f5f5f5;

border: 1px solid #e0e0e0;

border-radius: 16px;

font-size: 12px;

cursor: pointer;

transition: all 0.3s;

}

.control-button:hover {

background: #e0e0e0;

}

.control-button.active {

background: #667eea;

color: white;

border-color: #667eea;

}

.loading {

display: flex;

gap: 4px;

padding: 12px 16px;

}

.loading-dot {

width: 8px;

height: 8px;

border-radius: 50%;

background: #999;

animation: loading 1.4s infinite ease-in-out both;

}

.loading-dot:nth-child(1) {

animation-delay: -0.32s;

}

.loading-dot:nth-child(2) {

animation-delay: -0.16s;

}

@keyframes loading {

0%, 80%, 100% {

transform: scale(0);

}

40% {

transform: scale(1);

}

</style>

</head>

<body>

<span id="status-text">Checking...</span>

</div>

Hello! I'm your AI assistant. How can I help you today?

</div>

Streaming: ON

</button>

Multi-turn: OFF

</button>

Clear Chat

</button>

</div>

<input

type="text"

id="message-input"

placeholder="Type your message..."

autocomplete="off"

</div>

// State management

let isStreaming = true;

let isMultiTurn = false;

let conversationId = null;

let isProcessing = false;

// DOM elements

const chatArea = document.getElementById('chat-area');

const messageInput = document.getElementById('message-input');

const sendButton = document.getElementById('send-button');

const streamToggle = document.getElementById('stream-toggle');

const multiTurnToggle = document.getElementById('multi-turn-toggle');

const clearButton = document.getElementById('clear-button');

const statusIndicator = document.getElementById('status-indicator');

const statusText = document.getElementById('status-text');

// Check service health on load

checkHealth();

// Event listeners

sendButton.addEventListener('click', sendMessage);

messageInput.addEventListener('keypress', (e) => {

if (e.key === 'Enter' && !e.shiftKey) {

e.preventDefault();

sendMessage();

}

});

streamToggle.addEventListener('click', () => {

isStreaming = !isStreaming;

streamToggle.textContent = `Streaming: ${isStreaming ? 'ON' : 'OFF'}`;

streamToggle.classList.toggle('active');

});

multiTurnToggle.addEventListener('click', async () => {

isMultiTurn = !isMultiTurn;

multiTurnToggle.textContent = `Multi-turn: ${isMultiTurn ? 'ON' : 'OFF'}`;

multiTurnToggle.classList.toggle('active');

if (isMultiTurn && !conversationId) {

await createConversation();

} else if (!isMultiTurn && conversationId) {

await deleteConversation();

}

});

clearButton.addEventListener('click', clearChat);

// Functions

async function checkHealth() {

try {

const response = await fetch('/api/health');

const data = await response.json();

if (data.status === 'healthy') {

statusIndicator.classList.remove('offline');

statusText.textContent = 'Online';

} else {

statusIndicator.classList.add('offline');

statusText.textContent = 'Offline';

}

} catch (error) {

statusIndicator.classList.add('offline');

statusText.textContent = 'Offline';

}

async function createConversation() {

try {

const response = await fetch('/api/conversations', {

method: 'POST',

headers: {'Content-Type': 'application/json'},

body: JSON.stringify({})

});

const data = await response.json();

conversationId = data.conversation_id;

} catch (error) {

console.error('Failed to create conversation:', error);

}

async function deleteConversation() {

if (!conversationId) return;

try {

await fetch(`/api/conversations/${conversationId}`, {

method: 'DELETE'

});

conversationId = null;

} catch (error) {

console.error('Failed to delete conversation:', error);

}

function addMessage(role, content) {

const messageDiv = document.createElement('div');

messageDiv.className = `message ${role}`;

const avatar = document.createElement('div');

avatar.className = 'message-avatar';

avatar.textContent = role === 'user' ? 'You' : 'AI';

const contentDiv = document.createElement('div');

contentDiv.className = 'message-content';

contentDiv.textContent = content;

messageDiv.appendChild(avatar);

messageDiv.appendChild(contentDiv);

chatArea.appendChild(messageDiv);

chatArea.scrollTop = chatArea.scrollHeight;

return contentDiv;

}

function addLoadingIndicator() {

const messageDiv = document.createElement('div');

messageDiv.className = 'message assistant';

messageDiv.id = 'loading-message';

const avatar = document.createElement('div');

avatar.className = 'message-avatar';

avatar.textContent = 'AI';

const loadingDiv = document.createElement('div');

loadingDiv.className = 'loading';

loadingDiv.innerHTML = '<div class="loading-dot"></div><div class="loading-dot"></div><div class="loading-dot"></div>';

messageDiv.appendChild(avatar);

messageDiv.appendChild(loadingDiv);

chatArea.appendChild(messageDiv);

chatArea.scrollTop = chatArea.scrollHeight;

return messageDiv;

}

function removeLoadingIndicator() {

const loadingMessage = document.getElementById('loading-message');

if (loadingMessage) {

loadingMessage.remove();

}

async function sendMessage() {

if (isProcessing) return;

const message = messageInput.value.trim();

if (!message) return;

isProcessing = true;

sendButton.disabled = true;

messageInput.value = '';

// Add user message

addMessage('user', message);

try {

if (isStreaming) {

await handleStreamingResponse(message);

} else {

await handleCompleteResponse(message);

}

} catch (error) {

console.error('Error sending message:', error);

addMessage('assistant', 'Sorry, an error occurred. Please try again.');

} finally {

isProcessing = false;

sendButton.disabled = false;

messageInput.focus();

}

async function handleStreamingResponse(message) {

const response = await fetch('/api/chat', {

method: 'POST',

headers: {'Content-Type': 'application/json'},

body: JSON.stringify({

message: message,

conversation_id: conversationId,

stream: true

})

});

const reader = response.body.getReader();

const decoder = new TextDecoder();

let contentDiv = null;

let fullText = '';

while (true) {

const {done, value} = await reader.read();

if (done) break;

const chunk = decoder.decode(value);

const lines = chunk.split('\n');

for (const line of lines) {

if (line.startsWith('data: ')) {

const data = line.slice(6);

if (data === '[DONE]') break;

try {

const json = JSON.parse(data);

if (json.text) {

if (!contentDiv) {

contentDiv = addMessage('assistant', '');

}

fullText += json.text;

contentDiv.textContent = fullText;

chatArea.scrollTop = chatArea.scrollHeight;

}

} catch (e) {

// Ignore parse errors

}

async function handleCompleteResponse(message) {

const loadingIndicator = addLoadingIndicator();

const response = await fetch('/api/chat', {

method: 'POST',

headers: {'Content-Type': 'application/json'},

body: JSON.stringify({

message: message,

conversation_id: conversationId,

stream: false

})

});

removeLoadingIndicator();

const data = await response.json();

if (data.response) {

addMessage('assistant', data.response);

} else if (data.error) {

addMessage('assistant', `Error: ${data.error}`);

}

async function clearChat() {

if (conversationId) {

await deleteConversation();

if (isMultiTurn) {

await createConversation();

}

chatArea.innerHTML = '';

addMessage('assistant', 'Chat cleared. How can I help you?');

}

</script>

</body>

</html>

This web application provides a polished chat interface with real-time streaming, multi-turn conversations, and visual feedback. Users can toggle streaming mode and multi-turn mode, clear the chat, and see the service status.

To run the web application, install Flask:

pip install flask

Then start the server:

python client/web_app.py

Access the interface at http://localhost:5000 in your web browser.

PRODUCTION-READY COMPLETE EXAMPLE

Now let us provide the complete, production-ready code for the entire system. This includes all files needed to deploy and run the service.

Complete Project Structure:

llm-chat-microservice/

├── app/

│ ├── __init__.py

│ ├── main.py

│ ├── config.py

│ ├── models/

│ │ ├── __init__.py

│ │ └── llm.py

│ ├── services/

│ │ ├── __init__.py

│ │ └── chat_service.py

│ └── routes/

│ ├── __init__.py

│ └── chat.py

├── client/

│ ├── llm_client.py

│ ├── cli.py

│ ├── web_app.py

│ └── templates/

│ └── index.html

├── k8s/

│ ├── configmap.yaml

│ ├── pvc.yaml

│ ├── deployment.yaml

│ ├── service.yaml

│ ├── ingress.yaml

│ └── hpa.yaml

├── Dockerfile

├── docker-compose.yml

├── requirements.txt

├── .env.example

└── README.md

Complete app/__init__.py:

# app/__init__.py

"""

LLM Chat Microservice

A production-ready microservice for serving local LLM models with

multi-GPU support across NVIDIA CUDA, AMD ROCm, Apple Metal, and Intel SYCL.

"""

__version__ = "1.0.0"

Complete app/models/__init__.py:

# app/models/__init__.py

from app.models.llm import LLMInference

__all__ = ['LLMInference']

Complete app/services/__init__.py:

# app/services/__init__.py

from app.services.chat_service import ChatService, Conversation, Message

__all__ = ['ChatService', 'Conversation', 'Message']

Complete app/routes/__init__.py:

# app/routes/__init__.py

from app.routes import chat

__all__ = ['chat']

Complete .env.example:

# .env.example

# Copy this file to .env and configure for your environment

# Model configuration

MODEL_PATH=/app/models/model.gguf

MODEL_NAME=local-llm

CONTEXT_LENGTH=4096

MAX_TOKENS=2048

TEMPERATURE=0.7

TOP_P=0.9

TOP_K=40

REPEAT_PENALTY=1.1

# GPU configuration

# Options: cuda, rocm, metal, sycl, none

GPU_TYPE=cuda

N_GPU_LAYERS=99

MAIN_GPU=0

# For multi-GPU, specify tensor split as comma-separated values

# TENSOR_SPLIT=0.6,0.4

# Server configuration

HOST=0.0.0.0

PORT=8000

WORKERS=1

LOG_LEVEL=info

# Performance configuration

N_THREADS=4

N_BATCH=512

# API configuration

# API_KEY=your-secret-key-here

ENABLE_STREAMING=true

MAX_CONCURRENT_REQUESTS=10

Complete README.md:

# LLM Chat Microservice

A production-ready microservice for serving local Large Language Models with comprehensive multi-GPU architecture support including NVIDIA CUDA, AMD ROCm, Apple Metal Performance Shaders, and Intel SYCL.

## Features

- Local LLM inference with no external dependencies

- Multi-GPU architecture support (NVIDIA, AMD, Apple, Intel)

- RESTful API with automatic documentation

- Streaming and non-streaming response modes

- Multi-turn conversation management

- Docker containerization with multi-stage builds

- Kubernetes deployment with auto-scaling

- Production-ready error handling and logging

- Comprehensive client libraries and CLI tools

## Quick Start

### Prerequisites

- Python 3.10 or later

- Docker (for containerized deployment)

- Kubernetes cluster (for orchestrated deployment)

- GPU drivers for your hardware (CUDA, ROCm, Metal, or SYCL)

- LLM model file in GGUF format

### Local Development

1. Clone the repository and navigate to the project directory

2. Create a virtual environment:

python -m venv venv

source venv/bin/activate # On Windows: venv\Scripts\activate

3. Install dependencies:

pip install -r requirements.txt

4. Download an LLM model in GGUF format and note its path

5. Configure environment variables:

cp .env.example .env

# Edit .env with your settings

6. Run the service:

python -m app.main

7. Access the API documentation at http://localhost:8000/docs

### Docker Deployment

Build the image with GPU support:

# For NVIDIA CUDA

docker build --build-arg GPU_TYPE=cuda -t llm-chat-service:cuda .

# For AMD ROCm

docker build --build-arg GPU_TYPE=rocm -t llm-chat-service:rocm .

# For Apple Metal

docker build --build-arg GPU_TYPE=metal -t llm-chat-service:metal .

# For Intel SYCL

docker build --build-arg GPU_TYPE=sycl -t llm-chat-service:sycl .

# CPU only

docker build --build-arg GPU_TYPE=none -t llm-chat-service:cpu .

Run the container:

docker run -d \

--name llm-chat \

--gpus all \

-p 8000:8000 \

-v /path/to/models:/app/models \

-e MODEL_PATH=/app/models/your-model.gguf \

-e GPU_TYPE=cuda \

-e N_GPU_LAYERS=99 \

llm-chat-service:cuda

### Docker Compose Deployment

Configure your environment:

cp .env.example .env

# Edit .env with your settings

Start the service:

docker-compose up -d

### Kubernetes Deployment

1. Ensure GPU device plugins are installed on your cluster

2. Create the model PersistentVolume and populate it with your model files

3. Deploy the service:

kubectl apply -f k8s/

4. Check deployment status:

kubectl get pods -l app=llm-chat

5. Access the service:

kubectl port-forward service/llm-chat-service 8000:80

## API Usage

### Health Check

GET /api/v1/health

Returns service status and model information.

### Single Message Chat

POST /api/v1/chat

Content-Type: application/json

{

"message": "What is the capital of France?",

"system_prompt": "You are a helpful assistant.",

"model_format": "chatml",

"stream": false,

"temperature": 0.7,

"max_tokens": 2048

}

### Streaming Chat

POST /api/v1/chat

Content-Type: application/json

{

"message": "Write a poem about AI.",

"stream": true

}

Response is sent as server-sent events.

### Multi-Turn Conversation

Create a conversation:

POST /api/v1/conversations

Content-Type: application/json

{

"conversation_id": "my-conversation",

"system_prompt": "You are a helpful math tutor."

}

Send messages:

POST /api/v1/chat

Content-Type: application/json

{

"message": "What is 15 times 23?",

"conversation_id": "my-conversation"

}

Get conversation history:

GET /api/v1/conversations/my-conversation

Delete conversation:

DELETE /api/v1/conversations/my-conversation

## Client Usage

### Python Client Library

from client.llm_client import LLMClient

with LLMClient("http://localhost:8000") as client:

# Single message

response = client.chat("Hello, how are you?")

print(response['response'])

# Streaming

for chunk in client.chat("Tell me a story", stream=True):

print(chunk, end='', flush=True)

# Multi-turn conversation

client.create_conversation("conv-123", "You are a helpful assistant")

client.chat("What is Python?", conversation_id="conv-123")

client.chat("How do I install it?", conversation_id="conv-123")

### Command-Line Interface

# Interactive mode with streaming

python client/cli.py -i -s

# Single message

python client/cli.py -m "What is the capital of France?"

# Multi-turn conversation

python client/cli.py -i --multi-turn --system "You are a helpful tutor"

# Health check

python client/cli.py --health

# Custom service URL

python client/cli.py -u http://llm-chat.example.com -m "Hello"

### Web Interface

Start the web application:

python client/web_app.py

Access the interface at http://localhost:5000

## Configuration

All configuration is done through environment variables. See .env.example for all available options.

Key configuration parameters:

- MODEL_PATH: Path to the GGUF model file

- GPU_TYPE: GPU acceleration type (cuda, rocm, metal, sycl, none)

- N_GPU_LAYERS: Number of model layers to offload to GPU (99 for all)

- CONTEXT_LENGTH: Maximum context window size

- MAX_TOKENS: Maximum tokens to generate per response

- TEMPERATURE: Sampling temperature (higher = more creative)

## GPU Configuration

### NVIDIA CUDA

Set GPU_TYPE=cuda and ensure CUDA toolkit is installed. The service will

automatically detect and use available NVIDIA GPUs.

For multi-GPU setups, use TENSOR_SPLIT to distribute the model:

TENSOR_SPLIT=0.6,0.4

### AMD ROCm

Set GPU_TYPE=rocm and ensure ROCm is installed. The service uses HIP to interface with AMD GPUs.

### Apple Metal

Set GPU_TYPE=metal on Apple Silicon Macs. Metal support is automatic and uses the unified memory architecture for efficient inference.

### Intel GPUs

Set GPU_TYPE=sycl and ensure Intel oneAPI or OpenCL runtime is installed. Supports both integrated and discrete Intel GPUs.

## Performance Tuning

- Increase N_GPU_LAYERS to offload more computation to GPU

- Adjust N_BATCH for optimal throughput (higher = more memory, better speed)

- Set N_THREADS based on your CPU core count

- Use quantized models (Q4_K_M or Q5_K_M) for better performance

- Enable streaming for better perceived responsiveness

## Monitoring

View logs:

docker logs -f llm-chat

kubectl logs -f -l app=llm-chat

Check resource usage:

docker stats llm-chat

kubectl top pods -l app=llm-chat

Monitor GPU usage:

nvidia-smi # NVIDIA

rocm-smi # AMD

## Troubleshooting

### Model fails to load

- Verify MODEL_PATH points to a valid GGUF file

- Ensure sufficient memory (RAM or VRAM) for the model

- Check file permissions on the model file

### GPU not detected

- Verify GPU drivers are installed correctly

- Check GPU_TYPE matches your hardware

- Ensure Docker has GPU access (--gpus flag)

- Verify Kubernetes GPU device plugin is running

### Slow inference

- Increase N_GPU_LAYERS to use more GPU

- Use a smaller or more quantized model

- Reduce CONTEXT_LENGTH if not needed

- Check GPU utilization with nvidia-smi or rocm-smi

### Out of memory errors

- Use a smaller model or higher quantization

- Reduce CONTEXT_LENGTH

- Reduce N_BATCH

- Offload fewer layers to GPU (reduce N_GPU_LAYERS)

## Architecture

The service follows clean architecture principles with clear separation of

concerns:

- API Layer (routes/): HTTP endpoints and request/response handling

- Service Layer (services/): Business logic and conversation management

- Inference Layer (models/): LLM loading and inference

- Configuration Layer (config.py): Settings and environment management

This architecture enables:

- Easy testing of individual components

- Flexibility to swap implementations

- Clear dependency flow

- Maintainable and extensible code

## License

This project is provided as-is for educational and commercial use.

## Support

For issues and questions, please refer to the documentation or create an issue in the project repository.

This completes the comprehensive guide to building, deploying, and using an LLM chat microservice with full multi-GPU architecture support. The system is production-ready with proper error handling, logging, monitoring, and client tools. It supports deployment in Docker containers and Kubernetes clusters, with automatic scaling and self-healing capabilities. The included client applications demonstrate how to integrate the service into various types of applications, from command-line tools to web interfaces.

Saturday, June 27, 2026

AN LLM CHAT MICROSERVICE: A GUIDE FOR DOCKER AND KUBERNETES DEPLOYMENT WITH MULTI-GPU ARCHITECTURE SUPPORT

INTRODUCTION: WHY BUILD AN LLM CHAT MICROSERVICE?

UNDERSTANDING THE FUNDAMENTAL COMPONENTS

What Are Large Language Models?

What Is Docker and Why Use It?

What Is Kubernetes and Why Use It?

Understanding Microservice Architecture

MULTI-GPU ARCHITECTURE SUPPORT: THE TECHNICAL FOUNDATION

NVIDIA CUDA Architecture

AMD ROCm Architecture

Apple Metal Performance Shaders

Intel GPU Architecture

THE ARCHITECTURE OF OUR LLM MICROSERVICE

Service Architecture Overview

Request Flow Through the System

SETTING UP THE DEVELOPMENT ENVIRONMENT

Installing Required System Dependencies

Installing Python and Dependencies

Obtaining LLM Models

IMPLEMENTING THE LLM MICROSERVICE

Creating the Project Structure

Implementing Configuration Management

Implementing the LLM Inference Layer

Implementing the Service Layer

Implementing the API Layer

Creating the Main Application

CONTAINERIZING THE SERVICE WITH DOCKER

Understanding Docker Images and Containers

Creating a Multi-Architecture Dockerfile

Creating the Requirements File

Building Docker Images for Different GPU Types

Running the Docker Container

Creating a Docker Compose File

DEPLOYING TO KUBERNETES

Understanding Kubernetes Concepts

Installing GPU Support in Kubernetes

Creating Kubernetes Manifests

TESTING THE SERVICE

Testing in Kubernetes

MONITORING AND OBSERVABILITY

Viewing Logs

Checking Resource Usage

Health Checks

BUILDING A CLIENT APPLICATION

Python Client Library

Command-Line Interface

Web-Based Client Application

PRODUCTION-READY COMPLETE EXAMPLE

No comments: