Saturday, June 27, 2026

AN LLM CHAT MICROSERVICE: A GUIDE FOR DOCKER AND KUBERNETES DEPLOYMENT WITH MULTI-GPU ARCHITECTURE SUPPORT




INTRODUCTION: WHY BUILD AN LLM CHAT MICROSERVICE?


In the rapidly evolving landscape of artificial intelligence, Large Language Models have become indispensable tools for modern applications. However,  deploying these powerful models in a production environment presents unique challenges. This article guides you through creating a robust, scalable LLM chat service that runs as a microservice in containerized environments.


The microservice architecture approach offers several compelling advantages. First, it provides isolation, meaning your LLM service runs independently from other application components. If the LLM service crashes or needs updates, your main application continues functioning. Second, it enables horizontal scaling, allowing you to run multiple instances of your LLM service to handle increased load. Third, it facilitates resource management, particularly important for 

GPU-intensive LLM operations where you need precise control over hardware allocation.


Using local LLM models instead of cloud-based APIs offers significant benefits. You maintain complete data privacy since no information leaves your  infrastructure. You eliminate per-token costs associated with commercial APIs, making it economically viable for high-volume applications. You gain full  control over model selection, fine-tuning, and updates. Most importantly, you  avoid dependency on external services and their potential downtime or rate limits.



UNDERSTANDING THE FUNDAMENTAL COMPONENTS


Before diving into implementation, let us explore the essential building blocks of our system. Understanding these components deeply will help you make informed decisions throughout the development process.



What Are Large Language Models?


Large Language Models are neural networks trained on vast amounts of text data  to understand and generate human-like text. Unlike traditional software that follows explicit rules, LLMs learn patterns from data and can perform tasks  they were not explicitly programmed for. When you send a prompt to an LLM, it processes the text through billions of parameters to generate a contextually appropriate response.


The models we will use are quantized versions, meaning they have been compressed from their original size while maintaining most of their capabilities. A model originally requiring 80GB of memory might be quantized to run in 8GB or less. This compression uses techniques like reducing the precision of numerical  weights from 32-bit floating point to 4-bit integers. The GGUF format, developed by the llama.cpp project, has become the standard for these quantized models because it supports efficient loading and inference across different hardware platforms.



What Is Docker and Why Use It?


Docker is a containerization platform that packages your application along with all its dependencies into a standardized unit called a container. Think of a container as a lightweight, isolated environment that contains everything needed to run your application: the code, runtime, system tools, libraries, and settings.


The key advantage of Docker for LLM services is consistency. The phrase "it  works on my machine" becomes obsolete because the container runs identically  everywhere. Whether you develop on a MacBook with Apple Silicon, deploy to a  Linux server with NVIDIA GPUs, or scale across a Kubernetes cluster with AMD  GPUs, the containerized application behaves the same way.


Docker containers differ from virtual machines in crucial ways. Virtual machines include an entire operating system, making them heavy and slow to start. Containers share the host operating system kernel, making them lightweight and fast. A container can start in seconds and uses only megabytes of memory for the container overhead itself, though our LLM will require substantial memory for the model.



What Is Kubernetes and Why Use It?


Kubernetes is an orchestration platform that manages containerized applications across a cluster of machines. While Docker handles individual containers, Kubernetes manages fleets of containers, deciding where to run them, how to scale them, and what to do when they fail.


For LLM services, Kubernetes provides critical capabilities. It offers automatic scaling based on demand, spinning up new instances when traffic increases and shutting them down when traffic decreases. It provides self-healing, automatically restarting failed containers and replacing unhealthy instances. It manages resource allocation, ensuring your GPU-hungry LLM containers get the hardware they need without starving other services. It handles load balancing, distributing incoming requests across multiple LLM instances for optimal performance.



Understanding Microservice Architecture


Microservice architecture structures an application as a collection of loosely coupled services. Instead of building one large monolithic application, you create multiple small services that each handle a specific business capability. Each service runs independently, has its own database if needed, and communicates with other services through well-defined APIs.


For our LLM chat service, the microservice approach means creating a dedicated service that does one thing well: processing chat requests using a local LLM.  Other parts of your application, like user authentication, data storage, or frontend interfaces, run as separate services. This separation allows you to update your LLM service without touching other components, scale it  independently based on AI workload, and even use different programming languages or frameworks for different services.



MULTI-GPU ARCHITECTURE SUPPORT: THE TECHNICAL FOUNDATION


One of the most challenging aspects of deploying LLM services is supporting diverse GPU architectures. Different hardware manufacturers use different  programming models and libraries for GPU acceleration. Understanding these differences is essential for building a truly portable LLM service.



NVIDIA CUDA Architecture


NVIDIA GPUs use the CUDA (Compute Unified Device Architecture) platform for parallel computing. CUDA has been the dominant force in machine learning for  years, with extensive library support and optimization. When you run an LLM on NVIDIA hardware, the inference engine uses CUDA cores to parallelize matrix operations, the fundamental mathematical operations in neural networks.


The llama.cpp library we will use supports CUDA through cuBLAS, NVIDIA's library for basic linear algebra operations. When compiled with CUDA support, llama.cpp automatically offloads computation to the GPU, dramatically accelerating inference. A response that might take 30 seconds on CPU could complete in 2 seconds on a modern NVIDIA GPU.



AMD ROCm Architecture


AMD's ROCm (Radeon Open Compute) platform provides GPU acceleration for AMD graphics cards. ROCm is an open-source alternative to CUDA, designed to support high-performance computing and machine learning workloads. While historically less mature than CUDA, ROCm has improved significantly and now provides competitive performance for LLM inference.


The llama.cpp library supports ROCm through HIP (Heterogeneous-computing Interface for Portability), which allows CUDA code to run on AMD GPUs with minimal modifications. When you compile llama.cpp with ROCm support, it uses AMD GPU cores for acceleration just as it would use NVIDIA CUDA cores.



Apple Metal Performance Shaders


Apple Silicon chips (M1, M2, M3, and their variants) include integrated GPUs that use the Metal framework for GPU computing. Metal Performance Shaders (MPS) provides optimized implementations of common computational patterns, including the matrix operations needed for neural networks.


The llama.cpp library includes Metal support specifically for Apple Silicon. 

When running on a Mac with Apple Silicon, the library automatically uses the integrated GPU for acceleration. This is particularly powerful because Apple's unified memory architecture allows the CPU and GPU to share the same memory pool, eliminating the need to copy data between separate CPU and GPU memory.



Intel GPU Architecture


Intel provides GPU acceleration through multiple technologies. Modern Intel CPUs with integrated graphics support OpenCL for general-purpose GPU computing. Intel also offers oneAPI, a unified programming model that works across CPUs, GPUs, and other accelerators. Additionally, Intel's Arc discrete GPUs provide substantial computational power for AI workloads.


The llama.cpp library supports Intel GPUs primarily through SYCL (a higher-level programming model built on oneAPI) and OpenCL. This support enables acceleration on Intel integrated graphics as well as discrete Arc GPUs.



THE ARCHITECTURE OF OUR LLM MICROSERVICE


Now that we understand the components, let us design the architecture of our LLM chat microservice. A well-designed architecture makes the system easier to understand, maintain, and extend.



Service Architecture Overview


Our microservice follows a layered architecture with clear separation of 

concerns. At the top layer, we have the API layer, which handles HTTP requests and responses. This layer exposes REST endpoints that clients use to send chat messages and receive responses. It validates input, handles errors gracefully, and formats responses according to API specifications.


Below the API layer sits the service layer, which contains the business logic. This layer manages conversation context, handles prompt formatting, and orchestrates the interaction with the LLM. It implements features like conversation history management, token counting, and response streaming.


The inference layer interfaces directly with the LLM. This layer loads the 

model, manages GPU resources, and executes inference. It abstracts the 

complexity of the underlying llama.cpp library, providing a clean interface for the service layer.


Finally, the configuration layer manages all settings: model paths, GPU 

settings, performance parameters, and API configuration. This layer reads from environment variables and configuration files, allowing the service to adapt to different deployment environments without code changes.



Request Flow Through the System


When a client sends a chat request, it travels through our system in a 

well-defined path. The request first arrives at the API layer as an HTTP POST to the chat endpoint. The API layer validates the request structure, ensuring required fields are present and properly formatted.


The validated request moves to the service layer, which retrieves any relevant conversation history and constructs a complete prompt for the LLM. The prompt includes system instructions, conversation history, and the new user message, all formatted according to the specific model's expected format.


The service layer passes the formatted prompt to the inference layer, which feeds it to the loaded LLM. The model processes the prompt and generates tokens one at a time. For streaming responses, these tokens flow back through the layers immediately, allowing the client to receive partial responses as they are generated. For non-streaming responses, the inference layer collects all tokens before returning the complete response.


Finally, the response travels back through the service layer, which may perform post-processing like filtering or logging, and then through the API layer, which formats it as JSON and sends it to the client.



SETTING UP THE DEVELOPMENT ENVIRONMENT


Before writing code, we need to prepare our development environment. This section walks through the setup process step by step.



Installing Required System Dependencies


Our LLM service requires several system-level dependencies. On Ubuntu or Debian Linux, you need build tools for compiling native extensions, Python development headers, and GPU-specific libraries depending on your hardware.


For NVIDIA GPU support, install the CUDA toolkit matching your driver version. The CUDA toolkit includes the compiler and libraries needed to build GPU-accelerated code. You can verify your CUDA installation by running the nvidia-smi command, which displays GPU information and CUDA version.


For AMD GPU support, install ROCm following AMD's official documentation. ROCm installation is more involved than CUDA, requiring specific kernel modules and libraries. After installation, verify it by running rocm-smi, the AMD equivalent of nvidia-smi.


For Apple Silicon, no additional installation is needed. The Metal framework comes pre-installed with macOS. The llama.cpp library will automatically detect and use Metal when running on Apple Silicon.


For Intel GPU support, install the Intel oneAPI toolkit or OpenCL runtime. The specific requirements depend on whether you are using integrated graphics or discrete Arc GPUs.



Installing Python and Dependencies


Our microservice uses Python 3.10 or later. Python 3.10 introduced several 

features we will use, including improved type hints and better error messages. Create a virtual environment to isolate our project dependencies from system Python packages. This isolation prevents version conflicts and makes the project reproducible.


The core Python dependencies include FastAPI, a modern web framework for building APIs with automatic documentation and validation. We use Uvicorn as the ASGI server to run FastAPI. The llama-cpp-python package provides Python bindings for llama.cpp, giving us access to efficient LLM inference with GPU support. We also need Pydantic for data validation and serialization.



Obtaining LLM Models


To run a local LLM, you need to download model files. The most accessible source is Hugging Face, which hosts thousands of models in various formats. For our service, we need models in GGUF format, the format used by llama.cpp.


Popular model choices include Llama 2, an open-source model from Meta available in sizes from 7 billion to 70 billion parameters. Mistral 7B offers excellent performance for its size with a focus on instruction following. Phi-3 from Microsoft provides strong capabilities in a compact 3.8 billion parameter model. Qwen models from Alibaba offer multilingual support with strong performance.


When selecting a model, consider the quantization level. Q4_K_M quantization provides a good balance between quality and size, reducing model size to roughly 25 percent of the original while maintaining most capabilities. Q5_K_M offers slightly better quality at a modest size increase. Q8_0 provides near-original quality but requires more memory.


Download your chosen model and note its path. We will configure our service to load this model at startup.



IMPLEMENTING THE LLM MICROSERVICE


Now we begin implementing our microservice. We will build it incrementally, 

explaining each component in detail.



Creating the Project Structure


A well-organized project structure makes the code easier to navigate and 

maintain. Create a directory for your project and organize it into logical 

modules. The main application code lives in an app directory. Configuration handling goes in a config module. The LLM inference logic resides in a models module. API endpoints are defined in a routes module. Shared utilities and  helpers go in a utils module.


This structure follows clean architecture principles by separating concerns. The  routes module depends on the models module, but the models module does not know about HTTP or routing. This separation allows you to test the LLM inference logic independently of the web framework.



Implementing Configuration Management


Configuration management is critical for a production service. Hard-coding  values makes the service inflexible and difficult to deploy in different environments. We use environment variables and configuration files to make the service adaptable.


Create a configuration module that defines all settings using Pydantic. Pydantic provides automatic validation and type conversion for configuration values. It can read from environment variables, providing sensible defaults when values are not specified.


Here is the configuration module:



# app/config.py


from pydantic_settings import BaseSettings

from typing import Optional

from enum import Enum



class GPUType(str, Enum):

    """Enumeration of supported GPU types"""

    CUDA = "cuda"

    ROCM = "rocm"

    METAL = "metal"

    SYCL = "sycl"

    NONE = "none"



class Settings(BaseSettings):

    """Application configuration settings"""

    

    # Model configuration

    model_path: str

    model_name: str = "local-llm"

    context_length: int = 4096

    max_tokens: int = 2048

    temperature: float = 0.7

    top_p: float = 0.9

    top_k: int = 40

    repeat_penalty: float = 1.1

    

    # GPU configuration

    gpu_type: GPUType = GPUType.NONE

    n_gpu_layers: int = 0

    main_gpu: int = 0

    tensor_split: Optional[str] = None

    

    # Server configuration

    host: str = "0.0.0.0"

    port: int = 8000

    workers: int = 1

    log_level: str = "info"

    

    # Performance configuration

    n_threads: int = 4

    n_batch: int = 512

    

    # API configuration

    api_key: Optional[str] = None

    enable_streaming: bool = True

    max_concurrent_requests: int = 10

    

    class Config:

        env_file = ".env"

        env_file_encoding = "utf-8"



def get_settings() -> Settings:

    """Get application settings singleton"""

    return Settings()



This configuration module defines all the settings our service needs. The 

model_path specifies where to find the GGUF model file. The context_length  determines how much conversation history the model can consider. The max_tokens limits the length of generated responses. Temperature, top_p, and top_k control the randomness and creativity of responses. Lower temperature values produce more focused and deterministic outputs, while higher values increase creativity and randomness.


The GPU configuration section is particularly important. The gpu_type setting tells the service which GPU acceleration to use. The n_gpu_layers specifies how many model layers to offload to the GPU. Setting this to a high value like 99 offloads the entire model to GPU memory, maximizing performance. The main_gpu setting selects which GPU to use in multi-GPU systems. The tensor_split allows distributing the model across multiple GPUs, specified as a comma-separated list of proportions.


The server configuration controls how the HTTP server runs. The host setting determines which network interfaces to bind to. Using 0.0.0.0 makes the service accessible from any network interface, necessary for Docker containers. The port specifies which TCP port to listen on.


The performance configuration affects inference speed and resource usage. The n_threads setting controls CPU parallelism for operations not offloaded to GPU. The n_batch parameter affects how the model processes tokens internally, with higher values potentially improving throughput at the cost of memory.



Implementing the LLM Inference Layer


The inference layer encapsulates all interaction with the LLM. This abstraction isolates the rest of the application from the specifics of the llama.cpp library, making it easier to swap implementations if needed.


Here is the inference layer implementation:



# app/models/llm.py


from llama_cpp import Llama, LlamaGrammar

from typing import Iterator, Dict, Any, Optional, List

import logging

from app.config import Settings, GPUType



logger = logging.getLogger(__name__)



class LLMInference:

    """Handles LLM model loading and inference"""

    

    def __init__(self, settings: Settings):

        """

        Initialize the LLM inference engine

        

        Args:

            settings: Application settings containing model configuration

        """

        self.settings = settings

        self.model: Optional[Llama] = None

        self._initialize_model()

    

    def _initialize_model(self) -> None:

        """Load and initialize the LLM model with appropriate GPU settings"""

        logger.info(f"Loading model from {self.settings.model_path}")

        logger.info(f"GPU type: {self.settings.gpu_type}")

        logger.info(f"GPU layers: {self.settings.n_gpu_layers}")

        

        # Prepare model initialization parameters

        model_params = {

            "model_path": self.settings.model_path,

            "n_ctx": self.settings.context_length,

            "n_threads": self.settings.n_threads,

            "n_batch": self.settings.n_batch,

            "verbose": self.settings.log_level == "debug",

        }

        

        # Configure GPU acceleration based on type

        if self.settings.gpu_type != GPUType.NONE:

            model_params["n_gpu_layers"] = self.settings.n_gpu_layers

            

            if self.settings.gpu_type == GPUType.CUDA:

                # CUDA-specific settings

                model_params["main_gpu"] = self.settings.main_gpu

                if self.settings.tensor_split:

                    # Parse tensor split string into list of floats

                    splits = [float(x) for x in self.settings.tensor_split.split(",")]

                    model_params["tensor_split"] = splits

                logger.info("Configured for NVIDIA CUDA acceleration")

                

            elif self.settings.gpu_type == GPUType.ROCM:

                # ROCm uses same parameters as CUDA through HIP

                model_params["main_gpu"] = self.settings.main_gpu

                if self.settings.tensor_split:

                    splits = [float(x) for x in self.settings.tensor_split.split(",")]

                    model_params["tensor_split"] = splits

                logger.info("Configured for AMD ROCm acceleration")

                

            elif self.settings.gpu_type == GPUType.METAL:

                # Metal acceleration for Apple Silicon

                logger.info("Configured for Apple Metal acceleration")

                

            elif self.settings.gpu_type == GPUType.SYCL:

                # Intel GPU acceleration through SYCL

                model_params["main_gpu"] = self.settings.main_gpu

                logger.info("Configured for Intel SYCL acceleration")

        else:

            logger.info("Running on CPU only (no GPU acceleration)")

        

        try:

            self.model = Llama(**model_params)

            logger.info("Model loaded successfully")

        except Exception as e:

            logger.error(f"Failed to load model: {e}")

            raise

    

    def generate(

        self,

        prompt: str,

        max_tokens: Optional[int] = None,

        temperature: Optional[float] = None,

        top_p: Optional[float] = None,

        top_k: Optional[int] = None,

        repeat_penalty: Optional[float] = None,

        stop: Optional[List[str]] = None,

        stream: bool = False

    ) -> Iterator[Dict[str, Any]]:

        """

        Generate text from the model

        

        Args:

            prompt: Input text to generate from

            max_tokens: Maximum tokens to generate (uses config default if None)

            temperature: Sampling temperature (uses config default if None)

            top_p: Nucleus sampling parameter (uses config default if None)

            top_k: Top-k sampling parameter (uses config default if None)

            repeat_penalty: Repetition penalty (uses config default if None)

            stop: List of stop sequences

            stream: Whether to stream tokens as they are generated

            

        Yields:

            Dictionary containing generated text and metadata

        """

        if self.model is None:

            raise RuntimeError("Model not initialized")

        

        # Use provided parameters or fall back to configuration defaults

        generation_params = {

            "prompt": prompt,

            "max_tokens": max_tokens or self.settings.max_tokens,

            "temperature": temperature if temperature is not None else self.settings.temperature,

            "top_p": top_p if top_p is not None else self.settings.top_p,

            "top_k": top_k if top_k is not None else self.settings.top_k,

            "repeat_penalty": repeat_penalty if repeat_penalty is not None else self.settings.repeat_penalty,

            "stop": stop or [],

            "stream": stream,

            "echo": False,

        }

        

        logger.debug(f"Generating with params: {generation_params}")

        

        try:

            if stream:

                # Stream tokens as they are generated

                for output in self.model(**generation_params):

                    yield {

                        "text": output["choices"][0]["text"],

                        "finish_reason": output["choices"][0].get("finish_reason"),

                    }

            else:

                # Generate complete response

                output = self.model(**generation_params)

                yield {

                    "text": output["choices"][0]["text"],

                    "finish_reason": output["choices"][0].get("finish_reason"),

                    "usage": {

                        "prompt_tokens": output["usage"]["prompt_tokens"],

                        "completion_tokens": output["usage"]["completion_tokens"],

                        "total_tokens": output["usage"]["total_tokens"],

                    }

                }

        except Exception as e:

            logger.error(f"Generation failed: {e}")

            raise

    

    def get_model_info(self) -> Dict[str, Any]:

        """

        Get information about the loaded model

        

        Returns:

            Dictionary containing model metadata

        """

        return {

            "model_name": self.settings.model_name,

            "model_path": self.settings.model_path,

            "context_length": self.settings.context_length,

            "gpu_type": self.settings.gpu_type.value,

            "gpu_layers": self.settings.n_gpu_layers,

        }



This inference layer provides a clean interface for the rest of the application. The initialization method loads the model with appropriate GPU settings based on the configuration. The generate method handles both streaming and non-streaming inference, accepting parameters that control the generation process.


The GPU configuration logic deserves special attention. For CUDA and ROCm, we  set the main_gpu parameter to select which GPU to use. The tensor_split parameter allows distributing the model across multiple GPUs. For example, a tensor_split of "0.6,0.4" would put 60 percent of the model on the first GPU and 40 percent on the second GPU. This is useful for very large models that do not fit in a single GPU's memory.


For Metal acceleration on Apple Silicon, we simply set n_gpu_layers to offload computation to the integrated GPU. The llama.cpp library handles the Metal-specific details automatically.


The generate method implements both streaming and non-streaming modes. In streaming mode, tokens are yielded as soon as they are generated, allowing clients to display partial responses. This significantly improves perceived responsiveness for long responses. In non-streaming mode, the complete response is generated before returning, which is simpler but requires the client to wait for the entire response.



Implementing the Service Layer


The service layer sits between the API and the inference layer, handling 

business logic like conversation management and prompt formatting.


Here is the service layer implementation:



# app/services/chat_service.py


from typing import List, Dict, Any, Iterator, Optional

from app.models.llm import LLMInference

from app.config import Settings

import logging

import json



logger = logging.getLogger(__name__)



class Message:

    """Represents a single message in a conversation"""

    

    def __init__(self, role: str, content: str):

        """

        Initialize a message

        

        Args:

            role: Message role (system, user, or assistant)

            content: Message content text

        """

        self.role = role

        self.content = content

    

    def to_dict(self) -> Dict[str, str]:

        """Convert message to dictionary"""

        return {"role": self.role, "content": self.content}



class Conversation:

    """Manages a conversation with message history"""

    

    def __init__(self, system_prompt: Optional[str] = None):

        """

        Initialize a conversation

        

        Args:

            system_prompt: Optional system prompt to set behavior

        """

        self.messages: List[Message] = []

        if system_prompt:

            self.messages.append(Message("system", system_prompt))

    

    def add_message(self, role: str, content: str) -> None:

        """

        Add a message to the conversation

        

        Args:

            role: Message role (user or assistant)

            content: Message content

        """

        self.messages.append(Message(role, content))

    

    def get_messages(self) -> List[Dict[str, str]]:

        """Get all messages as list of dictionaries"""

        return [msg.to_dict() for msg in self.messages]

    

    def format_for_model(self, model_format: str = "chatml") -> str:

        """

        Format conversation for model input

        

        Args:

            model_format: Format to use (chatml, llama2, etc.)

            

        Returns:

            Formatted prompt string

        """

        if model_format == "chatml":

            # ChatML format used by many models

            formatted = ""

            for msg in self.messages:

                formatted += f"<|im_start|>{msg.role}\n{msg.content}<|im_end|>\n"

            formatted += "<|im_start|>assistant\n"

            return formatted

            

        elif model_format == "llama2":

            # Llama 2 chat format

            formatted = ""

            system_msg = None

            

            # Extract system message if present

            messages = self.messages.copy()

            if messages and messages[0].role == "system":

                system_msg = messages.pop(0).content

            

            # Format with special tokens

            if system_msg:

                formatted = f"[INST] <<SYS>>\n{system_msg}\n<</SYS>>\n\n"

            

            for i, msg in enumerate(messages):

                if msg.role == "user":

                    if i == 0 and system_msg:

                        formatted += f"{msg.content} [/INST] "

                    else:

                        formatted += f"[INST] {msg.content} [/INST] "

                elif msg.role == "assistant":

                    formatted += f"{msg.content} "

            

            return formatted

            

        elif model_format == "alpaca":

            # Alpaca instruction format

            formatted = ""

            for msg in self.messages:

                if msg.role == "system":

                    formatted += f"{msg.content}\n\n"

                elif msg.role == "user":

                    formatted += f"### Instruction:\n{msg.content}\n\n"

                elif msg.role == "assistant":

                    formatted += f"### Response:\n{msg.content}\n\n"

            formatted += "### Response:\n"

            return formatted

            

        else:

            # Simple format as fallback

            formatted = ""

            for msg in self.messages:

                formatted += f"{msg.role}: {msg.content}\n"

            formatted += "assistant: "

            return formatted



class ChatService:

    """Service for handling chat interactions with the LLM"""

    

    def __init__(self, llm: LLMInference, settings: Settings):

        """

        Initialize chat service

        

        Args:

            llm: LLM inference engine

            settings: Application settings

        """

        self.llm = llm

        self.settings = settings

        self.conversations: Dict[str, Conversation] = {}

    

    def create_conversation(

        self,

        conversation_id: str,

        system_prompt: Optional[str] = None

    ) -> None:

        """

        Create a new conversation

        

        Args:

            conversation_id: Unique identifier for the conversation

            system_prompt: Optional system prompt

        """

        self.conversations[conversation_id] = Conversation(system_prompt)

        logger.info(f"Created conversation {conversation_id}")

    

    def get_conversation(self, conversation_id: str) -> Optional[Conversation]:

        """

        Get an existing conversation

        

        Args:

            conversation_id: Conversation identifier

            

        Returns:

            Conversation object or None if not found

        """

        return self.conversations.get(conversation_id)

    

    def delete_conversation(self, conversation_id: str) -> bool:

        """

        Delete a conversation

        

        Args:

            conversation_id: Conversation identifier

            

        Returns:

            True if deleted, False if not found

        """

        if conversation_id in self.conversations:

            del self.conversations[conversation_id]

            logger.info(f"Deleted conversation {conversation_id}")

            return True

        return False

    

    def chat(

        self,

        message: str,

        conversation_id: Optional[str] = None,

        system_prompt: Optional[str] = None,

        model_format: str = "chatml",

        stream: bool = False,

        **generation_params

    ) -> Iterator[Dict[str, Any]]:

        """

        Process a chat message and generate response

        

        Args:

            message: User message text

            conversation_id: Optional conversation ID for multi-turn chat

            system_prompt: Optional system prompt for stateless chat

            model_format: Prompt format to use

            stream: Whether to stream the response

            **generation_params: Additional generation parameters

            

        Yields:

            Response chunks with generated text

        """

        # Handle conversation context

        if conversation_id:

            conversation = self.get_conversation(conversation_id)

            if not conversation:

                raise ValueError(f"Conversation {conversation_id} not found")

            conversation.add_message("user", message)

        else:

            # Stateless chat - create temporary conversation

            conversation = Conversation(system_prompt)

            conversation.add_message("user", message)

        

        # Format prompt for model

        prompt = conversation.format_for_model(model_format)

        

        logger.info(f"Processing chat message (stream={stream})")

        logger.debug(f"Formatted prompt: {prompt}")

        

        # Generate response

        accumulated_text = ""

        for chunk in self.llm.generate(prompt=prompt, stream=stream, **generation_params):

            accumulated_text += chunk["text"]

            yield chunk

        

        # Add assistant response to conversation history

        if conversation_id:

            conversation.add_message("assistant", accumulated_text)



The service layer implements several important abstractions. The Message class represents individual messages in a conversation. The Conversation class manages message history and handles prompt formatting for different model types.


Different LLM models expect different prompt formats. The format_for_model method implements several common formats. ChatML format uses special tokens like <|im_start|> and <|im_end|> to delimit messages. Llama 2 format uses [INST] and [/INST] tokens with special handling for system prompts. Alpaca format uses structured sections with headers like "### Instruction:" and "### Response:".


The ChatService class provides high-level chat functionality. It manages 

multiple concurrent conversations, each identified by a unique ID. For stateless single-turn interactions, it creates temporary conversations. The chat method orchestrates the entire process: retrieving or creating a conversation, adding the user message, formatting the prompt, generating the response, and updating the conversation history.



Implementing the API Layer


The API layer exposes our service through HTTP endpoints. We use FastAPI, which provides automatic request validation, response serialization, and interactive API documentation.


Here is the API implementation:



# app/routes/chat.py


from fastapi import APIRouter, HTTPException, Depends

from fastapi.responses import StreamingResponse

from pydantic import BaseModel, Field

from typing import Optional, List, Dict, Any

import json

import logging

from app.services.chat_service import ChatService

from app.models.llm import LLMInference

from app.config import Settings, get_settings



logger = logging.getLogger(__name__)

router = APIRouter()



# Request and response models

class ChatMessage(BaseModel):

    """Single chat message"""

    role: str = Field(..., description="Message role (system, user, or assistant)")

    content: str = Field(..., description="Message content")



class ChatRequest(BaseModel):

    """Request for chat completion"""

    message: str = Field(..., description="User message to process")

    conversation_id: Optional[str] = Field(None, description="Conversation ID for multi-turn chat")

    system_prompt: Optional[str] = Field(None, description="System prompt for behavior control")

    model_format: str = Field("chatml", description="Prompt format (chatml, llama2, alpaca)")

    stream: bool = Field(False, description="Whether to stream the response")

    max_tokens: Optional[int] = Field(None, description="Maximum tokens to generate")

    temperature: Optional[float] = Field(None, description="Sampling temperature")

    top_p: Optional[float] = Field(None, description="Nucleus sampling parameter")

    top_k: Optional[int] = Field(None, description="Top-k sampling parameter")

    stop: Optional[List[str]] = Field(None, description="Stop sequences")



class ChatResponse(BaseModel):

    """Response from chat completion"""

    response: str = Field(..., description="Generated response text")

    conversation_id: Optional[str] = Field(None, description="Conversation ID if applicable")

    finish_reason: Optional[str] = Field(None, description="Reason generation stopped")

    usage: Optional[Dict[str, int]] = Field(None, description="Token usage statistics")



class ConversationRequest(BaseModel):

    """Request to create a conversation"""

    conversation_id: str = Field(..., description="Unique conversation identifier")

    system_prompt: Optional[str] = Field(None, description="System prompt for the conversation")



class ConversationResponse(BaseModel):

    """Response for conversation operations"""

    conversation_id: str = Field(..., description="Conversation identifier")

    messages: List[ChatMessage] = Field(..., description="Conversation messages")



class HealthResponse(BaseModel):

    """Health check response"""

    status: str = Field(..., description="Service status")

    model_info: Dict[str, Any] = Field(..., description="Model information")



# Dependency injection for services

_chat_service: Optional[ChatService] = None



def get_chat_service() -> ChatService:

    """Get chat service singleton"""

    global _chat_service

    if _chat_service is None:

        settings = get_settings()

        llm = LLMInference(settings)

        _chat_service = ChatService(llm, settings)

    return _chat_service



@router.post("/chat", response_model=ChatResponse)

async def chat(

    request: ChatRequest,

    chat_service: ChatService = Depends(get_chat_service)

):

    """

    Process a chat message and generate a response

    

    This endpoint supports both stateless single-turn chat and stateful

    multi-turn conversations. For multi-turn chat, create a conversation

    first and provide its ID in subsequent requests.

    """

    try:

        # Prepare generation parameters

        gen_params = {}

        if request.max_tokens is not None:

            gen_params["max_tokens"] = request.max_tokens

        if request.temperature is not None:

            gen_params["temperature"] = request.temperature

        if request.top_p is not None:

            gen_params["top_p"] = request.top_p

        if request.top_k is not None:

            gen_params["top_k"] = request.top_k

        if request.stop is not None:

            gen_params["stop"] = request.stop

        

        if request.stream:

            # Return streaming response

            async def generate():

                try:

                    for chunk in chat_service.chat(

                        message=request.message,

                        conversation_id=request.conversation_id,

                        system_prompt=request.system_prompt,

                        model_format=request.model_format,

                        stream=True,

                        **gen_params

                    ):

                        # Format as server-sent events

                        data = json.dumps(chunk)

                        yield f"data: {data}\n\n"

                    yield "data: [DONE]\n\n"

                except Exception as e:

                    logger.error(f"Streaming error: {e}")

                    error_data = json.dumps({"error": str(e)})

                    yield f"data: {error_data}\n\n"

            

            return StreamingResponse(

                generate(),

                media_type="text/event-stream"

            )

        else:

            # Return complete response

            response_text = ""

            finish_reason = None

            usage = None

            

            for chunk in chat_service.chat(

                message=request.message,

                conversation_id=request.conversation_id,

                system_prompt=request.system_prompt,

                model_format=request.model_format,

                stream=False,

                **gen_params

            ):

                response_text += chunk["text"]

                finish_reason = chunk.get("finish_reason")

                usage = chunk.get("usage")

            

            return ChatResponse(

                response=response_text,

                conversation_id=request.conversation_id,

                finish_reason=finish_reason,

                usage=usage

            )

    

    except ValueError as e:

        raise HTTPException(status_code=400, detail=str(e))

    except Exception as e:

        logger.error(f"Chat error: {e}", exc_info=True)

        raise HTTPException(status_code=500, detail="Internal server error")



@router.post("/conversations", response_model=ConversationResponse)

async def create_conversation(

    request: ConversationRequest,

    chat_service: ChatService = Depends(get_chat_service)

):

    """

    Create a new conversation for multi-turn chat

    

    Conversations maintain message history across multiple chat requests.

    Use the returned conversation_id in subsequent chat requests.

    """

    try:

        chat_service.create_conversation(

            conversation_id=request.conversation_id,

            system_prompt=request.system_prompt

        )

        

        conversation = chat_service.get_conversation(request.conversation_id)

        messages = [

            ChatMessage(role=msg["role"], content=msg["content"])

            for msg in conversation.get_messages()

        ]

        

        return ConversationResponse(

            conversation_id=request.conversation_id,

            messages=messages

        )

    

    except Exception as e:

        logger.error(f"Conversation creation error: {e}", exc_info=True)

        raise HTTPException(status_code=500, detail="Internal server error")



@router.get("/conversations/{conversation_id}", response_model=ConversationResponse)

async def get_conversation(

    conversation_id: str,

    chat_service: ChatService = Depends(get_chat_service)

):

    """

    Get an existing conversation with its message history

    """

    conversation = chat_service.get_conversation(conversation_id)

    if not conversation:

        raise HTTPException(status_code=404, detail="Conversation not found")

    

    messages = [

        ChatMessage(role=msg["role"], content=msg["content"])

        for msg in conversation.get_messages()

    ]

    

    return ConversationResponse(

        conversation_id=conversation_id,

        messages=messages

    )



@router.delete("/conversations/{conversation_id}")

async def delete_conversation(

    conversation_id: str,

    chat_service: ChatService = Depends(get_chat_service)

):

    """

    Delete a conversation and its history

    """

    deleted = chat_service.delete_conversation(conversation_id)

    if not deleted:

        raise HTTPException(status_code=404, detail="Conversation not found")

    

    return {"status": "deleted", "conversation_id": conversation_id}



@router.get("/health", response_model=HealthResponse)

async def health_check(

    chat_service: ChatService = Depends(get_chat_service)

):

    """

    Check service health and get model information

    """

    model_info = chat_service.llm.get_model_info()

    return HealthResponse(

        status="healthy",

        model_info=model_info

    )



The API layer defines several endpoints. The /chat endpoint is the primary 

interface for generating responses. It accepts a ChatRequest containing the user message and optional parameters. It supports both streaming and non-streaming  responses. For streaming, it returns server-sent events that clients can process incrementally.


The /conversations endpoints manage multi-turn conversations. The POST endpoint creates a new conversation with an optional system prompt. The GET endpoint retrieves conversation history. The DELETE endpoint removes a conversation and its history.


The /health endpoint provides a way to check if the service is running and get information about the loaded model. This is useful for monitoring and debugging.


The dependency injection pattern used here ensures that only one instance of the ChatService and LLMInference is created and shared across all requests. This is critical because loading the LLM model is expensive and should only happen once.



Creating the Main Application


Now we tie everything together in the main application file:



# app/main.py


from fastapi import FastAPI

from fastapi.middleware.cors import CORSMiddleware

import logging

from app.routes import chat

from app.config import get_settings



# Configure logging

logging.basicConfig(

    level=logging.INFO,

    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'

)

logger = logging.getLogger(__name__)



def create_app() -> FastAPI:

    """Create and configure the FastAPI application"""

    

    settings = get_settings()

    

    # Create FastAPI app

    app = FastAPI(

        title="LLM Chat Microservice",

        description="Local LLM chat service with multi-GPU support",

        version="1.0.0",

        docs_url="/docs",

        redoc_url="/redoc"

    )

    

    # Configure CORS for cross-origin requests

    app.add_middleware(

        CORSMiddleware,

        allow_origins=["*"],

        allow_credentials=True,

        allow_methods=["*"],

        allow_headers=["*"],

    )

    

    # Include routers

    app.include_router(chat.router, prefix="/api/v1", tags=["chat"])

    

    @app.on_event("startup")

    async def startup_event():

        """Initialize services on startup"""

        logger.info("Starting LLM Chat Microservice")

        logger.info(f"Model: {settings.model_path}")

        logger.info(f"GPU: {settings.gpu_type.value}")

        logger.info(f"Server: {settings.host}:{settings.port}")

    

    @app.on_event("shutdown")

    async def shutdown_event():

        """Cleanup on shutdown"""

        logger.info("Shutting down LLM Chat Microservice")

    

    return app



app = create_app()



if __name__ == "__main__":

    import uvicorn

    settings = get_settings()

    uvicorn.run(

        "app.main:app",

        host=settings.host,

        port=settings.port,

        log_level=settings.log_level,

        reload=False

    )



This main application file creates the FastAPI application and configures it. 

The CORS middleware allows the service to accept requests from web browsers 

running on different domains. The startup event logs important configuration 

information. The shutdown event provides a hook for cleanup if needed.



CONTAINERIZING THE SERVICE WITH DOCKER


Now that we have a working service, we need to containerize it. Docker allows us 

to package the service with all its dependencies, making it portable and easy to 

deploy.



Understanding Docker Images and Containers


A Docker image is a template that contains your application and everything it  needs to run. An image is built from a Dockerfile, which specifies the build steps. A container is a running instance of an image. You can run multiple containers from the same image, each isolated from the others.


Docker images are built in layers. Each instruction in the Dockerfile creates a new layer. Layers are cached, so if you rebuild an image and a layer has not  changed, Docker reuses the cached layer. This makes builds faster and more efficient.



Creating a Multi-Architecture Dockerfile


Our Dockerfile needs to support multiple GPU architectures. We will use build arguments to control which GPU support to include. Here is the Dockerfile:



# Dockerfile


# Use official Python base image

FROM python:3.11-slim as base


# Set working directory

WORKDIR /app


# Install system dependencies

RUN apt-get update && apt-get install -y \

    build-essential \

    cmake \

    git \

    wget \

    && rm -rf /var/lib/apt/lists/*


# Create a build stage for compiling llama.cpp with GPU support

FROM base as builder


# Build arguments for GPU support

ARG GPU_TYPE=none

ARG CUDA_VERSION=12.2.0

ARG ROCM_VERSION=5.7


# Install GPU-specific dependencies based on build argument

RUN if [ "$GPU_TYPE" = "cuda" ]; then \

        # Install CUDA toolkit

        wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.0-1_all.deb && \

        dpkg -i cuda-keyring_1.0-1_all.deb && \

        apt-get update && \

        apt-get install -y cuda-toolkit-$(echo $CUDA_VERSION | cut -d. -f1,2 | tr . -) && \

        rm cuda-keyring_1.0-1_all.deb; \

    elif [ "$GPU_TYPE" = "rocm" ]; then \

        # Install ROCm

        wget https://repo.radeon.com/amdgpu-install/latest/ubuntu/jammy/amdgpu-install_latest_all.deb && \

        apt-get install -y ./amdgpu-install_latest_all.deb && \

        amdgpu-install -y --usecase=rocm && \

        rm amdgpu-install_latest_all.deb; \

    fi


# Copy requirements file

COPY requirements.txt .


# Install Python dependencies with GPU support

RUN if [ "$GPU_TYPE" = "cuda" ]; then \

        CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python --no-cache-dir; \

    elif [ "$GPU_TYPE" = "rocm" ]; then \

        CMAKE_ARGS="-DLLAMA_HIPBLAS=on" pip install llama-cpp-python --no-cache-dir; \

    elif [ "$GPU_TYPE" = "metal" ]; then \

        CMAKE_ARGS="-DLLAMA_METAL=on" pip install llama-cpp-python --no-cache-dir; \

    elif [ "$GPU_TYPE" = "sycl" ]; then \

        CMAKE_ARGS="-DLLAMA_SYCL=on" pip install llama-cpp-python --no-cache-dir; \

    else \

        pip install llama-cpp-python --no-cache-dir; \

    fi


# Install other Python dependencies

RUN pip install --no-cache-dir -r requirements.txt


# Final runtime stage

FROM base as runtime


# Copy installed packages from builder

COPY --from=builder /usr/local/lib/python3.11/site-packages /usr/local/lib/python3.11/site-packages

COPY --from=builder /usr/local/bin /usr/local/bin


# Copy application code

COPY app/ /app/app/


# Create directory for models

RUN mkdir -p /app/models


# Set environment variables

ENV PYTHONUNBUFFERED=1

ENV MODEL_PATH=/app/models/model.gguf


# Expose port

EXPOSE 8000


# Health check

HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \

    CMD python -c "import requests; requests.get('http://localhost:8000/api/v1/health')"


# Run the application

CMD ["python", "-m", "uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]



This Dockerfile uses a multi-stage build to keep the final image size 

reasonable. The builder stage compiles llama-cpp-python with appropriate GPU support based on the GPU_TYPE build argument. The runtime stage copies only the necessary files from the builder, excluding build tools and intermediate files.


The GPU_TYPE argument controls which GPU support to compile. Setting it to "cuda" compiles with CUDA support. Setting it to "rocm" compiles with ROCm support. Setting it to "metal" compiles with Metal support for Apple Silicon. Setting it to "sycl" compiles with SYCL support for Intel GPUs. Setting it to "none" or omitting it compiles CPU-only support.


The HEALTHCHECK instruction tells Docker how to check if the container is 

healthy. It periodically calls the health endpoint and marks the container as 

unhealthy if the check fails. This is important for orchestration systems like 

Kubernetes, which can automatically restart unhealthy containers.



Creating the Requirements File


The requirements.txt file lists all Python dependencies:



# requirements.txt

fastapi==0.104.1

uvicorn[standard]==0.24.0

pydantic==2.5.0

pydantic-settings==2.1.0

python-multipart==0.0.6



Notice that llama-cpp-python is not in this file. We install it separately in 

the Dockerfile with appropriate CMAKE_ARGS for GPU support. This is necessary because the package needs to be compiled with specific flags for each GPU type.



Building Docker Images for Different GPU Types


To build an image with CUDA support:


docker build --build-arg GPU_TYPE=cuda -t llm-chat-service:cuda .


To build an image with ROCm support:


docker build --build-arg GPU_TYPE=rocm -t llm-chat-service:rocm .


To build an image with Metal support for Apple Silicon:


docker build --build-arg GPU_TYPE=metal -t llm-chat-service:metal .


To build an image with SYCL support for Intel GPUs:


docker build --build-arg GPU_TYPE=sycl -t llm-chat-service:sycl .


To build a CPU-only image:


docker build --build-arg GPU_TYPE=none -t llm-chat-service:cpu .



The build process takes several minutes because it compiles llama-cpp-python  from source with GPU support. The resulting image contains everything needed to run the service.



Running the Docker Container


To run the container, you need to mount a volume containing your model file and set environment variables for configuration. Here is an example for CUDA:


docker run -d \

  --name llm-chat \

  --gpus all \

  -p 8000:8000 \

  -v /path/to/models:/app/models \

  -e MODEL_PATH=/app/models/your-model.gguf \

  -e GPU_TYPE=cuda \

  -e N_GPU_LAYERS=99 \

  llm-chat-service:cuda



The --gpus all flag makes all GPUs available to the container. For specific 

GPUs, use --gpus '"device=0,1"' to expose only GPUs 0 and 1.


For ROCm on AMD GPUs:


docker run -d \

  --name llm-chat \

  --device=/dev/kfd \

  --device=/dev/dri \

  --group-add video \

  -p 8000:8000 \

  -v /path/to/models:/app/models \

  -e MODEL_PATH=/app/models/your-model.gguf \

  -e GPU_TYPE=rocm \

  -e N_GPU_LAYERS=99 \

  llm-chat-service:rocm



ROCm requires exposing specific devices and adding the container to the video  group for GPU access.


For Metal on Apple Silicon:


docker run -d \

  --name llm-chat \

  -p 8000:8000 \

  -v /path/to/models:/app/models \

  -e MODEL_PATH=/app/models/your-model.gguf \

  -e GPU_TYPE=metal \

  -e N_GPU_LAYERS=99 \

  llm-chat-service:metal



Metal support works automatically on Apple Silicon Macs without additional flags.



Creating a Docker Compose File


Docker Compose simplifies running multi-container applications. Here is a 

docker-compose.yml file for our service:



# docker-compose.yml

version: '3.8'


services:

  llm-chat:

    build:

      context: .

      dockerfile: Dockerfile

      args:

        GPU_TYPE: ${GPU_TYPE:-none}

    image: llm-chat-service:${GPU_TYPE:-none}

    container_name: llm-chat

    ports:

      - "${PORT:-8000}:8000"

    volumes:

      - ${MODEL_DIR:-./models}:/app/models

    environment:

      - MODEL_PATH=${MODEL_PATH:-/app/models/model.gguf}

      - GPU_TYPE=${GPU_TYPE:-none}

      - N_GPU_LAYERS=${N_GPU_LAYERS:-0}

      - CONTEXT_LENGTH=${CONTEXT_LENGTH:-4096}

      - MAX_TOKENS=${MAX_TOKENS:-2048}

      - TEMPERATURE=${TEMPERATURE:-0.7}

      - LOG_LEVEL=${LOG_LEVEL:-info}

    deploy:

      resources:

        reservations:

          devices:

            - driver: nvidia

              count: all

              capabilities: [gpu]

    restart: unless-stopped

    healthcheck:

      test: ["CMD", "python", "-c", "import requests; requests.get('http://localhost:8000/api/v1/health')"]

      interval: 30s

      timeout: 10s

      retries: 3

      start_period: 60s



This compose file uses environment variables for configuration, making it easy to customize without editing the file. Create a .env file with your settings:



# .env

GPU_TYPE=cuda

MODEL_DIR=/path/to/models

MODEL_PATH=/app/models/your-model.gguf

N_GPU_LAYERS=99

PORT=8000

CONTEXT_LENGTH=4096

MAX_TOKENS=2048

TEMPERATURE=0.7

LOG_LEVEL=info



Then start the service with:


docker-compose up -d



Docker Compose reads the .env file automatically and substitutes the values into the compose file.



DEPLOYING TO KUBERNETES


Kubernetes provides production-grade orchestration for containerized 

applications. Deploying our LLM service to Kubernetes enables automatic scaling, self-healing, and efficient resource management.



Understanding Kubernetes Concepts


Kubernetes organizes resources into several key abstractions. A Pod is the 

smallest deployable unit, typically containing one container. A Deployment 

manages a set of identical Pods, ensuring the desired number of replicas are  running. A Service provides a stable network endpoint for accessing Pods, load balancing requests across replicas. A ConfigMap stores configuration data that Pods can consume. A PersistentVolume provides storage that persists beyond Pod lifecycles.


For GPU workloads, Kubernetes uses device plugins to expose GPUs to Pods. Each GPU vendor provides a device plugin that makes their GPUs available as schedulable resources. Pods can request GPU resources, and Kubernetes schedules them on nodes with available GPUs.



Installing GPU Support in Kubernetes


Before deploying our service, ensure your Kubernetes cluster has GPU support configured. For NVIDIA GPUs, install the NVIDIA device plugin:


kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/main/nvidia-device-plugin.yml



For AMD GPUs, install the AMD device plugin:


kubectl create -f https://raw.githubusercontent.com/RadeonOpenCompute/k8s-device-plugin/master/k8s-ds-amdgpu-dp.yaml



For Intel GPUs, install the Intel device plugin:


kubectl apply -k 'https://github.com/intel/intel-device-plugins-for-kubernetes/deployments/gpu_plugin?ref=main'



These device plugins run as DaemonSets, meaning they run on every node in the cluster, exposing GPUs as schedulable resources.



Creating Kubernetes Manifests


We need several Kubernetes resources to deploy our service. Let us create them one by one.


First, create a ConfigMap for configuration:



# k8s/configmap.yaml

apiVersion: v1

kind: ConfigMap

metadata:

  name: llm-chat-config

  namespace: default

data:

  MODEL_NAME: "local-llm"

  CONTEXT_LENGTH: "4096"

  MAX_TOKENS: "2048"

  TEMPERATURE: "0.7"

  TOP_P: "0.9"

  TOP_K: "40"

  REPEAT_PENALTY: "1.1"

  N_THREADS: "4"

  N_BATCH: "512"

  LOG_LEVEL: "info"

  HOST: "0.0.0.0"

  PORT: "8000"

  ENABLE_STREAMING: "true"

  MAX_CONCURRENT_REQUESTS: "10"



This ConfigMap stores non-sensitive configuration. We reference it in the 

Deployment to inject these values as environment variables.


Next, create a PersistentVolumeClaim for model storage:



# k8s/pvc.yaml

apiVersion: v1

kind: PersistentVolumeClaim

metadata:

  name: llm-models-pvc

  namespace: default

spec:

  accessModes:

    - ReadOnlyMany

  resources:

    requests:

      storage: 20Gi

  storageClassName: standard



This PVC requests 20GB of storage for model files. The ReadOnlyMany access mode allows multiple Pods to mount the volume simultaneously, which is safe because the models are read-only. In practice, you would populate this volume with your model files before deploying the service.


Now create the Deployment:



# k8s/deployment.yaml

apiVersion: apps/v1

kind: Deployment

metadata:

  name: llm-chat-deployment

  namespace: default

  labels:

    app: llm-chat

spec:

  replicas: 2

  selector:

    matchLabels:

      app: llm-chat

  template:

    metadata:

      labels:

        app: llm-chat

    spec:

      containers:

      - name: llm-chat

        image: llm-chat-service:cuda

        imagePullPolicy: IfNotPresent

        ports:

        - containerPort: 8000

          name: http

          protocol: TCP

        env:

        - name: MODEL_PATH

          value: "/app/models/model.gguf"

        - name: GPU_TYPE

          value: "cuda"

        - name: N_GPU_LAYERS

          value: "99"

        - name: MAIN_GPU

          value: "0"

        envFrom:

        - configMapRef:

            name: llm-chat-config

        volumeMounts:

        - name: models

          mountPath: /app/models

          readOnly: true

        resources:

          requests:

            memory: "8Gi"

            cpu: "2"

            nvidia.com/gpu: "1"

          limits:

            memory: "16Gi"

            cpu: "4"

            nvidia.com/gpu: "1"

        livenessProbe:

          httpGet:

            path: /api/v1/health

            port: 8000

          initialDelaySeconds: 60

          periodSeconds: 30

          timeoutSeconds: 10

          failureThreshold: 3

        readinessProbe:

          httpGet:

            path: /api/v1/health

            port: 8000

          initialDelaySeconds: 30

          periodSeconds: 10

          timeoutSeconds: 5

          failureThreshold: 3

      volumes:

      - name: models

        persistentVolumeClaim:

          claimName: llm-models-pvc

      nodeSelector:

        accelerator: nvidia-gpu

      tolerations:

      - key: nvidia.com/gpu

        operator: Exists

        effect: NoSchedule



This Deployment creates two replicas of our LLM service. The replicas field 

specifies how many identical Pods to run. Kubernetes distributes these Pods across available nodes, providing redundancy and load distribution.


The resources section is critical for GPU workloads. The requests specify the minimum resources the Pod needs. Requesting nvidia.com/gpu: "1" tells Kubernetes this Pod needs one NVIDIA GPU. Kubernetes only schedules the Pod on nodes with available GPUs. The limits specify maximum resources the Pod can use. Setting limits equal to requests for GPUs ensures the Pod gets exclusive GPU access.


The nodeSelector ensures Pods only run on nodes with NVIDIA GPUs. The 

tolerations allow Pods to run on nodes tainted for GPU workloads. Taints and tolerations are Kubernetes mechanisms for dedicating nodes to specific workloads.


The livenessProbe checks if the container is still running correctly. If the 

probe fails repeatedly, Kubernetes restarts the container. The readinessProbe 

checks if the container is ready to accept traffic. If the probe fails, 

Kubernetes removes the Pod from the Service load balancer until it passes again.


For AMD ROCm GPUs, modify the resources section:



        resources:

          requests:

            memory: "8Gi"

            cpu: "2"

            amd.com/gpu: "1"

          limits:

            memory: "16Gi"

            cpu: "4"

            amd.com/gpu: "1"



And update the nodeSelector:



      nodeSelector:

        accelerator: amd-gpu



For Intel GPUs, use:



        resources:

          requests:

            memory: "8Gi"

            cpu: "2"

            gpu.intel.com/i915: "1"

          limits:

            memory: "16Gi"

            cpu: "4"

            gpu.intel.com/i915: "1"



Now create a Service to expose the Deployment:



# k8s/service.yaml

apiVersion: v1

kind: Service

metadata:

  name: llm-chat-service

  namespace: default

  labels:

    app: llm-chat

spec:

  type: ClusterIP

  selector:

    app: llm-chat

  ports:

  - name: http

    port: 80

    targetPort: 8000

    protocol: TCP

  sessionAffinity: ClientIP

  sessionAffinityConfig:

    clientIP:

      timeoutSeconds: 3600



This Service creates a stable endpoint for accessing the LLM Pods. The 

ClusterIP type makes the Service accessible only within the cluster. For 

external access, you would use LoadBalancer or create an Ingress.


The sessionAffinity setting is important for stateful conversations. Setting it 

to ClientIP ensures requests from the same client IP always go to the same Pod. 

This is useful if you store conversation state in memory. However, for 

production systems, you should use external storage like Redis for conversation 

state to enable true statelessness.


For external access, create an Ingress:



# k8s/ingress.yaml

apiVersion: networking.k8s.io/v1

kind: Ingress

metadata:

  name: llm-chat-ingress

  namespace: default

  annotations:

    nginx.ingress.kubernetes.io/proxy-body-size: "10m"

    nginx.ingress.kubernetes.io/proxy-read-timeout: "300"

    nginx.ingress.kubernetes.io/proxy-send-timeout: "300"

spec:

  ingressClassName: nginx

  rules:

  - host: llm-chat.example.com

    http:

      paths:

      - path: /

        pathType: Prefix

        backend:

          service:

            name: llm-chat-service

            port:

              number: 80



This Ingress routes external traffic to the Service. The annotations configure 

the NGINX ingress controller to handle large request bodies and long timeouts, 

both important for LLM services. Replace llm-chat.example.com with your actual 

domain.


For automatic scaling based on load, create a HorizontalPodAutoscaler:



# k8s/hpa.yaml

apiVersion: autoscaling/v2

kind: HorizontalPodAutoscaler

metadata:

  name: llm-chat-hpa

  namespace: default

spec:

  scaleTargetRef:

    apiVersion: apps/v1

    kind: Deployment

    name: llm-chat-deployment

  minReplicas: 2

  maxReplicas: 10

  metrics:

  - type: Resource

    resource:

      name: cpu

      target:

        type: Utilization

        averageUtilization: 70

  - type: Resource

    resource:

      name: memory

      target:

        type: Utilization

        averageUtilization: 80

  behavior:

    scaleDown:

      stabilizationWindowSeconds: 300

      policies:

      - type: Percent

        value: 50

        periodSeconds: 60

    scaleUp:

      stabilizationWindowSeconds: 60

      policies:

      - type: Percent

        value: 100

        periodSeconds: 30



This HorizontalPodAutoscaler automatically adjusts the number of replicas based 

on CPU and memory usage. When average CPU usage exceeds 70 percent or memory 

usage exceeds 80 percent, it scales up. When usage drops, it scales down. The 

behavior section controls how quickly scaling happens. Scaling up happens 

quickly to handle traffic spikes, while scaling down happens slowly to avoid 

thrashing.


Note that GPU-based autoscaling is more complex because GPUs are discrete 

resources. Each Pod gets a whole GPU, so scaling happens in GPU-sized increments. 

For more sophisticated GPU-based autoscaling, you would use custom metrics based 

on GPU utilization or request queue depth.



Deploying to Kubernetes


To deploy all resources, apply the manifests in order:


kubectl apply -f k8s/configmap.yaml

kubectl apply -f k8s/pvc.yaml

kubectl apply -f k8s/deployment.yaml

kubectl apply -f k8s/service.yaml

kubectl apply -f k8s/ingress.yaml

kubectl apply -f k8s/hpa.yaml



Check the deployment status:


kubectl get deployments

kubectl get pods

kubectl get services



Watch the Pods start:


kubectl get pods -w



View logs from a Pod:


kubectl logs -f <pod-name>



If a Pod fails to start, describe it to see events and errors:


kubectl describe pod <pod-name>



Common issues include insufficient GPU resources, missing model files in the 

PersistentVolume, or incorrect environment variables. The describe command shows 

detailed information about why a Pod is not running.



TESTING THE SERVICE


Once deployed, test the service to ensure it works correctly. We will test both 

locally with Docker and in Kubernetes.



Testing with curl


The simplest test uses curl to send a request to the chat endpoint. For a local 

Docker container:


curl -X POST http://localhost:8000/api/v1/chat \

  -H "Content-Type: application/json" \

  -d '{

    "message": "What is the capital of France?",

    "system_prompt": "You are a helpful assistant.",

    "model_format": "chatml",

    "stream": false,

    "temperature": 0.7

  }'



This sends a simple question to the LLM. The response includes the generated 

text and metadata:


{

  "response": "The capital of France is Paris. Paris is not only the capital but also the largest city in France, known for its art, culture, fashion, and iconic landmarks like the Eiffel Tower.",

  "conversation_id": null,

  "finish_reason": "stop",

  "usage": {

    "prompt_tokens": 42,

    "completion_tokens": 38,

    "total_tokens": 80

  }

}



To test streaming responses:


curl -X POST http://localhost:8000/api/v1/chat \

  -H "Content-Type: application/json" \

  -d '{

    "message": "Write a short poem about AI.",

    "stream": true

  }' \

  --no-buffer



The --no-buffer flag prevents curl from buffering the output, allowing you to 

see tokens as they arrive. The response comes as server-sent events:


data: {"text": " In", "finish_reason": null}


data: {"text": " circuits", "finish_reason": null}


data: {"text": " deep", "finish_reason": null}


data: {"text": ",", "finish_reason": null}


data: [DONE]



Testing multi-turn conversations requires creating a conversation first:


curl -X POST http://localhost:8000/api/v1/conversations \

  -H "Content-Type: application/json" \

  -d '{

    "conversation_id": "test-conv-123",

    "system_prompt": "You are a helpful math tutor."

  }'



Then send messages with the conversation ID:


curl -X POST http://localhost:8000/api/v1/chat \

  -H "Content-Type: application/json" \

  -d '{

    "message": "What is 15 times 23?",

    "conversation_id": "test-conv-123"

  }'



Send a follow-up message:


curl -X POST http://localhost:8000/api/v1/chat \

  -H "Content-Type: application/json" \

  -d '{

    "message": "Now add 100 to that result.",

    "conversation_id": "test-conv-123"

  }'



The service maintains conversation context, so the second message refers to the 

previous result without repeating it.


Retrieve the conversation history:


curl http://localhost:8000/api/v1/conversations/test-conv-123



Delete the conversation when done:


curl -X DELETE http://localhost:8000/api/v1/conversations/test-conv-123



Testing in Kubernetes


For Kubernetes deployments, first port-forward to access the Service locally:


kubectl port-forward service/llm-chat-service 8000:80



Then use the same curl commands as above, connecting to localhost:8000.


Alternatively, if you configured an Ingress, access the service through its 

external domain:


curl -X POST https://llm-chat.example.com/api/v1/chat \

  -H "Content-Type: application/json" \

  -d '{

    "message": "Hello, how are you?"

  }'



MONITORING AND OBSERVABILITY


Production services require monitoring to track health, performance, and usage. 

Kubernetes provides built-in tools, and you can add more sophisticated 

monitoring with Prometheus and Grafana.



Viewing Logs


Kubernetes aggregates logs from all Pods. View logs from all Pods in the 

Deployment:


kubectl logs -l app=llm-chat --tail=100 -f



This follows logs from all Pods with the app=llm-chat label, showing the last 

100 lines and streaming new entries.


For structured logging, modify the application to output JSON logs. This makes 

logs easier to parse and analyze with log aggregation tools like ELK 

(Elasticsearch, Logstash, Kibana) or Loki.



Checking Resource Usage


Monitor resource usage with kubectl top:


kubectl top pods -l app=llm-chat



This shows CPU and memory usage for each Pod. For GPU usage, you need to install 

a GPU monitoring solution like NVIDIA DCGM Exporter for NVIDIA GPUs or ROCm SMI 

Exporter for AMD GPUs.



Health Checks


The health endpoint provides service status:


curl http://localhost:8000/api/v1/health



The response includes model information:


{

  "status": "healthy",

  "model_info": {

    "model_name": "local-llm",

    "model_path": "/app/models/model.gguf",

    "context_length": 4096,

    "gpu_type": "cuda",

    "gpu_layers": 99

  }

}



Kubernetes uses this endpoint for liveness and readiness probes. If the endpoint 

returns an error or times out, Kubernetes takes corrective action.



BUILDING A CLIENT APPLICATION


To demonstrate using the LLM service, let us build a simple client application. 

We will create both a Python client library and a command-line interface.



Python Client Library


First, create a reusable client library that encapsulates the API calls:



# client/llm_client.py


import requests

from typing import Optional, Dict, Any, Iterator, List

import json



class LLMClientError(Exception):

    """Base exception for LLM client errors"""

    pass



class LLMClient:

    """Client for interacting with the LLM chat microservice"""

    

    def __init__(self, base_url: str, api_key: Optional[str] = None, timeout: int = 300):

        """

        Initialize the LLM client

        

        Args:

            base_url: Base URL of the LLM service (e.g., http://localhost:8000)

            api_key: Optional API key for authentication

            timeout: Request timeout in seconds

        """

        self.base_url = base_url.rstrip('/')

        self.api_key = api_key

        self.timeout = timeout

        self.session = requests.Session()

        

        if api_key:

            self.session.headers.update({'Authorization': f'Bearer {api_key}'})

    

    def chat(

        self,

        message: str,

        conversation_id: Optional[str] = None,

        system_prompt: Optional[str] = None,

        model_format: str = "chatml",

        stream: bool = False,

        max_tokens: Optional[int] = None,

        temperature: Optional[float] = None,

        top_p: Optional[float] = None,

        top_k: Optional[int] = None,

        stop: Optional[List[str]] = None

    ) -> Dict[str, Any]:

        """

        Send a chat message and get a response

        

        Args:

            message: User message to send

            conversation_id: Optional conversation ID for multi-turn chat

            system_prompt: Optional system prompt for behavior control

            model_format: Prompt format (chatml, llama2, alpaca)

            stream: Whether to stream the response

            max_tokens: Maximum tokens to generate

            temperature: Sampling temperature

            top_p: Nucleus sampling parameter

            top_k: Top-k sampling parameter

            stop: List of stop sequences

            

        Returns:

            Response dictionary with generated text and metadata

            

        Raises:

            LLMClientError: If the request fails

        """

        url = f"{self.base_url}/api/v1/chat"

        

        payload = {

            "message": message,

            "model_format": model_format,

            "stream": stream

        }

        

        if conversation_id:

            payload["conversation_id"] = conversation_id

        if system_prompt:

            payload["system_prompt"] = system_prompt

        if max_tokens is not None:

            payload["max_tokens"] = max_tokens

        if temperature is not None:

            payload["temperature"] = temperature

        if top_p is not None:

            payload["top_p"] = top_p

        if top_k is not None:

            payload["top_k"] = top_k

        if stop is not None:

            payload["stop"] = stop

        

        try:

            if stream:

                return self._stream_chat(url, payload)

            else:

                response = self.session.post(url, json=payload, timeout=self.timeout)

                response.raise_for_status()

                return response.json()

        except requests.exceptions.RequestException as e:

            raise LLMClientError(f"Chat request failed: {e}")

    

    def _stream_chat(self, url: str, payload: Dict[str, Any]) -> Iterator[str]:

        """

        Stream chat response

        

        Args:

            url: API endpoint URL

            payload: Request payload

            

        Yields:

            Text chunks as they are generated

            

        Raises:

            LLMClientError: If streaming fails

        """

        try:

            response = self.session.post(

                url,

                json=payload,

                stream=True,

                timeout=self.timeout

            )

            response.raise_for_status()

            

            for line in response.iter_lines():

                if line:

                    line = line.decode('utf-8')

                    if line.startswith('data: '):

                        data = line[6:]  # Remove 'data: ' prefix

                        if data == '[DONE]':

                            break

                        try:

                            chunk = json.loads(data)

                            if 'text' in chunk:

                                yield chunk['text']

                            elif 'error' in chunk:

                                raise LLMClientError(f"Server error: {chunk['error']}")

                        except json.JSONDecodeError:

                            continue

        except requests.exceptions.RequestException as e:

            raise LLMClientError(f"Streaming request failed: {e}")

    

    def create_conversation(

        self,

        conversation_id: str,

        system_prompt: Optional[str] = None

    ) -> Dict[str, Any]:

        """

        Create a new conversation

        

        Args:

            conversation_id: Unique identifier for the conversation

            system_prompt: Optional system prompt

            

        Returns:

            Conversation details

            

        Raises:

            LLMClientError: If creation fails

        """

        url = f"{self.base_url}/api/v1/conversations"

        

        payload = {"conversation_id": conversation_id}

        if system_prompt:

            payload["system_prompt"] = system_prompt

        

        try:

            response = self.session.post(url, json=payload, timeout=self.timeout)

            response.raise_for_status()

            return response.json()

        except requests.exceptions.RequestException as e:

            raise LLMClientError(f"Conversation creation failed: {e}")

    

    def get_conversation(self, conversation_id: str) -> Dict[str, Any]:

        """

        Get conversation details and history

        

        Args:

            conversation_id: Conversation identifier

            

        Returns:

            Conversation details with message history

            

        Raises:

            LLMClientError: If retrieval fails

        """

        url = f"{self.base_url}/api/v1/conversations/{conversation_id}"

        

        try:

            response = self.session.get(url, timeout=self.timeout)

            response.raise_for_status()

            return response.json()

        except requests.exceptions.RequestException as e:

            raise LLMClientError(f"Conversation retrieval failed: {e}")

    

    def delete_conversation(self, conversation_id: str) -> Dict[str, Any]:

        """

        Delete a conversation

        

        Args:

            conversation_id: Conversation identifier

            

        Returns:

            Deletion confirmation

            

        Raises:

            LLMClientError: If deletion fails

        """

        url = f"{self.base_url}/api/v1/conversations/{conversation_id}"

        

        try:

            response = self.session.delete(url, timeout=self.timeout)

            response.raise_for_status()

            return response.json()

        except requests.exceptions.RequestException as e:

            raise LLMClientError(f"Conversation deletion failed: {e}")

    

    def health_check(self) -> Dict[str, Any]:

        """

        Check service health

        

        Returns:

            Health status and model information

            

        Raises:

            LLMClientError: If health check fails

        """

        url = f"{self.base_url}/api/v1/health"

        

        try:

            response = self.session.get(url, timeout=10)

            response.raise_for_status()

            return response.json()

        except requests.exceptions.RequestException as e:

            raise LLMClientError(f"Health check failed: {e}")

    

    def close(self):

        """Close the client session"""

        self.session.close()

    

    def __enter__(self):

        """Context manager entry"""

        return self

    

    def __exit__(self, exc_type, exc_val, exc_tb):

        """Context manager exit"""

        self.close()



This client library provides a clean Python interface for the LLM service. It 

handles request formatting, error handling, and streaming responses. The context manager support allows using it with the with statement for automatic cleanup.



Command-Line Interface


Now create a command-line interface using the client library:



# client/cli.py


import argparse

import sys

import uuid

from llm_client import LLMClient, LLMClientError



def print_streaming_response(client: LLMClient, message: str, **kwargs):

    """

    Print a streaming response with real-time output

    

    Args:

        client: LLM client instance

        message: Message to send

        **kwargs: Additional chat parameters

    """

    print("Assistant: ", end='', flush=True)

    try:

        for chunk in client.chat(message, stream=True, **kwargs):

            print(chunk, end='', flush=True)

        print()  # New line after response

    except LLMClientError as e:

        print(f"\nError: {e}", file=sys.stderr)

        sys.exit(1)



def print_complete_response(client: LLMClient, message: str, **kwargs):

    """

    Print a complete response after generation finishes

    

    Args:

        client: LLM client instance

        message: Message to send

        **kwargs: Additional chat parameters

    """

    try:

        response = client.chat(message, stream=False, **kwargs)

        print(f"Assistant: {response['response']}")

        

        if response.get('usage'):

            usage = response['usage']

            print(f"\nTokens - Prompt: {usage['prompt_tokens']}, "

                  f"Completion: {usage['completion_tokens']}, "

                  f"Total: {usage['total_tokens']}")

    except LLMClientError as e:

        print(f"Error: {e}", file=sys.stderr)

        sys.exit(1)



def interactive_mode(client: LLMClient, args: argparse.Namespace):

    """

    Run interactive chat mode

    

    Args:

        client: LLM client instance

        args: Command-line arguments

    """

    # Create conversation if using multi-turn mode

    conversation_id = None

    if args.multi_turn:

        conversation_id = f"cli-{uuid.uuid4()}"

        try:

            client.create_conversation(conversation_id, args.system_prompt)

            print(f"Started conversation: {conversation_id}")

        except LLMClientError as e:

            print(f"Error creating conversation: {e}", file=sys.stderr)

            sys.exit(1)

    

    print("Interactive mode. Type 'exit' or 'quit' to end, 'clear' to start new conversation.")

    print()

    

    try:

        while True:

            try:

                user_input = input("You: ").strip()

            except EOFError:

                break

            

            if not user_input:

                continue

            

            if user_input.lower() in ['exit', 'quit']:

                break

            

            if user_input.lower() == 'clear':

                if conversation_id:

                    client.delete_conversation(conversation_id)

                    conversation_id = f"cli-{uuid.uuid4()}"

                    client.create_conversation(conversation_id, args.system_prompt)

                    print("Started new conversation")

                continue

            

            # Send message

            chat_params = {

                'conversation_id': conversation_id,

                'system_prompt': args.system_prompt if not conversation_id else None,

                'model_format': args.format,

                'temperature': args.temperature,

                'max_tokens': args.max_tokens,

            }

            

            if args.stream:

                print_streaming_response(client, user_input, **chat_params)

            else:

                print_complete_response(client, user_input, **chat_params)

            

            print()

    

    finally:

        # Cleanup conversation

        if conversation_id:

            try:

                client.delete_conversation(conversation_id)

            except LLMClientError:

                pass



def single_message_mode(client: LLMClient, args: argparse.Namespace):

    """

    Send a single message and exit

    

    Args:

        client: LLM client instance

        args: Command-line arguments

    """

    chat_params = {

        'system_prompt': args.system_prompt,

        'model_format': args.format,

        'temperature': args.temperature,

        'max_tokens': args.max_tokens,

    }

    

    if args.stream:

        print_streaming_response(client, args.message, **chat_params)

    else:

        print_complete_response(client, args.message, **chat_params)



def health_check_mode(client: LLMClient):

    """

    Check service health and display information

    

    Args:

        client: LLM client instance

    """

    try:

        health = client.health_check()

        print(f"Status: {health['status']}")

        print("\nModel Information:")

        for key, value in health['model_info'].items():

            print(f"  {key}: {value}")

    except LLMClientError as e:

        print(f"Error: {e}", file=sys.stderr)

        sys.exit(1)



def main():

    """Main entry point for the CLI"""

    parser = argparse.ArgumentParser(

        description='Command-line client for LLM chat microservice',

        formatter_class=argparse.RawDescriptionHelpFormatter,

        epilog="""

Examples:

  # Interactive chat with streaming

  %(prog)s -i -s

  

  # Single message

  %(prog)s -m "What is the capital of France?"

  

  # Multi-turn conversation

  %(prog)s -i --multi-turn --system "You are a helpful math tutor"

  

  # Check service health

  %(prog)s --health

  

  # Custom service URL

  %(prog)s -u http://llm-chat.example.com -m "Hello"

        """

    )

    

    parser.add_argument(

        '-u', '--url',

        default='http://localhost:8000',

        help='Base URL of the LLM service (default: http://localhost:8000)'

    )

    

    parser.add_argument(

        '-k', '--api-key',

        help='API key for authentication'

    )

    

    parser.add_argument(

        '-i', '--interactive',

        action='store_true',

        help='Run in interactive mode'

    )

    

    parser.add_argument(

        '-m', '--message',

        help='Single message to send (non-interactive mode)'

    )

    

    parser.add_argument(

        '--health',

        action='store_true',

        help='Check service health and display model information'

    )

    

    parser.add_argument(

        '-s', '--stream',

        action='store_true',

        help='Stream responses in real-time'

    )

    

    parser.add_argument(

        '--multi-turn',

        action='store_true',

        help='Enable multi-turn conversation mode (maintains context)'

    )

    

    parser.add_argument(

        '--system',

        dest='system_prompt',

        help='System prompt to control assistant behavior'

    )

    

    parser.add_argument(

        '--format',

        choices=['chatml', 'llama2', 'alpaca'],

        default='chatml',

        help='Prompt format (default: chatml)'

    )

    

    parser.add_argument(

        '--temperature',

        type=float,

        default=0.7,

        help='Sampling temperature (default: 0.7)'

    )

    

    parser.add_argument(

        '--max-tokens',

        type=int,

        default=2048,

        help='Maximum tokens to generate (default: 2048)'

    )

    

    parser.add_argument(

        '--timeout',

        type=int,

        default=300,

        help='Request timeout in seconds (default: 300)'

    )

    

    args = parser.parse_args()

    

    # Validate arguments

    if not args.interactive and not args.message and not args.health:

        parser.error('Either --interactive, --message, or --health is required')

    

    # Create client

    with LLMClient(args.url, args.api_key, args.timeout) as client:

        if args.health:

            health_check_mode(client)

        elif args.interactive:

            interactive_mode(client, args)

        else:

            single_message_mode(client, args)



if __name__ == '__main__':

    main()



This command-line interface provides multiple modes of operation. Interactive mode allows ongoing conversations with the LLM. Single message mode sends one message and exits, useful for scripting. Health check mode verifies the service is running and displays model information.


The CLI supports all the features of the underlying service: streaming responses, multi-turn conversations, custom system prompts, and generation parameters. It provides a user-friendly interface for testing and using the LLM service.



Web-Based Client Application


For a more user-friendly interface, create a simple web application:



# client/web_app.py


from flask import Flask, render_template, request, jsonify, Response, stream_with_context

import uuid

import json

from llm_client import LLMClient, LLMClientError



app = Flask(__name__)


# Configuration

LLM_SERVICE_URL = "http://localhost:8000"

client = LLMClient(LLM_SERVICE_URL)



@app.route('/')

def index():

    """Render the main chat interface"""

    return render_template('index.html')



@app.route('/api/chat', methods=['POST'])

def chat():

    """Handle chat requests"""

    data = request.json

    message = data.get('message')

    conversation_id = data.get('conversation_id')

    stream = data.get('stream', False)

    

    if not message:

        return jsonify({'error': 'Message is required'}), 400

    

    try:

        if stream:

            def generate():

                try:

                    for chunk in client.chat(

                        message=message,

                        conversation_id=conversation_id,

                        stream=True

                    ):

                        yield f"data: {json.dumps({'text': chunk})}\n\n"

                    yield "data: [DONE]\n\n"

                except LLMClientError as e:

                    yield f"data: {json.dumps({'error': str(e)})}\n\n"

            

            return Response(

                stream_with_context(generate()),

                mimetype='text/event-stream'

            )

        else:

            response = client.chat(

                message=message,

                conversation_id=conversation_id,

                stream=False

            )

            return jsonify(response)

    

    except LLMClientError as e:

        return jsonify({'error': str(e)}), 500



@app.route('/api/conversations', methods=['POST'])

def create_conversation():

    """Create a new conversation"""

    data = request.json

    conversation_id = data.get('conversation_id') or str(uuid.uuid4())

    system_prompt = data.get('system_prompt')

    

    try:

        result = client.create_conversation(conversation_id, system_prompt)

        return jsonify(result)

    except LLMClientError as e:

        return jsonify({'error': str(e)}), 500



@app.route('/api/conversations/<conversation_id>', methods=['GET'])

def get_conversation(conversation_id):

    """Get conversation history"""

    try:

        result = client.get_conversation(conversation_id)

        return jsonify(result)

    except LLMClientError as e:

        return jsonify({'error': str(e)}), 404



@app.route('/api/conversations/<conversation_id>', methods=['DELETE'])

def delete_conversation(conversation_id):

    """Delete a conversation"""

    try:

        result = client.delete_conversation(conversation_id)

        return jsonify(result)

    except LLMClientError as e:

        return jsonify({'error': str(e)}), 404



@app.route('/api/health', methods=['GET'])

def health():

    """Check service health"""

    try:

        result = client.health_check()

        return jsonify(result)

    except LLMClientError as e:

        return jsonify({'error': str(e)}), 500



if __name__ == '__main__':

    app.run(host='0.0.0.0', port=5000, debug=True)



Create the HTML template for the web interface:



<!-- client/templates/index.html -->

<!DOCTYPE html>

<html lang="en">

<head>

    <meta charset="UTF-8">

    <meta name="viewport" content="width=device-width, initial-scale=1.0">

    <title>LLM Chat Interface</title>

    <style>

        * {

            margin: 0;

            padding: 0;

            box-sizing: border-box;

        }

        

        body {

            font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, Oxygen, Ubuntu, Cantarell, sans-serif;

            background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);

            height: 100vh;

            display: flex;

            justify-content: center;

            align-items: center;

            padding: 20px;

        }

        

        .container {

            background: white;

            border-radius: 10px;

            box-shadow: 0 10px 40px rgba(0,0,0,0.2);

            width: 100%;

            max-width: 800px;

            height: 90vh;

            display: flex;

            flex-direction: column;

        }

        

        .header {

            padding: 20px;

            border-bottom: 1px solid #e0e0e0;

            display: flex;

            justify-content: space-between;

            align-items: center;

        }

        

        .header h1 {

            font-size: 24px;

            color: #333;

        }

        

        .status {

            display: flex;

            align-items: center;

            gap: 8px;

        }

        

        .status-indicator {

            width: 10px;

            height: 10px;

            border-radius: 50%;

            background: #4caf50;

        }

        

        .status-indicator.offline {

            background: #f44336;

        }

        

        .chat-area {

            flex: 1;

            overflow-y: auto;

            padding: 20px;

            display: flex;

            flex-direction: column;

            gap: 16px;

        }

        

        .message {

            display: flex;

            gap: 12px;

            max-width: 80%;

        }

        

        .message.user {

            align-self: flex-end;

            flex-direction: row-reverse;

        }

        

        .message-avatar {

            width: 40px;

            height: 40px;

            border-radius: 50%;

            display: flex;

            align-items: center;

            justify-content: center;

            font-weight: bold;

            color: white;

            flex-shrink: 0;

        }

        

        .message.user .message-avatar {

            background: #667eea;

        }

        

        .message.assistant .message-avatar {

            background: #764ba2;

        }

        

        .message-content {

            background: #f5f5f5;

            padding: 12px 16px;

            border-radius: 12px;

            line-height: 1.5;

        }

        

        .message.user .message-content {

            background: #667eea;

            color: white;

        }

        

        .input-area {

            padding: 20px;

            border-top: 1px solid #e0e0e0;

        }

        

        .input-container {

            display: flex;

            gap: 12px;

        }

        

        #message-input {

            flex: 1;

            padding: 12px 16px;

            border: 2px solid #e0e0e0;

            border-radius: 24px;

            font-size: 14px;

            outline: none;

            transition: border-color 0.3s;

        }

        

        #message-input:focus {

            border-color: #667eea;

        }

        

        #send-button {

            padding: 12px 24px;

            background: #667eea;

            color: white;

            border: none;

            border-radius: 24px;

            font-size: 14px;

            font-weight: 600;

            cursor: pointer;

            transition: background 0.3s;

        }

        

        #send-button:hover {

            background: #5568d3;

        }

        

        #send-button:disabled {

            background: #ccc;

            cursor: not-allowed;

        }

        

        .controls {

            display: flex;

            gap: 12px;

            margin-bottom: 12px;

            flex-wrap: wrap;

        }

        

        .control-button {

            padding: 8px 16px;

            background: #f5f5f5;

            border: 1px solid #e0e0e0;

            border-radius: 16px;

            font-size: 12px;

            cursor: pointer;

            transition: all 0.3s;

        }

        

        .control-button:hover {

            background: #e0e0e0;

        }

        

        .control-button.active {

            background: #667eea;

            color: white;

            border-color: #667eea;

        }

        

        .loading {

            display: flex;

            gap: 4px;

            padding: 12px 16px;

        }

        

        .loading-dot {

            width: 8px;

            height: 8px;

            border-radius: 50%;

            background: #999;

            animation: loading 1.4s infinite ease-in-out both;

        }

        

        .loading-dot:nth-child(1) {

            animation-delay: -0.32s;

        }

        

        .loading-dot:nth-child(2) {

            animation-delay: -0.16s;

        }

        

        @keyframes loading {

            0%, 80%, 100% {

                transform: scale(0);

            }

            40% {

                transform: scale(1);

            }

        }

    </style>

</head>

<body>

    <div class="container">

        <div class="header">

            <h1>LLM Chat</h1>

            <div class="status">

                <div class="status-indicator" id="status-indicator"></div>

                <span id="status-text">Checking...</span>

            </div>

        </div>

        

        <div class="chat-area" id="chat-area">

            <div class="message assistant">

                <div class="message-avatar">AI</div>

                <div class="message-content">

                    Hello! I'm your AI assistant. How can I help you today?

                </div>

            </div>

        </div>

        

        <div class="input-area">

            <div class="controls">

                <button class="control-button active" id="stream-toggle">

                    Streaming: ON

                </button>

                <button class="control-button" id="multi-turn-toggle">

                    Multi-turn: OFF

                </button>

                <button class="control-button" id="clear-button">

                    Clear Chat

                </button>

            </div>

            <div class="input-container">

                <input 

                    type="text" 

                    id="message-input" 

                    placeholder="Type your message..."

                    autocomplete="off"

                >

                <button id="send-button">Send</button>

            </div>

        </div>

    </div>

    

    <script>

        // State management

        let isStreaming = true;

        let isMultiTurn = false;

        let conversationId = null;

        let isProcessing = false;

        

        // DOM elements

        const chatArea = document.getElementById('chat-area');

        const messageInput = document.getElementById('message-input');

        const sendButton = document.getElementById('send-button');

        const streamToggle = document.getElementById('stream-toggle');

        const multiTurnToggle = document.getElementById('multi-turn-toggle');

        const clearButton = document.getElementById('clear-button');

        const statusIndicator = document.getElementById('status-indicator');

        const statusText = document.getElementById('status-text');

        

        // Check service health on load

        checkHealth();

        

        // Event listeners

        sendButton.addEventListener('click', sendMessage);

        messageInput.addEventListener('keypress', (e) => {

            if (e.key === 'Enter' && !e.shiftKey) {

                e.preventDefault();

                sendMessage();

            }

        });

        

        streamToggle.addEventListener('click', () => {

            isStreaming = !isStreaming;

            streamToggle.textContent = `Streaming: ${isStreaming ? 'ON' : 'OFF'}`;

            streamToggle.classList.toggle('active');

        });

        

        multiTurnToggle.addEventListener('click', async () => {

            isMultiTurn = !isMultiTurn;

            multiTurnToggle.textContent = `Multi-turn: ${isMultiTurn ? 'ON' : 'OFF'}`;

            multiTurnToggle.classList.toggle('active');

            

            if (isMultiTurn && !conversationId) {

                await createConversation();

            } else if (!isMultiTurn && conversationId) {

                await deleteConversation();

            }

        });

        

        clearButton.addEventListener('click', clearChat);

        

        // Functions

        async function checkHealth() {

            try {

                const response = await fetch('/api/health');

                const data = await response.json();

                

                if (data.status === 'healthy') {

                    statusIndicator.classList.remove('offline');

                    statusText.textContent = 'Online';

                } else {

                    statusIndicator.classList.add('offline');

                    statusText.textContent = 'Offline';

                }

            } catch (error) {

                statusIndicator.classList.add('offline');

                statusText.textContent = 'Offline';

            }

        }

        

        async function createConversation() {

            try {

                const response = await fetch('/api/conversations', {

                    method: 'POST',

                    headers: {'Content-Type': 'application/json'},

                    body: JSON.stringify({})

                });

                const data = await response.json();

                conversationId = data.conversation_id;

            } catch (error) {

                console.error('Failed to create conversation:', error);

            }

        }

        

        async function deleteConversation() {

            if (!conversationId) return;

            

            try {

                await fetch(`/api/conversations/${conversationId}`, {

                    method: 'DELETE'

                });

                conversationId = null;

            } catch (error) {

                console.error('Failed to delete conversation:', error);

            }

        }

        

        function addMessage(role, content) {

            const messageDiv = document.createElement('div');

            messageDiv.className = `message ${role}`;

            

            const avatar = document.createElement('div');

            avatar.className = 'message-avatar';

            avatar.textContent = role === 'user' ? 'You' : 'AI';

            

            const contentDiv = document.createElement('div');

            contentDiv.className = 'message-content';

            contentDiv.textContent = content;

            

            messageDiv.appendChild(avatar);

            messageDiv.appendChild(contentDiv);

            chatArea.appendChild(messageDiv);

            chatArea.scrollTop = chatArea.scrollHeight;

            

            return contentDiv;

        }

        

        function addLoadingIndicator() {

            const messageDiv = document.createElement('div');

            messageDiv.className = 'message assistant';

            messageDiv.id = 'loading-message';

            

            const avatar = document.createElement('div');

            avatar.className = 'message-avatar';

            avatar.textContent = 'AI';

            

            const loadingDiv = document.createElement('div');

            loadingDiv.className = 'loading';

            loadingDiv.innerHTML = '<div class="loading-dot"></div><div class="loading-dot"></div><div class="loading-dot"></div>';

            

            messageDiv.appendChild(avatar);

            messageDiv.appendChild(loadingDiv);

            chatArea.appendChild(messageDiv);

            chatArea.scrollTop = chatArea.scrollHeight;

            

            return messageDiv;

        }

        

        function removeLoadingIndicator() {

            const loadingMessage = document.getElementById('loading-message');

            if (loadingMessage) {

                loadingMessage.remove();

            }

        }

        

        async function sendMessage() {

            if (isProcessing) return;

            

            const message = messageInput.value.trim();

            if (!message) return;

            

            isProcessing = true;

            sendButton.disabled = true;

            messageInput.value = '';

            

            // Add user message

            addMessage('user', message);

            

            try {

                if (isStreaming) {

                    await handleStreamingResponse(message);

                } else {

                    await handleCompleteResponse(message);

                }

            } catch (error) {

                console.error('Error sending message:', error);

                addMessage('assistant', 'Sorry, an error occurred. Please try again.');

            } finally {

                isProcessing = false;

                sendButton.disabled = false;

                messageInput.focus();

            }

        }

        

        async function handleStreamingResponse(message) {

            const response = await fetch('/api/chat', {

                method: 'POST',

                headers: {'Content-Type': 'application/json'},

                body: JSON.stringify({

                    message: message,

                    conversation_id: conversationId,

                    stream: true

                })

            });

            

            const reader = response.body.getReader();

            const decoder = new TextDecoder();

            let contentDiv = null;

            let fullText = '';

            

            while (true) {

                const {done, value} = await reader.read();

                if (done) break;

                

                const chunk = decoder.decode(value);

                const lines = chunk.split('\n');

                

                for (const line of lines) {

                    if (line.startsWith('data: ')) {

                        const data = line.slice(6);

                        if (data === '[DONE]') break;

                        

                        try {

                            const json = JSON.parse(data);

                            if (json.text) {

                                if (!contentDiv) {

                                    contentDiv = addMessage('assistant', '');

                                }

                                fullText += json.text;

                                contentDiv.textContent = fullText;

                                chatArea.scrollTop = chatArea.scrollHeight;

                            }

                        } catch (e) {

                            // Ignore parse errors

                        }

                    }

                }

            }

        }

        

        async function handleCompleteResponse(message) {

            const loadingIndicator = addLoadingIndicator();

            

            const response = await fetch('/api/chat', {

                method: 'POST',

                headers: {'Content-Type': 'application/json'},

                body: JSON.stringify({

                    message: message,

                    conversation_id: conversationId,

                    stream: false

                })

            });

            

            removeLoadingIndicator();

            

            const data = await response.json();

            if (data.response) {

                addMessage('assistant', data.response);

            } else if (data.error) {

                addMessage('assistant', `Error: ${data.error}`);

            }

        }

        

        async function clearChat() {

            if (conversationId) {

                await deleteConversation();

                if (isMultiTurn) {

                    await createConversation();

                }

            }

            

            chatArea.innerHTML = '';

            addMessage('assistant', 'Chat cleared. How can I help you?');

        }

    </script>

</body>

</html>



This web application provides a polished chat interface with real-time streaming, multi-turn conversations, and visual feedback. Users can toggle streaming mode and multi-turn mode, clear the chat, and see the service status.


To run the web application, install Flask:


pip install flask


Then start the server:


python client/web_app.py


Access the interface at http://localhost:5000 in your web browser.



PRODUCTION-READY COMPLETE EXAMPLE


Now let us provide the complete, production-ready code for the entire system.  This includes all files needed to deploy and run the service.



Complete Project Structure:


llm-chat-microservice/

├── app/

│   ├── __init__.py

│   ├── main.py

│   ├── config.py

│   ├── models/

│   │   ├── __init__.py

│   │   └── llm.py

│   ├── services/

│   │   ├── __init__.py

│   │   └── chat_service.py

│   └── routes/

│       ├── __init__.py

│       └── chat.py

├── client/

│   ├── llm_client.py

│   ├── cli.py

│   ├── web_app.py

│   └── templates/

│       └── index.html

├── k8s/

│   ├── configmap.yaml

│   ├── pvc.yaml

│   ├── deployment.yaml

│   ├── service.yaml

│   ├── ingress.yaml

│   └── hpa.yaml

├── Dockerfile

├── docker-compose.yml

├── requirements.txt

├── .env.example

└── README.md



Complete app/__init__.py:



# app/__init__.py


"""

LLM Chat Microservice


A production-ready microservice for serving local LLM models with

multi-GPU support across NVIDIA CUDA, AMD ROCm, Apple Metal, and Intel SYCL.

"""


__version__ = "1.0.0"



Complete app/models/__init__.py:



# app/models/__init__.py


from app.models.llm import LLMInference


__all__ = ['LLMInference']



Complete app/services/__init__.py:



# app/services/__init__.py


from app.services.chat_service import ChatService, Conversation, Message


__all__ = ['ChatService', 'Conversation', 'Message']



Complete app/routes/__init__.py:



# app/routes/__init__.py


from app.routes import chat


__all__ = ['chat']



Complete .env.example:



# .env.example

# Copy this file to .env and configure for your environment


# Model configuration

MODEL_PATH=/app/models/model.gguf

MODEL_NAME=local-llm

CONTEXT_LENGTH=4096

MAX_TOKENS=2048

TEMPERATURE=0.7

TOP_P=0.9

TOP_K=40

REPEAT_PENALTY=1.1


# GPU configuration

# Options: cuda, rocm, metal, sycl, none

GPU_TYPE=cuda

N_GPU_LAYERS=99

MAIN_GPU=0

# For multi-GPU, specify tensor split as comma-separated values

# TENSOR_SPLIT=0.6,0.4


# Server configuration

HOST=0.0.0.0

PORT=8000

WORKERS=1

LOG_LEVEL=info


# Performance configuration

N_THREADS=4

N_BATCH=512


# API configuration

# API_KEY=your-secret-key-here

ENABLE_STREAMING=true

MAX_CONCURRENT_REQUESTS=10



Complete README.md:



# LLM Chat Microservice


A production-ready microservice for serving local Large Language Models with comprehensive multi-GPU architecture support including NVIDIA CUDA, AMD ROCm, Apple Metal Performance Shaders, and Intel SYCL.


## Features


- Local LLM inference with no external dependencies

- Multi-GPU architecture support (NVIDIA, AMD, Apple, Intel)

- RESTful API with automatic documentation

- Streaming and non-streaming response modes

- Multi-turn conversation management

- Docker containerization with multi-stage builds

- Kubernetes deployment with auto-scaling

- Production-ready error handling and logging

- Comprehensive client libraries and CLI tools


## Quick Start


### Prerequisites


- Python 3.10 or later

- Docker (for containerized deployment)

- Kubernetes cluster (for orchestrated deployment)

- GPU drivers for your hardware (CUDA, ROCm, Metal, or SYCL)

- LLM model file in GGUF format


### Local Development


1. Clone the repository and navigate to the project directory


2. Create a virtual environment:

   python -m venv venv

   source venv/bin/activate  # On Windows: venv\Scripts\activate


3. Install dependencies:

   pip install -r requirements.txt


4. Download an LLM model in GGUF format and note its path


5. Configure environment variables:

   cp .env.example .env

   # Edit .env with your settings


6. Run the service:

   python -m app.main


7. Access the API documentation at http://localhost:8000/docs


### Docker Deployment


Build the image with GPU support:


# For NVIDIA CUDA

docker build --build-arg GPU_TYPE=cuda -t llm-chat-service:cuda .


# For AMD ROCm

docker build --build-arg GPU_TYPE=rocm -t llm-chat-service:rocm .


# For Apple Metal

docker build --build-arg GPU_TYPE=metal -t llm-chat-service:metal .


# For Intel SYCL

docker build --build-arg GPU_TYPE=sycl -t llm-chat-service:sycl .


# CPU only

docker build --build-arg GPU_TYPE=none -t llm-chat-service:cpu .


Run the container:


docker run -d \

  --name llm-chat \

  --gpus all \

  -p 8000:8000 \

  -v /path/to/models:/app/models \

  -e MODEL_PATH=/app/models/your-model.gguf \

  -e GPU_TYPE=cuda \

  -e N_GPU_LAYERS=99 \

  llm-chat-service:cuda


### Docker Compose Deployment


Configure your environment:


cp .env.example .env

# Edit .env with your settings


Start the service:


docker-compose up -d


### Kubernetes Deployment


1. Ensure GPU device plugins are installed on your cluster


2. Create the model PersistentVolume and populate it with your model files


3. Deploy the service:

   kubectl apply -f k8s/


4. Check deployment status:

   kubectl get pods -l app=llm-chat


5. Access the service:

   kubectl port-forward service/llm-chat-service 8000:80


## API Usage


### Health Check


GET /api/v1/health


Returns service status and model information.


### Single Message Chat


POST /api/v1/chat

Content-Type: application/json


{

  "message": "What is the capital of France?",

  "system_prompt": "You are a helpful assistant.",

  "model_format": "chatml",

  "stream": false,

  "temperature": 0.7,

  "max_tokens": 2048

}


### Streaming Chat


POST /api/v1/chat

Content-Type: application/json


{

  "message": "Write a poem about AI.",

  "stream": true

}


Response is sent as server-sent events.


### Multi-Turn Conversation


Create a conversation:


POST /api/v1/conversations

Content-Type: application/json


{

  "conversation_id": "my-conversation",

  "system_prompt": "You are a helpful math tutor."

}


Send messages:


POST /api/v1/chat

Content-Type: application/json


{

  "message": "What is 15 times 23?",

  "conversation_id": "my-conversation"

}


Get conversation history:


GET /api/v1/conversations/my-conversation


Delete conversation:


DELETE /api/v1/conversations/my-conversation


## Client Usage


### Python Client Library


from client.llm_client import LLMClient


with LLMClient("http://localhost:8000") as client:

    # Single message

    response = client.chat("Hello, how are you?")

    print(response['response'])

    

    # Streaming

    for chunk in client.chat("Tell me a story", stream=True):

        print(chunk, end='', flush=True)

    

    # Multi-turn conversation

    client.create_conversation("conv-123", "You are a helpful assistant")

    client.chat("What is Python?", conversation_id="conv-123")

    client.chat("How do I install it?", conversation_id="conv-123")


### Command-Line Interface


# Interactive mode with streaming

python client/cli.py -i -s


# Single message

python client/cli.py -m "What is the capital of France?"


# Multi-turn conversation

python client/cli.py -i --multi-turn --system "You are a helpful tutor"


# Health check

python client/cli.py --health


# Custom service URL

python client/cli.py -u http://llm-chat.example.com -m "Hello"


### Web Interface


Start the web application:


python client/web_app.py


Access the interface at http://localhost:5000


## Configuration


All configuration is done through environment variables. See .env.example for  all available options.


Key configuration parameters:


- MODEL_PATH: Path to the GGUF model file

- GPU_TYPE: GPU acceleration type (cuda, rocm, metal, sycl, none)

- N_GPU_LAYERS: Number of model layers to offload to GPU (99 for all)

- CONTEXT_LENGTH: Maximum context window size

- MAX_TOKENS: Maximum tokens to generate per response

- TEMPERATURE: Sampling temperature (higher = more creative)


## GPU Configuration


### NVIDIA CUDA


Set GPU_TYPE=cuda and ensure CUDA toolkit is installed. The service will 

automatically detect and use available NVIDIA GPUs.


For multi-GPU setups, use TENSOR_SPLIT to distribute the model:

TENSOR_SPLIT=0.6,0.4


### AMD ROCm


Set GPU_TYPE=rocm and ensure ROCm is installed. The service uses HIP to interface with AMD GPUs.


### Apple Metal


Set GPU_TYPE=metal on Apple Silicon Macs. Metal support is automatic and uses the unified memory architecture for efficient inference.


### Intel GPUs


Set GPU_TYPE=sycl and ensure Intel oneAPI or OpenCL runtime is installed. Supports both integrated and discrete Intel GPUs.


## Performance Tuning


- Increase N_GPU_LAYERS to offload more computation to GPU

- Adjust N_BATCH for optimal throughput (higher = more memory, better speed)

- Set N_THREADS based on your CPU core count

- Use quantized models (Q4_K_M or Q5_K_M) for better performance

- Enable streaming for better perceived responsiveness


## Monitoring


View logs:

docker logs -f llm-chat

kubectl logs -f -l app=llm-chat


Check resource usage:

docker stats llm-chat

kubectl top pods -l app=llm-chat


Monitor GPU usage:

nvidia-smi  # NVIDIA

rocm-smi    # AMD


## Troubleshooting


### Model fails to load


- Verify MODEL_PATH points to a valid GGUF file

- Ensure sufficient memory (RAM or VRAM) for the model

- Check file permissions on the model file


### GPU not detected


- Verify GPU drivers are installed correctly

- Check GPU_TYPE matches your hardware

- Ensure Docker has GPU access (--gpus flag)

- Verify Kubernetes GPU device plugin is running


### Slow inference


- Increase N_GPU_LAYERS to use more GPU

- Use a smaller or more quantized model

- Reduce CONTEXT_LENGTH if not needed

- Check GPU utilization with nvidia-smi or rocm-smi


### Out of memory errors


- Use a smaller model or higher quantization

- Reduce CONTEXT_LENGTH

- Reduce N_BATCH

- Offload fewer layers to GPU (reduce N_GPU_LAYERS)


## Architecture


The service follows clean architecture principles with clear separation of 

concerns:


- API Layer (routes/): HTTP endpoints and request/response handling

- Service Layer (services/): Business logic and conversation management

- Inference Layer (models/): LLM loading and inference

- Configuration Layer (config.py): Settings and environment management


This architecture enables:

- Easy testing of individual components

- Flexibility to swap implementations

- Clear dependency flow

- Maintainable and extensible code


## License


This project is provided as-is for educational and commercial use.


## Support


For issues and questions, please refer to the documentation or create an issue in the project repository.



This completes the comprehensive guide to building, deploying, and using an LLM chat microservice with full multi-GPU architecture support. The system is production-ready with proper error handling, logging, monitoring, and client tools. It supports deployment in Docker containers and Kubernetes clusters, with automatic scaling and self-healing capabilities. The included client applications demonstrate how to integrate the service into various types of applications, from command-line tools to web interfaces.

No comments: