INTRODUCTION: WHY BUILD AN LLM CHAT MICROSERVICE?
In the rapidly evolving landscape of artificial intelligence, Large Language Models have become indispensable tools for modern applications. However, deploying these powerful models in a production environment presents unique challenges. This article guides you through creating a robust, scalable LLM chat service that runs as a microservice in containerized environments.
The microservice architecture approach offers several compelling advantages. First, it provides isolation, meaning your LLM service runs independently from other application components. If the LLM service crashes or needs updates, your main application continues functioning. Second, it enables horizontal scaling, allowing you to run multiple instances of your LLM service to handle increased load. Third, it facilitates resource management, particularly important for
GPU-intensive LLM operations where you need precise control over hardware allocation.
Using local LLM models instead of cloud-based APIs offers significant benefits. You maintain complete data privacy since no information leaves your infrastructure. You eliminate per-token costs associated with commercial APIs, making it economically viable for high-volume applications. You gain full control over model selection, fine-tuning, and updates. Most importantly, you avoid dependency on external services and their potential downtime or rate limits.
UNDERSTANDING THE FUNDAMENTAL COMPONENTS
Before diving into implementation, let us explore the essential building blocks of our system. Understanding these components deeply will help you make informed decisions throughout the development process.
What Are Large Language Models?
Large Language Models are neural networks trained on vast amounts of text data to understand and generate human-like text. Unlike traditional software that follows explicit rules, LLMs learn patterns from data and can perform tasks they were not explicitly programmed for. When you send a prompt to an LLM, it processes the text through billions of parameters to generate a contextually appropriate response.
The models we will use are quantized versions, meaning they have been compressed from their original size while maintaining most of their capabilities. A model originally requiring 80GB of memory might be quantized to run in 8GB or less. This compression uses techniques like reducing the precision of numerical weights from 32-bit floating point to 4-bit integers. The GGUF format, developed by the llama.cpp project, has become the standard for these quantized models because it supports efficient loading and inference across different hardware platforms.
What Is Docker and Why Use It?
Docker is a containerization platform that packages your application along with all its dependencies into a standardized unit called a container. Think of a container as a lightweight, isolated environment that contains everything needed to run your application: the code, runtime, system tools, libraries, and settings.
The key advantage of Docker for LLM services is consistency. The phrase "it works on my machine" becomes obsolete because the container runs identically everywhere. Whether you develop on a MacBook with Apple Silicon, deploy to a Linux server with NVIDIA GPUs, or scale across a Kubernetes cluster with AMD GPUs, the containerized application behaves the same way.
Docker containers differ from virtual machines in crucial ways. Virtual machines include an entire operating system, making them heavy and slow to start. Containers share the host operating system kernel, making them lightweight and fast. A container can start in seconds and uses only megabytes of memory for the container overhead itself, though our LLM will require substantial memory for the model.
What Is Kubernetes and Why Use It?
Kubernetes is an orchestration platform that manages containerized applications across a cluster of machines. While Docker handles individual containers, Kubernetes manages fleets of containers, deciding where to run them, how to scale them, and what to do when they fail.
For LLM services, Kubernetes provides critical capabilities. It offers automatic scaling based on demand, spinning up new instances when traffic increases and shutting them down when traffic decreases. It provides self-healing, automatically restarting failed containers and replacing unhealthy instances. It manages resource allocation, ensuring your GPU-hungry LLM containers get the hardware they need without starving other services. It handles load balancing, distributing incoming requests across multiple LLM instances for optimal performance.
Understanding Microservice Architecture
Microservice architecture structures an application as a collection of loosely coupled services. Instead of building one large monolithic application, you create multiple small services that each handle a specific business capability. Each service runs independently, has its own database if needed, and communicates with other services through well-defined APIs.
For our LLM chat service, the microservice approach means creating a dedicated service that does one thing well: processing chat requests using a local LLM. Other parts of your application, like user authentication, data storage, or frontend interfaces, run as separate services. This separation allows you to update your LLM service without touching other components, scale it independently based on AI workload, and even use different programming languages or frameworks for different services.
MULTI-GPU ARCHITECTURE SUPPORT: THE TECHNICAL FOUNDATION
One of the most challenging aspects of deploying LLM services is supporting diverse GPU architectures. Different hardware manufacturers use different programming models and libraries for GPU acceleration. Understanding these differences is essential for building a truly portable LLM service.
NVIDIA CUDA Architecture
NVIDIA GPUs use the CUDA (Compute Unified Device Architecture) platform for parallel computing. CUDA has been the dominant force in machine learning for years, with extensive library support and optimization. When you run an LLM on NVIDIA hardware, the inference engine uses CUDA cores to parallelize matrix operations, the fundamental mathematical operations in neural networks.
The llama.cpp library we will use supports CUDA through cuBLAS, NVIDIA's library for basic linear algebra operations. When compiled with CUDA support, llama.cpp automatically offloads computation to the GPU, dramatically accelerating inference. A response that might take 30 seconds on CPU could complete in 2 seconds on a modern NVIDIA GPU.
AMD ROCm Architecture
AMD's ROCm (Radeon Open Compute) platform provides GPU acceleration for AMD graphics cards. ROCm is an open-source alternative to CUDA, designed to support high-performance computing and machine learning workloads. While historically less mature than CUDA, ROCm has improved significantly and now provides competitive performance for LLM inference.
The llama.cpp library supports ROCm through HIP (Heterogeneous-computing Interface for Portability), which allows CUDA code to run on AMD GPUs with minimal modifications. When you compile llama.cpp with ROCm support, it uses AMD GPU cores for acceleration just as it would use NVIDIA CUDA cores.
Apple Metal Performance Shaders
Apple Silicon chips (M1, M2, M3, and their variants) include integrated GPUs that use the Metal framework for GPU computing. Metal Performance Shaders (MPS) provides optimized implementations of common computational patterns, including the matrix operations needed for neural networks.
The llama.cpp library includes Metal support specifically for Apple Silicon.
When running on a Mac with Apple Silicon, the library automatically uses the integrated GPU for acceleration. This is particularly powerful because Apple's unified memory architecture allows the CPU and GPU to share the same memory pool, eliminating the need to copy data between separate CPU and GPU memory.
Intel GPU Architecture
Intel provides GPU acceleration through multiple technologies. Modern Intel CPUs with integrated graphics support OpenCL for general-purpose GPU computing. Intel also offers oneAPI, a unified programming model that works across CPUs, GPUs, and other accelerators. Additionally, Intel's Arc discrete GPUs provide substantial computational power for AI workloads.
The llama.cpp library supports Intel GPUs primarily through SYCL (a higher-level programming model built on oneAPI) and OpenCL. This support enables acceleration on Intel integrated graphics as well as discrete Arc GPUs.
THE ARCHITECTURE OF OUR LLM MICROSERVICE
Now that we understand the components, let us design the architecture of our LLM chat microservice. A well-designed architecture makes the system easier to understand, maintain, and extend.
Service Architecture Overview
Our microservice follows a layered architecture with clear separation of
concerns. At the top layer, we have the API layer, which handles HTTP requests and responses. This layer exposes REST endpoints that clients use to send chat messages and receive responses. It validates input, handles errors gracefully, and formats responses according to API specifications.
Below the API layer sits the service layer, which contains the business logic. This layer manages conversation context, handles prompt formatting, and orchestrates the interaction with the LLM. It implements features like conversation history management, token counting, and response streaming.
The inference layer interfaces directly with the LLM. This layer loads the
model, manages GPU resources, and executes inference. It abstracts the
complexity of the underlying llama.cpp library, providing a clean interface for the service layer.
Finally, the configuration layer manages all settings: model paths, GPU
settings, performance parameters, and API configuration. This layer reads from environment variables and configuration files, allowing the service to adapt to different deployment environments without code changes.
Request Flow Through the System
When a client sends a chat request, it travels through our system in a
well-defined path. The request first arrives at the API layer as an HTTP POST to the chat endpoint. The API layer validates the request structure, ensuring required fields are present and properly formatted.
The validated request moves to the service layer, which retrieves any relevant conversation history and constructs a complete prompt for the LLM. The prompt includes system instructions, conversation history, and the new user message, all formatted according to the specific model's expected format.
The service layer passes the formatted prompt to the inference layer, which feeds it to the loaded LLM. The model processes the prompt and generates tokens one at a time. For streaming responses, these tokens flow back through the layers immediately, allowing the client to receive partial responses as they are generated. For non-streaming responses, the inference layer collects all tokens before returning the complete response.
Finally, the response travels back through the service layer, which may perform post-processing like filtering or logging, and then through the API layer, which formats it as JSON and sends it to the client.
SETTING UP THE DEVELOPMENT ENVIRONMENT
Before writing code, we need to prepare our development environment. This section walks through the setup process step by step.
Installing Required System Dependencies
Our LLM service requires several system-level dependencies. On Ubuntu or Debian Linux, you need build tools for compiling native extensions, Python development headers, and GPU-specific libraries depending on your hardware.
For NVIDIA GPU support, install the CUDA toolkit matching your driver version. The CUDA toolkit includes the compiler and libraries needed to build GPU-accelerated code. You can verify your CUDA installation by running the nvidia-smi command, which displays GPU information and CUDA version.
For AMD GPU support, install ROCm following AMD's official documentation. ROCm installation is more involved than CUDA, requiring specific kernel modules and libraries. After installation, verify it by running rocm-smi, the AMD equivalent of nvidia-smi.
For Apple Silicon, no additional installation is needed. The Metal framework comes pre-installed with macOS. The llama.cpp library will automatically detect and use Metal when running on Apple Silicon.
For Intel GPU support, install the Intel oneAPI toolkit or OpenCL runtime. The specific requirements depend on whether you are using integrated graphics or discrete Arc GPUs.
Installing Python and Dependencies
Our microservice uses Python 3.10 or later. Python 3.10 introduced several
features we will use, including improved type hints and better error messages. Create a virtual environment to isolate our project dependencies from system Python packages. This isolation prevents version conflicts and makes the project reproducible.
The core Python dependencies include FastAPI, a modern web framework for building APIs with automatic documentation and validation. We use Uvicorn as the ASGI server to run FastAPI. The llama-cpp-python package provides Python bindings for llama.cpp, giving us access to efficient LLM inference with GPU support. We also need Pydantic for data validation and serialization.
Obtaining LLM Models
To run a local LLM, you need to download model files. The most accessible source is Hugging Face, which hosts thousands of models in various formats. For our service, we need models in GGUF format, the format used by llama.cpp.
Popular model choices include Llama 2, an open-source model from Meta available in sizes from 7 billion to 70 billion parameters. Mistral 7B offers excellent performance for its size with a focus on instruction following. Phi-3 from Microsoft provides strong capabilities in a compact 3.8 billion parameter model. Qwen models from Alibaba offer multilingual support with strong performance.
When selecting a model, consider the quantization level. Q4_K_M quantization provides a good balance between quality and size, reducing model size to roughly 25 percent of the original while maintaining most capabilities. Q5_K_M offers slightly better quality at a modest size increase. Q8_0 provides near-original quality but requires more memory.
Download your chosen model and note its path. We will configure our service to load this model at startup.
IMPLEMENTING THE LLM MICROSERVICE
Now we begin implementing our microservice. We will build it incrementally,
explaining each component in detail.
Creating the Project Structure
A well-organized project structure makes the code easier to navigate and
maintain. Create a directory for your project and organize it into logical
modules. The main application code lives in an app directory. Configuration handling goes in a config module. The LLM inference logic resides in a models module. API endpoints are defined in a routes module. Shared utilities and helpers go in a utils module.
This structure follows clean architecture principles by separating concerns. The routes module depends on the models module, but the models module does not know about HTTP or routing. This separation allows you to test the LLM inference logic independently of the web framework.
Implementing Configuration Management
Configuration management is critical for a production service. Hard-coding values makes the service inflexible and difficult to deploy in different environments. We use environment variables and configuration files to make the service adaptable.
Create a configuration module that defines all settings using Pydantic. Pydantic provides automatic validation and type conversion for configuration values. It can read from environment variables, providing sensible defaults when values are not specified.
Here is the configuration module:
# app/config.py
from pydantic_settings import BaseSettings
from typing import Optional
from enum import Enum
class GPUType(str, Enum):
"""Enumeration of supported GPU types"""
CUDA = "cuda"
ROCM = "rocm"
METAL = "metal"
SYCL = "sycl"
NONE = "none"
class Settings(BaseSettings):
"""Application configuration settings"""
# Model configuration
model_path: str
model_name: str = "local-llm"
context_length: int = 4096
max_tokens: int = 2048
temperature: float = 0.7
top_p: float = 0.9
top_k: int = 40
repeat_penalty: float = 1.1
# GPU configuration
gpu_type: GPUType = GPUType.NONE
n_gpu_layers: int = 0
main_gpu: int = 0
tensor_split: Optional[str] = None
# Server configuration
host: str = "0.0.0.0"
port: int = 8000
workers: int = 1
log_level: str = "info"
# Performance configuration
n_threads: int = 4
n_batch: int = 512
# API configuration
api_key: Optional[str] = None
enable_streaming: bool = True
max_concurrent_requests: int = 10
class Config:
env_file = ".env"
env_file_encoding = "utf-8"
def get_settings() -> Settings:
"""Get application settings singleton"""
return Settings()
This configuration module defines all the settings our service needs. The
model_path specifies where to find the GGUF model file. The context_length determines how much conversation history the model can consider. The max_tokens limits the length of generated responses. Temperature, top_p, and top_k control the randomness and creativity of responses. Lower temperature values produce more focused and deterministic outputs, while higher values increase creativity and randomness.
The GPU configuration section is particularly important. The gpu_type setting tells the service which GPU acceleration to use. The n_gpu_layers specifies how many model layers to offload to the GPU. Setting this to a high value like 99 offloads the entire model to GPU memory, maximizing performance. The main_gpu setting selects which GPU to use in multi-GPU systems. The tensor_split allows distributing the model across multiple GPUs, specified as a comma-separated list of proportions.
The server configuration controls how the HTTP server runs. The host setting determines which network interfaces to bind to. Using 0.0.0.0 makes the service accessible from any network interface, necessary for Docker containers. The port specifies which TCP port to listen on.
The performance configuration affects inference speed and resource usage. The n_threads setting controls CPU parallelism for operations not offloaded to GPU. The n_batch parameter affects how the model processes tokens internally, with higher values potentially improving throughput at the cost of memory.
Implementing the LLM Inference Layer
The inference layer encapsulates all interaction with the LLM. This abstraction isolates the rest of the application from the specifics of the llama.cpp library, making it easier to swap implementations if needed.
Here is the inference layer implementation:
# app/models/llm.py
from llama_cpp import Llama, LlamaGrammar
from typing import Iterator, Dict, Any, Optional, List
import logging
from app.config import Settings, GPUType
logger = logging.getLogger(__name__)
class LLMInference:
"""Handles LLM model loading and inference"""
def __init__(self, settings: Settings):
"""
Initialize the LLM inference engine
Args:
settings: Application settings containing model configuration
"""
self.settings = settings
self.model: Optional[Llama] = None
self._initialize_model()
def _initialize_model(self) -> None:
"""Load and initialize the LLM model with appropriate GPU settings"""
logger.info(f"Loading model from {self.settings.model_path}")
logger.info(f"GPU type: {self.settings.gpu_type}")
logger.info(f"GPU layers: {self.settings.n_gpu_layers}")
# Prepare model initialization parameters
model_params = {
"model_path": self.settings.model_path,
"n_ctx": self.settings.context_length,
"n_threads": self.settings.n_threads,
"n_batch": self.settings.n_batch,
"verbose": self.settings.log_level == "debug",
}
# Configure GPU acceleration based on type
if self.settings.gpu_type != GPUType.NONE:
model_params["n_gpu_layers"] = self.settings.n_gpu_layers
if self.settings.gpu_type == GPUType.CUDA:
# CUDA-specific settings
model_params["main_gpu"] = self.settings.main_gpu
if self.settings.tensor_split:
# Parse tensor split string into list of floats
splits = [float(x) for x in self.settings.tensor_split.split(",")]
model_params["tensor_split"] = splits
logger.info("Configured for NVIDIA CUDA acceleration")
elif self.settings.gpu_type == GPUType.ROCM:
# ROCm uses same parameters as CUDA through HIP
model_params["main_gpu"] = self.settings.main_gpu
if self.settings.tensor_split:
splits = [float(x) for x in self.settings.tensor_split.split(",")]
model_params["tensor_split"] = splits
logger.info("Configured for AMD ROCm acceleration")
elif self.settings.gpu_type == GPUType.METAL:
# Metal acceleration for Apple Silicon
logger.info("Configured for Apple Metal acceleration")
elif self.settings.gpu_type == GPUType.SYCL:
# Intel GPU acceleration through SYCL
model_params["main_gpu"] = self.settings.main_gpu
logger.info("Configured for Intel SYCL acceleration")
else:
logger.info("Running on CPU only (no GPU acceleration)")
try:
self.model = Llama(**model_params)
logger.info("Model loaded successfully")
except Exception as e:
logger.error(f"Failed to load model: {e}")
raise
def generate(
self,
prompt: str,
max_tokens: Optional[int] = None,
temperature: Optional[float] = None,
top_p: Optional[float] = None,
top_k: Optional[int] = None,
repeat_penalty: Optional[float] = None,
stop: Optional[List[str]] = None,
stream: bool = False
) -> Iterator[Dict[str, Any]]:
"""
Generate text from the model
Args:
prompt: Input text to generate from
max_tokens: Maximum tokens to generate (uses config default if None)
temperature: Sampling temperature (uses config default if None)
top_p: Nucleus sampling parameter (uses config default if None)
top_k: Top-k sampling parameter (uses config default if None)
repeat_penalty: Repetition penalty (uses config default if None)
stop: List of stop sequences
stream: Whether to stream tokens as they are generated
Yields:
Dictionary containing generated text and metadata
"""
if self.model is None:
raise RuntimeError("Model not initialized")
# Use provided parameters or fall back to configuration defaults
generation_params = {
"prompt": prompt,
"max_tokens": max_tokens or self.settings.max_tokens,
"temperature": temperature if temperature is not None else self.settings.temperature,
"top_p": top_p if top_p is not None else self.settings.top_p,
"top_k": top_k if top_k is not None else self.settings.top_k,
"repeat_penalty": repeat_penalty if repeat_penalty is not None else self.settings.repeat_penalty,
"stop": stop or [],
"stream": stream,
"echo": False,
}
logger.debug(f"Generating with params: {generation_params}")
try:
if stream:
# Stream tokens as they are generated
for output in self.model(**generation_params):
yield {
"text": output["choices"][0]["text"],
"finish_reason": output["choices"][0].get("finish_reason"),
}
else:
# Generate complete response
output = self.model(**generation_params)
yield {
"text": output["choices"][0]["text"],
"finish_reason": output["choices"][0].get("finish_reason"),
"usage": {
"prompt_tokens": output["usage"]["prompt_tokens"],
"completion_tokens": output["usage"]["completion_tokens"],
"total_tokens": output["usage"]["total_tokens"],
}
}
except Exception as e:
logger.error(f"Generation failed: {e}")
raise
def get_model_info(self) -> Dict[str, Any]:
"""
Get information about the loaded model
Returns:
Dictionary containing model metadata
"""
return {
"model_name": self.settings.model_name,
"model_path": self.settings.model_path,
"context_length": self.settings.context_length,
"gpu_type": self.settings.gpu_type.value,
"gpu_layers": self.settings.n_gpu_layers,
}
This inference layer provides a clean interface for the rest of the application. The initialization method loads the model with appropriate GPU settings based on the configuration. The generate method handles both streaming and non-streaming inference, accepting parameters that control the generation process.
The GPU configuration logic deserves special attention. For CUDA and ROCm, we set the main_gpu parameter to select which GPU to use. The tensor_split parameter allows distributing the model across multiple GPUs. For example, a tensor_split of "0.6,0.4" would put 60 percent of the model on the first GPU and 40 percent on the second GPU. This is useful for very large models that do not fit in a single GPU's memory.
For Metal acceleration on Apple Silicon, we simply set n_gpu_layers to offload computation to the integrated GPU. The llama.cpp library handles the Metal-specific details automatically.
The generate method implements both streaming and non-streaming modes. In streaming mode, tokens are yielded as soon as they are generated, allowing clients to display partial responses. This significantly improves perceived responsiveness for long responses. In non-streaming mode, the complete response is generated before returning, which is simpler but requires the client to wait for the entire response.
Implementing the Service Layer
The service layer sits between the API and the inference layer, handling
business logic like conversation management and prompt formatting.
Here is the service layer implementation:
# app/services/chat_service.py
from typing import List, Dict, Any, Iterator, Optional
from app.models.llm import LLMInference
from app.config import Settings
import logging
import json
logger = logging.getLogger(__name__)
class Message:
"""Represents a single message in a conversation"""
def __init__(self, role: str, content: str):
"""
Initialize a message
Args:
role: Message role (system, user, or assistant)
content: Message content text
"""
self.role = role
self.content = content
def to_dict(self) -> Dict[str, str]:
"""Convert message to dictionary"""
return {"role": self.role, "content": self.content}
class Conversation:
"""Manages a conversation with message history"""
def __init__(self, system_prompt: Optional[str] = None):
"""
Initialize a conversation
Args:
system_prompt: Optional system prompt to set behavior
"""
self.messages: List[Message] = []
if system_prompt:
self.messages.append(Message("system", system_prompt))
def add_message(self, role: str, content: str) -> None:
"""
Add a message to the conversation
Args:
role: Message role (user or assistant)
content: Message content
"""
self.messages.append(Message(role, content))
def get_messages(self) -> List[Dict[str, str]]:
"""Get all messages as list of dictionaries"""
return [msg.to_dict() for msg in self.messages]
def format_for_model(self, model_format: str = "chatml") -> str:
"""
Format conversation for model input
Args:
model_format: Format to use (chatml, llama2, etc.)
Returns:
Formatted prompt string
"""
if model_format == "chatml":
# ChatML format used by many models
formatted = ""
for msg in self.messages:
formatted += f"<|im_start|>{msg.role}\n{msg.content}<|im_end|>\n"
formatted += "<|im_start|>assistant\n"
return formatted
elif model_format == "llama2":
# Llama 2 chat format
formatted = ""
system_msg = None
# Extract system message if present
messages = self.messages.copy()
if messages and messages[0].role == "system":
system_msg = messages.pop(0).content
# Format with special tokens
if system_msg:
formatted = f"[INST] <<SYS>>\n{system_msg}\n<</SYS>>\n\n"
for i, msg in enumerate(messages):
if msg.role == "user":
if i == 0 and system_msg:
formatted += f"{msg.content} [/INST] "
else:
formatted += f"[INST] {msg.content} [/INST] "
elif msg.role == "assistant":
formatted += f"{msg.content} "
return formatted
elif model_format == "alpaca":
# Alpaca instruction format
formatted = ""
for msg in self.messages:
if msg.role == "system":
formatted += f"{msg.content}\n\n"
elif msg.role == "user":
formatted += f"### Instruction:\n{msg.content}\n\n"
elif msg.role == "assistant":
formatted += f"### Response:\n{msg.content}\n\n"
formatted += "### Response:\n"
return formatted
else:
# Simple format as fallback
formatted = ""
for msg in self.messages:
formatted += f"{msg.role}: {msg.content}\n"
formatted += "assistant: "
return formatted
class ChatService:
"""Service for handling chat interactions with the LLM"""
def __init__(self, llm: LLMInference, settings: Settings):
"""
Initialize chat service
Args:
llm: LLM inference engine
settings: Application settings
"""
self.llm = llm
self.settings = settings
self.conversations: Dict[str, Conversation] = {}
def create_conversation(
self,
conversation_id: str,
system_prompt: Optional[str] = None
) -> None:
"""
Create a new conversation
Args:
conversation_id: Unique identifier for the conversation
system_prompt: Optional system prompt
"""
self.conversations[conversation_id] = Conversation(system_prompt)
logger.info(f"Created conversation {conversation_id}")
def get_conversation(self, conversation_id: str) -> Optional[Conversation]:
"""
Get an existing conversation
Args:
conversation_id: Conversation identifier
Returns:
Conversation object or None if not found
"""
return self.conversations.get(conversation_id)
def delete_conversation(self, conversation_id: str) -> bool:
"""
Delete a conversation
Args:
conversation_id: Conversation identifier
Returns:
True if deleted, False if not found
"""
if conversation_id in self.conversations:
del self.conversations[conversation_id]
logger.info(f"Deleted conversation {conversation_id}")
return True
return False
def chat(
self,
message: str,
conversation_id: Optional[str] = None,
system_prompt: Optional[str] = None,
model_format: str = "chatml",
stream: bool = False,
**generation_params
) -> Iterator[Dict[str, Any]]:
"""
Process a chat message and generate response
Args:
message: User message text
conversation_id: Optional conversation ID for multi-turn chat
system_prompt: Optional system prompt for stateless chat
model_format: Prompt format to use
stream: Whether to stream the response
**generation_params: Additional generation parameters
Yields:
Response chunks with generated text
"""
# Handle conversation context
if conversation_id:
conversation = self.get_conversation(conversation_id)
if not conversation:
raise ValueError(f"Conversation {conversation_id} not found")
conversation.add_message("user", message)
else:
# Stateless chat - create temporary conversation
conversation = Conversation(system_prompt)
conversation.add_message("user", message)
# Format prompt for model
prompt = conversation.format_for_model(model_format)
logger.info(f"Processing chat message (stream={stream})")
logger.debug(f"Formatted prompt: {prompt}")
# Generate response
accumulated_text = ""
for chunk in self.llm.generate(prompt=prompt, stream=stream, **generation_params):
accumulated_text += chunk["text"]
yield chunk
# Add assistant response to conversation history
if conversation_id:
conversation.add_message("assistant", accumulated_text)
The service layer implements several important abstractions. The Message class represents individual messages in a conversation. The Conversation class manages message history and handles prompt formatting for different model types.
Different LLM models expect different prompt formats. The format_for_model method implements several common formats. ChatML format uses special tokens like <|im_start|> and <|im_end|> to delimit messages. Llama 2 format uses [INST] and [/INST] tokens with special handling for system prompts. Alpaca format uses structured sections with headers like "### Instruction:" and "### Response:".
The ChatService class provides high-level chat functionality. It manages
multiple concurrent conversations, each identified by a unique ID. For stateless single-turn interactions, it creates temporary conversations. The chat method orchestrates the entire process: retrieving or creating a conversation, adding the user message, formatting the prompt, generating the response, and updating the conversation history.
Implementing the API Layer
The API layer exposes our service through HTTP endpoints. We use FastAPI, which provides automatic request validation, response serialization, and interactive API documentation.
Here is the API implementation:
# app/routes/chat.py
from fastapi import APIRouter, HTTPException, Depends
from fastapi.responses import StreamingResponse
from pydantic import BaseModel, Field
from typing import Optional, List, Dict, Any
import json
import logging
from app.services.chat_service import ChatService
from app.models.llm import LLMInference
from app.config import Settings, get_settings
logger = logging.getLogger(__name__)
router = APIRouter()
# Request and response models
class ChatMessage(BaseModel):
"""Single chat message"""
role: str = Field(..., description="Message role (system, user, or assistant)")
content: str = Field(..., description="Message content")
class ChatRequest(BaseModel):
"""Request for chat completion"""
message: str = Field(..., description="User message to process")
conversation_id: Optional[str] = Field(None, description="Conversation ID for multi-turn chat")
system_prompt: Optional[str] = Field(None, description="System prompt for behavior control")
model_format: str = Field("chatml", description="Prompt format (chatml, llama2, alpaca)")
stream: bool = Field(False, description="Whether to stream the response")
max_tokens: Optional[int] = Field(None, description="Maximum tokens to generate")
temperature: Optional[float] = Field(None, description="Sampling temperature")
top_p: Optional[float] = Field(None, description="Nucleus sampling parameter")
top_k: Optional[int] = Field(None, description="Top-k sampling parameter")
stop: Optional[List[str]] = Field(None, description="Stop sequences")
class ChatResponse(BaseModel):
"""Response from chat completion"""
response: str = Field(..., description="Generated response text")
conversation_id: Optional[str] = Field(None, description="Conversation ID if applicable")
finish_reason: Optional[str] = Field(None, description="Reason generation stopped")
usage: Optional[Dict[str, int]] = Field(None, description="Token usage statistics")
class ConversationRequest(BaseModel):
"""Request to create a conversation"""
conversation_id: str = Field(..., description="Unique conversation identifier")
system_prompt: Optional[str] = Field(None, description="System prompt for the conversation")
class ConversationResponse(BaseModel):
"""Response for conversation operations"""
conversation_id: str = Field(..., description="Conversation identifier")
messages: List[ChatMessage] = Field(..., description="Conversation messages")
class HealthResponse(BaseModel):
"""Health check response"""
status: str = Field(..., description="Service status")
model_info: Dict[str, Any] = Field(..., description="Model information")
# Dependency injection for services
_chat_service: Optional[ChatService] = None
def get_chat_service() -> ChatService:
"""Get chat service singleton"""
global _chat_service
if _chat_service is None:
settings = get_settings()
llm = LLMInference(settings)
_chat_service = ChatService(llm, settings)
return _chat_service
@router.post("/chat", response_model=ChatResponse)
async def chat(
request: ChatRequest,
chat_service: ChatService = Depends(get_chat_service)
):
"""
Process a chat message and generate a response
This endpoint supports both stateless single-turn chat and stateful
multi-turn conversations. For multi-turn chat, create a conversation
first and provide its ID in subsequent requests.
"""
try:
# Prepare generation parameters
gen_params = {}
if request.max_tokens is not None:
gen_params["max_tokens"] = request.max_tokens
if request.temperature is not None:
gen_params["temperature"] = request.temperature
if request.top_p is not None:
gen_params["top_p"] = request.top_p
if request.top_k is not None:
gen_params["top_k"] = request.top_k
if request.stop is not None:
gen_params["stop"] = request.stop
if request.stream:
# Return streaming response
async def generate():
try:
for chunk in chat_service.chat(
message=request.message,
conversation_id=request.conversation_id,
system_prompt=request.system_prompt,
model_format=request.model_format,
stream=True,
**gen_params
):
# Format as server-sent events
data = json.dumps(chunk)
yield f"data: {data}\n\n"
yield "data: [DONE]\n\n"
except Exception as e:
logger.error(f"Streaming error: {e}")
error_data = json.dumps({"error": str(e)})
yield f"data: {error_data}\n\n"
return StreamingResponse(
generate(),
media_type="text/event-stream"
)
else:
# Return complete response
response_text = ""
finish_reason = None
usage = None
for chunk in chat_service.chat(
message=request.message,
conversation_id=request.conversation_id,
system_prompt=request.system_prompt,
model_format=request.model_format,
stream=False,
**gen_params
):
response_text += chunk["text"]
finish_reason = chunk.get("finish_reason")
usage = chunk.get("usage")
return ChatResponse(
response=response_text,
conversation_id=request.conversation_id,
finish_reason=finish_reason,
usage=usage
)
except ValueError as e:
raise HTTPException(status_code=400, detail=str(e))
except Exception as e:
logger.error(f"Chat error: {e}", exc_info=True)
raise HTTPException(status_code=500, detail="Internal server error")
@router.post("/conversations", response_model=ConversationResponse)
async def create_conversation(
request: ConversationRequest,
chat_service: ChatService = Depends(get_chat_service)
):
"""
Create a new conversation for multi-turn chat
Conversations maintain message history across multiple chat requests.
Use the returned conversation_id in subsequent chat requests.
"""
try:
chat_service.create_conversation(
conversation_id=request.conversation_id,
system_prompt=request.system_prompt
)
conversation = chat_service.get_conversation(request.conversation_id)
messages = [
ChatMessage(role=msg["role"], content=msg["content"])
for msg in conversation.get_messages()
]
return ConversationResponse(
conversation_id=request.conversation_id,
messages=messages
)
except Exception as e:
logger.error(f"Conversation creation error: {e}", exc_info=True)
raise HTTPException(status_code=500, detail="Internal server error")
@router.get("/conversations/{conversation_id}", response_model=ConversationResponse)
async def get_conversation(
conversation_id: str,
chat_service: ChatService = Depends(get_chat_service)
):
"""
Get an existing conversation with its message history
"""
conversation = chat_service.get_conversation(conversation_id)
if not conversation:
raise HTTPException(status_code=404, detail="Conversation not found")
messages = [
ChatMessage(role=msg["role"], content=msg["content"])
for msg in conversation.get_messages()
]
return ConversationResponse(
conversation_id=conversation_id,
messages=messages
)
@router.delete("/conversations/{conversation_id}")
async def delete_conversation(
conversation_id: str,
chat_service: ChatService = Depends(get_chat_service)
):
"""
Delete a conversation and its history
"""
deleted = chat_service.delete_conversation(conversation_id)
if not deleted:
raise HTTPException(status_code=404, detail="Conversation not found")
return {"status": "deleted", "conversation_id": conversation_id}
@router.get("/health", response_model=HealthResponse)
async def health_check(
chat_service: ChatService = Depends(get_chat_service)
):
"""
Check service health and get model information
"""
model_info = chat_service.llm.get_model_info()
return HealthResponse(
status="healthy",
model_info=model_info
)
The API layer defines several endpoints. The /chat endpoint is the primary
interface for generating responses. It accepts a ChatRequest containing the user message and optional parameters. It supports both streaming and non-streaming responses. For streaming, it returns server-sent events that clients can process incrementally.
The /conversations endpoints manage multi-turn conversations. The POST endpoint creates a new conversation with an optional system prompt. The GET endpoint retrieves conversation history. The DELETE endpoint removes a conversation and its history.
The /health endpoint provides a way to check if the service is running and get information about the loaded model. This is useful for monitoring and debugging.
The dependency injection pattern used here ensures that only one instance of the ChatService and LLMInference is created and shared across all requests. This is critical because loading the LLM model is expensive and should only happen once.
Creating the Main Application
Now we tie everything together in the main application file:
# app/main.py
from fastapi import FastAPI
from fastapi.middleware.cors import CORSMiddleware
import logging
from app.routes import chat
from app.config import get_settings
# Configure logging
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)
def create_app() -> FastAPI:
"""Create and configure the FastAPI application"""
settings = get_settings()
# Create FastAPI app
app = FastAPI(
title="LLM Chat Microservice",
description="Local LLM chat service with multi-GPU support",
version="1.0.0",
docs_url="/docs",
redoc_url="/redoc"
)
# Configure CORS for cross-origin requests
app.add_middleware(
CORSMiddleware,
allow_origins=["*"],
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"],
)
# Include routers
app.include_router(chat.router, prefix="/api/v1", tags=["chat"])
@app.on_event("startup")
async def startup_event():
"""Initialize services on startup"""
logger.info("Starting LLM Chat Microservice")
logger.info(f"Model: {settings.model_path}")
logger.info(f"GPU: {settings.gpu_type.value}")
logger.info(f"Server: {settings.host}:{settings.port}")
@app.on_event("shutdown")
async def shutdown_event():
"""Cleanup on shutdown"""
logger.info("Shutting down LLM Chat Microservice")
return app
app = create_app()
if __name__ == "__main__":
import uvicorn
settings = get_settings()
uvicorn.run(
"app.main:app",
host=settings.host,
port=settings.port,
log_level=settings.log_level,
reload=False
)
This main application file creates the FastAPI application and configures it.
The CORS middleware allows the service to accept requests from web browsers
running on different domains. The startup event logs important configuration
information. The shutdown event provides a hook for cleanup if needed.
CONTAINERIZING THE SERVICE WITH DOCKER
Now that we have a working service, we need to containerize it. Docker allows us
to package the service with all its dependencies, making it portable and easy to
deploy.
Understanding Docker Images and Containers
A Docker image is a template that contains your application and everything it needs to run. An image is built from a Dockerfile, which specifies the build steps. A container is a running instance of an image. You can run multiple containers from the same image, each isolated from the others.
Docker images are built in layers. Each instruction in the Dockerfile creates a new layer. Layers are cached, so if you rebuild an image and a layer has not changed, Docker reuses the cached layer. This makes builds faster and more efficient.
Creating a Multi-Architecture Dockerfile
Our Dockerfile needs to support multiple GPU architectures. We will use build arguments to control which GPU support to include. Here is the Dockerfile:
# Dockerfile
# Use official Python base image
FROM python:3.11-slim as base
# Set working directory
WORKDIR /app
# Install system dependencies
RUN apt-get update && apt-get install -y \
build-essential \
cmake \
git \
wget \
&& rm -rf /var/lib/apt/lists/*
# Create a build stage for compiling llama.cpp with GPU support
FROM base as builder
# Build arguments for GPU support
ARG GPU_TYPE=none
ARG CUDA_VERSION=12.2.0
ARG ROCM_VERSION=5.7
# Install GPU-specific dependencies based on build argument
RUN if [ "$GPU_TYPE" = "cuda" ]; then \
# Install CUDA toolkit
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.0-1_all.deb && \
dpkg -i cuda-keyring_1.0-1_all.deb && \
apt-get update && \
apt-get install -y cuda-toolkit-$(echo $CUDA_VERSION | cut -d. -f1,2 | tr . -) && \
rm cuda-keyring_1.0-1_all.deb; \
elif [ "$GPU_TYPE" = "rocm" ]; then \
# Install ROCm
wget https://repo.radeon.com/amdgpu-install/latest/ubuntu/jammy/amdgpu-install_latest_all.deb && \
apt-get install -y ./amdgpu-install_latest_all.deb && \
amdgpu-install -y --usecase=rocm && \
rm amdgpu-install_latest_all.deb; \
fi
# Copy requirements file
COPY requirements.txt .
# Install Python dependencies with GPU support
RUN if [ "$GPU_TYPE" = "cuda" ]; then \
CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python --no-cache-dir; \
elif [ "$GPU_TYPE" = "rocm" ]; then \
CMAKE_ARGS="-DLLAMA_HIPBLAS=on" pip install llama-cpp-python --no-cache-dir; \
elif [ "$GPU_TYPE" = "metal" ]; then \
CMAKE_ARGS="-DLLAMA_METAL=on" pip install llama-cpp-python --no-cache-dir; \
elif [ "$GPU_TYPE" = "sycl" ]; then \
CMAKE_ARGS="-DLLAMA_SYCL=on" pip install llama-cpp-python --no-cache-dir; \
else \
pip install llama-cpp-python --no-cache-dir; \
fi
# Install other Python dependencies
RUN pip install --no-cache-dir -r requirements.txt
# Final runtime stage
FROM base as runtime
# Copy installed packages from builder
COPY --from=builder /usr/local/lib/python3.11/site-packages /usr/local/lib/python3.11/site-packages
COPY --from=builder /usr/local/bin /usr/local/bin
# Copy application code
COPY app/ /app/app/
# Create directory for models
RUN mkdir -p /app/models
# Set environment variables
ENV PYTHONUNBUFFERED=1
ENV MODEL_PATH=/app/models/model.gguf
# Expose port
EXPOSE 8000
# Health check
HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \
CMD python -c "import requests; requests.get('http://localhost:8000/api/v1/health')"
# Run the application
CMD ["python", "-m", "uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]
This Dockerfile uses a multi-stage build to keep the final image size
reasonable. The builder stage compiles llama-cpp-python with appropriate GPU support based on the GPU_TYPE build argument. The runtime stage copies only the necessary files from the builder, excluding build tools and intermediate files.
The GPU_TYPE argument controls which GPU support to compile. Setting it to "cuda" compiles with CUDA support. Setting it to "rocm" compiles with ROCm support. Setting it to "metal" compiles with Metal support for Apple Silicon. Setting it to "sycl" compiles with SYCL support for Intel GPUs. Setting it to "none" or omitting it compiles CPU-only support.
The HEALTHCHECK instruction tells Docker how to check if the container is
healthy. It periodically calls the health endpoint and marks the container as
unhealthy if the check fails. This is important for orchestration systems like
Kubernetes, which can automatically restart unhealthy containers.
Creating the Requirements File
The requirements.txt file lists all Python dependencies:
# requirements.txt
fastapi==0.104.1
uvicorn[standard]==0.24.0
pydantic==2.5.0
pydantic-settings==2.1.0
python-multipart==0.0.6
Notice that llama-cpp-python is not in this file. We install it separately in
the Dockerfile with appropriate CMAKE_ARGS for GPU support. This is necessary because the package needs to be compiled with specific flags for each GPU type.
Building Docker Images for Different GPU Types
To build an image with CUDA support:
docker build --build-arg GPU_TYPE=cuda -t llm-chat-service:cuda .
To build an image with ROCm support:
docker build --build-arg GPU_TYPE=rocm -t llm-chat-service:rocm .
To build an image with Metal support for Apple Silicon:
docker build --build-arg GPU_TYPE=metal -t llm-chat-service:metal .
To build an image with SYCL support for Intel GPUs:
docker build --build-arg GPU_TYPE=sycl -t llm-chat-service:sycl .
To build a CPU-only image:
docker build --build-arg GPU_TYPE=none -t llm-chat-service:cpu .
The build process takes several minutes because it compiles llama-cpp-python from source with GPU support. The resulting image contains everything needed to run the service.
Running the Docker Container
To run the container, you need to mount a volume containing your model file and set environment variables for configuration. Here is an example for CUDA:
docker run -d \
--name llm-chat \
--gpus all \
-p 8000:8000 \
-v /path/to/models:/app/models \
-e MODEL_PATH=/app/models/your-model.gguf \
-e GPU_TYPE=cuda \
-e N_GPU_LAYERS=99 \
llm-chat-service:cuda
The --gpus all flag makes all GPUs available to the container. For specific
GPUs, use --gpus '"device=0,1"' to expose only GPUs 0 and 1.
For ROCm on AMD GPUs:
docker run -d \
--name llm-chat \
--device=/dev/kfd \
--device=/dev/dri \
--group-add video \
-p 8000:8000 \
-v /path/to/models:/app/models \
-e MODEL_PATH=/app/models/your-model.gguf \
-e GPU_TYPE=rocm \
-e N_GPU_LAYERS=99 \
llm-chat-service:rocm
ROCm requires exposing specific devices and adding the container to the video group for GPU access.
For Metal on Apple Silicon:
docker run -d \
--name llm-chat \
-p 8000:8000 \
-v /path/to/models:/app/models \
-e MODEL_PATH=/app/models/your-model.gguf \
-e GPU_TYPE=metal \
-e N_GPU_LAYERS=99 \
llm-chat-service:metal
Metal support works automatically on Apple Silicon Macs without additional flags.
Creating a Docker Compose File
Docker Compose simplifies running multi-container applications. Here is a
docker-compose.yml file for our service:
# docker-compose.yml
version: '3.8'
services:
llm-chat:
build:
context: .
dockerfile: Dockerfile
args:
GPU_TYPE: ${GPU_TYPE:-none}
image: llm-chat-service:${GPU_TYPE:-none}
container_name: llm-chat
ports:
- "${PORT:-8000}:8000"
volumes:
- ${MODEL_DIR:-./models}:/app/models
environment:
- MODEL_PATH=${MODEL_PATH:-/app/models/model.gguf}
- GPU_TYPE=${GPU_TYPE:-none}
- N_GPU_LAYERS=${N_GPU_LAYERS:-0}
- CONTEXT_LENGTH=${CONTEXT_LENGTH:-4096}
- MAX_TOKENS=${MAX_TOKENS:-2048}
- TEMPERATURE=${TEMPERATURE:-0.7}
- LOG_LEVEL=${LOG_LEVEL:-info}
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
restart: unless-stopped
healthcheck:
test: ["CMD", "python", "-c", "import requests; requests.get('http://localhost:8000/api/v1/health')"]
interval: 30s
timeout: 10s
retries: 3
start_period: 60s
This compose file uses environment variables for configuration, making it easy to customize without editing the file. Create a .env file with your settings:
# .env
GPU_TYPE=cuda
MODEL_DIR=/path/to/models
MODEL_PATH=/app/models/your-model.gguf
N_GPU_LAYERS=99
PORT=8000
CONTEXT_LENGTH=4096
MAX_TOKENS=2048
TEMPERATURE=0.7
LOG_LEVEL=info
Then start the service with:
docker-compose up -d
Docker Compose reads the .env file automatically and substitutes the values into the compose file.
DEPLOYING TO KUBERNETES
Kubernetes provides production-grade orchestration for containerized
applications. Deploying our LLM service to Kubernetes enables automatic scaling, self-healing, and efficient resource management.
Understanding Kubernetes Concepts
Kubernetes organizes resources into several key abstractions. A Pod is the
smallest deployable unit, typically containing one container. A Deployment
manages a set of identical Pods, ensuring the desired number of replicas are running. A Service provides a stable network endpoint for accessing Pods, load balancing requests across replicas. A ConfigMap stores configuration data that Pods can consume. A PersistentVolume provides storage that persists beyond Pod lifecycles.
For GPU workloads, Kubernetes uses device plugins to expose GPUs to Pods. Each GPU vendor provides a device plugin that makes their GPUs available as schedulable resources. Pods can request GPU resources, and Kubernetes schedules them on nodes with available GPUs.
Installing GPU Support in Kubernetes
Before deploying our service, ensure your Kubernetes cluster has GPU support configured. For NVIDIA GPUs, install the NVIDIA device plugin:
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/main/nvidia-device-plugin.yml
For AMD GPUs, install the AMD device plugin:
kubectl create -f https://raw.githubusercontent.com/RadeonOpenCompute/k8s-device-plugin/master/k8s-ds-amdgpu-dp.yaml
For Intel GPUs, install the Intel device plugin:
kubectl apply -k 'https://github.com/intel/intel-device-plugins-for-kubernetes/deployments/gpu_plugin?ref=main'
These device plugins run as DaemonSets, meaning they run on every node in the cluster, exposing GPUs as schedulable resources.
Creating Kubernetes Manifests
We need several Kubernetes resources to deploy our service. Let us create them one by one.
First, create a ConfigMap for configuration:
# k8s/configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: llm-chat-config
namespace: default
data:
MODEL_NAME: "local-llm"
CONTEXT_LENGTH: "4096"
MAX_TOKENS: "2048"
TEMPERATURE: "0.7"
TOP_P: "0.9"
TOP_K: "40"
REPEAT_PENALTY: "1.1"
N_THREADS: "4"
N_BATCH: "512"
LOG_LEVEL: "info"
HOST: "0.0.0.0"
PORT: "8000"
ENABLE_STREAMING: "true"
MAX_CONCURRENT_REQUESTS: "10"
This ConfigMap stores non-sensitive configuration. We reference it in the
Deployment to inject these values as environment variables.
Next, create a PersistentVolumeClaim for model storage:
# k8s/pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: llm-models-pvc
namespace: default
spec:
accessModes:
- ReadOnlyMany
resources:
requests:
storage: 20Gi
storageClassName: standard
This PVC requests 20GB of storage for model files. The ReadOnlyMany access mode allows multiple Pods to mount the volume simultaneously, which is safe because the models are read-only. In practice, you would populate this volume with your model files before deploying the service.
Now create the Deployment:
# k8s/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: llm-chat-deployment
namespace: default
labels:
app: llm-chat
spec:
replicas: 2
selector:
matchLabels:
app: llm-chat
template:
metadata:
labels:
app: llm-chat
spec:
containers:
- name: llm-chat
image: llm-chat-service:cuda
imagePullPolicy: IfNotPresent
ports:
- containerPort: 8000
name: http
protocol: TCP
env:
- name: MODEL_PATH
value: "/app/models/model.gguf"
- name: GPU_TYPE
value: "cuda"
- name: N_GPU_LAYERS
value: "99"
- name: MAIN_GPU
value: "0"
envFrom:
- configMapRef:
name: llm-chat-config
volumeMounts:
- name: models
mountPath: /app/models
readOnly: true
resources:
requests:
memory: "8Gi"
cpu: "2"
nvidia.com/gpu: "1"
limits:
memory: "16Gi"
cpu: "4"
nvidia.com/gpu: "1"
livenessProbe:
httpGet:
path: /api/v1/health
port: 8000
initialDelaySeconds: 60
periodSeconds: 30
timeoutSeconds: 10
failureThreshold: 3
readinessProbe:
httpGet:
path: /api/v1/health
port: 8000
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
volumes:
- name: models
persistentVolumeClaim:
claimName: llm-models-pvc
nodeSelector:
accelerator: nvidia-gpu
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
This Deployment creates two replicas of our LLM service. The replicas field
specifies how many identical Pods to run. Kubernetes distributes these Pods across available nodes, providing redundancy and load distribution.
The resources section is critical for GPU workloads. The requests specify the minimum resources the Pod needs. Requesting nvidia.com/gpu: "1" tells Kubernetes this Pod needs one NVIDIA GPU. Kubernetes only schedules the Pod on nodes with available GPUs. The limits specify maximum resources the Pod can use. Setting limits equal to requests for GPUs ensures the Pod gets exclusive GPU access.
The nodeSelector ensures Pods only run on nodes with NVIDIA GPUs. The
tolerations allow Pods to run on nodes tainted for GPU workloads. Taints and tolerations are Kubernetes mechanisms for dedicating nodes to specific workloads.
The livenessProbe checks if the container is still running correctly. If the
probe fails repeatedly, Kubernetes restarts the container. The readinessProbe
checks if the container is ready to accept traffic. If the probe fails,
Kubernetes removes the Pod from the Service load balancer until it passes again.
For AMD ROCm GPUs, modify the resources section:
resources:
requests:
memory: "8Gi"
cpu: "2"
amd.com/gpu: "1"
limits:
memory: "16Gi"
cpu: "4"
amd.com/gpu: "1"
And update the nodeSelector:
nodeSelector:
accelerator: amd-gpu
For Intel GPUs, use:
resources:
requests:
memory: "8Gi"
cpu: "2"
gpu.intel.com/i915: "1"
limits:
memory: "16Gi"
cpu: "4"
gpu.intel.com/i915: "1"
Now create a Service to expose the Deployment:
# k8s/service.yaml
apiVersion: v1
kind: Service
metadata:
name: llm-chat-service
namespace: default
labels:
app: llm-chat
spec:
type: ClusterIP
selector:
app: llm-chat
ports:
- name: http
port: 80
targetPort: 8000
protocol: TCP
sessionAffinity: ClientIP
sessionAffinityConfig:
clientIP:
timeoutSeconds: 3600
This Service creates a stable endpoint for accessing the LLM Pods. The
ClusterIP type makes the Service accessible only within the cluster. For
external access, you would use LoadBalancer or create an Ingress.
The sessionAffinity setting is important for stateful conversations. Setting it
to ClientIP ensures requests from the same client IP always go to the same Pod.
This is useful if you store conversation state in memory. However, for
production systems, you should use external storage like Redis for conversation
state to enable true statelessness.
For external access, create an Ingress:
# k8s/ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: llm-chat-ingress
namespace: default
annotations:
nginx.ingress.kubernetes.io/proxy-body-size: "10m"
nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
nginx.ingress.kubernetes.io/proxy-send-timeout: "300"
spec:
ingressClassName: nginx
rules:
- host: llm-chat.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: llm-chat-service
port:
number: 80
This Ingress routes external traffic to the Service. The annotations configure
the NGINX ingress controller to handle large request bodies and long timeouts,
both important for LLM services. Replace llm-chat.example.com with your actual
domain.
For automatic scaling based on load, create a HorizontalPodAutoscaler:
# k8s/hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: llm-chat-hpa
namespace: default
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: llm-chat-deployment
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 50
periodSeconds: 60
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Percent
value: 100
periodSeconds: 30
This HorizontalPodAutoscaler automatically adjusts the number of replicas based
on CPU and memory usage. When average CPU usage exceeds 70 percent or memory
usage exceeds 80 percent, it scales up. When usage drops, it scales down. The
behavior section controls how quickly scaling happens. Scaling up happens
quickly to handle traffic spikes, while scaling down happens slowly to avoid
thrashing.
Note that GPU-based autoscaling is more complex because GPUs are discrete
resources. Each Pod gets a whole GPU, so scaling happens in GPU-sized increments.
For more sophisticated GPU-based autoscaling, you would use custom metrics based
on GPU utilization or request queue depth.
Deploying to Kubernetes
To deploy all resources, apply the manifests in order:
kubectl apply -f k8s/configmap.yaml
kubectl apply -f k8s/pvc.yaml
kubectl apply -f k8s/deployment.yaml
kubectl apply -f k8s/service.yaml
kubectl apply -f k8s/ingress.yaml
kubectl apply -f k8s/hpa.yaml
Check the deployment status:
kubectl get deployments
kubectl get pods
kubectl get services
Watch the Pods start:
kubectl get pods -w
View logs from a Pod:
kubectl logs -f <pod-name>
If a Pod fails to start, describe it to see events and errors:
kubectl describe pod <pod-name>
Common issues include insufficient GPU resources, missing model files in the
PersistentVolume, or incorrect environment variables. The describe command shows
detailed information about why a Pod is not running.
TESTING THE SERVICE
Once deployed, test the service to ensure it works correctly. We will test both
locally with Docker and in Kubernetes.
Testing with curl
The simplest test uses curl to send a request to the chat endpoint. For a local
Docker container:
curl -X POST http://localhost:8000/api/v1/chat \
-H "Content-Type: application/json" \
-d '{
"message": "What is the capital of France?",
"system_prompt": "You are a helpful assistant.",
"model_format": "chatml",
"stream": false,
"temperature": 0.7
}'
This sends a simple question to the LLM. The response includes the generated
text and metadata:
{
"response": "The capital of France is Paris. Paris is not only the capital but also the largest city in France, known for its art, culture, fashion, and iconic landmarks like the Eiffel Tower.",
"conversation_id": null,
"finish_reason": "stop",
"usage": {
"prompt_tokens": 42,
"completion_tokens": 38,
"total_tokens": 80
}
}
To test streaming responses:
curl -X POST http://localhost:8000/api/v1/chat \
-H "Content-Type: application/json" \
-d '{
"message": "Write a short poem about AI.",
"stream": true
}' \
--no-buffer
The --no-buffer flag prevents curl from buffering the output, allowing you to
see tokens as they arrive. The response comes as server-sent events:
data: {"text": " In", "finish_reason": null}
data: {"text": " circuits", "finish_reason": null}
data: {"text": " deep", "finish_reason": null}
data: {"text": ",", "finish_reason": null}
data: [DONE]
Testing multi-turn conversations requires creating a conversation first:
curl -X POST http://localhost:8000/api/v1/conversations \
-H "Content-Type: application/json" \
-d '{
"conversation_id": "test-conv-123",
"system_prompt": "You are a helpful math tutor."
}'
Then send messages with the conversation ID:
curl -X POST http://localhost:8000/api/v1/chat \
-H "Content-Type: application/json" \
-d '{
"message": "What is 15 times 23?",
"conversation_id": "test-conv-123"
}'
Send a follow-up message:
curl -X POST http://localhost:8000/api/v1/chat \
-H "Content-Type: application/json" \
-d '{
"message": "Now add 100 to that result.",
"conversation_id": "test-conv-123"
}'
The service maintains conversation context, so the second message refers to the
previous result without repeating it.
Retrieve the conversation history:
curl http://localhost:8000/api/v1/conversations/test-conv-123
Delete the conversation when done:
curl -X DELETE http://localhost:8000/api/v1/conversations/test-conv-123
Testing in Kubernetes
For Kubernetes deployments, first port-forward to access the Service locally:
kubectl port-forward service/llm-chat-service 8000:80
Then use the same curl commands as above, connecting to localhost:8000.
Alternatively, if you configured an Ingress, access the service through its
external domain:
curl -X POST https://llm-chat.example.com/api/v1/chat \
-H "Content-Type: application/json" \
-d '{
"message": "Hello, how are you?"
}'
MONITORING AND OBSERVABILITY
Production services require monitoring to track health, performance, and usage.
Kubernetes provides built-in tools, and you can add more sophisticated
monitoring with Prometheus and Grafana.
Viewing Logs
Kubernetes aggregates logs from all Pods. View logs from all Pods in the
Deployment:
kubectl logs -l app=llm-chat --tail=100 -f
This follows logs from all Pods with the app=llm-chat label, showing the last
100 lines and streaming new entries.
For structured logging, modify the application to output JSON logs. This makes
logs easier to parse and analyze with log aggregation tools like ELK
(Elasticsearch, Logstash, Kibana) or Loki.
Checking Resource Usage
Monitor resource usage with kubectl top:
kubectl top pods -l app=llm-chat
This shows CPU and memory usage for each Pod. For GPU usage, you need to install
a GPU monitoring solution like NVIDIA DCGM Exporter for NVIDIA GPUs or ROCm SMI
Exporter for AMD GPUs.
Health Checks
The health endpoint provides service status:
curl http://localhost:8000/api/v1/health
The response includes model information:
{
"status": "healthy",
"model_info": {
"model_name": "local-llm",
"model_path": "/app/models/model.gguf",
"context_length": 4096,
"gpu_type": "cuda",
"gpu_layers": 99
}
}
Kubernetes uses this endpoint for liveness and readiness probes. If the endpoint
returns an error or times out, Kubernetes takes corrective action.
BUILDING A CLIENT APPLICATION
To demonstrate using the LLM service, let us build a simple client application.
We will create both a Python client library and a command-line interface.
Python Client Library
First, create a reusable client library that encapsulates the API calls:
# client/llm_client.py
import requests
from typing import Optional, Dict, Any, Iterator, List
import json
class LLMClientError(Exception):
"""Base exception for LLM client errors"""
pass
class LLMClient:
"""Client for interacting with the LLM chat microservice"""
def __init__(self, base_url: str, api_key: Optional[str] = None, timeout: int = 300):
"""
Initialize the LLM client
Args:
base_url: Base URL of the LLM service (e.g., http://localhost:8000)
api_key: Optional API key for authentication
timeout: Request timeout in seconds
"""
self.base_url = base_url.rstrip('/')
self.api_key = api_key
self.timeout = timeout
self.session = requests.Session()
if api_key:
self.session.headers.update({'Authorization': f'Bearer {api_key}'})
def chat(
self,
message: str,
conversation_id: Optional[str] = None,
system_prompt: Optional[str] = None,
model_format: str = "chatml",
stream: bool = False,
max_tokens: Optional[int] = None,
temperature: Optional[float] = None,
top_p: Optional[float] = None,
top_k: Optional[int] = None,
stop: Optional[List[str]] = None
) -> Dict[str, Any]:
"""
Send a chat message and get a response
Args:
message: User message to send
conversation_id: Optional conversation ID for multi-turn chat
system_prompt: Optional system prompt for behavior control
model_format: Prompt format (chatml, llama2, alpaca)
stream: Whether to stream the response
max_tokens: Maximum tokens to generate
temperature: Sampling temperature
top_p: Nucleus sampling parameter
top_k: Top-k sampling parameter
stop: List of stop sequences
Returns:
Response dictionary with generated text and metadata
Raises:
LLMClientError: If the request fails
"""
url = f"{self.base_url}/api/v1/chat"
payload = {
"message": message,
"model_format": model_format,
"stream": stream
}
if conversation_id:
payload["conversation_id"] = conversation_id
if system_prompt:
payload["system_prompt"] = system_prompt
if max_tokens is not None:
payload["max_tokens"] = max_tokens
if temperature is not None:
payload["temperature"] = temperature
if top_p is not None:
payload["top_p"] = top_p
if top_k is not None:
payload["top_k"] = top_k
if stop is not None:
payload["stop"] = stop
try:
if stream:
return self._stream_chat(url, payload)
else:
response = self.session.post(url, json=payload, timeout=self.timeout)
response.raise_for_status()
return response.json()
except requests.exceptions.RequestException as e:
raise LLMClientError(f"Chat request failed: {e}")
def _stream_chat(self, url: str, payload: Dict[str, Any]) -> Iterator[str]:
"""
Stream chat response
Args:
url: API endpoint URL
payload: Request payload
Yields:
Text chunks as they are generated
Raises:
LLMClientError: If streaming fails
"""
try:
response = self.session.post(
url,
json=payload,
stream=True,
timeout=self.timeout
)
response.raise_for_status()
for line in response.iter_lines():
if line:
line = line.decode('utf-8')
if line.startswith('data: '):
data = line[6:] # Remove 'data: ' prefix
if data == '[DONE]':
break
try:
chunk = json.loads(data)
if 'text' in chunk:
yield chunk['text']
elif 'error' in chunk:
raise LLMClientError(f"Server error: {chunk['error']}")
except json.JSONDecodeError:
continue
except requests.exceptions.RequestException as e:
raise LLMClientError(f"Streaming request failed: {e}")
def create_conversation(
self,
conversation_id: str,
system_prompt: Optional[str] = None
) -> Dict[str, Any]:
"""
Create a new conversation
Args:
conversation_id: Unique identifier for the conversation
system_prompt: Optional system prompt
Returns:
Conversation details
Raises:
LLMClientError: If creation fails
"""
url = f"{self.base_url}/api/v1/conversations"
payload = {"conversation_id": conversation_id}
if system_prompt:
payload["system_prompt"] = system_prompt
try:
response = self.session.post(url, json=payload, timeout=self.timeout)
response.raise_for_status()
return response.json()
except requests.exceptions.RequestException as e:
raise LLMClientError(f"Conversation creation failed: {e}")
def get_conversation(self, conversation_id: str) -> Dict[str, Any]:
"""
Get conversation details and history
Args:
conversation_id: Conversation identifier
Returns:
Conversation details with message history
Raises:
LLMClientError: If retrieval fails
"""
url = f"{self.base_url}/api/v1/conversations/{conversation_id}"
try:
response = self.session.get(url, timeout=self.timeout)
response.raise_for_status()
return response.json()
except requests.exceptions.RequestException as e:
raise LLMClientError(f"Conversation retrieval failed: {e}")
def delete_conversation(self, conversation_id: str) -> Dict[str, Any]:
"""
Delete a conversation
Args:
conversation_id: Conversation identifier
Returns:
Deletion confirmation
Raises:
LLMClientError: If deletion fails
"""
url = f"{self.base_url}/api/v1/conversations/{conversation_id}"
try:
response = self.session.delete(url, timeout=self.timeout)
response.raise_for_status()
return response.json()
except requests.exceptions.RequestException as e:
raise LLMClientError(f"Conversation deletion failed: {e}")
def health_check(self) -> Dict[str, Any]:
"""
Check service health
Returns:
Health status and model information
Raises:
LLMClientError: If health check fails
"""
url = f"{self.base_url}/api/v1/health"
try:
response = self.session.get(url, timeout=10)
response.raise_for_status()
return response.json()
except requests.exceptions.RequestException as e:
raise LLMClientError(f"Health check failed: {e}")
def close(self):
"""Close the client session"""
self.session.close()
def __enter__(self):
"""Context manager entry"""
return self
def __exit__(self, exc_type, exc_val, exc_tb):
"""Context manager exit"""
self.close()
This client library provides a clean Python interface for the LLM service. It
handles request formatting, error handling, and streaming responses. The context manager support allows using it with the with statement for automatic cleanup.
Command-Line Interface
Now create a command-line interface using the client library:
# client/cli.py
import argparse
import sys
import uuid
from llm_client import LLMClient, LLMClientError
def print_streaming_response(client: LLMClient, message: str, **kwargs):
"""
Print a streaming response with real-time output
Args:
client: LLM client instance
message: Message to send
**kwargs: Additional chat parameters
"""
print("Assistant: ", end='', flush=True)
try:
for chunk in client.chat(message, stream=True, **kwargs):
print(chunk, end='', flush=True)
print() # New line after response
except LLMClientError as e:
print(f"\nError: {e}", file=sys.stderr)
sys.exit(1)
def print_complete_response(client: LLMClient, message: str, **kwargs):
"""
Print a complete response after generation finishes
Args:
client: LLM client instance
message: Message to send
**kwargs: Additional chat parameters
"""
try:
response = client.chat(message, stream=False, **kwargs)
print(f"Assistant: {response['response']}")
if response.get('usage'):
usage = response['usage']
print(f"\nTokens - Prompt: {usage['prompt_tokens']}, "
f"Completion: {usage['completion_tokens']}, "
f"Total: {usage['total_tokens']}")
except LLMClientError as e:
print(f"Error: {e}", file=sys.stderr)
sys.exit(1)
def interactive_mode(client: LLMClient, args: argparse.Namespace):
"""
Run interactive chat mode
Args:
client: LLM client instance
args: Command-line arguments
"""
# Create conversation if using multi-turn mode
conversation_id = None
if args.multi_turn:
conversation_id = f"cli-{uuid.uuid4()}"
try:
client.create_conversation(conversation_id, args.system_prompt)
print(f"Started conversation: {conversation_id}")
except LLMClientError as e:
print(f"Error creating conversation: {e}", file=sys.stderr)
sys.exit(1)
print("Interactive mode. Type 'exit' or 'quit' to end, 'clear' to start new conversation.")
print()
try:
while True:
try:
user_input = input("You: ").strip()
except EOFError:
break
if not user_input:
continue
if user_input.lower() in ['exit', 'quit']:
break
if user_input.lower() == 'clear':
if conversation_id:
client.delete_conversation(conversation_id)
conversation_id = f"cli-{uuid.uuid4()}"
client.create_conversation(conversation_id, args.system_prompt)
print("Started new conversation")
continue
# Send message
chat_params = {
'conversation_id': conversation_id,
'system_prompt': args.system_prompt if not conversation_id else None,
'model_format': args.format,
'temperature': args.temperature,
'max_tokens': args.max_tokens,
}
if args.stream:
print_streaming_response(client, user_input, **chat_params)
else:
print_complete_response(client, user_input, **chat_params)
print()
finally:
# Cleanup conversation
if conversation_id:
try:
client.delete_conversation(conversation_id)
except LLMClientError:
pass
def single_message_mode(client: LLMClient, args: argparse.Namespace):
"""
Send a single message and exit
Args:
client: LLM client instance
args: Command-line arguments
"""
chat_params = {
'system_prompt': args.system_prompt,
'model_format': args.format,
'temperature': args.temperature,
'max_tokens': args.max_tokens,
}
if args.stream:
print_streaming_response(client, args.message, **chat_params)
else:
print_complete_response(client, args.message, **chat_params)
def health_check_mode(client: LLMClient):
"""
Check service health and display information
Args:
client: LLM client instance
"""
try:
health = client.health_check()
print(f"Status: {health['status']}")
print("\nModel Information:")
for key, value in health['model_info'].items():
print(f" {key}: {value}")
except LLMClientError as e:
print(f"Error: {e}", file=sys.stderr)
sys.exit(1)
def main():
"""Main entry point for the CLI"""
parser = argparse.ArgumentParser(
description='Command-line client for LLM chat microservice',
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
# Interactive chat with streaming
%(prog)s -i -s
# Single message
%(prog)s -m "What is the capital of France?"
# Multi-turn conversation
%(prog)s -i --multi-turn --system "You are a helpful math tutor"
# Check service health
%(prog)s --health
# Custom service URL
%(prog)s -u http://llm-chat.example.com -m "Hello"
"""
)
parser.add_argument(
'-u', '--url',
default='http://localhost:8000',
help='Base URL of the LLM service (default: http://localhost:8000)'
)
parser.add_argument(
'-k', '--api-key',
help='API key for authentication'
)
parser.add_argument(
'-i', '--interactive',
action='store_true',
help='Run in interactive mode'
)
parser.add_argument(
'-m', '--message',
help='Single message to send (non-interactive mode)'
)
parser.add_argument(
'--health',
action='store_true',
help='Check service health and display model information'
)
parser.add_argument(
'-s', '--stream',
action='store_true',
help='Stream responses in real-time'
)
parser.add_argument(
'--multi-turn',
action='store_true',
help='Enable multi-turn conversation mode (maintains context)'
)
parser.add_argument(
'--system',
dest='system_prompt',
help='System prompt to control assistant behavior'
)
parser.add_argument(
'--format',
choices=['chatml', 'llama2', 'alpaca'],
default='chatml',
help='Prompt format (default: chatml)'
)
parser.add_argument(
'--temperature',
type=float,
default=0.7,
help='Sampling temperature (default: 0.7)'
)
parser.add_argument(
'--max-tokens',
type=int,
default=2048,
help='Maximum tokens to generate (default: 2048)'
)
parser.add_argument(
'--timeout',
type=int,
default=300,
help='Request timeout in seconds (default: 300)'
)
args = parser.parse_args()
# Validate arguments
if not args.interactive and not args.message and not args.health:
parser.error('Either --interactive, --message, or --health is required')
# Create client
with LLMClient(args.url, args.api_key, args.timeout) as client:
if args.health:
health_check_mode(client)
elif args.interactive:
interactive_mode(client, args)
else:
single_message_mode(client, args)
if __name__ == '__main__':
main()
This command-line interface provides multiple modes of operation. Interactive mode allows ongoing conversations with the LLM. Single message mode sends one message and exits, useful for scripting. Health check mode verifies the service is running and displays model information.
The CLI supports all the features of the underlying service: streaming responses, multi-turn conversations, custom system prompts, and generation parameters. It provides a user-friendly interface for testing and using the LLM service.
Web-Based Client Application
For a more user-friendly interface, create a simple web application:
# client/web_app.py
from flask import Flask, render_template, request, jsonify, Response, stream_with_context
import uuid
import json
from llm_client import LLMClient, LLMClientError
app = Flask(__name__)
# Configuration
LLM_SERVICE_URL = "http://localhost:8000"
client = LLMClient(LLM_SERVICE_URL)
@app.route('/')
def index():
"""Render the main chat interface"""
return render_template('index.html')
@app.route('/api/chat', methods=['POST'])
def chat():
"""Handle chat requests"""
data = request.json
message = data.get('message')
conversation_id = data.get('conversation_id')
stream = data.get('stream', False)
if not message:
return jsonify({'error': 'Message is required'}), 400
try:
if stream:
def generate():
try:
for chunk in client.chat(
message=message,
conversation_id=conversation_id,
stream=True
):
yield f"data: {json.dumps({'text': chunk})}\n\n"
yield "data: [DONE]\n\n"
except LLMClientError as e:
yield f"data: {json.dumps({'error': str(e)})}\n\n"
return Response(
stream_with_context(generate()),
mimetype='text/event-stream'
)
else:
response = client.chat(
message=message,
conversation_id=conversation_id,
stream=False
)
return jsonify(response)
except LLMClientError as e:
return jsonify({'error': str(e)}), 500
@app.route('/api/conversations', methods=['POST'])
def create_conversation():
"""Create a new conversation"""
data = request.json
conversation_id = data.get('conversation_id') or str(uuid.uuid4())
system_prompt = data.get('system_prompt')
try:
result = client.create_conversation(conversation_id, system_prompt)
return jsonify(result)
except LLMClientError as e:
return jsonify({'error': str(e)}), 500
@app.route('/api/conversations/<conversation_id>', methods=['GET'])
def get_conversation(conversation_id):
"""Get conversation history"""
try:
result = client.get_conversation(conversation_id)
return jsonify(result)
except LLMClientError as e:
return jsonify({'error': str(e)}), 404
@app.route('/api/conversations/<conversation_id>', methods=['DELETE'])
def delete_conversation(conversation_id):
"""Delete a conversation"""
try:
result = client.delete_conversation(conversation_id)
return jsonify(result)
except LLMClientError as e:
return jsonify({'error': str(e)}), 404
@app.route('/api/health', methods=['GET'])
def health():
"""Check service health"""
try:
result = client.health_check()
return jsonify(result)
except LLMClientError as e:
return jsonify({'error': str(e)}), 500
if __name__ == '__main__':
app.run(host='0.0.0.0', port=5000, debug=True)
Create the HTML template for the web interface:
<!-- client/templates/index.html -->
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>LLM Chat Interface</title>
<style>
* {
margin: 0;
padding: 0;
box-sizing: border-box;
}
body {
font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, Oxygen, Ubuntu, Cantarell, sans-serif;
background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
height: 100vh;
display: flex;
justify-content: center;
align-items: center;
padding: 20px;
}
.container {
background: white;
border-radius: 10px;
box-shadow: 0 10px 40px rgba(0,0,0,0.2);
width: 100%;
max-width: 800px;
height: 90vh;
display: flex;
flex-direction: column;
}
.header {
padding: 20px;
border-bottom: 1px solid #e0e0e0;
display: flex;
justify-content: space-between;
align-items: center;
}
.header h1 {
font-size: 24px;
color: #333;
}
.status {
display: flex;
align-items: center;
gap: 8px;
}
.status-indicator {
width: 10px;
height: 10px;
border-radius: 50%;
background: #4caf50;
}
.status-indicator.offline {
background: #f44336;
}
.chat-area {
flex: 1;
overflow-y: auto;
padding: 20px;
display: flex;
flex-direction: column;
gap: 16px;
}
.message {
display: flex;
gap: 12px;
max-width: 80%;
}
.message.user {
align-self: flex-end;
flex-direction: row-reverse;
}
.message-avatar {
width: 40px;
height: 40px;
border-radius: 50%;
display: flex;
align-items: center;
justify-content: center;
font-weight: bold;
color: white;
flex-shrink: 0;
}
.message.user .message-avatar {
background: #667eea;
}
.message.assistant .message-avatar {
background: #764ba2;
}
.message-content {
background: #f5f5f5;
padding: 12px 16px;
border-radius: 12px;
line-height: 1.5;
}
.message.user .message-content {
background: #667eea;
color: white;
}
.input-area {
padding: 20px;
border-top: 1px solid #e0e0e0;
}
.input-container {
display: flex;
gap: 12px;
}
#message-input {
flex: 1;
padding: 12px 16px;
border: 2px solid #e0e0e0;
border-radius: 24px;
font-size: 14px;
outline: none;
transition: border-color 0.3s;
}
#message-input:focus {
border-color: #667eea;
}
#send-button {
padding: 12px 24px;
background: #667eea;
color: white;
border: none;
border-radius: 24px;
font-size: 14px;
font-weight: 600;
cursor: pointer;
transition: background 0.3s;
}
#send-button:hover {
background: #5568d3;
}
#send-button:disabled {
background: #ccc;
cursor: not-allowed;
}
.controls {
display: flex;
gap: 12px;
margin-bottom: 12px;
flex-wrap: wrap;
}
.control-button {
padding: 8px 16px;
background: #f5f5f5;
border: 1px solid #e0e0e0;
border-radius: 16px;
font-size: 12px;
cursor: pointer;
transition: all 0.3s;
}
.control-button:hover {
background: #e0e0e0;
}
.control-button.active {
background: #667eea;
color: white;
border-color: #667eea;
}
.loading {
display: flex;
gap: 4px;
padding: 12px 16px;
}
.loading-dot {
width: 8px;
height: 8px;
border-radius: 50%;
background: #999;
animation: loading 1.4s infinite ease-in-out both;
}
.loading-dot:nth-child(1) {
animation-delay: -0.32s;
}
.loading-dot:nth-child(2) {
animation-delay: -0.16s;
}
@keyframes loading {
0%, 80%, 100% {
transform: scale(0);
}
40% {
transform: scale(1);
}
}
</style>
</head>
<body>
<div class="container">
<div class="header">
<h1>LLM Chat</h1>
<div class="status">
<div class="status-indicator" id="status-indicator"></div>
<span id="status-text">Checking...</span>
</div>
</div>
<div class="chat-area" id="chat-area">
<div class="message assistant">
<div class="message-avatar">AI</div>
<div class="message-content">
Hello! I'm your AI assistant. How can I help you today?
</div>
</div>
</div>
<div class="input-area">
<div class="controls">
<button class="control-button active" id="stream-toggle">
Streaming: ON
</button>
<button class="control-button" id="multi-turn-toggle">
Multi-turn: OFF
</button>
<button class="control-button" id="clear-button">
Clear Chat
</button>
</div>
<div class="input-container">
<input
type="text"
id="message-input"
placeholder="Type your message..."
autocomplete="off"
>
<button id="send-button">Send</button>
</div>
</div>
</div>
<script>
// State management
let isStreaming = true;
let isMultiTurn = false;
let conversationId = null;
let isProcessing = false;
// DOM elements
const chatArea = document.getElementById('chat-area');
const messageInput = document.getElementById('message-input');
const sendButton = document.getElementById('send-button');
const streamToggle = document.getElementById('stream-toggle');
const multiTurnToggle = document.getElementById('multi-turn-toggle');
const clearButton = document.getElementById('clear-button');
const statusIndicator = document.getElementById('status-indicator');
const statusText = document.getElementById('status-text');
// Check service health on load
checkHealth();
// Event listeners
sendButton.addEventListener('click', sendMessage);
messageInput.addEventListener('keypress', (e) => {
if (e.key === 'Enter' && !e.shiftKey) {
e.preventDefault();
sendMessage();
}
});
streamToggle.addEventListener('click', () => {
isStreaming = !isStreaming;
streamToggle.textContent = `Streaming: ${isStreaming ? 'ON' : 'OFF'}`;
streamToggle.classList.toggle('active');
});
multiTurnToggle.addEventListener('click', async () => {
isMultiTurn = !isMultiTurn;
multiTurnToggle.textContent = `Multi-turn: ${isMultiTurn ? 'ON' : 'OFF'}`;
multiTurnToggle.classList.toggle('active');
if (isMultiTurn && !conversationId) {
await createConversation();
} else if (!isMultiTurn && conversationId) {
await deleteConversation();
}
});
clearButton.addEventListener('click', clearChat);
// Functions
async function checkHealth() {
try {
const response = await fetch('/api/health');
const data = await response.json();
if (data.status === 'healthy') {
statusIndicator.classList.remove('offline');
statusText.textContent = 'Online';
} else {
statusIndicator.classList.add('offline');
statusText.textContent = 'Offline';
}
} catch (error) {
statusIndicator.classList.add('offline');
statusText.textContent = 'Offline';
}
}
async function createConversation() {
try {
const response = await fetch('/api/conversations', {
method: 'POST',
headers: {'Content-Type': 'application/json'},
body: JSON.stringify({})
});
const data = await response.json();
conversationId = data.conversation_id;
} catch (error) {
console.error('Failed to create conversation:', error);
}
}
async function deleteConversation() {
if (!conversationId) return;
try {
await fetch(`/api/conversations/${conversationId}`, {
method: 'DELETE'
});
conversationId = null;
} catch (error) {
console.error('Failed to delete conversation:', error);
}
}
function addMessage(role, content) {
const messageDiv = document.createElement('div');
messageDiv.className = `message ${role}`;
const avatar = document.createElement('div');
avatar.className = 'message-avatar';
avatar.textContent = role === 'user' ? 'You' : 'AI';
const contentDiv = document.createElement('div');
contentDiv.className = 'message-content';
contentDiv.textContent = content;
messageDiv.appendChild(avatar);
messageDiv.appendChild(contentDiv);
chatArea.appendChild(messageDiv);
chatArea.scrollTop = chatArea.scrollHeight;
return contentDiv;
}
function addLoadingIndicator() {
const messageDiv = document.createElement('div');
messageDiv.className = 'message assistant';
messageDiv.id = 'loading-message';
const avatar = document.createElement('div');
avatar.className = 'message-avatar';
avatar.textContent = 'AI';
const loadingDiv = document.createElement('div');
loadingDiv.className = 'loading';
loadingDiv.innerHTML = '<div class="loading-dot"></div><div class="loading-dot"></div><div class="loading-dot"></div>';
messageDiv.appendChild(avatar);
messageDiv.appendChild(loadingDiv);
chatArea.appendChild(messageDiv);
chatArea.scrollTop = chatArea.scrollHeight;
return messageDiv;
}
function removeLoadingIndicator() {
const loadingMessage = document.getElementById('loading-message');
if (loadingMessage) {
loadingMessage.remove();
}
}
async function sendMessage() {
if (isProcessing) return;
const message = messageInput.value.trim();
if (!message) return;
isProcessing = true;
sendButton.disabled = true;
messageInput.value = '';
// Add user message
addMessage('user', message);
try {
if (isStreaming) {
await handleStreamingResponse(message);
} else {
await handleCompleteResponse(message);
}
} catch (error) {
console.error('Error sending message:', error);
addMessage('assistant', 'Sorry, an error occurred. Please try again.');
} finally {
isProcessing = false;
sendButton.disabled = false;
messageInput.focus();
}
}
async function handleStreamingResponse(message) {
const response = await fetch('/api/chat', {
method: 'POST',
headers: {'Content-Type': 'application/json'},
body: JSON.stringify({
message: message,
conversation_id: conversationId,
stream: true
})
});
const reader = response.body.getReader();
const decoder = new TextDecoder();
let contentDiv = null;
let fullText = '';
while (true) {
const {done, value} = await reader.read();
if (done) break;
const chunk = decoder.decode(value);
const lines = chunk.split('\n');
for (const line of lines) {
if (line.startsWith('data: ')) {
const data = line.slice(6);
if (data === '[DONE]') break;
try {
const json = JSON.parse(data);
if (json.text) {
if (!contentDiv) {
contentDiv = addMessage('assistant', '');
}
fullText += json.text;
contentDiv.textContent = fullText;
chatArea.scrollTop = chatArea.scrollHeight;
}
} catch (e) {
// Ignore parse errors
}
}
}
}
}
async function handleCompleteResponse(message) {
const loadingIndicator = addLoadingIndicator();
const response = await fetch('/api/chat', {
method: 'POST',
headers: {'Content-Type': 'application/json'},
body: JSON.stringify({
message: message,
conversation_id: conversationId,
stream: false
})
});
removeLoadingIndicator();
const data = await response.json();
if (data.response) {
addMessage('assistant', data.response);
} else if (data.error) {
addMessage('assistant', `Error: ${data.error}`);
}
}
async function clearChat() {
if (conversationId) {
await deleteConversation();
if (isMultiTurn) {
await createConversation();
}
}
chatArea.innerHTML = '';
addMessage('assistant', 'Chat cleared. How can I help you?');
}
</script>
</body>
</html>
This web application provides a polished chat interface with real-time streaming, multi-turn conversations, and visual feedback. Users can toggle streaming mode and multi-turn mode, clear the chat, and see the service status.
To run the web application, install Flask:
pip install flask
Then start the server:
python client/web_app.py
Access the interface at http://localhost:5000 in your web browser.
PRODUCTION-READY COMPLETE EXAMPLE
Now let us provide the complete, production-ready code for the entire system. This includes all files needed to deploy and run the service.
Complete Project Structure:
llm-chat-microservice/
├── app/
│ ├── __init__.py
│ ├── main.py
│ ├── config.py
│ ├── models/
│ │ ├── __init__.py
│ │ └── llm.py
│ ├── services/
│ │ ├── __init__.py
│ │ └── chat_service.py
│ └── routes/
│ ├── __init__.py
│ └── chat.py
├── client/
│ ├── llm_client.py
│ ├── cli.py
│ ├── web_app.py
│ └── templates/
│ └── index.html
├── k8s/
│ ├── configmap.yaml
│ ├── pvc.yaml
│ ├── deployment.yaml
│ ├── service.yaml
│ ├── ingress.yaml
│ └── hpa.yaml
├── Dockerfile
├── docker-compose.yml
├── requirements.txt
├── .env.example
└── README.md
Complete app/__init__.py:
# app/__init__.py
"""
LLM Chat Microservice
A production-ready microservice for serving local LLM models with
multi-GPU support across NVIDIA CUDA, AMD ROCm, Apple Metal, and Intel SYCL.
"""
__version__ = "1.0.0"
Complete app/models/__init__.py:
# app/models/__init__.py
from app.models.llm import LLMInference
__all__ = ['LLMInference']
Complete app/services/__init__.py:
# app/services/__init__.py
from app.services.chat_service import ChatService, Conversation, Message
__all__ = ['ChatService', 'Conversation', 'Message']
Complete app/routes/__init__.py:
# app/routes/__init__.py
from app.routes import chat
__all__ = ['chat']
Complete .env.example:
# .env.example
# Copy this file to .env and configure for your environment
# Model configuration
MODEL_PATH=/app/models/model.gguf
MODEL_NAME=local-llm
CONTEXT_LENGTH=4096
MAX_TOKENS=2048
TEMPERATURE=0.7
TOP_P=0.9
TOP_K=40
REPEAT_PENALTY=1.1
# GPU configuration
# Options: cuda, rocm, metal, sycl, none
GPU_TYPE=cuda
N_GPU_LAYERS=99
MAIN_GPU=0
# For multi-GPU, specify tensor split as comma-separated values
# TENSOR_SPLIT=0.6,0.4
# Server configuration
HOST=0.0.0.0
PORT=8000
WORKERS=1
LOG_LEVEL=info
# Performance configuration
N_THREADS=4
N_BATCH=512
# API configuration
# API_KEY=your-secret-key-here
ENABLE_STREAMING=true
MAX_CONCURRENT_REQUESTS=10
Complete README.md:
# LLM Chat Microservice
A production-ready microservice for serving local Large Language Models with comprehensive multi-GPU architecture support including NVIDIA CUDA, AMD ROCm, Apple Metal Performance Shaders, and Intel SYCL.
## Features
- Local LLM inference with no external dependencies
- Multi-GPU architecture support (NVIDIA, AMD, Apple, Intel)
- RESTful API with automatic documentation
- Streaming and non-streaming response modes
- Multi-turn conversation management
- Docker containerization with multi-stage builds
- Kubernetes deployment with auto-scaling
- Production-ready error handling and logging
- Comprehensive client libraries and CLI tools
## Quick Start
### Prerequisites
- Python 3.10 or later
- Docker (for containerized deployment)
- Kubernetes cluster (for orchestrated deployment)
- GPU drivers for your hardware (CUDA, ROCm, Metal, or SYCL)
- LLM model file in GGUF format
### Local Development
1. Clone the repository and navigate to the project directory
2. Create a virtual environment:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
3. Install dependencies:
pip install -r requirements.txt
4. Download an LLM model in GGUF format and note its path
5. Configure environment variables:
cp .env.example .env
# Edit .env with your settings
6. Run the service:
python -m app.main
7. Access the API documentation at http://localhost:8000/docs
### Docker Deployment
Build the image with GPU support:
# For NVIDIA CUDA
docker build --build-arg GPU_TYPE=cuda -t llm-chat-service:cuda .
# For AMD ROCm
docker build --build-arg GPU_TYPE=rocm -t llm-chat-service:rocm .
# For Apple Metal
docker build --build-arg GPU_TYPE=metal -t llm-chat-service:metal .
# For Intel SYCL
docker build --build-arg GPU_TYPE=sycl -t llm-chat-service:sycl .
# CPU only
docker build --build-arg GPU_TYPE=none -t llm-chat-service:cpu .
Run the container:
docker run -d \
--name llm-chat \
--gpus all \
-p 8000:8000 \
-v /path/to/models:/app/models \
-e MODEL_PATH=/app/models/your-model.gguf \
-e GPU_TYPE=cuda \
-e N_GPU_LAYERS=99 \
llm-chat-service:cuda
### Docker Compose Deployment
Configure your environment:
cp .env.example .env
# Edit .env with your settings
Start the service:
docker-compose up -d
### Kubernetes Deployment
1. Ensure GPU device plugins are installed on your cluster
2. Create the model PersistentVolume and populate it with your model files
3. Deploy the service:
kubectl apply -f k8s/
4. Check deployment status:
kubectl get pods -l app=llm-chat
5. Access the service:
kubectl port-forward service/llm-chat-service 8000:80
## API Usage
### Health Check
GET /api/v1/health
Returns service status and model information.
### Single Message Chat
POST /api/v1/chat
Content-Type: application/json
{
"message": "What is the capital of France?",
"system_prompt": "You are a helpful assistant.",
"model_format": "chatml",
"stream": false,
"temperature": 0.7,
"max_tokens": 2048
}
### Streaming Chat
POST /api/v1/chat
Content-Type: application/json
{
"message": "Write a poem about AI.",
"stream": true
}
Response is sent as server-sent events.
### Multi-Turn Conversation
Create a conversation:
POST /api/v1/conversations
Content-Type: application/json
{
"conversation_id": "my-conversation",
"system_prompt": "You are a helpful math tutor."
}
Send messages:
POST /api/v1/chat
Content-Type: application/json
{
"message": "What is 15 times 23?",
"conversation_id": "my-conversation"
}
Get conversation history:
GET /api/v1/conversations/my-conversation
Delete conversation:
DELETE /api/v1/conversations/my-conversation
## Client Usage
### Python Client Library
from client.llm_client import LLMClient
with LLMClient("http://localhost:8000") as client:
# Single message
response = client.chat("Hello, how are you?")
print(response['response'])
# Streaming
for chunk in client.chat("Tell me a story", stream=True):
print(chunk, end='', flush=True)
# Multi-turn conversation
client.create_conversation("conv-123", "You are a helpful assistant")
client.chat("What is Python?", conversation_id="conv-123")
client.chat("How do I install it?", conversation_id="conv-123")
### Command-Line Interface
# Interactive mode with streaming
python client/cli.py -i -s
# Single message
python client/cli.py -m "What is the capital of France?"
# Multi-turn conversation
python client/cli.py -i --multi-turn --system "You are a helpful tutor"
# Health check
python client/cli.py --health
# Custom service URL
python client/cli.py -u http://llm-chat.example.com -m "Hello"
### Web Interface
Start the web application:
python client/web_app.py
Access the interface at http://localhost:5000
## Configuration
All configuration is done through environment variables. See .env.example for all available options.
Key configuration parameters:
- MODEL_PATH: Path to the GGUF model file
- GPU_TYPE: GPU acceleration type (cuda, rocm, metal, sycl, none)
- N_GPU_LAYERS: Number of model layers to offload to GPU (99 for all)
- CONTEXT_LENGTH: Maximum context window size
- MAX_TOKENS: Maximum tokens to generate per response
- TEMPERATURE: Sampling temperature (higher = more creative)
## GPU Configuration
### NVIDIA CUDA
Set GPU_TYPE=cuda and ensure CUDA toolkit is installed. The service will
automatically detect and use available NVIDIA GPUs.
For multi-GPU setups, use TENSOR_SPLIT to distribute the model:
TENSOR_SPLIT=0.6,0.4
### AMD ROCm
Set GPU_TYPE=rocm and ensure ROCm is installed. The service uses HIP to interface with AMD GPUs.
### Apple Metal
Set GPU_TYPE=metal on Apple Silicon Macs. Metal support is automatic and uses the unified memory architecture for efficient inference.
### Intel GPUs
Set GPU_TYPE=sycl and ensure Intel oneAPI or OpenCL runtime is installed. Supports both integrated and discrete Intel GPUs.
## Performance Tuning
- Increase N_GPU_LAYERS to offload more computation to GPU
- Adjust N_BATCH for optimal throughput (higher = more memory, better speed)
- Set N_THREADS based on your CPU core count
- Use quantized models (Q4_K_M or Q5_K_M) for better performance
- Enable streaming for better perceived responsiveness
## Monitoring
View logs:
docker logs -f llm-chat
kubectl logs -f -l app=llm-chat
Check resource usage:
docker stats llm-chat
kubectl top pods -l app=llm-chat
Monitor GPU usage:
nvidia-smi # NVIDIA
rocm-smi # AMD
## Troubleshooting
### Model fails to load
- Verify MODEL_PATH points to a valid GGUF file
- Ensure sufficient memory (RAM or VRAM) for the model
- Check file permissions on the model file
### GPU not detected
- Verify GPU drivers are installed correctly
- Check GPU_TYPE matches your hardware
- Ensure Docker has GPU access (--gpus flag)
- Verify Kubernetes GPU device plugin is running
### Slow inference
- Increase N_GPU_LAYERS to use more GPU
- Use a smaller or more quantized model
- Reduce CONTEXT_LENGTH if not needed
- Check GPU utilization with nvidia-smi or rocm-smi
### Out of memory errors
- Use a smaller model or higher quantization
- Reduce CONTEXT_LENGTH
- Reduce N_BATCH
- Offload fewer layers to GPU (reduce N_GPU_LAYERS)
## Architecture
The service follows clean architecture principles with clear separation of
concerns:
- API Layer (routes/): HTTP endpoints and request/response handling
- Service Layer (services/): Business logic and conversation management
- Inference Layer (models/): LLM loading and inference
- Configuration Layer (config.py): Settings and environment management
This architecture enables:
- Easy testing of individual components
- Flexibility to swap implementations
- Clear dependency flow
- Maintainable and extensible code
## License
This project is provided as-is for educational and commercial use.
## Support
For issues and questions, please refer to the documentation or create an issue in the project repository.
This completes the comprehensive guide to building, deploying, and using an LLM chat microservice with full multi-GPU architecture support. The system is production-ready with proper error handling, logging, monitoring, and client tools. It supports deployment in Docker containers and Kubernetes clusters, with automatic scaling and self-healing capabilities. The included client applications demonstrate how to integrate the service into various types of applications, from command-line tools to web interfaces.
No comments:
Post a Comment