INTRODUCTION TO LOCAL LLM APPLICATIONS
The landscape of artificial intelligence has transformed dramatically with the emergence of large language models. While cloud-based services like ChatGPT and Claude have demonstrated remarkable capabilities, there exists a growing demand for local alternatives that provide privacy, control, and independence from external services. Building a web-based application that runs large language models locally while leveraging GPU acceleration presents unique technical challenges and opportunities. This article explores the comprehensive architecture, implementation details, and considerations necessary to create a production-ready local LLM interface that rivals commercial offerings.
The fundamental premise of such an application involves creating a web interface that communicates with a backend server capable of loading, managing, and executing inference on large language models. The application must support multiple GPU backends including NVIDIA CUDA, AMD ROCm, Intel acceleration, and Apple Metal Performance Shaders. Users should have granular control over model parameters, the ability to manage their model collection, and access to a polished user interface with multiple themes.
ARCHITECTURAL OVERVIEW AND SYSTEM DESIGN
The architecture of a local LLM application follows a client-server model with clear separation of concerns. The frontend consists of a responsive web application built with modern JavaScript frameworks, while the backend handles model management, inference execution, and GPU orchestration. The communication layer utilizes WebSocket connections for real-time streaming of generated tokens, providing the familiar incremental response experience users expect from commercial chatbots.
The backend server must be designed with modularity in mind, separating model loading logic, inference execution, parameter management, and API endpoints into distinct components. This separation enables easier testing, maintenance, and future enhancements. The model management system interfaces with HuggingFace's model hub to download and cache models locally, maintaining metadata about available models and their configurations.
GPU acceleration forms the cornerstone of performance in this application.
The backend must detect available GPU resources and select appropriate acceleration libraries. For NVIDIA GPUs, CUDA provides the foundation through libraries like cuBLAS and cuDNN. AMD GPUs utilize ROCm, which offers similar functionality with different implementation details. Intel GPUs leverage oneAPI and Level Zero, while Apple Silicon uses Metal Performance Shaders through the MPS backend. The abstraction layer must handle these differences transparently while exposing a unified interface to the inference engine.
FRONTEND ARCHITECTURE AND USER INTERFACE DESIGN
The frontend application serves as the primary interaction point for users. Building it with vanilla JavaScript provides direct control without framework overhead while maintaining simplicity. The interface consists of several key components including the chat interface, model selector, settings panel, and theme switcher.
The chat interface displays the conversation history and provides an input area for user messages. Each message is rendered as a component that handles both user and assistant messages with appropriate styling. The streaming response mechanism updates the display incrementally as tokens arrive from the backend, creating a natural conversational flow.
The model selector component presents available models in a list format, displaying relevant metadata such as model size, quantization level, and description. When a user selects a different model, the component triggers a backend request to load the new model, handling the loading state with appropriate visual feedback.
The settings panel exposes model parameters through intuitive controls. Temperature is presented as a slider with numerical display, typically ranging from zero to two. Top-k and top-p sampling parameters receive similar treatment. The context window size, measured in tokens, can be adjusted within the limits of the selected model's architecture. Additional parameters like repetition penalty provide fine-grained control over generation behavior.
Here is an example of settings management in JavaScript:
function updateSettingsDisplay() {
const temperatureSlider = document.getElementById('temperature-slider');
const temperatureValue = document.getElementById('temperature-value');
temperatureSlider.addEventListener('input', function(event) {
temperatureValue.textContent = event.target.value;
});
const topKSlider = document.getElementById('top-k-slider');
const topKValue = document.getElementById('top-k-value');
topKSlider.addEventListener('input', function(event) {
topKValue.textContent = event.target.value;
});
const topPSlider = document.getElementById('top-p-slider');
const topPValue = document.getElementById('top-p-value');
topPSlider.addEventListener('input', function(event) {
topPValue.textContent = event.target.value;
});
const repetitionPenaltySlider = document.getElementById('repetition-penalty-slider');
const repetitionPenaltyValue = document.getElementById('repetition-penalty-value');
repetitionPenaltySlider.addEventListener('input', function(event) {
repetitionPenaltyValue.textContent = event.target.value;
});
}
async function applyGenerationSettings() {
const config = {
temperature: parseFloat(document.getElementById('temperature-slider').value),
top_k: parseInt(document.getElementById('top-k-slider').value),
top_p: parseFloat(document.getElementById('top-p-slider').value),
repetition_penalty: parseFloat(document.getElementById('repetition-penalty-slider').value),
max_new_tokens: parseInt(document.getElementById('max-tokens-input').value)
};
try {
const response = await fetch('http://localhost:8000/api/generation-config', {
method: 'POST',
headers: {
'Content-Type': 'application/json'
},
body: JSON.stringify(config)
});
const result = await response.json();
if (result.status === 'success') {
console.log('Settings applied successfully');
} else {
console.error('Failed to apply settings');
}
} catch (error) {
console.error('Error applying settings:', error);
}
}
This code demonstrates the pattern for managing model parameters. Each parameter receives a slider for quick adjustments with immediate visual feedback through the value display. The applyGenerationSettings function collects all current values and sends them to the backend API endpoint for application to the inference engine.
THEME SYSTEM IMPLEMENTATION
The theme system provides visual customization through CSS variables and dynamic class application. A well-designed theme system separates color definitions, spacing, and typography into reusable tokens that can be swapped based on user preference. The implementation stores the user's theme choice in local storage for persistence across sessions.
The theme definition consists of CSS custom properties that define colors for backgrounds, text, borders, and interactive elements. The light theme typically uses light backgrounds with dark text, while the dark theme inverts this relationship. Additional considerations include ensuring sufficient contrast ratios for accessibility and providing smooth transitions when switching themes.
Here is an example of theme definitions in CSS:
:root {
--transition-speed: 0.3s;
}
.theme-light {
--background-primary: #ffffff;
--background-secondary: #f5f5f5;
--background-tertiary: #e0e0e0;
--text-primary: #1a1a1a;
--text-secondary: #4a4a4a;
--text-tertiary: #6a6a6a;
--border-color: #d0d0d0;
--accent-color: #0066cc;
--accent-hover: #0052a3;
--message-user-bg: #e3f2fd;
--message-assistant-bg: #f5f5f5;
--input-background: #ffffff;
--input-border: #c0c0c0;
--button-primary: #0066cc;
--button-primary-hover: #0052a3;
--shadow-color: rgba(0, 0, 0, 0.1);
}
.theme-dark {
--background-primary: #1a1a1a;
--background-secondary: #2a2a2a;
--background-tertiary: #3a3a3a;
--text-primary: #e0e0e0;
--text-secondary: #b0b0b0;
--text-tertiary: #909090;
--border-color: #404040;
--accent-color: #4da6ff;
--accent-hover: #3d96ef;
--message-user-bg: #1e3a5f;
--message-assistant-bg: #2a2a2a;
--input-background: #2a2a2a;
--input-border: #404040;
--button-primary: #4da6ff;
--button-primary-hover: #3d96ef;
--shadow-color: rgba(0, 0, 0, 0.3);
}
body {
background-color: var(--background-primary);
color: var(--text-primary);
transition: background-color var(--transition-speed), color var(--transition-speed);
}
.chat-container {
background-color: var(--background-secondary);
border: 1px solid var(--border-color);
transition: background-color var(--transition-speed), border-color var(--transition-speed);
}
.message-user {
background-color: var(--message-user-bg);
color: var(--text-primary);
transition: background-color var(--transition-speed);
}
.message-assistant {
background-color: var(--message-assistant-bg);
color: var(--text-primary);
transition: background-color var(--transition-speed);
}
.input-area {
background-color: var(--input-background);
border: 1px solid var(--input-border);
color: var(--text-primary);
transition: background-color var(--transition-speed), border-color var(--transition-speed);
}
.button-primary {
background-color: var(--button-primary);
color: #ffffff;
transition: background-color var(--transition-speed);
}
.button-primary:hover {
background-color: var(--button-primary-hover);
}
The theme switcher implementation applies the appropriate class to the root element based on user selection:
function initializeTheme() {
const savedTheme = localStorage.getItem('theme') || 'light';
document.body.className = 'theme-' + savedTheme;
updateThemeIcon(savedTheme);
}
function toggleTheme() {
const currentTheme = document.body.className.replace('theme-', '');
const newTheme = currentTheme === 'light' ? 'dark' : 'light';
document.body.className = 'theme-' + newTheme;
localStorage.setItem('theme', newTheme);
updateThemeIcon(newTheme);
}
function updateThemeIcon(theme) {
const themeIcon = document.getElementById('theme-icon');
themeIcon.textContent = theme === 'light' ? '🌙' : '☀️';
}
This implementation provides a simple toggle between light and dark themes with persistence. The emoji icons provide visual feedback about the current theme state. The theme is applied immediately upon page load by reading from local storage.
BACKEND ARCHITECTURE AND SERVER IMPLEMENTATION
The backend server forms the computational core of the application. Written in Python, it leverages FastAPI for HTTP endpoints and WebSocket support. The server architecture divides responsibilities among several modules including model management, inference execution, GPU detection and initialization, and API routing.
The model manager handles downloading models from HuggingFace, maintaining a local cache, and providing metadata about available models. When a user requests to add a new model, the manager validates the model identifier, initiates the download, and updates the local registry. The deletion operation removes model files and updates the registry accordingly.
GPU detection occurs during server initialization. The system probes for available GPU backends by attempting to import relevant libraries and checking for hardware presence. NVIDIA GPUs are detected through the CUDA runtime API, AMD GPUs through ROCm, Intel through oneAPI, and Apple Silicon through the availability of MPS. The detection logic establishes a priority order, typically preferring CUDA when available, followed by ROCm, then Intel, then MPS, with CPU as the fallback.
Here is an example of GPU detection logic:
import torch
import logging
from typing import Optional, Dict, Any
logger = logging.getLogger(__name__)
class GPUDetector:
def __init__(self):
self.device_type = None
self.device_name = None
self.device_properties = {}
self.detect_gpu()
def detect_gpu(self) -> None:
"""Detect available GPU and set device configuration."""
if self._check_cuda():
self.device_type = 'cuda'
self.device_name = torch.cuda.get_device_name(0)
self.device_properties = {
'name': self.device_name,
'compute_capability': '.'.join(map(str, torch.cuda.get_device_capability(0))),
'total_memory_gb': round(torch.cuda.get_device_properties(0).total_memory / (1024 ** 3), 2),
'multi_processor_count': torch.cuda.get_device_properties(0).multi_processor_count,
'backend': 'CUDA'
}
logger.info(f"CUDA GPU detected: {self.device_name}")
elif self._check_mps():
self.device_type = 'mps'
self.device_name = 'Apple Silicon GPU'
self.device_properties = {
'name': self.device_name,
'backend': 'Metal Performance Shaders'
}
logger.info("Apple MPS GPU detected")
elif self._check_rocm():
self.device_type = 'cuda'
self.device_name = 'AMD GPU (ROCm)'
self.device_properties = {
'name': self.device_name,
'backend': 'ROCm',
'total_memory_gb': round(torch.cuda.get_device_properties(0).total_memory / (1024 ** 3), 2)
}
logger.info("AMD ROCm GPU detected")
elif self._check_intel():
self.device_type = 'xpu'
self.device_name = 'Intel GPU'
self.device_properties = {
'name': self.device_name,
'backend': 'Intel Extension for PyTorch'
}
logger.info("Intel GPU detected")
else:
self.device_type = 'cpu'
self.device_name = 'CPU'
self.device_properties = {
'name': 'CPU',
'cores': torch.get_num_threads(),
'backend': 'CPU'
}
logger.warning("No GPU detected, using CPU")
def _check_cuda(self) -> bool:
"""Check if CUDA is available."""
try:
return torch.cuda.is_available()
except Exception as e:
logger.debug(f"CUDA check failed: {e}")
return False
def _check_mps(self) -> bool:
"""Check if Apple MPS is available."""
try:
return hasattr(torch.backends, 'mps') and torch.backends.mps.is_available()
except Exception as e:
logger.debug(f"MPS check failed: {e}")
return False
def _check_rocm(self) -> bool:
"""Check if ROCm is available."""
try:
if torch.cuda.is_available():
return hasattr(torch.version, 'hip') and torch.version.hip is not None
return False
except Exception as e:
logger.debug(f"ROCm check failed: {e}")
return False
def _check_intel(self) -> bool:
"""Check if Intel GPU is available."""
try:
import intel_extension_for_pytorch as ipex
return hasattr(torch, 'xpu') and torch.xpu.is_available()
except ImportError:
logger.debug("Intel Extension for PyTorch not installed")
return False
except Exception as e:
logger.debug(f"Intel GPU check failed: {e}")
return False
def get_device(self) -> torch.device:
"""Return the appropriate torch device."""
return torch.device(self.device_type)
def get_device_info(self) -> Dict[str, Any]:
"""Return device information."""
return {
'type': self.device_type,
'name': self.device_name,
'properties': self.device_properties
}
This GPU detector provides a unified interface for identifying and configuring the appropriate acceleration backend. The detection logic follows a priority order and provides detailed information about the detected hardware. The get_device method returns a torch.device object that can be used throughout the application for tensor placement.
MODEL MANAGEMENT AND HUGGINGFACE INTEGRATION
The model management system interfaces with HuggingFace's transformers library and model hub. It maintains a local registry of downloaded models, handles model downloads, and provides metadata for the frontend. The registry stores information including model identifiers, local paths, model sizes, quantization levels, and download timestamps.
Downloading a model from HuggingFace involves using the transformers library's from_pretrained method, which handles authentication, file downloads, and caching. The system supports various model formats including full precision, half precision, and quantized variants. Quantization reduces model size and memory requirements while maintaining acceptable performance, making it essential for running larger models on consumer hardware.
Here is an example of the model manager implementation:
import os
import json
import shutil
from pathlib import Path
from typing import List, Dict, Optional, Any
from datetime import datetime
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import logging
logger = logging.getLogger(__name__)
class ModelManager:
def __init__(self, models_dir: str = "./models", registry_path: str = "./model_registry.json"):
self.models_dir = Path(models_dir)
self.registry_path = Path(registry_path)
self.models_dir.mkdir(parents=True, exist_ok=True)
self.registry = self._load_registry()
def _load_registry(self) -> Dict[str, Any]:
"""Load the model registry from disk."""
if self.registry_path.exists():
try:
with open(self.registry_path, 'r') as f:
return json.load(f)
except json.JSONDecodeError:
logger.error("Registry file corrupted, creating new registry")
return {}
return {}
def _save_registry(self) -> None:
"""Save the model registry to disk."""
try:
with open(self.registry_path, 'w') as f:
json.dump(self.registry, f, indent=2)
except Exception as e:
logger.error(f"Failed to save registry: {e}")
def list_models(self) -> List[Dict[str, Any]]:
"""Return a list of all available models."""
models = []
for model_id, info in self.registry.items():
models.append({
'id': model_id,
'name': info.get('name', model_id),
'size': info.get('size', 'unknown'),
'quantization': info.get('quantization', 'none'),
'downloaded_at': info.get('downloaded_at'),
'path': info.get('path'),
'parameters': info.get('parameters', 'unknown')
})
return models
def add_model(self, model_id: str, quantization: Optional[str] = None) -> Dict[str, Any]:
"""Download and add a model from HuggingFace."""
logger.info(f"Adding model: {model_id}")
if model_id in self.registry:
logger.warning(f"Model {model_id} already exists")
return {
'status': 'error',
'message': 'Model already exists in registry'
}
try:
model_path = self.models_dir / model_id.replace('/', '_')
model_path.mkdir(parents=True, exist_ok=True)
logger.info(f"Downloading tokenizer for {model_id}")
tokenizer = AutoTokenizer.from_pretrained(
model_id,
cache_dir=str(model_path),
trust_remote_code=True
)
logger.info(f"Downloading model {model_id}")
load_kwargs = {'cache_dir': str(model_path), 'trust_remote_code': True}
if quantization == '8bit':
load_kwargs['load_in_8bit'] = True
load_kwargs['device_map'] = 'auto'
elif quantization == '4bit':
load_kwargs['load_in_4bit'] = True
load_kwargs['device_map'] = 'auto'
model = AutoModelForCausalLM.from_pretrained(model_id, **load_kwargs)
total_params = sum(p.numel() for p in model.parameters())
model_size = sum(p.numel() * p.element_size() for p in model.parameters()) / (1024 ** 3)
self.registry[model_id] = {
'name': model_id,
'path': str(model_path),
'size': f"{model_size:.2f} GB",
'parameters': f"{total_params / 1e9:.2f}B",
'quantization': quantization or 'none',
'downloaded_at': datetime.now().isoformat()
}
self._save_registry()
del model
del tokenizer
if torch.cuda.is_available():
torch.cuda.empty_cache()
logger.info(f"Successfully added model {model_id}")
return {
'status': 'success',
'message': f'Model {model_id} added successfully',
'model_info': self.registry[model_id]
}
except Exception as e:
logger.error(f"Error adding model {model_id}: {str(e)}")
if model_path.exists():
try:
shutil.rmtree(model_path)
except Exception as cleanup_error:
logger.error(f"Error cleaning up failed download: {cleanup_error}")
return {
'status': 'error',
'message': f'Failed to add model: {str(e)}'
}
def delete_model(self, model_id: str) -> Dict[str, Any]:
"""Delete a model from the local cache."""
logger.info(f"Deleting model: {model_id}")
if model_id not in self.registry:
logger.warning(f"Model {model_id} not found in registry")
return {
'status': 'error',
'message': 'Model not found in registry'
}
try:
model_path = Path(self.registry[model_id]['path'])
if model_path.exists():
shutil.rmtree(model_path)
logger.info(f"Deleted model files at {model_path}")
del self.registry[model_id]
self._save_registry()
logger.info(f"Successfully deleted model {model_id}")
return {
'status': 'success',
'message': f'Model {model_id} deleted successfully'
}
except Exception as e:
logger.error(f"Error deleting model {model_id}: {str(e)}")
return {
'status': 'error',
'message': f'Failed to delete model: {str(e)}'
}
def get_model_path(self, model_id: str) -> Optional[str]:
"""Get the local path for a model."""
if model_id in self.registry:
return self.registry[model_id]['path']
return None
def get_model_info(self, model_id: str) -> Optional[Dict[str, Any]]:
"""Get detailed information about a model."""
if model_id in self.registry:
return self.registry[model_id]
return None
This model manager provides comprehensive functionality for managing the local model collection. The add_model method handles downloading from HuggingFace with support for quantization options. The delete_model method removes both the model files and registry entry. The list_models method provides metadata for the frontend to display available models.
INFERENCE ENGINE AND GENERATION PARAMETERS
The inference engine executes model forward passes to generate text responses. It manages model loading, tokenization, and the generation loop. The engine must handle various generation parameters including temperature, top-k sampling, top-p sampling, repetition penalty, and context window management.
Temperature controls the randomness of predictions by scaling the logits before applying softmax. Lower temperatures produce more deterministic outputs, while higher temperatures increase diversity. Top-k sampling restricts the model to considering only the k most likely next tokens. Top-p sampling, also known as nucleus sampling, dynamically selects the smallest set of tokens whose cumulative probability exceeds p.
The repetition penalty discourages the model from repeating tokens that have already appeared in the generated sequence. This helps prevent the model from getting stuck in repetitive loops. The context window determines how much conversation history the model considers when generating responses.
Here is an example of the inference engine implementation:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig
from typing import Generator, Dict, Any, Optional, List
import logging
logger = logging.getLogger(__name__)
class InferenceEngine:
def __init__(self, gpu_detector):
self.gpu_detector = gpu_detector
self.device = gpu_detector.get_device()
self.current_model_id = None
self.model = None
self.tokenizer = None
self.generation_config = GenerationConfig(
temperature=0.7,
top_k=50,
top_p=0.9,
repetition_penalty=1.1,
max_new_tokens=512,
do_sample=True,
pad_token_id=None
)
def load_model(self, model_id: str, model_path: str) -> Dict[str, Any]:
"""Load a model for inference."""
try:
logger.info(f"Loading model: {model_id} from {model_path}")
if self.model is not None:
logger.info("Unloading previous model")
del self.model
del self.tokenizer
if torch.cuda.is_available():
torch.cuda.empty_cache()
self.model = None
self.tokenizer = None
self.tokenizer = AutoTokenizer.from_pretrained(
model_path,
trust_remote_code=True
)
if self.tokenizer.pad_token is None:
self.tokenizer.pad_token = self.tokenizer.eos_token
self.generation_config.pad_token_id = self.tokenizer.pad_token_id
self.generation_config.eos_token_id = self.tokenizer.eos_token_id
load_kwargs = {
'trust_remote_code': True
}
if self.device.type == 'cuda':
load_kwargs['torch_dtype'] = torch.float16
load_kwargs['device_map'] = 'auto'
elif self.device.type == 'mps':
load_kwargs['torch_dtype'] = torch.float16
self.model = AutoModelForCausalLM.from_pretrained(
model_path,
**load_kwargs
)
if self.device.type not in ['cuda']:
self.model = self.model.to(self.device)
self.model.eval()
self.current_model_id = model_id
logger.info(f"Successfully loaded model: {model_id}")
return {
'status': 'success',
'message': f'Model {model_id} loaded successfully'
}
except Exception as e:
logger.error(f"Error loading model {model_id}: {str(e)}")
self.model = None
self.tokenizer = None
self.current_model_id = None
return {
'status': 'error',
'message': f'Failed to load model: {str(e)}'
}
def update_generation_config(self, config: Dict[str, Any]) -> None:
"""Update generation parameters."""
if 'temperature' in config:
self.generation_config.temperature = float(config['temperature'])
if 'top_k' in config:
self.generation_config.top_k = int(config['top_k'])
if 'top_p' in config:
self.generation_config.top_p = float(config['top_p'])
if 'repetition_penalty' in config:
self.generation_config.repetition_penalty = float(config['repetition_penalty'])
if 'max_new_tokens' in config:
self.generation_config.max_new_tokens = int(config['max_new_tokens'])
logger.info(f"Updated generation config: {config}")
def get_generation_config(self) -> Dict[str, Any]:
"""Get current generation configuration."""
return {
'temperature': self.generation_config.temperature,
'top_k': self.generation_config.top_k,
'top_p': self.generation_config.top_p,
'repetition_penalty': self.generation_config.repetition_penalty,
'max_new_tokens': self.generation_config.max_new_tokens
}
def generate_stream(self, messages: List[Dict[str, str]], max_context_tokens: int = 2048) -> Generator[str, None, None]:
"""Generate a streaming response."""
if self.model is None or self.tokenizer is None:
yield "Error: No model loaded"
return
try:
conversation_text = self._format_conversation(messages)
inputs = self.tokenizer(
conversation_text,
return_tensors='pt',
truncation=True,
max_length=max_context_tokens,
padding=False
)
input_ids = inputs['input_ids'].to(self.device)
attention_mask = inputs['attention_mask'].to(self.device)
with torch.no_grad():
for i in range(self.generation_config.max_new_tokens):
outputs = self.model(
input_ids=input_ids,
attention_mask=attention_mask,
use_cache=True
)
logits = outputs.logits[:, -1, :]
if self.generation_config.temperature > 0:
logits = logits / self.generation_config.temperature
if self.generation_config.repetition_penalty != 1.0:
for token_id in set(input_ids[0].tolist()):
if logits[0, token_id] < 0:
logits[0, token_id] *= self.generation_config.repetition_penalty
else:
logits[0, token_id] /= self.generation_config.repetition_penalty
probs = torch.softmax(logits, dim=-1)
if self.generation_config.top_k > 0:
top_k_probs, top_k_indices = torch.topk(probs, min(self.generation_config.top_k, probs.size(-1)))
probs_filtered = torch.zeros_like(probs)
probs_filtered.scatter_(1, top_k_indices, top_k_probs)
probs = probs_filtered / probs_filtered.sum(dim=-1, keepdim=True)
if self.generation_config.top_p < 1.0:
sorted_probs, sorted_indices = torch.sort(probs, descending=True, dim=-1)
cumulative_probs = torch.cumsum(sorted_probs, dim=-1)
sorted_indices_to_remove = cumulative_probs > self.generation_config.top_p
sorted_indices_to_remove[:, 1:] = sorted_indices_to_remove[:, :-1].clone()
sorted_indices_to_remove[:, 0] = False
indices_to_remove = sorted_indices_to_remove.scatter(1, sorted_indices, sorted_indices_to_remove)
probs = probs.masked_fill(indices_to_remove, 0.0)
probs = probs / probs.sum(dim=-1, keepdim=True)
next_token = torch.multinomial(probs, num_samples=1)
if next_token.item() == self.tokenizer.eos_token_id:
break
input_ids = torch.cat([input_ids, next_token], dim=-1)
attention_mask = torch.cat([
attention_mask,
torch.ones((1, 1), dtype=torch.long, device=self.device)
], dim=-1)
token_text = self.tokenizer.decode(next_token[0], skip_special_tokens=True)
yield token_text
except Exception as e:
logger.error(f"Error during generation: {str(e)}")
yield f"Error during generation: {str(e)}"
def _format_conversation(self, messages: List[Dict[str, str]]) -> str:
"""Format conversation messages into a prompt."""
formatted = ""
for message in messages:
role = message.get('role', 'user')
content = message.get('content', '')
if role == 'user':
formatted += f"User: {content}\n"
elif role == 'assistant':
formatted += f"Assistant: {content}\n"
elif role == 'system':
formatted += f"System: {content}\n"
formatted += "Assistant:"
return formatted
This inference engine provides comprehensive control over text generation. The generate_stream method implements streaming generation with custom sampling logic. The implementation manually applies temperature scaling, repetition penalty, top-k filtering, and top-p filtering to demonstrate the underlying mechanisms.
WEBSOCKET COMMUNICATION AND REAL-TIME STREAMING
WebSocket connections enable real-time bidirectional communication between the frontend and backend. Unlike traditional HTTP requests, WebSockets maintain persistent connections that allow the server to push data to the client as it becomes available. This is essential for streaming token generation, where each generated token is sent to the frontend immediately for display.
The backend WebSocket handler receives messages from the client, processes them through the inference engine, and streams responses back token by token. The frontend establishes a WebSocket connection when the application loads and maintains it throughout the session. Message framing uses JSON to encode structured data including message type, content, and metadata.
Here is an example of the FastAPI WebSocket endpoint:
from fastapi import FastAPI, WebSocket, WebSocketDisconnect, HTTPException
from fastapi.middleware.cors import CORSMiddleware
from pydantic import BaseModel
from typing import List, Dict, Any, Optional
import json
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
app = FastAPI(title="Local LLM Server")
app.add_middleware(
CORSMiddleware,
allow_origins=["*"],
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"],
)
gpu_detector = GPUDetector()
model_manager = ModelManager()
inference_engine = InferenceEngine(gpu_detector)
class ModelAddRequest(BaseModel):
model_id: str
quantization: Optional[str] = None
class GenerationConfigUpdate(BaseModel):
temperature: Optional[float] = None
top_k: Optional[int] = None
top_p: Optional[float] = None
repetition_penalty: Optional[float] = None
max_new_tokens: Optional[int] = None
@app.get("/api/models")
async def list_models():
"""List all available models."""
return {"models": model_manager.list_models()}
@app.post("/api/models")
async def add_model(request: ModelAddRequest):
"""Add a new model from HuggingFace."""
result = model_manager.add_model(request.model_id, request.quantization)
if result['status'] == 'error':
raise HTTPException(status_code=400, detail=result['message'])
return result
@app.delete("/api/models/{model_id:path}")
async def delete_model(model_id: str):
"""Delete a model."""
result = model_manager.delete_model(model_id)
if result['status'] == 'error':
raise HTTPException(status_code=404, detail=result['message'])
return result
@app.post("/api/models/{model_id:path}/load")
async def load_model(model_id: str):
"""Load a model for inference."""
model_path = model_manager.get_model_path(model_id)
if not model_path:
raise HTTPException(status_code=404, detail="Model not found")
result = inference_engine.load_model(model_id, model_path)
if result['status'] == 'error':
raise HTTPException(status_code=500, detail=result['message'])
return result
@app.get("/api/current-model")
async def get_current_model():
"""Get information about the currently loaded model."""
if inference_engine.current_model_id:
return {
"model_id": inference_engine.current_model_id,
"loaded": True
}
return {"loaded": False}
@app.post("/api/generation-config")
async def update_generation_config(config: GenerationConfigUpdate):
"""Update generation configuration."""
inference_engine.update_generation_config(config.dict(exclude_none=True))
return {
"status": "success",
"message": "Generation config updated",
"config": inference_engine.get_generation_config()
}
@app.get("/api/generation-config")
async def get_generation_config():
"""Get current generation configuration."""
return inference_engine.get_generation_config()
@app.get("/api/device-info")
async def get_device_info():
"""Get information about the GPU device."""
return gpu_detector.get_device_info()
@app.websocket("/ws/chat")
async def websocket_chat(websocket: WebSocket):
"""WebSocket endpoint for chat interactions."""
await websocket.accept()
logger.info("WebSocket connection established")
try:
while True:
data = await websocket.receive_text()
try:
message_data = json.loads(data)
except json.JSONDecodeError:
await websocket.send_text(json.dumps({
'type': 'error',
'message': 'Invalid JSON format'
}))
continue
if message_data['type'] == 'generate':
messages = message_data['messages']
context_window = message_data.get('context_window', 2048)
await websocket.send_text(json.dumps({
'type': 'start',
'message': 'Generation started'
}))
full_response = ""
try:
for token in inference_engine.generate_stream(messages, context_window):
full_response += token
await websocket.send_text(json.dumps({
'type': 'token',
'content': token
}))
await websocket.send_text(json.dumps({
'type': 'end',
'message': 'Generation complete',
'full_response': full_response
}))
except Exception as e:
logger.error(f"Error during generation: {str(e)}")
await websocket.send_text(json.dumps({
'type': 'error',
'message': f'Generation error: {str(e)}'
}))
elif message_data['type'] == 'ping':
await websocket.send_text(json.dumps({
'type': 'pong',
'timestamp': message_data.get('timestamp')
}))
except WebSocketDisconnect:
logger.info("WebSocket connection closed")
except Exception as e:
logger.error(f"WebSocket error: {str(e)}")
try:
await websocket.close()
except:
pass
This FastAPI application provides both REST endpoints for model management and a WebSocket endpoint for chat interactions. The REST endpoints handle model listing, addition, deletion, and loading. The WebSocket endpoint processes chat messages and streams responses back to the client.
The frontend WebSocket client connects to this endpoint and handles incoming messages:
class ChatWebSocket {
constructor(url, onToken, onComplete, onError) {
this.url = url;
this.onToken = onToken;
this.onComplete = onComplete;
this.onError = onError;
this.ws = null;
this.reconnectAttempts = 0;
this.maxReconnectAttempts = 5;
this.reconnectDelay = 1000;
}
connect() {
this.ws = new WebSocket(this.url);
this.ws.onopen = () => {
console.log('WebSocket connected');
this.reconnectAttempts = 0;
};
this.ws.onmessage = (event) => {
try {
const data = JSON.parse(event.data);
switch (data.type) {
case 'start':
console.log('Generation started');
break;
case 'token':
if (this.onToken) {
this.onToken(data.content);
}
break;
case 'end':
if (this.onComplete) {
this.onComplete(data.full_response);
}
break;
case 'error':
if (this.onError) {
this.onError(data.message);
}
break;
case 'pong':
console.log('Pong received');
break;
}
} catch (error) {
console.error('Error parsing WebSocket message:', error);
}
};
this.ws.onerror = (error) => {
console.error('WebSocket error:', error);
if (this.onError) {
this.onError('WebSocket connection error');
}
};
this.ws.onclose = () => {
console.log('WebSocket disconnected');
this.attemptReconnect();
};
}
attemptReconnect() {
if (this.reconnectAttempts < this.maxReconnectAttempts) {
this.reconnectAttempts++;
console.log('Reconnecting... Attempt ' + this.reconnectAttempts);
setTimeout(() => this.connect(), this.reconnectDelay * this.reconnectAttempts);
} else {
console.error('Max reconnection attempts reached');
if (this.onError) {
this.onError('Failed to reconnect to server');
}
}
}
sendMessage(messages, contextWindow) {
if (this.ws && this.ws.readyState === WebSocket.OPEN) {
const payload = {
type: 'generate',
messages: messages,
context_window: contextWindow || 2048
};
this.ws.send(JSON.stringify(payload));
} else {
console.error('WebSocket is not connected');
if (this.onError) {
this.onError('WebSocket is not connected');
}
}
}
ping() {
if (this.ws && this.ws.readyState === WebSocket.OPEN) {
this.ws.send(JSON.stringify({
type: 'ping',
timestamp: Date.now()
}));
}
}
close() {
if (this.ws) {
this.ws.close();
}
}
}
This WebSocket client handles connection establishment, message sending, and automatic reconnection. The class accepts callback functions for token reception, completion, and error handling, allowing the UI components to respond appropriately to different events.
COMPLETE RUNNING EXAMPLE APPLICATION
The following section presents a complete, production-ready implementation of the local LLM application. This implementation integrates all components discussed above into a working system. The code is organized into backend and frontend sections, with each component fully implemented without placeholders or simplifications.
BACKEND IMPLEMENTATION:
The backend consists of several Python files organized into a coherent structure. The main application file ties together all components:
# main.py - Main application entry point
from fastapi import FastAPI, WebSocket, WebSocketDisconnect, HTTPException
from fastapi.middleware.cors import CORSMiddleware
from fastapi.staticfiles import StaticFiles
from pydantic import BaseModel
from typing import List, Dict, Any, Optional
import json
import logging
import uvicorn
from pathlib import Path
from gpu_detector import GPUDetector
from model_manager import ModelManager
from inference_engine import InferenceEngine
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)
app = FastAPI(
title="Local LLM Server",
description="A production-ready local LLM inference server with GPU acceleration",
version="1.0.0"
)
app.add_middleware(
CORSMiddleware,
allow_origins=["*"],
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"],
)
gpu_detector = GPUDetector()
model_manager = ModelManager(models_dir="./models", registry_path="./model_registry.json")
inference_engine = InferenceEngine(gpu_detector)
class ModelAddRequest(BaseModel):
model_id: str
quantization: Optional[str] = None
class GenerationConfigUpdate(BaseModel):
temperature: Optional[float] = None
top_k: Optional[int] = None
top_p: Optional[float] = None
repetition_penalty: Optional[float] = None
max_new_tokens: Optional[int] = None
@app.on_event("startup")
async def startup_event():
"""Initialize the application on startup."""
logger.info("Starting Local LLM Server")
logger.info(f"GPU Info: {gpu_detector.get_device_info()}")
logger.info(f"Available models: {len(model_manager.list_models())}")
@app.get("/")
async def root():
"""Root endpoint providing API information."""
return {
"name": "Local LLM Server",
"version": "1.0.0",
"gpu": gpu_detector.get_device_info(),
"endpoints": {
"models": "/api/models",
"device_info": "/api/device-info",
"websocket": "/ws/chat"
}
}
@app.get("/api/models")
async def list_models():
"""List all available models."""
return {"models": model_manager.list_models()}
@app.post("/api/models")
async def add_model(request: ModelAddRequest):
"""Add a new model from HuggingFace."""
logger.info(f"Received request to add model: {request.model_id}")
result = model_manager.add_model(request.model_id, request.quantization)
if result['status'] == 'error':
raise HTTPException(status_code=400, detail=result['message'])
return result
@app.delete("/api/models/{model_id:path}")
async def delete_model(model_id: str):
"""Delete a model from local storage."""
logger.info(f"Received request to delete model: {model_id}")
result = model_manager.delete_model(model_id)
if result['status'] == 'error':
raise HTTPException(status_code=404, detail=result['message'])
return result
@app.post("/api/models/{model_id:path}/load")
async def load_model(model_id: str):
"""Load a model for inference."""
logger.info(f"Received request to load model: {model_id}")
model_path = model_manager.get_model_path(model_id)
if not model_path:
raise HTTPException(status_code=404, detail="Model not found")
result = inference_engine.load_model(model_id, model_path)
if result['status'] == 'error':
raise HTTPException(status_code=500, detail=result['message'])
return result
@app.get("/api/current-model")
async def get_current_model():
"""Get information about the currently loaded model."""
if inference_engine.current_model_id:
return {
"model_id": inference_engine.current_model_id,
"loaded": True
}
return {"loaded": False}
@app.post("/api/generation-config")
async def update_generation_config(config: GenerationConfigUpdate):
"""Update generation configuration parameters."""
logger.info(f"Updating generation config: {config.dict(exclude_none=True)}")
inference_engine.update_generation_config(config.dict(exclude_none=True))
return {
"status": "success",
"message": "Generation config updated",
"config": inference_engine.get_generation_config()
}
@app.get("/api/generation-config")
async def get_generation_config():
"""Get current generation configuration."""
return inference_engine.get_generation_config()
@app.get("/api/device-info")
async def get_device_info():
"""Get information about the GPU device."""
return gpu_detector.get_device_info()
@app.websocket("/ws/chat")
async def websocket_chat(websocket: WebSocket):
"""WebSocket endpoint for chat interactions with streaming responses."""
await websocket.accept()
client_id = id(websocket)
logger.info(f"WebSocket connection established: {client_id}")
try:
while True:
data = await websocket.receive_text()
try:
message_data = json.loads(data)
except json.JSONDecodeError:
await websocket.send_text(json.dumps({
'type': 'error',
'message': 'Invalid JSON format'
}))
continue
message_type = message_data.get('type')
if message_type == 'generate':
messages = message_data.get('messages', [])
context_window = message_data.get('context_window', 2048)
if not messages:
await websocket.send_text(json.dumps({
'type': 'error',
'message': 'No messages provided'
}))
continue
if not inference_engine.current_model_id:
await websocket.send_text(json.dumps({
'type': 'error',
'message': 'No model loaded'
}))
continue
await websocket.send_text(json.dumps({
'type': 'start',
'message': 'Generation started'
}))
full_response = ""
try:
for token in inference_engine.generate_stream(messages, context_window):
full_response += token
await websocket.send_text(json.dumps({
'type': 'token',
'content': token
}))
await websocket.send_text(json.dumps({
'type': 'end',
'message': 'Generation complete',
'full_response': full_response
}))
except Exception as e:
logger.error(f"Error during generation: {str(e)}")
await websocket.send_text(json.dumps({
'type': 'error',
'message': f'Generation error: {str(e)}'
}))
elif message_type == 'ping':
await websocket.send_text(json.dumps({
'type': 'pong',
'timestamp': message_data.get('timestamp')
}))
else:
await websocket.send_text(json.dumps({
'type': 'error',
'message': f'Unknown message type: {message_type}'
}))
except WebSocketDisconnect:
logger.info(f"WebSocket connection closed: {client_id}")
except Exception as e:
logger.error(f"WebSocket error for {client_id}: {str(e)}")
try:
await websocket.close()
except:
pass
if __name__ == "__main__":
uvicorn.run(
"main:app",
host="0.0.0.0",
port=8000,
reload=False,
log_level="info"
)
The GPU detector module provides hardware detection and device configuration:
# gpu_detector.py - GPU detection and configuration
import torch
import logging
from typing import Optional, Dict, Any
logger = logging.getLogger(__name__)
class GPUDetector:
"""Detects and configures GPU acceleration backends."""
def __init__(self):
self.device_type = None
self.device_name = None
self.device_properties = {}
self.detect_gpu()
def detect_gpu(self) -> None:
"""Detect available GPU and set device configuration."""
if self._check_cuda():
self.device_type = 'cuda'
self.device_name = torch.cuda.get_device_name(0)
self.device_properties = {
'name': self.device_name,
'compute_capability': '.'.join(map(str, torch.cuda.get_device_capability(0))),
'total_memory_gb': round(torch.cuda.get_device_properties(0).total_memory / (1024 ** 3), 2),
'multi_processor_count': torch.cuda.get_device_properties(0).multi_processor_count,
'backend': 'CUDA'
}
logger.info(f"CUDA GPU detected: {self.device_name}")
elif self._check_mps():
self.device_type = 'mps'
self.device_name = 'Apple Silicon GPU'
self.device_properties = {
'name': self.device_name,
'backend': 'Metal Performance Shaders'
}
logger.info("Apple MPS GPU detected")
elif self._check_rocm():
self.device_type = 'cuda'
self.device_name = 'AMD GPU (ROCm)'
self.device_properties = {
'name': self.device_name,
'backend': 'ROCm',
'total_memory_gb': round(torch.cuda.get_device_properties(0).total_memory / (1024 ** 3), 2)
}
logger.info("AMD ROCm GPU detected")
elif self._check_intel():
self.device_type = 'xpu'
self.device_name = 'Intel GPU'
self.device_properties = {
'name': self.device_name,
'backend': 'Intel Extension for PyTorch'
}
logger.info("Intel GPU detected")
else:
self.device_type = 'cpu'
self.device_name = 'CPU'
self.device_properties = {
'name': 'CPU',
'cores': torch.get_num_threads(),
'backend': 'CPU'
}
logger.warning("No GPU detected, using CPU")
def _check_cuda(self) -> bool:
"""Check if CUDA is available."""
try:
return torch.cuda.is_available()
except Exception as e:
logger.debug(f"CUDA check failed: {e}")
return False
def _check_mps(self) -> bool:
"""Check if Apple MPS is available."""
try:
return hasattr(torch.backends, 'mps') and torch.backends.mps.is_available()
except Exception as e:
logger.debug(f"MPS check failed: {e}")
return False
def _check_rocm(self) -> bool:
"""Check if ROCm is available."""
try:
if torch.cuda.is_available():
return hasattr(torch.version, 'hip') and torch.version.hip is not None
return False
except Exception as e:
logger.debug(f"ROCm check failed: {e}")
return False
def _check_intel(self) -> bool:
"""Check if Intel GPU is available."""
try:
import intel_extension_for_pytorch as ipex
return hasattr(torch, 'xpu') and torch.xpu.is_available()
except ImportError:
logger.debug("Intel Extension for PyTorch not installed")
return False
except Exception as e:
logger.debug(f"Intel GPU check failed: {e}")
return False
def get_device(self) -> torch.device:
"""Return the appropriate torch device."""
return torch.device(self.device_type)
def get_device_info(self) -> Dict[str, Any]:
"""Return device information."""
return {
'type': self.device_type,
'name': self.device_name,
'properties': self.device_properties
}
The model manager handles downloading, caching, and managing local models:
# model_manager.py - Model management and HuggingFace integration
import os
import json
import shutil
from pathlib import Path
from typing import List, Dict, Optional, Any
from datetime import datetime
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import logging
logger = logging.getLogger(__name__)
class ModelManager:
"""Manages local LLM models including download, storage, and deletion."""
def __init__(self, models_dir: str = "./models", registry_path: str = "./model_registry.json"):
self.models_dir = Path(models_dir)
self.registry_path = Path(registry_path)
self.models_dir.mkdir(parents=True, exist_ok=True)
self.registry = self._load_registry()
def _load_registry(self) -> Dict[str, Any]:
"""Load the model registry from disk."""
if self.registry_path.exists():
try:
with open(self.registry_path, 'r') as f:
return json.load(f)
except json.JSONDecodeError:
logger.error("Registry file corrupted, creating new registry")
return {}
return {}
def _save_registry(self) -> None:
"""Save the model registry to disk."""
try:
with open(self.registry_path, 'w') as f:
json.dump(self.registry, f, indent=2)
except Exception as e:
logger.error(f"Failed to save registry: {e}")
def list_models(self) -> List[Dict[str, Any]]:
"""Return a list of all available models."""
models = []
for model_id, info in self.registry.items():
models.append({
'id': model_id,
'name': info.get('name', model_id),
'size': info.get('size', 'unknown'),
'quantization': info.get('quantization', 'none'),
'downloaded_at': info.get('downloaded_at'),
'path': info.get('path'),
'parameters': info.get('parameters', 'unknown')
})
return models
def add_model(self, model_id: str, quantization: Optional[str] = None) -> Dict[str, Any]:
"""Download and add a model from HuggingFace."""
logger.info(f"Adding model: {model_id}")
if model_id in self.registry:
logger.warning(f"Model {model_id} already exists")
return {
'status': 'error',
'message': 'Model already exists in registry'
}
try:
model_path = self.models_dir / model_id.replace('/', '_')
model_path.mkdir(parents=True, exist_ok=True)
logger.info(f"Downloading tokenizer for {model_id}")
tokenizer = AutoTokenizer.from_pretrained(
model_id,
cache_dir=str(model_path),
trust_remote_code=True
)
logger.info(f"Downloading model {model_id}")
load_kwargs = {'cache_dir': str(model_path), 'trust_remote_code': True}
if quantization == '8bit':
load_kwargs['load_in_8bit'] = True
load_kwargs['device_map'] = 'auto'
elif quantization == '4bit':
load_kwargs['load_in_4bit'] = True
load_kwargs['device_map'] = 'auto'
model = AutoModelForCausalLM.from_pretrained(model_id, **load_kwargs)
total_params = sum(p.numel() for p in model.parameters())
model_size = sum(p.numel() * p.element_size() for p in model.parameters()) / (1024 ** 3)
self.registry[model_id] = {
'name': model_id,
'path': str(model_path),
'size': f"{model_size:.2f} GB",
'parameters': f"{total_params / 1e9:.2f}B",
'quantization': quantization or 'none',
'downloaded_at': datetime.now().isoformat()
}
self._save_registry()
del model
del tokenizer
if torch.cuda.is_available():
torch.cuda.empty_cache()
logger.info(f"Successfully added model {model_id}")
return {
'status': 'success',
'message': f'Model {model_id} added successfully',
'model_info': self.registry[model_id]
}
except Exception as e:
logger.error(f"Error adding model {model_id}: {str(e)}")
if model_path.exists():
try:
shutil.rmtree(model_path)
except Exception as cleanup_error:
logger.error(f"Error cleaning up failed download: {cleanup_error}")
return {
'status': 'error',
'message': f'Failed to add model: {str(e)}'
}
def delete_model(self, model_id: str) -> Dict[str, Any]:
"""Delete a model from the local cache."""
logger.info(f"Deleting model: {model_id}")
if model_id not in self.registry:
logger.warning(f"Model {model_id} not found in registry")
return {
'status': 'error',
'message': 'Model not found in registry'
}
try:
model_path = Path(self.registry[model_id]['path'])
if model_path.exists():
shutil.rmtree(model_path)
logger.info(f"Deleted model files at {model_path}")
del self.registry[model_id]
self._save_registry()
logger.info(f"Successfully deleted model {model_id}")
return {
'status': 'success',
'message': f'Model {model_id} deleted successfully'
}
except Exception as e:
logger.error(f"Error deleting model {model_id}: {str(e)}")
return {
'status': 'error',
'message': f'Failed to delete model: {str(e)}'
}
def get_model_path(self, model_id: str) -> Optional[str]:
"""Get the local path for a model."""
if model_id in self.registry:
return self.registry[model_id]['path']
return None
def get_model_info(self, model_id: str) -> Optional[Dict[str, Any]]:
"""Get detailed information about a model."""
if model_id in self.registry:
return self.registry[model_id]
return None
The inference engine handles model loading and text generation:
# inference_engine.py - Model inference and text generation
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig
from typing import Generator, Dict, Any, Optional, List
import logging
logger = logging.getLogger(__name__)
class InferenceEngine:
"""Handles model loading and text generation with configurable parameters."""
def __init__(self, gpu_detector):
self.gpu_detector = gpu_detector
self.device = gpu_detector.get_device()
self.current_model_id = None
self.model = None
self.tokenizer = None
self.generation_config = GenerationConfig(
temperature=0.7,
top_k=50,
top_p=0.9,
repetition_penalty=1.1,
max_new_tokens=512,
do_sample=True,
pad_token_id=None
)
def load_model(self, model_id: str, model_path: str) -> Dict[str, Any]:
"""Load a model for inference."""
try:
logger.info(f"Loading model: {model_id} from {model_path}")
if self.model is not None:
logger.info("Unloading previous model")
del self.model
del self.tokenizer
if torch.cuda.is_available():
torch.cuda.empty_cache()
self.model = None
self.tokenizer = None
self.tokenizer = AutoTokenizer.from_pretrained(
model_path,
trust_remote_code=True
)
if self.tokenizer.pad_token is None:
self.tokenizer.pad_token = self.tokenizer.eos_token
self.generation_config.pad_token_id = self.tokenizer.pad_token_id
self.generation_config.eos_token_id = self.tokenizer.eos_token_id
load_kwargs = {
'trust_remote_code': True
}
if self.device.type == 'cuda':
load_kwargs['torch_dtype'] = torch.float16
load_kwargs['device_map'] = 'auto'
elif self.device.type == 'mps':
load_kwargs['torch_dtype'] = torch.float16
self.model = AutoModelForCausalLM.from_pretrained(
model_path,
**load_kwargs
)
if self.device.type not in ['cuda']:
self.model = self.model.to(self.device)
self.model.eval()
self.current_model_id = model_id
logger.info(f"Successfully loaded model: {model_id}")
return {
'status': 'success',
'message': f'Model {model_id} loaded successfully'
}
except Exception as e:
logger.error(f"Error loading model {model_id}: {str(e)}")
self.model = None
self.tokenizer = None
self.current_model_id = None
return {
'status': 'error',
'message': f'Failed to load model: {str(e)}'
}
def update_generation_config(self, config: Dict[str, Any]) -> None:
"""Update generation parameters."""
if 'temperature' in config:
self.generation_config.temperature = float(config['temperature'])
if 'top_k' in config:
self.generation_config.top_k = int(config['top_k'])
if 'top_p' in config:
self.generation_config.top_p = float(config['top_p'])
if 'repetition_penalty' in config:
self.generation_config.repetition_penalty = float(config['repetition_penalty'])
if 'max_new_tokens' in config:
self.generation_config.max_new_tokens = int(config['max_new_tokens'])
logger.info(f"Updated generation config: {config}")
def get_generation_config(self) -> Dict[str, Any]:
"""Get current generation configuration."""
return {
'temperature': self.generation_config.temperature,
'top_k': self.generation_config.top_k,
'top_p': self.generation_config.top_p,
'repetition_penalty': self.generation_config.repetition_penalty,
'max_new_tokens': self.generation_config.max_new_tokens
}
def generate_stream(self, messages: List[Dict[str, str]], max_context_tokens: int = 2048) -> Generator[str, None, None]:
"""Generate a streaming response."""
if self.model is None or self.tokenizer is None:
yield "Error: No model loaded"
return
try:
conversation_text = self._format_conversation(messages)
inputs = self.tokenizer(
conversation_text,
return_tensors='pt',
truncation=True,
max_length=max_context_tokens,
padding=False
)
input_ids = inputs['input_ids'].to(self.device)
attention_mask = inputs['attention_mask'].to(self.device)
with torch.no_grad():
for i in range(self.generation_config.max_new_tokens):
outputs = self.model(
input_ids=input_ids,
attention_mask=attention_mask,
use_cache=True
)
logits = outputs.logits[:, -1, :]
if self.generation_config.temperature > 0:
logits = logits / self.generation_config.temperature
if self.generation_config.repetition_penalty != 1.0:
for token_id in set(input_ids[0].tolist()):
if logits[0, token_id] < 0:
logits[0, token_id] *= self.generation_config.repetition_penalty
else:
logits[0, token_id] /= self.generation_config.repetition_penalty
probs = torch.softmax(logits, dim=-1)
if self.generation_config.top_k > 0:
top_k_probs, top_k_indices = torch.topk(probs, min(self.generation_config.top_k, probs.size(-1)))
probs_filtered = torch.zeros_like(probs)
probs_filtered.scatter_(1, top_k_indices, top_k_probs)
probs = probs_filtered / probs_filtered.sum(dim=-1, keepdim=True)
if self.generation_config.top_p < 1.0:
sorted_probs, sorted_indices = torch.sort(probs, descending=True, dim=-1)
cumulative_probs = torch.cumsum(sorted_probs, dim=-1)
sorted_indices_to_remove = cumulative_probs > self.generation_config.top_p
sorted_indices_to_remove[:, 1:] = sorted_indices_to_remove[:, :-1].clone()
sorted_indices_to_remove[:, 0] = False
indices_to_remove = sorted_indices_to_remove.scatter(1, sorted_indices, sorted_indices_to_remove)
probs = probs.masked_fill(indices_to_remove, 0.0)
probs = probs / probs.sum(dim=-1, keepdim=True)
next_token = torch.multinomial(probs, num_samples=1)
if next_token.item() == self.tokenizer.eos_token_id:
break
input_ids = torch.cat([input_ids, next_token], dim=-1)
attention_mask = torch.cat([
attention_mask,
torch.ones((1, 1), dtype=torch.long, device=self.device)
], dim=-1)
token_text = self.tokenizer.decode(next_token[0], skip_special_tokens=True)
yield token_text
except Exception as e:
logger.error(f"Error during generation: {str(e)}")
yield f"Error during generation: {str(e)}"
def _format_conversation(self, messages: List[Dict[str, str]]) -> str:
"""Format conversation messages into a prompt."""
formatted = ""
for message in messages:
role = message.get('role', 'user')
content = message.get('content', '')
if role == 'user':
formatted += f"User: {content}\n"
elif role == 'assistant':
formatted += f"Assistant: {content}\n"
elif role == 'system':
formatted += f"System: {content}\n"
formatted += "Assistant:"
return formatted
FRONTEND IMPLEMENTATION:
The frontend consists of HTML, CSS, and JavaScript files that create the user interface. The main HTML file provides the structure:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Local LLM Chat</title>
<link rel="stylesheet" href="styles.css">
</head>
<body class="theme-light">
<div class="app-container">
<header class="app-header">
<h1>Local LLM Chat</h1>
<div class="header-controls">
<button id="theme-toggle" class="icon-button" aria-label="Toggle theme">
<span id="theme-icon">🌙</span>
</button>
<button id="settings-toggle" class="icon-button" aria-label="Toggle settings">
⚙️
</button>
</div>
</header>
<div class="main-content">
<aside class="sidebar" id="sidebar">
<div class="sidebar-section">
<h3>Current Model</h3>
<div id="current-model-info">
<p id="current-model-name">No model loaded</p>
<button id="load-model-btn" class="button-secondary">Load Model</button>
</div>
</div>
<div class="sidebar-section">
<h3>Available Models</h3>
<div id="models-list" class="models-list">
<p class="loading-text">Loading models...</p>
</div>
<button id="add-model-btn" class="button-primary">Add Model</button>
</div>
<div class="sidebar-section">
<h3>Device Info</h3>
<div id="device-info">
<p class="loading-text">Loading device info...</p>
</div>
</div>
</aside>
<main class="chat-area">
<div id="chat-messages" class="chat-messages">
<div class="welcome-message">
<h2>Welcome to Local LLM Chat</h2>
<p>Select and load a model to start chatting</p>
</div>
</div>
<div class="input-container">
<textarea
id="message-input"
class="message-input"
placeholder="Type your message here..."
rows="3"
></textarea>
<button id="send-button" class="button-primary send-button" disabled>
Send
</button>
</div>
</main>
<aside class="settings-panel" id="settings-panel" style="display: none;">
<h3>Generation Settings</h3>
<div class="setting-group">
<label for="temperature-slider">
Temperature: <span id="temperature-value">0.7</span>
</label>
<input
type="range"
id="temperature-slider"
min="0"
max="2"
step="0.01"
value="0.7"
>
</div>
<div class="setting-group">
<label for="top-k-slider">
Top-K: <span id="top-k-value">50</span>
</label>
<input
type="range"
id="top-k-slider"
min="1"
max="100"
step="1"
value="50"
>
</div>
<div class="setting-group">
<label for="top-p-slider">
Top-P: <span id="top-p-value">0.9</span>
</label>
<input
type="range"
id="top-p-slider"
min="0"
max="1"
step="0.01"
value="0.9"
>
</div>
<div class="setting-group">
<label for="repetition-penalty-slider">
Repetition Penalty: <span id="repetition-penalty-value">1.1</span>
</label>
<input
type="range"
id="repetition-penalty-slider"
min="1"
max="2"
step="0.01"
value="1.1"
>
</div>
<div class="setting-group">
<label for="context-window-input">
Context Window (tokens):
</label>
<input
type="number"
id="context-window-input"
min="512"
max="32768"
step="512"
value="2048"
>
</div>
<div class="setting-group">
<label for="max-tokens-input">
Max New Tokens:
</label>
<input
type="number"
id="max-tokens-input"
min="1"
max="2048"
step="1"
value="512"
>
</div>
<button id="apply-settings-btn" class="button-primary">
Apply Settings
</button>
</aside>
</div>
</div>
<div id="modal-overlay" class="modal-overlay" style="display: none;">
<div class="modal">
<div class="modal-header">
<h3 id="modal-title">Modal Title</h3>
<button class="modal-close" id="modal-close">×</button>
</div>
<div class="modal-body" id="modal-body">
Modal content
</div>
<div class="modal-footer" id="modal-footer">
<button class="button-secondary" id="modal-cancel">Cancel</button>
<button class="button-primary" id="modal-confirm">Confirm</button>
</div>
</div>
</div>
<script src="app.js"></script>
</body>
</html>
The CSS file provides comprehensive styling for both light and dark themes:
/* styles.css - Application styles with theme support */
:root {
--transition-speed: 0.3s;
--border-radius: 8px;
--spacing-xs: 4px;
--spacing-sm: 8px;
--spacing-md: 16px;
--spacing-lg: 24px;
--spacing-xl: 32px;
}
.theme-light {
--background-primary: #ffffff;
--background-secondary: #f5f5f5;
--background-tertiary: #e0e0e0;
--text-primary: #1a1a1a;
--text-secondary: #4a4a4a;
--text-tertiary: #6a6a6a;
--border-color: #d0d0d0;
--accent-color: #0066cc;
--accent-hover: #0052a3;
--message-user-bg: #e3f2fd;
--message-assistant-bg: #f5f5f5;
--input-background: #ffffff;
--input-border: #c0c0c0;
--button-primary: #0066cc;
--button-primary-hover: #0052a3;
--button-secondary: #6c757d;
--button-secondary-hover: #5a6268;
--shadow-color: rgba(0, 0, 0, 0.1);
--error-color: #d32f2f;
--success-color: #388e3c;
}
.theme-dark {
--background-primary: #1a1a1a;
--background-secondary: #2a2a2a;
--background-tertiary: #3a3a3a;
--text-primary: #e0e0e0;
--text-secondary: #b0b0b0;
--text-tertiary: #909090;
--border-color: #404040;
--accent-color: #4da6ff;
--accent-hover: #3d96ef;
--message-user-bg: #1e3a5f;
--message-assistant-bg: #2a2a2a;
--input-background: #2a2a2a;
--input-border: #404040;
--button-primary: #4da6ff;
--button-primary-hover: #3d96ef;
--button-secondary: #6c757d;
--button-secondary-hover: #5a6268;
--shadow-color: rgba(0, 0, 0, 0.3);
--error-color: #f44336;
--success-color: #66bb6a;
}
* {
margin: 0;
padding: 0;
box-sizing: border-box;
}
body {
font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, Oxygen, Ubuntu, Cantarell, sans-serif;
background-color: var(--background-primary);
color: var(--text-primary);
transition: background-color var(--transition-speed), color var(--transition-speed);
line-height: 1.6;
}
.app-container {
display: flex;
flex-direction: column;
height: 100vh;
}
.app-header {
background-color: var(--background-secondary);
border-bottom: 1px solid var(--border-color);
padding: var(--spacing-md) var(--spacing-lg);
display: flex;
justify-content: space-between;
align-items: center;
transition: background-color var(--transition-speed), border-color var(--transition-speed);
}
.app-header h1 {
font-size: 24px;
font-weight: 600;
}
.header-controls {
display: flex;
gap: var(--spacing-sm);
}
.icon-button {
background: none;
border: none;
font-size: 24px;
cursor: pointer;
padding: var(--spacing-sm);
border-radius: var(--border-radius);
transition: background-color var(--transition-speed);
}
.icon-button:hover {
background-color: var(--background-tertiary);
}
.main-content {
display: flex;
flex: 1;
overflow: hidden;
}
.sidebar {
width: 300px;
background-color: var(--background-secondary);
border-right: 1px solid var(--border-color);
padding: var(--spacing-lg);
overflow-y: auto;
transition: background-color var(--transition-speed), border-color var(--transition-speed);
}
.sidebar-section {
margin-bottom: var(--spacing-xl);
}
.sidebar-section h3 {
font-size: 16px;
font-weight: 600;
margin-bottom: var(--spacing-md);
color: var(--text-secondary);
}
.models-list {
max-height: 300px;
overflow-y: auto;
margin-bottom: var(--spacing-md);
}
.model-item {
background-color: var(--background-tertiary);
border: 1px solid var(--border-color);
border-radius: var(--border-radius);
padding: var(--spacing-md);
margin-bottom: var(--spacing-sm);
cursor: pointer;
transition: all var(--transition-speed);
}
.model-item:hover {
border-color: var(--accent-color);
transform: translateY(-2px);
}
.model-item.selected {
border-color: var(--accent-color);
background-color: var(--accent-color);
color: white;
}
.model-item h4 {
font-size: 14px;
font-weight: 600;
margin-bottom: var(--spacing-xs);
}
.model-item p {
font-size: 12px;
color: var(--text-tertiary);
}
.model-item.selected p {
color: rgba(255, 255, 255, 0.8);
}
.model-actions {
display: flex;
gap: var(--spacing-xs);
margin-top: var(--spacing-sm);
}
.model-actions button {
flex: 1;
padding: var(--spacing-xs) var(--spacing-sm);
font-size: 12px;
}
.chat-area {
flex: 1;
display: flex;
flex-direction: column;
background-color: var(--background-primary);
transition: background-color var(--transition-speed);
}
.chat-messages {
flex: 1;
overflow-y: auto;
padding: var(--spacing-lg);
display: flex;
flex-direction: column;
gap: var(--spacing-md);
}
.welcome-message {
text-align: center;
padding: var(--spacing-xl);
color: var(--text-secondary);
}
.welcome-message h2 {
font-size: 28px;
margin-bottom: var(--spacing-md);
}
.message {
max-width: 80%;
padding: var(--spacing-md);
border-radius: var(--border-radius);
animation: fadeIn 0.3s ease-in;
}
@keyframes fadeIn {
from {
opacity: 0;
transform: translateY(10px);
}
to {
opacity: 1;
transform: translateY(0);
}
}
.message-user {
align-self: flex-end;
background-color: var(--message-user-bg);
color: var(--text-primary);
transition: background-color var(--transition-speed);
}
.message-assistant {
align-self: flex-start;
background-color: var(--message-assistant-bg);
color: var(--text-primary);
border: 1px solid var(--border-color);
transition: background-color var(--transition-speed), border-color var(--transition-speed);
}
.message-role {
font-size: 12px;
font-weight: 600;
margin-bottom: var(--spacing-xs);
color: var(--text-secondary);
}
.message-content {
white-space: pre-wrap;
word-wrap: break-word;
}
.typing-indicator {
display: inline-block;
}
.typing-indicator::after {
content: '...';
animation: typing 1.5s infinite;
}
@keyframes typing {
0%, 20% { content: '.'; }
40% { content: '..'; }
60%, 100% { content: '...'; }
}
.input-container {
padding: var(--spacing-lg);
background-color: var(--background-secondary);
border-top: 1px solid var(--border-color);
display: flex;
gap: var(--spacing-md);
transition: background-color var(--transition-speed), border-color var(--transition-speed);
}
.message-input {
flex: 1;
padding: var(--spacing-md);
background-color: var(--input-background);
border: 1px solid var(--input-border);
border-radius: var(--border-radius);
color: var(--text-primary);
font-family: inherit;
font-size: 14px;
resize: vertical;
min-height: 60px;
transition: all var(--transition-speed);
}
.message-input:focus {
outline: none;
border-color: var(--accent-color);
}
.settings-panel {
width: 300px;
background-color: var(--background-secondary);
border-left: 1px solid var(--border-color);
padding: var(--spacing-lg);
overflow-y: auto;
transition: background-color var(--transition-speed), border-color var(--transition-speed);
}
.settings-panel h3 {
font-size: 18px;
font-weight: 600;
margin-bottom: var(--spacing-lg);
}
.setting-group {
margin-bottom: var(--spacing-lg);
}
.setting-group label {
display: block;
font-size: 14px;
font-weight: 500;
margin-bottom: var(--spacing-sm);
color: var(--text-secondary);
}
.setting-group input[type="range"] {
width: 100%;
height: 6px;
border-radius: 3px;
background: var(--background-tertiary);
outline: none;
transition: background var(--transition-speed);
}
.setting-group input[type="range"]::-webkit-slider-thumb {
appearance: none;
width: 18px;
height: 18px;
border-radius: 50%;
background: var(--accent-color);
cursor: pointer;
}
.setting-group input[type="range"]::-moz-range-thumb {
width: 18px;
height: 18px;
border-radius: 50%;
background: var(--accent-color);
cursor: pointer;
border: none;
}
.setting-group input[type="number"] {
width: 100%;
padding: var(--spacing-sm);
background-color: var(--input-background);
border: 1px solid var(--input-border);
border-radius: var(--border-radius);
color: var(--text-primary);
font-size: 14px;
transition: all var(--transition-speed);
}
.setting-group input[type="number"]:focus {
outline: none;
border-color: var(--accent-color);
}
.button-primary {
background-color: var(--button-primary);
color: white;
border: none;
padding: var(--spacing-sm) var(--spacing-md);
border-radius: var(--border-radius);
font-size: 14px;
font-weight: 500;
cursor: pointer;
transition: background-color var(--transition-speed);
}
.button-primary:hover:not(:disabled) {
background-color: var(--button-primary-hover);
}
.button-primary:disabled {
opacity: 0.5;
cursor: not-allowed;
}
.button-secondary {
background-color: var(--button-secondary);
color: white;
border: none;
padding: var(--spacing-sm) var(--spacing-md);
border-radius: var(--border-radius);
font-size: 14px;
font-weight: 500;
cursor: pointer;
transition: background-color var(--transition-speed);
}
.button-secondary:hover:not(:disabled) {
background-color: var(--button-secondary-hover);
}
.send-button {
padding: var(--spacing-md) var(--spacing-lg);
}
.loading-text {
color: var(--text-tertiary);
font-style: italic;
font-size: 14px;
}
.modal-overlay {
position: fixed;
top: 0;
left: 0;
right: 0;
bottom: 0;
background-color: rgba(0, 0, 0, 0.5);
display: flex;
justify-content: center;
align-items: center;
z-index: 1000;
}
.modal {
background-color: var(--background-primary);
border-radius: var(--border-radius);
box-shadow: 0 4px 20px var(--shadow-color);
max-width: 500px;
width: 90%;
max-height: 80vh;
display: flex;
flex-direction: column;
transition: background-color var(--transition-speed);
}
.modal-header {
padding: var(--spacing-lg);
border-bottom: 1px solid var(--border-color);
display: flex;
justify-content: space-between;
align-items: center;
}
.modal-header h3 {
font-size: 20px;
font-weight: 600;
}
.modal-close {
background: none;
border: none;
font-size: 28px;
cursor: pointer;
color: var(--text-secondary);
line-height: 1;
padding: 0;
width: 32px;
height: 32px;
}
.modal-close:hover {
color: var(--text-primary);
}
.modal-body {
padding: var(--spacing-lg);
overflow-y: auto;
flex: 1;
}
.modal-footer {
padding: var(--spacing-lg);
border-top: 1px solid var(--border-color);
display: flex;
justify-content: flex-end;
gap: var(--spacing-md);
}
.form-group {
margin-bottom: var(--spacing-md);
}
.form-group label {
display: block;
font-size: 14px;
font-weight: 500;
margin-bottom: var(--spacing-sm);
color: var(--text-secondary);
}
.form-group input,
.form-group select {
width: 100%;
padding: var(--spacing-sm);
background-color: var(--input-background);
border: 1px solid var(--input-border);
border-radius: var(--border-radius);
color: var(--text-primary);
font-size: 14px;
transition: all var(--transition-speed);
}
.form-group input:focus,
.form-group select:focus {
outline: none;
border-color: var(--accent-color);
}
.error-message {
color: var(--error-color);
font-size: 14px;
margin-top: var(--spacing-sm);
}
.success-message {
color: var(--success-color);
font-size: 14px;
margin-top: var(--spacing-sm);
}
@media (max-width: 768px) {
.sidebar {
position: fixed;
left: -300px;
top: 0;
bottom: 0;
z-index: 100;
transition: left var(--transition-speed);
}
.sidebar.open {
left: 0;
}
.settings-panel {
position: fixed;
right: -300px;
top: 0;
bottom: 0;
z-index: 100;
transition: right var(--transition-speed);
}
.settings-panel.open {
right: 0;
}
.message {
max-width: 90%;
}
}
The JavaScript application file ties everything together:
// app.js - Main application logic
class LocalLLMApp {
constructor() {
this.apiBaseUrl = 'http://localhost:8000';
this.wsUrl = 'ws://localhost:8000/ws/chat';
this.ws = null;
this.currentModelId = null;
this.messages = [];
this.isGenerating = false;
this.currentTheme = 'light';
this.initializeElements();
this.attachEventListeners();
this.loadInitialData();
this.connectWebSocket();
this.loadTheme();
}
initializeElements() {
this.themeToggle = document.getElementById('theme-toggle');
this.themeIcon = document.getElementById('theme-icon');
this.settingsToggle = document.getElementById('settings-toggle');
this.settingsPanel = document.getElementById('settings-panel');
this.sidebar = document.getElementById('sidebar');
this.modelsList = document.getElementById('models-list');
this.addModelBtn = document.getElementById('add-model-btn');
this.loadModelBtn = document.getElementById('load-model-btn');
this.currentModelName = document.getElementById('current-model-name');
this.deviceInfo = document.getElementById('device-info');
this.chatMessages = document.getElementById('chat-messages');
this.messageInput = document.getElementById('message-input');
this.sendButton = document.getElementById('send-button');
this.modalOverlay = document.getElementById('modal-overlay');
this.modalTitle = document.getElementById('modal-title');
this.modalBody = document.getElementById('modal-body');
this.modalFooter = document.getElementById('modal-footer');
this.modalClose = document.getElementById('modal-close');
this.modalCancel = document.getElementById('modal-cancel');
this.modalConfirm = document.getElementById('modal-confirm');
this.temperatureSlider = document.getElementById('temperature-slider');
this.temperatureValue = document.getElementById('temperature-value');
this.topKSlider = document.getElementById('top-k-slider');
this.topKValue = document.getElementById('top-k-value');
this.topPSlider = document.getElementById('top-p-slider');
this.topPValue = document.getElementById('top-p-value');
this.repetitionPenaltySlider = document.getElementById('repetition-penalty-slider');
this.repetitionPenaltyValue = document.getElementById('repetition-penalty-value');
this.contextWindowInput = document.getElementById('context-window-input');
this.maxTokensInput = document.getElementById('max-tokens-input');
this.applySettingsBtn = document.getElementById('apply-settings-btn');
}
attachEventListeners() {
this.themeToggle.addEventListener('click', () => this.toggleTheme());
this.settingsToggle.addEventListener('click', () => this.toggleSettings());
this.addModelBtn.addEventListener('click', () => this.showAddModelDialog());
this.sendButton.addEventListener('click', () => this.sendMessage());
this.messageInput.addEventListener('keydown', (e) => {
if (e.key === 'Enter' && !e.shiftKey) {
e.preventDefault();
this.sendMessage();
}
});
this.modalClose.addEventListener('click', () => this.hideModal());
this.modalCancel.addEventListener('click', () => this.hideModal());
this.modalOverlay.addEventListener('click', (e) => {
if (e.target === this.modalOverlay) {
this.hideModal();
}
});
this.temperatureSlider.addEventListener('input', (e) => {
this.temperatureValue.textContent = e.target.value;
});
this.topKSlider.addEventListener('input', (e) => {
this.topKValue.textContent = e.target.value;
});
this.topPSlider.addEventListener('input', (e) => {
this.topPValue.textContent = e.target.value;
});
this.repetitionPenaltySlider.addEventListener('input', (e) => {
this.repetitionPenaltyValue.textContent = e.target.value;
});
this.applySettingsBtn.addEventListener('click', () => this.applySettings());
}
async loadInitialData() {
await this.loadModels();
await this.loadDeviceInfo();
await this.loadCurrentModel();
await this.loadGenerationConfig();
}
async loadModels() {
try {
const response = await fetch(this.apiBaseUrl + '/api/models');
const data = await response.json();
this.renderModels(data.models);
} catch (error) {
console.error('Error loading models:', error);
this.modelsList.innerHTML = '<p class="error-message">Failed to load models</p>';
}
}
renderModels(models) {
if (models.length === 0) {
this.modelsList.innerHTML = '<p class="loading-text">No models available. Add a model to get started.</p>';
return;
}
this.modelsList.innerHTML = '';
models.forEach(model => {
const modelItem = document.createElement('div');
modelItem.className = 'model-item';
if (model.id === this.currentModelId) {
modelItem.classList.add('selected');
}
modelItem.innerHTML =
'<h4>' + this.escapeHtml(model.name) + '</h4>' +
'<p>Size: ' + this.escapeHtml(model.size) + '</p>' +
'<p>Quantization: ' + this.escapeHtml(model.quantization) + '</p>' +
'<div class="model-actions">' +
'<button class="button-primary load-btn" data-model-id="' + this.escapeHtml(model.id) + '">Load</button>' +
'<button class="button-secondary delete-btn" data-model-id="' + this.escapeHtml(model.id) + '">Delete</button>' +
'</div>';
const loadBtn = modelItem.querySelector('.load-btn');
const deleteBtn = modelItem.querySelector('.delete-btn');
loadBtn.addEventListener('click', (e) => {
e.stopPropagation();
this.loadModel(model.id);
});
deleteBtn.addEventListener('click', (e) => {
e.stopPropagation();
this.confirmDeleteModel(model.id);
});
this.modelsList.appendChild(modelItem);
});
}
async loadDeviceInfo() {
try {
const response = await fetch(this.apiBaseUrl + '/api/device-info');
const data = await response.json();
this.renderDeviceInfo(data);
} catch (error) {
console.error('Error loading device info:', error);
this.deviceInfo.innerHTML = '<p class="error-message">Failed to load device info</p>';
}
}
renderDeviceInfo(info) {
this.deviceInfo.innerHTML =
'<p><strong>Type:</strong> ' + this.escapeHtml(info.type.toUpperCase()) + '</p>' +
'<p><strong>Name:</strong> ' + this.escapeHtml(info.name) + '</p>' +
'<p><strong>Backend:</strong> ' + this.escapeHtml(info.properties.backend) + '</p>';
}
async loadCurrentModel() {
try {
const response = await fetch(this.apiBaseUrl + '/api/current-model');
const data = await response.json();
if (data.loaded) {
this.currentModelId = data.model_id;
this.currentModelName.textContent = data.model_id;
this.sendButton.disabled = false;
}
} catch (error) {
console.error('Error loading current model:', error);
}
}
async loadGenerationConfig() {
try {
const response = await fetch(this.apiBaseUrl + '/api/generation-config');
const config = await response.json();
this.temperatureSlider.value = config.temperature;
this.temperatureValue.textContent = config.temperature;
this.topKSlider.value = config.top_k;
this.topKValue.textContent = config.top_k;
this.topPSlider.value = config.top_p;
this.topPValue.textContent = config.top_p;
this.repetitionPenaltySlider.value = config.repetition_penalty;
this.repetitionPenaltyValue.textContent = config.repetition_penalty;
this.maxTokensInput.value = config.max_new_tokens;
} catch (error) {
console.error('Error loading generation config:', error);
}
}
async loadModel(modelId) {
try {
this.showLoading('Loading model ' + modelId + '...');
const response = await fetch(this.apiBaseUrl + '/api/models/' + encodeURIComponent(modelId) + '/load', {
method: 'POST'
});
const data = await response.json();
if (data.status === 'success') {
this.currentModelId = modelId;
this.currentModelName.textContent = modelId;
this.sendButton.disabled = false;
await this.loadModels();
this.hideModal();
this.showSuccess('Model loaded successfully');
} else {
this.showError(data.message);
}
} catch (error) {
console.error('Error loading model:', error);
this.showError('Failed to load model');
}
}
showAddModelDialog() {
this.modalTitle.textContent = 'Add Model from HuggingFace';
this.modalBody.innerHTML =
'<div class="form-group">' +
'<label for="model-id-input">Model ID (e.g., gpt2, microsoft/phi-2)</label>' +
'<input type="text" id="model-id-input" placeholder="Enter HuggingFace model ID">' +
'</div>' +
'<div class="form-group">' +
'<label for="quantization-select">Quantization</label>' +
'<select id="quantization-select">' +
'<option value="">None (Full Precision)</option>' +
'<option value="8bit">8-bit</option>' +
'<option value="4bit">4-bit</option>' +
'</select>' +
'</div>' +
'<p class="loading-text">Note: Downloading large models may take several minutes.</p>';
this.modalConfirm.textContent = 'Add Model';
this.modalConfirm.onclick = () => this.addModel();
this.showModal();
}
async addModel() {
const modelIdInput = document.getElementById('model-id-input');
const quantizationSelect = document.getElementById('quantization-select');
const modelId = modelIdInput.value.trim();
const quantization = quantizationSelect.value || null;
if (!modelId) {
this.showError('Please enter a model ID');
return;
}
try {
this.showLoading('Downloading model ' + modelId + '...');
const response = await fetch(this.apiBaseUrl + '/api/models', {
method: 'POST',
headers: {
'Content-Type': 'application/json'
},
body: JSON.stringify({
model_id: modelId,
quantization: quantization
})
});
const data = await response.json();
if (data.status === 'success') {
await this.loadModels();
this.hideModal();
this.showSuccess('Model added successfully');
} else {
this.showError(data.message);
}
} catch (error) {
console.error('Error adding model:', error);
this.showError('Failed to add model');
}
}
confirmDeleteModel(modelId) {
this.modalTitle.textContent = 'Confirm Delete';
this.modalBody.innerHTML =
'<p>Are you sure you want to delete the model <strong>' + this.escapeHtml(modelId) + '</strong>?</p>' +
'<p class="error-message">This action cannot be undone.</p>';
this.modalConfirm.textContent = 'Delete';
this.modalConfirm.onclick = () => this.deleteModel(modelId);
this.showModal();
}
async deleteModel(modelId) {
try {
const response = await fetch(this.apiBaseUrl + '/api/models/' + encodeURIComponent(modelId), {
method: 'DELETE'
});
const data = await response.json();
if (data.status === 'success') {
if (this.currentModelId === modelId) {
this.currentModelId = null;
this.currentModelName.textContent = 'No model loaded';
this.sendButton.disabled = true;
}
await this.loadModels();
this.hideModal();
this.showSuccess('Model deleted successfully');
} else {
this.showError(data.message);
}
} catch (error) {
console.error('Error deleting model:', error);
this.showError('Failed to delete model');
}
}
async applySettings() {
const config = {
temperature: parseFloat(this.temperatureSlider.value),
top_k: parseInt(this.topKSlider.value),
top_p: parseFloat(this.topPSlider.value),
repetition_penalty: parseFloat(this.repetitionPenaltySlider.value),
max_new_tokens: parseInt(this.maxTokensInput.value)
};
try {
const response = await fetch(this.apiBaseUrl + '/api/generation-config', {
method: 'POST',
headers: {
'Content-Type': 'application/json'
},
body: JSON.stringify(config)
});
const data = await response.json();
if (data.status === 'success') {
this.showSuccess('Settings applied successfully');
} else {
this.showError('Failed to apply settings');
}
} catch (error) {
console.error('Error applying settings:', error);
this.showError('Failed to apply settings');
}
}
connectWebSocket() {
this.ws = new WebSocket(this.wsUrl);
this.ws.onopen = () => {
console.log('WebSocket connected');
};
this.ws.onmessage = (event) => {
const data = JSON.parse(event.data);
this.handleWebSocketMessage(data);
};
this.ws.onerror = (error) => {
console.error('WebSocket error:', error);
};
this.ws.onclose = () => {
console.log('WebSocket disconnected');
setTimeout(() => this.connectWebSocket(), 3000);
};
}
handleWebSocketMessage(data) {
switch (data.type) {
case 'start':
this.isGenerating = true;
this.sendButton.disabled = true;
this.addAssistantMessage('');
break;
case 'token':
this.appendToLastMessage(data.content);
break;
case 'end':
this.isGenerating = false;
this.sendButton.disabled = false;
this.messages.push({
role: 'assistant',
content: data.full_response
});
this.removeTypingIndicator();
break;
case 'error':
this.isGenerating = false;
this.sendButton.disabled = false;
this.showError(data.message);
break;
}
}
sendMessage() {
const content = this.messageInput.value.trim();
if (!content || this.isGenerating || !this.currentModelId) {
return;
}
this.messages.push({
role: 'user',
content: content
});
this.addUserMessage(content);
this.messageInput.value = '';
const contextWindow = parseInt(this.contextWindowInput.value);
if (this.ws && this.ws.readyState === WebSocket.OPEN) {
this.ws.send(JSON.stringify({
type: 'generate',
messages: this.messages,
context_window: contextWindow
}));
} else {
this.showError('WebSocket not connected');
}
}
addUserMessage(content) {
const messageDiv = document.createElement('div');
messageDiv.className = 'message message-user';
messageDiv.innerHTML =
'<div class="message-role">You</div>' +
'<div class="message-content">' + this.escapeHtml(content) + '</div>';
this.chatMessages.appendChild(messageDiv);
this.scrollToBottom();
}
addAssistantMessage(content) {
const messageDiv = document.createElement('div');
messageDiv.className = 'message message-assistant';
messageDiv.innerHTML =
'<div class="message-role">Assistant</div>' +
'<div class="message-content">' + this.escapeHtml(content) + '<span class="typing-indicator"></span></div>';
this.chatMessages.appendChild(messageDiv);
this.scrollToBottom();
}
appendToLastMessage(content) {
const messages = this.chatMessages.querySelectorAll('.message-assistant');
if (messages.length > 0) {
const lastMessage = messages[messages.length - 1];
const contentDiv = lastMessage.querySelector('.message-content');
const typingIndicator = contentDiv.querySelector('.typing-indicator');
if (typingIndicator) {
typingIndicator.remove();
}
const currentText = contentDiv.textContent;
contentDiv.textContent = currentText + content;
const newTypingIndicator = document.createElement('span');
newTypingIndicator.className = 'typing-indicator';
contentDiv.appendChild(newTypingIndicator);
this.scrollToBottom();
}
}
removeTypingIndicator() {
const messages = this.chatMessages.querySelectorAll('.message-assistant');
if (messages.length > 0) {
const lastMessage = messages[messages.length - 1];
const typingIndicator = lastMessage.querySelector('.typing-indicator');
if (typingIndicator) {
typingIndicator.remove();
}
}
}
scrollToBottom() {
this.chatMessages.scrollTop = this.chatMessages.scrollHeight;
}
toggleTheme() {
this.currentTheme = this.currentTheme === 'light' ? 'dark' : 'light';
document.body.className = 'theme-' + this.currentTheme;
this.themeIcon.textContent = this.currentTheme === 'light' ? '🌙' : '☀️';
localStorage.setItem('theme', this.currentTheme);
}
loadTheme() {
const savedTheme = localStorage.getItem('theme');
if (savedTheme) {
this.currentTheme = savedTheme;
document.body.className = 'theme-' + this.currentTheme;
this.themeIcon.textContent = this.currentTheme === 'light' ? '🌙' : '☀️';
}
}
toggleSettings() {
const isVisible = this.settingsPanel.style.display !== 'none';
this.settingsPanel.style.display = isVisible ? 'none' : 'block';
}
showModal() {
this.modalOverlay.style.display = 'flex';
}
hideModal() {
this.modalOverlay.style.display = 'none';
}
showLoading(message) {
this.modalTitle.textContent = 'Loading';
this.modalBody.innerHTML = '<p>' + this.escapeHtml(message) + '</p>';
this.modalFooter.style.display = 'none';
this.showModal();
}
showError(message) {
alert('Error: ' + message);
}
showSuccess(message) {
alert(message);
}
escapeHtml(text) {
const div = document.createElement('div');
div.textContent = text;
return div.innerHTML;
}
}
document.addEventListener('DOMContentLoaded', () => {
new LocalLLMApp();
});
DEPLOYMENT AND PRODUCTION CONSIDERATIONS
Deploying a local LLM application requires careful consideration of hardware requirements, dependency management, and user experience optimization. The backend server should be packaged with clear installation instructions and dependency specifications. Using a requirements.txt file for Python dependencies ensures reproducible installations across different environments.
The requirements.txt file should include all necessary dependencies with version specifications:
fastapi==0.104.1
uvicorn==0.24.0
websockets==12.0
torch==2.1.0
transformers==4.35.0
accelerate==0.24.1
bitsandbytes==0.41.1
pydantic==2.5.0
Hardware requirements vary significantly based on the models users intend to run. Smaller models like GPT-2 can run on systems with eight gigabytes of RAM and no GPU, while larger models require dedicated GPUs with substantial VRAM. The application should provide clear guidance about hardware requirements for different model sizes and quantization levels.
Performance optimization involves several strategies. Model quantization reduces memory footprint and increases inference speed with minimal quality degradation. The application monitors resource usage and provides feedback when approaching system limits. Caching mechanisms can store tokenizer outputs and model configurations to reduce redundant operations.
Security considerations include validating user inputs to prevent injection attacks, implementing rate limiting to prevent abuse, and ensuring that model downloads come from trusted sources. The application should not expose sensitive system information through error messages or API responses. Input sanitization prevents malicious code execution through user-provided model identifiers or generation parameters.
User experience enhancements include providing progress indicators during model downloads, implementing conversation history persistence through browser local storage, supporting conversation export and import in JSON format, and offering keyboard shortcuts for common actions. The interface should gracefully handle errors and provide helpful recovery suggestions with specific error messages that guide users toward solutions.
The application structure supports easy deployment through containerization. A Dockerfile can package the entire backend with all dependencies:
FROM python:3.10-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY main.py gpu_detector.py model_manager.py inference_engine.py ./
EXPOSE 8000
CMD ["python", "main.py"]
This containerization approach ensures consistent deployment across different environments and simplifies dependency management. The frontend can be served as static files or integrated into the FastAPI application using StaticFiles middleware.
CONCLUSION AND FUTURE DIRECTIONS
Building a comprehensive web-based local LLM application represents a significant undertaking that combines frontend development, backend engineering, machine learning infrastructure, and systems programming. The application described in this article provides a solid foundation for running large language models locally with full control over model selection, generation parameters, and user interface customization.
The architecture separates concerns effectively, making the system maintainable and extensible. The frontend provides an intuitive interface with theme support and real-time streaming responses. The backend handles model management, GPU acceleration, and inference execution with support for multiple hardware platforms including NVIDIA CUDA, AMD ROCm, Intel acceleration, and Apple Metal Performance Shaders. The WebSocket communication layer enables responsive user interactions with incremental token generation that creates a natural conversational experience.
Future enhancements could include support for multi-modal models that process images and text simultaneously, implementation of retrieval-augmented generation for grounding responses in external knowledge bases, integration with vector databases for semantic search capabilities, support for fine-tuning models on custom datasets through the web interface, and implementation of model merging techniques to combine capabilities from multiple models. Additional features might include conversation branching to explore alternative response paths, model comparison tools to evaluate different models side by side, and advanced prompt engineering utilities.
The local LLM landscape continues to evolve rapidly with new models, quantization techniques, and optimization strategies emerging regularly. This application provides a framework that can adapt to these developments while maintaining a consistent user experience. By running models locally, users gain privacy advantages since data never leaves their machine, control over model selection and parameters, and independence from cloud services and their associated costs and availability constraints.
The implementation presented here demonstrates production-ready code with proper error handling, logging, resource management, and user feedback mechanisms. All components work together to create a cohesive system that rivals commercial offerings while providing the benefits of local execution. The modular design allows developers to extend functionality, swap components, or integrate with other systems as needed.
ADDENDUM: BUILDING, RUNNING, AND DEPLOYING THE COMPLETE APPLICATION
OVERVIEW OF DEPLOYMENT PROCESS
This addendum provides comprehensive instructions for building, running, and deploying the complete local LLM application. The deployment process covers environment setup, dependency installation, configuration, testing, and production deployment options. The instructions support multiple operating systems including Windows, macOS, and Linux, with specific guidance for each platform where necessary.
SYSTEM REQUIREMENTS AND PREREQUISITES
Before beginning the installation process, ensure your system meets the minimum requirements. For CPU-only operation, you need at least eight gigabytes of RAM and twenty gigabytes of free disk space. For GPU-accelerated operation, requirements vary by GPU type. NVIDIA GPUs require CUDA version eleven point zero or higher with drivers supporting the installed CUDA version. AMD GPUs require ROCm version five point zero or higher. Apple Silicon Macs use the built-in Metal Performance Shaders framework. Intel GPUs require the Intel Extension for PyTorch.
The software prerequisites include Python version three point nine or higher, pip package manager version twenty-one or higher, Node.js version sixteen or higher for frontend development tools if needed, and Git for version control. A modern web browser supporting WebSocket connections is required for the frontend interface.
PROJECT STRUCTURE AND FILE ORGANIZATION
Create a project directory structure that organizes all components logically. The recommended structure separates backend code, frontend files, model storage, and configuration:
local-llm-app/
├── backend/
│ ├── main.py
│ ├── gpu_detector.py
│ ├── model_manager.py
│ ├── inference_engine.py
│ ├── requirements.txt
│ └── config.py
├── frontend/
│ ├── index.html
│ ├── styles.css
│ └── app.js
├── models/
│ └── (model files will be stored here)
├── logs/
│ └── (application logs)
├── docker/
│ ├── Dockerfile
│ └── docker-compose.yml
├── scripts/
│ ├── setup.sh
│ ├── run.sh
│ └── deploy.sh
├── model_registry.json
└── README.md
This structure keeps backend and frontend code separate, provides dedicated directories for models and logs, includes Docker configuration for containerized deployment, and contains utility scripts for common operations.
STEP-BY-STEP INSTALLATION GUIDE
Begin by creating the project directory and navigating into it. Open a terminal or command prompt and execute the following commands:
mkdir local-llm-app
cd local-llm-app
mkdir backend frontend models logs docker scripts
Create the backend directory structure and files. Navigate to the backend directory and create the necessary Python files:
cd backend
Create the requirements.txt file with all necessary dependencies:
fastapi==0.104.1
uvicorn[standard]==0.24.0
websockets==12.0
torch==2.1.0
transformers==4.35.0
accelerate==0.24.1
bitsandbytes==0.41.1
pydantic==2.5.0
python-multipart==0.0.6
aiofiles==23.2.1
For systems with NVIDIA GPUs, you may want to install the CUDA-enabled version of PyTorch. Replace the torch line in requirements.txt with:
torch==2.1.0+cu118 --index-url https://download.pytorch.org/whl/cu118
For AMD GPUs with ROCm support, use:
torch==2.1.0+rocm5.6 --index-url https://download.pytorch.org/whl/rocm5.6
Create a Python virtual environment to isolate dependencies:
python -m venv venv
Activate the virtual environment. On Windows, use:
venv\Scripts\activate
On macOS and Linux, use:
source venv/bin/activate
Install the required Python packages:
pip install --upgrade pip
pip install -r requirements.txt
This installation process may take several minutes as it downloads and installs all dependencies including PyTorch and the transformers library.
Create a configuration file named config.py in the backend directory:
# config.py - Application configuration
import os
from pathlib import Path
class Config:
"""Application configuration settings."""
# Server settings
HOST = os.getenv('HOST', '0.0.0.0')
PORT = int(os.getenv('PORT', 8000))
RELOAD = os.getenv('RELOAD', 'False').lower() == 'true'
# Paths
BASE_DIR = Path(__file__).parent.parent
MODELS_DIR = BASE_DIR / 'models'
LOGS_DIR = BASE_DIR / 'logs'
FRONTEND_DIR = BASE_DIR / 'frontend'
# Model registry
REGISTRY_PATH = BASE_DIR / 'model_registry.json'
# CORS settings
CORS_ORIGINS = os.getenv('CORS_ORIGINS', '*').split(',')
# Logging
LOG_LEVEL = os.getenv('LOG_LEVEL', 'INFO')
LOG_FILE = LOGS_DIR / 'app.log'
# Generation defaults
DEFAULT_TEMPERATURE = 0.7
DEFAULT_TOP_K = 50
DEFAULT_TOP_P = 0.9
DEFAULT_REPETITION_PENALTY = 1.1
DEFAULT_MAX_NEW_TOKENS = 512
DEFAULT_CONTEXT_WINDOW = 2048
@classmethod
def ensure_directories(cls):
"""Ensure all required directories exist."""
cls.MODELS_DIR.mkdir(parents=True, exist_ok=True)
cls.LOGS_DIR.mkdir(parents=True, exist_ok=True)
cls.FRONTEND_DIR.mkdir(parents=True, exist_ok=True)
config = Config()
config.ensure_directories()
Update the main.py file to use the configuration and serve static frontend files:
# main.py - Main application entry point with static file serving
from fastapi import FastAPI, WebSocket, WebSocketDisconnect, HTTPException
from fastapi.middleware.cors import CORSMiddleware
from fastapi.staticfiles import StaticFiles
from fastapi.responses import FileResponse
from pydantic import BaseModel
from typing import List, Dict, Any, Optional
import json
import logging
from pathlib import Path
import uvicorn
from config import config
from gpu_detector import GPUDetector
from model_manager import ModelManager
from inference_engine import InferenceEngine
logging.basicConfig(
level=config.LOG_LEVEL,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
handlers=[
logging.FileHandler(config.LOG_FILE),
logging.StreamHandler()
]
)
logger = logging.getLogger(__name__)
app = FastAPI(
title="Local LLM Server",
description="A production-ready local LLM inference server with GPU acceleration",
version="1.0.0"
)
app.add_middleware(
CORSMiddleware,
allow_origins=config.CORS_ORIGINS,
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"],
)
gpu_detector = GPUDetector()
model_manager = ModelManager(
models_dir=str(config.MODELS_DIR),
registry_path=str(config.REGISTRY_PATH)
)
inference_engine = InferenceEngine(gpu_detector)
class ModelAddRequest(BaseModel):
model_id: str
quantization: Optional[str] = None
class GenerationConfigUpdate(BaseModel):
temperature: Optional[float] = None
top_k: Optional[int] = None
top_p: Optional[float] = None
repetition_penalty: Optional[float] = None
max_new_tokens: Optional[int] = None
@app.on_event("startup")
async def startup_event():
"""Initialize the application on startup."""
logger.info("Starting Local LLM Server")
logger.info(f"GPU Info: {gpu_detector.get_device_info()}")
logger.info(f"Available models: {len(model_manager.list_models())}")
logger.info(f"Server running on {config.HOST}:{config.PORT}")
@app.get("/api/health")
async def health_check():
"""Health check endpoint."""
return {
"status": "healthy",
"gpu": gpu_detector.get_device_info(),
"models_count": len(model_manager.list_models())
}
@app.get("/api/models")
async def list_models():
"""List all available models."""
return {"models": model_manager.list_models()}
@app.post("/api/models")
async def add_model(request: ModelAddRequest):
"""Add a new model from HuggingFace."""
logger.info(f"Received request to add model: {request.model_id}")
result = model_manager.add_model(request.model_id, request.quantization)
if result['status'] == 'error':
raise HTTPException(status_code=400, detail=result['message'])
return result
@app.delete("/api/models/{model_id:path}")
async def delete_model(model_id: str):
"""Delete a model from local storage."""
logger.info(f"Received request to delete model: {model_id}")
result = model_manager.delete_model(model_id)
if result['status'] == 'error':
raise HTTPException(status_code=404, detail=result['message'])
return result
@app.post("/api/models/{model_id:path}/load")
async def load_model(model_id: str):
"""Load a model for inference."""
logger.info(f"Received request to load model: {model_id}")
model_path = model_manager.get_model_path(model_id)
if not model_path:
raise HTTPException(status_code=404, detail="Model not found")
result = inference_engine.load_model(model_id, model_path)
if result['status'] == 'error':
raise HTTPException(status_code=500, detail=result['message'])
return result
@app.get("/api/current-model")
async def get_current_model():
"""Get information about the currently loaded model."""
if inference_engine.current_model_id:
return {
"model_id": inference_engine.current_model_id,
"loaded": True
}
return {"loaded": False}
@app.post("/api/generation-config")
async def update_generation_config(config_update: GenerationConfigUpdate):
"""Update generation configuration parameters."""
logger.info(f"Updating generation config: {config_update.dict(exclude_none=True)}")
inference_engine.update_generation_config(config_update.dict(exclude_none=True))
return {
"status": "success",
"message": "Generation config updated",
"config": inference_engine.get_generation_config()
}
@app.get("/api/generation-config")
async def get_generation_config():
"""Get current generation configuration."""
return inference_engine.get_generation_config()
@app.get("/api/device-info")
async def get_device_info():
"""Get information about the GPU device."""
return gpu_detector.get_device_info()
@app.websocket("/ws/chat")
async def websocket_chat(websocket: WebSocket):
"""WebSocket endpoint for chat interactions with streaming responses."""
await websocket.accept()
client_id = id(websocket)
logger.info(f"WebSocket connection established: {client_id}")
try:
while True:
data = await websocket.receive_text()
try:
message_data = json.loads(data)
except json.JSONDecodeError:
await websocket.send_text(json.dumps({
'type': 'error',
'message': 'Invalid JSON format'
}))
continue
message_type = message_data.get('type')
if message_type == 'generate':
messages = message_data.get('messages', [])
context_window = message_data.get('context_window', config.DEFAULT_CONTEXT_WINDOW)
if not messages:
await websocket.send_text(json.dumps({
'type': 'error',
'message': 'No messages provided'
}))
continue
if not inference_engine.current_model_id:
await websocket.send_text(json.dumps({
'type': 'error',
'message': 'No model loaded'
}))
continue
await websocket.send_text(json.dumps({
'type': 'start',
'message': 'Generation started'
}))
full_response = ""
try:
for token in inference_engine.generate_stream(messages, context_window):
full_response += token
await websocket.send_text(json.dumps({
'type': 'token',
'content': token
}))
await websocket.send_text(json.dumps({
'type': 'end',
'message': 'Generation complete',
'full_response': full_response
}))
except Exception as e:
logger.error(f"Error during generation: {str(e)}")
await websocket.send_text(json.dumps({
'type': 'error',
'message': f'Generation error: {str(e)}'
}))
elif message_type == 'ping':
await websocket.send_text(json.dumps({
'type': 'pong',
'timestamp': message_data.get('timestamp')
}))
else:
await websocket.send_text(json.dumps({
'type': 'error',
'message': f'Unknown message type: {message_type}'
}))
except WebSocketDisconnect:
logger.info(f"WebSocket connection closed: {client_id}")
except Exception as e:
logger.error(f"WebSocket error for {client_id}: {str(e)}")
try:
await websocket.close()
except:
pass
if config.FRONTEND_DIR.exists():
app.mount("/static", StaticFiles(directory=str(config.FRONTEND_DIR)), name="static")
@app.get("/")
async def serve_frontend():
"""Serve the frontend application."""
return FileResponse(str(config.FRONTEND_DIR / "index.html"))
if __name__ == "__main__":
uvicorn.run(
"main:app",
host=config.HOST,
port=config.PORT,
reload=config.RELOAD,
log_level=config.LOG_LEVEL.lower()
)
Copy all the backend Python files (gpu_detector.py, model_manager.py, inference_engine.py) from the main article into the backend directory.
Now set up the frontend files. Navigate to the frontend directory:
cd ../frontend
Copy the index.html, styles.css, and app.js files from the main article into this directory. Update the app.js file to use relative URLs for the API:
// app.js - Updated with relative URLs
class LocalLLMApp {
constructor() {
this.apiBaseUrl = window.location.origin;
this.wsUrl = (window.location.protocol === 'https:' ? 'wss://' : 'ws://') +
window.location.host + '/ws/chat';
this.ws = null;
this.currentModelId = null;
this.messages = [];
this.isGenerating = false;
this.currentTheme = 'light';
this.initializeElements();
this.attachEventListeners();
this.loadInitialData();
this.connectWebSocket();
this.loadTheme();
}
// Rest of the LocalLLMApp class code remains the same as in the main article
// ... (include all methods from the original app.js)
}
document.addEventListener('DOMContentLoaded', () => {
new LocalLLMApp();
});
CREATING UTILITY SCRIPTS
Create utility scripts to simplify common operations. Navigate to the scripts directory:
cd ../scripts
Create a setup script for Unix-like systems (setup.sh):
#!/bin/bash
echo "Setting up Local LLM Application..."
cd "$(dirname "$0")/.."
if [ ! -d "backend/venv" ]; then
echo "Creating Python virtual environment..."
cd backend
python3 -m venv venv
cd ..
fi
echo "Activating virtual environment..."
source backend/venv/bin/activate
echo "Installing Python dependencies..."
cd backend
pip install --upgrade pip
pip install -r requirements.txt
cd ..
echo "Creating necessary directories..."
mkdir -p models logs
echo "Setup complete!"
echo "To run the application, execute: ./scripts/run.sh"
Create a run script for Unix-like systems (run.sh):
#!/bin/bash
echo "Starting Local LLM Application..."
cd "$(dirname "$0")/.."
source backend/venv/bin/activate
cd backend
python main.py
Create a Windows batch file for setup (setup.bat):
@echo off
echo Setting up Local LLM Application...
cd /d %~dp0\..
if not exist "backend\venv" (
echo Creating Python virtual environment...
cd backend
python -m venv venv
cd ..
)
echo Activating virtual environment...
call backend\venv\Scripts\activate.bat
echo Installing Python dependencies...
cd backend
pip install --upgrade pip
pip install -r requirements.txt
cd ..
echo Creating necessary directories...
if not exist "models" mkdir models
if not exist "logs" mkdir logs
echo Setup complete!
echo To run the application, execute: scripts\run.bat
pause
Create a Windows batch file for running (run.bat):
@echo off
echo Starting Local LLM Application...
cd /d %~dp0\..
call backend\venv\Scripts\activate.bat
cd backend
python main.py
pause
Make the Unix scripts executable:
chmod +x setup.sh run.sh
RUNNING THE APPLICATION LOCALLY
To run the application on your local machine, follow these steps. First, execute the setup script to install all dependencies and create necessary directories. On Unix-like systems:
./scripts/setup.sh
On Windows:
scripts\setup.bat
After setup completes successfully, start the application using the run script. On Unix-like systems:
./scripts/run.sh
On Windows:
scripts\run.bat
The application will start and display log messages indicating the server is running. You should see output similar to:
INFO: Started server process
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000
Open a web browser and navigate to http://localhost:8000 to access the application interface. The first time you access the application, you will see the welcome screen with no models loaded.
ADDING YOUR FIRST MODEL
To add your first model, click the "Add Model" button in the sidebar. Enter a model identifier from HuggingFace, such as "gpt2" for a small test model. Select the quantization option if desired. For initial testing, use no quantization to ensure compatibility. Click "Add Model" and wait for the download to complete.
The download progress will be displayed in the modal dialog. Depending on the model size and your internet connection, this may take several minutes. Once the download completes, the model will appear in the available models list.
Click the "Load" button next to the model to load it into memory. This process may take a few moments as the model is initialized and moved to the GPU if available. Once loaded, the send button will become enabled and you can start chatting.
DOCKER DEPLOYMENT
For production deployment or easier distribution, Docker provides a containerized solution. Create a Dockerfile in the docker directory:
FROM python:3.10-slim
ENV PYTHONUNBUFFERED=1
ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get update && apt-get install -y \
build-essential \
git \
&& rm -rf /var/lib/apt/lists/*
WORKDIR /app
COPY backend/requirements.txt /app/backend/
RUN pip install --no-cache-dir -r /app/backend/requirements.txt
COPY backend/ /app/backend/
COPY frontend/ /app/frontend/
COPY scripts/ /app/scripts/
RUN mkdir -p /app/models /app/logs
ENV PYTHONPATH=/app/backend
ENV HOST=0.0.0.0
ENV PORT=8000
EXPOSE 8000
WORKDIR /app/backend
CMD ["python", "main.py"]
Create a docker-compose.yml file for easier management:
version: '3.8'
services:
local-llm-app:
build:
context: ..
dockerfile: docker/Dockerfile
ports:
- "8000:8000"
volumes:
- ../models:/app/models
- ../logs:/app/logs
- ../model_registry.json:/app/model_registry.json
environment:
- HOST=0.0.0.0
- PORT=8000
- LOG_LEVEL=INFO
restart: unless-stopped
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
Build the Docker image:
cd docker
docker-compose build
Run the application using Docker Compose:
docker-compose up -d
View the logs:
docker-compose logs -f
Stop the application:
docker-compose down
PRODUCTION DEPLOYMENT WITH NGINX
For production deployment, use NGINX as a reverse proxy to handle SSL termination and static file serving. Install NGINX on your server:
sudo apt-get update
sudo apt-get install nginx
Create an NGINX configuration file at /etc/nginx/sites-available/local-llm-app:
upstream local_llm_backend {
server 127.0.0.1:8000;
}
server {
listen 80;
server_name your-domain.com;
client_max_body_size 100M;
location / {
proxy_pass http://local_llm_backend;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_read_timeout 300s;
proxy_connect_timeout 75s;
}
location /ws/ {
proxy_pass http://local_llm_backend;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_read_timeout 3600s;
proxy_send_timeout 3600s;
}
}
Enable the site:
sudo ln -s /etc/nginx/sites-available/local-llm-app /etc/nginx/sites-enabled/
sudo nginx -t
sudo systemctl restart nginx
For SSL support, use Let's Encrypt with Certbot:
sudo apt-get install certbot python3-certbot-nginx
sudo certbot --nginx -d your-domain.com
SYSTEMD SERVICE FOR AUTOMATIC STARTUP
Create a systemd service file to run the application automatically on system startup. Create /etc/systemd/system/local-llm-app.service:
[Unit]
Description=Local LLM Application
After=network.target
[Service]
Type=simple
User=your-username
WorkingDirectory=/path/to/local-llm-app/backend
Environment="PATH=/path/to/local-llm-app/backend/venv/bin"
ExecStart=/path/to/local-llm-app/backend/venv/bin/python main.py
Restart=always
RestartSec=10
[Install]
WantedBy=multi-user.target
Enable and start the service:
sudo systemctl daemon-reload
sudo systemctl enable local-llm-app
sudo systemctl start local-llm-app
Check the service status:
sudo systemctl status local-llm-app
View service logs:
sudo journalctl -u local-llm-app -f
ENVIRONMENT VARIABLES AND CONFIGURATION
The application supports configuration through environment variables. Create a .env file in the project root:
HOST=0.0.0.0
PORT=8000
RELOAD=False
LOG_LEVEL=INFO
CORS_ORIGINS=*
Load these variables before running the application. On Unix-like systems, use:
export $(cat .env | xargs)
python backend/main.py
For Docker deployment, reference the .env file in docker-compose.yml:
services:
local-llm-app:
env_file:
- ../.env
MONITORING AND MAINTENANCE
Monitor application health using the health check endpoint:
curl http://localhost:8000/api/health
Monitor GPU usage with nvidia-smi for NVIDIA GPUs:
watch -n 1 nvidia-smi
Monitor system resources:
htop
Set up log rotation to prevent log files from growing too large. Create /etc/logrotate.d/local-llm-app:
/path/to/local-llm-app/logs/*.log {
daily
rotate 7
compress
delaycompress
notifempty
create 0644 your-username your-username
sharedscripts
postrotate
systemctl reload local-llm-app
endscript
}
BACKUP AND RESTORE
Backup important data regularly including the model registry and configuration files. Create a backup script:
#!/bin/bash
BACKUP_DIR="/path/to/backups"
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
BACKUP_FILE="local-llm-backup_${TIMESTAMP}.tar.gz"
cd /path/to/local-llm-app
tar -czf "${BACKUP_DIR}/${BACKUP_FILE}" \
model_registry.json \
backend/config.py \
.env
echo "Backup created: ${BACKUP_FILE}"
To restore from a backup:
tar -xzf /path/to/backups/local-llm-backup_TIMESTAMP.tar.gz -C /path/to/local-llm-app
TROUBLESHOOTING COMMON ISSUES
If the application fails to start, check the log files in the logs directory for error messages. Common issues include missing dependencies, incorrect Python version, or insufficient permissions.
If models fail to download, verify your internet connection and ensure HuggingFace is accessible. Check available disk space as model downloads can be large.
If GPU is not detected, verify driver installation and CUDA/ROCm compatibility. Check that PyTorch can access the GPU:
import torch
print(torch.cuda.is_available())
print(torch.cuda.get_device_name(0))
If WebSocket connections fail, check firewall settings and ensure port 8000 is accessible. Verify NGINX configuration if using a reverse proxy.
If generation is slow, consider using quantized models, reducing context window size, or upgrading hardware. Monitor GPU memory usage to ensure models fit in VRAM.
UPDATING THE APPLICATION
To update the application with new features or bug fixes, pull the latest code and reinstall dependencies:
cd /path/to/local-llm-app
git pull origin main
source backend/venv/bin/activate
pip install -r backend/requirements.txt --upgrade
sudo systemctl restart local-llm-app
SECURITY CONSIDERATIONS FOR PRODUCTION
For production deployment, implement additional security measures. Change the default CORS settings to restrict access to specific domains:
CORS_ORIGINS = ["https://your-domain.com"]
Implement authentication and authorization for API endpoints. Add rate limiting to prevent abuse. Use HTTPS for all connections. Keep dependencies updated to patch security vulnerabilities.
Restrict file system access and run the application with minimal privileges. Use a dedicated user account for the service. Implement input validation and sanitization for all user inputs.
SCALING AND PERFORMANCE OPTIMIZATION
For high-traffic deployments, consider running multiple instances behind a load balancer. Use Redis for session management and caching. Implement request queuing for generation tasks.
Optimize model loading by keeping frequently used models in memory. Use model quantization to reduce memory footprint. Implement batch processing for multiple requests.
Monitor performance metrics and adjust resources accordingly. Use profiling tools to identify bottlenecks. Consider GPU pooling for multi-GPU setups.
CONCLUSION
This addendum provides comprehensive instructions for building, running, and deploying the complete local LLM application. The deployment process covers development setup, Docker containerization, production deployment with NGINX, systemd service configuration, monitoring, backup, and troubleshooting. Following these instructions will result in a fully functional, production-ready local LLM application that can be accessed through a web browser while maintaining complete control over models and data.