Tuesday, March 10, 2026

Building Future-Proof LLM Applications: Mastering Multi-GPU Support and Configuration Management



The artificial intelligence landscape has exploded with possibilities as Large Language Models become increasingly accessible for local deployment. However, developers face a critical challenge that often gets overlooked in the rush to build AI-powered applications: how to create LLM applications that work seamlessly across different hardware platforms and can leverage both local and remote LLM endpoints. The ecosystem spans NVIDIA CUDA GPUs, AMD ROCm-enabled graphics cards, Apple’s Metal Performance Shaders, and various cloud API providers, each with unique requirements and optimizations.


This technical landscape creates a dilemma for developers. Do you lock your application to a single platform and limit your user base? Do you maintain separate codebases for different hardware configurations? Or do you build a flexible architecture that adapts to whatever hardware your users have available? The answer lies in thoughtful architecture design and sophisticated configuration management that puts the user in control.


Understanding the Modern GPU Landscape for LLM Deployment


The hardware landscape for running LLMs locally has become remarkably diverse, with each platform offering distinct advantages and trade-offs. NVIDIA’s CUDA ecosystem remains the gold standard for AI workloads, benefiting from years of optimization and universal framework support. The RTX 4090 with its 24GB of VRAM represents the pinnacle of consumer hardware for LLM inference, capable of running 70B parameter models with 4-bit quantization while maintaining reasonable inference speeds of 8-12 tokens per second.


AMD has made significant strides with ROCm, their open-source compute platform that rivals CUDA in many scenarios. The RX 7900 XTX offers competitive performance at a lower cost per gigabyte of VRAM, making it an attractive option for developers willing to navigate slightly more complex setup procedures. ROCm now supports leading frameworks like PyTorch and vLLM, with major improvements in Flash Attention and Paged Attention implementations that bring performance closer to CUDA levels.


Apple Silicon introduces a completely different paradigm with its unified memory architecture. The M3 Ultra with up to 192GB of unified RAM can run models that would be impossible on traditional discrete GPUs due to VRAM limitations. A Mac Studio can comfortably run 70B parameter models entirely in memory, achieving 8-12 tokens per second while consuming a fraction of the power compared to discrete GPU solutions. The Metal Performance Shaders backend in PyTorch provides seamless acceleration for these workloads.


The challenge for application developers is not just supporting these different platforms, but optimizing for each one’s unique characteristics. NVIDIA GPUs excel at parallel computation and benefit from techniques like tensor parallelism for multi-GPU setups. AMD GPUs require careful tuning of ROCm-specific parameters and may benefit from different memory management strategies. Apple Silicon leverages unified memory but may require different batch sizes and precision settings for optimal performance.


Device Detection and Runtime Adaptation Patterns


Creating applications that automatically detect and configure themselves for available hardware requires sophisticated device detection logic. The key is implementing a hierarchical preference system that attempts to use the best available hardware while providing graceful fallbacks to ensure your application runs everywhere.


Here is a comprehensive device detection implementation that handles all major GPU platforms:


import torch

import os

import logging

from typing import Tuple, Dict, Optional, List

from enum import Enum


class DeviceType(Enum):

    CUDA = "cuda"

    MPS = "mps"

    CPU = "cpu"


class DeviceInfo:

    def __init__(self, device_type: DeviceType, device_count: int = 1, 

                 memory_gb: Optional[float] = None, compute_capability: Optional[str] = None):

        self.device_type = device_type

        self.device_count = device_count

        self.memory_gb = memory_gb

        self.compute_capability = compute_capability

        self.is_rocm = self._detect_rocm()

    

    def _detect_rocm(self) -> bool:

        """Detect if we're running on ROCm instead of CUDA"""

        if self.device_type != DeviceType.CUDA:

            return False

        

        # Check for ROCm-specific environment variables

        rocm_vars = ['ROCM_PATH', 'HIP_PATH', 'ROCM_HOME']

        if any(os.getenv(var) for var in rocm_vars):

            return True

            

        # Check if PyTorch was built with ROCm

        try:

            return 'rocm' in torch.version.hip or 'hip' in torch.version.hip

        except:

            return False

    

    @property

    def platform_name(self) -> str:

        if self.device_type == DeviceType.CUDA:

            return "ROCm" if self.is_rocm else "CUDA"

        elif self.device_type == DeviceType.MPS:

            return "Apple MPS"

        else:

            return "CPU"


class DeviceManager:

    def __init__(self):

        self.logger = logging.getLogger(__name__)

        self._device_info = None

        

    def detect_available_devices(self) -> DeviceInfo:

        """Detect and return information about available compute devices"""

        if self._device_info is not None:

            return self._device_info

            

        device_info = self._probe_devices()

        self._device_info = device_info

        

        self.logger.info(f"Detected {device_info.platform_name} with {device_info.device_count} device(s)")

        if device_info.memory_gb:

            self.logger.info(f"Available memory: {device_info.memory_gb:.1f} GB")

            

        return device_info

    

    def _probe_devices(self) -> DeviceInfo:

        """Probe available devices in order of preference"""

        

        # First, try CUDA (includes ROCm)

        if torch.cuda.is_available():

            device_count = torch.cuda.device_count()

            

            # Get memory info for the primary device

            try:

                torch.cuda.set_device(0)

                memory_bytes = torch.cuda.get_device_properties(0).total_memory

                memory_gb = memory_bytes / (1024**3)

                

                # Get compute capability (CUDA) or architecture (ROCm)

                props = torch.cuda.get_device_properties(0)

                if hasattr(props, 'major') and hasattr(props, 'minor'):

                    compute_capability = f"{props.major}.{props.minor}"

                else:

                    compute_capability = props.name

                    

            except Exception as e:

                self.logger.warning(f"Could not get CUDA device properties: {e}")

                memory_gb = None

                compute_capability = None

                

            return DeviceInfo(DeviceType.CUDA, device_count, memory_gb, compute_capability)

        

        # Try Apple MPS

        if torch.backends.mps.is_available():

            # MPS doesn't have explicit device count, treat as single device

            # Memory is shared with system, so don't report specific GPU memory

            return DeviceInfo(DeviceType.MPS, 1, None, "Apple Silicon")

        

        # Check if MPS is built but not available

        if torch.backends.mps.is_built():

            self.logger.warning("MPS is built but not available. Check macOS version (12.3+ required)")

        

        # Fallback to CPU

        cpu_count = os.cpu_count() or 1

        self.logger.info("No GPU acceleration available, falling back to CPU")

        return DeviceInfo(DeviceType.CPU, cpu_count, None, None)

    

    def get_optimal_device(self, memory_required_gb: Optional[float] = None) -> torch.device:

        """Get the optimal PyTorch device for the given memory requirements"""

        device_info = self.detect_available_devices()

        

        if device_info.device_type == DeviceType.CPU:

            return torch.device("cpu")

        

        # Check memory requirements

        if memory_required_gb and device_info.memory_gb:

            if memory_required_gb > device_info.memory_gb * 0.9:  # Leave 10% headroom

                self.logger.warning(f"Required memory ({memory_required_gb:.1f}GB) exceeds "

                                  f"available memory ({device_info.memory_gb:.1f}GB)")

                return torch.device("cpu")

        

        if device_info.device_type == DeviceType.CUDA:

            return torch.device("cuda:0")

        elif device_info.device_type == DeviceType.MPS:

            return torch.device("mps")

        

        return torch.device("cpu")

    

    def configure_memory_management(self, device_info: DeviceInfo) -> None:

        """Configure memory management based on device type"""

        if device_info.device_type == DeviceType.CUDA:

            # Configure CUDA memory allocation strategy

            if device_info.is_rocm:

                # ROCm-specific optimizations

                os.environ.setdefault('PYTORCH_TUNABLEOP_ENABLED', '1')

                os.environ.setdefault('PYTORCH_TUNABLEOP_TUNING', '1')

                self.logger.info("Enabled ROCm TunableOp optimizations")

            else:

                # NVIDIA CUDA optimizations

                torch.backends.cuda.matmul.allow_tf32 = True

                torch.backends.cudnn.allow_tf32 = True

                self.logger.info("Enabled TensorFloat-32 for CUDA")

                

            # Common CUDA memory settings

            torch.cuda.empty_cache()

            

        elif device_info.device_type == DeviceType.MPS:

            # MPS-specific settings

            os.environ.setdefault('PYTORCH_ENABLE_MPS_FALLBACK', '1')

            self.logger.info("Enabled MPS fallback to CPU for unsupported operations")


Modern LLM frameworks have standardized around common detection patterns. For CUDA support, you check torch.cuda.is_available() and torch.cuda.device_count() to determine both availability and the number of available GPUs. ROCm detection uses the same CUDA interface by design, since AMD intentionally reused PyTorch’s CUDA APIs to minimize porting effort. This means torch.cuda.is_available() returns True on ROCm systems, and you use torch.device(‘cuda’) even when running on AMD hardware.


Apple MPS detection requires different APIs: torch.backends.mps.is_available() and torch.backends.mps.is_built() tell you whether MPS acceleration is possible. The distinction matters because MPS might be built into PyTorch but unavailable due to OS version requirements or hardware limitations. Once confirmed, you create devices with torch.device(‘mps’).


A robust device detection system implements a preference hierarchy that tries the best available option first. CUDA gets priority due to its maturity and broad model support, followed by MPS for Apple Silicon users, then CPU as the universal fallback. The system should also detect specific capabilities like available VRAM, compute capability versions, and multi-GPU configurations to make informed decisions about model loading strategies.


Smart applications go beyond simple device detection and implement capability-aware configuration. They might detect that a system has 12GB of VRAM and automatically select 4-bit quantization for larger models, or identify multi-GPU setups and enable tensor parallelism. This level of adaptation makes applications feel native to each platform rather than merely compatible.


Configuration Files: The Developer’s Secret Weapon


Configuration files represent the most powerful tool for creating user-controllable LLM applications. Rather than hardcoding device preferences and model parameters, well-architected applications expose these choices through hierarchical configuration systems that let users specify exactly how they want their application to behave.


YAML has emerged as the preferred format for LLM application configuration due to its human readability and excellent support for complex nested structures. A comprehensive configuration system needs to handle multiple concerns: hardware preferences, model selection, inference parameters, memory management, and fallback strategies. The key is designing a schema that balances flexibility with sensible defaults.


Here is a comprehensive configuration example that demonstrates all the key patterns:


# config.yaml - Production LLM Application Configuration

version: "1.0"

application:

  name: "Advanced LLM Assistant"

  logging_level: "INFO"

  

# Hardware and device configuration

hardware:

  # Device preference order - will try in this sequence

  preferred_devices: ["cuda", "mps", "cpu"]

  

  # Device-specific settings

  cuda:

    enabled: true

    device_ids: [0, 1]  # Use specific GPU IDs, empty for all

    memory_fraction: 0.9  # Use 90% of available VRAM

    allow_tf32: true

    enable_flash_attention: true

    # ROCm-specific settings

    rocm:

      enable_tunable_op: true

      hip_visible_devices: null  # null means use all

      

  mps:

    enabled: true

    fallback_to_cpu: true  # Fallback for unsupported ops

    memory_limit_gb: null  # null means use system default

    

  cpu:

    threads: null  # null means auto-detect

    memory_limit_gb: 8

    

# Model configuration

models:

  # Local models

  local:

    base_path: "./models"

    

    # Model definitions

    llama2_7b:

      path: "llama-2-7b-chat.gguf"

      context_length: 4096

      quantization: "q4_0"

      tensor_parallel: false

      memory_required_gb: 4.0

      supported_devices: ["cuda", "mps", "cpu"]

      

    llama2_70b:

      path: "llama-2-70b-chat.gguf"

      context_length: 4096

      quantization: "q4_0"

      tensor_parallel: true

      tensor_parallel_size: 2

      memory_required_gb: 40.0

      supported_devices: ["cuda"]  # Requires CUDA for multi-GPU

      

    codellama_13b:

      path: "codellama-13b-instruct.gguf"

      context_length: 8192

      quantization: "q5_1"

      tensor_parallel: false

      memory_required_gb: 8.0

      supported_devices: ["cuda", "mps", "cpu"]

      

  # Remote API models

  remote:

    openai:

      enabled: true

      api_key: "${OPENAI_API_KEY}"  # Environment variable reference

      base_url: "https://api.openai.com/v1"

      models:

        - "gpt-4"

        - "gpt-3.5-turbo"

      timeout: 30

      max_retries: 3

      

    anthropic:

      enabled: true

      api_key: "${ANTHROPIC_API_KEY}"

      base_url: "https://api.anthropic.com"

      models:

        - "claude-3-opus-20240229"

        - "claude-3-sonnet-20240229"

      timeout: 30

      max_retries: 3

      

    local_server:

      enabled: false

      base_url: "http://localhost:8000/v1"

      api_key: "local"

      models:

        - "local-model"


# Inference parameters

inference:

  # Default parameters (can be overridden per model)

  defaults:

    temperature: 0.7

    max_tokens: 2048

    top_p: 0.9

    top_k: 40

    repetition_penalty: 1.1

    stream: true

    

  # Model-specific overrides

  overrides:

    codellama_13b:

      temperature: 0.1  # Lower temperature for code generation

      max_tokens: 4096

      

    llama2_70b:

      batch_size: 1  # Large model, single batch

      

# Memory management

memory:

  # Global memory settings

  global:

    garbage_collect_threshold: 0.8  # GC when 80% memory used

    cache_size_mb: 1024

    

  # Device-specific memory management

  cuda:

    memory_pool: true

    empty_cache_threshold: 0.9

    

  mps:

    unified_memory_management: true

    

  cpu:

    max_memory_gb: 16


# Performance optimization

performance:

  # Compilation settings

  torch_compile: false  # Enable PyTorch 2.0 compilation

  flash_attention: true  # Use Flash Attention when available

  

  # Quantization settings

  quantization:

    default_precision: "float16"  # float32, float16, bfloat16

    dynamic_quantization: true

    

  # Batching

  batching:

    max_batch_size: 8

    batch_timeout_ms: 100


# Fallback and error handling

fallback:

  # Automatic fallback strategy

  enabled: true

  

  # Fallback chain for device failures

  device_fallback_chain:

    - "cuda"

    - "mps" 

    - "cpu"

    

  # Fallback chain for model loading failures

  model_fallback_chain:

    - "local"

    - "remote"

    

  # What to do when all devices fail

  final_fallback: "cpu"

  

  # Retry settings

  max_retries: 3

  retry_delay_ms: 1000


# Environment-specific overrides

environments:

  development:

    logging_level: "DEBUG"

    hardware:

      cuda:

        memory_fraction: 0.7  # Leave more memory for development tools

        

  production:

    logging_level: "WARNING"

    performance:

      torch_compile: true

      

  testing:

    models:

      local:

        llama2_7b:

          context_length: 512  # Smaller context for faster tests


Consider a configuration structure that allows users to specify their preferred execution strategy while providing automatic fallbacks. The hardware section might allow users to explicitly prefer CUDA over MPS, set memory limits for different device types, or disable certain acceleration methods if they encounter compatibility issues. Model configuration should support both local model paths and remote API endpoints, with parameters like quantization levels, context lengths, and batch sizes that can be adjusted per deployment scenario.


Here is the configuration management system that loads and validates these settings:


import yaml

import os

import logging

from typing import Dict, Any, Optional, List

from dataclasses import dataclass

from pathlib import Path


@dataclass

class ModelConfig:

    name: str

    path: Optional[str] = None

    context_length: int = 4096

    quantization: str = "q4_0"

    tensor_parallel: bool = False

    tensor_parallel_size: int = 1

    memory_required_gb: float = 4.0

    supported_devices: List[str] = None

    

    def __post_init__(self):

        if self.supported_devices is None:

            self.supported_devices = ["cuda", "mps", "cpu"]


@dataclass

class HardwareConfig:

    preferred_devices: List[str]

    cuda_enabled: bool = True

    cuda_memory_fraction: float = 0.9

    mps_enabled: bool = True

    mps_fallback_to_cpu: bool = True

    cpu_threads: Optional[int] = None


@dataclass

class InferenceConfig:

    temperature: float = 0.7

    max_tokens: int = 2048

    top_p: float = 0.9

    stream: bool = True


class ConfigManager:

    def __init__(self, config_path: str = "config.yaml"):

        self.config_path = Path(config_path)

        self.logger = logging.getLogger(__name__)

        self._config = None

        self._validated = False

        

    def load_config(self) -> Dict[str, Any]:

        """Load and validate configuration from file"""

        if self._config is not None and self._validated:

            return self._config

            

        try:

            with open(self.config_path, 'r') as f:

                config_content = f.read()

                

            # Substitute environment variables

            config_content = self._substitute_env_vars(config_content)

            

            # Parse YAML

            self._config = yaml.safe_load(config_content)

            

            # Validate configuration

            self._validate_config()

            self._validated = True

            

            self.logger.info(f"Successfully loaded configuration from {self.config_path}")

            return self._config

            

        except FileNotFoundError:

            self.logger.error(f"Configuration file {self.config_path} not found")

            raise

        except yaml.YAMLError as e:

            self.logger.error(f"Invalid YAML in configuration file: {e}")

            raise

        except Exception as e:

            self.logger.error(f"Error loading configuration: {e}")

            raise

    

    def _substitute_env_vars(self, content: str) -> str:

        """Substitute environment variable references like ${VAR_NAME}"""

        import re

        

        def replace_env_var(match):

            var_name = match.group(1)

            return os.getenv(var_name, match.group(0))

        

        return re.sub(r'\$\{([^}]+)\}', replace_env_var, content)

    

    def _validate_config(self) -> None:

        """Validate configuration structure and values"""

        required_sections = ['hardware', 'models', 'inference']

        

        for section in required_sections:

            if section not in self._config:

                raise ValueError(f"Required configuration section '{section}' missing")

        

        # Validate hardware configuration

        hardware = self._config['hardware']

        if 'preferred_devices' not in hardware:

            raise ValueError("hardware.preferred_devices is required")

            

        valid_devices = ['cuda', 'mps', 'cpu']

        for device in hardware['preferred_devices']:

            if device not in valid_devices:

                raise ValueError(f"Invalid device '{device}'. Must be one of {valid_devices}")

        

        # Validate model configurations

        models = self._config['models']

        if 'local' in models:

            for model_name, model_config in models['local'].items():

                if model_name == 'base_path':

                    continue

                    

                if 'memory_required_gb' in model_config:

                    if model_config['memory_required_gb'] <= 0:

                        raise ValueError(f"Model {model_name}: memory_required_gb must be positive")

                        

                if 'supported_devices' in model_config:

                    for device in model_config['supported_devices']:

                        if device not in valid_devices:

                            raise ValueError(f"Model {model_name}: invalid device '{device}'")

        

        # Validate remote API configurations

        if 'remote' in models:

            for provider, config in models['remote'].items():

                if config.get('enabled', False):

                    if 'api_key' not in config:

                        raise ValueError(f"Remote provider {provider}: api_key is required")

                    if 'base_url' not in config:

                        raise ValueError(f"Remote provider {provider}: base_url is required")

    

    def get_hardware_config(self) -> HardwareConfig:

        """Get hardware configuration as a structured object"""

        config = self.load_config()

        hardware = config['hardware']

        

        return HardwareConfig(

            preferred_devices=hardware['preferred_devices'],

            cuda_enabled=hardware.get('cuda', {}).get('enabled', True),

            cuda_memory_fraction=hardware.get('cuda', {}).get('memory_fraction', 0.9),

            mps_enabled=hardware.get('mps', {}).get('enabled', True),

            mps_fallback_to_cpu=hardware.get('mps', {}).get('fallback_to_cpu', True),

            cpu_threads=hardware.get('cpu', {}).get('threads')

        )

    

    def get_model_config(self, model_name: str) -> Optional[ModelConfig]:

        """Get configuration for a specific model"""

        config = self.load_config()

        

        # Check local models

        local_models = config.get('models', {}).get('local', {})

        if model_name in local_models:

            model_data = local_models[model_name]

            base_path = local_models.get('base_path', './models')

            

            return ModelConfig(

                name=model_name,

                path=os.path.join(base_path, model_data.get('path', '')),

                context_length=model_data.get('context_length', 4096),

                quantization=model_data.get('quantization', 'q4_0'),

                tensor_parallel=model_data.get('tensor_parallel', False),

                tensor_parallel_size=model_data.get('tensor_parallel_size', 1),

                memory_required_gb=model_data.get('memory_required_gb', 4.0),

                supported_devices=model_data.get('supported_devices', ['cuda', 'mps', 'cpu'])

            )

        

        return None

    

    def get_inference_config(self, model_name: str = None) -> InferenceConfig:

        """Get inference configuration with optional model-specific overrides"""

        config = self.load_config()

        inference = config.get('inference', {})

        

        # Start with defaults

        defaults = inference.get('defaults', {})

        result = InferenceConfig(

            temperature=defaults.get('temperature', 0.7),

            max_tokens=defaults.get('max_tokens', 2048),

            top_p=defaults.get('top_p', 0.9),

            stream=defaults.get('stream', True)

        )

        

        # Apply model-specific overrides

        if model_name:

            overrides = inference.get('overrides', {}).get(model_name, {})

            for key, value in overrides.items():

                if hasattr(result, key):

                    setattr(result, key, value)

        

        return result

    

    def get_available_models(self, device_type: str = None) -> List[str]:

        """Get list of available models, optionally filtered by device support"""

        config = self.load_config()

        models = []

        

        # Local models

        local_models = config.get('models', {}).get('local', {})

        for model_name, model_config in local_models.items():

            if model_name == 'base_path':

                continue

                

            if device_type is None:

                models.append(model_name)

            elif device_type in model_config.get('supported_devices', []):

                models.append(model_name)

        

        # Remote models

        remote_providers = config.get('models', {}).get('remote', {})

        for provider, provider_config in remote_providers.items():

            if provider_config.get('enabled', False):

                for model in provider_config.get('models', []):

                    models.append(f"{provider}:{model}")

        

        return models

    

    def validate_model_device_compatibility(self, model_name: str, device_type: str) -> bool:

        """Check if a model is compatible with a specific device type"""

        model_config = self.get_model_config(model_name)

        if model_config is None:

            return False

            

        return device_type in model_config.supported_devices


Advanced configuration systems support environment variable interpolation, allowing sensitive information like API keys to be injected at runtime without storing them in configuration files. They also implement validation systems that catch configuration errors early and provide helpful error messages when hardware requirements aren’t met.


The most sophisticated implementations support configuration inheritance and composition, letting users define base configurations that can be extended for specific use cases. A base configuration might specify common model parameters, while derived configurations adjust settings for different hardware profiles or deployment environments.


A Production-Ready Multi-Platform Implementation


Building a production-ready system requires careful attention to the integration between device detection, configuration management, and runtime adaptation. The architecture should cleanly separate concerns while providing a unified interface that applications can use without worrying about underlying platform differences.


Here is a complete implementation that ties together device detection, configuration management, and runtime adaptation:


import torch
import asyncio
import logging
from typing import Optional, Dict, Any, Callable, Union
from contextlib import contextmanager
from dataclasses import dataclass
from abc import ABC, abstractmethod
class LLMBackend(ABC):
    """Abstract base class for LLM backends"""
    
    @abstractmethod
    async def generate(self, prompt: str, **kwargs) -> str:
        pass
    
    @abstractmethod
    def get_memory_usage(self) -> Dict[str, float]:
        pass
    
    @abstractmethod
    def cleanup(self) -> None:
        pass
class LocalLLMBackend(LLMBackend):
    """Local LLM backend using PyTorch"""
    
    def __init__(self, model_config: ModelConfig, device: torch.device):
        self.model_config = model_config
        self.device = device
        self.model = None
        self.tokenizer = None
        self.logger = logging.getLogger(__name__)
        
    async def load_model(self) -> None:
        """Load the model onto the specified device"""
        try:
            self.logger.info(f"Loading {self.model_config.name} on {self.device}")
            
            # Platform-specific model loading optimizations
            if self.device.type == "cuda":
                await self._load_cuda_model()
            elif self.device.type == "mps":
                await self._load_mps_model()
            else:
                await self._load_cpu_model()
                
            self.logger.info(f"Model {self.model_config.name} loaded successfully")
            
        except Exception as e:
            self.logger.error(f"Failed to load model: {e}")
            raise
    
    async def _load_cuda_model(self) -> None:
        """CUDA-specific model loading with optimizations"""
        # Simulate model loading - replace with actual implementation
        await asyncio.sleep(0.1)  # Simulate loading time
        
        # CUDA-specific optimizations
        torch.backends.cudnn.benchmark = True
        if hasattr(torch.backends.cuda, 'enable_flash_sdp'):
            torch.backends.cuda.enable_flash_sdp(True)
            
        # Enable tensor parallelism if configured
        if self.model_config.tensor_parallel and torch.cuda.device_count() > 1:
            self.logger.info(f"Enabling tensor parallelism across {self.model_config.tensor_parallel_size} GPUs")
            
    async def _load_mps_model(self) -> None:
        """MPS-specific model loading"""
        await asyncio.sleep(0.1)
        
        # MPS optimizations
        import os
        os.environ['PYTORCH_ENABLE_MPS_FALLBACK'] = '1'
        
    async def _load_cpu_model(self) -> None:
        """CPU-specific model loading"""
        await asyncio.sleep(0.1)
        
        # CPU optimizations
        torch.set_num_threads(self.model_config.cpu_threads or torch.get_num_threads())
        
    async def generate(self, prompt: str, **kwargs) -> str:
        """Generate text using the local model"""
        if self.model is None:
            raise RuntimeError("Model not loaded")
            
        # Simulate text generation - replace with actual implementation
        await asyncio.sleep(0.5)
        return f"Generated response for: {prompt[:50]}..."
    
    def get_memory_usage(self) -> Dict[str, float]:
        """Get current memory usage statistics"""
        if self.device.type == "cuda":
            return {
                "allocated_gb": torch.cuda.memory_allocated(self.device) / 1e9,
                "reserved_gb": torch.cuda.memory_reserved(self.device) / 1e9,
                "max_allocated_gb": torch.cuda.max_memory_allocated(self.device) / 1e9
            }
        elif self.device.type == "mps":
            return {
                "current_allocated_gb": torch.mps.current_allocated_memory() / 1e9,
                "driver_allocated_gb": torch.mps.driver_allocated_memory() / 1e9
            }
        else:
            import psutil
            return {
                "system_memory_gb": psutil.virtual_memory().used / 1e9,
                "available_memory_gb": psutil.virtual_memory().available / 1e9
            }
    
    def cleanup(self) -> None:
        """Clean up model resources"""
        if self.model is not None:
            del self.model
            self.model = None
            
        if self.device.type == "cuda":
            torch.cuda.empty_cache()
        elif self.device.type == "mps":
            torch.mps.empty_cache()
class RemoteLLMBackend(LLMBackend):
    """Remote API backend for LLM services"""
    
    def __init__(self, provider_config: Dict[str, Any]):
        self.provider_config = provider_config
        self.base_url = provider_config['base_url']
        self.api_key = provider_config['api_key']
        self.timeout = provider_config.get('timeout', 30)
        self.max_retries = provider_config.get('max_retries', 3)
        self.logger = logging.getLogger(__name__)
        
    async def generate(self, prompt: str, **kwargs) -> str:
        """Generate text using remote API"""
        import aiohttp
        
        headers = {
            "Authorization": f"Bearer {self.api_key}",
            "Content-Type": "application/json"
        }
        
        payload = {
            "messages": [{"role": "user", "content": prompt}],
            **kwargs
        }
        
        for attempt in range(self.max_retries):
            try:
                async with aiohttp.ClientSession(timeout=aiohttp.ClientTimeout(self.timeout)) as session:
                    async with session.post(f"{self.base_url}/chat/completions", 
                                          headers=headers, json=payload) as response:
                        if response.status == 200:
                            data = await response.json()
                            return data['choices'][0]['message']['content']
                        else:
                            self.logger.warning(f"API request failed with status {response.status}")
                            
            except Exception as e:
                self.logger.warning(f"API request attempt {attempt + 1} failed: {e}")
                if attempt == self.max_retries - 1:
                    raise
                    
                await asyncio.sleep(1 * (attempt + 1))  # Exponential backoff
        
        raise RuntimeError(f"All {self.max_retries} API attempts failed")
    
    def get_memory_usage(self) -> Dict[str, float]:
        """Remote APIs don't have local memory usage"""
        return {"remote_api": 0.0}
    
    def cleanup(self) -> None:
        """Nothing to clean up for remote APIs"""
        pass
class LLMManager:
    """Main manager class that orchestrates device detection, configuration, and model loading"""
    
    def __init__(self, config_path: str = "config.yaml"):
        self.config_manager = ConfigManager(config_path)
        self.device_manager = DeviceManager()
        self.current_backend: Optional[LLMBackend] = None
        self.logger = logging.getLogger(__name__)
        
    async def initialize(self, model_name: str = None) -> None:
        """Initialize the LLM manager with optimal configuration"""
        try:
            # Load configuration
            config = self.config_manager.load_config()
            
            # Detect available hardware
            device_info = self.device_manager.detect_available_devices()
            
            # Configure memory management
            self.device_manager.configure_memory_management(device_info)
            
            # Select and load model
            if model_name:
                await self._load_specific_model(model_name, device_info)
            else:
                await self._load_optimal_model(device_info)
                
        except Exception as e:
            self.logger.error(f"Failed to initialize LLM manager: {e}")
            raise
    
    async def _load_specific_model(self, model_name: str, device_info: DeviceInfo) -> None:
        """Load a specific model with device compatibility checking"""
        model_config = self.config_manager.get_model_config(model_name)
        
        if model_config is None:
            # Try remote models
            if ":" in model_name:
                provider, model = model_name.split(":", 1)
                await self._load_remote_model(provider, model)
                return
            else:
                raise ValueError(f"Model {model_name} not found in configuration")
        
        # Check device compatibility
        device_type = device_info.device_type.value
        if device_type not in model_config.supported_devices:
            self.logger.warning(f"Model {model_name} doesn't support {device_type}, trying fallback")
            await self._try_fallback_devices(model_config, device_info)
            return
        
        # Check memory requirements
        if (device_info.memory_gb and 
            model_config.memory_required_gb > device_info.memory_gb * 0.9):
            self.logger.warning(f"Insufficient memory for {model_name}, trying fallback")
            await self._try_fallback_devices(model_config, device_info)
            return
        
        # Load local model
        device = self.device_manager.get_optimal_device(model_config.memory_required_gb)
        backend = LocalLLMBackend(model_config, device)
        await backend.load_model()
        self.current_backend = backend
        
    async def _try_fallback_devices(self, model_config: ModelConfig, device_info: DeviceInfo) -> None:
        """Try loading model on fallback devices"""
        config = self.config_manager.load_config()
        fallback_chain = config.get('fallback', {}).get('device_fallback_chain', ['cuda', 'mps', 'cpu'])
        
        for device_type in fallback_chain:
            if device_type in model_config.supported_devices:
                try:
                    device = torch.device(device_type)
                    backend = LocalLLMBackend(model_config, device)
                    await backend.load_model()
                    self.current_backend = backend
                    self.logger.info(f"Successfully loaded {model_config.name} on fallback device {device_type}")
                    return
                except Exception as e:
                    self.logger.warning(f"Failed to load on {device_type}: {e}")
                    continue
        
        raise RuntimeError(f"Failed to load {model_config.name} on any compatible device")
    
    async def _load_remote_model(self, provider: str, model: str) -> None:
        """Load a remote API model"""
        config = self.config_manager.load_config()
        remote_config = config.get('models', {}).get('remote', {}).get(provider)
        
        if not remote_config or not remote_config.get('enabled'):
            raise ValueError(f"Remote provider {provider} not configured or disabled")
        
        if model not in remote_config.get('models', []):
            raise ValueError(f"Model {model} not available from provider {provider}")
        
        backend = RemoteLLMBackend(remote_config)
        self.current_backend = backend
        
    async def _load_optimal_model(self, device_info: DeviceInfo) -> None:
        """Load the best available model for the detected hardware"""
        available_models = self.config_manager.get_available_models(device_info.device_type.value)
        
        if not available_models:
            raise RuntimeError("No compatible models found")
        
        # Simple heuristic: pick the largest model that fits in memory
        best_model = None
        for model_name in available_models:
            if ":" in model_name:  # Skip remote models for auto-selection
                continue
                
            model_config = self.config_manager.get_model_config(model_name)
            if (device_info.memory_gb is None or 
                model_config.memory_required_gb <= device_info.memory_gb * 0.9):
                best_model = model_name
        
        if best_model:
            await self._load_specific_model(best_model, device_info)
        else:
            raise RuntimeError("No suitable model found for available hardware")
    
    async def generate(self, prompt: str, **kwargs) -> str:
        """Generate text using the current backend"""
        if self.current_backend is None:
            raise RuntimeError("No model loaded. Call initialize() first.")
        
        # Apply inference configuration
        inference_config = self.config_manager.get_inference_config()
        generation_kwargs = {
            'temperature': kwargs.get('temperature', inference_config.temperature),
            'max_tokens': kwargs.get('max_tokens', inference_config.max_tokens),
            'top_p': kwargs.get('top_p', inference_config.top_p),
            'stream': kwargs.get('stream', inference_config.stream)
        }
        
        try:
            return await self.current_backend.generate(prompt, **generation_kwargs)
        except Exception as e:
            self.logger.error(f"Generation failed: {e}")
            # Implement fallback logic here if needed
            raise
    
    @contextmanager
    def monitor_performance(self):
        """Context manager for performance monitoring"""
        start_memory = None
        if self.current_backend:
            start_memory = self.current_backend.get_memory_usage()
        
        import time
        start_time = time.time()
        
        try:
            yield
        finally:
            end_time = time.time()
            duration = end_time - start_time
            
            if self.current_backend:
                end_memory = self.current_backend.get_memory_usage()
                self.logger.info(f"Operation completed in {duration:.2f}s")
                self.logger.info(f"Memory usage: {end_memory}")
    
    def get_status(self) -> Dict[str, Any]:
        """Get current system status"""
        device_info = self.device_manager.detect_available_devices()
        memory_usage = self.current_backend.get_memory_usage() if self.current_backend else {}
        
        return {
            "device_type": device_info.device_type.value,
            "device_count": device_info.device_count,
            "memory_gb": device_info.memory_gb,
            "platform": device_info.platform_name,
            "model_loaded": self.current_backend is not None,
            "memory_usage": memory_usage
        }
    
    async def cleanup(self) -> None:
        """Clean up resources"""
        if self.current_backend:
            self.current_backend.cleanup()
            self.current_backend = None
# Example usage demonstrating the complete system
async def main():
    """Example showing how to use the complete LLM management system"""
    
    # Initialize logging
    logging.basicConfig(level=logging.INFO)
    logger = logging.getLogger(__name__)
    
    try:
        # Create LLM manager
        llm_manager = LLMManager("config.yaml")
        
        # Initialize with automatic model selection
        await llm_manager.initialize()
        
        # Get system status
        status = llm_manager.get_status()
        logger.info(f"System initialized: {status}")
        
        # Generate text with performance monitoring
        with llm_manager.monitor_performance():
            response = await llm_manager.generate(
                "Explain the benefits of configuration-driven LLM applications",
                temperature=0.8,
                max_tokens=1024
            )
            logger.info(f"Generated response: {response[:100]}...")
        
        # Try loading a specific model
        await llm_manager.initialize("llama2_7b")
        
        # Generate with the new model
        response = await llm_manager.generate("Write a Python function to detect GPU capabilities")
        logger.info(f"Code generation response: {response[:100]}...")
        
    except Exception as e:
        logger.error(f"Error in main: {e}")
    finally:
        await llm_manager.cleanup()
if __name__ == "__main__":
    asyncio.run(main())


The device management layer handles all platform-specific logic, presenting a consistent interface regardless of whether the application runs on CUDA, ROCm, or MPS. This abstraction includes memory management, with different strategies for discrete GPUs versus unified memory systems, and performance optimization that applies platform-specific techniques transparently.


Configuration validation becomes crucial in multi-platform deployments. The system needs to verify that requested configurations are possible on the target hardware, providing clear error messages and suggested alternatives when they’re not. For example, if a user requests tensor parallelism on a single-GPU system, the validator should explain why this isn’t possible and suggest alternative optimizations.


Error handling and fallback strategies need particular attention in multi-platform systems. Hardware-specific failures should trigger automatic fallbacks to alternative execution strategies rather than application crashes. If CUDA initialization fails, the system should attempt MPS on Apple Silicon or fall back to CPU inference with appropriate user notification.


The runtime monitoring system should track performance metrics and resource usage across different platforms, helping users understand whether their configuration choices are optimal. This telemetry can inform automatic optimization suggestions and help identify when hardware upgrades might be beneficial.


Configuration Management Best Practices


Effective configuration management in LLM applications requires balancing flexibility with usability. The configuration schema should provide powerful options for advanced users while offering sensible defaults that work well for typical use cases. This dual approach lets applications be both approachable for newcomers and controllable for power users.


Here is an advanced configuration loader that demonstrates inheritance and composition patterns:


import yaml

import os

from typing import Dict, Any, List, Optional

from pathlib import Path

import copy


class AdvancedConfigManager:

    """Advanced configuration manager with inheritance and composition support"""

    

    def __init__(self, base_config_path: str = "config.yaml"):

        self.base_config_path = Path(base_config_path)

        self.config_search_paths = [

            Path.cwd() / "config",

            Path.home() / ".llm-app",

            Path("/etc/llm-app")

        ]

        self.loaded_configs = {}

        self.logger = logging.getLogger(__name__)

    

    def load_hierarchical_config(self, environment: str = None) -> Dict[str, Any]:

        """Load configuration with hierarchical merging"""

        

        # 1. Load base configuration

        base_config = self._load_single_config(self.base_config_path)

        

        # 2. Load user-specific overrides

        user_config_path = Path.home() / ".llm-app" / "config.yaml"

        if user_config_path.exists():

            user_config = self._load_single_config(user_config_path)

            base_config = self._deep_merge(base_config, user_config)

            self.logger.info(f"Applied user configuration from {user_config_path}")

        

        # 3. Load project-specific overrides

        project_config_path = Path.cwd() / "config.local.yaml"

        if project_config_path.exists():

            project_config = self._load_single_config(project_config_path)

            base_config = self._deep_merge(base_config, project_config)

            self.logger.info(f"Applied project configuration from {project_config_path}")

        

        # 4. Apply environment-specific overrides

        if environment:

            env_config = base_config.get('environments', {}).get(environment, {})

            if env_config:

                base_config = self._deep_merge(base_config, env_config)

                self.logger.info(f"Applied {environment} environment configuration")

        

        # 5. Apply environment variable overrides

        base_config = self._apply_env_overrides(base_config)

        

        return base_config

    

    def _load_single_config(self, config_path: Path) -> Dict[str, Any]:

        """Load a single configuration file with includes support"""

        

        if config_path in self.loaded_configs:

            return copy.deepcopy(self.loaded_configs[config_path])

        

        try:

            with open(config_path, 'r') as f:

                content = f.read()

                

            # Substitute environment variables

            content = self._substitute_env_vars(content)

            

            # Parse YAML

            config = yaml.safe_load(content)

            

            # Process includes

            if 'includes' in config:

                for include_path in config['includes']:

                    include_full_path = self._resolve_include_path(include_path, config_path)

                    if include_full_path and include_full_path.exists():

                        include_config = self._load_single_config(include_full_path)

                        config = self._deep_merge(include_config, config)

                

                # Remove includes from final config

                del config['includes']

            

            # Cache the loaded config

            self.loaded_configs[config_path] = copy.deepcopy(config)

            

            return config

            

        except Exception as e:

            self.logger.error(f"Failed to load config {config_path}: {e}")

            raise

    

    def _resolve_include_path(self, include_path: str, base_config_path: Path) -> Optional[Path]:

        """Resolve include path relative to base config or search paths"""

        include_path = Path(include_path)

        

        # Try relative to base config directory

        if not include_path.is_absolute():

            relative_path = base_config_path.parent / include_path

            if relative_path.exists():

                return relative_path

        

        # Try absolute path

        if include_path.is_absolute() and include_path.exists():

            return include_path

        

        # Try search paths

        for search_path in self.config_search_paths:

            full_path = search_path / include_path

            if full_path.exists():

                return full_path

        

        self.logger.warning(f"Include file {include_path} not found")

        return None

    

    def _deep_merge(self, base: Dict[str, Any], override: Dict[str, Any]) -> Dict[str, Any]:

        """Deep merge two configuration dictionaries"""

        result = copy.deepcopy(base)

        

        for key, value in override.items():

            if key in result and isinstance(result[key], dict) and isinstance(value, dict):

                result[key] = self._deep_merge(result[key], value)

            elif key in result and isinstance(result[key], list) and isinstance(value, list):

                # For lists, extend rather than replace

                result[key].extend(value)

            else:

                result[key] = copy.deepcopy(value)

        

        return result

    

    def _apply_env_overrides(self, config: Dict[str, Any]) -> Dict[str, Any]:

        """Apply environment variable overrides using dot notation"""

        # Environment variables like LLM_HARDWARE_CUDA_ENABLED=false

        # override config['hardware']['cuda']['enabled']

        

        for env_var, value in os.environ.items():

            if not env_var.startswith('LLM_'):

                continue

            

            # Convert LLM_HARDWARE_CUDA_ENABLED to ['hardware', 'cuda', 'enabled']

            path_parts = env_var[4:].lower().split('_')  # Remove LLM_ prefix

            

            # Navigate to the parent container

            current = config

            for part in path_parts[:-1]:

                if part not in current:

                    current[part] = {}

                current = current[part]

            

            # Set the final value with type conversion

            final_key = path_parts[-1]

            current[final_key] = self._convert_env_value(value)

            

            self.logger.info(f"Applied environment override: {env_var}={value}")

        

        return config

    

    def _convert_env_value(self, value: str) -> Any:

        """Convert string environment variable values to appropriate types"""

        value = value.strip()

        

        # Boolean conversion

        if value.lower() in ('true', 'yes', '1', 'on'):

            return True

        elif value.lower() in ('false', 'no', '0', 'off'):

            return False

        

        # Number conversion

        try:

            if '.' in value:

                return float(value)

            else:

                return int(value)

        except ValueError:

            pass

        

        # List conversion (comma-separated)

        if ',' in value:

            return [item.strip() for item in value.split(',')]

        

        # Return as string

        return value

    

    def _substitute_env_vars(self, content: str) -> str:

        """Advanced environment variable substitution with defaults"""

        import re

        

        def replace_env_var(match):

            var_expression = match.group(1)

            

            # Handle ${VAR:default_value} syntax

            if ':' in var_expression:

                var_name, default_value = var_expression.split(':', 1)

                return os.getenv(var_name, default_value)

            else:

                return os.getenv(var_expression, match.group(0))

        

        return re.sub(r'\$\{([^}]+)\}', replace_env_var, content)

    

    def generate_config_template(self, output_path: str = "config.template.yaml") -> None:

        """Generate a configuration template with comments and examples"""

        template_content = """# LLM Application Configuration Template

# This file demonstrates all available configuration options


version: "1.0"


# Application settings

application:

  name: "LLM Assistant"

  logging_level: "INFO"  # DEBUG, INFO, WARNING, ERROR

  

# Hardware and device configuration

hardware:

  # Preferred device order - will try devices in this sequence

  preferred_devices: ["cuda", "mps", "cpu"]

  

  # CUDA/ROCm settings

  cuda:

    enabled: true

    device_ids: []  # Empty list means use all available GPUs

    memory_fraction: 0.9  # Use 90% of GPU memory

    allow_tf32: true  # Enable TensorFloat-32 on compatible hardware

    enable_flash_attention: true

    

    # ROCm-specific settings (only used on AMD hardware)

    rocm:

      enable_tunable_op: true  # Enable ROCm TunableOp optimizations

      hip_visible_devices: null  # null means all devices

      

  # Apple Metal Performance Shaders settings

  mps:

    enabled: true

    fallback_to_cpu: true  # Fallback for unsupported operations

    memory_limit_gb: null  # null means system manages memory

    

  # CPU settings

  cpu:

    threads: null  # null means auto-detect optimal thread count

    memory_limit_gb: 8


# Model configuration

models:

  # Local models stored on filesystem

  local:

    base_path: "./models"  # Base directory for local models

    

    # Example: Small model for development/testing

    llama2_7b:

      path: "llama-2-7b-chat.gguf"

      context_length: 4096

      quantization: "q4_0"  # q4_0, q5_1, q8_0, f16, f32

      tensor_parallel: false

      memory_required_gb: 4.0

      supported_devices: ["cuda", "mps", "cpu"]

      

    # Example: Large model requiring multi-GPU

    llama2_70b:

      path: "llama-2-70b-chat.gguf"

      context_length: 4096

      quantization: "q4_0"

      tensor_parallel: true

      tensor_parallel_size: 2  # Split across 2 GPUs

      memory_required_gb: 40.0

      supported_devices: ["cuda"]  # Requires CUDA for tensor parallel

      

  # Remote API endpoints

  remote:

    openai:

      enabled: false  # Set to true to enable

      api_key: "${OPENAI_API_KEY}"  # Environment variable

      base_url: "https://api.openai.com/v1"

      models: ["gpt-4", "gpt-3.5-turbo"]

      timeout: 30

      max_retries: 3

      

    anthropic:

      enabled: false

      api_key: "${ANTHROPIC_API_KEY}"

      base_url: "https://api.anthropic.com"

      models: ["claude-3-opus-20240229"]

      timeout: 30

      max_retries: 3


# Inference parameters

inference:

  defaults:

    temperature: 0.7        # Randomness (0.0 = deterministic, 1.0 = creative)

    max_tokens: 2048        # Maximum tokens to generate

    top_p: 0.9             # Nucleus sampling threshold

    top_k: 40              # Top-k sampling limit

    repetition_penalty: 1.1 # Penalty for repetition

    stream: true           # Stream responses token by token

    

  # Model-specific parameter overrides

  overrides:

    llama2_70b:

      batch_size: 1  # Large models may need smaller batches

      

# Environment-specific configurations

environments:

  development:

    application:

      logging_level: "DEBUG"

    hardware:

      cuda:

        memory_fraction: 0.7  # Leave more memory for dev tools

        

  production:

    application:

      logging_level: "WARNING"

    performance:

      torch_compile: true  # Enable optimizations in production

      

  testing:

    models:

      local:

        llama2_7b:

          context_length: 512  # Smaller context for faster tests


# Performance optimizations

performance:

  torch_compile: false     # Enable PyTorch 2.0 compilation

  flash_attention: true    # Use Flash Attention when available

  

  quantization:

    default_precision: "float16"  # float32, float16, bfloat16

    dynamic_quantization: true

    

  batching:

    max_batch_size: 8

    batch_timeout_ms: 100


# Fallback and error handling

fallback:

  enabled: true

  device_fallback_chain: ["cuda", "mps", "cpu"]

  model_fallback_chain: ["local", "remote"]

  final_fallback: "cpu"

  max_retries: 3

  retry_delay_ms: 1000


# Memory management

memory:

  global:

    garbage_collect_threshold: 0.8

    cache_size_mb: 1024

    

  cuda:

    memory_pool: true

    empty_cache_threshold: 0.9

    

  mps:

    unified_memory_management: true

    

  cpu:

    max_memory_gb: 16

"""

        

        with open(output_path, 'w') as f:

            f.write(template_content)

            

        self.logger.info(f"Configuration template generated: {output_path}")


# Validation system with detailed error reporting

class ConfigValidator:

    """Comprehensive configuration validator with detailed error reporting"""

    

    def __init__(self):

        self.errors: List[str] = []

        self.warnings: List[str] = []

        

    def validate(self, config: Dict[str, Any]) -> bool:

        """Validate configuration and return True if valid"""

        self.errors.clear()

        self.warnings.clear()

        

        self._validate_structure(config)

        self._validate_hardware_config(config.get('hardware', {}))

        self._validate_model_config(config.get('models', {}))

        self._validate_inference_config(config.get('inference', {}))

        self._validate_cross_references(config)

        

        return len(self.errors) == 0

    

    def _validate_structure(self, config: Dict[str, Any]) -> None:

        """Validate basic configuration structure"""

        required_sections = ['hardware', 'models', 'inference']

        for section in required_sections:

            if section not in config:

                self.errors.append(f"Required section '{section}' missing from configuration")

    

    def _validate_hardware_config(self, hardware: Dict[str, Any]) -> None:

        """Validate hardware configuration"""

        if 'preferred_devices' not in hardware:

            self.errors.append("hardware.preferred_devices is required")

            return

            

        valid_devices = ['cuda', 'mps', 'cpu']

        preferred = hardware['preferred_devices']

        

        if not isinstance(preferred, list) or not preferred:

            self.errors.append("hardware.preferred_devices must be a non-empty list")

            return

            

        for device in preferred:

            if device not in valid_devices:

                self.errors.append(f"Invalid device '{device}'. Valid devices: {valid_devices}")

        

        # Validate CUDA configuration

        cuda_config = hardware.get('cuda', {})

        if cuda_config.get('enabled', True):

            memory_fraction = cuda_config.get('memory_fraction', 0.9)

            if not 0.1 <= memory_fraction <= 1.0:

                self.errors.append("cuda.memory_fraction must be between 0.1 and 1.0")

                

            device_ids = cuda_config.get('device_ids', [])

            if device_ids and not all(isinstance(id, int) and id >= 0 for id in device_ids):

                self.errors.append("cuda.device_ids must be a list of non-negative integers")

    

    def _validate_model_config(self, models: Dict[str, Any]) -> None:

        """Validate model configuration"""

        if not models.get('local') and not models.get('remote'):

            self.errors.append("At least one of models.local or models.remote must be configured")

            

        # Validate local models

        local = models.get('local', {})

        for model_name, model_config in local.items():

            if model_name == 'base_path':

                continue

                

            if not isinstance(model_config, dict):

                self.errors.append(f"Model {model_name} configuration must be an object")

                continue

                

            # Validate required fields

            if 'path' not in model_config:

                self.errors.append(f"Model {model_name}: 'path' field is required")

                

            memory_required = model_config.get('memory_required_gb', 0)

            if not isinstance(memory_required, (int, float)) or memory_required <= 0:

                self.errors.append(f"Model {model_name}: memory_required_gb must be positive number")

            

            # Validate device support

            supported_devices = model_config.get('supported_devices', [])

            valid_devices = ['cuda', 'mps', 'cpu']

            for device in supported_devices:

                if device not in valid_devices:

                    self.errors.append(f"Model {model_name}: invalid supported device '{device}'")

        

        # Validate remote providers

        remote = models.get('remote', {})

        for provider, provider_config in remote.items():

            if not isinstance(provider_config, dict):

                self.errors.append(f"Remote provider {provider} configuration must be an object")

                continue

                

            if provider_config.get('enabled', False):

                required_fields = ['api_key', 'base_url', 'models']

                for field in required_fields:

                    if field not in provider_config:

                        self.errors.append(f"Remote provider {provider}: '{field}' is required")

                        

                # Check for placeholder API keys

                api_key = provider_config.get('api_key', '')

                if api_key.startswith('${') and api_key.endswith('}'):

                    env_var = api_key[2:-1]

                    if not os.getenv(env_var):

                        self.warnings.append(f"Environment variable {env_var} not set for {provider}")

    

    def _validate_inference_config(self, inference: Dict[str, Any]) -> None:

        """Validate inference configuration"""

        defaults = inference.get('defaults', {})

        

        # Validate parameter ranges

        temperature = defaults.get('temperature', 0.7)

        if not isinstance(temperature, (int, float)) or not 0.0 <= temperature <= 2.0:

            self.errors.append("inference.defaults.temperature must be between 0.0 and 2.0")

            

        max_tokens = defaults.get('max_tokens', 2048)

        if not isinstance(max_tokens, int) or max_tokens <= 0:

            self.errors.append("inference.defaults.max_tokens must be positive integer")

            

        top_p = defaults.get('top_p', 0.9)

        if not isinstance(top_p, (int, float)) or not 0.0 <= top_p <= 1.0:

            self.errors.append("inference.defaults.top_p must be between 0.0 and 1.0")

    

    def _validate_cross_references(self, config: Dict[str, Any]) -> None:

        """Validate cross-references between configuration sections"""

        # Check that fallback devices are valid

        fallback = config.get('fallback', {})

        device_chain = fallback.get('device_fallback_chain', [])

        preferred_devices = config.get('hardware', {}).get('preferred_devices', [])

        

        for device in device_chain:

            if device not in preferred_devices:

                self.warnings.append(f"Fallback device '{device}' not in preferred_devices list")

    

    def get_error_report(self) -> str:

        """Get formatted error and warning report"""

        report = []

        

        if self.errors:

            report.append("ERRORS:")

            for error in self.errors:

                report.append(f"  - {error}")

        

        if self.warnings:

            if report:

                report.append("")

            report.append("WARNINGS:")

            for warning in self.warnings:

                report.append(f"  - {warning}")

        

        return "\n".join(report) if report else "Configuration is valid."


Hierarchical configuration loading allows applications to merge settings from multiple sources: global defaults, user preferences, project-specific overrides, and runtime parameters. This system lets users maintain consistent preferences across projects while allowing per-project customization when needed. Environment-specific configurations become particularly important when deploying across different hardware environments.


Secret management deserves special attention in configuration design. API keys, authentication tokens, and other sensitive values should never be stored directly in configuration files. Instead, configurations should reference environment variables or external secret management systems. This approach enables secure deployment in containerized environments and multi-user systems.


Documentation and validation go hand in hand for configuration systems. Every configuration option should have clear documentation explaining its purpose, valid values, and interaction with other settings. Runtime validation should provide specific, actionable error messages that help users correct configuration issues quickly.


Version management becomes important as applications evolve. Configuration schemas should include version information and support migration from older formats. This forward compatibility ensures that user configurations continue working as applications are updated, reducing friction for long-term users.


The Path Forward: Recommendations for Modern LLM Applications


The future of LLM application development lies in platforms that abstract hardware complexity while preserving user choice and performance optimization. As the ecosystem matures, we can expect better standardization around device detection APIs and configuration patterns, making multi-platform development more straightforward.


For developers starting new projects, the recommendation is clear: design for multiple platforms from the beginning rather than retrofitting compatibility later. The incremental development cost is minimal compared to the architectural changes required to add multi-platform support to single-platform applications. Modern frameworks like MLC-LLM and vLLM provide excellent starting points with built-in multi-platform support.


Configuration-driven architecture represents a competitive advantage in the current landscape. Applications that let users control their deployment characteristics will appeal to a broader audience than those with fixed assumptions about hardware or usage patterns. The investment in sophisticated configuration management pays dividends in reduced support burden and increased user satisfaction.


Looking ahead, we can expect continued convergence in the underlying APIs across different compute platforms. Apple’s ongoing improvements to MPS, AMD’s advancement of ROCm, and industry standardization efforts suggest that the current platform-specific complexities may diminish over time. However, performance optimization will likely remain platform-specific, making configuration-driven approaches valuable even as compatibility improves.


The most successful LLM applications of the future will be those that combine powerful local inference capabilities with seamless cloud integration, automatic hardware optimization, and user-controlled configuration management. By implementing these patterns today, developers can build applications that remain competitive and useful regardless of how the underlying technology landscape evolves.


The era of AI democratization depends on applications that work everywhere, not just in ideal development environments. By embracing multi-platform architecture and configuration-driven design, developers can contribute to making advanced AI capabilities accessible to users regardless of their hardware preferences or constraints. This inclusive approach to AI application development will ultimately determine which tools succeed in the broader market and which remain niche technical curiosities.

No comments: