Hitchhiker's Guide to AI, Software Architecture, and Everything Else: BUILDING AN OPEN SOURCE LLM VOICE ASSISTANT

Introduction: The Dawn of Accessible Voice AI

The landscape of artificial intelligence has dramatically shifted in recent years, with large language models becoming increasingly sophisticated and accessible. Voice assistants, once the exclusive domain of tech giants with massive resources, can now be built using entirely open source components. This democratization of AI technology opens unprecedented opportunities for developers, researchers, and organizations to create customized voice interfaces tailored to specific needs.

Building a voice assistant involves orchestrating several complex components that must work seamlessly together. The primary challenge lies not just in implementing individual components, but in creating a cohesive system where speech recognition, language understanding, response generation, and speech synthesis operate in harmony. This article will guide you through creating such a system using only open source tools and libraries.

The open source approach offers several compelling advantages over proprietary solutions. First, it provides complete control over data privacy and security, as all processing can occur locally without sending sensitive information to external services. Second, it allows for unlimited customization and fine-tuning to meet specific requirements. Third, it eliminates ongoing costs associated with cloud-based APIs, making it economically viable for long-term deployment.

Our implementation will leverage several key open source projects. HuggingFace Transformers will provide access to state-of-the-art language models, while OpenAI's Whisper will handle speech recognition. For text-to-speech synthesis, we'll use Coqui TTS, and LangChain will orchestrate the conversation flow and memory management. The entire system will be built using Python, ensuring broad compatibility and ease of deployment.

Architecture Overview: Understanding the Voice Assistant Pipeline

A voice assistant operates through a sophisticated pipeline that transforms spoken input into meaningful responses and back to speech output. Understanding this architecture is crucial for successful implementation and optimization.

The pipeline begins with audio capture, where microphones record user speech in real-time. This raw audio data requires preprocessing to remove noise, normalize volume levels, and segment speech from silence. The preprocessed audio then flows to the speech-to-text component, which converts acoustic signals into textual representations.

Once we have text, the language model processes the user's intent and generates an appropriate response. This stage involves understanding context, maintaining conversation history, and potentially accessing external knowledge sources or APIs. The language model's textual response then moves to the text-to-speech synthesizer, which converts written words back into natural-sounding speech.

Throughout this pipeline, several supporting components ensure smooth operation. A conversation manager maintains dialogue state and context across multiple exchanges. An audio manager handles real-time streaming, buffering, and playback. Error handling and fallback mechanisms ensure graceful degradation when components fail or produce unexpected results.

The modular nature of this architecture allows for independent optimization and replacement of components. For instance, you might start with a smaller, faster language model for prototyping and later upgrade to a more capable but resource-intensive model for production deployment.

Hardware Acceleration Support: Multi-Platform Optimization

Modern voice assistants must efficiently utilize available hardware acceleration across different platforms. Supporting NVIDIA CUDA, AMD ROCm, and Apple Silicon MPS ensures optimal performance regardless of the deployment environment.

Hardware detection and automatic configuration enable seamless deployment across different systems without manual intervention. The system automatically detects available acceleration hardware and configures each component accordingly.

The hardware manager component serves as the foundation for multi-platform support. It detects the current platform architecture, identifies available acceleration devices, and selects the optimal configuration for maximum performance. This approach eliminates the need for manual configuration while ensuring that each component utilizes the best available hardware resources.

The platform detection mechanism identifies the operating system, processor architecture, and Python version to ensure compatibility with different hardware acceleration frameworks. This information guides the selection of appropriate device drivers and optimization strategies.

Device detection encompasses multiple acceleration technologies. CUDA detection verifies GPU availability and enumerates device capabilities including memory capacity and compute capability. MPS detection specifically targets Apple Silicon processors and their unified memory architecture. ROCm detection identifies AMD GPU hardware and verifies driver installation.

def _detect_devices(self):

"""Detect available acceleration devices"""

devices = ['cpu']

# Check for CUDA (NVIDIA)

if torch.cuda.is_available():

cuda_count = torch.cuda.device_count()

devices.extend([f'cuda:{i}' for i in range(cuda_count)])

print(f"CUDA devices found: {cuda_count}")

for i in range(cuda_count):

gpu_name = torch.cuda.get_device_name(i)

memory = torch.cuda.get_device_properties(i).total_memory / 1e9

print(f" GPU {i}: {gpu_name} ({memory:.1f}GB)")

# Check for MPS (Apple Silicon)

if hasattr(torch.backends, 'mps') and torch.backends.mps.is_available():

devices.append('mps')

print("Apple Silicon MPS acceleration available")

# Check for ROCm (AMD)

if self._check_rocm_available():

devices.append('rocm')

print("AMD ROCm acceleration detected")

return devices

The device enumeration process provides detailed information about each available acceleration option. For CUDA devices, this includes GPU model names, memory capacity, and compute capabilities. This information enables intelligent device selection based on workload requirements and available resources.

ROCm detection requires special handling due to its installation complexity and varying support across different AMD GPU generations. The detection process checks for ROCm installation paths, environment variables, and PyTorch compilation flags to determine availability.

def _check_rocm_available(self):

"""Check if ROCm is available"""

try:

# Check for ROCm installation

rocm_paths = [

'/opt/rocm',

'/usr/local/rocm',

os.path.expanduser('~/rocm')

]

for path in rocm_paths:

if os.path.exists(path):

# Try to import torch with ROCm support

if hasattr(torch.version, 'hip') and torch.version.hip is not None:

return True

# Alternative check using environment variables

if 'ROCM_PATH' in os.environ or 'HIP_PATH' in os.environ:

return True

return False

except Exception:

return False

Device selection follows a priority hierarchy based on performance characteristics and compatibility. CUDA devices receive highest priority due to their mature ecosystem and broad software support. Apple Silicon MPS provides excellent performance for Mac users with unified memory architecture. ROCm offers competitive performance for AMD GPU users, while CPU serves as the universal fallback option.

def _select_optimal_device(self):

"""Select optimal device based on availability and performance"""

# Priority order: CUDA > MPS > ROCm > CPU

if any('cuda' in device for device in self.available_devices):

# Select CUDA device with most memory

if torch.cuda.is_available():

best_gpu = 0

max_memory = 0

for i in range(torch.cuda.device_count()):

memory = torch.cuda.get_device_properties(i).total_memory

if memory > max_memory:

max_memory = memory

best_gpu = i

return f'cuda:{best_gpu}'

elif 'mps' in self.available_devices:

return 'mps'

elif 'rocm' in self.available_devices:

return 'rocm'

else:

return 'cpu'

The device configuration process applies platform-specific optimizations to maximize performance and stability. CUDA configurations enable cuDNN benchmarking and memory management optimizations. MPS configurations set fallback options for unsupported operations. ROCm configurations specify graphics version overrides for compatibility.

def configure_torch_device(self, preferred_device=None):

"""Configure PyTorch device with proper settings"""

device = preferred_device or self.optimal_device

if device.startswith('cuda'):

torch.backends.cudnn.benchmark = True

torch.backends.cudnn.deterministic = False

if torch.cuda.is_available():

torch.cuda.empty_cache()

elif device == 'mps':

# Configure MPS-specific settings

if hasattr(torch.backends.mps, 'is_available') and torch.backends.mps.is_available():

os.environ['PYTORCH_ENABLE_MPS_FALLBACK'] = '1'

elif device == 'rocm':

# Configure ROCm-specific settings

os.environ['HSA_OVERRIDE_GFX_VERSION'] = os.environ.get('HSA_OVERRIDE_GFX_VERSION', '10.3.0')

return device

Memory monitoring capabilities provide insights into resource utilization across different hardware platforms. This information enables dynamic optimization and helps identify potential bottlenecks or resource constraints.

def get_memory_info(self, device=None):

"""Get memory information for specified device"""

device = device or self.optimal_device

if device.startswith('cuda') and torch.cuda.is_available():

gpu_id = int(device.split(':')[1]) if ':' in device else 0

total = torch.cuda.get_device_properties(gpu_id).total_memory / 1e9

allocated = torch.cuda.memory_allocated(gpu_id) / 1e9

cached = torch.cuda.memory_reserved(gpu_id) / 1e9

return {

'total': total,

'allocated': allocated,

'cached': cached,

'free': total - allocated

}

elif device == 'mps':

# MPS memory info is limited

return {

'total': 'Unknown',

'allocated': 'Unknown',

'cached': 'Unknown',

'free': 'Unknown'

}

else:

import psutil

memory = psutil.virtual_memory()

return {

'total': memory.total / 1e9,

'allocated': (memory.total - memory.available) / 1e9,

'cached': 0,

'free': memory.available / 1e9

}

Speech Recognition: Implementing Whisper for Robust STT

OpenAI's Whisper represents a breakthrough in open source speech recognition technology. Unlike traditional ASR systems that require extensive training on domain-specific data, Whisper demonstrates remarkable robustness across languages, accents, and audio conditions due to its training on diverse internet audio.

The enhanced Whisper implementation automatically configures itself for optimal performance across different hardware platforms while maintaining consistent functionality. Device-specific optimizations ensure maximum throughput while preserving transcription accuracy.

import whisper

import torch

import numpy as np

from typing import Optional, Dict, Any

import warnings

class EnhancedWhisperSTT:

def __init__(self, model_size="base", device="auto", hardware_manager=None):

"""

Initialize enhanced Whisper speech-to-text engine with multi-platform support

Args:

model_size: Size of Whisper model (tiny, base, small, medium, large)

device: Computing device (auto, cpu, cuda, mps, rocm)

hardware_manager: HardwareManager instance for device configuration

"""

self.hardware_manager = hardware_manager or HardwareManager()

if device == "auto":

self.device = self.hardware_manager.configure_torch_device()

else:

self.device = self.hardware_manager.configure_torch_device(device)

print(f"Loading Whisper {model_size} model on {self.device}")

# Configure device-specific settings

self._configure_device_settings()

# Load model with device-specific optimizations

try:

self.model = whisper.load_model(model_size, device=self._get_whisper_device())

print(f"Whisper model loaded successfully on {self.device}")

except Exception as e:

print(f"Error loading Whisper model on {self.device}: {e}")

print("Falling back to CPU...")

self.device = "cpu"

self.model = whisper.load_model(model_size, device="cpu")

# Model configuration

self.model_size = model_size

self.sample_rate = 16000

Device-specific configuration optimizes performance characteristics for each hardware platform. CUDA configurations enable cuDNN benchmarking for faster convolution operations. MPS configurations set fallback options for operations not yet supported by Apple's Metal Performance Shaders. ROCm configurations specify graphics version overrides for AMD GPU compatibility.

def _configure_device_settings(self):

"""Configure device-specific settings for optimal performance"""

if self.device.startswith('cuda'):

# CUDA-specific optimizations

torch.backends.cudnn.benchmark = True

if torch.cuda.is_available():

torch.cuda.empty_cache()

elif self.device == 'mps':

# MPS-specific optimizations

# Disable some operations that might not be supported

os.environ['PYTORCH_ENABLE_MPS_FALLBACK'] = '1'

elif self.device == 'rocm':

# ROCm-specific optimizations

os.environ['HSA_OVERRIDE_GFX_VERSION'] = os.environ.get('HSA_OVERRIDE_GFX_VERSION', '10.3.0')

Whisper device compatibility requires careful handling due to varying support across different acceleration frameworks. While CUDA enjoys full support, MPS and ROCm may require CPU fallbacks for certain operations to ensure stability and compatibility.

def _get_whisper_device(self):

"""Get device string compatible with Whisper"""

if self.device.startswith('cuda'):

return self.device

elif self.device == 'mps':

# Whisper may not directly support MPS, use CPU as fallback

return "cpu"

elif self.device == 'rocm':

# ROCm support depends on PyTorch build

return "cpu" # Fallback to CPU for compatibility

else:

return "cpu"

The enhanced transcription method incorporates sophisticated error handling and performance optimization. Audio preprocessing ensures compatibility with Whisper's input requirements, while confidence scoring provides feedback about transcription quality.

def transcribe_audio(self, audio_path=None, audio_array=None, language=None,

temperature=0.0, best_of=5):

"""

Enhanced transcribe audio to text using Whisper with multi-platform support

Args:

audio_path: Path to audio file

audio_array: Numpy array containing audio data

language: Target language code (optional)

temperature: Sampling temperature for transcription

best_of: Number of candidates to generate

Returns:

dict: Transcription results with text and metadata

"""

try:

# Prepare transcription options

options = {

'language': language,

'temperature': temperature,

'best_of': best_of,

'fp16': self._use_fp16()

}

if audio_path:

result = self.model.transcribe(audio_path, **options)

elif audio_array is not None:

# Ensure audio is in correct format for Whisper

audio_array = self._preprocess_audio(audio_array)

result = self.model.transcribe(audio_array, **options)

else:

raise ValueError("Either audio_path or audio_array must be provided")

return {

'text': result['text'].strip(),

'language': result['language'],

'segments': result['segments'],

'confidence': self._calculate_confidence(result['segments']),

'processing_device': self.device

}

except Exception as e:

print(f"Error in speech recognition: {e}")

return {

'text': '',

'language': 'unknown',

'segments': [],

'confidence': 0.0,

'error': str(e)

}

Precision selection balances performance and accuracy based on hardware capabilities. FP16 precision provides significant speedup on modern GPUs while maintaining acceptable accuracy for most applications. Conservative fallbacks ensure stability on platforms with limited FP16 support.

def _use_fp16(self):

"""Determine if FP16 should be used based on device capabilities"""

if self.device.startswith('cuda'):

return torch.cuda.is_available()

elif self.device == 'mps':

# MPS supports FP16 but may have compatibility issues

return False # Conservative approach

elif self.device == 'rocm':

return False # Conservative approach for ROCm

else:

return False

Audio preprocessing ensures optimal input quality for Whisper transcription. Normalization prevents clipping and ensures consistent amplitude levels, while format conversion handles different input data types seamlessly.

def _preprocess_audio(self, audio_array):

"""Preprocess audio array for optimal transcription"""

# Ensure correct data type

if audio_array.dtype != np.float32:

audio_array = audio_array.astype(np.float32)

# Normalize audio to [-1, 1] range

max_val = np.max(np.abs(audio_array))

if max_val > 1.0:

audio_array = audio_array / max_val

# Ensure correct sample rate (Whisper expects 16kHz)

# Note: This is a simplified approach; proper resampling would be better

return audio_array

Confidence calculation provides quantitative feedback about transcription quality. This information enables the system to request clarification when recognition confidence falls below acceptable thresholds, improving overall user experience.

def _calculate_confidence(self, segments):

"""Calculate average confidence score from segments"""

if not segments:

return 0.0

total_confidence = sum(segment.get('avg_logprob', 0) for segment in segments)

avg_logprob = total_confidence / len(segments)

# Convert log probability to confidence score (0-1)

confidence = max(0.0, min(1.0, (avg_logprob + 1) / 2))

return confidence

Device information reporting enables monitoring and debugging of speech recognition performance across different hardware platforms. This data helps identify optimization opportunities and troubleshoot platform-specific issues.

def get_device_info(self):

"""Get information about current device configuration"""

info = {

'device': self.device,

'model_size': self.model_size,

'fp16_enabled': self._use_fp16()

}

if self.hardware_manager:

memory_info = self.hardware_manager.get_memory_info(self.device)

info['memory'] = memory_info

return info

Language Model Integration: Leveraging HuggingFace Transformers

The language model serves as the brain of our voice assistant, processing user queries and generating contextually appropriate responses. HuggingFace Transformers provides access to thousands of pre-trained models, from lightweight options suitable for edge deployment to powerful models rivaling commercial offerings.

The enhanced language model implementation provides robust support across different hardware platforms while maintaining conversation quality and performance. Automatic device configuration and memory management ensure optimal resource utilization regardless of the deployment environment.

from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

import torch

from typing import List, Dict, Optional, Union

import gc

class EnhancedConversationalLLM:

def __init__(self, model_name="microsoft/DialoGPT-medium", device="auto",

max_length=512, hardware_manager=None):

"""

Initialize enhanced conversational language model with multi-platform support

Args:

model_name: HuggingFace model identifier

device: Computing device (auto, cpu, cuda, mps, rocm)

max_length: Maximum response length in tokens

hardware_manager: HardwareManager instance for device configuration

"""

self.hardware_manager = hardware_manager or HardwareManager()

if device == "auto":

self.device = self.hardware_manager.configure_torch_device()

else:

self.device = self.hardware_manager.configure_torch_device(device)

print(f"Loading language model {model_name} on {self.device}")

# Configure model loading parameters based on device

self.model_config = self._get_model_config()

try:

# Load tokenizer

self.tokenizer = AutoTokenizer.from_pretrained(

model_name,

trust_remote_code=True

)

# Load model with device-specific optimizations

self.model = AutoModelForCausalLM.from_pretrained(

model_name,

torch_dtype=self.model_config['dtype'],

device_map=self.model_config['device_map'],

trust_remote_code=True,

low_cpu_mem_usage=True

)

# Move model to device if not using device_map

if self.model_config['device_map'] is None:

self.model = self.model.to(self.device)

print(f"Language model loaded successfully on {self.device}")

except Exception as e:

print(f"Error loading model on {self.device}: {e}")

print("Falling back to CPU...")

self.device = "cpu"

self.model_config = self._get_model_config()

self._load_model_cpu_fallback(model_name)

Model configuration adapts to hardware capabilities and constraints. CUDA configurations enable FP16 precision and automatic device mapping for multi-GPU systems. MPS configurations use FP32 precision for stability, while ROCm configurations balance performance and compatibility.

# Configure tokenizer

if self.tokenizer.pad_token is None:

self.tokenizer.pad_token = self.tokenizer.eos_token

self.max_length = max_length

self.conversation_history = []

self.model_name = model_name

# Performance tracking

self.generation_times = []

def _get_model_config(self):

"""Get model configuration based on device capabilities"""

config = {

'dtype': torch.float32,

'device_map': None,

'use_cache': True

}

if self.device.startswith('cuda'):

config['dtype'] = torch.float16

config['device_map'] = "auto"

elif self.device == 'mps':

# MPS has limited FP16 support, use FP32 for stability

config['dtype'] = torch.float32

config['device_map'] = None

elif self.device == 'rocm':

# ROCm configuration

config['dtype'] = torch.float16

config['device_map'] = None

else: # CPU

config['dtype'] = torch.float32

config['device_map'] = None

return config

CPU fallback handling ensures system reliability when primary acceleration methods fail. The fallback process maintains full functionality while providing clear feedback about the configuration change.

def _load_model_cpu_fallback(self, model_name):

"""Load model with CPU fallback configuration"""

self.model = AutoModelForCausalLM.from_pretrained(

model_name,

torch_dtype=torch.float32,

device_map=None,

trust_remote_code=True

)

self.model = self.model.to("cpu")

Response generation incorporates sophisticated context management and device-specific optimizations. The system maintains conversation history while applying memory management techniques to prevent resource exhaustion during extended interactions.

def generate_response(self, user_input: str, system_prompt: Optional[str] = None,

temperature: float = 0.7, max_new_tokens: int = 150) -> str:

"""

Generate response to user input with enhanced multi-platform support

Args:

user_input: User's message

system_prompt: Optional system instruction

temperature: Sampling temperature

max_new_tokens: Maximum new tokens to generate

Returns:

str: Generated response

"""

import time

start_time = time.time()

try:

# Prepare conversation context

if system_prompt and not self.conversation_history:

self.conversation_history.append(f"System: {system_prompt}")

# Add user input to history

self.conversation_history.append(f"User: {user_input}")

# Create input text with conversation context

context = self._build_context()

context += "\nAssistant:"

# Tokenize input with device-specific handling

inputs = self._tokenize_input(context)

# Generate response with device-optimized settings

response = self._generate_with_device_optimization(

inputs, temperature, max_new_tokens

)

# Clean and format response

response = self._clean_response(response, context)

# Add to conversation history

self.conversation_history.append(f"Assistant: {response}")

# Track performance

generation_time = time.time() - start_time

self.generation_times.append(generation_time)

return response

except Exception as e:

print(f"Error generating response: {e}")

return "I apologize, but I'm having trouble processing your request right now."

Context building manages conversation memory efficiently by maintaining recent exchanges while preventing unbounded memory growth. This approach ensures coherent responses while maintaining system stability during extended conversations.

def _build_context(self):

"""Build conversation context with memory management"""

# Keep last 10 exchanges to manage memory

recent_history = self.conversation_history[-20:] # 10 exchanges = 20 messages

return "\n".join(recent_history)

Input tokenization handles device-specific requirements and optimizations. The process ensures that tokenized input reaches the appropriate device while managing memory allocation efficiently.

def _tokenize_input(self, context):

"""Tokenize input with device-specific optimizations"""

inputs = self.tokenizer.encode(

context,

return_tensors="pt",

truncation=True,

max_length=self.max_length

)

# Move to appropriate device

if self.device != "cpu":

inputs = inputs.to(self.device)

return inputs

Device-optimized generation applies platform-specific acceleration techniques while maintaining consistent output quality. CUDA configurations leverage automatic mixed precision for faster inference, while other platforms use optimized settings for their respective architectures.

def _generate_with_device_optimization(self, inputs, temperature, max_new_tokens):

"""Generate response with device-specific optimizations"""

generation_kwargs = {

'max_length': inputs.shape[1] + max_new_tokens,

'num_return_sequences': 1,

'temperature': temperature,

'do_sample': True,

'top_p': 0.9,

'pad_token_id': self.tokenizer.eos_token_id,

'attention_mask': torch.ones_like(inputs)

}

# Device-specific optimizations

if self.device.startswith('cuda'):

generation_kwargs['use_cache'] = True

elif self.device == 'mps':

# MPS-specific adjustments

generation_kwargs['use_cache'] = True

elif self.device == 'rocm':

# ROCm-specific adjustments

generation_kwargs['use_cache'] = True

# Generate with memory management

with torch.no_grad():

if self.device.startswith('cuda'):

with torch.cuda.amp.autocast(enabled=self.model_config['dtype'] == torch.float16):

outputs = self.model.generate(inputs, **generation_kwargs)

else:

outputs = self.model.generate(inputs, **generation_kwargs)

# Decode response

response = self.tokenizer.decode(outputs[0], skip_special_tokens=True)

# Clean up GPU memory if needed

if self.device.startswith('cuda'):

torch.cuda.empty_cache()

return response

Response cleaning ensures that generated text is well-formatted and appropriate for speech synthesis. The cleaning process removes artifacts, handles incomplete sentences, and ensures proper punctuation for natural-sounding speech output.

def _clean_response(self, response: str, context: str) -> str:

"""Clean and format the generated response"""

# Extract only the new response part

response = response[len(context):].strip()

# Remove common artifacts

response = response.replace("User:", "").replace("Assistant:", "")

# Split on newlines and take first complete sentence

lines = response.split('\n')

cleaned_response = lines[0].strip()

# Ensure response ends properly

if cleaned_response and not cleaned_response.endswith(('.', '!', '?')):

# Find last complete sentence

for punct in ['.', '!', '?']:

if punct in cleaned_response:

cleaned_response = cleaned_response[:cleaned_response.rfind(punct) + 1]

break

return cleaned_response if cleaned_response else "I understand."

Memory management includes both conversation history clearing and device-specific cache management. This ensures that the system can recover from memory pressure situations and maintain optimal performance over extended periods.

def clear_history(self):

"""Clear conversation history and free memory"""

self.conversation_history = []

# Force garbage collection

gc.collect()

# Clear device cache if applicable

if self.device.startswith('cuda'):

torch.cuda.empty_cache()

Performance statistics tracking enables monitoring and optimization of language model performance across different hardware platforms. This data helps identify bottlenecks and guide system tuning decisions.

def get_performance_stats(self):

"""Get performance statistics"""

if not self.generation_times:

return {}

return {

'avg_generation_time': sum(self.generation_times) / len(self.generation_times),

'min_generation_time': min(self.generation_times),

'max_generation_time': max(self.generation_times),

'total_generations': len(self.generation_times),

'device': self.device,

'model_name': self.model_name

}

def get_memory_usage(self):

"""Get current memory usage"""

if self.hardware_manager:

return self.hardware_manager.get_memory_info(self.device)

return {}

Text-to-Speech Synthesis: Creating Natural Voice Output

Converting text responses back to speech requires careful attention to naturalness, clarity, and emotional expression. The enhanced TTS implementation provides consistent voice synthesis across different hardware platforms while optimizing performance for each specific environment.

Modern neural TTS systems use sophisticated models to generate human-like speech with proper intonation, emphasis, and emotional expression. The implementation handles device-specific optimizations while maintaining consistent output quality across different platforms.

import torch

import torchaudio

import numpy as np

from typing import Optional

import warnings

import os

# Suppress TTS warnings for cleaner output

warnings.filterwarnings("ignore", category=UserWarning)

class EnhancedNeuralTTS:

def __init__(self, model_name="tts_models/en/ljspeech/tacotron2-DDC",

device="auto", hardware_manager=None):

"""

Initialize enhanced neural text-to-speech system with multi-platform support

Args:

model_name: Coqui TTS model identifier

device: Computing device (auto, cpu, cuda, mps, rocm)

hardware_manager: HardwareManager instance for device configuration

"""

self.hardware_manager = hardware_manager or HardwareManager()

if device == "auto":

self.device = self.hardware_manager.configure_torch_device()

else:

self.device = self.hardware_manager.configure_torch_device(device)

print(f"Loading TTS model {model_name} on {self.device}")

# Configure TTS with device-specific settings

self._configure_tts_device()

try:

from TTS.api import TTS

# Initialize TTS with device support

self.tts = TTS(

model_name=model_name,

progress_bar=False,

gpu=self._use_gpu_acceleration()

)

print(f"TTS model loaded successfully on {self.device}")

except Exception as e:

print(f"Error loading TTS model: {e}")

print("Falling back to basic TTS configuration...")

self._setup_fallback_tts()

TTS device configuration applies platform-specific optimizations for speech synthesis. CUDA configurations enable cuDNN optimizations for faster convolution operations. MPS and ROCm configurations set appropriate fallback options for operations that may not be fully supported.

# Get model sample rate

self.sample_rate = getattr(self.tts, 'synthesizer', {}).get('output_sample_rate', 22050)

if hasattr(self.tts, 'synthesizer') and hasattr(self.tts.synthesizer, 'output_sample_rate'):

self.sample_rate = self.tts.synthesizer.output_sample_rate

else:

self.sample_rate = 22050

self.model_name = model_name

def _configure_tts_device(self):

"""Configure device-specific settings for TTS"""

if self.device.startswith('cuda'):

# CUDA-specific optimizations

torch.backends.cudnn.benchmark = True

if torch.cuda.is_available():

torch.cuda.empty_cache()

elif self.device == 'mps':

# MPS-specific optimizations

os.environ['PYTORCH_ENABLE_MPS_FALLBACK'] = '1'

elif self.device == 'rocm':

# ROCm-specific optimizations

os.environ['HSA_OVERRIDE_GFX_VERSION'] = os.environ.get('HSA_OVERRIDE_GFX_VERSION', '10.3.0')

GPU acceleration determination considers the capabilities and limitations of different TTS libraries across various hardware platforms. While CUDA enjoys broad support, MPS and ROCm may require CPU fallbacks for certain TTS models.

def _use_gpu_acceleration(self):

"""Determine if GPU acceleration should be used for TTS"""

if self.device.startswith('cuda'):

return torch.cuda.is_available()

elif self.device == 'mps':

# TTS library may not support MPS directly

return False

elif self.device == 'rocm':

# TTS library may not support ROCm directly

return False

else:

return False

Fallback TTS setup ensures system reliability when primary TTS models fail to load. The fallback process attempts simpler models before ultimately disabling voice output if no TTS capability can be established.

def _setup_fallback_tts(self):

"""Setup fallback TTS configuration"""

try:

from TTS.api import TTS

# Try with a simpler model

self.tts = TTS(model_name="tts_models/en/ljspeech/glow-tts", progress_bar=False, gpu=False)

except Exception:

# Ultimate fallback - this would need to be implemented with a different TTS library

print("Warning: Could not initialize any TTS model. Voice output will be disabled.")

self.tts = None

Speech synthesis incorporates advanced text preprocessing and device-specific optimizations to produce high-quality audio output. The process handles various input formats and applies normalization techniques for consistent results.

def synthesize_speech(self, text: str, output_path: Optional[str] = None,

speaker_idx: Optional[int] = None) -> np.ndarray:

"""

Convert text to speech with enhanced multi-platform support

Args:

text: Text to synthesize

output_path: Optional path to save audio file

speaker_idx: Optional speaker index for multi-speaker models

Returns:

np.ndarray: Audio waveform

"""

if self.tts is None:

print("TTS not available, returning empty audio")

return np.array([])

try:

# Preprocess text for better synthesis

processed_text = self._preprocess_text(text)

if not processed_text.strip():

return np.array([])

# Prepare synthesis arguments

synthesis_kwargs = {'text': processed_text}

if output_path:

synthesis_kwargs['file_path'] = output_path

if speaker_idx is not None:

synthesis_kwargs['speaker_idx'] = speaker_idx

# Generate speech with device-specific optimizations

wav = self._synthesize_with_device_optimization(**synthesis_kwargs)

# Process output based on device

audio_array = self._process_synthesis_output(wav, output_path)

return audio_array

except Exception as e:

print(f"Error in speech synthesis: {e}")

return np.array([])

Device-optimized synthesis applies platform-specific acceleration techniques while maintaining audio quality. CUDA synthesis can leverage automatic mixed precision where supported, while other platforms use optimized settings for their respective architectures.

def _synthesize_with_device_optimization(self, **kwargs):

"""Synthesize speech with device-specific optimizations"""

if self.device.startswith('cuda') and torch.cuda.is_available():

# CUDA optimizations

with torch.cuda.amp.autocast(enabled=False): # TTS may not support autocast

wav = self.tts.tts(**kwargs)

else:

# CPU/MPS/ROCm synthesis

wav = self.tts.tts(**kwargs)

return wav

Synthesis output processing ensures consistent audio format regardless of the underlying TTS implementation or hardware platform. The process handles various output types and applies necessary conversions for compatibility.

def _process_synthesis_output(self, wav, output_path):

"""Process synthesis output into consistent format"""

if output_path and os.path.exists(output_path):

# Load the saved file to return as array

try:

waveform, sample_rate = torchaudio.load(output_path)

return waveform.numpy().flatten()

except Exception:

# Fallback to direct wav processing

pass

# Process direct wav output

if isinstance(wav, torch.Tensor):

if self.device.startswith('cuda'):

wav = wav.cpu()

return wav.numpy().flatten()

elif isinstance(wav, np.ndarray):

return wav.flatten()

else:

return np.array(wav).flatten()

Enhanced text preprocessing improves speech synthesis quality by handling abbreviations, numbers, and special characters that might confuse TTS models. The preprocessing stage ensures that text is optimized for natural-sounding speech output.

def _preprocess_text(self, text: str) -> str:

"""Enhanced text preprocessing for better TTS output"""

# Remove or replace problematic characters

text = text.replace('\n', ' ').replace('\t', ' ')

# Handle common abbreviations

abbreviations = {

'Dr.': 'Doctor',

'Mr.': 'Mister',

'Mrs.': 'Missus',

'Ms.': 'Miss',

'Prof.': 'Professor',

'etc.': 'etcetera',

'vs.': 'versus',

'e.g.': 'for example',

'i.e.': 'that is',

'AI': 'A I',

'ML': 'M L',

'GPU': 'G P U',

'CPU': 'C P U',

'API': 'A P I'

}

for abbrev, expansion in abbreviations.items():

text = text.replace(abbrev, expansion)

# Handle numbers (enhanced implementation)

import re

# Replace simple numbers with words (0-100)

number_words = {

'0': 'zero', '1': 'one', '2': 'two', '3': 'three', '4': 'four', '5': 'five',

'6': 'six', '7': 'seven', '8': 'eight', '9': 'nine', '10': 'ten',

'11': 'eleven', '12': 'twelve', '13': 'thirteen', '14': 'fourteen',

'15': 'fifteen', '16': 'sixteen', '17': 'seventeen', '18': 'eighteen',

'19': 'nineteen', '20': 'twenty', '30': 'thirty', '40': 'forty',

'50': 'fifty', '60': 'sixty', '70': 'seventy', '80': 'eighty',

'90': 'ninety', '100': 'one hundred'

}

for num, word in number_words.items():

text = re.sub(r'\b' + num + r'\b', word, text)

# Handle URLs and email addresses

text = re.sub(r'http[s]?://\S+', 'web link', text)

text = re.sub(r'\S+@\S+\.\S+', 'email address', text)

# Clean up multiple spaces

text = re.sub(r'\s+', ' ', text).strip()

return text

Audio file saving incorporates enhanced format support and error handling. The process ensures compatibility across different platforms while providing fallback options when primary saving methods fail.

def save_audio(self, audio_array: np.ndarray, filename: str,

sample_rate: Optional[int] = None):

"""Save audio array to file with enhanced format support"""

if sample_rate is None:

sample_rate = self.sample_rate

# Ensure audio is in correct format

if audio_array.dtype != np.float32:

audio_array = audio_array.astype(np.float32)

# Normalize audio

max_val = np.max(np.abs(audio_array))

if max_val > 1.0:

audio_array = audio_array / max_val

# Save using torchaudio with device handling

try:

tensor_audio = torch.from_numpy(audio_array).unsqueeze(0)

if self.device.startswith('cuda'):

tensor_audio = tensor_audio.cpu() # Ensure CPU for saving

torchaudio.save(filename, tensor_audio, sample_rate)

except Exception as e:

print(f"Error saving audio: {e}")

# Fallback to scipy if available

try:

from scipy.io import wavfile

# Convert to int16 for scipy

audio_int16 = (audio_array * 32767).astype(np.int16)

wavfile.write(filename, sample_rate, audio_int16)

except ImportError:

print("Could not save audio file - no suitable library available")

Device information reporting provides insights into TTS configuration and performance characteristics. This information enables monitoring and optimization of speech synthesis across different hardware platforms.

def get_device_info(self):

"""Get information about current TTS device configuration"""

info = {

'device': self.device,

'model_name': self.model_name,

'sample_rate': self.sample_rate,

'gpu_acceleration': self._use_gpu_acceleration()

}

if self.hardware_manager:

memory_info = self.hardware_manager.get_memory_info(self.device)

info['memory'] = memory_info

return info

Orchestrating Conversations with LangChain

LangChain provides powerful abstractions for building complex conversational applications that go beyond simple question-and-answer interactions. It enables sophisticated conversation management, memory systems, and integration with external tools and knowledge sources.

The conversation manager handles dialogue state, maintains context across multiple exchanges, and provides memory management capabilities. This component ensures that conversations remain coherent and contextually relevant throughout extended interactions.

from langchain.memory import ConversationBufferWindowMemory, ConversationSummaryMemory

from langchain.schema import BaseMessage, HumanMessage, AIMessage

from langchain.callbacks.base import BaseCallbackHandler

from typing import Any, Dict, List, Optional

import json

import datetime

class ConversationManager:

def __init__(self, window_size: int = 10, use_summary: bool = False):

"""

Initialize conversation management system

Args:

window_size: Number of recent exchanges to keep in memory

use_summary: Whether to use conversation summarization

"""

self.window_size = window_size

self.use_summary = use_summary

# Initialize memory system

if use_summary:

self.memory = ConversationSummaryMemory(

return_messages=True,

max_token_limit=1000

)

else:

self.memory = ConversationBufferWindowMemory(

k=window_size,

return_messages=True

)

# Conversation metadata

self.conversation_id = self._generate_conversation_id()

self.start_time = datetime.datetime.now()

self.turn_count = 0

Memory system selection balances between detailed history retention and computational efficiency. Window-based memory maintains recent exchanges in full detail, while summary-based memory compresses longer conversations into concise summaries that preserve important context.

def add_exchange(self, user_input: str, assistant_response: str, metadata: Optional[Dict] = None):

"""

Add a conversation exchange to memory

Args:

user_input: User's message

assistant_response: Assistant's response

metadata: Optional metadata about the exchange

"""

# Add to LangChain memory

self.memory.chat_memory.add_user_message(user_input)

self.memory.chat_memory.add_ai_message(assistant_response)

# Update conversation metadata

self.turn_count += 1

# Store additional metadata if provided

if metadata:

self._store_metadata(metadata)

Context retrieval provides formatted conversation history for language model consumption. The formatting process ensures that context is presented in a consistent manner that maximizes language model comprehension and response quality.

def get_conversation_context(self, max_tokens: Optional[int] = None) -> str:

"""

Get formatted conversation context for language model

Args:

max_tokens: Maximum tokens to include in context

Returns:

str: Formatted conversation history

"""

messages = self.memory.chat_memory.messages

if not messages:

return ""

# Format messages for context

context_parts = []

token_count = 0

for message in reversed(messages):

if isinstance(message, HumanMessage):

formatted = f"User: {message.content}"

elif isinstance(message, AIMessage):

formatted = f"Assistant: {message.content}"

else:

continue

# Rough token estimation (4 chars per token)

estimated_tokens = len(formatted) // 4

if max_tokens and token_count + estimated_tokens > max_tokens:

break

context_parts.append(formatted)

token_count += estimated_tokens

# Reverse to get chronological order

context_parts.reverse()

return "\n".join(context_parts)

Recent context extraction provides access to the most recent conversation exchanges for applications that need detailed information about immediate dialogue history. This capability supports features like conversation analysis and context-aware responses.

def get_recent_context(self, num_exchanges: int = 3) -> List[Dict]:

"""

Get recent conversation exchanges

Args:

num_exchanges: Number of recent exchanges to retrieve

Returns:

List[Dict]: Recent conversation exchanges

"""

messages = self.memory.chat_memory.messages

exchanges = []

# Group messages into exchanges (user + assistant pairs)

for i in range(0, len(messages) - 1, 2):

if i + 1 < len(messages):

user_msg = messages[i]

ai_msg = messages[i + 1]

if isinstance(user_msg, HumanMessage) and isinstance(ai_msg, AIMessage):

exchanges.append({

'user': user_msg.content,

'assistant': ai_msg.content,

'timestamp': getattr(user_msg, 'timestamp', None)

})

# Return most recent exchanges

return exchanges[-num_exchanges:] if exchanges else []

Memory management includes both conversation clearing and metadata tracking. The system maintains conversation statistics and provides summary information for analytics and debugging purposes.

def clear_memory(self):

"""Clear conversation memory"""

self.memory.clear()

self.turn_count = 0

self.start_time = datetime.datetime.now()

self.conversation_id = self._generate_conversation_id()

def get_conversation_summary(self) -> Dict:

"""Get summary of current conversation"""

return {

'conversation_id': self.conversation_id,

'start_time': self.start_time.isoformat(),

'duration_minutes': (datetime.datetime.now() - self.start_time).total_seconds() / 60,

'turn_count': self.turn_count,

'message_count': len(self.memory.chat_memory.messages)

}

def _generate_conversation_id(self) -> str:

"""Generate unique conversation identifier"""

import uuid

return str(uuid.uuid4())[:8]

def _store_metadata(self, metadata: Dict):

"""Store conversation metadata (placeholder for future enhancement)"""

# This could be extended to store metadata in a database

# or file system for conversation analytics

pass

Real-Time Audio Processing: Handling Streaming Audio

Real-time audio processing presents unique challenges in voice assistant implementation. The system must handle continuous audio streams, detect speech boundaries, and process audio chunks efficiently while maintaining low latency for natural conversation flow.

The audio processor manages continuous audio capture, voice activity detection, and speech segmentation. It operates in real-time while maintaining low latency and providing reliable speech boundary detection across various acoustic conditions.

import pyaudio

import numpy as np

import threading

import queue

import time

from collections import deque

import webrtcvad

class RealTimeAudioProcessor:

def __init__(self, sample_rate=16000, chunk_size=1024, channels=1):

"""

Initialize real-time audio processing system

Args:

sample_rate: Audio sample rate in Hz

chunk_size: Audio chunk size for processing

channels: Number of audio channels (1 for mono)

"""

self.sample_rate = sample_rate

self.chunk_size = chunk_size

self.channels = channels

self.format = pyaudio.paInt16

# Audio buffers and queues

self.audio_queue = queue.Queue()

self.recording_buffer = deque(maxlen=100) # Keep last 100 chunks

# Voice activity detection

self.vad = webrtcvad.Vad(2) # Aggressiveness level 0-3

# Processing state

self.is_recording = False

self.is_processing = False

self.speech_detected = False

self.silence_threshold = 30 # Chunks of silence before stopping

self.silence_counter = 0

# Initialize PyAudio

self.audio = pyaudio.PyAudio()

# Threading

self.audio_thread = None

self.processing_thread = None

self.stop_event = threading.Event()

Audio capture initialization configures the audio system for optimal real-time performance. The configuration balances latency, quality, and computational requirements to ensure responsive speech detection and processing.

def start_listening(self):

"""Start continuous audio listening"""

if self.is_recording:

return

self.is_recording = True

self.stop_event.clear()

# Start audio capture thread

self.audio_thread = threading.Thread(target=self._audio_capture_loop)

self.audio_thread.daemon = True

self.audio_thread.start()

# Start processing thread

self.processing_thread = threading.Thread(target=self._audio_processing_loop)

self.processing_thread.daemon = True

self.processing_thread.start()

print("Started listening for audio input...")

Audio capture operates in a dedicated thread to ensure continuous operation without blocking other system components. The capture loop handles audio streaming, buffering, and initial preprocessing for downstream analysis.

def _audio_capture_loop(self):

"""Main audio capture loop"""

try:

# Open audio stream

stream = self.audio.open(

format=self.format,

channels=self.channels,

rate=self.sample_rate,

input=True,

frames_per_buffer=self.chunk_size

)

print(f"Audio stream opened: {self.sample_rate}Hz, {self.chunk_size} samples/chunk")

while self.is_recording and not self.stop_event.is_set():

try:

# Read audio data

data = stream.read(self.chunk_size, exception_on_overflow=False)

# Convert to numpy array

audio_chunk = np.frombuffer(data, dtype=np.int16)

# Add to processing queue

if not self.audio_queue.full():

self.audio_queue.put(audio_chunk)

except Exception as e:

print(f"Error reading audio: {e}")

break

# Clean up

stream.stop_stream()

stream.close()

except Exception as e:

print(f"Error in audio capture: {e}")

Audio processing operates independently from capture to prevent blocking and ensure real-time performance. The processing loop handles voice activity detection, speech segmentation, and utterance completion detection.

def _audio_processing_loop(self):

"""Main audio processing loop"""

while self.is_recording and not self.stop_event.is_set():

try:

# Get audio chunk with timeout

audio_chunk = self.audio_queue.get(timeout=0.1)

# Process audio chunk

self._process_audio_chunk(audio_chunk)

except queue.Empty:

continue

except Exception as e:

print(f"Error processing audio: {e}")

Voice activity detection uses WebRTC VAD for robust speech detection across various acoustic conditions. The system handles different frame sizes and provides fallback detection methods for enhanced reliability.

def _process_audio_chunk(self, audio_chunk):

"""Process individual audio chunk"""

# Add to recording buffer

self.recording_buffer.append(audio_chunk)

# Voice activity detection

is_speech = self._detect_speech(audio_chunk)

if is_speech:

if not self.speech_detected:

print("Speech detected, starting recording...")

self.speech_detected = True

self.silence_counter = 0

else:

if self.speech_detected:

self.silence_counter += 1

# Check if we've had enough silence to stop recording

if self.silence_counter >= self.silence_threshold:

print("Speech ended, processing audio...")

self._process_complete_utterance()

self.speech_detected = False

self.silence_counter = 0

def _detect_speech(self, audio_chunk):

"""Detect speech in audio chunk using WebRTC VAD"""

try:

# Convert to bytes for VAD

audio_bytes = audio_chunk.tobytes()

# WebRTC VAD requires specific frame sizes

# For 16kHz: 160, 320, or 480 samples (10ms, 20ms, 30ms)

frame_size = 320 # 20ms at 16kHz

if len(audio_chunk) >= frame_size:

frame = audio_chunk[:frame_size].tobytes()

return self.vad.is_speech(frame, self.sample_rate)

return False

except Exception as e:

# Fallback to simple energy-based detection

return self._simple_speech_detection(audio_chunk)

Fallback speech detection provides reliability when WebRTC VAD encounters issues or unsupported audio formats. The energy-based approach offers basic speech detection capabilities for system resilience.

def _simple_speech_detection(self, audio_chunk):

"""Simple energy-based speech detection fallback"""

# Calculate RMS energy

rms = np.sqrt(np.mean(audio_chunk.astype(np.float32) ** 2))

# Simple threshold-based detection

return rms > 500 # Adjust threshold based on your environment

Utterance completion processing combines audio chunks into complete speech segments for transcription. The system manages buffer contents and triggers callbacks for downstream processing components.

def _process_complete_utterance(self):

"""Process complete speech utterance"""

if len(self.recording_buffer) < 5: # Too short to be meaningful

return

# Combine audio chunks

complete_audio = np.concatenate(list(self.recording_buffer))

# Clear buffer for next utterance

self.recording_buffer.clear()

# Trigger callback or add to processing queue

self._on_utterance_complete(complete_audio)

def _on_utterance_complete(self, audio_data):

"""Callback for complete utterance (override in subclass)"""

print(f"Complete utterance captured: {len(audio_data)} samples")

# This would typically trigger STT processing

Audio level monitoring provides feedback about input signal strength for user interface elements and system diagnostics. The monitoring system calculates real-time audio levels for display and debugging purposes.

def get_audio_levels(self):

"""Get current audio input levels for monitoring"""

if self.recording_buffer:

recent_audio = np.concatenate(list(self.recording_buffer)[-5:])

rms = np.sqrt(np.mean(recent_audio.astype(np.float32) ** 2))

return min(100, int(rms / 50)) # Scale to 0-100

return 0

System cleanup ensures proper resource management and graceful shutdown of audio processing components. The cleanup process stops all threads and releases audio system resources.

def stop_listening(self):

"""Stop audio listening"""

self.is_recording = False

self.stop_event.set()

if self.audio_thread:

self.audio_thread.join(timeout=1.0)

if self.processing_thread:

self.processing_thread.join(timeout=1.0)

print("Stopped listening for audio input.")

def cleanup(self):

"""Clean up audio resources"""

self.stop_listening()

if hasattr(self, 'audio'):

self.audio.terminate()

System Integration: Complete General Voice Assistant

The complete integration brings together all enhanced components into a cohesive general-purpose voice assistant. The system manages complex state transitions, handles errors gracefully, and provides comprehensive monitoring and debugging capabilities.

The integrated assistant supports multi-platform hardware acceleration, real-time speech processing, natural language understanding, and high-quality voice synthesis. The modular architecture enables easy customization and extension for specific use cases.

import asyncio

import threading

import time

import queue

from typing import Callable, Optional, Dict, Any

from enum import Enum

import json

class EnhancedAssistantState(Enum):

INITIALIZING = "initializing"

IDLE = "idle"

LISTENING = "listening"

PROCESSING_SPEECH = "processing_speech"

GENERATING_RESPONSE = "generating_response"

SYNTHESIZING_SPEECH = "synthesizing_speech"

SPEAKING = "speaking"

ERROR = "error"

class GeneralVoiceAssistant:

def __init__(self, config: Optional[Dict] = None):

"""

Initialize complete general voice assistant with multi-platform support

Args:

config: Configuration dictionary for customizing assistant behavior

"""

self.config = self._load_default_config()

if config:

self.config.update(config)

print("Initializing General Voice Assistant...")

print("=" * 60)

# Initialize hardware manager

self.hardware_manager = HardwareManager()

# Initialize all components with hardware optimization

self._initialize_components()

# System state

self.state = EnhancedAssistantState.INITIALIZING

self.is_running = False

self.conversation_active = False

# Callbacks

self.callbacks = {

'on_state_change': [],

'on_user_speech': [],

'on_assistant_response': [],

'on_error': [],

'on_audio_level': []

}

# Performance monitoring

self.performance_stats = {

'total_interactions': 0,

'successful_interactions': 0,

'error_count': 0,

'response_times': [],

'start_time': time.time()

}

# Audio processing queue

self.audio_processing_queue = queue.Queue(maxsize=10)

print("General Voice Assistant initialized successfully!")

print("=" * 60)

self._set_state(EnhancedAssistantState.IDLE)

Configuration management provides flexible customization of assistant behavior while maintaining sensible defaults. The configuration system supports model selection, performance tuning, and feature enabling across different deployment scenarios.

def _load_default_config(self):

"""Load default configuration"""

return {

'stt_model': 'base',

'llm_model': 'microsoft/DialoGPT-medium',

'tts_model': 'tts_models/en/ljspeech/tacotron2-DDC',

'max_conversation_length': 20,

'response_timeout': 30.0,

'audio_sample_rate': 16000,

'enable_voice_output': True,

'conversation_memory': True,

'system_prompt': """You are a helpful, friendly, and knowledgeable AI assistant.

You provide clear, concise, and accurate responses to user questions.

Keep your responses conversational and under 100 words when possible."""

}

Component initialization orchestrates the setup of all assistant subsystems with proper hardware optimization and error handling. The initialization process ensures that each component is configured for optimal performance on the available hardware.

def _initialize_components(self):

"""Initialize all assistant components with hardware optimization"""

print("Initializing components...")

# Speech-to-Text

print("Loading speech recognition...")

self.stt = EnhancedWhisperSTT(

model_size=self.config['stt_model'],

hardware_manager=self.hardware_manager

)

# Language Model

print("Loading language model...")

self.llm = EnhancedConversationalLLM(

model_name=self.config['llm_model'],

hardware_manager=self.hardware_manager

)

# Text-to-Speech

if self.config['enable_voice_output']:

print("Loading text-to-speech...")

self.tts = EnhancedNeuralTTS(

model_name=self.config['tts_model'],

hardware_manager=self.hardware_manager

)

else:

self.tts = None

print("Voice output disabled")

# Conversation Management

if self.config['conversation_memory']:

print("Initializing conversation management...")

self.conversation = ConversationManager(

window_size=self.config['max_conversation_length']

)

else:

self.conversation = None

# Audio Processing

print("Initializing audio processing...")

self.audio = RealTimeAudioProcessor(

sample_rate=self.config['audio_sample_rate']

)

# Configure audio processor callback

self.audio._on_utterance_complete = self._handle_audio_input

print("All components initialized successfully!")

System startup manages the transition from initialization to active operation. The startup process configures audio processing, initializes conversation context, and prepares the system for user interaction.

def start(self):

"""Start the general voice assistant"""

if self.is_running:

print("Assistant is already running")

return

print("\nStarting General Voice Assistant...")

print("=" * 60)

print("CAPABILITIES:")

print("- General conversation and Q&A")

print("- Multi-platform hardware acceleration")

print("- Real-time speech recognition")

print("- Natural language understanding")

print("- Voice synthesis and output")

print("- Conversation memory and context")

print("\nSay 'Hello' or ask any question to begin!")

print("Press Ctrl+C to stop")

print("=" * 60)

self.is_running = True

self.performance_stats['start_time'] = time.time()

# Add system prompt to conversation

if self.conversation and self.config['system_prompt']:

self.llm.generate_response("", self.config['system_prompt'])

# Start audio processing

self._set_state(EnhancedAssistantState.IDLE)

self.audio.start_listening()

# Start background processing thread

self.processing_thread = threading.Thread(target=self._background_processing_loop)

self.processing_thread.daemon = True

self.processing_thread.start()

print("Voice assistant is ready and listening!")

Audio input handling manages the flow from speech detection to processing queue. The system uses asynchronous processing to maintain responsiveness while handling complex speech recognition and response generation tasks.

def _handle_audio_input(self, audio_data):

"""Handle complete audio utterance"""

if not self.is_running or self.state not in [EnhancedAssistantState.IDLE]:

return

# Add to processing queue

try:

self.audio_processing_queue.put(audio_data, block=False)

except queue.Full:

print("Audio processing queue full, dropping audio")

Background processing manages the complete voice assistant pipeline from speech recognition through response synthesis. The processing loop operates independently to maintain system responsiveness during computationally intensive operations.

def _background_processing_loop(self):

"""Background processing loop for audio input"""

while self.is_running:

try:

# Get audio data with timeout

audio_data = self.audio_processing_queue.get(timeout=1.0)

# Process the audio input

self._process_user_input(audio_data)

except queue.Empty:

continue

except Exception as e:

print(f"Error in background processing: {e}")

self._handle_error(e)

User input processing orchestrates the complete pipeline from speech recognition through response generation and synthesis. The process includes comprehensive error handling, performance monitoring, and state management.

def _process_user_input(self, audio_data):

"""Process user input through the complete pipeline"""

start_time = time.time()

self.performance_stats['total_interactions'] += 1

try:

# Step 1: Speech to Text

self._set_state(EnhancedAssistantState.PROCESSING_SPEECH)

print("\n[PROCESSING] Converting speech to text...")

stt_result = self.stt.transcribe_audio(audio_array=audio_data)

if not stt_result['text']:

print("[INFO] No speech detected or transcription failed")

self._set_state(EnhancedAssistantState.IDLE)

return

user_text = stt_result['text']

print(f"[USER] {user_text}")

print(f"[INFO] Confidence: {stt_result['confidence']:.2f}")

# Trigger callbacks

self._trigger_callbacks('on_user_speech', user_text, stt_result)

# Step 2: Generate Response

self._set_state(EnhancedAssistantState.GENERATING_RESPONSE)

print("[PROCESSING] Generating response...")

response_text = self.llm.generate_response(user_text)

if not response_text:

response_text = "I'm sorry, I didn't understand that. Could you please repeat?"

print(f"[ASSISTANT] {response_text}")

# Step 3: Add to conversation history

if self.conversation:

self.conversation.add_exchange(

user_text,

response_text,

{

'stt_confidence': stt_result['confidence'],

'processing_time': time.time() - start_time,

'device_info': self._get_device_summary()

}

)

# Step 4: Text to Speech (if enabled)

if self.config['enable_voice_output'] and self.tts:

self._set_state(EnhancedAssistantState.SYNTHESIZING_SPEECH)

print("[PROCESSING] Converting response to speech...")

audio_response = self.tts.synthesize_speech(response_text)

if len(audio_response) > 0:

self._set_state(EnhancedAssistantState.SPEAKING)

print("[PLAYING] Speaking response...")

# Play audio response

self._play_audio_response(audio_response)

# Trigger callbacks

self._trigger_callbacks('on_assistant_response', response_text, audio_response)

else:

print("[WARNING] TTS synthesis failed")

# Still trigger callback with text-only response

self._trigger_callbacks('on_assistant_response', response_text, None)

else:

# Text-only mode

self._trigger_callbacks('on_assistant_response', response_text, None)

# Record performance metrics

total_time = time.time() - start_time

self.performance_stats['response_times'].append(total_time)

self.performance_stats['successful_interactions'] += 1

print(f"[INFO] Total response time: {total_time:.2f} seconds")

print("-" * 60)

except Exception as e:

print(f"[ERROR] Error processing user input: {e}")

self.performance_stats['error_count'] += 1

self._handle_error(e)

finally:

# Return to idle state

self._set_state(EnhancedAssistantState.IDLE)

Audio playback handles voice output with enhanced error handling and fallback mechanisms. The playback system ensures consistent audio output across different platforms while providing graceful degradation when audio hardware issues occur.

def _play_audio_response(self, audio_data):

"""Play audio response to user with enhanced error handling"""

try:

import sounddevice as sd

# Ensure audio is in correct format

if audio_data.dtype != np.float32:

audio_data = audio_data.astype(np.float32)

# Normalize audio

max_val = np.max(np.abs(audio_data))

if max_val > 1.0:

audio_data = audio_data / max_val

# Play audio with device-specific settings

sample_rate = self.tts.sample_rate if self.tts else 22050

sd.play(audio_data, samplerate=sample_rate)

sd.wait() # Wait until playback is finished

except Exception as e:

print(f"[ERROR] Error playing audio: {e}")

# Fallback: save to file and notify user

try:

if self.tts:

self.tts.save_audio(audio_data, "last_response.wav")

print("[INFO] Audio response saved to last_response.wav")

except Exception:

print("[WARNING] Could not save audio response")

State management provides clear tracking of system status and enables proper coordination between different processing stages. The state system includes callback mechanisms for external monitoring and integration.

def _set_state(self, new_state: EnhancedAssistantState):

"""Update assistant state with callback triggers"""

if self.state != new_state:

old_state = self.state

self.state = new_state

state_change_msg = f"[STATE] {old_state.value} -> {new_state.value}"

if new_state in [EnhancedAssistantState.IDLE, EnhancedAssistantState.ERROR]:

print(state_change_msg)

self._trigger_callbacks('on_state_change', old_state, new_state)

Error handling includes both immediate recovery attempts and graceful degradation strategies. The error handling system attempts platform-specific recovery techniques while maintaining system stability.

def _handle_error(self, error):

"""Handle system errors with recovery attempts"""

self._set_state(EnhancedAssistantState.ERROR)

self._trigger_callbacks('on_error', error)

# Attempt recovery based on error type

if "CUDA" in str(error) or "GPU" in str(error):

print("[RECOVERY] GPU error detected, clearing cache...")

if torch.cuda.is_available():

torch.cuda.empty_cache()

# Brief pause before returning to idle

time.sleep(1)

Callback management enables extensible event handling for logging, monitoring, and integration with external systems. The callback system provides hooks for all major system events and state transitions.

def _trigger_callbacks(self, callback_type, *args):

"""Trigger registered callbacks"""

for callback in self.callbacks.get(callback_type, []):

try:

callback(*args)

except Exception as e:

print(f"[WARNING] Callback error: {e}")

def add_callback(self, callback_type: str, callback: Callable):

"""Add callback for specific events"""

if callback_type in self.callbacks:

self.callbacks[callback_type].append(callback)

else:

print(f"[WARNING] Unknown callback type: {callback_type}")

Text input processing provides direct text interaction capabilities for testing and text-only operation modes. This functionality enables debugging and development without requiring audio hardware.

def process_text_input(self, text: str) -> str:

"""Process text input directly (for testing or text-only mode)"""

try:

print(f"[USER] {text}")

response = self.llm.generate_response(text)

if self.conversation:

self.conversation.add_exchange(text, response)

print(f"[ASSISTANT] {response}")

# Synthesize speech if enabled

if self.config['enable_voice_output'] and self.tts:

audio_response = self.tts.synthesize_speech(response)

if len(audio_response) > 0:

self._play_audio_response(audio_response)

return response

except Exception as e:

print(f"[ERROR] Error processing text input: {e}")

return "I'm sorry, I encountered an error processing your request."

System status reporting provides comprehensive information about assistant performance, hardware utilization, and operational metrics. The status system enables monitoring and optimization of system performance.

def get_system_status(self) -> Dict[str, Any]:

"""Get comprehensive system status and performance metrics"""

uptime = time.time() - self.performance_stats['start_time']

avg_response_time = (

sum(self.performance_stats['response_times'][-10:]) /

len(self.performance_stats['response_times'][-10:])

if self.performance_stats['response_times'] else 0

)

status = {

'state': self.state.value,

'is_running': self.is_running,

'uptime_seconds': uptime,

'performance': {

'total_interactions': self.performance_stats['total_interactions'],

'successful_interactions': self.performance_stats['successful_interactions'],

'error_count': self.performance_stats['error_count'],

'success_rate': (

self.performance_stats['successful_interactions'] /

max(1, self.performance_stats['total_interactions'])

'average_response_time': avg_response_time,

'total_responses': len(self.performance_stats['response_times'])

'hardware': self._get_device_summary(),

'audio_level': self.audio.get_audio_levels() if hasattr(self.audio, 'get_audio_levels') else 0

}

if self.conversation:

status['conversation'] = self.conversation.get_conversation_summary()

return status

Device information aggregation provides comprehensive hardware status across all system components. This information enables performance monitoring and troubleshooting across different hardware platforms.

def _get_device_summary(self) -> Dict[str, Any]:

"""Get summary of device information across all components"""

summary = {

'hardware_platform': self.hardware_manager.platform_info,

'optimal_device': self.hardware_manager.optimal_device

}

if hasattr(self.stt, 'get_device_info'):

summary['stt'] = self.stt.get_device_info()

if hasattr(self.llm, 'get_performance_stats'):

summary['llm'] = self.llm.get_performance_stats()

if self.tts and hasattr(self.tts, 'get_device_info'):

summary['tts'] = self.tts.get_device_info()

return summary

System shutdown manages graceful termination of all components and resources. The shutdown process ensures proper cleanup while providing session summary information for analysis and debugging.

def stop(self):

"""Stop the voice assistant"""

if not self.is_running:

return

print("\nStopping General Voice Assistant...")

self.is_running = False

# Stop audio processing

self.audio.stop_listening()

# Clean up resources

self.audio.cleanup()

# Clear model memory

if hasattr(self.llm, 'clear_history'):

self.llm.clear_history()

self._set_state(EnhancedAssistantState.IDLE)

# Print final statistics

self._print_session_summary()

print("General Voice Assistant stopped.")

def _print_session_summary(self):

"""Print session summary statistics"""

status = self.get_system_status()

print("\n" + "=" * 60)

print("SESSION SUMMARY")

print("=" * 60)

print(f"Total interactions: {status['performance']['total_interactions']}")

print(f"Successful interactions: {status['performance']['successful_interactions']}")

print(f"Success rate: {status['performance']['success_rate']:.1%}")

print(f"Average response time: {status['performance']['average_response_time']:.2f}s")

print(f"Session duration: {status['uptime_seconds']:.0f} seconds")

print(f"Hardware platform: {status['hardware']['optimal_device']}")

print("=" * 60)

Interactive testing mode provides comprehensive text-based interaction for development and debugging. The testing mode includes command handling, status reporting, and conversation management capabilities.

def test_text_mode(self):

"""Test the assistant with text input in interactive mode"""

print("\n" + "=" * 60)

print("GENERAL VOICE ASSISTANT - TEXT MODE")

print("=" * 60)

print("Type your questions or statements below.")

print("Commands:")

print(" 'quit' or 'exit' - Exit text mode")

print(" 'status' - Show system status")

print(" 'clear' - Clear conversation history")

print(" 'help' - Show available commands")

print("=" * 60)

while True:

try:

user_input = input("\nYou: ").strip()

if user_input.lower() in ['quit', 'exit', 'bye', 'goodbye']:

print("Goodbye!")

break

if not user_input:

continue

if user_input.lower() == 'status':

status = self.get_system_status()

print(f"\nSystem Status:")

print(f" State: {status['state']}")

print(f" Interactions: {status['performance']['total_interactions']}")

print(f" Success rate: {status['performance']['success_rate']:.1%}")

print(f" Device: {status['hardware']['optimal_device']}")

continue

if user_input.lower() == 'clear':

if self.conversation:

self.conversation.clear_memory()

if hasattr(self.llm, 'clear_history'):

self.llm.clear_history()

print("Conversation history cleared.")

continue

if user_input.lower() == 'help':

print("\nThis is a general AI assistant. You can:")

print("- Ask questions about any topic")

print("- Have conversations")

print("- Request explanations")

print("- Get help with various tasks")

print("- Use voice commands (in voice mode)")

continue

# Process the input

response = self.process_text_input(user_input)

except KeyboardInterrupt:

print("\nExiting text mode...")

break

except Exception as e:

print(f"Error: {e}")

Running Example: Complete General Voice Assistant

The complete running example demonstrates the integration of all components into a functional general-purpose voice assistant. This implementation showcases multi-platform hardware support, comprehensive error handling, and extensible architecture.

# Example usage and testing

if __name__ == "__main__":

# Create configuration for the assistant

config = {

'stt_model': 'base', # Whisper model size

'llm_model': 'microsoft/DialoGPT-medium', # Language model

'tts_model': 'tts_models/en/ljspeech/tacotron2-DDC', # TTS model

'enable_voice_output': True, # Enable voice synthesis

'conversation_memory': True, # Enable conversation memory

'system_prompt': """You are a helpful, friendly, and knowledgeable AI assistant.

You provide clear, accurate, and conversational responses. Keep responses

concise but informative, typically under 100 words unless more detail is requested."""

}

# Initialize the general voice assistant

print("Initializing General Voice Assistant...")

assistant = GeneralVoiceAssistant(config)

# Add some example callbacks

def on_user_speech(text, stt_result):

# Log user speech to file or database

pass

def on_assistant_response(text, audio):

# Log assistant responses

pass

def on_error(error):

# Handle errors (logging, notifications, etc.)

print(f"Assistant error logged: {error}")

# Register callbacks

assistant.add_callback('on_user_speech', on_user_speech)

assistant.add_callback('on_assistant_response', on_assistant_response)

assistant.add_callback('on_error', on_error)

# Test in text mode first

print("\nTesting in text mode...")

assistant.test_text_mode()

# Uncomment to test voice mode

# print("\nStarting voice mode...")

# try:

# assistant.start()

# # Keep running until interrupted

# while True:

# time.sleep(1)

# # Print status every 60 seconds

# if int(time.time()) % 60 == 0:

# status = assistant.get_system_status()

# print(f"\n[STATUS] Interactions: {status['performance']['total_interactions']}, "

# f"Success rate: {status['performance']['success_rate']:.1%}, "

# f"Avg time: {status['performance']['average_response_time']:.2f}s")

# except KeyboardInterrupt:

# print("\nShutting down...")

# assistant.stop()

Performance Optimization and Deployment Considerations

Deploying a voice assistant in production requires careful attention to performance optimization, resource management, and scalability considerations. The enhanced implementation provides multiple optimization strategies for different deployment scenarios.

Model optimization techniques include quantization, pruning, and knowledge distillation to reduce memory usage and inference time. Hardware-specific optimizations leverage platform capabilities while maintaining compatibility across different systems.

Memory management strategies prevent resource exhaustion during extended operation. The system includes automatic garbage collection, device cache management, and conversation history pruning to maintain optimal performance over time.

Monitoring and analytics capabilities enable continuous optimization and troubleshooting. The system tracks performance metrics, error rates, and resource utilization to guide optimization efforts and identify potential issues.

Security considerations include input validation, output filtering, and resource access controls. The implementation provides hooks for security monitoring and includes safeguards against potential attacks or misuse.

Scalability features enable deployment across different system configurations from edge devices to high-performance servers. The modular architecture supports horizontal scaling and load distribution for high-throughput applications.

The future of open source voice assistants continues to evolve with advances in model efficiency, multimodal capabilities, and edge computing optimization. This implementation provides a solid foundation for incorporating future developments while maintaining system stability and compatibility.

Conclusion

This comprehensive guide demonstrates how to build sophisticated voice assistants using entirely open source components with full multi-platform hardware support. The implementation showcases automatic hardware detection and optimization for NVIDIA CUDA, AMD ROCm, and Apple Silicon MPS acceleration.

The modular architecture enables continuous improvement and customization while the complete example provides a functional general-purpose assistant. With proper attention to optimization and deployment considerations, these systems can provide robust, privacy-preserving voice interfaces suitable for a wide range of applications.

The democratization of voice AI technology through open source tools opens new possibilities for innovation, customization, and deployment across diverse domains and use cases. By understanding and implementing these techniques, developers can create voice assistants that meet specific requirements while maintaining full control over functionality, privacy, and performance characteristics.

Hitchhiker's Guide to AI, Software Architecture, and Everything Else

Tuesday, February 03, 2026

BUILDING AN OPEN SOURCE LLM VOICE ASSISTANT

Introduction: The Dawn of Accessible Voice AI

Architecture Overview: Understanding the Voice Assistant Pipeline

Hardware Acceleration Support: Multi-Platform Optimization

Speech Recognition: Implementing Whisper for Robust STT

Language Model Integration: Leveraging HuggingFace Transformers

Text-to-Speech Synthesis: Creating Natural Voice Output

Orchestrating Conversations with LangChain

Real-Time Audio Processing: Handling Streaming Audio

System Integration: Complete General Voice Assistant

Running Example: Complete General Voice Assistant

Performance Optimization and Deployment Considerations

Conclusion

No comments:

About Me