Hitchhiker's Guide to AI, Software Architecture, and Everything Else: BUILDING YOUR OWN VOICE-POWERED AI ASSISTANT

INTRODUCTION TO VOICE-POWERED AI SYSTEMS

Welcome to this comprehensive tutorial where you will learn how to build a complete voice-powered artificial intelligence system from scratch using only open source components. This system will allow you to speak to your computer, have your speech converted to text, processed by a large language model, and then receive spoken responses back. Think of it as creating your own personal AI assistant similar to commercial products, but with full control over every component and the ability to run everything locally on your own hardware.

The beauty of this project lies in its modular architecture. You will build three distinct but interconnected components. The first component handles speech-to-text conversion, transforming your spoken words into written text that computers can process. The second component does the reverse, taking text and converting it into natural-sounding speech. The third component sits in the middle, using a large language model to understand your questions and generate intelligent responses. When combined, these three components create a seamless conversational experience.

What makes this tutorial special is its focus on practical, production-ready code that works across different hardware platforms. Whether you have an NVIDIA graphics card, an AMD GPU, an Intel processor, or an Apple Silicon Mac, the code provided here will work for you. We will use Python as our primary programming language because of its excellent support for machine learning libraries and its readability for beginners.

UNDERSTANDING THE ARCHITECTURE

Before diving into code, let us understand how the system works at a high level. Imagine you are having a conversation with a friend. You speak, your friend listens and understands, thinks about what you said, formulates a response, and speaks back to you. Our system mimics this natural flow.

When you speak into your microphone, the audio signal contains your voice along with any background noise. The first challenge is to capture this audio cleanly and convert it into a format suitable for processing. Modern speech recognition systems use deep learning models trained on thousands of hours of human speech. These models have learned to recognize patterns in audio that correspond to specific words and phrases.

The speech recognition component continuously listens to your microphone, detects when you start speaking, captures the audio, and sends it to a specialized model that transcribes it into text. This text then flows into the language model component.

The language model is the brain of the system. It reads the transcribed text, understands the context and meaning, and generates an appropriate response. Large language models have been trained on vast amounts of text data and can engage in remarkably human-like conversations, answer questions, help with tasks, and much more.

Finally, the text-to-speech component takes the language model's response and converts it back into spoken audio. Modern text-to-speech systems use neural networks to generate natural-sounding voices that are far superior to the robotic voices of the past.

CHOOSING THE RIGHT OPEN SOURCE COMPONENTS

The open source ecosystem offers several excellent options for each component of our system. For speech recognition, we will use OpenAI's Whisper model, which has achieved state-of-the-art results and works remarkably well even in noisy environments. Whisper comes in several sizes, from tiny models that run quickly on modest hardware to large models that provide the highest accuracy.

For the language model component, we will support multiple backends. You can use local models through libraries like llama-cpp-python, transformers from Hugging Face, or connect to remote API services. This flexibility allows you to choose based on your hardware capabilities and privacy requirements.

For text-to-speech, we will use Coqui TTS, a powerful open source library that supports multiple languages and voice styles. Coqui TTS includes pre-trained models that sound natural and expressive.

To handle audio input and output, we will use PyAudio and sounddevice libraries, which provide cross-platform support for microphone and speaker access. For audio processing and noise reduction, we will incorporate noisereduce, a library specifically designed to remove background noise from audio recordings.

SETTING UP YOUR DEVELOPMENT ENVIRONMENT

Before writing any code, you need to set up your development environment with all necessary dependencies. The setup process varies slightly depending on your operating system and hardware, but the core requirements remain the same.

You will need Python version 3.8 or higher. Most modern operating systems come with Python pre-installed, but you may need to install it separately. You should also set up a virtual environment to keep your project dependencies isolated from other Python projects on your system.

For GPU acceleration, you need to install the appropriate libraries for your hardware. NVIDIA users need CUDA and cuDNN. AMD users need ROCm. Apple Silicon users will use Metal Performance Shaders, which comes built into macOS. Intel users can leverage OpenVINO for optimized inference.

The installation process begins by creating a virtual environment and installing the core dependencies. Here is how you would set this up:

python -m venv voice_ai_env

source voice_ai_env/bin/activate # On Windows: voice_ai_env\Scripts\activate

pip install torch torchvision torchaudio

pip install openai-whisper

pip install transformers accelerate

pip install TTS

pip install sounddevice soundfile

pip install noisereduce librosa

pip install numpy scipy

pip install llama-cpp-python

The torch installation command may need modification based on your GPU. For NVIDIA GPUs with CUDA support, you would install a CUDA-enabled version. For AMD GPUs, you would install the ROCm version. The PyTorch website provides a configuration tool that generates the exact installation command for your specific hardware.

BUILDING THE SPEECH-TO-TEXT COMPONENT

The speech-to-text component is responsible for capturing audio from your microphone and converting it into text. This component needs to handle several challenges: detecting when you start and stop speaking, filtering out background noise, and accurately transcribing your words even in less-than-perfect audio conditions.

We begin by creating a class that encapsulates all speech recognition functionality. This class will manage the Whisper model, handle audio recording, and provide a clean interface for the rest of our system to use.

The first consideration is audio format. Digital audio is represented as a series of numbers called samples. The sample rate determines how many samples are captured per second. CD-quality audio uses 44,100 samples per second, but for speech recognition, 16,000 samples per second is sufficient and reduces computational requirements.

Here is the foundation of our speech recognition class:

import whisper

import sounddevice as sd

import numpy as np

import noisereduce as nr

import queue

import threading

from scipy.io import wavfile

class SpeechRecognizer:

def __init__(self, model_size='base', device='cpu', language='en'):

"""

Initialize the speech recognizer with a Whisper model.

Args:

model_size: Size of Whisper model (tiny, base, small, medium, large)

device: Computing device (cpu, cuda, mps)

language: Language code for transcription

"""

print(f"Loading Whisper {model_size} model on {device}...")

self.model = whisper.load_model(model_size, device=device)

self.device = device

self.language = language

self.sample_rate = 16000

self.channels = 1

self.audio_queue = queue.Queue()

self.is_recording = False

This initialization method loads the Whisper model into memory. The model_size parameter allows you to choose between speed and accuracy. The tiny model is fastest but least accurate, while the large model provides the best accuracy but requires more computational power. The base model offers a good balance for most applications.

The device parameter specifies where computations will run. On systems with NVIDIA GPUs, you would use 'cuda'. On Apple Silicon Macs, you would use 'mps'. On systems without GPU acceleration or for testing, you would use 'cpu'.

Now we need to implement the audio recording functionality. The challenge here is to continuously monitor the microphone, detect when speech begins, capture the audio, and stop recording when speech ends. This is called voice activity detection.

A simple but effective approach uses energy-based detection. When you speak, the audio signal has higher energy than background noise. By monitoring the audio energy level, we can detect when speech starts and stops.

def record_audio(self, duration=5, silence_threshold=0.01, silence_duration=1.5):

"""

Record audio from microphone with automatic silence detection.

Args:

duration: Maximum recording duration in seconds

silence_threshold: Energy threshold below which audio is considered silence

silence_duration: Seconds of silence before stopping recording

Returns:

numpy array containing audio samples

"""

print("Listening... Speak now!")

audio_buffer = []

silence_counter = 0

silence_samples = int(silence_duration * self.sample_rate)

def audio_callback(indata, frames, time, status):

"""Called for each audio block from the microphone"""

if status:

print(f"Audio status: {status}")

audio_buffer.append(indata.copy())

with sd.InputStream(samplerate=self.sample_rate,

channels=self.channels,

callback=audio_callback,

dtype='float32'):

# Record until silence detected or max duration reached

max_samples = int(duration * self.sample_rate)

while len(audio_buffer) * sd.default.blocksize < max_samples:

sd.sleep(100)

if len(audio_buffer) > 0:

recent_audio = np.concatenate(audio_buffer[-10:])

energy = np.sqrt(np.mean(recent_audio**2))

if energy < silence_threshold:

silence_counter += len(recent_audio)

if silence_counter >= silence_samples:

break

else:

silence_counter = 0

if not audio_buffer:

return np.array([])

audio_data = np.concatenate(audio_buffer)

print(f"Recording complete. Captured {len(audio_data)/self.sample_rate:.2f} seconds")

return audio_data

This recording function uses a callback mechanism. The sounddevice library calls our audio_callback function repeatedly with small chunks of audio data. We accumulate these chunks in a buffer while monitoring the audio energy. When the energy stays below the silence threshold for the specified duration, we stop recording.

The energy calculation uses root mean square, which gives us a single number representing the overall loudness of the audio. This is computed by squaring all the samples, taking their mean, and then taking the square root.

Background noise can significantly degrade transcription accuracy. To address this, we apply noise reduction before sending audio to the Whisper model. The noisereduce library uses spectral gating, a technique that analyzes the frequency spectrum of the audio and attenuates frequencies that appear to be noise.

def reduce_noise(self, audio_data):

"""

Apply noise reduction to audio data.

Args:

audio_data: numpy array of audio samples

Returns:

Noise-reduced audio array

"""

if len(audio_data) == 0:

return audio_data

# Estimate noise profile from the first 0.5 seconds

noise_sample_length = min(int(0.5 * self.sample_rate), len(audio_data))

noise_sample = audio_data[:noise_sample_length]

# Apply noise reduction

reduced_noise = nr.reduce_noise(

y=audio_data,

sr=self.sample_rate,

y_noise=noise_sample,

stationary=False,

prop_decrease=0.8

)

return reduced_noise

The noise reduction function assumes that the first half-second of audio contains primarily background noise. It uses this segment to build a noise profile, then removes similar patterns from the entire recording. The prop_decrease parameter controls how aggressively noise is removed. A value of 0.8 means we reduce noise by 80 percent, which provides good noise reduction while preserving speech quality.

Now we can implement the transcription method that ties everything together:

def transcribe(self, audio_data):

"""

Transcribe audio data to text using Whisper.

Args:

audio_data: numpy array of audio samples

Returns:

Transcribed text string

"""

if len(audio_data) == 0:

return ""

# Ensure audio is in the correct format for Whisper

if audio_data.dtype != np.float32:

audio_data = audio_data.astype(np.float32)

# Whisper expects mono audio

if len(audio_data.shape) > 1:

audio_data = audio_data.mean(axis=1)

# Normalize audio to [-1, 1] range

max_val = np.abs(audio_data).max()

if max_val > 0:

audio_data = audio_data / max_val

print("Transcribing audio...")

result = self.model.transcribe(

audio_data,

language=self.language,

fp16=(self.device == 'cuda')

)

transcribed_text = result['text'].strip()

print(f"Transcribed: {transcribed_text}")

return transcribed_text

The transcribe method prepares the audio data for Whisper and performs the actual transcription. Whisper expects audio in a specific format: 32-bit floating point values normalized to the range -1 to 1, with a sample rate of 16000 Hz. We ensure the audio meets these requirements before transcription.

The fp16 parameter enables half-precision floating point computations on CUDA devices, which speeds up processing with minimal impact on accuracy. This optimization is only available on NVIDIA GPUs.

BUILDING THE TEXT-TO-SPEECH COMPONENT

The text-to-speech component converts text into natural-sounding speech. Modern neural text-to-speech systems can produce remarkably human-like voices with proper intonation, rhythm, and emotion.

We will use Coqui TTS, which provides pre-trained models for multiple languages and voices. The library handles the complex process of converting text into audio waveforms that can be played through speakers.

Here is the foundation of our text-to-speech class:

from TTS.api import TTS

import sounddevice as sd

import soundfile as sf

class TextToSpeech:

def __init__(self, model_name=None, device='cpu'):

"""

Initialize text-to-speech engine.

Args:

model_name: Name of TTS model to use (None for default)

device: Computing device (cpu, cuda, mps)

"""

print(f"Loading TTS model on {device}...")

# Use default English model if none specified

if model_name is None:

model_name = "tts_models/en/ljspeech/tacotron2-DDC"

self.tts = TTS(model_name=model_name, progress_bar=False, gpu=(device=='cuda'))

self.device = device

self.sample_rate = 22050 # Standard TTS sample rate

The initialization loads a TTS model into memory. Coqui TTS supports many different models, each with different characteristics. The Tacotron2 model we use here produces high-quality speech and works well for English.

The sample rate for TTS is typically 22050 Hz, which is higher than the 16000 Hz we use for speech recognition. This higher rate allows for better audio quality in the synthesized speech.

Now we implement the synthesis method that converts text to audio:

def synthesize(self, text):

"""

Convert text to speech audio.

Args:

text: String of text to synthesize

Returns:

numpy array containing audio samples

"""

if not text or len(text.strip()) == 0:

return np.array([])

print(f"Synthesizing speech: {text[:50]}...")

# Generate audio from text

audio_data = self.tts.tts(text=text)

# Convert to numpy array if needed

if not isinstance(audio_data, np.ndarray):

audio_data = np.array(audio_data)

return audio_data

The synthesize method is straightforward. It takes text as input and returns an audio waveform as a numpy array. The TTS library handles all the complex neural network computations internally.

To make the speech audible, we need to play it through the computer's speakers:

def play_audio(self, audio_data):

"""

Play audio through speakers.

Args:

audio_data: numpy array of audio samples

"""

if len(audio_data) == 0:

return

print("Playing audio...")

# Ensure audio is in correct format

if audio_data.dtype != np.float32:

audio_data = audio_data.astype(np.float32)

# Normalize to prevent clipping

max_val = np.abs(audio_data).max()

if max_val > 1.0:

audio_data = audio_data / max_val

# Play audio and wait for completion

sd.play(audio_data, samplerate=self.sample_rate)

sd.wait()

print("Playback complete")

The play_audio method uses sounddevice to send audio to the speakers. We normalize the audio to prevent clipping, which occurs when audio values exceed the valid range and causes distortion. The sd.wait() call blocks until playback completes, ensuring we do not try to play multiple audio streams simultaneously.

For convenience, we can combine synthesis and playback into a single method:

def speak(self, text):

"""

Synthesize text and play it immediately.

Args:

text: String of text to speak

"""

audio_data = self.synthesize(text)

self.play_audio(audio_data)

This speak method provides a simple interface for the rest of our system. Given text, it converts it to speech and plays it in one operation.

INTEGRATING THE LANGUAGE MODEL

The language model is the intelligence of our system. It processes the transcribed text, understands the meaning and context, and generates appropriate responses. We will create a flexible language model interface that supports both local models and remote API services.

The key design principle is abstraction. We define a common interface that all language model implementations must follow, allowing us to swap between different models without changing the rest of our code.

Here is the base class that defines this interface:

from abc import ABC, abstractmethod

class LanguageModelInterface(ABC):

"""Abstract base class for language model implementations"""

@abstractmethod

def generate_response(self, prompt, max_tokens=500, temperature=0.7):

"""

Generate a response to the given prompt.

Args:

prompt: Input text to respond to

max_tokens: Maximum length of response

temperature: Randomness in generation (0.0 to 1.0)

Returns:

Generated response text

"""

pass

@abstractmethod

def reset_conversation(self):

"""Reset conversation history"""

pass

This abstract base class uses Python's ABC module to define methods that all language model implementations must provide. The generate_response method is the core functionality, while reset_conversation allows clearing the conversation history for multi-turn dialogues.

Now let us implement a local language model using llama-cpp-python, which provides efficient CPU and GPU inference for models in the GGUF format:

from llama_cpp import Llama

class LocalLanguageModel(LanguageModelInterface):

def __init__(self, model_path, n_ctx=2048, n_gpu_layers=0):

"""

Initialize local language model.

Args:

model_path: Path to GGUF model file

n_ctx: Context window size

n_gpu_layers: Number of layers to offload to GPU (0 for CPU only)

"""

print(f"Loading local model from {model_path}...")

self.model = Llama(

model_path=model_path,

n_ctx=n_ctx,

n_gpu_layers=n_gpu_layers,

verbose=False

)

self.conversation_history = []

self.system_prompt = (

"You are a helpful AI assistant. Provide clear, concise, "

"and accurate responses to user questions."

)

def generate_response(self, prompt, max_tokens=500, temperature=0.7):

"""Generate response using local model"""

# Add user message to history

self.conversation_history.append({

"role": "user",

"content": prompt

})

# Build full prompt with history

full_prompt = self._build_prompt()

print("Generating response...")

# Generate response

response = self.model(

full_prompt,

max_tokens=max_tokens,

temperature=temperature,

stop=["User:", "Assistant:", "\n\n"],

echo=False

)

response_text = response['choices'][0]['text'].strip()

# Add assistant response to history

self.conversation_history.append({

"role": "assistant",

"content": response_text

})

return response_text

def _build_prompt(self):

"""Build prompt from conversation history"""

prompt_parts = [self.system_prompt, "\n\n"]

for message in self.conversation_history:

role = message['role'].capitalize()

content = message['content']

prompt_parts.append(f"{role}: {content}\n\n")

prompt_parts.append("Assistant:")

return "".join(prompt_parts)

def reset_conversation(self):

"""Clear conversation history"""

self.conversation_history = []

The LocalLanguageModel class manages conversation history to enable multi-turn dialogues. Each time you ask a question, it is added to the history along with the model's response. This allows the model to maintain context across multiple exchanges.

The _build_prompt method formats the conversation history into a single prompt string that the model can process. This includes a system prompt that sets the model's behavior, followed by all previous user and assistant messages.

The n_gpu_layers parameter controls GPU acceleration. Setting it to 0 runs entirely on CPU. Setting it to a positive number offloads that many transformer layers to the GPU, which can significantly speed up generation on systems with capable graphics cards.

For users who prefer using Hugging Face transformers, here is an alternative implementation:

from transformers import AutoModelForCausalLM, AutoTokenizer

import torch

class HuggingFaceLanguageModel(LanguageModelInterface):

def __init__(self, model_name, device='cpu'):

"""

Initialize Hugging Face language model.

Args:

model_name: Name or path of model on Hugging Face

device: Computing device (cpu, cuda, mps)

"""

print(f"Loading {model_name} on {device}...")

self.device = device

self.tokenizer = AutoTokenizer.from_pretrained(model_name)

self.model = AutoModelForCausalLM.from_pretrained(

model_name,

torch_dtype=torch.float16 if device == 'cuda' else torch.float32,

low_cpu_mem_usage=True

).to(device)

self.conversation_history = []

# Set pad token if not defined

if self.tokenizer.pad_token is None:

self.tokenizer.pad_token = self.tokenizer.eos_token

def generate_response(self, prompt, max_tokens=500, temperature=0.7):

"""Generate response using Hugging Face model"""

# Add to conversation history

self.conversation_history.append({

"role": "user",

"content": prompt

})

# Format conversation for model

formatted_prompt = self._format_conversation()

# Tokenize input

inputs = self.tokenizer(

formatted_prompt,

return_tensors="pt",

padding=True,

truncation=True,

max_length=2048

).to(self.device)

print("Generating response...")

# Generate response

with torch.no_grad():

outputs = self.model.generate(

**inputs,

max_new_tokens=max_tokens,

temperature=temperature,

do_sample=True,

top_p=0.9,

pad_token_id=self.tokenizer.pad_token_id

)

# Decode response

response_text = self.tokenizer.decode(

outputs[0][inputs['input_ids'].shape[1]:],

skip_special_tokens=True

).strip()

# Add to history

self.conversation_history.append({

"role": "assistant",

"content": response_text

})

return response_text

def _format_conversation(self):

"""Format conversation history for model"""

formatted = []

for message in self.conversation_history:

role = message['role']

content = message['content']

formatted.append(f"{role}: {content}")

formatted.append("assistant:")

return "\n".join(formatted)

def reset_conversation(self):

"""Clear conversation history"""

self.conversation_history = []

This Hugging Face implementation provides similar functionality but uses the transformers library instead of llama-cpp. It supports a wider variety of models but may be slower for CPU inference compared to llama-cpp's optimized implementation.

The torch.no_grad() context manager disables gradient computation, which is only needed during training. This reduces memory usage and speeds up inference.

HANDLING DIFFERENT GPU ARCHITECTURES

Supporting multiple GPU architectures requires careful device detection and configuration. Different hardware vendors use different libraries and APIs for GPU acceleration.

Here is a utility class that detects available hardware and configures the appropriate device:

import torch

import platform

class DeviceManager:

"""Manages device selection across different GPU architectures"""

@staticmethod

def get_optimal_device():

"""

Detect and return the best available computing device.

Returns:

String indicating device type (cuda, mps, cpu)

"""

# Check for NVIDIA CUDA

if torch.cuda.is_available():

device = 'cuda'

gpu_name = torch.cuda.get_device_name(0)

print(f"Using NVIDIA GPU: {gpu_name}")

return device

# Check for Apple Metal Performance Shaders

if hasattr(torch.backends, 'mps') and torch.backends.mps.is_available():

device = 'mps'

print("Using Apple Metal Performance Shaders")

return device

# Check for AMD ROCm (appears as CUDA in PyTorch)

if platform.system() == 'Linux':

try:

import subprocess

result = subprocess.run(

['rocm-smi'],

capture_output=True,

text=True,

timeout=2

)

if result.returncode == 0:

device = 'cuda' # ROCm uses CUDA API

print("Using AMD ROCm GPU")

return device

except (FileNotFoundError, subprocess.TimeoutExpired):

pass

# Fall back to CPU

print("Using CPU (no GPU acceleration available)")

return 'cpu'

@staticmethod

def get_gpu_layers_for_model(device, model_size_gb):

"""

Determine optimal number of GPU layers based on available memory.

Args:

device: Device type string

model_size_gb: Approximate model size in gigabytes

Returns:

Number of layers to offload to GPU

"""

if device == 'cpu':

return 0

try:

if device == 'cuda':

# Get available GPU memory

gpu_memory_gb = torch.cuda.get_device_properties(0).total_memory / (1024**3)

# Reserve 2GB for other operations

available_memory = gpu_memory_gb - 2.0

# Estimate layers based on memory

if available_memory >= model_size_gb:

return -1 # All layers

elif available_memory >= model_size_gb * 0.5:

return 32 # Most layers

elif available_memory >= model_size_gb * 0.25:

return 16 # Some layers

else:

return 0 # CPU only

elif device == 'mps':

# Apple MPS typically has unified memory

# Offload all layers as memory is shared

return -1

except Exception as e:

print(f"Error detecting GPU memory: {e}")

return 0

The DeviceManager class provides two key methods. The get_optimal_device method detects what hardware acceleration is available and returns the appropriate device string. The get_gpu_layers_for_model method determines how many model layers can fit in GPU memory, which is crucial for optimal performance with large language models.

For NVIDIA GPUs, we query the total GPU memory and estimate how many layers we can offload based on the model size. For Apple MPS, we can typically offload all layers because the GPU shares memory with the CPU.

AMD ROCm support is more complex because ROCm provides a CUDA-compatible API. On Linux systems with ROCm installed, PyTorch treats AMD GPUs as CUDA devices. We detect ROCm by checking for the rocm-smi utility, which is part of the ROCm installation.

CREATING THE MAIN VOICE ASSISTANT

Now we bring all components together into a cohesive voice assistant system. This main class orchestrates the speech recognition, language model, and text-to-speech components.

import time

class VoiceAssistant:

def __init__(self,

whisper_model='base',

tts_model=None,

llm_type='local',

llm_model_path=None,

device=None):

"""

Initialize the complete voice assistant system.

Args:

whisper_model: Whisper model size for speech recognition

tts_model: TTS model name (None for default)

llm_type: Type of LLM ('local' or 'huggingface')

llm_model_path: Path to LLM model

device: Computing device (None for auto-detection)

"""

# Detect optimal device if not specified

if device is None:

device = DeviceManager.get_optimal_device()

self.device = device

# Initialize speech recognition

self.speech_recognizer = SpeechRecognizer(

model_size=whisper_model,

device=device

)

# Initialize text-to-speech

self.text_to_speech = TextToSpeech(

model_name=tts_model,

device=device

)

# Initialize language model

if llm_type == 'local':

if llm_model_path is None:

raise ValueError("llm_model_path required for local LLM")

gpu_layers = DeviceManager.get_gpu_layers_for_model(device, 4.0)

self.language_model = LocalLanguageModel(

model_path=llm_model_path,

n_gpu_layers=gpu_layers

)

elif llm_type == 'huggingface':

if llm_model_path is None:

llm_model_path = 'gpt2' # Default small model

self.language_model = HuggingFaceLanguageModel(

model_name=llm_model_path,

device=device

)

else:

raise ValueError(f"Unknown LLM type: {llm_type}")

self.is_running = False

print("Voice Assistant initialized successfully!")

The VoiceAssistant initialization creates instances of all three components and configures them to work together. The device parameter is automatically detected if not provided, ensuring the system uses the best available hardware.

Now we implement the core interaction loop:

def process_voice_input(self):

"""

Record audio, transcribe it, generate response, and speak it.

Returns:

Tuple of (user_text, assistant_response)

"""

# Record and transcribe user speech

audio_data = self.speech_recognizer.record_audio(

duration=10,

silence_threshold=0.01,

silence_duration=2.0

)

if len(audio_data) == 0:

print("No audio detected")

return None, None

# Apply noise reduction

clean_audio = self.speech_recognizer.reduce_noise(audio_data)

# Transcribe to text

user_text = self.speech_recognizer.transcribe(clean_audio)

if not user_text:

print("Could not transcribe audio")

return None, None

# Generate response from language model

assistant_response = self.language_model.generate_response(

user_text,

max_tokens=300,

temperature=0.7

)

# Speak the response

self.text_to_speech.speak(assistant_response)

return user_text, assistant_response

The process_voice_input method implements a complete interaction cycle. It records audio from the microphone, applies noise reduction, transcribes the speech to text, generates a response using the language model, and speaks the response back to the user.

For continuous operation, we implement a run loop:

def run_interactive(self):

"""

Run the assistant in interactive mode.

User can speak multiple times until they say 'exit' or 'quit'.

"""

self.is_running = True

print("\n" + "="*60)

print("Voice Assistant is ready!")

print("Speak your question or command.")

print("Say 'exit' or 'quit' to stop.")

print("="*60 + "\n")

self.text_to_speech.speak(

"Hello! I am your voice assistant. How can I help you today?"

)

while self.is_running:

try:

user_text, assistant_response = self.process_voice_input()

if user_text is None:

continue

# Check for exit commands

user_text_lower = user_text.lower().strip()

if any(word in user_text_lower for word in ['exit', 'quit', 'goodbye', 'bye']):

self.text_to_speech.speak(

"Goodbye! It was nice talking to you."

)

self.is_running = False

break

print(f"\nYou: {user_text}")

print(f"Assistant: {assistant_response}\n")

# Small pause between interactions

time.sleep(0.5)

except KeyboardInterrupt:

print("\nInterrupted by user")

self.is_running = False

break

except Exception as e:

print(f"Error during interaction: {e}")

self.text_to_speech.speak(

"I encountered an error. Please try again."

)

print("\nVoice Assistant stopped.")

The run_interactive method creates a continuous conversation loop. The assistant greets the user, then repeatedly listens for input, processes it, and responds. The loop continues until the user says an exit command or presses Ctrl+C.

Error handling is important for a robust system. We catch exceptions and provide feedback to the user rather than crashing.

We can also add a method for single-turn interactions:

def ask_question(self, question_text):

"""

Process a text question and return spoken response.

Args:

question_text: Question as text string

Returns:

Assistant's response text

"""

print(f"Processing question: {question_text}")

# Generate response

response = self.language_model.generate_response(

question_text,

max_tokens=300,

temperature=0.7

)

# Speak response

self.text_to_speech.speak(response)

return response

This ask_question method allows programmatic interaction with the assistant without requiring voice input. This is useful for testing and for applications where text input is more appropriate.

ADVANCED NOISE REDUCTION TECHNIQUES

While the basic noise reduction we implemented earlier works well for moderate background noise, more challenging environments require advanced techniques. Let us enhance our noise reduction capabilities.

Spectral subtraction is a classic technique that works by estimating the noise spectrum and subtracting it from the signal spectrum. We can implement a more sophisticated version:

import librosa

from scipy.signal import wiener

class AdvancedNoiseReducer:

"""Advanced noise reduction using multiple techniques"""

@staticmethod

def spectral_subtraction(audio, sample_rate, noise_duration=0.5):

"""

Apply spectral subtraction for noise reduction.

Args:

audio: Audio signal as numpy array

sample_rate: Sample rate in Hz

noise_duration: Duration of noise sample in seconds

Returns:

Noise-reduced audio

"""

# Extract noise sample from beginning

noise_samples = int(noise_duration * sample_rate)

noise_segment = audio[:noise_samples]

# Compute STFT of signal and noise

stft_signal = librosa.stft(audio)

stft_noise = librosa.stft(noise_segment)

# Estimate noise power spectrum

noise_power = np.mean(np.abs(stft_noise) ** 2, axis=1, keepdims=True)

# Compute signal power spectrum

signal_power = np.abs(stft_signal) ** 2

# Subtract noise power with floor to prevent negative values

clean_power = np.maximum(signal_power - noise_power, 0.1 * signal_power)

# Reconstruct magnitude and phase

magnitude = np.sqrt(clean_power)

phase = np.angle(stft_signal)

# Reconstruct STFT and inverse transform

clean_stft = magnitude * np.exp(1j * phase)

clean_audio = librosa.istft(clean_stft, length=len(audio))

return clean_audio

@staticmethod

def wiener_filter(audio, sample_rate):

"""

Apply Wiener filtering for noise reduction.

Args:

audio: Audio signal as numpy array

sample_rate: Sample rate in Hz

Returns:

Filtered audio

"""

# Apply Wiener filter

filtered = wiener(audio)

return filtered

@staticmethod

def combined_reduction(audio, sample_rate):

"""

Apply multiple noise reduction techniques in sequence.

Args:

audio: Audio signal as numpy array

sample_rate: Sample rate in Hz

Returns:

Noise-reduced audio

"""

# First apply spectral subtraction

audio = AdvancedNoiseReducer.spectral_subtraction(audio, sample_rate)

# Then apply Wiener filtering

audio = AdvancedNoiseReducer.wiener_filter(audio, sample_rate)

# Finally apply basic noise reduction

audio = nr.reduce_noise(y=audio, sr=sample_rate, stationary=True)

return audio

The spectral subtraction method works in the frequency domain. It converts the audio to a spectrogram using the Short-Time Fourier Transform, estimates the noise power spectrum from the initial silence, and subtracts this noise estimate from the entire signal. The phase information is preserved to maintain speech quality.

The Wiener filter is an optimal filter that minimizes the mean square error between the estimated clean signal and the true clean signal. It adapts to the local signal-to-noise ratio, providing more aggressive filtering in noisy regions and less filtering where the signal is strong.

The combined_reduction method applies multiple techniques in sequence, providing robust noise reduction even in very challenging acoustic environments.

OPTIMIZING FOR REAL-TIME PERFORMANCE

Real-time voice interaction requires careful optimization to minimize latency. Users expect responses within a second or two, which means every component must be as fast as possible.

For speech recognition, we can use streaming recognition instead of waiting for complete utterances:

class StreamingSpeechRecognizer(SpeechRecognizer):

"""Speech recognizer with streaming support for lower latency"""

def __init__(self, model_size='base', device='cpu', language='en'):

super().__init__(model_size, device, language)

self.audio_buffer = []

self.buffer_duration = 3.0 # Process every 3 seconds

def stream_callback(self, indata, frames, time, status):

"""Callback for streaming audio"""

if status:

print(f"Stream status: {status}")

self.audio_buffer.append(indata.copy())

# Check if we have enough audio to process

total_samples = sum(len(chunk) for chunk in self.audio_buffer)

buffer_seconds = total_samples / self.sample_rate

if buffer_seconds >= self.buffer_duration:

# Process accumulated audio

audio_data = np.concatenate(self.audio_buffer)

self.audio_buffer = []

# Transcribe in background thread

threading.Thread(

target=self._process_audio_chunk,

args=(audio_data,)

).start()

def _process_audio_chunk(self, audio_data):

"""Process audio chunk in background"""

clean_audio = self.reduce_noise(audio_data)

text = self.transcribe(clean_audio)

if text:

# Put result in queue for main thread

self.audio_queue.put(text)

This streaming recognizer processes audio in chunks rather than waiting for complete silence. It accumulates audio in a buffer and transcribes it every few seconds, allowing for lower latency in interactive applications.

For language model inference, we can use quantization to reduce model size and increase speed:

class QuantizedLanguageModel(LocalLanguageModel):

"""Language model with quantization for faster inference"""

def __init__(self, model_path, n_ctx=2048, n_gpu_layers=0):

"""Initialize with quantized model"""

print(f"Loading quantized model from {model_path}...")

# Load with 4-bit quantization for speed

self.model = Llama(

model_path=model_path,

n_ctx=n_ctx,

n_gpu_layers=n_gpu_layers,

n_batch=512, # Larger batch for better GPU utilization

verbose=False

)

self.conversation_history = []

self.system_prompt = (

"You are a helpful AI assistant. Provide clear, concise, "

"and accurate responses to user questions."

)

Quantization reduces the precision of model weights from 32-bit or 16-bit floating point to 8-bit or even 4-bit integers. This reduces memory usage and speeds up computation with minimal impact on quality. Models in GGUF format support various quantization levels, indicated by suffixes like Q4_K_M for 4-bit quantization.

HANDLING MULTIPLE LANGUAGES

Supporting multiple languages expands the usefulness of your voice assistant. Whisper supports over 90 languages, and many TTS models support multiple languages as well.

Here is an enhanced version that handles multiple languages:

class MultilingualVoiceAssistant(VoiceAssistant):

"""Voice assistant with multi-language support"""

SUPPORTED_LANGUAGES = {

'en': {'name': 'English', 'tts_model': 'tts_models/en/ljspeech/tacotron2-DDC'},

'es': {'name': 'Spanish', 'tts_model': 'tts_models/es/mai/tacotron2-DDC'},

'fr': {'name': 'French', 'tts_model': 'tts_models/fr/mai/tacotron2-DDC'},

'de': {'name': 'German', 'tts_model': 'tts_models/de/thorsten/tacotron2-DDC'},

'it': {'name': 'Italian', 'tts_model': 'tts_models/it/mai/tacotron2-DDC'},

'pt': {'name': 'Portuguese', 'tts_model': 'tts_models/pt/cv/vits'},

'zh': {'name': 'Chinese', 'tts_model': 'tts_models/zh-CN/baker/tacotron2-DDC'},

'ja': {'name': 'Japanese', 'tts_model': 'tts_models/ja/kokoro/tacotron2-DDC'},

}

def __init__(self, language='en', **kwargs):

"""

Initialize multilingual assistant.

Args:

language: Language code (en, es, fr, de, etc.)

**kwargs: Additional arguments for VoiceAssistant

"""

if language not in self.SUPPORTED_LANGUAGES:

raise ValueError(f"Unsupported language: {language}")

self.language = language

lang_info = self.SUPPORTED_LANGUAGES[language]

# Set TTS model for language if not specified

if 'tts_model' not in kwargs:

kwargs['tts_model'] = lang_info['tts_model']

super().__init__(**kwargs)

# Update speech recognizer language

self.speech_recognizer.language = language

print(f"Assistant configured for {lang_info['name']}")

def change_language(self, new_language):

"""

Change the assistant's language.

Args:

new_language: New language code

"""

if new_language not in self.SUPPORTED_LANGUAGES:

raise ValueError(f"Unsupported language: {new_language}")

self.language = new_language

lang_info = self.SUPPORTED_LANGUAGES[new_language]

# Update speech recognizer

self.speech_recognizer.language = new_language

# Reload TTS model for new language

self.text_to_speech = TextToSpeech(

model_name=lang_info['tts_model'],

device=self.device

)

# Reset conversation history

self.language_model.reset_conversation()

print(f"Language changed to {lang_info['name']}")

The multilingual assistant maintains a dictionary of supported languages with their corresponding TTS models. When you change languages, it updates both the speech recognizer and text-to-speech components to use the appropriate models.

Whisper can automatically detect the language being spoken, which we can leverage:

def auto_detect_language(self, audio_data):

"""

Automatically detect the language of spoken audio.

Args:

audio_data: Audio samples as numpy array

Returns:

Detected language code

"""

# Transcribe with language detection

result = self.speech_recognizer.model.transcribe(

audio_data,

language=None # Auto-detect

)

detected_language = result['language']

print(f"Detected language: {detected_language}")

return detected_language

This auto-detection capability allows the assistant to automatically adapt to the user's language without manual configuration.

SAVING AND LOADING CONVERSATION HISTORY

For practical applications, you often want to save conversation history and resume later. Here is how to implement conversation persistence:

import json

from datetime import datetime

class ConversationManager:

"""Manages conversation history with save/load capabilities"""

def __init__(self, save_directory='conversations'):

"""

Initialize conversation manager.

Args:

save_directory: Directory to store conversation files

"""

self.save_directory = save_directory

# Create directory if it doesn't exist

import os

os.makedirs(save_directory, exist_ok=True)

def save_conversation(self, conversation_history, filename=None):

"""

Save conversation history to file.

Args:

conversation_history: List of conversation messages

filename: Optional filename (auto-generated if None)

Returns:

Path to saved file

"""

if filename is None:

timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")

filename = f"conversation_{timestamp}.json"

filepath = os.path.join(self.save_directory, filename)

# Prepare data for saving

save_data = {

'timestamp': datetime.now().isoformat(),

'messages': conversation_history,

'message_count': len(conversation_history)

}

# Save to JSON file

with open(filepath, 'w', encoding='utf-8') as f:

json.dump(save_data, f, indent=2, ensure_ascii=False)

print(f"Conversation saved to {filepath}")

return filepath

def load_conversation(self, filename):

"""

Load conversation history from file.

Args:

filename: Name of file to load

Returns:

List of conversation messages

"""

filepath = os.path.join(self.save_directory, filename)

with open(filepath, 'r', encoding='utf-8') as f:

save_data = json.load(f)

conversation_history = save_data['messages']

print(f"Loaded conversation with {len(conversation_history)} messages")

return conversation_history

def list_conversations(self):

"""

List all saved conversations.

Returns:

List of conversation filenames

"""

import os

files = [f for f in os.listdir(self.save_directory)

if f.endswith('.json')]

return sorted(files, reverse=True)

The ConversationManager class provides methods to save conversations to JSON files and load them back. Each saved conversation includes a timestamp and the complete message history.

We can integrate this into our voice assistant:

class PersistentVoiceAssistant(VoiceAssistant):

"""Voice assistant with conversation persistence"""

def __init__(self, **kwargs):

super().__init__(**kwargs)

self.conversation_manager = ConversationManager()

def save_current_conversation(self, filename=None):

"""Save the current conversation"""

return self.conversation_manager.save_conversation(

self.language_model.conversation_history,

filename

)

def load_conversation(self, filename):

"""Load a previous conversation"""

history = self.conversation_manager.load_conversation(filename)

self.language_model.conversation_history = history

# Summarize loaded conversation

message_count = len(history)

self.text_to_speech.speak(

f"Loaded conversation with {message_count} messages. "

"We can continue where we left off."

)

This persistent assistant automatically saves conversations and can resume previous discussions, making it more practical for extended use.

IMPLEMENTING WAKE WORD DETECTION

A professional voice assistant should activate only when you say a specific wake word, rather than listening continuously. This improves privacy and reduces false activations.

We can implement wake word detection using a lightweight model:

import pvporcupine

import struct

class WakeWordDetector:

"""Detects wake words to activate the assistant"""

def __init__(self, wake_word='computer', sensitivity=0.5):

"""

Initialize wake word detector.

Args:

wake_word: Wake word to detect (computer, alexa, etc.)

sensitivity: Detection sensitivity (0.0 to 1.0)

"""

# Note: Porcupine requires an access key

# For production, use a proper wake word detection library

self.wake_word = wake_word

self.sensitivity = sensitivity

self.is_active = False

def simple_energy_detector(self, audio_data, threshold=0.02):

"""

Simple energy-based detection as alternative to Porcupine.

Detects when audio energy exceeds threshold.

Args:

audio_data: Audio samples

threshold: Energy threshold

Returns:

True if wake word detected

"""

energy = np.sqrt(np.mean(audio_data ** 2))

return energy > threshold

def listen_for_wake_word(self, sample_rate=16000, timeout=None):

"""

Listen continuously for wake word.

Args:

sample_rate: Audio sample rate

timeout: Optional timeout in seconds

Returns:

True when wake word detected

"""

print(f"Listening for wake word '{self.wake_word}'...")

start_time = time.time()

def audio_callback(indata, frames, time_info, status):

"""Process audio for wake word detection"""

if self.simple_energy_detector(indata):

self.is_active = True

with sd.InputStream(

samplerate=sample_rate,

channels=1,

callback=audio_callback,

dtype='float32'

while not self.is_active:

sd.sleep(100)

if timeout and (time.time() - start_time) > timeout:

return False

self.is_active = False

return True

This simple wake word detector uses energy-based detection. For production use, you would want to use a more sophisticated system like Porcupine, Snowboy, or a custom neural network trained on your specific wake word.

We can integrate wake word detection into our assistant:

class WakeWordVoiceAssistant(VoiceAssistant):

"""Voice assistant with wake word activation"""

def __init__(self, wake_word='computer', **kwargs):

super().__init__(**kwargs)

self.wake_word_detector = WakeWordDetector(wake_word=wake_word)

def run_with_wake_word(self):

"""Run assistant with wake word activation"""

print(f"\nVoice Assistant ready!")

print(f"Say '{self.wake_word_detector.wake_word}' to activate.")

print("Press Ctrl+C to exit.\n")

try:

while True:

# Wait for wake word

if self.wake_word_detector.listen_for_wake_word():

print("Wake word detected! Listening...")

# Play activation sound

self.text_to_speech.speak("Yes?")

# Process one interaction

user_text, response = self.process_voice_input()

if user_text:

print(f"\nYou: {user_text}")

print(f"Assistant: {response}\n")

time.sleep(0.1)

except KeyboardInterrupt:

print("\nShutting down...")

This wake word assistant waits passively until it hears the wake word, then activates and processes one interaction before returning to passive listening mode.

COMPLETE RUNNING EXAMPLE

# voice_assistant_complete.py
# Complete production-ready voice-powered AI assistant
# Supports multiple GPU architectures and LLM backends

import whisper
import sounddevice as sd
import numpy as np
import noisereduce as nr
import queue
import threading
import torch
import time
import json
import os
import logging
from datetime import datetime
from abc import ABC, abstractmethod
from TTS.api import TTS
from llama_cpp import Llama
from transformers import AutoModelForCausalLM, AutoTokenizer
import librosa
from scipy.signal import wiener
from scipy.io import wavfile
import argparse
import sys


# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler('voice_assistant.log'),
        logging.StreamHandler(sys.stdout)
    ]
)
logger = logging.getLogger(__name__)


class DeviceManager:
    """Manages device selection across different GPU architectures"""
    
    @staticmethod
    def get_optimal_device():
        """
        Detect and return the best available computing device.
        
        Returns:
            String indicating device type (cuda, mps, cpu)
        """
        if torch.cuda.is_available():
            device = 'cuda'
            gpu_name = torch.cuda.get_device_name(0)
            logger.info(f"Using NVIDIA GPU: {gpu_name}")
            return device
        
        if hasattr(torch.backends, 'mps') and torch.backends.mps.is_available():
            device = 'mps'
            logger.info("Using Apple Metal Performance Shaders")
            return device
        
        import platform
        if platform.system() == 'Linux':
            try:
                import subprocess
                result = subprocess.run(
                    ['rocm-smi'],
                    capture_output=True,
                    text=True,
                    timeout=2
                )
                if result.returncode == 0:
                    device = 'cuda'
                    logger.info("Using AMD ROCm GPU")
                    return device
            except (FileNotFoundError, subprocess.TimeoutExpired):
                pass
        
        logger.info("Using CPU (no GPU acceleration available)")
        return 'cpu'
    
    @staticmethod
    def get_gpu_layers_for_model(device, model_size_gb):
        """
        Determine optimal number of GPU layers based on available memory.
        
        Args:
            device: Device type string
            model_size_gb: Approximate model size in gigabytes
            
        Returns:
            Number of layers to offload to GPU
        """
        if device == 'cpu':
            return 0
        
        try:
            if device == 'cuda':
                gpu_memory_gb = torch.cuda.get_device_properties(0).total_memory / (1024**3)
                available_memory = gpu_memory_gb - 2.0
                
                if available_memory >= model_size_gb:
                    return -1
                elif available_memory >= model_size_gb * 0.5:
                    return 32
                elif available_memory >= model_size_gb * 0.25:
                    return 16
                else:
                    return 0
            
            elif device == 'mps':
                return -1
            
        except Exception as e:
            logger.error(f"Error detecting GPU memory: {e}")
            return 0
        
        return 0


class AdvancedNoiseReducer:
    """Advanced noise reduction using multiple techniques"""
    
    @staticmethod
    def spectral_subtraction(audio, sample_rate, noise_duration=0.5):
        """
        Apply spectral subtraction for noise reduction.
        
        Args:
            audio: Audio signal as numpy array
            sample_rate: Sample rate in Hz
            noise_duration: Duration of noise sample in seconds
            
        Returns:
            Noise-reduced audio
        """
        try:
            noise_samples = int(noise_duration * sample_rate)
            if noise_samples >= len(audio):
                noise_samples = len(audio) // 2
            
            noise_segment = audio[:noise_samples]
            
            stft_signal = librosa.stft(audio)
            stft_noise = librosa.stft(noise_segment)
            
            noise_power = np.mean(np.abs(stft_noise) ** 2, axis=1, keepdims=True)
            signal_power = np.abs(stft_signal) ** 2
            
            clean_power = np.maximum(signal_power - noise_power, 0.1 * signal_power)
            
            magnitude = np.sqrt(clean_power)
            phase = np.angle(stft_signal)
            
            clean_stft = magnitude * np.exp(1j * phase)
            clean_audio = librosa.istft(clean_stft, length=len(audio))
            
            return clean_audio
        except Exception as e:
            logger.error(f"Error in spectral subtraction: {e}")
            return audio
    
    @staticmethod
    def wiener_filter(audio, sample_rate):
        """
        Apply Wiener filtering for noise reduction.
        
        Args:
            audio: Audio signal as numpy array
            sample_rate: Sample rate in Hz
            
        Returns:
            Filtered audio
        """
        try:
            filtered = wiener(audio)
            return filtered
        except Exception as e:
            logger.error(f"Error in Wiener filtering: {e}")
            return audio
    
    @staticmethod
    def combined_reduction(audio, sample_rate):
        """
        Apply multiple noise reduction techniques in sequence.
        
        Args:
            audio: Audio signal as numpy array
            sample_rate: Sample rate in Hz
            
        Returns:
            Noise-reduced audio
        """
        try:
            audio = AdvancedNoiseReducer.spectral_subtraction(audio, sample_rate)
            audio = AdvancedNoiseReducer.wiener_filter(audio, sample_rate)
            audio = nr.reduce_noise(y=audio, sr=sample_rate, stationary=True)
            return audio
        except Exception as e:
            logger.error(f"Error in combined noise reduction: {e}")
            return audio


class SpeechRecognizer:
    """Handles speech-to-text conversion using Whisper"""
    
    def __init__(self, model_size='base', device='cpu', language='en', 
                 use_advanced_noise_reduction=False):
        """
        Initialize the speech recognizer with a Whisper model.
        
        Args:
            model_size: Size of Whisper model (tiny, base, small, medium, large)
            device: Computing device (cpu, cuda, mps)
            language: Language code for transcription
            use_advanced_noise_reduction: Whether to use advanced noise reduction
        """
        logger.info(f"Loading Whisper {model_size} model on {device}...")
        
        try:
            self.model = whisper.load_model(model_size, device=device)
            self.device = device
            self.language = language
            self.sample_rate = 16000
            self.channels = 1
            self.audio_queue = queue.Queue()
            self.is_recording = False
            self.use_advanced_noise_reduction = use_advanced_noise_reduction
            
            logger.info("Speech recognizer initialized successfully")
        except Exception as e:
            logger.error(f"Failed to initialize speech recognizer: {e}")
            raise
    
    def record_audio(self, duration=10, silence_threshold=0.01, silence_duration=2.0):
        """
        Record audio from microphone with automatic silence detection.
        
        Args:
            duration: Maximum recording duration in seconds
            silence_threshold: Energy threshold below which audio is considered silence
            silence_duration: Seconds of silence before stopping recording
            
        Returns:
            numpy array containing audio samples
        """
        logger.info("Starting audio recording...")
        
        audio_buffer = []
        silence_counter = 0
        silence_samples = int(silence_duration * self.sample_rate)
        has_speech = False
        
        def audio_callback(indata, frames, time_info, status):
            """Called for each audio block from the microphone"""
            if status:
                logger.warning(f"Audio status: {status}")
            audio_buffer.append(indata.copy())
        
        try:
            with sd.InputStream(
                samplerate=self.sample_rate,
                channels=self.channels,
                callback=audio_callback,
                dtype='float32'
            ):
                max_samples = int(duration * self.sample_rate)
                
                while len(audio_buffer) * sd.default.blocksize < max_samples:
                    sd.sleep(100)
                    
                    if len(audio_buffer) > 0:
                        recent_audio = np.concatenate(audio_buffer[-10:])
                        energy = np.sqrt(np.mean(recent_audio**2))
                        
                        if energy > silence_threshold:
                            has_speech = True
                            silence_counter = 0
                        elif has_speech:
                            silence_counter += len(recent_audio)
                            if silence_counter >= silence_samples:
                                break
            
            if not audio_buffer:
                logger.warning("No audio captured")
                return np.array([])
            
            audio_data = np.concatenate(audio_buffer)
            duration_recorded = len(audio_data) / self.sample_rate
            logger.info(f"Recording complete. Captured {duration_recorded:.2f} seconds")
            
            return audio_data
            
        except Exception as e:
            logger.error(f"Error during audio recording: {e}")
            return np.array([])
    
    def reduce_noise(self, audio_data):
        """
        Apply noise reduction to audio data.
        
        Args:
            audio_data: numpy array of audio samples
            
        Returns:
            Noise-reduced audio array
        """
        if len(audio_data) == 0:
            return audio_data
        
        try:
            if self.use_advanced_noise_reduction:
                logger.info("Applying advanced noise reduction...")
                reduced_noise = AdvancedNoiseReducer.combined_reduction(
                    audio_data,
                    self.sample_rate
                )
            else:
                logger.info("Applying basic noise reduction...")
                noise_sample_length = min(int(0.5 * self.sample_rate), len(audio_data))
                noise_sample = audio_data[:noise_sample_length]
                
                reduced_noise = nr.reduce_noise(
                    y=audio_data,
                    sr=self.sample_rate,
                    y_noise=noise_sample,
                    stationary=False,
                    prop_decrease=0.8
                )
            
            return reduced_noise
            
        except Exception as e:
            logger.error(f"Error during noise reduction: {e}")
            return audio_data
    
    def transcribe(self, audio_data):
        """
        Transcribe audio data to text using Whisper.
        
        Args:
            audio_data: numpy array of audio samples
            
        Returns:
            Transcribed text string
        """
        if len(audio_data) == 0:
            return ""
        
        try:
            if audio_data.dtype != np.float32:
                audio_data = audio_data.astype(np.float32)
            
            if len(audio_data.shape) > 1:
                audio_data = audio_data.mean(axis=1)
            
            max_val = np.abs(audio_data).max()
            if max_val > 0:
                audio_data = audio_data / max_val
            
            logger.info("Transcribing audio...")
            result = self.model.transcribe(
                audio_data,
                language=self.language,
                fp16=(self.device == 'cuda')
            )
            
            transcribed_text = result['text'].strip()
            logger.info(f"Transcribed: {transcribed_text}")
            
            return transcribed_text
            
        except Exception as e:
            logger.error(f"Error during transcription: {e}")
            return ""


class TextToSpeech:
    """Handles text-to-speech conversion using Coqui TTS"""
    
    def __init__(self, model_name=None, device='cpu'):
        """
        Initialize text-to-speech engine.
        
        Args:
            model_name: Name of TTS model to use (None for default)
            device: Computing device (cpu, cuda, mps)
        """
        logger.info(f"Loading TTS model on {device}...")
        
        try:
            if model_name is None:
                model_name = "tts_models/en/ljspeech/tacotron2-DDC"
            
            self.tts = TTS(
                model_name=model_name,
                progress_bar=False,
                gpu=(device == 'cuda')
            )
            self.device = device
            self.sample_rate = 22050
            
            logger.info("TTS engine initialized successfully")
            
        except Exception as e:
            logger.error(f"Failed to initialize TTS engine: {e}")
            raise
    
    def synthesize(self, text):
        """
        Convert text to speech audio.
        
        Args:
            text: String of text to synthesize
            
        Returns:
            numpy array containing audio samples
        """
        if not text or len(text.strip()) == 0:
            return np.array([])
        
        try:
            logger.info(f"Synthesizing speech for: {text[:50]}...")
            
            audio_data = self.tts.tts(text=text)
            
            if not isinstance(audio_data, np.ndarray):
                audio_data = np.array(audio_data)
            
            return audio_data
            
        except Exception as e:
            logger.error(f"Error during speech synthesis: {e}")
            return np.array([])
    
    def play_audio(self, audio_data):
        """
        Play audio through speakers.
        
        Args:
            audio_data: numpy array of audio samples
        """
        if len(audio_data) == 0:
            return
        
        try:
            logger.info("Playing audio...")
            
            if audio_data.dtype != np.float32:
                audio_data = audio_data.astype(np.float32)
            
            max_val = np.abs(audio_data).max()
            if max_val > 1.0:
                audio_data = audio_data / max_val
            
            sd.play(audio_data, samplerate=self.sample_rate)
            sd.wait()
            
            logger.info("Playback complete")
            
        except Exception as e:
            logger.error(f"Error during audio playback: {e}")
    
    def speak(self, text):
        """
        Synthesize text and play it immediately.
        
        Args:
            text: String of text to speak
        """
        audio_data = self.synthesize(text)
        self.play_audio(audio_data)


class LanguageModelInterface(ABC):
    """Abstract base class for language model implementations"""
    
    @abstractmethod
    def generate_response(self, prompt, max_tokens=500, temperature=0.7):
        """
        Generate a response to the given prompt.
        
        Args:
            prompt: Input text to respond to
            max_tokens: Maximum length of response
            temperature: Randomness in generation (0.0 to 1.0)
            
        Returns:
            Generated response text
        """
        pass
    
    @abstractmethod
    def reset_conversation(self):
        """Reset conversation history"""
        pass


class LocalLanguageModel(LanguageModelInterface):
    """Language model using llama-cpp-python for local inference"""
    
    def __init__(self, model_path, n_ctx=2048, n_gpu_layers=0):
        """
        Initialize local language model.
        
        Args:
            model_path: Path to GGUF model file
            n_ctx: Context window size
            n_gpu_layers: Number of layers to offload to GPU (0 for CPU only)
        """
        logger.info(f"Loading local model from {model_path}...")
        
        try:
            self.model = Llama(
                model_path=model_path,
                n_ctx=n_ctx,
                n_gpu_layers=n_gpu_layers,
                verbose=False
            )
            
            self.conversation_history = []
            self.system_prompt = (
                "You are a helpful AI assistant. Provide clear, concise, "
                "and accurate responses to user questions. Keep responses "
                "brief and conversational for voice interaction."
            )
            
            logger.info("Local language model initialized successfully")
            
        except Exception as e:
            logger.error(f"Failed to initialize local language model: {e}")
            raise
    
    def generate_response(self, prompt, max_tokens=500, temperature=0.7):
        """Generate response using local model"""
        
        try:
            self.conversation_history.append({
                "role": "user",
                "content": prompt
            })
            
            full_prompt = self._build_prompt()
            
            logger.info("Generating response from local model...")
            
            response = self.model(
                full_prompt,
                max_tokens=max_tokens,
                temperature=temperature,
                stop=["User:", "Assistant:", "\n\n"],
                echo=False
            )
            
            response_text = response['choices'][0]['text'].strip()
            
            self.conversation_history.append({
                "role": "assistant",
                "content": response_text
            })
            
            logger.info(f"Generated response: {response_text[:100]}...")
            
            return response_text
            
        except Exception as e:
            logger.error(f"Error generating response: {e}")
            return "I apologize, but I encountered an error generating a response."
    
    def _build_prompt(self):
        """Build prompt from conversation history"""
        prompt_parts = [self.system_prompt, "\n\n"]
        
        for message in self.conversation_history:
            role = message['role'].capitalize()
            content = message['content']
            prompt_parts.append(f"{role}: {content}\n\n")
        
        prompt_parts.append("Assistant:")
        
        return "".join(prompt_parts)
    
    def reset_conversation(self):
        """Clear conversation history"""
        self.conversation_history = []
        logger.info("Conversation history reset")


class HuggingFaceLanguageModel(LanguageModelInterface):
    """Language model using Hugging Face transformers"""
    
    def __init__(self, model_name, device='cpu'):
        """
        Initialize Hugging Face language model.
        
        Args:
            model_name: Name or path of model on Hugging Face
            device: Computing device (cpu, cuda, mps)
        """
        logger.info(f"Loading {model_name} on {device}...")
        
        try:
            self.device = device
            self.tokenizer = AutoTokenizer.from_pretrained(model_name)
            self.model = AutoModelForCausalLM.from_pretrained(
                model_name,
                torch_dtype=torch.float16 if device == 'cuda' else torch.float32,
                low_cpu_mem_usage=True
            ).to(device)
            
            self.conversation_history = []
            
            if self.tokenizer.pad_token is None:
                self.tokenizer.pad_token = self.tokenizer.eos_token
            
            logger.info("Hugging Face language model initialized successfully")
            
        except Exception as e:
            logger.error(f"Failed to initialize Hugging Face model: {e}")
            raise
    
    def generate_response(self, prompt, max_tokens=500, temperature=0.7):
        """Generate response using Hugging Face model"""
        
        try:
            self.conversation_history.append({
                "role": "user",
                "content": prompt
            })
            
            formatted_prompt = self._format_conversation()
            
            inputs = self.tokenizer(
                formatted_prompt,
                return_tensors="pt",
                padding=True,
                truncation=True,
                max_length=2048
            ).to(self.device)
            
            logger.info("Generating response from Hugging Face model...")
            
            with torch.no_grad():
                outputs = self.model.generate(
                    **inputs,
                    max_new_tokens=max_tokens,
                    temperature=temperature,
                    do_sample=True,
                    top_p=0.9,
                    pad_token_id=self.tokenizer.pad_token_id
                )
            
            response_text = self.tokenizer.decode(
                outputs[0][inputs['input_ids'].shape[1]:],
                skip_special_tokens=True
            ).strip()
            
            self.conversation_history.append({
                "role": "assistant",
                "content": response_text
            })
            
            logger.info(f"Generated response: {response_text[:100]}...")
            
            return response_text
            
        except Exception as e:
            logger.error(f"Error generating response: {e}")
            return "I apologize, but I encountered an error generating a response."
    
    def _format_conversation(self):
        """Format conversation history for model"""
        formatted = []
        for message in self.conversation_history:
            role = message['role']
            content = message['content']
            formatted.append(f"{role}: {content}")
        formatted.append("assistant:")
        return "\n".join(formatted)
    
    def reset_conversation(self):
        """Clear conversation history"""
        self.conversation_history = []
        logger.info("Conversation history reset")


class ConversationManager:
    """Manages conversation history with save/load capabilities"""
    
    def __init__(self, save_directory='conversations'):
        """
        Initialize conversation manager.
        
        Args:
            save_directory: Directory to store conversation files
        """
        self.save_directory = save_directory
        os.makedirs(save_directory, exist_ok=True)
        logger.info(f"Conversation manager initialized with directory: {save_directory}")
    
    def save_conversation(self, conversation_history, filename=None):
        """
        Save conversation history to file.
        
        Args:
            conversation_history: List of conversation messages
            filename: Optional filename (auto-generated if None)
            
        Returns:
            Path to saved file
        """
        try:
            if filename is None:
                timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
                filename = f"conversation_{timestamp}.json"
            
            filepath = os.path.join(self.save_directory, filename)
            
            save_data = {
                'timestamp': datetime.now().isoformat(),
                'messages': conversation_history,
                'message_count': len(conversation_history)
            }
            
            with open(filepath, 'w', encoding='utf-8') as f:
                json.dump(save_data, f, indent=2, ensure_ascii=False)
            
            logger.info(f"Conversation saved to {filepath}")
            return filepath
            
        except Exception as e:
            logger.error(f"Error saving conversation: {e}")
            return None
    
    def load_conversation(self, filename):
        """
        Load conversation history from file.
        
        Args:
            filename: Name of file to load
            
        Returns:
            List of conversation messages
        """
        try:
            filepath = os.path.join(self.save_directory, filename)
            
            with open(filepath, 'r', encoding='utf-8') as f:
                save_data = json.load(f)
            
            conversation_history = save_data['messages']
            
            logger.info(f"Loaded conversation with {len(conversation_history)} messages")
            return conversation_history
            
        except Exception as e:
            logger.error(f"Error loading conversation: {e}")
            return []
    
    def list_conversations(self):
        """
        List all saved conversations.
        
        Returns:
            List of conversation filenames
        """
        try:
            files = [f for f in os.listdir(self.save_directory) 
                    if f.endswith('.json')]
            return sorted(files, reverse=True)
        except Exception as e:
            logger.error(f"Error listing conversations: {e}")
            return []


class VoiceAssistant:
    """Complete voice-powered AI assistant system"""
    
    def __init__(self,
                 whisper_model='base',
                 tts_model=None,
                 llm_type='local',
                 llm_model_path=None,
                 device=None,
                 use_advanced_noise_reduction=False,
                 save_conversations=True):
        """
        Initialize the complete voice assistant system.
        
        Args:
            whisper_model: Whisper model size for speech recognition
            tts_model: TTS model name (None for default)
            llm_type: Type of LLM ('local' or 'huggingface')
            llm_model_path: Path to LLM model
            device: Computing device (None for auto-detection)
            use_advanced_noise_reduction: Whether to use advanced noise reduction
            save_conversations: Whether to save conversation history
        """
        logger.info("Initializing Voice Assistant...")
        
        if device is None:
            device = DeviceManager.get_optimal_device()
        
        self.device = device
        self.save_conversations = save_conversations
        
        try:
            self.speech_recognizer = SpeechRecognizer(
                model_size=whisper_model,
                device=device,
                use_advanced_noise_reduction=use_advanced_noise_reduction
            )
            
            self.text_to_speech = TextToSpeech(
                model_name=tts_model,
                device=device
            )
            
            if llm_type == 'local':
                if llm_model_path is None:
                    raise ValueError("llm_model_path required for local LLM")
                
                gpu_layers = DeviceManager.get_gpu_layers_for_model(device, 4.0)
                self.language_model = LocalLanguageModel(
                    model_path=llm_model_path,
                    n_gpu_layers=gpu_layers
                )
            elif llm_type == 'huggingface':
                if llm_model_path is None:
                    llm_model_path = 'gpt2'
                
                self.language_model = HuggingFaceLanguageModel(
                    model_name=llm_model_path,
                    device=device
                )
            else:
                raise ValueError(f"Unknown LLM type: {llm_type}")
            
            if save_conversations:
                self.conversation_manager = ConversationManager()
            
            self.is_running = False
            
            logger.info("Voice Assistant initialized successfully!")
            
        except Exception as e:
            logger.error(f"Failed to initialize Voice Assistant: {e}")
            raise
    
    def process_voice_input(self):
        """
        Record audio, transcribe it, generate response, and speak it.
        
        Returns:
            Tuple of (user_text, assistant_response)
        """
        try:
            audio_data = self.speech_recognizer.record_audio(
                duration=10,
                silence_threshold=0.01,
                silence_duration=2.0
            )
            
            if len(audio_data) == 0:
                logger.warning("No audio detected")
                return None, None
            
            clean_audio = self.speech_recognizer.reduce_noise(audio_data)
            
            user_text = self.speech_recognizer.transcribe(clean_audio)
            
            if not user_text:
                logger.warning("Could not transcribe audio")
                return None, None
            
            assistant_response = self.language_model.generate_response(
                user_text,
                max_tokens=300,
                temperature=0.7
            )
            
            self.text_to_speech.speak(assistant_response)
            
            return user_text, assistant_response
            
        except Exception as e:
            logger.error(f"Error processing voice input: {e}")
            return None, None
    
    def run_interactive(self):
        """
        Run the assistant in interactive mode.
        User can speak multiple times until they say 'exit' or 'quit'.
        """
        self.is_running = True
        
        print("\n" + "=" * 60)
        print("Voice Assistant is ready!")
        print("Speak your question or command.")
        print("Say 'exit' or 'quit' to stop.")
        print("=" * 60 + "\n")
        
        try:
            self.text_to_speech.speak(
                "Hello! I am your voice assistant. How can I help you today?"
            )
            
            while self.is_running:
                try:
                    user_text, assistant_response = self.process_voice_input()
                    
                    if user_text is None:
                        continue
                    
                    user_text_lower = user_text.lower().strip()
                    if any(word in user_text_lower for word in ['exit', 'quit', 'goodbye', 'bye']):
                        self.text_to_speech.speak(
                            "Goodbye! It was nice talking to you."
                        )
                        self.is_running = False
                        break
                    
                    print(f"\nYou: {user_text}")
                    print(f"Assistant: {assistant_response}\n")
                    
                    time.sleep(0.5)
                    
                except KeyboardInterrupt:
                    logger.info("Interrupted by user")
                    self.is_running = False
                    break
                except Exception as e:
                    logger.error(f"Error during interaction: {e}")
                    self.text_to_speech.speak(
                        "I encountered an error. Please try again."
                    )
            
            if self.save_conversations:
                self.save_current_conversation()
            
            print("\nVoice Assistant stopped.")
            
        except Exception as e:
            logger.error(f"Error in interactive mode: {e}")
    
    def ask_question(self, question_text):
        """
        Process a text question and return spoken response.
        
        Args:
            question_text: Question as text string
            
        Returns:
            Assistant's response text
        """
        try:
            logger.info(f"Processing question: {question_text}")
            
            response = self.language_model.generate_response(
                question_text,
                max_tokens=300,
                temperature=0.7
            )
            
            self.text_to_speech.speak(response)
            
            return response
            
        except Exception as e:
            logger.error(f"Error processing question: {e}")
            return "I apologize, but I encountered an error."
    
    def save_current_conversation(self, filename=None):
        """Save the current conversation"""
        if not self.save_conversations:
            return None
        
        try:
            return self.conversation_manager.save_conversation(
                self.language_model.conversation_history,
                filename
            )
        except Exception as e:
            logger.error(f"Error saving conversation: {e}")
            return None
    
    def load_conversation(self, filename):
        """Load a previous conversation"""
        if not self.save_conversations:
            return
        
        try:
            history = self.conversation_manager.load_conversation(filename)
            self.language_model.conversation_history = history
            
            message_count = len(history)
            self.text_to_speech.speak(
                f"Loaded conversation with {message_count} messages. "
                "We can continue where we left off."
            )
        except Exception as e:
            logger.error(f"Error loading conversation: {e}")


def main():
    """Main entry point for the voice assistant application"""
    
    parser = argparse.ArgumentParser(
        description='Voice-Powered AI Assistant',
        formatter_class=argparse.RawDescriptionHelpFormatter,
        epilog="""
Examples:
  # Run with local LLM model
  python voice_assistant_complete.py --llm-type local --llm-model /path/to/model.gguf
  
  # Run with Hugging Face model
  python voice_assistant_complete.py --llm-type huggingface --llm-model gpt2
  
  # Use advanced noise reduction
  python voice_assistant_complete.py --llm-type local --llm-model model.gguf --advanced-noise
  
  # Specify device
  python voice_assistant_complete.py --llm-type local --llm-model model.gguf --device cuda
        """
    )
    
    parser.add_argument(
        '--whisper-model',
        type=str,
        default='base',
        choices=['tiny', 'base', 'small', 'medium', 'large'],
        help='Whisper model size (default: base)'
    )
    
    parser.add_argument(
        '--tts-model',
        type=str,
        default=None,
        help='TTS model name (default: automatic)'
    )
    
    parser.add_argument(
        '--llm-type',
        type=str,
        required=True,
        choices=['local', 'huggingface'],
        help='Type of language model to use'
    )
    
    parser.add_argument(
        '--llm-model',
        type=str,
        required=True,
        help='Path to LLM model file or Hugging Face model name'
    )
    
    parser.add_argument(
        '--device',
        type=str,
        default=None,
        choices=['cpu', 'cuda', 'mps'],
        help='Computing device (default: auto-detect)'
    )
    
    parser.add_argument(
        '--advanced-noise',
        action='store_true',
        help='Use advanced noise reduction techniques'
    )
    
    parser.add_argument(
        '--no-save',
        action='store_true',
        help='Do not save conversation history'
    )
    
    parser.add_argument(
        '--test-question',
        type=str,
        default=None,
        help='Test with a single question instead of interactive mode'
    )
    
    args = parser.parse_args()
    
    try:
        assistant = VoiceAssistant(
            whisper_model=args.whisper_model,
            tts_model=args.tts_model,
            llm_type=args.llm_type,
            llm_model_path=args.llm_model,
            device=args.device,
            use_advanced_noise_reduction=args.advanced_noise,
            save_conversations=not args.no_save
        )
        
        if args.test_question:
            print(f"\nTest Question: {args.test_question}")
            response = assistant.ask_question(args.test_question)
            print(f"Assistant Response: {response}\n")
        else:
            assistant.run_interactive()
        
    except Exception as e:
        logger.error(f"Fatal error: {e}")
        print(f"\nError: {e}")
        print("Please check the log file for details.")
        sys.exit(1)


if __name__ == '__main__':
    main()

DETAILED USAGE, CONFIGURATION, DEPLOYMENT, AND MANAGEMENT INSTRUCTIONS

SYSTEM REQUIREMENTS AND PREREQUISITES

Before you can use this voice assistant system, you need to ensure your computer meets certain requirements. The system requires Python version 3.8 or higher installed on your machine. You can verify your Python version by opening a terminal or command prompt and typing the command python --version or python3 --version. If Python is not installed or the version is too old, download and install the latest version from the official Python website at python.org.

You will need a working microphone and speakers or headphones connected to your computer. The system uses your default audio input and output devices, so make sure these are properly configured in your operating system settings. Test your microphone by recording a voice memo or using your operating system's sound settings to verify it is working correctly.

For optimal performance, you should have at least 8 gigabytes of RAM available, though the system can run with less if you use smaller models. If you plan to use GPU acceleration, ensure you have the appropriate drivers installed for your graphics card. NVIDIA users need CUDA toolkit version 11.7 or higher and cuDNN. AMD users on Linux need ROCm version 5.0 or higher. Apple Silicon Mac users need macOS 12.3 or higher for Metal Performance Shaders support.

Disk space requirements vary depending on which models you choose to use. The Whisper base model requires approximately 150 megabytes. Language models can range from 1 gigabyte for small models to over 50 gigabytes for large models. Text-to-speech models typically require 100 to 500 megabytes. Plan to have at least 10 gigabytes of free disk space for a basic setup with room for conversation history and logs.

INSTALLATION PROCESS STEP BY STEP

Begin by creating a dedicated directory for your voice assistant project. Open a terminal or command prompt and navigate to where you want to store the project. Create a new directory with the command mkdir voice_assistant_project and then navigate into it with cd voice_assistant_project.

Create a Python virtual environment to isolate the project dependencies from your system Python installation. Run the command python -m venv venv on Windows or python3 -m venv venv on macOS and Linux. This creates a new virtual environment in a directory called venv.

Activate the virtual environment. On Windows, run the command venv\Scripts\activate. On macOS and Linux, run source venv/bin/activate. You should see the environment name appear in your command prompt, indicating the virtual environment is active.

Now install PyTorch, which is the foundation for most of the deep learning models used in this system. Visit pytorch.org and use their installation selector tool to get the exact command for your operating system and hardware. For example, on Windows with NVIDIA GPU support, you might run pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118. For CPU-only installation on any platform, use pip install torch torchvision torchaudio. For Apple Silicon Macs, the standard pip install torch torchvision torchaudio will automatically use Metal Performance Shaders.

Install the Whisper speech recognition library with the command pip install openai-whisper. This will download Whisper and its dependencies.

Install the Coqui TTS text-to-speech library with pip install TTS. Note that TTS is capitalized. This installation may take several minutes as it downloads various dependencies.

Install the llama-cpp-python library for local language model support with pip install llama-cpp-python. If you have an NVIDIA GPU and want to use CUDA acceleration, you need to install it with specific flags. Use CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python on Linux and macOS, or set CMAKE_ARGS="-DLLAMA_CUBLAS=on" as an environment variable before running pip install llama-cpp-python on Windows.

Install the Hugging Face transformers library with pip install transformers accelerate. The accelerate library provides optimizations for model loading and inference.

Install audio processing libraries with pip install sounddevice soundfile noisereduce librosa scipy. These libraries handle microphone input, speaker output, and noise reduction.

Install additional utility libraries with pip install numpy. NumPy should already be installed as a dependency of other packages, but this ensures you have it.

Create a new file named voice_assistant_complete.py in your project directory and copy the entire running example code into this file. Save the file.

OBTAINING AND CONFIGURING LANGUAGE MODELS

For local language model inference using llama-cpp-python, you need to download a model in GGUF format. The Hugging Face model hub hosts many quantized models suitable for this purpose. Visit huggingface.co and search for models with GGUF in the name. Popular options include Llama 2, Mistral, and Phi models.

For example, to download a quantized Llama 2 7B model, search for "llama-2-7b-chat.Q4_K_M.gguf" on Hugging Face. Download the model file to your project directory. These files can be several gigabytes in size, so the download may take some time depending on your internet connection.

Alternatively, you can use the Hugging Face CLI to download models. Install it with pip install huggingface-hub and then use the command huggingface-cli download TheBloke/Llama-2-7B-Chat-GGUF llama-2-7b-chat.Q4_K_M.gguf --local-dir ./models to download a model to a local models directory.

For Hugging Face transformers models, you do not need to download anything manually. The library will automatically download models the first time you use them. However, you can pre-download models using Python code. Create a simple script that imports the transformers library and loads the model you want to use, such as AutoModelForCausalLM.from_pretrained("gpt2"). This will download the model to your Hugging Face cache directory.

Note the path to your downloaded GGUF model file, as you will need to provide this path when running the voice assistant.

BASIC USAGE AND RUNNING THE ASSISTANT

To run the voice assistant in its simplest form with a local GGUF model, use the following command from your project directory with the virtual environment activated. Replace the path with the actual path to your model file.

python voice_assistant_complete.py --llm-type local --llm-model ./models/llama-2-7b-chat.Q4_K_M.gguf

The system will initialize, which may take 30 seconds to a few minutes depending on your hardware. You will see log messages indicating the progress of loading each component. Once you see "Voice Assistant is ready!" and hear the greeting, you can start speaking.

Speak clearly into your microphone. The system will automatically detect when you start speaking and when you finish based on silence detection. After you stop speaking, the system will transcribe your speech, generate a response, and speak it back to you.

To exit the assistant, say "exit", "quit", "goodbye", or "bye". You can also press Ctrl+C to stop the program immediately.

For testing with a Hugging Face model instead of a local model, use this command:

python voice_assistant_complete.py --llm-type huggingface --llm-model gpt2

The gpt2 model is relatively small and will download quickly the first time you run this command. Subsequent runs will use the cached model.

ADVANCED CONFIGURATION OPTIONS

The voice assistant supports numerous command-line options to customize its behavior. To use a different Whisper model size for better accuracy, add the --whisper-model flag. For example, to use the small model which provides better accuracy than base:

python voice_assistant_complete.py --llm-type local --llm-model model.gguf --whisper-model small

Available Whisper model sizes from smallest to largest are tiny, base, small, medium, and large. Larger models provide better accuracy but require more memory and processing time.
To enable advanced noise reduction for challenging acoustic environments with significant background noise, add the --advanced-noise flag:

python voice_assistant_complete.py --llm-type local --llm-model model.gguf --advanced-noise

This applies multiple noise reduction algorithms in sequence, providing superior noise suppression at the cost of slightly increased processing time.

To explicitly specify which computing device to use, add the --device flag. This is useful if you have multiple GPUs or want to force CPU usage for testing:

python voice_assistant_complete.py --llm-type local --llm-model model.gguf --device cuda

Valid device options are cpu, cuda for NVIDIA GPUs, and mps for Apple Silicon Macs.
To disable automatic conversation saving, add the --no-save flag:

python voice_assistant_complete.py --llm-type local --llm-model model.gguf --no-save

By default, the system saves all conversations to JSON files in a conversations directory. This flag disables that behavior.

To test the system with a single text question without voice input, use the --test-question flag:

python voice_assistant_complete.py --llm-type local --llm-model model.gguf --test-question "What is artificial intelligence?"

This is useful for testing that the language model and text-to-speech components are working correctly without requiring microphone input.

MANAGING CONVERSATION HISTORY

The system automatically saves conversation history to JSON files in a conversations subdirectory. Each conversation is saved with a timestamp in the filename, such as conversation_20250126_143052.json.

To view saved conversations, navigate to the conversations directory and open any JSON file with a text editor. The file contains a timestamp, the complete message history, and a message count.

To programmatically load and continue a previous conversation, you would need to modify the main function or create a custom script. The ConversationManager class provides methods for listing, loading, and saving conversations. Here is an example of how you might extend the system to support loading conversations:

Create a new Python file called continue_conversation.py with code that instantiates the VoiceAssistant, uses the conversation_manager to list available conversations, prompts the user to select one, loads it, and then runs the interactive mode. This allows you to resume previous discussions with full context.

The conversation files are plain JSON, so you can also process them with other tools for analysis, export to other formats, or integration with other systems.

TROUBLESHOOTING COMMON ISSUES

If you encounter an error about CUDA not being available when you have an NVIDIA GPU, verify that you installed the CUDA-enabled version of PyTorch. Run python -c "import torch; print(torch.cuda.is_available())" to check. If it prints False, you need to reinstall PyTorch with CUDA support using the command from the PyTorch website.

If the microphone is not working or you see errors about audio devices, check your operating system's audio settings to ensure the correct microphone is selected as the default input device. On Linux, you may need to install additional audio libraries with sudo apt-get install portaudio19-dev python3-pyaudio.

If you get errors about missing models when using Coqui TTS, the system may have failed to download the model automatically. Try running the TTS initialization separately in a Python shell to see more detailed error messages. You can also manually specify a different TTS model with the --tts-model flag.

If transcription quality is poor, try using a larger Whisper model with --whisper-model medium or --whisper-model large. Also ensure you are speaking clearly and that your microphone is positioned correctly. Enable advanced noise reduction if you are in a noisy environment.

If the language model responses are slow, you may not be using GPU acceleration effectively. Check the log output to see how many layers are being offloaded to the GPU. You can also try using a smaller or more heavily quantized model for faster inference.

If you encounter out-of-memory errors, you are likely using a model that is too large for your available RAM or VRAM. Try using a smaller model or a more aggressively quantized version. For GGUF models, look for versions with Q4 or Q5 in the name, which use 4-bit or 5-bit quantization.

DEPLOYMENT CONSIDERATIONS FOR PRODUCTION USE

For deployment in production environments, you should consider several additional factors beyond basic functionality. First, implement proper error handling and recovery mechanisms. The current system logs errors but may not gracefully handle all failure scenarios. Add try-except blocks around critical sections and implement automatic retry logic for transient failures.

Consider implementing authentication and access control if the system will be used by multiple users or in a shared environment. You could add user identification through voice recognition or require a PIN code before activating the assistant.

For better performance in production, pre-load all models at startup rather than lazy loading. This increases startup time but eliminates delays during the first interaction. You can modify the initialization code to explicitly load and warm up all models.

Implement rate limiting to prevent abuse if the system is exposed to untrusted users. Track the number of requests per user or IP address and enforce reasonable limits.

Add monitoring and metrics collection to track system performance, error rates, and usage patterns. You could integrate with monitoring tools like Prometheus or send metrics to a logging service.

Consider deploying the system as a service that starts automatically on system boot. On Linux, you can create a systemd service file. On Windows, you can use the Task Scheduler or install it as a Windows service.

For multi-user scenarios, consider deploying the system as a web service with a REST API rather than a command-line application. This would allow multiple clients to connect to a single instance running on a server. You would need to implement request queuing and potentially multiple worker processes to handle concurrent requests.

PERFORMANCE OPTIMIZATION STRATEGIES

To optimize performance, start by profiling the system to identify bottlenecks. The Python cProfile module can help identify which functions consume the most time. Run the assistant with profiling enabled using

python -m cProfile -o profile.stats voice_assistant_complete.py

and then analyze the results with tools like snakeviz.

For speech recognition, the Whisper model size has the largest impact on performance. The tiny model is approximately 10 times faster than the large model but less accurate. Choose the smallest model that provides acceptable accuracy for your use case.
For language model inference, quantization is the most effective optimization. Models quantized to 4-bit or 5-bit precision run significantly faster with minimal quality loss. If you are using Hugging Face models, consider converting them to GGUF format for better performance.

Enable GPU acceleration wherever possible. Ensure that the maximum number of model layers are offloaded to the GPU by checking the log output. You can manually specify the number of GPU layers with modifications to the code if the automatic detection is not optimal.
For text-to-speech, the Tacotron2 models used by default are relatively slow. Consider using faster models like VITS or FastSpeech2 if available for your language. You can specify alternative models with the --tts-model flag.

Implement caching for common responses. If certain questions are asked frequently, you can cache the language model responses and return them immediately without regenerating. This is particularly effective for FAQ-style interactions.

Consider using model distillation to create smaller, faster versions of large models while retaining most of their capabilities. This is an advanced technique that requires additional training but can significantly improve inference speed.

SECURITY AND PRIVACY CONSIDERATIONS

The voice assistant processes potentially sensitive audio and text data, so security and privacy are important considerations. All audio processing happens locally on your machine by default, which provides good privacy protection. However, you should still take precautions.

Ensure that conversation history files are stored securely with appropriate file permissions. On Unix-like systems, set permissions to 600 so only the owner can read and write the files. Consider encrypting the conversation history files if they contain sensitive information.

If you modify the system to support remote language models via API calls, ensure all communications use HTTPS to prevent eavesdropping. Implement proper API key management and never hard-code API keys in the source code. Use environment variables or secure configuration files.

Be aware that the language models may inadvertently memorize and reproduce sensitive information from their training data. Do not rely on the assistant for handling highly confidential information without additional safeguards.

Implement input validation and sanitization to prevent injection attacks if you extend the system to interact with external services or databases. Never directly execute user input as code or system commands.

Consider implementing automatic deletion of old conversation history files to minimize the amount of stored personal data. You could add a cleanup function that deletes conversations older than a specified number of days.

If deploying in a regulated environment such as healthcare or finance, ensure compliance with relevant regulations like HIPAA or GDPR. This may require additional security measures, audit logging, and data retention policies.

OPTIONAL ENHANCEMENT TASKS FOR READERS

The following tasks provide opportunities to extend and enhance the voice assistant system. These range from simple modifications to complex new features, allowing you to customize the system to your specific needs and learn more about the underlying technologies.

Task one is to implement wake word detection so the assistant only activates when you say a specific phrase like "Hey Assistant". You can use the Porcupine wake word detection library or implement a simple energy-based detector that listens for a specific audio pattern. This requires modifying the main loop to continuously monitor audio and only start full processing when the wake word is detected.

Task two is to add support for multiple languages with automatic language detection. Modify the system to detect which language the user is speaking and automatically switch the text-to-speech model to match. Whisper can detect the spoken language, and Coqui TTS supports models for many languages. You would need to maintain a mapping of language codes to TTS models and implement logic to switch between them.

Task three is to implement voice activity detection using a more sophisticated algorithm than simple energy thresholding. Look into using the WebRTC VAD library or training a small neural network to detect speech versus silence. This can improve the system's ability to handle varying background noise levels and different speaking styles.

Task four is to add emotion detection to analyze the user's emotional state from their voice and adjust the assistant's responses accordingly. You can use libraries like librosa to extract acoustic features and train a classifier to recognize emotions like happiness, sadness, anger, or frustration. The assistant could then generate more empathetic responses based on the detected emotion.

Task five is to implement speaker identification to recognize different users and maintain separate conversation histories for each person. This requires collecting voice samples from each user, extracting speaker embeddings using models like resemblyzer, and comparing new audio against stored embeddings to identify the speaker.

Task six is to add support for function calling, allowing the assistant to perform actions like setting timers, controlling smart home devices, or searching the web. Define a set of available functions with their parameters, modify the language model prompts to include function descriptions, and implement logic to parse function calls from the model output and execute them.

Task seven is to create a graphical user interface using a framework like PyQt or Tkinter. The GUI could display the conversation history, show real-time transcription, provide buttons for common actions, and visualize the audio waveform or speech recognition confidence. This makes the system more accessible to users who prefer graphical interfaces over command-line tools.

Task eight is to implement streaming speech recognition that transcribes audio in real-time as you speak rather than waiting for silence. This requires processing audio in small chunks and using a streaming-capable speech recognition model. You would need to modify the audio callback to send chunks to the recognizer continuously and update the transcription incrementally.

Task nine is to add support for multi-turn dialogue with explicit context management. Implement a dialogue state tracker that maintains information about the current topic, user preferences, and conversation goals. This allows the assistant to handle complex multi-step tasks and maintain coherent conversations over many turns.

Task ten is to create a plugin system that allows adding new capabilities without modifying the core code. Define a plugin interface that external modules can implement, and add a plugin loader that discovers and initializes plugins at startup. Plugins could add new commands, integrate with external services, or provide domain-specific knowledge.

Task eleven is to implement conversation summarization that automatically generates summaries of long conversations. Use a summarization model to condense conversation history, which can help manage context window limitations in language models and provide users with quick overviews of past discussions.

Task twelve is to add support for multimodal interaction by integrating image recognition. Allow users to show images to the camera while asking questions about them. Use a vision-language model like CLIP or LLaVA to understand the images and incorporate visual information into the conversation.

Task thirteen is to implement adaptive noise reduction that learns the characteristics of your specific acoustic environment over time. Collect background noise samples during silence periods, build a noise profile, and continuously update the noise reduction parameters based on the current environment.

Task fourteen is to create a web-based interface using a framework like Flask or FastAPI. This allows accessing the assistant through a web browser from any device on your network. Implement WebSocket communication for real-time audio streaming and response delivery.

Task fifteen is to add support for voice cloning, allowing the assistant to speak in a custom voice. Use tools like Coqui TTS's voice cloning capabilities to create a personalized voice from audio samples. This requires collecting clean audio recordings of the target voice and fine-tuning the TTS model.

These enhancement tasks provide numerous opportunities to deepen your understanding of voice AI systems and create a truly personalized assistant tailored to your specific needs and preferences. Start with simpler tasks and gradually work toward more complex features as you become more comfortable with the codebase and underlying technologies.

Hitchhiker's Guide to AI, Software Architecture, and Everything Else

Sunday, July 05, 2026

BUILDING YOUR OWN VOICE-POWERED AI ASSISTANT

INTRODUCTION TO VOICE-POWERED AI SYSTEMS

UNDERSTANDING THE ARCHITECTURE

CHOOSING THE RIGHT OPEN SOURCE COMPONENTS

SETTING UP YOUR DEVELOPMENT ENVIRONMENT

BUILDING THE SPEECH-TO-TEXT COMPONENT

BUILDING THE TEXT-TO-SPEECH COMPONENT

INTEGRATING THE LANGUAGE MODEL

HANDLING DIFFERENT GPU ARCHITECTURES

CREATING THE MAIN VOICE ASSISTANT

ADVANCED NOISE REDUCTION TECHNIQUES

OPTIMIZING FOR REAL-TIME PERFORMANCE

SAVING AND LOADING CONVERSATION HISTORY

IMPLEMENTING WAKE WORD DETECTION

COMPLETE RUNNING EXAMPLE

DETAILED USAGE, CONFIGURATION, DEPLOYMENT, AND MANAGEMENT INSTRUCTIONS

SYSTEM REQUIREMENTS AND PREREQUISITES

INSTALLATION PROCESS STEP BY STEP

OBTAINING AND CONFIGURING LANGUAGE MODELS

BASIC USAGE AND RUNNING THE ASSISTANT

ADVANCED CONFIGURATION OPTIONS

MANAGING CONVERSATION HISTORY

TROUBLESHOOTING COMMON ISSUES

DEPLOYMENT CONSIDERATIONS FOR PRODUCTION USE

PERFORMANCE OPTIMIZATION STRATEGIES

SECURITY AND PRIVACY CONSIDERATIONS

OPTIONAL ENHANCEMENT TASKS FOR READERS

No comments:

About Me