Hitchhiker's Guide to AI, Software Architecture, and Everything Else: Mic to AI to Speaker: Crafting Your Own Self-Hosted Voice Assistant Using SBC, ESP32, and LLM

INTRODUCTION

Voice assistants have become ubiquitous, yet building a custom solution remains an instructive challenge for makers & software engineers. At its core, a voice assistant must capture spoken words, transform those sounds into text, process the text to generate a response, and render that response back into audible speech. In this article a prototype system will be assembled from open-source speech-to-text and text-to-speech libraries, commodity hardware in the form of a single-board computer or microcontroller, and a large language model to handle natural language understanding and response generation. Each subsystem will be introduced conceptually, followed by a practical code example demonstrating a minimal working integration. By the end, you‘ll have a clear blueprint for constructing a flexible, self-hosted voice assistant that can run entirely within a private LAN or even on a single device.

HARDWARE SELECTION

Choosing the right hardware platform is a foundational step because it determines how much processing can occur locally and what peripherals can be supported. A single-board computer such as a Raspberry Pi offers a familiar Linux environment, generous RAM, and USB or I2S interfaces for microphones and speakers. It can host heavier models and run Python libraries directly. A microcontroller like the ESP32 is far more constrained in memory and CPU speed, yet it shines at low-power streaming and real-time I/O. If the architecture offloads transcription and inference to another machine, the ESP32 can serve as a simple audio capture and playback node that streams data over Wi-Fi.

Below are three illustrative hardware configurations for a self-hosted voice assistant, shown both in the diagram above and described in detail.

Standalone Single-Board Computer Configuration

In this setup a single-board computer such as a Raspberry Pi runs every subsystem locally. A USB microphone captures voice input, and a powered speaker or USB headset renders the synthesized speech. The Raspberry Pi hosts the speech-to-text engine (for example Vosk), the text-to-speech engine (for example pyttsx3), and a lightweight large-language-model instance or a client that forwards prompts to a server elsewhere on the LAN. This configuration maximizes self-containment and avoids network latency, though it requires that the Pi has enough RAM (ideally four gigabytes or more) to load models and handle audio processing in real time.

ESP32 as Edge Audio Node with Central SBC

Here an ESP32 microcontroller handles the raw audio I/O and wake-word detection. Its I2S interface attaches to a digital microphone, and a small speaker or headphone jack plays back audio. When the wake word is detected, the ESP32 streams the recorded audio frames over Wi-Fi to a central SBC. The SBC performs transcription, LLM inference, and synthesis, then sends the resulting audio back to the ESP32 for playback. This offloads the heavy lifting from the microcontroller, allowing extremely low-power, real-time wake-word filtering at the edge, while leveraging the Pi’s compute for language tasks.

Fully Distributed Setup with Remote LLM Server

In environments where the local SBC cannot host an LLM, or where a more powerful model is desired, the system may spread across three devices. A microcontroller or SBC in the room captures and pre-processes audio and handles wake-word detection. Once the audio is ready, it is sent over the LAN to a dedicated server or PC hosting the LLM. That server returns plain text, which is then fed back to the edge node for speech synthesis. This arrangement can support larger models with hundreds of millions of parameters, at the cost of added network latency and the need for robust LAN security.

Each of these configurations balances trade-offs among latency, power consumption, privacy, and cost. Choosing among them depends on whether you prioritize total self-containment, minimal latency, or the flexibility to leverage more powerful language models on external hardware.

The following code example illustrates how to initialize an I2S microphone on the ESP32 using the Arduino framework, connect to a WLAN, and send buffered audio frames to a local server for transcription.

Introduction to the code example

The snippet below shows an Arduino sketch for the ESP32. It demonstrates how to configure the I2S peripheral to sample audio from a digital microphone, connect to a wireless network using SSID and password, and then package audio frames into HTTP POST requests to a transcription endpoint running elsewhere on the LAN. This example focuses on the key setup calls and the loop that reads and sends data. Error handling and retry logic are left out for clarity but would be required in production.

#include <WiFi.h>

#include “driver/i2s.h”

const char* ssid = "YourNetworkSSID";

const char* password = "YourNetworkPass";

const char* serverHost = "192.168.1.100"; // Transcription server IP

const int serverPort = 5000; // Transcription API port

void setupI2SMic() {

i2s_config_t i2sConfig = {

.mode = (i2s_mode_t)(I2S_MODE_MASTER | I2S_MODE_RX),

.sample_rate = 16000,

.bits_per_sample = I2S_BITS_PER_SAMPLE_16BIT,

.channel_format = I2S_CHANNEL_FMT_ONLY_LEFT,

.communication_format = I2S_COMM_FORMAT_I2S | I2S_COMM_FORMAT_I2S_MSB,

.intr_alloc_flags = 0,

.dma_buf_count = 4,

.dma_buf_len = 512

};

i2s_pin_config_t pinConfig = {

.bck_io_num = 26,

.ws_io_num = 25,

.data_out_num = I2S_PIN_NO_CHANGE,

.data_in_num = 32

};

i2s_driver_install(I2S_NUM_0, &i2sConfig, 0, NULL);

i2s_set_pin(I2S_NUM_0, &pinConfig);

}

void setup() {

Serial.begin(115200);

WiFi.begin(ssid, password);

while (WiFi.status() != WL_CONNECTED) {

delay(500);

Serial.print(".");

}

Serial.println("Connected to Wi-Fi");

setupI2SMic();

}

void loop() {

static uint8_t i2sBuffer[1024];

size_t bytesRead;

// Read raw PCM from I2S

i2s_read(I2S_NUM_0, i2sBuffer, sizeof(i2sBuffer), &bytesRead, portMAX_DELAY);

WiFiClient client;

if (client.connect(serverHost, serverPort)) {

client.print("POST /transcribe HTTP/1.1\r\n");

client.print("Host: "); client.print(serverHost); client.print("\r\n");

client.print("Content-Type: application/octet-stream\r\n");

client.print("Content-Length: "); client.print(bytesRead); client.print("\r\n");

client.print("Connection: close\r\n\r\n");

client.write(i2sBuffer, bytesRead);

client.stop();

}

// In production you’d need to add retry or error handling here

}

SPEECH-TO-TEXT SUBSYSTEM

On a single-board computer such as a Raspberry Pi, offline recognition can be achieved using Vosk, an open-source toolkit that provides bindings for Python. Vosk model files are downloaded once and loaded into memory; they perform well on machines with a few gigabytes of RAM. After installing the vosk Python package and placing the model folder on disk, audio chunks from a microphone are fed into a recognizer instance which emits partial transcripts as it processes the data. The following Python example demonstrates how to open the default ALSA microphone, read raw PCM frames, and obtain a final JSON transcript.

Introduction to the code example

This Python script shows how to initialize the Vosk recognizer with a local model directory. It uses the sounddevice library to capture audio at sixteen kilohertz in mono. Each buffer of audio is passed into the recognizer. When the recognizer indicates that an utterance has completed, the JSON result is parsed to extract the textual transcript. Adjusting the buffer size and model path will be necessary depending on the hardware and OS configuration.

import json

import sounddevice as sd

from vosk import Model, KaldiRecognizer

# Load the Vosk model from disk (ensure you have downloaded and unzipped it)

model = Model("vosk-model-small-en-us-0.15")

recognizer = KaldiRecognizer(model, 16000)

def callback(indata, frames, time, status):

if status:

print("Audio status:", status)

# Convert the NumPy int16 array to raw bytes

pcm_bytes = indata.tobytes()

if recognizer.AcceptWaveform(pcm_bytes):

result = json.loads(recognizer.Result())

print("Transcript:", result.get("text", ""))

else:

# Optional: print partial results

partial = json.loads(recognizer.PartialResult())

print("Partial:", partial.get("partial", ""))

# Open a raw stream at 16 kHz, 16-bit, mono

with sd.RawInputStream(samplerate=16000, blocksize=8000,

dtype="int16", channels=1, callback=callback):

print("Listening… press Ctrl+C to stop.")

try:

sd.sleep(int(1e6))

except KeyboardInterrupt:

print("Stopped.")

TEXT-TO-SPEECH SUBSYSTEM

Generating audible responses can be done with pyttsx3, a Python library that wraps platform-specific TTS engines and works offline. Once installed, it allows selecting from available voices and controlling the speech rate and volume. After the engine is configured, calling its say method queues text for synthesis, and runAndWait blocks until the audio is played back through the system’s default output. The example below converts a simple string into speech and plays it immediately.

The demonstration sets up the pyttsx3 engine, queries the installed voice list, and selects the first available voice. It then sets a moderate speech rate before instructing the engine to speak a greeting. Depending on the operating system and installed voices, the available options will vary. On Linux, pyttsx3 uses espeak or espeak-ng under the hood.

import pyttsx3

engine = pyttsx3.init()

voices = engine.getProperty('voices')

if voices:

engine.setProperty('voice', voices[0].id)

engine.setProperty('rate', 150)

text = "Hello, I am your custom voice assistant."

engine.say(text)

engine.runAndWait()

LARGE LANGUAGE MODEL INTEGRATION AND HOSTING OPTIONS

The heart of a voice assistant is the component that interprets intent and formulates a response. A large language model can be hosted locally on the SBC if its memory footprint and compute requirements are met, or alternatively on a more powerful PC accessible via LAN. Lightweight open-source models such as GPT4All or small LLaMA variants can run on a Raspberry Pi 4 only if model sizes are kept under a couple of gigabytes, otherwise inference times will be prohibitive. A simple REST API can wrap the LLM or a cloud API. The snippet below illustrates a Python client that sends a user prompt as JSON to an HTTP endpoint and prints the text reply.

Introduction to the code example

This Python client uses the requests library to post a JSON payload containing the user’s query. It assumes there is a server listening on port eighty at the specified host that returns a JSON object with a “response” field. Error checking for network timeouts and invalid JSON would be necessary in production code.

import requests

def query_llm(prompt):

url = "http://192.168.1.100:80/llm"

payload = {"prompt": prompt}

try:

r = requests.post(url, json=payload, timeout=10)

r.raise_for_status()

data = r.json()

return data.get("response", "")

except requests.RequestException as e:

print("LLM request failed:", e)

return ""

if __name__ == "__main__":

user_input = "What is the weather today in Berlin?"

answer = query_llm(user_input)

print("Assistant says:", answer)

WAKEWORD DETECTION

A reliable voice assistant should not transcribe every sound in its environment, nor should it forward every utterance to the language model. Wake-word detection serves as the gatekeeper that listens continuously for a predefined phrase, and only once that phrase is recognized does the system begin the heavier work of transcription and inference. In this way the assistant conserves CPU cycles, avoids unnecessary network traffic, and protects privacy by not sending unrelated speech to the LLM.

Under the hood a wake-word detector is a small neural network optimized to run in real time at low power. It ingests audio frames—typically sixteen-kilohertz, mono, sixteen-bit PCM—and outputs a probability or binary flag indicating whether the target phrase has been spoken. The core loop feeds successive audio chunks into the detector; when the output crosses a threshold, the detector emits a “trigger” event.

One popular solution for single-board computers is Porcupine from Picovoice, which offers ready-made keyword models and a permissive license for personal projects. The example below shows how to initialize the Porcupine engine in Python, capture audio via the sounddevice library, and react when the wake word is detected.

Introduction to the code example

This snippet demonstrates how to create a Porcupine instance with a built-in keyword, open an audio stream at the required sample rate, and inspect each frame for a detection. When the keyword is heard, the loop breaks and control returns to the main assistant logic. In production you would wrap this in a thread or integrate it into an asynchronous event loop to avoid blocking other tasks.

# Initialize components

porcupine = pvporcupine.create(keywords=["computer"])

model = Model("vosk-model-small-en-us-0.15")

recognizer = KaldiRecognizer(model, 16000)

engine = pyttsx3.init()

engine.setProperty('rate', 150)

def wait_for_wake():

def callback(indata, frames, time, status):

if status:

return

pcm = indata[:, 0].tolist()

if porcupine.process(pcm) >= 0:

raise sd.CallbackStop

with sd.InputStream(samplerate=porcupine.sample_rate,

blocksize=porcupine.frame_length,

channels=1,

dtype="int16",

callback=callback):

sd.sleep(int(1e8))

def transcribe():

# Record up to 5 seconds of audio for a single utterance

data = sd.rec(int(16000 * 5), samplerate=16000, channels=1, dtype='int16')

sd.wait()

pcm_bytes = data.tobytes()

if recognizer.AcceptWaveform(pcm_bytes):

result = json.loads(recognizer.Result())

return result.get("text", "")

return ""

def get_response(text):

try:

r = requests.post("http://192.168.1.100/llm", json={"prompt": text}, timeout=10)

r.raise_for_status()

return r.json().get("response", "")

except requests.RequestException as e:

print("LLM request failed:", e)

return ""

try:

while True:

print("Awaiting wake word…")

wait_for_wake()

print("Wake word heard. Please speak.")

user_text = transcribe()

if not user_text:

print("No speech detected, restarting.")

continue

print("You said:", user_text)

reply = get_response(user_text)

print("Assistant:", reply)

engine.say(reply)

engine.runAndWait()

except KeyboardInterrupt:

print("Interrupted by user.")

finally:

porcupine.delete()

SYSTEM ARCHITECTURE AND ORCHESTRATION

Bringing all subsystems together requires a main program that listens for the wake word, captures audio, obtains transcription, forwards the text to the LLM, and then plays back the synthesized response. A straightforward implementation uses a loop that blocks on audio input and then performs HTTP calls sequentially. For lower latency, threads or asynchronous I/O can overlap wake-word detection, transcription, inference, and synthesis. The simplified sketch below uses blocking calls for clarity. In a production system, proper error handling and resource management would be essential.

Introduction to the code example

The following Python script ties the wake-word detector, microphone capture, speech-to-text recognizer, LLM client, and text-to-speech engine into a single flow. It waits for the wake word, captures speech, prints the transcript, sends it to the LLM, and then speaks the returned text. It uses the previously shown Vosk, Porcupine, and pyttsx3 examples within a unified context.

import json

import sounddevice as sd

from vosk import Model, KaldiRecognizer

import pvporcupine

import requests

import pyttsx3

porcupine = pvporcupine.create(keywords=["hey computer"])

model = Model("vosk-model-small-en-us-0.15")

recognizer = KaldiRecognizer(model, 16000)

engine = pyttsx3.init()

engine.setProperty('rate', 150)

def wait_for_wake():

def callback(indata, frames, time, status):

if status:

return

pcm = indata[:, 0].tolist()

if porcupine.process(pcm) >= 0:

raise sd.CallbackStop

with sd.InputStream(samplerate=porcupine.sample_rate,

blocksize=porcupine.frame_length,

channels=1,

dtype="int16",

callback=callback):

sd.sleep(int(1e8))

def transcribe():

data = sd.rec(int(16000 * 5), samplerate=16000, channels=1, dtype='int16')

sd.wait()

raw = data.tobytes()

if recognizer.AcceptWaveform(raw):

result = json.loads(recognizer.Result())

return result.get("text", "")

return ""

def get_response(text):

r = requests.post("http://192.168.1.100/llm", json={"prompt": text}, timeout=10)

r.raise_for_status()

return r.json().get("response", "")

try:

while True:

print("Awaiting wake word...")

wait_for_wake()

print("Wake word heard. Please speak.")

user_text = transcribe()

print("You said:", user_text)

reply = get_response(user_text)

print("Assistant:", reply)

engine.say(reply)

engine.runAndWait()

except KeyboardInterrupt:

print("Shutting down.")

finally:

porcupine.delete()

CONCLUSION

By combining commodity hardware, open source speech libraries, wake-word detection, and a large language model, it is possible to construct a functional voice assistant within a private network. Performance will hinge on the choice of platform and model size, and network latency must be managed if inference is offloaded. Further enhancements might include securing API endpoints with authentication, adding custom skills via plug-in modules, and integrating with home automation protocols. The blueprint presented here serves as a starting point for makers/software engineers to explore voice-driven interfaces without relying on proprietary cloud services.

Hitchhiker's Guide to AI, Software Architecture, and Everything Else

Wednesday, June 18, 2025

Mic to AI to Speaker: Crafting Your Own Self-Hosted Voice Assistant Using SBC, ESP32, and LLM

INTRODUCTION

HARDWARE SELECTION

Standalone Single-Board Computer Configuration

ESP32 as Edge Audio Node with Central SBC

Fully Distributed Setup with Remote LLM Server

Introduction to the code example

SPEECH-TO-TEXT SUBSYSTEM

Introduction to the code example

TEXT-TO-SPEECH SUBSYSTEM

LARGE LANGUAGE MODEL INTEGRATION AND HOSTING OPTIONS

Introduction to the code example

WAKEWORD DETECTION

Introduction to the code example

SYSTEM ARCHITECTURE AND ORCHESTRATION

Introduction to the code example

CONCLUSION

No comments:

About Me