CHAPTER I: THE TALKING MACHINE - WHY VOICE MATTERS FOR AGENTIC AIThere is something deeply human about voice. Long before we learned to write, we spoke. Long before we typed, we talked. And now, as artificial intelligence matures from a curiosity into a genuine productivity tool, the most natural frontier is giving AI agents the ability to speak and to listen. Agentic AI is not just a chatbot. It is a system in which one or more AI models are given tools, memory, and autonomy to pursue goals across multiple steps. An agent can search the web, write and execute code, send emails, query databases, and coordinate with other agents. But until recently, almost all of this happened through text. You typed. The agent replied in text. You read. This is fine, but it is not how most humans prefer to communicate when they are driving a car, cooking dinner, walking the dog, or simply feeling too tired to type. Voice changes everything. When an agent can listen to you speak and respond in a natural voice, the interaction becomes fluid and ambient. You can have a conversation while doing something else entirely. You can ask a complex question in the same way you would ask a knowledgeable colleague, and receive an answer that sounds like a person talking back to you. The good news is that the technology to do this is now mature, open-source, and surprisingly easy to wire together. OpenAI's Whisper model can transcribe speech with near-human accuracy, running entirely on your own machine. A library called edge-tts can produce Microsoft-quality neural voices for free, with no API key required. Ollama lets you run powerful language models like Llama 3, Mistral, or Phi-3 locally, with zero cloud costs and complete privacy. And the python-telegram-bot library makes it trivial to deploy your agent as a Telegram bot that can receive voice messages and reply with spoken audio. This article will take you through all of it. We will start from first principles, build up the conceptual framework, compare every major option for both STT and TTS, introduce the fascinating open-source tool voicebox.sh, and then implement a complete, running, minimalistic agentic AI system with conversation history management that can talk to you directly from your terminal or remotely through Telegram. We will also show how the same architecture applies to Signal. Every code example is tested, commented, and designed to be clean and readable. Fasten your seatbelt. Your agent is about to find its voice.
CHAPTER II: THE ANATOMY OF A VOICE-CAPABLE AGENTBefore we write a single line of code, it is worth understanding the architecture we are building. A voice-capable agentic AI system has a small number of clearly defined layers, each with a specific responsibility. Getting these layers right from the start makes the system easy to extend, easy to debug, and easy to swap out components when better options appear. The first layer is the Transport Layer. This is how audio gets into and out of the system. In a local setup, this means your microphone and your speakers. In a remote setup, it means a messaging platform like Telegram or Signal, which carries audio as file attachments. The transport layer does not care about what is in the audio. It simply moves bytes from one place to another. The second layer is the Speech-to-Text Layer (STT). This layer receives raw audio and produces a text transcript. It is the agent's ears. The STT layer must handle different audio formats (OGG, WAV, MP3, FLAC), different languages, different accents, and varying levels of background noise. The quality of this layer determines how well the agent understands what you say. The third layer is the Agent Core. This is the brain. It receives a text string from the STT layer, processes it in the context of the conversation history, decides what to do, calls any tools it needs, and produces a text response. The agent core is where the LLM lives, where memory is managed, and where the intelligence resides. The fourth layer is the Text-to-Speech Layer (TTS). This layer receives the text response from the agent core and converts it into audio. It is the agent's mouth. The TTS layer must produce audio that sounds natural, is easy to understand, and is delivered in a format that the transport layer can send. The fifth layer is the Conversation History Manager. This is a cross-cutting concern that sits between the transport layer and the agent core. It maintains a record of all previous turns in the conversation, so the agent can refer back to what was said earlier, maintain context, and avoid repeating itself. Here is a diagram of the architecture in ASCII form: +------------------+ audio +------------------+ | TRANSPORT LAYER |--------------->| STT LAYER | | (mic / Telegram | | (Whisper / Vosk | | / Signal) |<---------------| / cloud APIs) | +------------------+ audio +------------------+ | | | text | | v v +------------------+ text +------------------+ | TTS LAYER |<---------------| AGENT CORE | | (edge-tts / gTTS | | (LLM + tools + | | / pyttsx3 / | | history) | | voicebox.sh) | +------------------+ +------------------+ | | +---------+---------+ | HISTORY MANAGER | | (in-memory list / | | SQLite / Redis) | +-------------------+ Each of these layers is independently replaceable. You can swap Whisper for Deepgram without touching the agent core. You can swap edge-tts for ElevenLabs without touching the STT layer. You can swap Ollama for the OpenAI API without touching anything else. This is the power of clean architecture applied to agentic AI. Let us now explore each layer in depth, starting with the ears.
CHAPTER III: SPEECH-TO-TEXT - GIVING YOUR AGENT EARSSpeech-to-Text, also called Automatic Speech Recognition (ASR), is the process of converting an audio signal containing human speech into a string of text. This is a solved problem in the sense that the technology works reliably, but it is an unsolved problem in the sense that no single solution is perfect for every situation. The right choice depends on your constraints around privacy, latency, accuracy, cost, and language support.
OpenAI WhisperWhisper is, without question, the most important development in open-source speech recognition in the last decade. Released by OpenAI in 2022 and continuously improved since, it is a transformer-based model trained on 680,000 hours of multilingual audio from the internet. It supports 99 languages, handles accents and background noise gracefully, and runs entirely on your local machine with no API key and no data leaving your premises. Whisper comes in five sizes: tiny (39M parameters), base (74M), small (244M), medium (769M), and large (1.5B). The tiny model runs in real-time on a CPU and is good enough for clear speech in a quiet environment. The large model requires a GPU but produces near-human transcription accuracy even in difficult conditions. For most agent applications, the base or small model is the right starting point. Installing Whisper is straightforward: pip install openai-whisper You also need ffmpeg on your system, which handles audio format conversion: # On Ubuntu or Debian sudo apt install ffmpeg # On macOS with Homebrew brew install ffmpeg # On Windows with Chocolatey choco install ffmpeg The simplest possible use of Whisper in Python looks like this: import whisper # Load the model once at startup, not on every call. # This is important because model loading takes several seconds. model = whisper.load_model("base") def transcribe_audio_file(file_path: str) -> str: """ Transcribe an audio file to text using OpenAI Whisper. The function accepts any audio format that ffmpeg can read, including WAV, MP3, OGG, FLAC, and M4A. It returns the transcribed text as a plain string. Args: file_path: Path to the audio file on disk. Returns: The transcribed text, stripped of leading and trailing whitespace. """ result = model.transcribe(file_path) return result["text"].strip()
The result dictionary returned by model.transcribe() contains not just the text but also language detection results, word-level timestamps (if you ask for them), and segment-level confidence information. For a basic agent, you only need result["text"], but the richer information is there when you need it.
One important practical detail: Whisper internally works with 30-second chunks of audio. For longer recordings, it automatically segments and reassembles the transcript. This means you do not need to worry about chunking for most conversational use cases, since voice messages in a chat context are rarely longer than a minute or two.
Vosk: The Lightweight Offline Alternative ---
Vosk is a different beast from Whisper. Where Whisper is a large neural model that processes audio in batch mode (you give it a file and it gives you back text), Vosk is designed for real-time streaming recognition. It uses smaller, more efficient models and can run on devices as constrained as a Raspberry Pi.
The trade-off is accuracy. Vosk is not as accurate as Whisper, especially for accented speech or noisy environments. But if you need real-time word-by-word transcription, or if you are deploying on a device with very limited compute, Vosk is an excellent choice.
pip install vosk
You also need to download a language model separately from alphacephei.com/vosk. The English model is about 40MB for the small version and 1.8GB for the large version.
Cloud-Based STT: Deepgram, AssemblyAI, Google, Azure
Cloud-based STT services offer the highest accuracy and the most features, but they require an internet connection, an API key, and they send your audio data to a third-party server. For enterprise applications where privacy is not a concern, or where you need features like speaker diarization (identifying who said what), real-time streaming with very low latency, or automatic punctuation and formatting, cloud services are worth considering.
Deepgram is particularly popular in the agentic AI community because it offers a Python SDK, extremely low latency (under 300ms for streaming), and competitive pricing. AssemblyAI offers excellent accuracy and a generous free tier. Google Cloud Speech-to-Text and Microsoft Azure Speech Services are the enterprise-grade options with the most extensive language support.
For the purposes of this article, we will use Whisper as our primary STT engine because it is free, private, accurate, and requires no external accounts. We will show how to swap it out for a cloud service when needed.
The SpeechRecognition Library: A Unified Interface
The Python SpeechRecognition library (pip install SpeechRecognition) provides a unified interface to many different STT backends, including Google Web Speech API, CMU Sphinx, IBM Watson, Wit.ai, and others. It is useful for quick prototyping and for applications that need to switch between backends without changing code. However, it does not support Whisper directly (you need to use the openai-whisper package for that), and its Google backend requires an internet connection and has usage limits.
For our agent system, we will use openai-whisper directly, which gives us more control and better performance than going through the SpeechRecognition wrapper.
CHAPTER IV: TEXT-TO-SPEECH - GIVING YOUR AGENT A VOICE
Text-to-Speech (TTS) is the inverse of STT: it takes a string of text and produces an audio signal that sounds like a human voice speaking that text. The quality of TTS has improved dramatically in the last five years, driven by the same neural network revolution that produced Whisper and the large language models. Modern neural TTS systems produce voices that are nearly indistinguishable from real human speech, with natural prosody, appropriate emphasis, and convincing emotional tone.
edge-tts: Free, Neural, No API Key Required
The edge-tts library is a Python wrapper around the TTS engine built into Microsoft Edge's browser. Microsoft uses this engine to power the "Read Aloud" feature in Edge, and it is backed by the same Azure Cognitive Services neural voice technology that costs money when accessed directly through the Azure API. The edge-tts library accesses it for free by mimicking the browser's requests.
This is, frankly, remarkable. You get access to over 300 neural voices in dozens of languages, all for free, with no API key, no account, and no usage limits (within reason). The voices sound genuinely good. For most agent applications, edge-tts is the right default choice.
pip install edge-tts
The library is asynchronous by design, which fits naturally into the async architecture of a Telegram bot. Here is a complete, well-commented example of how to use it:
import asyncio
import edge_tts
import os
# A curated selection of high-quality English voices available in edge-tts.
# Run 'python -m edge_tts --list-voices' to see all available voices.
VOICE_EN_FEMALE = "en-US-JennyNeural"
VOICE_EN_MALE = "en-US-GuyNeural"
VOICE_EN_ARIA = "en-US-AriaNeural"
async def text_to_speech_edge(
text: str,
output_path: str,
voice: str = VOICE_EN_ARIA,
rate: str = "+0%",
volume: str = "+0%"
) -> str:
"""
Convert text to speech using Microsoft Edge's neural TTS engine.
This function is asynchronous and must be called with 'await' inside
an async context, or with asyncio.run() from synchronous code.
The 'rate' parameter controls speaking speed. Use "+20%" for faster
speech or "-10%" for slower speech. The 'volume' parameter works
similarly: "+10%" is louder, "-10%" is quieter.
Args:
text: The text to convert to speech.
output_path: Where to save the resulting MP3 file.
voice: The edge-tts voice name to use.
rate: Speaking rate adjustment (e.g., "+0%", "+20%", "-10%").
volume: Volume adjustment (e.g., "+0%", "+10%", "-5%").
Returns:
The output_path where the audio file was saved.
Raises:
edge_tts.exceptions.NoAudioReceived: If the TTS service returns
no audio, which can happen with very short or empty text strings.
"""
communicate = edge_tts.Communicate(text, voice, rate=rate, volume=volume)
await communicate.save(output_path)
return output_path
def text_to_speech_edge_sync(
text: str,
output_path: str,
voice: str = VOICE_EN_ARIA
) -> str:
"""
Synchronous wrapper around text_to_speech_edge for use in
non-async contexts. This is a convenience function that creates
a new event loop, runs the async function, and returns the result.
Args:
text: The text to convert to speech.
output_path: Where to save the resulting MP3 file.
voice: The edge-tts voice name to use.
Returns:
The output_path where the audio file was saved.
"""
return asyncio.run(text_to_speech_edge(text, output_path, voice))
One important caveat: edge-tts requires an internet connection because it contacts Microsoft's servers. If you need fully offline TTS, you need a different solution.
pyttsx3: Fully Offline, Cross-Platform
pyttsx3 is the workhorse of offline TTS in Python. It uses the TTS engine built into your operating system: SAPI5 on Windows, NSSpeechSynthesizer on macOS, and eSpeak on Linux. The voices are not as natural as edge-tts or cloud services, but they work completely offline and have zero latency for short texts.
pip install pyttsx3
The Linux voice quality with eSpeak is noticeably robotic. On macOS, the built-in voices (especially the "Alex" voice) are quite good. On Windows, the SAPI5 voices are decent. If you are deploying on a Linux server and need offline TTS, consider Piper (discussed below) instead of pyttsx3.
import pyttsx3
import tempfile
import os
def text_to_speech_pyttsx3(text: str, output_path: str) -> str:
"""
Convert text to speech using the system's built-in TTS engine via pyttsx3.
This function works completely offline. The voice quality depends on
the operating system: best on macOS, good on Windows, robotic on Linux.
On Linux, this uses eSpeak under the hood. For better quality on Linux,
consider using Piper TTS or edge-tts (which requires internet).
Args:
text: The text to convert to speech.
output_path: Where to save the resulting WAV file. Note that
pyttsx3 saves as WAV, not MP3.
Returns:
The output_path where the audio file was saved.
"""
engine = pyttsx3.init()
# Set a comfortable speaking rate. The default is often too fast.
# 150 words per minute is a natural conversational pace.
engine.setProperty("rate", 150)
# Set volume to maximum (range is 0.0 to 1.0).
engine.setProperty("volume", 1.0)
# Save to file instead of playing through speakers.
engine.save_to_file(text, output_path)
engine.runAndWait()
return output_path
gTTS: Google's Free TTS
gTTS (Google Text-to-Speech) uses Google's public TTS API, the same one that powers Google Translate's "listen" button. It produces good-quality audio, supports many languages, and is free for reasonable usage. Like edge-tts, it requires an internet connection.
pip install gTTS
from gtts import gTTS
import os
def text_to_speech_gtts(
text: str,
output_path: str,
language: str = "en",
slow: bool = False
) -> str:
"""
Convert text to speech using Google's TTS service via gTTS.
This requires an internet connection. The quality is good but not
as natural as edge-tts's neural voices. It is, however, extremely
simple to use and supports a very wide range of languages.
The 'slow' parameter, when True, generates speech at a reduced
speed, which can be useful for language learning applications.
Args:
text: The text to convert to speech.
output_path: Where to save the resulting MP3 file.
language: BCP-47 language code (e.g., "en", "de", "fr", "ja").
slow: If True, speak more slowly.
Returns:
The output_path where the audio file was saved.
"""
tts = gTTS(text=text, lang=language, slow=slow)
tts.save(output_path)
return output_path
ElevenLabs: The Gold Standard for Voice Quality
ElevenLabs offers the most realistic AI voices available today. The voices have natural emotion, appropriate emphasis, and convincing prosody. ElevenLabs also offers voice cloning, which lets you create a custom voice from a short audio sample. The trade-off is cost: ElevenLabs is a paid service, though it has a free tier with a limited monthly character quota.
pip install elevenlabs
For production agentic AI systems where voice quality is a key differentiator, ElevenLabs is worth the cost. For experimentation and development, edge-tts provides excellent quality for free.
Piper: The Best Offline Neural TTS
Piper is a fast, local neural TTS system developed by the Home Assistant community. It runs entirely on your machine, produces surprisingly natural voices, and is optimized for low-latency use on devices like the Raspberry Pi. Piper is the right choice when you need offline TTS with neural voice quality.
You run Piper as a command-line tool and call it from Python using subprocess, or use the piper-tts Python package. The voices are downloaded separately as ONNX model files from the Piper releases page on GitHub.
For our main example, we will use edge-tts as the default TTS engine because it offers the best balance of quality, simplicity, and cost (free). We will design the code so that swapping to any other engine requires changing only a single function.
CHAPTER V: COMPARING YOUR OPTIONS - WHEN TO USE WHAT
Making the right choice for STT and TTS depends on your specific requirements. Here is a systematic comparison that will help you decide.
For Speech-to-Text, the decision tree looks like this: If privacy is paramount and you cannot send audio to any external server, use Whisper (local) or Vosk (local, lower resource). If you need real-time streaming transcription with very low latency and privacy is less critical, use Deepgram or AssemblyAI. If you need speaker diarization (knowing who said what in a multi-speaker recording), use AssemblyAI or Deepgram. If you need to support a very wide range of languages with high accuracy, use Whisper large or Google Cloud Speech-to-Text. If you are deploying on a resource-constrained device like a Raspberry Pi, use Vosk with a small model.
For Text-to-Speech, the decision tree is similar: If you need the best possible voice quality and can afford a paid service, use ElevenLabs. If you need excellent quality for free and have an internet connection, use edge-tts. If you need good quality for free with internet, use gTTS. If you need fully offline operation with neural quality, use Piper. If you need fully offline operation and do not care about voice quality, use pyttsx3. If you want to experiment with voice cloning and local-first operation, use voicebox.sh.
For our tutorial, we will implement a pluggable STT/TTS architecture that defaults to Whisper + edge-tts, with clear instructions on how to swap each component.
CHAPTER VI: VOICEBOX.SH - THE OPEN-SOURCE VOICE STUDIO
Voicebox (voicebox.sh) deserves special attention because it is a genuinely exciting tool that goes beyond a simple library. It is a full desktop application and local AI voice studio that runs on macOS, Windows, and Linux. It is open-source (MIT license), available at github.com/jamiepine/voicebox, and it provides both a REST API and a Model Context Protocol (MCP) server for integration with AI agents.
What makes voicebox.sh special is its combination of features. It supports seven different TTS engines, voice cloning from as little as a few seconds of audio, system-wide dictation powered by OpenAI Whisper, a multi-track audio editor for creating conversations and podcasts, and post-processing effects like reverb, pitch shift, and compression. All of this runs locally on your machine, with no data leaving your device.
The REST API is particularly interesting for agentic AI integration. When voicebox.sh is running, it exposes endpoints at http://127.0.0.1:17493/. The key endpoints are:
The /speak endpoint (POST) accepts a JSON body with "text" and "profile" fields and plays the speech through your speakers immediately. This is ideal for a local agent that should speak its responses out loud.
The /generate endpoint (POST) accepts "text", "profile_id", "language", and "engine" fields and returns audio bytes that you can save to a file. This is ideal for generating audio files to send via Telegram or Signal.
The /transcribe endpoint (POST) accepts an audio file and returns a Whisper transcription. This gives you a clean STT API without having to manage the Whisper model yourself.
The /profiles endpoint (GET) lists all available voice profiles, including cloned voices and preset voices.
The MCP server at http://127.0.0.1:17493/mcp allows MCP-aware agents (like those built with Claude Code, Cursor, or Cline) to call voicebox.speak, voicebox.transcribe, and voicebox.list_profiles as native tools.
Here is how to integrate voicebox.sh into a Python agent using its REST API:
import requests
import json
from pathlib import Path
VOICEBOX_BASE_URL = "http://127.0.0.1:17493"
class VoiceboxClient:
"""
A client for the voicebox.sh REST API.
This class provides a clean Python interface to the voicebox.sh
desktop application's API. Voicebox must be running on the local
machine for this client to work.
The client supports speaking text through the system speakers,
generating audio files, transcribing audio, and listing available
voice profiles.
"""
def __init__(self, base_url: str = VOICEBOX_BASE_URL):
"""
Initialize the client with the voicebox.sh API base URL.
Args:
base_url: The base URL of the voicebox.sh API.
Defaults to the standard local address.
"""
self.base_url = base_url
self.session = requests.Session()
self.session.headers.update({"Content-Type": "application/json"})
def is_running(self) -> bool:
"""
Check if voicebox.sh is currently running and accessible.
Returns:
True if the API is reachable, False otherwise.
"""
try:
response = self.session.get(
f"{self.base_url}/profiles", timeout=2
)
return response.status_code == 200
except requests.exceptions.ConnectionError:
return False
def speak(self, text: str, profile: str = "default") -> bool:
"""
Speak text immediately through the system speakers.
This is the simplest way to make your agent speak. It sends
the text to voicebox.sh, which plays it through your speakers
using the specified voice profile. The call blocks until
voicebox.sh acknowledges the request, but audio playback
may continue asynchronously.
Args:
text: The text to speak.
profile: The name or ID of the voice profile to use.
Use list_profiles() to see available profiles.
Returns:
True if the request was successful, False otherwise.
"""
payload = {"text": text, "profile": profile}
try:
response = self.session.post(
f"{self.base_url}/speak",
data=json.dumps(payload),
timeout=10
)
return response.status_code == 200
except requests.exceptions.RequestException as e:
print(f"[VoiceboxClient] speak() failed: {e}")
return False
def generate(
self,
text: str,
output_path: str,
profile_id: str = None,
language: str = "en"
) -> bool:
"""
Generate an audio file from text without playing it.
This is useful when you need to send audio via Telegram or
Signal rather than playing it locally. The audio is saved
to the specified output path.
Args:
text: The text to convert to speech.
output_path: Where to save the generated audio file.
profile_id: UUID of the voice profile to use. If None,
the default profile is used.
language: BCP-47 language code (e.g., "en", "de").
Returns:
True if the audio was generated and saved successfully.
"""
payload = {"text": text, "language": language}
if profile_id:
payload["profile_id"] = profile_id
try:
response = self.session.post(
f"{self.base_url}/generate",
data=json.dumps(payload),
timeout=30
)
if response.status_code == 200:
Path(output_path).write_bytes(response.content)
return True
return False
except requests.exceptions.RequestException as e:
print(f"[VoiceboxClient] generate() failed: {e}")
return False
def transcribe(self, audio_path: str) -> str:
"""
Transcribe an audio file to text using voicebox.sh's Whisper
integration.
Args:
audio_path: Path to the audio file to transcribe.
Returns:
The transcribed text, or an empty string on failure.
"""
try:
with open(audio_path, "rb") as audio_file:
response = requests.post(
f"{self.base_url}/transcribe",
files={"file": audio_file},
timeout=60
)
if response.status_code == 200:
return response.json().get("text", "").strip()
return ""
except requests.exceptions.RequestException as e:
print(f"[VoiceboxClient] transcribe() failed: {e}")
return ""
def list_profiles(self) -> list:
"""
List all available voice profiles in voicebox.sh.
Returns:
A list of profile dictionaries, each containing at least
'id' and 'name' keys. Returns an empty list on failure.
"""
try:
response = self.session.get(
f"{self.base_url}/profiles", timeout=5
)
if response.status_code == 200:
return response.json()
return []
except requests.exceptions.RequestException:
return []
This client class is a clean, self-contained interface to voicebox.sh. You can drop it into any agent project and use it alongside or instead of the other TTS/STT options. The is_running() method is particularly useful for building a graceful fallback: if voicebox.sh is running, use it for the best voice quality; otherwise, fall back to edge-tts.
CHAPTER VII: LLM BACKENDS - LOCAL AND REMOTE BRAINS
The intelligence in our agent system comes from a Large Language Model. We need to support both local LLMs (running on your own machine via Ollama) and remote LLMs (accessed via API, such as OpenAI's GPT-4 or Anthropic's Claude). The elegant solution is to use the OpenAI Python client library for both, because Ollama exposes an OpenAI-compatible API endpoint.
Ollama: Running LLMs Locally
Ollama is a tool that makes it trivially easy to download and run large language models on your local machine. You install it from ollama.ai, and then pull models with a single command:
ollama pull llama3.2
ollama pull mistral
ollama pull phi3
Once a model is running, Ollama exposes it at http://localhost:11434/v1/ using the OpenAI API format. This means you can use the exact same Python code to talk to a local Llama 3 model as you would use to talk to OpenAI's GPT-4. You simply change the base_url and model name.
The OpenAI Python Client
The openai Python library is the standard way to interact with OpenAI's API, but because Ollama is OpenAI-compatible, we can use it for local models too.
pip install openai
Here is the LLM client abstraction we will use throughout this article:
from openai import OpenAI
from typing import List, Dict, Optional
import os
# Configuration constants for different LLM backends.
# These can also be loaded from environment variables or a config file.
OLLAMA_BASE_URL = "http://localhost:11434/v1"
OLLAMA_API_KEY = "ollama" # Ollama ignores this, but the client requires it
OLLAMA_MODEL = "llama3.2" # Change to any model you have pulled
OPENAI_BASE_URL = "https://api.openai.com/v1"
OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY", "")
OPENAI_MODEL = "gpt-4o-mini" # Cost-effective and capable
# A type alias for clarity. Each message is a dict with 'role' and 'content'.
MessageHistory = List[Dict[str, str]]
class LLMClient:
"""
A unified client for both local (Ollama) and remote (OpenAI) LLMs.
This class abstracts away the difference between local and remote
LLM backends. Both use the OpenAI API format, so the only difference
is the base_url, api_key, and model name.
Usage:
# For local Ollama:
client = LLMClient(backend="ollama")
# For remote OpenAI:
client = LLMClient(backend="openai")
# For any OpenAI-compatible API (e.g., Groq, Together AI):
client = LLMClient(
backend="custom",
base_url="https://api.groq.com/openai/v1",
api_key="your-groq-api-key",
model="llama3-70b-8192"
)
"""
def __init__(
self,
backend: str = "ollama",
base_url: Optional[str] = None,
api_key: Optional[str] = None,
model: Optional[str] = None,
temperature: float = 0.7,
max_tokens: int = 1024
):
"""
Initialize the LLM client.
Args:
backend: One of "ollama", "openai", or "custom".
base_url: Override the base URL (used with backend="custom").
api_key: Override the API key (used with backend="custom").
model: Override the model name.
temperature: Controls randomness. 0.0 is deterministic,
1.0 is very creative. 0.7 is a good default.
max_tokens: Maximum number of tokens in the response.
"""
self.temperature = temperature
self.max_tokens = max_tokens
if backend == "ollama":
self.model = model or OLLAMA_MODEL
self.client = OpenAI(
base_url=base_url or OLLAMA_BASE_URL,
api_key=api_key or OLLAMA_API_KEY
)
elif backend == "openai":
self.model = model or OPENAI_MODEL
self.client = OpenAI(
base_url=base_url or OPENAI_BASE_URL,
api_key=api_key or OPENAI_API_KEY
)
elif backend == "custom":
if not base_url or not api_key or not model:
raise ValueError(
"backend='custom' requires base_url, api_key, and model."
)
self.model = model
self.client = OpenAI(base_url=base_url, api_key=api_key)
else:
raise ValueError(
f"Unknown backend: '{backend}'. "
f"Use 'ollama', 'openai', or 'custom'."
)
print(f"[LLMClient] Initialized with backend='{backend}', "
f"model='{self.model}'")
def chat(
self,
messages: MessageHistory,
system_prompt: Optional[str] = None
) -> str:
"""
Send a conversation history to the LLM and get a response.
The messages list should contain the full conversation history
in OpenAI format: a list of dicts with 'role' and 'content' keys.
Roles can be 'user', 'assistant', or 'system'.
If system_prompt is provided, it is prepended to the messages
list as a system message. This is the cleanest way to give the
agent its persona and instructions.
Args:
messages: The conversation history.
system_prompt: Optional system prompt to prepend.
Returns:
The assistant's response as a plain string.
Raises:
Exception: If the API call fails for any reason.
"""
# Build the full message list, prepending the system prompt if given.
full_messages = []
if system_prompt:
full_messages.append({
"role": "system",
"content": system_prompt
})
full_messages.extend(messages)
response = self.client.chat.completions.create(
model=self.model,
messages=full_messages,
temperature=self.temperature,
max_tokens=self.max_tokens
)
# Extract and return just the text content of the response.
return response.choices[0].message.content.strip()
def is_available(self) -> bool:
"""
Check if the LLM backend is currently reachable.
This sends a minimal request to verify the connection.
Useful for startup health checks and graceful fallback logic.
Returns:
True if the backend responds successfully, False otherwise.
"""
try:
self.client.chat.completions.create(
model=self.model,
messages=[{"role": "user", "content": "ping"}],
max_tokens=5
)
return True
except Exception:
return False
This LLMClient class is the heart of our agent's intelligence. Notice how it handles both Ollama and OpenAI with identical code, differing only in the constructor arguments. The chat() method takes a full conversation history, which is exactly what we need for a chatbot with memory. The system_prompt parameter lets us give the agent its personality and instructions without polluting the conversation history.
The is_available() method is a practical addition that lets the agent system perform a health check at startup and potentially fall back to a different backend if the primary one is unavailable.
CHAPTER VIII: BUILDING THE FULL SYSTEM - A MINIMALISTIC AGENTIC CHATBOT
Now we have all the pieces. Let us assemble them into a coherent system. We will build a minimalistic but complete agentic AI chatbot with the following properties:
The agent has a configurable persona defined by a system prompt. It maintains a full conversation history across all turns. It can receive input as text or as audio (via STT). It can respond as text or as audio (via TTS). It supports both local and remote LLM backends. It uses a pluggable architecture so that any STT or TTS engine can be swapped in with minimal code changes.
We start with the conversation history manager, because everything else depends on it.
The Conversation History Manager
Conversation history is what transforms a stateless LLM call into a stateful conversation. Without history, every message is treated as if it is the first message the agent has ever received. With history, the agent can refer to earlier parts of the conversation, remember what the user said, and build on previous responses.
The simplest possible history manager is a Python list of message dictionaries. Each dictionary has a "role" key (either "user" or "assistant") and a "content" key containing the text of that turn. This list grows with each turn and is passed in its entirety to the LLM on each call.
The practical problem with an unbounded list is that LLMs have a context window limit. If the conversation grows too long, the messages will exceed the model's maximum token count and the API call will fail. The solution is to implement a sliding window: keep only the most recent N turns, or trim the history when it exceeds a token budget.
from typing import List, Dict, Optional
from dataclasses import dataclass, field
import json
import os
from datetime import datetime
# A single message in the conversation.
Message = Dict[str, str]
@dataclass
class ConversationHistory:
"""
Manages the conversation history for a single chat session.
This class stores messages in memory and provides methods to add
new messages, retrieve the history, and trim it when it grows too
long. It also supports saving and loading history to/from a JSON
file, which enables persistence across bot restarts.
The max_turns parameter controls how many user+assistant turn pairs
are kept in the active history. Older turns are dropped to stay
within the LLM's context window. Setting max_turns to None keeps
the full history (use with caution for long conversations).
Attributes:
session_id: A unique identifier for this conversation session.
max_turns: Maximum number of turns to keep in active history.
messages: The list of message dictionaries.
"""
session_id: str = "default"
max_turns: Optional[int] = 20
messages: List[Message] = field(default_factory=list)
def add_user_message(self, content: str) -> None:
"""
Add a user message to the conversation history.
Args:
content: The text of the user's message.
"""
self.messages.append({"role": "user", "content": content})
self._trim_if_needed()
def add_assistant_message(self, content: str) -> None:
"""
Add an assistant message to the conversation history.
Args:
content: The text of the assistant's response.
"""
self.messages.append({"role": "assistant", "content": content})
def get_messages(self) -> List[Message]:
"""
Get the current conversation history as a list of message dicts.
This is the list you pass directly to LLMClient.chat().
Returns:
A copy of the current message list.
"""
return list(self.messages)
def _trim_if_needed(self) -> None:
"""
Trim the history to stay within the max_turns limit.
This keeps the most recent max_turns * 2 messages (each turn
consists of one user message and one assistant message).
The trimming happens after adding a user message, so we always
have a complete turn at the end.
"""
if self.max_turns is None:
return
max_messages = self.max_turns * 2
if len(self.messages) > max_messages:
# Drop the oldest messages, keeping the most recent ones.
self.messages = self.messages[-max_messages:]
def clear(self) -> None:
"""
Clear the entire conversation history.
This starts a fresh conversation while keeping the session ID.
"""
self.messages = []
def save_to_file(self, directory: str = ".") -> str:
"""
Save the conversation history to a JSON file.
The filename is based on the session_id and the current date,
so each day's conversation is saved separately.
Args:
directory: The directory where the file should be saved.
Returns:
The path to the saved file.
"""
os.makedirs(directory, exist_ok=True)
date_str = datetime.now().strftime("%Y-%m-%d")
filename = f"history_{self.session_id}_{date_str}.json"
filepath = os.path.join(directory, filename)
data = {
"session_id": self.session_id,
"saved_at": datetime.now().isoformat(),
"messages": self.messages
}
with open(filepath, "w", encoding="utf-8") as f:
json.dump(data, f, indent=2, ensure_ascii=False)
return filepath
@classmethod
def load_from_file(cls, filepath: str) -> "ConversationHistory":
"""
Load a conversation history from a JSON file.
Args:
filepath: Path to the JSON file to load.
Returns:
A ConversationHistory instance with the loaded messages.
Raises:
FileNotFoundError: If the file does not exist.
json.JSONDecodeError: If the file is not valid JSON.
"""
with open(filepath, "r", encoding="utf-8") as f:
data = json.load(f)
history = cls(session_id=data.get("session_id", "default"))
history.messages = data.get("messages", [])
return history
def __len__(self) -> int:
"""Return the number of messages in the history."""
return len(self.messages)
def __repr__(self) -> str:
"""Return a human-readable summary of the history."""
return (
f"ConversationHistory(session_id='{self.session_id}', "
f"messages={len(self.messages)})"
)
This ConversationHistory class is the foundation of our chatbot's memory. The _trim_if_needed() method ensures we never exceed the LLM's context window, and the save_to_file() and load_from_file() methods give us persistence across restarts. Notice that the class uses a dataclass decorator, which gives us automatic init, repr, and comparison methods with minimal boilerplate.
The STT Engine Abstraction
We want to be able to swap STT engines without changing the rest of the code. The cleanest way to do this in Python is with a simple abstract base class and concrete implementations for each engine.
import abc
import os
import tempfile
from pathlib import Path
from typing import Optional
import whisper
class STTEngine(abc.ABC):
"""
Abstract base class for Speech-to-Text engines.
All STT engines must implement the transcribe() method, which
accepts a path to an audio file and returns the transcribed text.
The audio file can be in any format that the engine supports.
"""
@abc.abstractmethod
def transcribe(self, audio_path: str) -> str:
"""
Transcribe an audio file to text.
Args:
audio_path: Path to the audio file on disk.
Returns:
The transcribed text as a plain string.
"""
...
def transcribe_ogg(self, ogg_path: str) -> str:
"""
Transcribe an OGG audio file (as received from Telegram).
Telegram sends voice messages as OGG files with Opus encoding.
Most STT engines prefer WAV format. This method converts the
OGG to WAV using pydub/ffmpeg before transcribing.
Args:
ogg_path: Path to the OGG file.
Returns:
The transcribed text.
"""
from pydub import AudioSegment
# Create a temporary WAV file for the conversion.
with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as tmp:
wav_path = tmp.name
try:
audio = AudioSegment.from_file(ogg_path, format="ogg")
audio.export(wav_path, format="wav")
return self.transcribe(wav_path)
finally:
# Always clean up the temporary file, even if transcription fails.
if os.path.exists(wav_path):
os.remove(wav_path)
class WhisperSTT(STTEngine):
"""
STT engine using OpenAI's Whisper model running locally.
This is the recommended default for most applications because
it is free, private, accurate, and works offline.
Model size guide:
"tiny" - Fastest, least accurate. Good for testing.
"base" - Good balance of speed and accuracy. Recommended default.
"small" - Better accuracy, slower. Good for production.
"medium" - High accuracy, requires significant RAM/GPU.
"large" - Best accuracy, requires a GPU with 10GB+ VRAM.
"""
def __init__(self, model_size: str = "base"):
"""
Initialize the Whisper STT engine.
The model is loaded once at initialization and reused for all
subsequent transcription calls. Loading takes a few seconds
the first time (and downloads the model if not cached).
Args:
model_size: The Whisper model size to use.
"""
print(f"[WhisperSTT] Loading Whisper model '{model_size}'...")
self.model = whisper.load_model(model_size)
print(f"[WhisperSTT] Model loaded successfully.")
def transcribe(self, audio_path: str) -> str:
"""
Transcribe an audio file using the local Whisper model.
Args:
audio_path: Path to the audio file. Whisper accepts WAV,
MP3, M4A, FLAC, OGG, and other formats.
Returns:
The transcribed text, stripped of whitespace.
"""
result = self.model.transcribe(audio_path)
return result["text"].strip()
class VoiceboxSTT(STTEngine):
"""
STT engine using voicebox.sh's Whisper integration via REST API.
Use this when voicebox.sh is running and you want to leverage its
managed Whisper instance rather than loading Whisper yourself.
This saves memory when voicebox.sh is already running for TTS.
"""
def __init__(self, voicebox_client):
"""
Initialize with a VoiceboxClient instance.
Args:
voicebox_client: An initialized VoiceboxClient instance.
"""
self.client = voicebox_client
def transcribe(self, audio_path: str) -> str:
"""
Transcribe an audio file using voicebox.sh's API.
Args:
audio_path: Path to the audio file.
Returns:
The transcribed text.
"""
return self.client.transcribe(audio_path)
The transcribe_ogg() method in the base class is a particularly useful addition. Telegram sends all voice messages as OGG files with Opus encoding. Rather than duplicating the OGG-to-WAV conversion logic in every STT engine, we put it in the base class where all engines can inherit it. This is a clean application of the Don't Repeat Yourself principle.
The TTS Engine Abstraction
We apply the same pattern to TTS:
import abc
import asyncio
import os
import tempfile
from typing import Optional
class TTSEngine(abc.ABC):
"""
Abstract base class for Text-to-Speech engines.
All TTS engines must implement the synthesize() method, which
accepts a text string and a path where the audio file should be
saved, and returns the path to the saved file.
"""
@abc.abstractmethod
def synthesize(self, text: str, output_path: str) -> str:
"""
Convert text to speech and save the audio to a file.
Args:
text: The text to convert to speech.
output_path: Where to save the audio file.
Returns:
The output_path where the audio was saved.
"""
...
def synthesize_to_temp(self, text: str, suffix: str = ".mp3") -> str:
"""
Convert text to speech and save to a temporary file.
This is a convenience method that creates a temporary file,
synthesizes the audio into it, and returns the path. The caller
is responsible for deleting the temporary file when done.
Args:
text: The text to convert to speech.
suffix: The file extension for the temporary file.
Returns:
Path to the temporary audio file.
"""
with tempfile.NamedTemporaryFile(
suffix=suffix, delete=False
) as tmp:
output_path = tmp.name
return self.synthesize(text, output_path)
class EdgeTTS(TTSEngine):
"""
TTS engine using Microsoft Edge's neural voices via edge-tts.
This provides excellent voice quality for free, with no API key.
It requires an internet connection. The voices are neural and
sound very natural.
Popular English voices:
"en-US-AriaNeural" - Female, warm and conversational
"en-US-JennyNeural" - Female, friendly and clear
"en-US-GuyNeural" - Male, professional
"en-GB-SoniaNeural" - British female
"en-AU-NatashaNeural"- Australian female
Run 'python -m edge_tts --list-voices' to see all available voices.
"""
def __init__(
self,
voice: str = "en-US-AriaNeural",
rate: str = "+0%",
volume: str = "+0%"
):
"""
Initialize the EdgeTTS engine.
Args:
voice: The neural voice to use.
rate: Speaking rate adjustment (e.g., "+10%", "-5%").
volume: Volume adjustment (e.g., "+5%", "-10%").
"""
self.voice = voice
self.rate = rate
self.volume = volume
def synthesize(self, text: str, output_path: str) -> str:
"""
Convert text to speech using edge-tts and save as MP3.
This method handles the async nature of edge-tts by running
the async function in a new event loop. This makes it safe
to call from both sync and async contexts.
Args:
text: The text to convert to speech.
output_path: Where to save the MP3 file.
Returns:
The output_path where the audio was saved.
"""
import edge_tts
async def _run():
communicate = edge_tts.Communicate(
text, self.voice, rate=self.rate, volume=self.volume
)
await communicate.save(output_path)
# Use asyncio.run() to execute the async function.
# This creates a new event loop, which is safe in sync contexts.
# In an already-running async context (like a Telegram handler),
# use 'await' with the async version instead.
asyncio.run(_run())
return output_path
async def synthesize_async(self, text: str, output_path: str) -> str:
"""
Async version of synthesize() for use in async contexts.
Use this version inside async functions (e.g., Telegram handlers)
to avoid creating a nested event loop.
Args:
text: The text to convert to speech.
output_path: Where to save the MP3 file.
Returns:
The output_path where the audio was saved.
"""
import edge_tts
communicate = edge_tts.Communicate(
text, self.voice, rate=self.rate, volume=self.volume
)
await communicate.save(output_path)
return output_path
class GttsTTS(TTSEngine):
"""
TTS engine using Google's Text-to-Speech service via gTTS.
Good quality, free, supports many languages. Requires internet.
Not as natural-sounding as EdgeTTS but very reliable.
"""
def __init__(self, language: str = "en", slow: bool = False):
"""
Initialize the gTTS engine.
Args:
language: BCP-47 language code (e.g., "en", "de", "fr").
slow: If True, speak more slowly.
"""
self.language = language
self.slow = slow
def synthesize(self, text: str, output_path: str) -> str:
"""
Convert text to speech using gTTS and save as MP3.
Args:
text: The text to convert to speech.
output_path: Where to save the MP3 file.
Returns:
The output_path where the audio was saved.
"""
from gtts import gTTS
tts = gTTS(text=text, lang=self.language, slow=self.slow)
tts.save(output_path)
return output_path
class Pyttsx3TTS(TTSEngine):
"""
TTS engine using pyttsx3 (system TTS, fully offline).
Works without internet. Quality depends on OS:
best on macOS, good on Windows, robotic on Linux.
Saves as WAV format (not MP3).
"""
def __init__(self, rate: int = 150, volume: float = 1.0):
"""
Initialize the pyttsx3 engine.
Args:
rate: Speaking rate in words per minute.
volume: Volume level from 0.0 to 1.0.
"""
self.rate = rate
self.volume = volume
def synthesize(self, text: str, output_path: str) -> str:
"""
Convert text to speech using pyttsx3 and save as WAV.
Note: pyttsx3 saves as WAV, not MP3. If you need MP3 for
Telegram, convert with pydub after calling this method.
Args:
text: The text to convert to speech.
output_path: Where to save the WAV file.
Returns:
The output_path where the audio was saved.
"""
import pyttsx3
engine = pyttsx3.init()
engine.setProperty("rate", self.rate)
engine.setProperty("volume", self.volume)
engine.save_to_file(text, output_path)
engine.runAndWait()
return output_path
With these abstractions in place, the rest of the system never needs to know which specific STT or TTS engine is being used. It just calls transcribe() on an STTEngine and synthesize() on a TTSEngine. Swapping engines is as simple as changing the constructor call at the top of your main script.
The Agent Core
Now we bring everything together in the Agent class. This is the central orchestrator that connects the conversation history, the LLM, the STT engine, and the TTS engine into a coherent whole.
import os
import tempfile
import logging
from typing import Optional
# We import our own modules defined earlier in this article.
# In a real project, these would be in separate files.
# from llm_client import LLMClient
# from conversation_history import ConversationHistory
# from stt_engines import STTEngine, WhisperSTT
# from tts_engines import TTSEngine, EdgeTTS
logger = logging.getLogger(__name__)
# The default system prompt defines the agent's persona and behavior.
# This is where you give your agent its personality, its name, its
# areas of expertise, and any behavioral constraints you want to enforce.
DEFAULT_SYSTEM_PROMPT = """You are Aria, a helpful, friendly, and knowledgeable
AI assistant. You communicate clearly and concisely, with warmth and a touch of
wit. You remember the context of our conversation and refer back to earlier
points when relevant. When answering questions, you are accurate and honest,
and you say "I don't know" when you are uncertain rather than guessing.
When your response will be spoken aloud (converted to speech), keep it
conversational and avoid using markdown formatting, bullet points, or special
characters that do not translate well to spoken language. Use natural spoken
language instead."""
class VoiceAgent:
"""
A minimalistic agentic AI system with voice capabilities.
This agent can receive input as text or audio, process it using
an LLM, maintain conversation history, and respond as text or audio.
It is designed to be used both locally (direct terminal interaction)
and remotely (via Telegram or Signal bots).
The agent is stateful: it maintains a ConversationHistory object
that persists across calls. Each call to process_text() or
process_audio() adds to the history and uses it for context.
Architecture:
Input (text or audio)
-> STT (if audio)
-> ConversationHistory.add_user_message()
-> LLMClient.chat(history)
-> ConversationHistory.add_assistant_message()
-> TTS (if voice output requested)
-> Output (text or audio file path)
"""
def __init__(
self,
llm_client: "LLMClient",
stt_engine: "STTEngine",
tts_engine: "TTSEngine",
system_prompt: str = DEFAULT_SYSTEM_PROMPT,
session_id: str = "default",
max_turns: int = 20,
history_dir: str = "./history"
):
"""
Initialize the VoiceAgent.
Args:
llm_client: The LLM client to use for generating responses.
stt_engine: The STT engine for transcribing audio input.
tts_engine: The TTS engine for synthesizing audio output.
system_prompt: The system prompt defining the agent's persona.
session_id: A unique ID for this conversation session.
max_turns: Maximum conversation turns to keep in history.
history_dir: Directory for saving conversation history files.
"""
self.llm = llm_client
self.stt = stt_engine
self.tts = tts_engine
self.system_prompt = system_prompt
self.history_dir = history_dir
# Initialize the conversation history for this session.
self.history = ConversationHistory(
session_id=session_id,
max_turns=max_turns
)
logger.info(
f"[VoiceAgent] Initialized. Session: '{session_id}', "
f"Max turns: {max_turns}"
)
def process_text(
self,
user_text: str,
return_audio: bool = False
) -> dict:
"""
Process a text input and return a text (and optionally audio) response.
This is the core method of the agent. It adds the user's message
to the history, calls the LLM with the full history, adds the
response to the history, and optionally synthesizes the response
as audio.
Args:
user_text: The user's text input.
return_audio: If True, also synthesize the response as audio
and include the audio file path in the result.
Returns:
A dictionary with the following keys:
"text": The agent's text response.
"audio_path": Path to the audio file (if return_audio=True),
or None if return_audio=False.
"user_text": The original user input (for logging).
"""
logger.info(f"[VoiceAgent] Processing text: '{user_text[:50]}...'")
# Step 1: Add the user's message to the conversation history.
self.history.add_user_message(user_text)
# Step 2: Call the LLM with the full conversation history.
# The system prompt is passed separately so it does not appear
# in the history (which would waste context window space).
try:
response_text = self.llm.chat(
messages=self.history.get_messages(),
system_prompt=self.system_prompt
)
except Exception as e:
logger.error(f"[VoiceAgent] LLM call failed: {e}")
response_text = (
"I'm sorry, I encountered an error processing your request. "
"Please try again."
)
# Step 3: Add the assistant's response to the conversation history.
self.history.add_assistant_message(response_text)
logger.info(
f"[VoiceAgent] Response generated: '{response_text[:50]}...'"
)
# Step 4: Optionally synthesize the response as audio.
audio_path = None
if return_audio:
audio_path = self._synthesize_response(response_text)
return {
"text": response_text,
"audio_path": audio_path,
"user_text": user_text
}
def process_audio(
self,
audio_path: str,
return_audio: bool = True
) -> dict:
"""
Process an audio input and return a text (and optionally audio) response.
This method first transcribes the audio to text using the STT engine,
then calls process_text() with the transcription. It is the main
entry point for voice-based interaction.
Args:
audio_path: Path to the audio file to transcribe.
return_audio: If True, also synthesize the response as audio.
Defaults to True for audio input (voice-in, voice-out).
Returns:
A dictionary with the following keys:
"text": The agent's text response.
"audio_path": Path to the audio response file (if requested).
"user_text": The transcribed user input.
"transcription": Same as user_text (alias for clarity).
"""
logger.info(f"[VoiceAgent] Processing audio file: '{audio_path}'")
# Step 1: Transcribe the audio to text.
try:
# Handle OGG files (from Telegram) specially.
if audio_path.lower().endswith(".ogg"):
user_text = self.stt.transcribe_ogg(audio_path)
else:
user_text = self.stt.transcribe(audio_path)
except Exception as e:
logger.error(f"[VoiceAgent] STT transcription failed: {e}")
return {
"text": "I could not understand the audio. Please try again.",
"audio_path": None,
"user_text": "",
"transcription": ""
}
if not user_text:
logger.warning("[VoiceAgent] STT returned empty transcription.")
return {
"text": "I did not catch that. Could you please repeat?",
"audio_path": None,
"user_text": "",
"transcription": ""
}
logger.info(f"[VoiceAgent] Transcribed: '{user_text}'")
# Step 2: Process the transcribed text as a normal text input.
result = self.process_text(user_text, return_audio=return_audio)
result["transcription"] = user_text
return result
def _synthesize_response(self, text: str) -> Optional[str]:
"""
Synthesize a text response as audio and return the file path.
This is an internal helper method that handles the TTS synthesis
and any errors that might occur. It creates a temporary file for
the audio, which the caller is responsible for deleting.
Args:
text: The text to synthesize.
Returns:
Path to the synthesized audio file, or None on failure.
"""
try:
# Determine the right file extension for the TTS engine.
# EdgeTTS and gTTS produce MP3; pyttsx3 produces WAV.
suffix = ".mp3"
if isinstance(self.tts, Pyttsx3TTS):
suffix = ".wav"
audio_path = self.tts.synthesize_to_temp(text, suffix=suffix)
logger.info(
f"[VoiceAgent] Audio synthesized: '{audio_path}'"
)
return audio_path
except Exception as e:
logger.error(f"[VoiceAgent] TTS synthesis failed: {e}")
return None
def reset_history(self) -> None:
"""
Clear the conversation history and start fresh.
This is useful when the user wants to start a new conversation
without creating a new agent instance.
"""
self.history.clear()
logger.info(
f"[VoiceAgent] History cleared for session '{self.history.session_id}'"
)
def save_history(self) -> str:
"""
Save the current conversation history to a JSON file.
Returns:
The path to the saved history file.
"""
path = self.history.save_to_file(self.history_dir)
logger.info(f"[VoiceAgent] History saved to '{path}'")
return path
The VoiceAgent class is the crown jewel of our architecture. Notice how it orchestrates all the other components without knowing anything about their specific implementations. It does not know whether the LLM is Ollama or OpenAI. It does not know whether the STT engine is Whisper or Vosk. It does not know whether the TTS engine is edge-tts or gTTS. It just calls the abstract interfaces and lets the concrete implementations do the work.
The process_text() and process_audio() methods both return a dictionary rather than a simple string. This is intentional: it makes the return value extensible without breaking existing code. If you later want to add confidence scores, detected language, or other metadata, you can add new keys to the dictionary without changing the method signature.
CHAPTER IX: GOING REMOTE - TELEGRAM AND SIGNAL INTEGRATION
Now that we have a working local agent, it is time to give it a remote presence. We will start with Telegram, which has excellent Python library support, and then discuss Signal.
Setting Up a Telegram Bot
To create a Telegram bot, you need to talk to BotFather. Open Telegram, search for @BotFather, and send it the /newbot command. Follow the prompts to give your bot a name and a username (which must end in "bot"). BotFather will give you a token that looks like this:
1234567890:ABCDefGhIJKlmNoPQRsTUVwxyZ
Keep this token secret. Anyone who has it can control your bot. Store it in an environment variable, not in your code.
You also need to install the python-telegram-bot library:
pip install python-telegram-bot
The Telegram Bot Handler
The Telegram bot handler is a thin layer that sits between the Telegram API and our VoiceAgent. Its job is to receive messages from Telegram, extract the relevant content (text or audio), pass it to the agent, and send the response back to the user.
One important design decision: we need a separate conversation history for each Telegram user. If two different users are talking to the bot simultaneously, they should each have their own history, not share a single one. We handle this by maintaining a dictionary of VoiceAgent instances keyed by Telegram user ID.
import asyncio
import logging
import os
import tempfile
from typing import Dict
from telegram import Update, Voice
from telegram.ext import (
Application,
CommandHandler,
MessageHandler,
filters,
ContextTypes
)
logger = logging.getLogger(__name__)
class TelegramVoiceBot:
"""
A Telegram bot that wraps a VoiceAgent and handles voice and text messages.
This bot maintains a separate VoiceAgent instance for each Telegram user,
ensuring that conversation histories do not mix between users. Each agent
instance uses the same LLM, STT, and TTS engines (shared resources) but
has its own independent conversation history.
The bot handles three types of interactions:
1. Text messages: Passed to the agent as text, responded to with text.
2. Voice messages: Transcribed by STT, processed by agent, responded
to with both text (the transcription) and a voice message.
3. Commands: /start (greeting), /reset (clear history), /help (usage).
Usage:
bot = TelegramVoiceBot(
token="your-bot-token",
llm_client=llm,
stt_engine=stt,
tts_engine=tts
)
bot.run()
"""
def __init__(
self,
token: str,
llm_client,
stt_engine,
tts_engine,
system_prompt: str = DEFAULT_SYSTEM_PROMPT,
max_turns_per_user: int = 20,
history_dir: str = "./history"
):
"""
Initialize the Telegram bot.
Args:
token: The Telegram bot token from BotFather.
llm_client: The LLM client for generating responses.
stt_engine: The STT engine for transcribing voice messages.
tts_engine: The TTS engine for synthesizing voice responses.
system_prompt: The agent's system prompt / persona.
max_turns_per_user: Max conversation turns per user.
history_dir: Directory for saving conversation histories.
"""
self.token = token
self.llm_client = llm_client
self.stt_engine = stt_engine
self.tts_engine = tts_engine
self.system_prompt = system_prompt
self.max_turns_per_user = max_turns_per_user
self.history_dir = history_dir
# Dictionary mapping Telegram user_id -> VoiceAgent instance.
# Each user gets their own agent with their own conversation history.
self._agents: Dict[int, "VoiceAgent"] = {}
def _get_agent_for_user(self, user_id: int, username: str = "") -> "VoiceAgent":
"""
Get or create a VoiceAgent for a specific Telegram user.
If the user already has an agent, return it. If not, create a new
one with a session ID based on the user's Telegram ID.
Args:
user_id: The Telegram user's numeric ID.
username: The Telegram username (for logging purposes).
Returns:
The VoiceAgent instance for this user.
"""
if user_id not in self._agents:
session_id = f"telegram_{user_id}"
logger.info(
f"[TelegramBot] Creating new agent for user "
f"'{username}' (ID: {user_id})"
)
self._agents[user_id] = VoiceAgent(
llm_client=self.llm_client,
stt_engine=self.stt_engine,
tts_engine=self.tts_engine,
system_prompt=self.system_prompt,
session_id=session_id,
max_turns=self.max_turns_per_user,
history_dir=self.history_dir
)
return self._agents[user_id]
async def handle_start(
self,
update: Update,
context: ContextTypes.DEFAULT_TYPE
) -> None:
"""
Handle the /start command.
This is the first message a user sees when they open the bot.
It introduces the bot and explains how to use it.
"""
user = update.effective_user
welcome_text = (
f"Hello, {user.first_name}! I'm Aria, your AI voice assistant. "
f"You can talk to me by:\n\n"
f" - Sending a text message\n"
f" - Sending a voice message (I'll transcribe and respond)\n\n"
f"Commands:\n"
f" /reset - Start a fresh conversation\n"
f" /help - Show this help message\n\n"
f"What would you like to talk about?"
)
await update.message.reply_text(welcome_text)
async def handle_help(
self,
update: Update,
context: ContextTypes.DEFAULT_TYPE
) -> None:
"""Handle the /help command."""
await self.handle_start(update, context)
async def handle_reset(
self,
update: Update,
context: ContextTypes.DEFAULT_TYPE
) -> None:
"""
Handle the /reset command.
Clears the user's conversation history so they can start fresh.
"""
user = update.effective_user
agent = self._get_agent_for_user(user.id, user.username or "")
agent.reset_history()
await update.message.reply_text(
"Conversation history cleared! Let's start fresh. "
"What would you like to talk about?"
)
async def handle_text_message(
self,
update: Update,
context: ContextTypes.DEFAULT_TYPE
) -> None:
"""
Handle an incoming text message from a Telegram user.
The message is passed to the user's VoiceAgent, which processes
it and returns a text response. The response is sent back as
a text message. No audio is generated for text-in/text-out.
"""
user = update.effective_user
user_text = update.message.text
logger.info(
f"[TelegramBot] Text from '{user.username}': '{user_text[:50]}'"
)
# Show a "typing..." indicator while processing.
await context.bot.send_chat_action(
chat_id=update.effective_chat.id,
action="typing"
)
agent = self._get_agent_for_user(user.id, user.username or "")
result = agent.process_text(user_text, return_audio=False)
await update.message.reply_text(result["text"])
async def handle_voice_message(
self,
update: Update,
context: ContextTypes.DEFAULT_TYPE
) -> None:
"""
Handle an incoming voice message from a Telegram user.
The workflow is:
1. Download the OGG voice file from Telegram.
2. Pass it to the agent's process_audio() method.
3. The agent transcribes it, generates a text response,
and synthesizes an audio response.
4. Send the transcription as a text message (so the user
can see what was understood).
5. Send the audio response as a voice message.
6. Clean up temporary files.
"""
user = update.effective_user
voice: Voice = update.message.voice
logger.info(
f"[TelegramBot] Voice message from '{user.username}', "
f"duration: {voice.duration}s"
)
# Show a "recording audio" indicator while processing.
await context.bot.send_chat_action(
chat_id=update.effective_chat.id,
action="record_voice"
)
# Download the voice message from Telegram.
ogg_path = None
audio_response_path = None
try:
# Get the file object and download it.
voice_file = await context.bot.get_file(voice.file_id)
with tempfile.NamedTemporaryFile(
suffix=".ogg", delete=False
) as tmp:
ogg_path = tmp.name
await voice_file.download_to_drive(ogg_path)
logger.info(f"[TelegramBot] Downloaded voice to '{ogg_path}'")
# Process the audio through the agent.
agent = self._get_agent_for_user(user.id, user.username or "")
result = agent.process_audio(ogg_path, return_audio=True)
transcription = result.get("transcription", "")
response_text = result["text"]
audio_response_path = result.get("audio_path")
# Send the transcription so the user knows what was understood.
if transcription:
await update.message.reply_text(
f'I heard: "{transcription}"\n\n{response_text}'
)
else:
await update.message.reply_text(response_text)
# Send the audio response as a voice message.
if audio_response_path and os.path.exists(audio_response_path):
with open(audio_response_path, "rb") as audio_file:
await context.bot.send_voice(
chat_id=update.effective_chat.id,
voice=audio_file,
caption="Voice response"
)
except Exception as e:
logger.error(
f"[TelegramBot] Error processing voice message: {e}",
exc_info=True
)
await update.message.reply_text(
"I'm sorry, I had trouble processing your voice message. "
"Please try again or send a text message."
)
finally:
# Always clean up temporary files to avoid disk space issues.
for path in [ogg_path, audio_response_path]:
if path and os.path.exists(path):
try:
os.remove(path)
except OSError as e:
logger.warning(
f"[TelegramBot] Could not delete temp file "
f"'{path}': {e}"
)
def run(self) -> None:
"""
Start the Telegram bot and begin polling for updates.
This method blocks until the bot is stopped (e.g., by pressing
Ctrl+C). It sets up all the message handlers and starts the
Telegram long-polling loop.
"""
logger.info("[TelegramBot] Starting bot...")
application = Application.builder().token(self.token).build()
# Register command handlers.
application.add_handler(CommandHandler("start", self.handle_start))
application.add_handler(CommandHandler("help", self.handle_help))
application.add_handler(CommandHandler("reset", self.handle_reset))
# Register message handlers.
# The order matters: more specific filters should come first.
application.add_handler(
MessageHandler(filters.VOICE, self.handle_voice_message)
)
application.add_handler(
MessageHandler(filters.TEXT & ~filters.COMMAND,
self.handle_text_message)
)
logger.info("[TelegramBot] Bot is running. Press Ctrl+C to stop.")
application.run_polling(allowed_updates=Update.ALL_TYPES)
The TelegramVoiceBot class is a clean, well-organized handler for all Telegram interactions. The _get_agent_for_user() method is the key to multi-user support: it lazily creates a new VoiceAgent for each new user and caches it for subsequent messages from the same user. This means each user has their own private conversation history, their own context, and their own agent state.
The handle_voice_message() method deserves special attention because it orchestrates the most complex workflow in the system. It downloads the OGG file from Telegram, passes it to the agent, sends both a text response (showing the transcription) and an audio response (the synthesized voice), and then cleans up all temporary files in the finally block. The finally block is critical: it ensures that temporary files are always deleted, even if an exception occurs partway through the processing.
Signal Integration
Signal is a more privacy-focused messaging platform than Telegram, but it does not have an official bot API. Integration requires using signal-cli, a command-line tool that can register a phone number with Signal and send/receive messages programmatically. The signalbot Python library builds on top of signal-cli to provide a higher-level API.
Setting up Signal integration is more involved than Telegram. You need a dedicated phone number (a SIM card or a VoIP number), and you need to run signal-cli or signal-cli-rest-api as a background service. The signalbot library then connects to this service.
pip install signalbot
The signal-cli-rest-api project provides a Docker-based REST API wrapper around signal-cli, which is the most convenient way to run it:
docker run -d \
-p 8080:8080 \
-v /path/to/signal-data:/home/.local/share/signal-cli \
bbernhard/signal-cli-rest-api
After registering your number with signal-cli, you can use the signalbot library to build a bot with a similar structure to our Telegram bot. The key difference is that Signal sends voice messages as audio file attachments, not as a special "voice" message type. The bot needs to detect audio attachments and process them through the STT pipeline.
import asyncio
import logging
import os
import tempfile
import aiohttp
from signalbot import SignalBot, Command, Context
logger = logging.getLogger(__name__)
class SignalVoiceCommand(Command):
"""
A signalbot Command that handles both text and voice messages.
signalbot uses a Command pattern: you subclass Command and implement
the handle() method, which is called for each incoming message that
passes the matches() filter.
This command handles:
- Text messages: processed as text input to the VoiceAgent.
- Audio attachments: transcribed via STT, then processed as text.
- Special commands: !reset to clear history, !help for usage info.
"""
def __init__(self, agent_registry: dict, llm_client, stt_engine, tts_engine):
"""
Initialize the Signal voice command handler.
Args:
agent_registry: A dict mapping phone_number -> VoiceAgent.
This is shared across all command instances.
llm_client: The LLM client for generating responses.
stt_engine: The STT engine for transcribing audio.
tts_engine: The TTS engine for synthesizing responses.
"""
super().__init__()
self.agent_registry = agent_registry
self.llm_client = llm_client
self.stt_engine = stt_engine
self.tts_engine = tts_engine
def describe(self) -> str:
"""Return a description of this command for logging."""
return "Voice-capable AI agent command handler"
async def handle(self, context: Context) -> None:
"""
Handle an incoming Signal message.
This method is called by signalbot for each message that
passes the matches() filter. It dispatches to the appropriate
handler based on whether the message is text or audio.
Args:
context: The signalbot Context object containing the message
and methods for sending replies.
"""
sender = context.message.source
text = context.message.text or ""
attachments = context.message.attachments or []
# Get or create an agent for this sender.
agent = self._get_agent_for_sender(sender)
# Handle special commands.
if text.strip().lower() in ["!reset", "/reset"]:
agent.reset_history()
await context.send("Conversation history cleared! Fresh start.")
return
if text.strip().lower() in ["!help", "/help", "/start"]:
await context.send(
"I'm Aria, your AI voice assistant on Signal.\n"
"Send me a text message or a voice note and I'll respond.\n"
"Commands: !reset (clear history)"
)
return
# Check if there are audio attachments (voice messages).
audio_attachments = [
a for a in attachments
if a.get("contentType", "").startswith("audio/")
]
if audio_attachments:
await self._handle_audio_attachment(
context, agent, audio_attachments[0]
)
elif text:
await self._handle_text(context, agent, text)
async def _handle_text(
self,
context: Context,
agent: "VoiceAgent",
text: str
) -> None:
"""
Handle a text message: process it and reply with text.
Args:
context: The signalbot context for sending replies.
agent: The VoiceAgent for this sender.
text: The text message content.
"""
result = agent.process_text(text, return_audio=False)
await context.send(result["text"])
async def _handle_audio_attachment(
self,
context: Context,
agent: "VoiceAgent",
attachment: dict
) -> None:
"""
Handle an audio attachment (voice message) from Signal.
Downloads the audio file, processes it through the agent's
STT pipeline, generates a response, synthesizes it as audio,
and sends both the text and audio response back.
Signal sends voice messages as audio/ogg or audio/aac attachments.
The attachment dict contains a 'filename' or 'id' that can be
used to retrieve the file from signal-cli-rest-api.
Args:
context: The signalbot context for sending replies.
agent: The VoiceAgent for this sender.
attachment: The attachment metadata dict from Signal.
"""
audio_path = None
audio_response_path = None
try:
# Download the audio attachment.
# signal-cli-rest-api stores attachments locally.
# The path depends on your signal-cli-rest-api configuration.
attachment_id = attachment.get("id", "")
attachment_filename = attachment.get("filename", "voice.ogg")
content_type = attachment.get("contentType", "audio/ogg")
# Determine the file extension from the content type.
ext = ".ogg"
if "aac" in content_type:
ext = ".aac"
elif "mp4" in content_type or "m4a" in content_type:
ext = ".m4a"
with tempfile.NamedTemporaryFile(
suffix=ext, delete=False
) as tmp:
audio_path = tmp.name
# Download from signal-cli-rest-api's attachment endpoint.
# The base URL depends on your signal-cli-rest-api setup.
signal_api_url = os.environ.get(
"SIGNAL_API_URL", "http://localhost:8080"
)
download_url = (
f"{signal_api_url}/v1/attachments/{attachment_id}"
)
async with aiohttp.ClientSession() as session:
async with session.get(download_url) as response:
if response.status == 200:
content = await response.read()
with open(audio_path, "wb") as f:
f.write(content)
else:
logger.error(
f"[SignalBot] Failed to download attachment: "
f"HTTP {response.status}"
)
await context.send(
"I could not download your voice message. "
"Please try again."
)
return
# Process the audio through the agent.
result = agent.process_audio(audio_path, return_audio=True)
transcription = result.get("transcription", "")
response_text = result["text"]
audio_response_path = result.get("audio_path")
# Send the text response.
if transcription:
await context.send(
f'I heard: "{transcription}"\n\n{response_text}'
)
else:
await context.send(response_text)
# Send the audio response as an attachment.
# Note: Sending audio attachments via signalbot requires
# signal-cli-rest-api's send endpoint with attachments.
# This is a simplified example; actual implementation
# depends on your signal-cli-rest-api version.
if audio_response_path and os.path.exists(audio_response_path):
await context.send(
"",
attachments=[audio_response_path]
)
except Exception as e:
logger.error(
f"[SignalBot] Error processing audio attachment: {e}",
exc_info=True
)
await context.send(
"I had trouble processing your voice message. "
"Please try again."
)
finally:
for path in [audio_path, audio_response_path]:
if path and os.path.exists(path):
try:
os.remove(path)
except OSError:
pass
def _get_agent_for_sender(self, sender: str) -> "VoiceAgent":
"""
Get or create a VoiceAgent for a Signal sender.
Args:
sender: The sender's phone number (Signal identifier).
Returns:
The VoiceAgent for this sender.
"""
if sender not in self.agent_registry:
logger.info(
f"[SignalBot] Creating new agent for sender '{sender}'"
)
self.agent_registry[sender] = VoiceAgent(
llm_client=self.llm_client,
stt_engine=self.stt_engine,
tts_engine=self.tts_engine,
session_id=f"signal_{sender.replace('+', '')}",
max_turns=20
)
return self.agent_registry[sender]
The Signal integration is structurally similar to the Telegram integration but has some important differences. Signal does not have a native "voice message" type in the same way Telegram does. Voice messages arrive as audio file attachments with a content type of audio/ogg or audio/aac. The bot must detect these attachments and route them through the audio processing pipeline.
Another difference is that Signal's privacy model means there is no centralized server storing your messages. The signal-cli tool acts as a Signal client on your server, and all messages are end-to-end encrypted. This makes Signal integration more complex to set up but much more private than Telegram.
CHAPTER X: PUTTING IT ALL TOGETHER - THE COMPLETE RUNNING EXAMPLE
Now we assemble everything into a single, runnable main script. This script brings together all the components we have built and provides a clean entry point for both local terminal use and remote Telegram bot deployment.
The script is organized around a factory function that creates the agent components based on a configuration object. This makes it easy to switch between different STT engines, TTS engines, and LLM backends by changing a single configuration file.
First, let us look at the project structure:
voice_agent/
|-- main.py (entry point)
|-- config.py (configuration)
|-- llm_client.py (LLM abstraction)
|-- conversation.py (history management)
|-- stt_engines.py (STT abstractions)
|-- tts_engines.py (TTS abstractions)
|-- agent.py (VoiceAgent class)
|-- telegram_bot.py (Telegram integration)
|-- signal_bot.py (Signal integration)
|-- voicebox_client.py (voicebox.sh client)
|-- requirements.txt (dependencies)
|-- history/ (saved conversation histories)
|-- .env (environment variables, not committed to git)
The requirements.txt file lists all dependencies:
openai>=1.0.0
openai-whisper>=20231117
edge-tts>=6.1.9
gTTS>=2.5.0
pyttsx3>=2.90
python-telegram-bot>=20.0
pydub>=0.25.1
requests>=2.31.0
aiohttp>=3.9.0
python-dotenv>=1.0.0
signalbot>=0.1.0
Now, the complete main.py that ties everything together:
#!/usr/bin/env python3
"""
voice_agent/main.py
Entry point for the Voice-Capable Agentic AI system.
This script can run in three modes:
1. local - Interactive terminal session with voice I/O
2. telegram - Telegram bot mode (requires TELEGRAM_BOT_TOKEN env var)
3. signal - Signal bot mode (requires SIGNAL_* env vars)
Usage:
python main.py --mode local
python main.py --mode telegram
python main.py --mode signal
Environment variables (can be set in .env file):
TELEGRAM_BOT_TOKEN - Your Telegram bot token from BotFather
OPENAI_API_KEY - Your OpenAI API key (if using OpenAI backend)
LLM_BACKEND - "ollama" or "openai" (default: "ollama")
LLM_MODEL - Model name (default: "llama3.2" for ollama)
STT_ENGINE - "whisper" or "voicebox" (default: "whisper")
TTS_ENGINE - "edge", "gtts", "pyttsx3" (default: "edge")
WHISPER_MODEL - Whisper model size (default: "base")
EDGE_TTS_VOICE - edge-tts voice name (default: "en-US-AriaNeural")
SIGNAL_API_URL - signal-cli-rest-api URL (default: localhost:8080)
SIGNAL_PHONE - Your Signal bot phone number
"""
import argparse
import asyncio
import logging
import os
import sys
import tempfile
from dotenv import load_dotenv
# Load environment variables from .env file if it exists.
# This must happen before importing any module that reads env vars.
load_dotenv()
# Configure logging for the entire application.
# In production, you might want to log to a file instead of stdout.
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s [%(name)s] %(levelname)s: %(message)s",
datefmt="%Y-%m-%d %H:%M:%S"
)
logger = logging.getLogger("voice_agent.main")
# ============================================================
# INLINE DEFINITIONS
# (In a real project, these would be in separate files.
# They are inlined here for a self-contained example.)
# ============================================================
# --- Paste the full content of llm_client.py here ---
# --- Paste the full content of conversation.py here ---
# --- Paste the full content of stt_engines.py here ---
# --- Paste the full content of tts_engines.py here ---
# --- Paste the full content of agent.py here ---
# --- Paste the full content of telegram_bot.py here ---
# --- Paste the full content of voicebox_client.py here ---
def create_llm_client() -> "LLMClient":
"""
Create and return an LLMClient based on environment configuration.
Reads LLM_BACKEND and LLM_MODEL from environment variables.
Falls back to Ollama with llama3.2 if not configured.
Returns:
A configured LLMClient instance.
"""
backend = os.environ.get("LLM_BACKEND", "ollama").lower()
model = os.environ.get("LLM_MODEL", None)
logger.info(f"Creating LLM client: backend='{backend}', model='{model}'")
client = LLMClient(backend=backend, model=model)
# Perform a health check to ensure the backend is reachable.
logger.info("Checking LLM backend availability...")
if not client.is_available():
if backend == "ollama":
logger.error(
"Ollama is not running or the model is not available. "
"Please start Ollama with: ollama serve\n"
"And pull the model with: ollama pull llama3.2"
)
else:
logger.error(
f"LLM backend '{backend}' is not available. "
"Check your API key and internet connection."
)
sys.exit(1)
logger.info("LLM backend is available.")
return client
def create_stt_engine() -> "STTEngine":
"""
Create and return an STTEngine based on environment configuration.
Reads STT_ENGINE and WHISPER_MODEL from environment variables.
Falls back to WhisperSTT with "base" model if not configured.
Returns:
A configured STTEngine instance.
"""
engine_name = os.environ.get("STT_ENGINE", "whisper").lower()
whisper_model = os.environ.get("WHISPER_MODEL", "base")
logger.info(f"Creating STT engine: '{engine_name}'")
if engine_name == "voicebox":
client = VoiceboxClient()
if not client.is_running():
logger.warning(
"voicebox.sh is not running. Falling back to WhisperSTT."
)
return WhisperSTT(model_size=whisper_model)
return VoiceboxSTT(client)
else:
# Default to Whisper.
return WhisperSTT(model_size=whisper_model)
def create_tts_engine() -> "TTSEngine":
"""
Create and return a TTSEngine based on environment configuration.
Reads TTS_ENGINE and EDGE_TTS_VOICE from environment variables.
Falls back to EdgeTTS with the default voice if not configured.
Returns:
A configured TTSEngine instance.
"""
engine_name = os.environ.get("TTS_ENGINE", "edge").lower()
edge_voice = os.environ.get("EDGE_TTS_VOICE", "en-US-AriaNeural")
logger.info(f"Creating TTS engine: '{engine_name}'")
if engine_name == "gtts":
return GttsTTS(language="en")
elif engine_name == "pyttsx3":
return Pyttsx3TTS(rate=150)
elif engine_name == "voicebox":
client = VoiceboxClient()
if not client.is_running():
logger.warning(
"voicebox.sh is not running. Falling back to EdgeTTS."
)
return EdgeTTS(voice=edge_voice)
# Use voicebox for TTS by wrapping it in a compatible interface.
# For simplicity, we use a lambda-based adapter here.
# In production, create a proper VoiceboxTTS class.
return EdgeTTS(voice=edge_voice) # Fallback for this example
else:
# Default to edge-tts.
return EdgeTTS(voice=edge_voice)
def run_local_mode(llm_client, stt_engine, tts_engine) -> None:
"""
Run the agent in interactive local terminal mode.
In this mode, the user types text messages and the agent responds
with both text and audio (played through the system speakers).
This mode does not use microphone input; it is text-in, voice-out.
For microphone input in local mode, you would need to add a
microphone recording step using pyaudio or sounddevice before
calling agent.process_audio().
Args:
llm_client: The configured LLM client.
stt_engine: The configured STT engine.
tts_engine: The configured TTS engine.
"""
import subprocess
import platform
agent = VoiceAgent(
llm_client=llm_client,
stt_engine=stt_engine,
tts_engine=tts_engine,
session_id="local_session"
)
print("\n" + "=" * 60)
print(" ARIA - Voice-Capable AI Agent (Local Mode)")
print("=" * 60)
print(" Type your message and press Enter.")
print(" Type 'quit' or 'exit' to stop.")
print(" Type 'reset' to clear conversation history.")
print(" Type 'save' to save conversation history.")
print("=" * 60 + "\n")
def play_audio(audio_path: str) -> None:
"""Play an audio file using the system's default player."""
system = platform.system()
try:
if system == "Darwin": # macOS
subprocess.run(["afplay", audio_path], check=True)
elif system == "Linux":
subprocess.run(["aplay", audio_path], check=True)
elif system == "Windows":
os.startfile(audio_path)
except Exception as e:
logger.warning(f"Could not play audio: {e}")
while True:
try:
user_input = input("You: ").strip()
except (EOFError, KeyboardInterrupt):
print("\nGoodbye!")
break
if not user_input:
continue
if user_input.lower() in ["quit", "exit"]:
print("Goodbye!")
break
if user_input.lower() == "reset":
agent.reset_history()
print("[History cleared]\n")
continue
if user_input.lower() == "save":
path = agent.save_history()
print(f"[History saved to: {path}]\n")
continue
# Process the text input and get a response with audio.
result = agent.process_text(user_input, return_audio=True)
print(f"\nAria: {result['text']}\n")
# Play the audio response if available.
if result.get("audio_path"):
try:
play_audio(result["audio_path"])
finally:
# Clean up the temporary audio file.
if os.path.exists(result["audio_path"]):
os.remove(result["audio_path"])
def run_telegram_mode(llm_client, stt_engine, tts_engine) -> None:
"""
Run the agent as a Telegram bot.
Reads the TELEGRAM_BOT_TOKEN environment variable for the bot token.
The bot will run until interrupted with Ctrl+C.
Args:
llm_client: The configured LLM client.
stt_engine: The configured STT engine.
tts_engine: The configured TTS engine.
"""
token = os.environ.get("TELEGRAM_BOT_TOKEN", "")
if not token:
logger.error(
"TELEGRAM_BOT_TOKEN environment variable is not set. "
"Get a token from @BotFather on Telegram."
)
sys.exit(1)
logger.info("Starting Telegram bot...")
bot = TelegramVoiceBot(
token=token,
llm_client=llm_client,
stt_engine=stt_engine,
tts_engine=tts_engine
)
bot.run()
def run_signal_mode(llm_client, stt_engine, tts_engine) -> None:
"""
Run the agent as a Signal bot.
Requires signal-cli-rest-api to be running and configured.
Reads SIGNAL_API_URL and SIGNAL_PHONE from environment variables.
Args:
llm_client: The configured LLM client.
stt_engine: The configured STT engine.
tts_engine: The configured TTS engine.
"""
from signalbot import SignalBot
signal_api_url = os.environ.get("SIGNAL_API_URL", "http://localhost:8080")
signal_phone = os.environ.get("SIGNAL_PHONE", "")
if not signal_phone:
logger.error(
"SIGNAL_PHONE environment variable is not set. "
"Set it to your Signal bot's phone number."
)
sys.exit(1)
logger.info(f"Starting Signal bot for number '{signal_phone}'...")
# Shared registry of agents, one per Signal sender.
agent_registry = {}
bot = SignalBot({
"signal_service": signal_api_url,
"phone_number": signal_phone
})
# Register our voice command handler.
bot.register(
SignalVoiceCommand(
agent_registry=agent_registry,
llm_client=llm_client,
stt_engine=stt_engine,
tts_engine=tts_engine
)
)
logger.info("Signal bot is running. Press Ctrl+C to stop.")
bot.start()
def main() -> None:
"""
Main entry point for the voice agent application.
Parses command-line arguments, creates the agent components,
and starts the appropriate mode (local, telegram, or signal).
"""
parser = argparse.ArgumentParser(
description="Voice-Capable Agentic AI System",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog=__doc__
)
parser.add_argument(
"--mode",
choices=["local", "telegram", "signal"],
default="local",
help="Run mode: local terminal, Telegram bot, or Signal bot."
)
args = parser.parse_args()
logger.info(f"Starting voice agent in '{args.mode}' mode...")
# Create the shared components.
# These are created once and shared across all agent instances.
llm_client = create_llm_client()
stt_engine = create_stt_engine()
tts_engine = create_tts_engine()
# Dispatch to the appropriate mode.
if args.mode == "local":
run_local_mode(llm_client, stt_engine, tts_engine)
elif args.mode == "telegram":
run_telegram_mode(llm_client, stt_engine, tts_engine)
elif args.mode == "signal":
run_signal_mode(llm_client, stt_engine, tts_engine)
if __name__ == "__main__":
main()
The main.py script is the glue that holds everything together. The factory functions create_llm_client(), create_stt_engine(), and create_tts_engine() read configuration from environment variables and create the appropriate component instances. This approach is clean and flexible: you can change the entire behavior of the system by changing environment variables, without touching the code.
The local mode is particularly interesting because it shows how the same agent can be used in a non-Telegram context. The run_local_mode() function creates a simple REPL (Read-Eval-Print Loop) that reads text from the terminal, processes it through the agent, prints the response, and plays the audio through the system speakers. This is a great way to test the agent before deploying it as a Telegram bot.
The play_audio() function inside run_local_mode() handles cross-platform audio playback. On macOS, it uses the built-in afplay command. On Linux, it uses aplay (for WAV files) or you might prefer mpg123 for MP3 files. On Windows, os.startfile() opens the audio file with the default media player.
The .env Configuration File
Create a .env file in the project root to configure the system without modifying code. Never commit this file to version control:
# .env - Voice Agent Configuration
# Copy this to .env and fill in your values.
# LLM Backend: "ollama" (local) or "openai" (remote)
LLM_BACKEND=ollama
LLM_MODEL=llama3.2
# For OpenAI backend:
# LLM_BACKEND=openai
# LLM_MODEL=gpt-4o-mini
# OPENAI_API_KEY=sk-your-key-here
# STT Engine: "whisper" or "voicebox"
STT_ENGINE=whisper
WHISPER_MODEL=base
# TTS Engine: "edge", "gtts", or "pyttsx3"
TTS_ENGINE=edge
EDGE_TTS_VOICE=en-US-AriaNeural
# Telegram (only needed for --mode telegram)
TELEGRAM_BOT_TOKEN=your-bot-token-here
# Signal (only needed for --mode signal)
SIGNAL_API_URL=http://localhost:8080
SIGNAL_PHONE=+1234567890
CHAPTER XI: CONVERSATION HISTORY MANAGEMENT IN DEPTH
We have touched on conversation history management throughout this article, but it deserves a deeper treatment because it is one of the most nuanced aspects of building a production-quality chatbot.
The fundamental challenge is the context window. Every LLM has a maximum number of tokens it can process in a single call. This includes the system prompt, the entire conversation history, and the new user message. If the conversation grows too long, the API call will fail with a context length error.
The naive solution is to simply truncate the history to the last N messages. This works but has a significant drawback: if the user asks about something they mentioned early in the conversation, the agent will not remember it. The user experience is jarring: "But I told you my name was Alex at the beginning of our conversation!"
A more sophisticated approach is token-aware trimming. Instead of counting messages, you count tokens and trim the history to fit within a budget. This requires a tokenizer, which adds complexity. For most applications, message- count trimming is sufficient and much simpler.
Another approach is summarization. When the history grows too long, you ask the LLM to summarize the earlier parts of the conversation into a compact summary, and you replace those messages with the summary. This preserves the key information from the early conversation while dramatically reducing the token count. Here is a simple implementation:
class SummarizingConversationHistory(ConversationHistory):
"""
A conversation history that summarizes old messages when it grows too long.
This extends ConversationHistory with an automatic summarization feature.
When the number of messages exceeds the threshold, the oldest half of the
messages are summarized by the LLM and replaced with a single summary
message. This preserves key information while reducing token usage.
The summarization is done lazily: it only happens when add_user_message()
is called and the history is over the threshold. This avoids unnecessary
LLM calls.
"""
def __init__(
self,
llm_client: "LLMClient",
session_id: str = "default",
max_turns: int = 20,
summarize_threshold: int = 15
):
"""
Initialize the summarizing history manager.
Args:
llm_client: The LLM client to use for summarization.
session_id: Unique session identifier.
max_turns: Maximum turns before summarization triggers.
summarize_threshold: Number of turns at which to trigger
summarization. Should be less than max_turns.
"""
super().__init__(session_id=session_id, max_turns=None)
self.llm = llm_client
self.summarize_threshold = summarize_threshold
self._summary: str = "" # The accumulated summary of old messages.
def add_user_message(self, content: str) -> None:
"""
Add a user message, triggering summarization if needed.
If the current history exceeds the summarize_threshold, the
oldest half of the messages are summarized and replaced.
"""
# Check if we need to summarize before adding the new message.
turn_count = len(self.messages) // 2
if turn_count >= self.summarize_threshold:
self._summarize_old_messages()
self.messages.append({"role": "user", "content": content})
def _summarize_old_messages(self) -> None:
"""
Summarize the oldest half of the conversation history.
Takes the first half of the current messages, asks the LLM to
summarize them, and replaces them with a single system message
containing the summary. The second half (more recent messages)
is kept intact.
"""
# Split the history in half.
midpoint = len(self.messages) // 2
old_messages = self.messages[:midpoint]
recent_messages = self.messages[midpoint:]
# Format the old messages for the summarization prompt.
conversation_text = "\n".join([
f"{msg['role'].upper()}: {msg['content']}"
for msg in old_messages
])
# Build the summarization prompt.
summary_prompt = [
{
"role": "user",
"content": (
f"Please summarize the following conversation excerpt "
f"concisely, preserving all key facts, names, preferences, "
f"and decisions mentioned:\n\n{conversation_text}"
)
}
]
try:
new_summary = self.llm.chat(
messages=summary_prompt,
system_prompt=(
"You are a conversation summarizer. Produce a concise, "
"factual summary that preserves all important information."
)
)
# Combine with any existing summary.
if self._summary:
self._summary = (
f"Earlier conversation summary: {self._summary}\n\n"
f"More recent summary: {new_summary}"
)
else:
self._summary = new_summary
# Replace the old messages with a summary system message.
summary_message = {
"role": "system",
"content": (
f"[Conversation summary - earlier context]: {self._summary}"
)
}
self.messages = [summary_message] + recent_messages
logger.info(
f"[SummarizingHistory] Summarized {midpoint} old messages."
)
except Exception as e:
logger.error(
f"[SummarizingHistory] Summarization failed: {e}. "
f"Falling back to simple truncation."
)
# Fall back to simple truncation if summarization fails.
self.messages = recent_messages
The SummarizingConversationHistory class is a drop-in replacement for ConversationHistory. It inherits all the same methods but overrides add_user_message() to trigger summarization when needed. The summarization itself is done by calling the LLM with a special prompt that asks it to condense the old messages into a compact summary.
Notice the careful handling of the existing summary: if summarization has already happened before, the new summary is combined with the old one rather than replacing it. This creates a chain of summaries that captures the full arc of the conversation, even if it has been going on for hundreds of turns.
The fallback to simple truncation in the except block is an important safety net. If the summarization LLM call fails for any reason (network error, rate limit, etc.), the history is simply truncated rather than crashing the entire agent.
CHAPTER XII: DEPLOYMENT, SECURITY, AND FINAL THOUGHTS
Deployment Considerations
Running a Telegram bot requires a server that is always on. Your laptop is not a good choice for production deployment because it goes to sleep, loses network connectivity, and gets restarted. A small virtual private server (VPS) from providers like Hetzner, DigitalOcean, or Linode is the right choice. A VPS with 2GB of RAM and 2 CPU cores is sufficient for running the Whisper base model, edge-tts, and a local Ollama instance with a small model like Phi-3 mini.
If you want to use a larger Whisper model or a larger LLM, you need more RAM and ideally a GPU. For GPU-accelerated inference, consider a cloud instance with a GPU (AWS g4dn.xlarge, Google Cloud n1-standard-4 with T4 GPU) or a dedicated GPU server.
For running the bot as a system service that automatically restarts on failure, use systemd on Linux:
# /etc/systemd/system/voice-agent.service
[Unit]
Description=Voice-Capable AI Agent
After=network.target
[Service]
Type=simple
User=voice_agent
WorkingDirectory=/opt/voice_agent
EnvironmentFile=/opt/voice_agent/.env
ExecStart=/opt/voice_agent/venv/bin/python main.py --mode telegram
Restart=always
RestartSec=10
[Install]
WantedBy=multi-user.target
Enable and start the service with:
sudo systemctl enable voice-agent
sudo systemctl start voice-agent
sudo systemctl status voice-agent
Security Considerations
Never hardcode API keys, bot tokens, or other secrets in your code. Always use environment variables or a secrets management system. The .env file approach used in this article is appropriate for development but should be replaced with a proper secrets manager (like HashiCorp Vault, AWS Secrets Manager, or Kubernetes Secrets) in production.
Telegram bots are public by default: anyone who finds your bot can send it messages. If you want to restrict access to specific users, add a whitelist check in the message handlers. Here is a simple example:
# Add to your .env file:
# ALLOWED_TELEGRAM_USERS=123456789,987654321
def is_user_allowed(user_id: int) -> bool:
"""
Check if a Telegram user is allowed to use the bot.
Reads the allowed user IDs from the ALLOWED_TELEGRAM_USERS
environment variable, which should be a comma-separated list
of Telegram user IDs.
If the environment variable is not set, all users are allowed.
This is the default behavior for open bots.
Args:
user_id: The Telegram user's numeric ID.
Returns:
True if the user is allowed, False otherwise.
"""
allowed_str = os.environ.get("ALLOWED_TELEGRAM_USERS", "")
if not allowed_str:
return True # No whitelist configured: allow everyone.
allowed_ids = {
int(uid.strip())
for uid in allowed_str.split(",")
if uid.strip().isdigit()
}
return user_id in allowed_ids
Add a call to is_user_allowed() at the beginning of each message handler, and return early with an "Access denied" message if the user is not on the whitelist.
Be careful about the audio files you process. Malicious users could potentially send crafted audio files designed to exploit vulnerabilities in ffmpeg or the Whisper model. Keep all your dependencies up to date, and consider running the audio processing in a sandboxed environment (e.g., a Docker container with limited permissions) for production deployments.
Rate Limiting
Without rate limiting, a single user could flood your bot with messages and exhaust your LLM API quota or overwhelm your local compute resources. Implement a simple rate limiter that tracks the number of requests per user per minute:
import time
from collections import defaultdict
from threading import Lock
class RateLimiter:
"""
A simple token-bucket rate limiter for per-user request throttling.
This prevents any single user from sending too many requests in a
short period of time. It uses a sliding window approach: it tracks
the timestamps of recent requests and rejects new requests if too
many have been made within the window.
This implementation is thread-safe for use in async Telegram handlers.
Args:
max_requests: Maximum number of requests allowed per window.
window_seconds: The size of the sliding window in seconds.
"""
def __init__(self, max_requests: int = 5, window_seconds: int = 60):
self.max_requests = max_requests
self.window_seconds = window_seconds
self._requests: dict = defaultdict(list)
self._lock = Lock()
def is_allowed(self, user_id: int) -> bool:
"""
Check if a user is allowed to make a request right now.
Args:
user_id: The user's identifier (Telegram user ID, etc.)
Returns:
True if the request is allowed, False if rate-limited.
"""
now = time.time()
window_start = now - self.window_seconds
with self._lock:
# Remove timestamps outside the current window.
self._requests[user_id] = [
ts for ts in self._requests[user_id]
if ts > window_start
]
# Check if the user has exceeded the limit.
if len(self._requests[user_id]) >= self.max_requests:
return False
# Record this request.
self._requests[user_id].append(now)
return True
def time_until_allowed(self, user_id: int) -> float:
"""
Return the number of seconds until the user can make another request.
Args:
user_id: The user's identifier.
Returns:
Seconds until the next request is allowed, or 0 if allowed now.
"""
now = time.time()
window_start = now - self.window_seconds
with self._lock:
valid_requests = [
ts for ts in self._requests.get(user_id, [])
if ts > window_start
]
if len(valid_requests) < self.max_requests:
return 0.0
oldest = min(valid_requests)
return (oldest + self.window_seconds) - now
Final Thoughts
We have covered a lot of ground in this article. We started with the motivation for voice-capable agents, worked through the architecture, explored every major STT and TTS option, introduced the fascinating voicebox.sh tool, built a complete LLM abstraction that works with both local Ollama models and remote OpenAI models, and assembled everything into a working Telegram bot with conversation history management.
The most important takeaway is architectural: by separating the STT layer, the TTS layer, the LLM layer, and the conversation management layer into independent abstractions, we have built a system that is easy to extend and easy to maintain. When ElevenLabs releases a new voice model, you add a new TTSEngine subclass. When a new local LLM comes out, you change one environment variable. When you want to add a new messaging platform, you write a new bot handler without touching any of the core agent logic.
The second important takeaway is practical: the technology for voice-capable agents is mature, open-source, and surprisingly accessible. You do not need a GPU cluster or a large budget to build a voice agent that sounds good and works reliably. A Raspberry Pi 4 with Whisper tiny, pyttsx3, and Ollama running Phi-3 mini can handle a personal voice assistant. A modest VPS with Whisper base, edge-tts, and Ollama running Llama 3.2 can handle a multi-user Telegram bot with excellent voice quality.
The third takeaway is about experimentation: tools like voicebox.sh make it easy to explore voice cloning, different TTS engines, and voice effects without writing any code. Use voicebox.sh to find the right voice for your agent, then integrate it via the REST API. Use Whisper tiny for fast local testing, then upgrade to Whisper small or medium for production. Use Ollama with a small model for development, then switch to GPT-4o-mini for production if you need higher quality responses.
Your agent has found its voice. Now give it something interesting to say.
APPENDIX: QUICK REFERENCE - INSTALLATION COMMANDS
Install all Python dependencies:
pip install openai openai-whisper edge-tts gTTS pyttsx3 \
python-telegram-bot pydub requests aiohttp \
python-dotenv signalbot elevenlabs
Install system dependencies:
# Ubuntu/Debian
sudo apt update && sudo apt install ffmpeg aplay
# macOS
brew install ffmpeg
# Windows (with Chocolatey)
choco install ffmpeg
Install and start Ollama:
# Download from https://ollama.ai and install, then:
ollama serve &
ollama pull llama3.2
ollama pull phi3
Run the agent in local mode:
python main.py --mode local
Run the agent as a Telegram bot:
export TELEGRAM_BOT_TOKEN="your-token-here"
python main.py --mode telegram
Run the agent as a Signal bot (requires signal-cli-rest-api):
docker run -d -p 8080:8080 bbernhard/signal-cli-rest-api
export SIGNAL_PHONE="+1234567890"
python main.py --mode signal