PRD-74: Voice Interface & Conversational AI
Version: 1.0 Status: Draft Priority: P1 — High Author: Gar Kavanagh + Auto CTO Created: 2026-03-09 Updated: 2026-03-09 Dependencies: PRD-01 (Core Orchestration Engine), PRD-50 (Universal Orchestrator Router), PRD-55 (Autonomous Assistant / Heartbeats), PRD-73 (Observability & Monitoring Stack) Repositories:
automatos-voice— Voice service (STT/TTS engines, Pipecat pipeline, model management). New repo.automatos-ai— Integration wiring (API endpoints, voice client, frontend components, migrations) Branches:automatos-voice:main(new repo, starts fresh)automatos-ai:feat/prd-74-voice-interfaceDeployment: Railway (voice services deployed within the same Railway project as automatos-ai, communicating via private networking)
Executive Summary
Automatos agents are text-only. Users type messages, agents reply with text. This limits the platform to desktop-with-keyboard use cases and excludes hands-free operation, mobile-first usage, accessibility needs, and the natural conversational flow that makes voice assistants feel alive.
This PRD adds voice capabilities to the Automatos platform across three phases:
Phase 1 — Voice Messages: Record audio in chat, get spoken responses back. Push-to-talk, like voice notes in WhatsApp. Works in web app and mobile.
Phase 2 — High-Quality Voices: Upgrade TTS to near-human quality with voice cloning, emotion, and multilingual support. Auto gets a distinctive voice.
Phase 3 — Live Conversation: Real-time bidirectional voice. Talk to Auto like a phone call — interruptions, turn-taking, streaming. Full conversational AI.
What We're Building
Voice Service — Self-hosted STT + TTS behind an OpenAI-compatible REST API, deployed as a Railway service
Frontend Voice UI — Mic button in chat, audio playback for responses, push-to-talk and hands-free modes
Orchestrator Integration — Voice pipeline wired into the existing message flow (no new orchestration path — voice is a transport, not a routing change)
Real-Time Voice Pipeline — WebSocket/WebRTC streaming for live conversation with VAD, interruption handling, and turn-taking (Phase 3)
Observability — Voice-specific metrics and logs feeding into PRD-73's monitoring stack
What We're NOT Building
Phone/telephony integration (Twilio, SIP trunks) — future PRD if needed
Voice-based agent-to-agent communication — agents communicate via existing channels
Custom model training or fine-tuning — we use pre-trained open-source models
Speaker identification / voice biometrics for auth — future consideration
Offline/on-device voice processing — all processing is server-side
A replacement for text chat — voice is additive, text remains the primary interface
1. Architecture Overview
1.1 Phase 1+2: Voice Messages (REST API)
1.2 Phase 3: Real-Time Conversation (WebSocket/WebRTC)
1.3 Key Architecture Decisions
Voice is a transport, not a route
Voice messages enter the same Universal Router pipeline as text
No duplicate orchestration logic. Agents don't know or care if input was voice or text.
OpenAI-compatible API
Voice service exposes /v1/audio/transcriptions and /v1/audio/speech
Can swap between self-hosted and OpenAI with a URL change. Frontend/backend code stays identical.
Separate repo + Railway service
automatos-voice is its own repo, deployed as independent Railway service(s)
Heavy ML deps (PyTorch, model weights) stay out of the orchestrator. Independent CI/CD, scaling, and release cycle.
Two-repo boundary
automatos-voice owns inference. automatos-ai owns integration. Contract is the OpenAI-compatible REST API.
Clean separation of concerns. Voice service is replaceable — swap to OpenAI's hosted API by changing one URL.
Phase 3 adds a pipeline layer, doesn't replace Phase 1
REST endpoints remain for voice messages; WebSocket added for real-time
Voice messages (async) and live conversation (streaming) are both valid UX patterns.
STT + TTS engines are swappable
Provider abstraction from day one
Start with Whisper + Kokoro, upgrade to Chatterbox/CosyVoice without changing integration code.
2. Phase 1: Voice Messages
Goal: Users can send voice messages in chat and hear spoken responses. Push-to-talk UX.
2.1 Voice Service Deployment
Primary option: Speaches (MIT, 3k stars, actively maintained)
Single Docker container with STT (Faster-Whisper) + TTS (Kokoro/Piper)
OpenAI-compatible API out of the box
Docker Compose configs for CPU and CUDA
Fallback option: Custom FastAPI service wrapping Faster-Whisper + Kokoro-FastAPI
Only if Speaches proves too opinionated or unstable
Same OpenAI-compatible API contract
Railway service config:
Service name
voice-service
Image
ghcr.io/speaches-ai/speaches:latest (or custom Dockerfile)
Port
8300
Internal DNS
voice-service.railway.internal:8300
Public domain
None (backend proxies all requests)
Volume
5GB (model cache — Whisper + TTS models auto-download on first boot)
Resources
2 vCPU / 4GB RAM minimum (CPU inference); GPU recommended for production
Environment variables:
2.2 Backend Integration
New endpoint: POST /api/chat/voice
Response format:
Voice service client:
Config additions (config.py):
2.3 Frontend Voice UI
Chat input changes:
Components to build:
VoiceMicButton
Toggle recording, visual feedback (pulsing, waveform)
VoiceRecorder
MediaRecorder API wrapper, audio blob capture, format conversion
VoicePlayer
Audio playback for TTS responses, with play/pause/speed controls
VoiceMessage
Chat bubble variant showing waveform + transcript + play button
VoiceSettings
User preferences: auto-play responses, voice selection, speed
Audio format:
User → STT
WebM/Opus (browser native) or WAV
MediaRecorder default; Whisper handles both
TTS → User
MP3 (default) or OGG/Opus
Smallest size for playback; configurable
Key UX decisions:
Mic button replaces send button while recording (no simultaneous text + voice)
Transcript is always shown alongside audio (accessibility + searchability)
Auto-play TTS response is opt-in (off by default — respects quiet environments)
Voice messages are stored as regular messages with
type: "voice"and audio URLLong-press mic for hands-free recording (release to send) — tap for toggle mode
2.4 Audio Storage
Voice audio files are stored in S3, not in the database.
Retention: Same as workspace document retention policy. Voice audio is ephemeral by default — 30-day TTL unless the user pins the message.
Database schema addition:
3. Phase 2: High-Quality Voices
Goal: Upgrade TTS quality to near-human level. Give Auto a distinctive, consistent voice. Support voice cloning and multilingual output.
3.1 TTS Engine Upgrade
Phase 1 ships with Kokoro (82M params, Apache 2.0) — good quality, tiny footprint. Phase 2 upgrades to a top-tier engine.
Candidates (ranked by fit):
Chatterbox (Resemble AI)
23k
Beats ElevenLabs in blind tests
Yes, zero-shot
MIT
Turbo variant
Community server
Primary choice
CosyVoice 2 (Alibaba)
20k
Top tier, 150ms streaming
Yes, multilingual
Apache 2.0
Yes, native
Yes (FastAPI+gRPC+Docker)
Secondary choice
Orpheus (Canopy AI)
6k
Near top, LLM-native prosody
Yes
Apache 2.0
25-50ms latency
Community FastAPI
Alternative
Fish Speech
25k
#1 TTS-Arena
Yes
CC-BY-NC (non-commercial)
Yes
Yes
Excluded — non-commercial license
F5-TTS
14k
Excellent
Yes
CC-BY-NC (non-commercial)
Chunked
Yes
Excluded — non-commercial license
Recommended path:
Start with Chatterbox — MIT license, beats ElevenLabs, emotion control, 23 languages
Keep CosyVoice 2 as fallback — best production deployment story (Docker + gRPC + TensorRT)
Both fit behind the same OpenAI-compatible API contract from Phase 1
Voice provider abstraction:
3.2 Auto's Voice Identity
Auto (the platform's autonomous assistant) gets a distinctive voice configured at the platform level.
Voice profile storage:
3.3 Per-Agent Voice Assignment
Agents can have distinct voices. A customer-facing agent might sound different from a technical agent.
When TTS is invoked, the system checks:
Agent has a
voice_profile_id→ use that voiceNo agent voice → use workspace default voice
No workspace default → use platform default (
AUTO_VOICE_ID)
3.4 Voice Cloning Workflow
For workspaces that want custom agent voices:
User uploads 10-30 seconds of reference audio via Settings → Voices
Backend stores reference audio in S3:
s3://automatos-ai/workspaces/{wid}/voices/{profile_id}/reference.wavVoice profile created with
reference_audiopointing to S3 keyTTS provider uses reference audio for zero-shot cloning at synthesis time
User can preview and assign to agents
Security considerations:
Voice cloning is workspace-admin only
Reference audio scanned for minimum quality (duration, sample rate, SNR)
Consent acknowledgement required before enabling cloning
Rate-limited to prevent abuse
4. Phase 3: Live Conversation
Goal: Real-time bidirectional voice. Talk to Auto like a phone call. Interruptions work. Streaming in both directions.
4.1 Framework Selection
Pipecat (Daily.co)
5k+
WebSocket via Daily/LiveKit
Yes
15+ STT, 20+ TTS providers
Best fit — Python, modular, vendor-agnostic
LiveKit Agents
22k + 6k
WebRTC
Yes (Go server)
Plugin-based
Strong infra but heavier to deploy
EchoKit
New
WebSocket
Yes (Rust)
Config-driven
Lightweight but immature
TEN Framework
5k+
WebSocket
Yes
Graph-based
Over-complex for our needs
Recommended: Pipecat
Python-native (fits our stack)
Pluggable STT/TTS providers (use our Phase 1/2 engines)
Handles VAD, interruption, turn-taking, streaming out of the box
BSD 2-Clause license
NVIDIA partnership, active development
Can run with or without Daily.co's WebRTC infra (local WebSocket transport available)
4.2 Voice Pipeline Architecture
4.3 Real-Time Features
Voice Activity Detection (VAD)
Silero VAD (included in Pipecat). Detects when user starts/stops speaking. No manual push-to-talk needed.
Interruption handling
When user speaks while TTS is playing, immediately stop TTS, discard buffered audio, process new user input.
Turn-taking
VAD silence threshold (configurable, default 800ms) determines when user is "done speaking."
Streaming STT
Partial transcripts stream to LLM as user speaks. LLM can begin generating before user finishes.
Streaming TTS
LLM response tokens stream to TTS. First audio chunk plays before full response is generated.
Backpressure
If TTS generation is slower than playback, buffer. If faster, queue. Never drop audio.
Graceful degradation
If voice service is down, fall back to text-only with a "voice unavailable" indicator.
4.4 Frontend Real-Time UI
Components:
VoiceCallPanel
Full-screen or docked panel for live voice mode
LiveTranscript
Real-time transcript display as user speaks
VoiceActivityIndicator
Visual feedback for who is "speaking"
InterruptionHandler
Detects user speech during playback, cancels TTS
4.5 Railway Deployment (Phase 3)
The voice pipeline runs as a separate Railway service:
Service name
voice-pipeline
Image
Custom Dockerfile (Pipecat + providers)
Port
8301 (WebSocket)
Internal DNS
voice-pipeline.railway.internal:8301
Public domain
Yes — voice.automatos.app (WebSocket needs public access for browser connections)
Resources
2 vCPU / 4GB RAM minimum
Note: The voice pipeline service is in addition to the voice service from Phase 1. Phase 1's voice service provides the STT/TTS engines. Phase 3's voice pipeline orchestrates real-time streaming between them.
5. Observability & Monitoring Integration (PRD-73)
Voice adds new failure modes and latency-sensitive paths. Full integration with PRD-73's monitoring stack is required.
5.1 Prometheus Metrics
The voice service and voice pipeline expose /metrics endpoints scraped by Prometheus.
Voice Service metrics:
Voice Pipeline metrics (Phase 3):
5.2 Prometheus Scrape Config Addition
5.3 Alert Rules
5.4 Grafana Dashboard: Voice Performance
A new dashboard added to PRD-73's Grafana provisioning:
Voice Service Health
Status indicator (up/down)
Prometheus
STT Requests/min
Time series
rate(voice_stt_requests_total[1m])
TTS Requests/min
Time series
rate(voice_tts_requests_total[1m])
STT Latency (p50/p95/p99)
Time series
histogram_quantile(voice_stt_duration_seconds)
TTS Latency (p50/p95/p99)
Time series
histogram_quantile(voice_tts_duration_seconds)
End-to-End Latency (Phase 3)
Time series
voice_end_to_end_latency_seconds
Error Rate
Time series
rate(voice_*_errors_total[5m])
Inference Queue Depth
Gauge
voice_inference_queue_depth
Active Voice Sessions (Phase 3)
Gauge
voice_sessions_active
Model Load Status
Table
voice_model_loaded
Audio Processed (minutes)
Stat panel
sum(rate(voice_stt_audio_duration_seconds_sum[1h]))
Voice Logs
Log panel
Loki: {service="voice-service"}
5.5 Loki Log Labels
Voice service logs use structured JSON with these labels:
5.6 Backend Logging
All voice operations logged through the backend's existing structured logging with voice-specific fields:
6. Security Considerations
Audio contains sensitive data
Audio files in S3 with workspace-scoped keys. Same access controls as documents. 30-day TTL default.
Voice cloning abuse
Cloning restricted to workspace admins. Consent acknowledgement required. Rate-limited.
Audio injection attacks
Validate audio format, duration, and size before processing. Reject files >25MB or >120s.
Voice service access
Internal only (no public domain). Backend proxies all requests. No direct frontend-to-voice-service communication in Phase 1/2.
WebSocket auth (Phase 3)
Voice pipeline WebSocket requires JWT token in connection handshake. Same auth as chat WebSocket.
Model supply chain
Pin model versions. Verify HuggingFace checksums on download. Don't auto-update models in production.
No hardcoded credentials
All config via config.py → Railway env vars. Voice service API keys (if any) as Railway secrets.
Denial of service
Rate limit voice endpoints per workspace (e.g., 60 STT requests/min, 60 TTS requests/min). Queue with backpressure.
7. Mobile Considerations
Voice is especially valuable on mobile where typing is slower.
Mic permissions
Request on first mic button tap. Show clear permission rationale.
Background audio
TTS playback continues when app is backgrounded (mobile web audio API).
Bandwidth
Compress audio before upload (Opus codec, ~32kbps). Keep TTS responses in MP3/Opus.
Offline indicator
If voice service unreachable, mic button shows "unavailable" state. Text input remains functional.
Touch UX
Long-press mic for push-to-talk. Tap for toggle mode. Swipe to cancel recording.
Low-latency priority
Mobile users expect <2s response. Phase 1 target: STT <1s + TTS <1.5s for typical messages.
8. API Reference
8.1 Voice Chat Endpoint
8.2 Audio Retrieval
8.3 Voice Profiles (Phase 2)
8.4 Voice WebSocket (Phase 3)
9. Implementation Phases
Phase 1: Voice Messages (4-6 weeks)
automatos-voice repo (new):
Create automatos-voice repo with README, .env.example
P0
S
New GitHub repo under Automatos-AI-Platform org
Dockerfile for voice-service (Speaches-based or custom FastAPI)
P0
M
Speaches image + custom config layer
/health endpoint
P0
S
JSON health status for Prometheus
/metrics endpoint (Prometheus)
P1
M
STT/TTS request counters, latency histograms, queue depth
Docker Compose for local dev
P0
S
voice-service on port 8300
Deploy voice-service as Railway service
P0
M
Docker image, 5GB volume for models, env config
Railway service config (.toml)
P0
S
voice-service.toml
Voice alert rules YAML (for automatos-monitoring)
P1
S
VoiceServiceDown, latency, error rate
Local dev setup script (model download, dirs)
P2
S
scripts/setup-local.sh
automatos-ai repo (existing):
Add VOICE_* config constants to config.py
P0
S
All config centralized per platform rules
Build VoiceServiceClient (STT + TTS HTTP client)
P0
M
OpenAI-compatible, async, retry + timeout
Build POST /api/chat/voice endpoint
P0
M
Audio upload → STT → router → TTS → response
Build GET /api/chat/voice/audio/{id} endpoint
P0
S
S3 presigned URL or proxy
Add voice_metadata column to messages
P0
S
Alembic migration
S3 audio storage (upload/retrieve/TTL)
P0
M
Reuse existing S3 client
Input validation (format, size, duration limits)
P0
S
Security boundary
Rate limiting on voice endpoints
P1
S
Per-workspace throttle
Loki structured logging for voice operations
P1
S
Labels: service, component, workspace_id
Feature flag: VOICE_ENABLED
P1
S
Kill switch
Frontend: VoiceMicButton component
P0
M
MediaRecorder, visual feedback
Frontend: VoiceRecorder (audio capture + encoding)
P0
M
WebM/Opus capture, size validation
Frontend: VoicePlayer (audio playback)
P0
M
Play/pause/speed controls
Frontend: VoiceMessage chat bubble variant
P0
S
Waveform + transcript + play button
Frontend: VoiceSettings (auto-play, speed)
P2
S
User preferences
Graceful degradation (voice unavailable indicator)
P2
S
Mic button disabled state
automatos-monitoring repo (PRD-73):
Add voice-service scrape target to Prometheus config
P1
S
voice-service.railway.internal:8300
Add voice alert rules to Prometheus rules directory
P1
S
Copy from automatos-voice/monitoring/
Build Grafana Voice Performance dashboard
P1
M
New dashboard JSON
Phase 2: High-Quality Voices (3-4 weeks)
automatos-voice repo:
Add Chatterbox TTS to voice-service (Dockerfile + config)
P0
M
Second TTS engine alongside Kokoro
Chatterbox Docker image or custom wrapper
P0
M
GPU if available, CPU fallback
Voice cloning inference endpoint (reference audio → synthesis)
P1
M
Zero-shot cloning via Chatterbox
CosyVoice 2 integration (backup TTS engine)
P2
M
Alternative if Chatterbox disappoints
automatos-ai repo:
Build TTS provider abstraction (TTSProvider ABC)
P0
M
Base class + self-hosted + OpenAI providers
Voice profiles DB table + migration
P0
S
voice_profiles table
Voice profiles CRUD API
P0
M
Admin endpoints
Per-agent voice assignment
P0
S
voice_profile_id on agents table, migration
Auto's default voice configuration
P1
S
Platform-level config
Voice cloning workflow (upload reference → create profile)
P1
M
Admin only, with consent, S3 storage
Frontend: Voice selection in agent settings
P1
M
Dropdown + preview playback
Frontend: Voice profiles management page
P1
M
Settings → Voices
Phase 3: Live Conversation (6-8 weeks)
automatos-voice repo:
Evaluate Pipecat vs LiveKit Agents (spike)
P0
M
1-week spike, build POC with each
Build voice-pipeline service (Pipecat)
P0
L
New service in repo, Dockerfile
Build AutomatosOrchestratorProcessor (pipeline ↔ orchestrator bridge)
P0
L
HTTP client to backend for chat routing
VAD integration (Silero)
P0
M
Voice activity detection
Streaming STT integration
P0
M
Partial transcripts via voice-service
Streaming TTS integration
P0
M
Audio chunk streaming via voice-service
Interruption handling
P0
M
Cancel TTS on user speech
Voice pipeline /metrics endpoint
P1
M
Session, latency, interruption metrics
Deploy voice-pipeline as Railway service
P0
L
Public domain (WebSocket needs browser access)
Turn-taking tuning (silence threshold, VAD sensitivity)
P2
M
Iterative UX refinement
automatos-ai repo:
Frontend: VoiceCallPanel
P0
L
Full real-time voice UI
Frontend: LiveTranscript component
P0
M
Real-time transcript display
Frontend: WebSocket audio streaming (useVoiceStream)
P0
L
MediaStream → WS → pipeline
Frontend: InterruptionHandler
P1
M
Client-side interrupt detection
Mobile-optimized voice UI
P1
M
Touch UX, bandwidth optimization
Multi-language live conversation
P2
M
Auto-detect language, switch TTS voice
automatos-monitoring repo (PRD-73):
Add voice-pipeline scrape target to Prometheus
P1
S
voice-pipeline.railway.internal:8301
Add voice-pipeline alert rules
P1
S
E2E latency, session failures
Update Grafana Voice dashboard with Phase 3 panels
P1
M
Sessions, interruptions, streaming latency
10. Success Criteria
Phase 1
STT accuracy (WER)
<10% on English conversational speech
STT latency (p95)
<2s for 10s audio clip
TTS latency (p95)
<1.5s for 100-word response
Voice service uptime
>99% when platform is running
Audio upload success rate
>98%
Voice messages visible in chat history with transcript
100%
Text chat unaffected when voice service is down
100%
Phase 2
TTS naturalness (subjective)
Indistinguishable from human in casual listening
Voice cloning quality
Recognizable as the reference speaker
Provider swap time
<1 hour to switch TTS engine (config change + restart)
Per-agent voice differentiation
Configurable per agent, audibly distinct
Phase 3
End-to-end latency (user done speaking → first audio response)
<2s p95
Interruption response time
<500ms (TTS stops within 500ms of user speaking)
Conversation naturalness
Turn-taking feels natural, no awkward pauses >2s
Concurrent voice sessions
>=10 simultaneous
Session drop rate
<5%
11. File Structure
11.1 automatos-voice (New Repo — Voice Service)
automatos-voice (New Repo — Voice Service)Owns all ML inference, model management, and real-time voice pipeline. Deployed as independent Railway service(s).
Key boundaries:
This repo has NO knowledge of the orchestrator, agents, workspaces, or database
It exposes only OpenAI-compatible HTTP endpoints (
/v1/audio/transcriptions,/v1/audio/speech)It exposes
/healthand/metricsfor PRD-73 monitoring integrationModel weights are downloaded at first boot and cached in a Railway volume
11.2 automatos-ai (Existing Repo — Integration Wiring)
automatos-ai (Existing Repo — Integration Wiring)Owns the API endpoints, voice client, frontend components, and database schema. Talks to automatos-voice via HTTP over Railway private networking.
Key boundaries:
This repo has NO ML dependencies (no PyTorch, no model weights, no audio inference)
VoiceServiceClientmakes HTTP calls tovoice-service.railway.internal:8300— same contract as OpenAI's APISwapping to OpenAI's hosted voice API = change
VOICE_SERVICE_URLin config, switch provider toopenai.pyProvider abstraction lives here (not in
automatos-voice) because the orchestrator decides which provider to use
11.3 Repo Boundary Contract
The ONLY interface between the two repos is the OpenAI-compatible REST API:
No shared database. No shared message queue. No shared code. If automatos-voice goes down, text chat is unaffected.
12. Relationship to Existing PRDs & Repos
12.1 Cross-Repo Dependencies
automatos-voice (new)
Voice inference service (STT/TTS engines, Pipecat pipeline)
Nothing — standalone service
automatos-ai (existing)
Integration wiring (API endpoints, voice client, frontend, migrations)
automatos-voice running and reachable via Railway private network
automatos-monitoring (PRD-73)
Prometheus scrape targets, alert rules, Grafana dashboard for voice
automatos-voice exposing /metrics and /health
12.2 PRD Relationships
PRD-01 (Core Orchestration)
Voice is a transport layer. Messages enter the same orchestration pipeline regardless of input modality.
PRD-50 (Universal Orchestrator Router)
Voice transcripts are routed through the Universal Router exactly like typed messages. No routing changes needed.
PRD-55 (Autonomous Assistant / Heartbeats)
Auto's heartbeat responses can be spoken aloud if the user is in voice mode. Voice profile determines Auto's voice.
PRD-57 (Mobile-First Responsive)
Voice is mobile's killer feature. Voice UI components must be mobile-optimized from Phase 1.
PRD-71 (Unified Skills Architecture)
Voice is not a skill — it's a transport. Skills don't need voice awareness.
PRD-72 (Activity Command Centre)
Voice events (sessions started, errors) surface in the activity feed.
PRD-73 (Observability & Monitoring)
Voice service scrape targets, alert rules, Grafana dashboard, and Loki log labels integrate into the monitoring stack. See Section 5.
13. Open Questions
1
GPU or CPU for voice service on Railway? CPU is cheaper but slower. GPU reduces latency 3-5x.
Phase 1 latency targets
Before deployment
2
Should voice audio be transcribed and stored, or ephemeral? Current design stores both.
Storage costs, privacy
Phase 1
3
Pipecat vs LiveKit Agents — need a 1-week spike to validate.
Phase 3 architecture
Before Phase 3 starts
4
Should Auto's voice be consistent across all agents, or should each agent have a distinct voice by default?
UX identity
Phase 2
5
Rate limits per workspace — what's reasonable? 60 STT/min and 60 TTS/min proposed.
Cost, abuse prevention
Phase 1
6
Voice cloning consent — legal review needed for different jurisdictions?
Compliance
Phase 2
14. Glossary
STT
Speech-to-Text — converting audio to text transcript
TTS
Text-to-Speech — converting text to spoken audio
VAD
Voice Activity Detection — detecting when a user is speaking vs. silent
WebRTC
Web Real-Time Communication — browser standard for real-time audio/video
Whisper
OpenAI's open-source STT model family (tiny → large-v3)
Kokoro
82M-parameter TTS model (Apache 2.0, StyleTTS2-based)
Chatterbox
Resemble AI's open-source TTS (MIT, beats ElevenLabs in blind tests)
CosyVoice 2
Alibaba's open-source TTS (Apache 2.0, 150ms streaming, 9 languages)
Pipecat
Daily.co's open-source Python framework for real-time voice AI pipelines
Speaches
Open-source unified STT+TTS server (OpenAI-compatible API, Docker-native)
Opus
Audio codec optimized for speech — low bitrate, high quality
Zero-shot cloning
Cloning a voice from a short audio sample without fine-tuning
Last updated

