PRD-74: Voice Interface & Conversational AI

Version: 1.0 Status: Draft Priority: P1 — High Author: Gar Kavanagh + Auto CTO Created: 2026-03-09 Updated: 2026-03-09 Dependencies: PRD-01 (Core Orchestration Engine), PRD-50 (Universal Orchestrator Router), PRD-55 (Autonomous Assistant / Heartbeats), PRD-73 (Observability & Monitoring Stack) Repositories:

  • automatos-voice — Voice service (STT/TTS engines, Pipecat pipeline, model management). New repo.

  • automatos-ai — Integration wiring (API endpoints, voice client, frontend components, migrations) Branches:

  • automatos-voice: main (new repo, starts fresh)

  • automatos-ai: feat/prd-74-voice-interface Deployment: Railway (voice services deployed within the same Railway project as automatos-ai, communicating via private networking)


Executive Summary

Automatos agents are text-only. Users type messages, agents reply with text. This limits the platform to desktop-with-keyboard use cases and excludes hands-free operation, mobile-first usage, accessibility needs, and the natural conversational flow that makes voice assistants feel alive.

This PRD adds voice capabilities to the Automatos platform across three phases:

  1. Phase 1 — Voice Messages: Record audio in chat, get spoken responses back. Push-to-talk, like voice notes in WhatsApp. Works in web app and mobile.

  2. Phase 2 — High-Quality Voices: Upgrade TTS to near-human quality with voice cloning, emotion, and multilingual support. Auto gets a distinctive voice.

  3. Phase 3 — Live Conversation: Real-time bidirectional voice. Talk to Auto like a phone call — interruptions, turn-taking, streaming. Full conversational AI.

What We're Building

  1. Voice Service — Self-hosted STT + TTS behind an OpenAI-compatible REST API, deployed as a Railway service

  2. Frontend Voice UI — Mic button in chat, audio playback for responses, push-to-talk and hands-free modes

  3. Orchestrator Integration — Voice pipeline wired into the existing message flow (no new orchestration path — voice is a transport, not a routing change)

  4. Real-Time Voice Pipeline — WebSocket/WebRTC streaming for live conversation with VAD, interruption handling, and turn-taking (Phase 3)

  5. Observability — Voice-specific metrics and logs feeding into PRD-73's monitoring stack

What We're NOT Building

  • Phone/telephony integration (Twilio, SIP trunks) — future PRD if needed

  • Voice-based agent-to-agent communication — agents communicate via existing channels

  • Custom model training or fine-tuning — we use pre-trained open-source models

  • Speaker identification / voice biometrics for auth — future consideration

  • Offline/on-device voice processing — all processing is server-side

  • A replacement for text chat — voice is additive, text remains the primary interface


1. Architecture Overview

1.1 Phase 1+2: Voice Messages (REST API)

1.2 Phase 3: Real-Time Conversation (WebSocket/WebRTC)

1.3 Key Architecture Decisions

Decision
Choice
Rationale

Voice is a transport, not a route

Voice messages enter the same Universal Router pipeline as text

No duplicate orchestration logic. Agents don't know or care if input was voice or text.

OpenAI-compatible API

Voice service exposes /v1/audio/transcriptions and /v1/audio/speech

Can swap between self-hosted and OpenAI with a URL change. Frontend/backend code stays identical.

Separate repo + Railway service

automatos-voice is its own repo, deployed as independent Railway service(s)

Heavy ML deps (PyTorch, model weights) stay out of the orchestrator. Independent CI/CD, scaling, and release cycle.

Two-repo boundary

automatos-voice owns inference. automatos-ai owns integration. Contract is the OpenAI-compatible REST API.

Clean separation of concerns. Voice service is replaceable — swap to OpenAI's hosted API by changing one URL.

Phase 3 adds a pipeline layer, doesn't replace Phase 1

REST endpoints remain for voice messages; WebSocket added for real-time

Voice messages (async) and live conversation (streaming) are both valid UX patterns.

STT + TTS engines are swappable

Provider abstraction from day one

Start with Whisper + Kokoro, upgrade to Chatterbox/CosyVoice without changing integration code.


2. Phase 1: Voice Messages

Goal: Users can send voice messages in chat and hear spoken responses. Push-to-talk UX.

2.1 Voice Service Deployment

Primary option: Speachesarrow-up-right (MIT, 3k stars, actively maintained)

  • Single Docker container with STT (Faster-Whisper) + TTS (Kokoro/Piper)

  • OpenAI-compatible API out of the box

  • Docker Compose configs for CPU and CUDA

Fallback option: Custom FastAPI service wrapping Faster-Whisper + Kokoro-FastAPI

  • Only if Speaches proves too opinionated or unstable

  • Same OpenAI-compatible API contract

Railway service config:

Attribute
Value

Service name

voice-service

Image

ghcr.io/speaches-ai/speaches:latest (or custom Dockerfile)

Port

8300

Internal DNS

voice-service.railway.internal:8300

Public domain

None (backend proxies all requests)

Volume

5GB (model cache — Whisper + TTS models auto-download on first boot)

Resources

2 vCPU / 4GB RAM minimum (CPU inference); GPU recommended for production

Environment variables:

2.2 Backend Integration

New endpoint: POST /api/chat/voice

Response format:

Voice service client:

Config additions (config.py):

2.3 Frontend Voice UI

Chat input changes:

Components to build:

Component
Purpose

VoiceMicButton

Toggle recording, visual feedback (pulsing, waveform)

VoiceRecorder

MediaRecorder API wrapper, audio blob capture, format conversion

VoicePlayer

Audio playback for TTS responses, with play/pause/speed controls

VoiceMessage

Chat bubble variant showing waveform + transcript + play button

VoiceSettings

User preferences: auto-play responses, voice selection, speed

Audio format:

Direction
Format
Reason

User → STT

WebM/Opus (browser native) or WAV

MediaRecorder default; Whisper handles both

TTS → User

MP3 (default) or OGG/Opus

Smallest size for playback; configurable

Key UX decisions:

  • Mic button replaces send button while recording (no simultaneous text + voice)

  • Transcript is always shown alongside audio (accessibility + searchability)

  • Auto-play TTS response is opt-in (off by default — respects quiet environments)

  • Voice messages are stored as regular messages with type: "voice" and audio URL

  • Long-press mic for hands-free recording (release to send) — tap for toggle mode

2.4 Audio Storage

Voice audio files are stored in S3, not in the database.

Retention: Same as workspace document retention policy. Voice audio is ephemeral by default — 30-day TTL unless the user pins the message.

Database schema addition:


3. Phase 2: High-Quality Voices

Goal: Upgrade TTS quality to near-human level. Give Auto a distinctive, consistent voice. Support voice cloning and multilingual output.

3.1 TTS Engine Upgrade

Phase 1 ships with Kokoro (82M params, Apache 2.0) — good quality, tiny footprint. Phase 2 upgrades to a top-tier engine.

Candidates (ranked by fit):

Engine
Stars
Quality
Voice Clone
License
Streaming
Docker
Decision

Chatterbox (Resemble AI)

23k

Beats ElevenLabs in blind tests

Yes, zero-shot

MIT

Turbo variant

Community server

Primary choice

CosyVoice 2 (Alibaba)

20k

Top tier, 150ms streaming

Yes, multilingual

Apache 2.0

Yes, native

Yes (FastAPI+gRPC+Docker)

Secondary choice

Orpheus (Canopy AI)

6k

Near top, LLM-native prosody

Yes

Apache 2.0

25-50ms latency

Community FastAPI

Alternative

Fish Speech

25k

#1 TTS-Arena

Yes

CC-BY-NC (non-commercial)

Yes

Yes

Excluded — non-commercial license

F5-TTS

14k

Excellent

Yes

CC-BY-NC (non-commercial)

Chunked

Yes

Excluded — non-commercial license

Recommended path:

  1. Start with Chatterbox — MIT license, beats ElevenLabs, emotion control, 23 languages

  2. Keep CosyVoice 2 as fallback — best production deployment story (Docker + gRPC + TensorRT)

  3. Both fit behind the same OpenAI-compatible API contract from Phase 1

Voice provider abstraction:

3.2 Auto's Voice Identity

Auto (the platform's autonomous assistant) gets a distinctive voice configured at the platform level.

Voice profile storage:

3.3 Per-Agent Voice Assignment

Agents can have distinct voices. A customer-facing agent might sound different from a technical agent.

When TTS is invoked, the system checks:

  1. Agent has a voice_profile_id → use that voice

  2. No agent voice → use workspace default voice

  3. No workspace default → use platform default (AUTO_VOICE_ID)

3.4 Voice Cloning Workflow

For workspaces that want custom agent voices:

  1. User uploads 10-30 seconds of reference audio via Settings → Voices

  2. Backend stores reference audio in S3: s3://automatos-ai/workspaces/{wid}/voices/{profile_id}/reference.wav

  3. Voice profile created with reference_audio pointing to S3 key

  4. TTS provider uses reference audio for zero-shot cloning at synthesis time

  5. User can preview and assign to agents

Security considerations:

  • Voice cloning is workspace-admin only

  • Reference audio scanned for minimum quality (duration, sample rate, SNR)

  • Consent acknowledgement required before enabling cloning

  • Rate-limited to prevent abuse


4. Phase 3: Live Conversation

Goal: Real-time bidirectional voice. Talk to Auto like a phone call. Interruptions work. Streaming in both directions.

4.1 Framework Selection

Framework
Stars
Protocol
Self-Hosted
Provider Coverage
Fit

Pipecat (Daily.co)

5k+

WebSocket via Daily/LiveKit

Yes

15+ STT, 20+ TTS providers

Best fit — Python, modular, vendor-agnostic

LiveKit Agents

22k + 6k

WebRTC

Yes (Go server)

Plugin-based

Strong infra but heavier to deploy

EchoKit

New

WebSocket

Yes (Rust)

Config-driven

Lightweight but immature

TEN Framework

5k+

WebSocket

Yes

Graph-based

Over-complex for our needs

Recommended: Pipecat

  • Python-native (fits our stack)

  • Pluggable STT/TTS providers (use our Phase 1/2 engines)

  • Handles VAD, interruption, turn-taking, streaming out of the box

  • BSD 2-Clause license

  • NVIDIA partnership, active development

  • Can run with or without Daily.co's WebRTC infra (local WebSocket transport available)

4.2 Voice Pipeline Architecture

4.3 Real-Time Features

Feature
Implementation

Voice Activity Detection (VAD)

Silero VAD (included in Pipecat). Detects when user starts/stops speaking. No manual push-to-talk needed.

Interruption handling

When user speaks while TTS is playing, immediately stop TTS, discard buffered audio, process new user input.

Turn-taking

VAD silence threshold (configurable, default 800ms) determines when user is "done speaking."

Streaming STT

Partial transcripts stream to LLM as user speaks. LLM can begin generating before user finishes.

Streaming TTS

LLM response tokens stream to TTS. First audio chunk plays before full response is generated.

Backpressure

If TTS generation is slower than playback, buffer. If faster, queue. Never drop audio.

Graceful degradation

If voice service is down, fall back to text-only with a "voice unavailable" indicator.

4.4 Frontend Real-Time UI

Components:

Component
Purpose

VoiceCallPanel

Full-screen or docked panel for live voice mode

LiveTranscript

Real-time transcript display as user speaks

VoiceActivityIndicator

Visual feedback for who is "speaking"

InterruptionHandler

Detects user speech during playback, cancels TTS

4.5 Railway Deployment (Phase 3)

The voice pipeline runs as a separate Railway service:

Attribute
Value

Service name

voice-pipeline

Image

Custom Dockerfile (Pipecat + providers)

Port

8301 (WebSocket)

Internal DNS

voice-pipeline.railway.internal:8301

Public domain

Yes — voice.automatos.app (WebSocket needs public access for browser connections)

Resources

2 vCPU / 4GB RAM minimum

Note: The voice pipeline service is in addition to the voice service from Phase 1. Phase 1's voice service provides the STT/TTS engines. Phase 3's voice pipeline orchestrates real-time streaming between them.


5. Observability & Monitoring Integration (PRD-73)

Voice adds new failure modes and latency-sensitive paths. Full integration with PRD-73's monitoring stack is required.

5.1 Prometheus Metrics

The voice service and voice pipeline expose /metrics endpoints scraped by Prometheus.

Voice Service metrics:

Voice Pipeline metrics (Phase 3):

5.2 Prometheus Scrape Config Addition

5.3 Alert Rules

5.4 Grafana Dashboard: Voice Performance

A new dashboard added to PRD-73's Grafana provisioning:

Panel
Visualization
Data Source

Voice Service Health

Status indicator (up/down)

Prometheus

STT Requests/min

Time series

rate(voice_stt_requests_total[1m])

TTS Requests/min

Time series

rate(voice_tts_requests_total[1m])

STT Latency (p50/p95/p99)

Time series

histogram_quantile(voice_stt_duration_seconds)

TTS Latency (p50/p95/p99)

Time series

histogram_quantile(voice_tts_duration_seconds)

End-to-End Latency (Phase 3)

Time series

voice_end_to_end_latency_seconds

Error Rate

Time series

rate(voice_*_errors_total[5m])

Inference Queue Depth

Gauge

voice_inference_queue_depth

Active Voice Sessions (Phase 3)

Gauge

voice_sessions_active

Model Load Status

Table

voice_model_loaded

Audio Processed (minutes)

Stat panel

sum(rate(voice_stt_audio_duration_seconds_sum[1h]))

Voice Logs

Log panel

Loki: {service="voice-service"}

5.5 Loki Log Labels

Voice service logs use structured JSON with these labels:

5.6 Backend Logging

All voice operations logged through the backend's existing structured logging with voice-specific fields:


6. Security Considerations

Concern
Mitigation

Audio contains sensitive data

Audio files in S3 with workspace-scoped keys. Same access controls as documents. 30-day TTL default.

Voice cloning abuse

Cloning restricted to workspace admins. Consent acknowledgement required. Rate-limited.

Audio injection attacks

Validate audio format, duration, and size before processing. Reject files >25MB or >120s.

Voice service access

Internal only (no public domain). Backend proxies all requests. No direct frontend-to-voice-service communication in Phase 1/2.

WebSocket auth (Phase 3)

Voice pipeline WebSocket requires JWT token in connection handshake. Same auth as chat WebSocket.

Model supply chain

Pin model versions. Verify HuggingFace checksums on download. Don't auto-update models in production.

No hardcoded credentials

All config via config.py → Railway env vars. Voice service API keys (if any) as Railway secrets.

Denial of service

Rate limit voice endpoints per workspace (e.g., 60 STT requests/min, 60 TTS requests/min). Queue with backpressure.


7. Mobile Considerations

Voice is especially valuable on mobile where typing is slower.

Aspect
Implementation

Mic permissions

Request on first mic button tap. Show clear permission rationale.

Background audio

TTS playback continues when app is backgrounded (mobile web audio API).

Bandwidth

Compress audio before upload (Opus codec, ~32kbps). Keep TTS responses in MP3/Opus.

Offline indicator

If voice service unreachable, mic button shows "unavailable" state. Text input remains functional.

Touch UX

Long-press mic for push-to-talk. Tap for toggle mode. Swipe to cancel recording.

Low-latency priority

Mobile users expect <2s response. Phase 1 target: STT <1s + TTS <1.5s for typical messages.


8. API Reference

8.1 Voice Chat Endpoint

8.2 Audio Retrieval

8.3 Voice Profiles (Phase 2)

8.4 Voice WebSocket (Phase 3)


9. Implementation Phases

Phase 1: Voice Messages (4-6 weeks)

automatos-voice repo (new):

Task
Priority
Effort
Notes

Create automatos-voice repo with README, .env.example

P0

S

New GitHub repo under Automatos-AI-Platform org

Dockerfile for voice-service (Speaches-based or custom FastAPI)

P0

M

Speaches image + custom config layer

/health endpoint

P0

S

JSON health status for Prometheus

/metrics endpoint (Prometheus)

P1

M

STT/TTS request counters, latency histograms, queue depth

Docker Compose for local dev

P0

S

voice-service on port 8300

Deploy voice-service as Railway service

P0

M

Docker image, 5GB volume for models, env config

Railway service config (.toml)

P0

S

voice-service.toml

Voice alert rules YAML (for automatos-monitoring)

P1

S

VoiceServiceDown, latency, error rate

Local dev setup script (model download, dirs)

P2

S

scripts/setup-local.sh

automatos-ai repo (existing):

Task
Priority
Effort
Notes

Add VOICE_* config constants to config.py

P0

S

All config centralized per platform rules

Build VoiceServiceClient (STT + TTS HTTP client)

P0

M

OpenAI-compatible, async, retry + timeout

Build POST /api/chat/voice endpoint

P0

M

Audio upload → STT → router → TTS → response

Build GET /api/chat/voice/audio/{id} endpoint

P0

S

S3 presigned URL or proxy

Add voice_metadata column to messages

P0

S

Alembic migration

S3 audio storage (upload/retrieve/TTL)

P0

M

Reuse existing S3 client

Input validation (format, size, duration limits)

P0

S

Security boundary

Rate limiting on voice endpoints

P1

S

Per-workspace throttle

Loki structured logging for voice operations

P1

S

Labels: service, component, workspace_id

Feature flag: VOICE_ENABLED

P1

S

Kill switch

Frontend: VoiceMicButton component

P0

M

MediaRecorder, visual feedback

Frontend: VoiceRecorder (audio capture + encoding)

P0

M

WebM/Opus capture, size validation

Frontend: VoicePlayer (audio playback)

P0

M

Play/pause/speed controls

Frontend: VoiceMessage chat bubble variant

P0

S

Waveform + transcript + play button

Frontend: VoiceSettings (auto-play, speed)

P2

S

User preferences

Graceful degradation (voice unavailable indicator)

P2

S

Mic button disabled state

automatos-monitoring repo (PRD-73):

Task
Priority
Effort
Notes

Add voice-service scrape target to Prometheus config

P1

S

voice-service.railway.internal:8300

Add voice alert rules to Prometheus rules directory

P1

S

Copy from automatos-voice/monitoring/

Build Grafana Voice Performance dashboard

P1

M

New dashboard JSON

Phase 2: High-Quality Voices (3-4 weeks)

automatos-voice repo:

Task
Priority
Effort
Notes

Add Chatterbox TTS to voice-service (Dockerfile + config)

P0

M

Second TTS engine alongside Kokoro

Chatterbox Docker image or custom wrapper

P0

M

GPU if available, CPU fallback

Voice cloning inference endpoint (reference audio → synthesis)

P1

M

Zero-shot cloning via Chatterbox

CosyVoice 2 integration (backup TTS engine)

P2

M

Alternative if Chatterbox disappoints

automatos-ai repo:

Task
Priority
Effort
Notes

Build TTS provider abstraction (TTSProvider ABC)

P0

M

Base class + self-hosted + OpenAI providers

Voice profiles DB table + migration

P0

S

voice_profiles table

Voice profiles CRUD API

P0

M

Admin endpoints

Per-agent voice assignment

P0

S

voice_profile_id on agents table, migration

Auto's default voice configuration

P1

S

Platform-level config

Voice cloning workflow (upload reference → create profile)

P1

M

Admin only, with consent, S3 storage

Frontend: Voice selection in agent settings

P1

M

Dropdown + preview playback

Frontend: Voice profiles management page

P1

M

Settings → Voices

Phase 3: Live Conversation (6-8 weeks)

automatos-voice repo:

Task
Priority
Effort
Notes

Evaluate Pipecat vs LiveKit Agents (spike)

P0

M

1-week spike, build POC with each

Build voice-pipeline service (Pipecat)

P0

L

New service in repo, Dockerfile

Build AutomatosOrchestratorProcessor (pipeline ↔ orchestrator bridge)

P0

L

HTTP client to backend for chat routing

VAD integration (Silero)

P0

M

Voice activity detection

Streaming STT integration

P0

M

Partial transcripts via voice-service

Streaming TTS integration

P0

M

Audio chunk streaming via voice-service

Interruption handling

P0

M

Cancel TTS on user speech

Voice pipeline /metrics endpoint

P1

M

Session, latency, interruption metrics

Deploy voice-pipeline as Railway service

P0

L

Public domain (WebSocket needs browser access)

Turn-taking tuning (silence threshold, VAD sensitivity)

P2

M

Iterative UX refinement

automatos-ai repo:

Task
Priority
Effort
Notes

Frontend: VoiceCallPanel

P0

L

Full real-time voice UI

Frontend: LiveTranscript component

P0

M

Real-time transcript display

Frontend: WebSocket audio streaming (useVoiceStream)

P0

L

MediaStream → WS → pipeline

Frontend: InterruptionHandler

P1

M

Client-side interrupt detection

Mobile-optimized voice UI

P1

M

Touch UX, bandwidth optimization

Multi-language live conversation

P2

M

Auto-detect language, switch TTS voice

automatos-monitoring repo (PRD-73):

Task
Priority
Effort
Notes

Add voice-pipeline scrape target to Prometheus

P1

S

voice-pipeline.railway.internal:8301

Add voice-pipeline alert rules

P1

S

E2E latency, session failures

Update Grafana Voice dashboard with Phase 3 panels

P1

M

Sessions, interruptions, streaming latency


10. Success Criteria

Phase 1

Metric
Target

STT accuracy (WER)

<10% on English conversational speech

STT latency (p95)

<2s for 10s audio clip

TTS latency (p95)

<1.5s for 100-word response

Voice service uptime

>99% when platform is running

Audio upload success rate

>98%

Voice messages visible in chat history with transcript

100%

Text chat unaffected when voice service is down

100%

Phase 2

Metric
Target

TTS naturalness (subjective)

Indistinguishable from human in casual listening

Voice cloning quality

Recognizable as the reference speaker

Provider swap time

<1 hour to switch TTS engine (config change + restart)

Per-agent voice differentiation

Configurable per agent, audibly distinct

Phase 3

Metric
Target

End-to-end latency (user done speaking → first audio response)

<2s p95

Interruption response time

<500ms (TTS stops within 500ms of user speaking)

Conversation naturalness

Turn-taking feels natural, no awkward pauses >2s

Concurrent voice sessions

>=10 simultaneous

Session drop rate

<5%


11. File Structure

11.1 automatos-voice (New Repo — Voice Service)

Owns all ML inference, model management, and real-time voice pipeline. Deployed as independent Railway service(s).

Key boundaries:

  • This repo has NO knowledge of the orchestrator, agents, workspaces, or database

  • It exposes only OpenAI-compatible HTTP endpoints (/v1/audio/transcriptions, /v1/audio/speech)

  • It exposes /health and /metrics for PRD-73 monitoring integration

  • Model weights are downloaded at first boot and cached in a Railway volume

11.2 automatos-ai (Existing Repo — Integration Wiring)

Owns the API endpoints, voice client, frontend components, and database schema. Talks to automatos-voice via HTTP over Railway private networking.

Key boundaries:

  • This repo has NO ML dependencies (no PyTorch, no model weights, no audio inference)

  • VoiceServiceClient makes HTTP calls to voice-service.railway.internal:8300 — same contract as OpenAI's API

  • Swapping to OpenAI's hosted voice API = change VOICE_SERVICE_URL in config, switch provider to openai.py

  • Provider abstraction lives here (not in automatos-voice) because the orchestrator decides which provider to use

11.3 Repo Boundary Contract

The ONLY interface between the two repos is the OpenAI-compatible REST API:

No shared database. No shared message queue. No shared code. If automatos-voice goes down, text chat is unaffected.


12. Relationship to Existing PRDs & Repos

12.1 Cross-Repo Dependencies

Repo
Role in PRD-74
Depends On

automatos-voice (new)

Voice inference service (STT/TTS engines, Pipecat pipeline)

Nothing — standalone service

automatos-ai (existing)

Integration wiring (API endpoints, voice client, frontend, migrations)

automatos-voice running and reachable via Railway private network

automatos-monitoring (PRD-73)

Prometheus scrape targets, alert rules, Grafana dashboard for voice

automatos-voice exposing /metrics and /health

12.2 PRD Relationships

PRD
Relationship

PRD-01 (Core Orchestration)

Voice is a transport layer. Messages enter the same orchestration pipeline regardless of input modality.

PRD-50 (Universal Orchestrator Router)

Voice transcripts are routed through the Universal Router exactly like typed messages. No routing changes needed.

PRD-55 (Autonomous Assistant / Heartbeats)

Auto's heartbeat responses can be spoken aloud if the user is in voice mode. Voice profile determines Auto's voice.

PRD-57 (Mobile-First Responsive)

Voice is mobile's killer feature. Voice UI components must be mobile-optimized from Phase 1.

PRD-71 (Unified Skills Architecture)

Voice is not a skill — it's a transport. Skills don't need voice awareness.

PRD-72 (Activity Command Centre)

Voice events (sessions started, errors) surface in the activity feed.

PRD-73 (Observability & Monitoring)

Voice service scrape targets, alert rules, Grafana dashboard, and Loki log labels integrate into the monitoring stack. See Section 5.


13. Open Questions

#
Question
Impact
Decision Needed By

1

GPU or CPU for voice service on Railway? CPU is cheaper but slower. GPU reduces latency 3-5x.

Phase 1 latency targets

Before deployment

2

Should voice audio be transcribed and stored, or ephemeral? Current design stores both.

Storage costs, privacy

Phase 1

3

Pipecat vs LiveKit Agents — need a 1-week spike to validate.

Phase 3 architecture

Before Phase 3 starts

4

Should Auto's voice be consistent across all agents, or should each agent have a distinct voice by default?

UX identity

Phase 2

5

Rate limits per workspace — what's reasonable? 60 STT/min and 60 TTS/min proposed.

Cost, abuse prevention

Phase 1

6

Voice cloning consent — legal review needed for different jurisdictions?

Compliance

Phase 2


14. Glossary

Term
Definition

STT

Speech-to-Text — converting audio to text transcript

TTS

Text-to-Speech — converting text to spoken audio

VAD

Voice Activity Detection — detecting when a user is speaking vs. silent

WebRTC

Web Real-Time Communication — browser standard for real-time audio/video

Whisper

OpenAI's open-source STT model family (tiny → large-v3)

Kokoro

82M-parameter TTS model (Apache 2.0, StyleTTS2-based)

Chatterbox

Resemble AI's open-source TTS (MIT, beats ElevenLabs in blind tests)

CosyVoice 2

Alibaba's open-source TTS (Apache 2.0, 150ms streaming, 9 languages)

Pipecat

Daily.co's open-source Python framework for real-time voice AI pipelines

Speaches

Open-source unified STT+TTS server (OpenAI-compatible API, Docker-native)

Opus

Audio codec optimized for speech — low bitrate, high quality

Zero-shot cloning

Cloning a voice from a short audio sample without fine-tuning

Last updated