Skip to content

Voice Server

The voice server accepts WebSocket connections from edge nodes (Raspberry Pi devices with microphones and speakers), handles speech-to-text, routes queries through the agent, and streams synthesized audio responses back.

Architecture

Edge Node (Pi + ReSpeaker)
    │  WebSocket (ws:// or wss://)
VoiceServer (missy/channels/voice/server.py)
    ├── DeviceRegistry    — node auth + pairing
    ├── PairingManager    — first-contact registration
    ├── PresenceStore     — room occupancy + sensor data
    ├── STTEngine         — faster-whisper transcription
    ├── TTSEngine         — Piper synthesis
    └── AgentRuntime      — same agent as CLI/Discord

Configuration

Add the voice: section to ~/.missy/config.yaml:

voice:
  host: "0.0.0.0"          # Listen on all interfaces
  port: 8765                # WebSocket port
  stt:
    engine: "faster-whisper"
    model: "base.en"        # Options: tiny.en, base.en, small.en, medium.en
  tts:
    engine: "piper"
    voice: "en_US-lessac-medium"

Binding to 0.0.0.0

Binding to 0.0.0.0 exposes the voice channel on all network interfaces. The server emits a voice.bind.warning audit event when this happens. For local-only use, bind to 127.0.0.1.

STT engines

faster-whisper (default)

faster-whisper provides fast, accurate transcription. Install the voice extras:

pip install -e ".[voice]"

Model selection affects accuracy vs. speed:

Model Size Speed Accuracy
tiny.en 39 MB Fastest Good for simple commands
base.en 74 MB Fast Good general purpose
small.en 244 MB Moderate Better accuracy
medium.en 769 MB Slow Best accuracy

TTS engines

Piper

Piper is a fast, local TTS engine. It runs as a separate binary (not a pip package).

Install from the Piper releases page:

# Download and extract piper
wget https://github.com/rhasspy/piper/releases/download/2023.11.14-2/piper_linux_x86_64.tar.gz
tar xzf piper_linux_x86_64.tar.gz
sudo mv piper /usr/local/bin/

# Download a voice model
mkdir -p ~/.local/share/piper
wget -O ~/.local/share/piper/en_US-lessac-medium.onnx \
  https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/lessac/medium/en_US-lessac-medium.onnx

Server limits

The voice server enforces several safety limits:

Limit Value Purpose
Max audio buffer 10 MB Prevents memory exhaustion per connection
Max WebSocket frame 1 MB Limits individual frame size
Max concurrent connections 50 Connection flood protection
Auth timeout 10 seconds Closes unauthenticated connections
Sample rate range 8,000--48,000 Hz Rejects out-of-range values
Audio channels 1--2 Mono or stereo only

Starting the server

The voice server starts as part of the gateway:

missy gateway start --host 0.0.0.0 --port 8765

Check status:

missy gateway status
missy voice status

Testing

Test TTS synthesis for a specific edge node:

missy voice test NODE_ID --text "Hello from Missy"

Audit events

The voice server emits structured audit events for all operations:

Event Meaning
voice.bind.warning Server bound to 0.0.0.0
voice.connection.auth_ok Node authenticated successfully
voice.connection.auth_fail Authentication failed
voice.connection.rejected_muted Muted node tried to connect
voice.connection.closed Node disconnected
voice.audio.received Audio buffer received from node
voice.pair_request New device requested pairing