WebSocket Protocol¶

The voice channel uses a stateful WebSocket protocol. Messages are JSON text frames except for audio data, which is sent as binary frames. The first message from any client must be auth or pair_request.

Connection lifecycle¶

Client                                Server
  │                                      │
  ├── auth ─────────────────────────────►│
  │◄─────────────────────── auth_ok ─────┤
  │                                      │
  ├── audio_start ──────────────────────►│
  ├── [binary PCM chunks] ──────────────►│
  ├── audio_end ────────────────────────►│
  │                                      │  STT → Agent → TTS
  │◄──────────────────── response_text ──┤
  │◄──────────────────── audio_start ────┤
  │◄──────────────────── [binary WAV] ───┤
  │◄──────────────────── audio_end ──────┤
  │                                      │
  ├── heartbeat ────────────────────────►│
  │                                      │

Client-to-server messages¶

auth¶

Must be the first message. Authenticates using a node ID and token.

{
  "type": "auth",
  "node_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "token": "URL_SAFE_BASE64_TOKEN"
}

pair_request¶

Alternative first message for unregistered devices. The server assigns a node_id and closes the connection after responding with pair_pending.

{
  "type": "pair_request",
  "friendly_name": "Living Room Pi",
  "room": "Living Room",
  "hardware_profile": {
    "mic_type": "respeaker",
    "channels": 4,
    "speaker": true
  }
}

Server assigns the node_id

Do not include node_id in the pair request. The server generates it and returns it in the pair_pending response.

audio_start¶

Begins an audio streaming session. Must be followed by binary frames and an audio_end.

{
  "type": "audio_start",
  "sample_rate": 16000,
  "channels": 1,
  "format": "pcm_s16le"
}

Binary audio frames¶

Raw PCM audio bytes sent as WebSocket binary frames. Typically 80ms chunks (1,280 samples at 16 kHz). Binary frames outside an active audio session are ignored.

audio_end¶

Signals the end of the audio stream. Triggers STT transcription and agent processing.

{
  "type": "audio_end"
}

heartbeat¶

Periodic keepalive with optional sensor data.

{
  "type": "heartbeat",
  "node_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "occupancy": true,
  "noise_level": 0.03,
  "wake_word_fp": false
}

Field	Type	Description
`occupancy`	bool or null	Room occupied (from sensor data)
`noise_level`	float or null	Ambient noise RMS [0.0, 1.0]
`wake_word_fp`	bool	False positive wake word detection flag

Server-to-client messages¶

auth_ok¶

Sent after successful authentication.

{
  "type": "auth_ok",
  "node_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "room": "Living Room"
}

auth_fail¶

Sent when authentication fails. The connection is closed immediately after.

{
  "type": "auth_fail",
  "reason": "invalid credentials"
}

pair_pending¶

Sent after a successful pair request. The connection is closed after this message. The operator must approve the device via missy devices pair before the node can authenticate.

{
  "type": "pair_pending",
  "node_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890"
}

transcript¶

Sent when debug_transcripts is enabled on the server. Contains the STT result.

{
  "type": "transcript",
  "text": "What is the weather today?",
  "confidence": 0.95
}

response_text¶

The agent's text response. Always sent before any TTS audio.

{
  "type": "response_text",
  "text": "The current temperature is 72 degrees."
}

audio_start (server)¶

Begins a TTS audio stream back to the client.

{
  "type": "audio_start",
  "sample_rate": 22050,
  "format": "wav"
}

Binary WAV frames¶

WAV audio data sent as binary WebSocket frames in 4 KB chunks (configurable via audio_chunk_size).

audio_end (server)¶

Signals the end of TTS audio.

{
  "type": "audio_end"
}

error¶

Sent on processing failures. The connection may or may not be closed depending on severity.

{
  "type": "error",
  "message": "Speech recognition failed"
}

muted¶

Sent when a node with policy_mode: muted attempts to connect. The connection is closed immediately after.

{
  "type": "muted"
}

Protocol notes¶

Spec doc vs. actual implementation

The original spec document and the actual server implementation differ. Both missy-edge and the server use the implementation:

Spec doc	Actual server
`stream_start` / `stream_end`	`audio_start` / `audio_end`
`tts_audio` + `tts_end` (raw PCM)	`audio_start` + binary WAV + `audio_end`
Client sends `node_id` in pair_request	Server assigns `node_id`
Token via WebSocket `pair_ack`	Token shown in CLI `missy devices pair` output

Security¶

The first frame must be auth or pair_request. Any other message type causes immediate disconnection.
Binary frames before authentication are rejected.
Nodes with paired=False are rejected even if the token is valid.
Nodes with policy_mode=muted receive a muted frame and are disconnected.
Token verification uses PBKDF2-HMAC-SHA256 with constant-time comparison.