HTTP Chunked Streaming
The streaming endpoint (/api/v1/tts/speech/stream) uses HTTP chunked transfer encoding:
- Audio arrives incrementally, reducing time-to-first-audio
- Start playing audio before generation completes
- Lower memory usage (no need to buffer entire file)
- Better UX for real-time applications
Examples
WebSocket Streaming
WebSocket streaming for real-time TTS. Two modes are available:| Endpoint | Use Case | Connection Lifecycle |
|---|---|---|
/stream | Single generation | Closes after audio completes |
/multi-stream | Multiple generations, multi-speaker | Persistent until client closes |
Authorization header. See the Authentication guide for details.
Protocol:
- First message is an init message (sets voice, model, language)
- Text can be buffered over multiple messages before flush
- All messages are JSON text (binary input is rejected)
- Server responds with JSON messages containing base64-encoded audio
Single-Context WebSocket
Endpoint:wss://dev.voice.ai/api/v1/tts/stream
Single generation per WebSocket connection. Text is buffered until flush, audio streams back, then the server closes the connection (code 1000). For multiple generations, use Multi-Context WebSocket instead.
- Input: JSON text messages only
- Output: JSON messages with base64-encoded audio
{"audio": "<base64-encoded-audio>"}- Audio chunks as base64 strings (streamed immediately){"is_last": true}- Sent after all audio chunks are sent. Indicates generation is complete. Server closes connection (code 1000) immediately after this message.
close event to detect errors:
1000- Normal closure (generation complete)1003- Invalid message type (binary not supported, or expected text message)1007- Invalid data (malformed JSON, validation errors)1008- Policy violation (authentication failed, text too long, insufficient credits, invalid parameters)1011- Internal server error (TTS generation failed, session preparation failed)
Multi-Context WebSocket
Endpoint:wss://dev.voice.ai/api/v1/tts/multi-stream
Multiple concurrent TTS streams over a single WebSocket connection. Each context has its own voice and settings.
- Input: JSON text messages only
- Output: JSON messages with base64-encoded audio and
context_id
{"audio": "<base64-encoded-audio>", "context_id": "ctx-1"}- Audio chunks with context ID (streamed immediately, no buffering delay){"is_last": true, "context_id": "ctx-1"}- Completion message sent after all audio chunks for a flush (separate message, never included in audio messages)- Note:
is_lastis sent after EACH flush completes, allowing the same context to be reused for multiple flushes
- Note:
{"context_closed": true, "context_id": "ctx-1"}- Sent when a context is explicitly closed viaclose_context(separate fromis_last){"error": "message", "context_id": "ctx-1"}on error
{"context_id": "ctx-1", "close_context": true}- Close a specific context. Server responds withcontext_closedto confirm.{"close_socket": true}- Close the entire WebSocket connection and all contexts. Can be included in any message or sent standalone.
When to Use
Use HTTP Chunked Streaming for:- Simple request/response patterns
- One-off audio generation
- Stateless operations
/stream) for:
- Single audio generation with WebSocket protocol
- Text buffering before generation (send text incrementally, flush once)
- When you need WebSocket protocol but only one generation per connection
/multi-stream) for:
- Conversational AI applications with multiple turns
- Multiple sequential audio generations over persistent connection
- Multiple concurrent voices in the same application
- Conversation simulations with multiple speakers
- When voice settings should persist across multiple requests
- Applications requiring voice switching
- Batch processing
- Simpler code requirements
- Small text inputs
- When you don’t need real-time audio
Best Practices
HTTP Chunked Streaming
- Handle network errors gracefully
- Start playing audio chunks as soon as they arrive
- Implement timeout handling for long streams
- Prefer MP3 for efficiency; PCM for highest quality
WebSocket Streaming
- Always send an init message first (with voice, model, language)
- Handle
is_lastas a separate message to know when audio is complete- The same context can be reused for multiple flushes - each flush generates its own
is_last
- The same context can be reused for multiple flushes - each flush generates its own
- Handle
context_closedmessage for context closure confirmation- When you send
close_context, the server responds withcontext_closedto confirm the context is closed
- When you send
- Decode base64 audio chunks properly
- Handle errors gracefully
- Single-context (
/stream): Connection closes afteris_last. Reconnect for each generation. - Multi-context (
/multi-stream): Connection stays open. Track audio bycontext_idto avoid mixing streams.- Closing contexts: Send
{"context_id": "ctx-1", "close_context": true}(can be sent standalone, no text/flush required). Server responds withcontext_closedto confirm closure. - Closing connection: Send
{"close_socket": true}to close the entire WebSocket connection and all contexts.
- Closing contexts: Send
General
- Use appropriate audio format (MP3 for efficiency, WAV/PCM for quality)
- All audio is output at 32kHz sample rate
- Implement reconnection logic for WebSocket connections
- Monitor connection health and handle disconnections
- Set appropriate timeouts for all streaming methods
Audio Output Formats
The TTS API supports multiple audio formats with various sample rates and bitrates. Basic formats (mp3, wav, pcm) output at 32kHz sample rate. Format-specific options allow you to control sample rate and bitrate.
32kHz Formats
| Format | Description | Use Case |
|---|---|---|
mp3 | Compressed, smallest size | Web playback, bandwidth efficiency |
wav | Uncompressed with headers | Professional audio, editing |
pcm | Raw 16-bit signed little-endian | Real-time processing, custom decoders |
MP3 Formats (with sample rate and bitrate)
| Format | Sample Rate | Bitrate | Use Case |
|---|---|---|---|
mp3_22050_32 | 22.05kHz | 32kbps | Low bandwidth, voice-only |
mp3_24000_48 | 24kHz | 48kbps | Voice applications |
mp3_44100_32 | 44.1kHz | 32kbps | Music/voice, low bandwidth |
mp3_44100_64 | 44.1kHz | 64kbps | Music/voice, balanced |
mp3_44100_96 | 44.1kHz | 96kbps | Music/voice, good quality |
mp3_44100_128 | 44.1kHz | 128kbps | Music/voice, high quality |
mp3_44100_192 | 44.1kHz | 192kbps | Music/voice, highest quality |
Opus Formats (with sample rate and bitrate)
| Format | Sample Rate | Bitrate | Use Case |
|---|---|---|---|
opus_48000_32 | 48kHz | 32kbps | Low bandwidth, voice-only |
opus_48000_64 | 48kHz | 64kbps | Voice applications, balanced |
opus_48000_96 | 48kHz | 96kbps | Voice/music, good quality |
opus_48000_128 | 48kHz | 128kbps | Voice/music, high quality |
opus_48000_192 | 48kHz | 192kbps | Voice/music, highest quality |
PCM Formats (with sample rate)
| Format | Sample Rate | Use Case |
|---|---|---|
pcm_8000 | 8kHz | Telephony, low bandwidth |
pcm_16000 | 16kHz | Voice applications |
pcm_22050 | 22.05kHz | Voice/music, balanced |
pcm_24000 | 24kHz | Voice/music |
pcm_32000 | 32kHz | Voice/music, standard |
pcm_44100 | 44.1kHz | Music, CD quality |
pcm_48000 | 48kHz | Music, professional quality |
WAV Formats (with sample rate)
| Format | Sample Rate | Use Case |
|---|---|---|
wav_16000 | 16kHz | Voice applications |
wav_22050 | 22.05kHz | Voice/music, balanced |
wav_24000 | 24kHz | Voice/music |
Telephony Formats
| Format | Sample Rate | Use Case |
|---|---|---|
alaw_8000 | 8kHz | A-law telephony (G.711) |
ulaw_8000 | 8kHz | μ-law telephony (G.711) |
Related Endpoints
- Generate Speech (Non-streaming) - Standard request/response TTS
- Generate Speech Stream - HTTP chunked streaming
- Single-Context WebSocket - WebSocket streaming for single voice
- Multi-Context WebSocket - WebSocket streaming for multiple concurrent voices