Text-to-Speech Streaming API

HTTP Chunked Streaming
WebSocket Streaming
- Single-Context WebSocket
- Multi-Context WebSocket
When to Use
Best Practices

Stream audio in real-time using HTTP chunked transfer encoding or WebSocket connections. Both methods send audio chunks as they’re generated, reducing latency for conversational AI and real-time applications.

HTTP Chunked Streaming

The streaming endpoint (/api/v1/tts/speech/stream) uses HTTP chunked transfer encoding:

Audio arrives incrementally, reducing time-to-first-audio
Start playing audio before generation completes
Lower memory usage (no need to buffer entire file)
Better UX for real-time applications

Examples

import requests

# Using default voice (voice_id is optional)
response = requests.post(
    'https://dev.voice.ai/api/v1/tts/speech/stream',
    headers={'Authorization': 'Bearer YOUR_API_KEY', 'Content-Type': 'application/json'},
    json={
        'text': 'This is a test of streaming audio generation.',
        'model': 'voiceai-tts-v1-latest',  # Optional, defaults to voiceai-tts-v1-latest
        'language': 'en'  # Optional, defaults to 'en'
    },
    stream=True
)

# Or with a custom voice_id:
# json={'voice_id': 'your-voice-id-here', 'text': 'This is a test of streaming audio generation.', 'model': 'voiceai-tts-v1-latest', 'language': 'en'}

with open('output.mp3', 'wb') as f:
    for chunk in response.iter_content():
        if chunk: f.write(chunk)

WebSocket Streaming

WebSocket streaming for real-time TTS. Voice, model, and language are set on the first message and cached for the session. Subsequent messages are text-only. Authentication: Include your API key in the Authorization header. See the Authentication guide for details. Protocol:

First message is an init message (sets voice, model, language)
Subsequent messages are text-only (no params allowed)
All messages are JSON text (binary input is rejected)
Server responds with JSON messages containing base64-encoded audio

Single-Context WebSocket

Endpoint: wss://dev.voice.ai/api/v1/tts/stream Single voice stream per WebSocket connection. Voice settings persist for the session.

import asyncio
import json
import base64
import websockets

async def test_tts_websocket():
    url = "wss://dev.voice.ai/api/v1/tts/stream"
    headers = {
        "Authorization": "Bearer YOUR_API_KEY"
    }
    
    async with websockets.connect(url, additional_headers=headers) as ws:
        # First message: Init message (sets up session)
        await ws.send(json.dumps({
            "voice_id": None,  # Optional: voice ID (None = default built-in voice)
            "text": "Hello, this is a test.",
            "audio_format": "mp3",
            "language": "en",  # Optional: Language code (ISO 639-1, defaults to 'en')
            "model": "voiceai-tts-v1-latest",  # Optional: model
            "flush": True  # Generate immediately
        }))
        
        # Receive audio chunks (JSON with base64-encoded audio)
        audio_data = b""
        while True:
            msg = await ws.recv()
            data = json.loads(msg)
            if data.get("audio"):
                chunk = base64.b64decode(data["audio"])
                audio_data += chunk
                print(f"Received {len(chunk)} bytes of audio")
                # Note: is_last is sent as a separate message after all chunks
                continue
            elif data.get("is_last"):
                # Completion message (sent after all audio chunks)
                print("Complete!")
                break
            elif data.get("error"):
                print(f"Error: {data['error']}")
                break
        
        # Save audio
        with open("output.mp3", "wb") as f:
            f.write(audio_data)
        
        # Subsequent messages: Text-only (voice_id, model, language are remembered)
        await ws.send(json.dumps({
            "text": "This is a second message.",
            "flush": True
        }))

asyncio.run(test_tts_websocket())

Message Format:

Input: JSON text messages only
Output: JSON messages with base64-encoded audio

{
  "voice_id": "uuid",      // Optional: voice to use (defaults to model's built-in voice if omitted)
  "text": "Hello world",   // Text to buffer
  "language": "en",        // Optional: Language code (ISO 639-1, e.g., "en", "es", "fr"), defaults to "en"
  "flush": true,           // Trigger audio generation
  "audio_format": "mp3",   // Optional: mp3, wav, pcm
  "temperature": 1.0,      // Optional: 0.0-2.0
  "top_p": 0.8,            // Optional: 0.0-1.0
  "model": "voiceai-tts-v1-latest"  // Optional: model
}

Server Responses:

{"audio": "<base64-encoded-audio>"} - Audio chunks as base64 strings (streamed immediately, no buffering delay)
{"is_last": true} - Completion message sent after all audio chunks (separate message, never included in audio messages)
{"error": "message"} on error

Multi-Context WebSocket

Endpoint: wss://dev.voice.ai/api/v1/tts/multi-stream Multiple concurrent TTS streams over a single WebSocket connection. Each context has its own voice and settings.

import asyncio
import json
import base64
import websockets

async def test_multi_context():
    url = "wss://dev.voice.ai/api/v1/tts/multi-stream"
    headers = {
        "Authorization": "Bearer YOUR_API_KEY",
    }
    
    async with websockets.connect(url, additional_headers=headers) as ws:
        # Init context-1 (first message to ctx-1)
        await ws.send(json.dumps({
            "context_id": "ctx-1",
            "voice_id": "VOICE_ID_1",
            "text": "Hello from context one.",
            "language": "en",
            "model": "voiceai-tts-v1-latest",
            "flush": True
        }))
        
        # Init context-2 (first message to ctx-2, can use different voice)
        await ws.send(json.dumps({
            "context_id": "ctx-2", 
            "voice_id": "VOICE_ID_2",
            "text": "Hello from context two.",
            "language": "en",
            "model": "voiceai-tts-v1-latest",
            "flush": True
        }))
        
        # Receive responses (will be interleaved, all JSON with base64 audio)
        audio_by_context = {}  # {context_id: bytes}
        completed = set()
        while len(completed) < 2:
            msg = await ws.recv()
            data = json.loads(msg)
            ctx = data.get("context_id")
            if data.get("audio"):
                chunk = base64.b64decode(data["audio"])
                if ctx not in audio_by_context:
                    audio_by_context[ctx] = b""
                audio_by_context[ctx] += chunk
                print(f"Audio chunk for {ctx}: {len(chunk)} bytes")
                # Note: is_last is sent as a separate message after all chunks
                continue
            elif data.get("is_last"):
                # Completion message (sent after all audio chunks)
                completed.add(ctx)
                print(f"{ctx} complete!")
        
        # Send text-only message to existing context
        await ws.send(json.dumps({
            "context_id": "ctx-1",
            "text": "More text for context one.",
            "flush": True
        }))
        
        # Close a specific context
        await ws.send(json.dumps({
            "context_id": "ctx-1",
            "close_context": True
        }))
        
        # Close entire socket
        await ws.send(json.dumps({"close_socket": True}))
    
    # Save audio to files
    for ctx_id, audio_data in audio_by_context.items():
        filename = f"output_{ctx_id}.mp3"
        with open(filename, "wb") as f:
            f.write(audio_data)
        print(f"Saved {ctx_id} audio to {filename} ({len(audio_data)} bytes)")

asyncio.run(test_multi_context())

Message Format:

Input: JSON text messages only
Output: JSON messages with base64-encoded audio and context_id

{
  "context_id": "ctx-1",   // Context identifier (auto-generated if omitted)
  "voice_id": "uuid",      // Optional: voice to use (defaults to model's built-in voice if omitted)
  "text": "Hello",         // Text to buffer
  "language": "en",        // Optional: Language code (ISO 639-1, e.g., "en", "es", "fr"), defaults to "en"
  "flush": true,           // Trigger audio generation
  "audio_format": "mp3",   // Optional: mp3, wav, pcm
  "temperature": 1.0,      // Optional: 0.0-2.0
  "top_p": 0.8,            // Optional: 0.0-1.0
  "model": "voiceai-tts-v1-latest",  // Optional: model
  "close_context": true,   // Close this context
  "close_socket": true     // Close entire connection
}

Server Responses:

{"audio": "<base64-encoded-audio>", "context_id": "ctx-1"} - Audio chunks with context ID (streamed immediately, no buffering delay)
{"is_last": true, "context_id": "ctx-1"} - Completion message sent after all audio chunks (separate message, never included in audio messages)
{"error": "message", "context_id": "ctx-1"} on error

When to Use

Use HTTP Chunked Streaming for:

Simple request/response patterns
One-off audio generation
Stateless operations

Use WebSocket Streaming for:

Conversational AI applications
Multiple sequential audio generations
When you need to send text incrementally
Real-time applications requiring low latency
When voice settings should persist across multiple requests

Use Multi-Context WebSocket for:

Multiple concurrent voices in the same application
Conversation simulations with multiple speakers
When you need to manage multiple independent TTS streams
Applications requiring voice switching

Use non-streaming for:

Batch processing
Simpler code requirements
Small text inputs
When you don’t need real-time audio

Best Practices

HTTP Chunked Streaming

Handle network errors gracefully
Start playing audio chunks as soon as they arrive
Implement timeout handling for long streams
Prefer MP3 for efficiency; PCM for highest quality

WebSocket Streaming

Always send an init message first (with voice, model, language)
Use text-only messages after initialization
Handle is_last as a separate message (never included in audio messages) to know when audio is complete
Decode base64 audio chunks properly
Handle errors gracefully and close connections on fatal errors
For multi-context: track audio by context_id to avoid mixing streams

General

Use appropriate audio format (MP3 for efficiency, WAV/PCM for quality)
Implement reconnection logic for WebSocket connections
Monitor connection health and handle disconnections
Set appropriate timeouts for all streaming methods

Generate Speech (Non-streaming) - Standard request/response TTS
Generate Speech Stream - HTTP chunked streaming
Single-Context WebSocket - WebSocket streaming for single voice
Multi-Context WebSocket - WebSocket streaming for multiple concurrent voices

See the API Reference for complete documentation.

Get started

Text-to-Speech

Voice Agents

HTTP Chunked Streaming

Examples

WebSocket Streaming

Single-Context WebSocket

Multi-Context WebSocket

When to Use

Best Practices

HTTP Chunked Streaming

WebSocket Streaming

General

Get started

Text-to-Speech

Voice Agents

​HTTP Chunked Streaming

​Examples

​WebSocket Streaming

​Single-Context WebSocket

​Multi-Context WebSocket

​When to Use

​Best Practices

​HTTP Chunked Streaming

​WebSocket Streaming

​General

​Related Endpoints

HTTP Chunked Streaming

Examples

WebSocket Streaming

Single-Context WebSocket

Multi-Context WebSocket

When to Use

Best Practices

HTTP Chunked Streaming

WebSocket Streaming

General

Related Endpoints