Skip to main content
Stream audio in real-time using HTTP chunked transfer encoding or WebSocket connections. Both methods send audio chunks as they’re generated, reducing latency for conversational AI and real-time applications.

HTTP Chunked Streaming

The streaming endpoint (/api/v1/tts/speech/stream) uses HTTP chunked transfer encoding:
  • Audio arrives incrementally, reducing time-to-first-audio
  • Start playing audio before generation completes
  • Lower memory usage (no need to buffer entire file)
  • Better UX for real-time applications

Examples

import requests

# Using default voice (voice_id is optional)
response = requests.post(
    'https://dev.voice.ai/api/v1/tts/speech/stream',
    headers={'Authorization': 'Bearer YOUR_API_KEY', 'Content-Type': 'application/json'},
    json={
        'text': 'This is a test of streaming audio generation.',
        'model': 'voiceai-tts-v1-latest',  # Optional, defaults to voiceai-tts-v1-latest
        'language': 'en'  # Optional, defaults to 'en'
    },
    stream=True
)

# Or with a custom voice_id:
# json={'voice_id': 'your-voice-id-here', 'text': 'This is a test of streaming audio generation.', 'model': 'voiceai-tts-v1-latest', 'language': 'en'}

with open('output.mp3', 'wb') as f:
    for chunk in response.iter_content():
        if chunk: f.write(chunk)

WebSocket Streaming

WebSocket streaming for real-time TTS. Two modes are available:
EndpointUse CaseConnection Lifecycle
/streamSingle generationCloses after audio completes
/multi-streamMultiple generations, multi-speakerPersistent until client closes
Authentication: Include your API key in the Authorization header. See the Authentication guide for details. Protocol:
  • First message is an init message (sets voice, model, language)
  • Text can be buffered over multiple messages before flush
  • All messages are JSON text (binary input is rejected)
  • Server responds with JSON messages containing base64-encoded audio

Single-Context WebSocket

Endpoint: wss://dev.voice.ai/api/v1/tts/stream Single generation per WebSocket connection. Text is buffered until flush, audio streams back, then the server closes the connection (code 1000). For multiple generations, use Multi-Context WebSocket instead.
import asyncio
import json
import base64
import websockets

async def test_tts_websocket():
    url = "wss://dev.voice.ai/api/v1/tts/stream"
    headers = {
        "Authorization": "Bearer YOUR_API_KEY"
    }
    
    try:
        async with websockets.connect(url, additional_headers=headers) as ws:
            # Init message with text and flush=true for immediate generation
            await ws.send(json.dumps({
                "voice_id": None,  # Optional: voice ID (None = default built-in voice)
                "text": "Hello, this is a test.",
                "audio_format": "mp3",
                "language": "en",
                "model": "voiceai-tts-v1-latest",
                "flush": True
            }))
            
            # Receive audio chunks
            audio_data = b""
            while True:
                msg = await ws.recv()
                data = json.loads(msg)
                if data.get("audio"):
                    chunk = base64.b64decode(data["audio"])
                    audio_data += chunk
                    print(f"Received {len(chunk)} bytes")
                elif data.get("is_last"):
                    print("Complete!")
                    break
            
            # Save audio
            with open("output.mp3", "wb") as f:
                f.write(audio_data)
                
    except websockets.ConnectionClosed as e:
        # Errors close connection with close code (no JSON error message)
        print(f"Connection closed: code={e.code} reason={e.reason}")

asyncio.run(test_tts_websocket())
Message Format:
  • Input: JSON text messages only
  • Output: JSON messages with base64-encoded audio
{
  "voice_id": "uuid",      // Optional: voice to use (defaults to model's built-in voice if omitted)
  "text": "Hello world",   // Text to buffer
  "language": "en",        // Optional: Language code (ISO 639-1, e.g., "en", "es", "fr"), defaults to "en"
  "flush": true,           // Trigger audio generation (connection closes after)
  "audio_format": "mp3",   // Optional: mp3, wav, pcm, or format-specific options (e.g., mp3_44100_128, opus_48000_64, pcm_16000)
  "temperature": 1.0,      // Optional: 0.0-2.0
  "top_p": 0.8,            // Optional: 0.0-1.0
  "model": "voiceai-tts-v1-latest"  // Optional: model
}
Server Responses:
  • {"audio": "<base64-encoded-audio>"} - Audio chunks as base64 strings (streamed immediately)
  • {"is_last": true} - Sent after all audio chunks are sent. Indicates generation is complete. Server closes connection (code 1000) immediately after this message.
Error Handling: Errors are sent via WebSocket close codes only (no JSON error messages). Handle the close event to detect errors:
  • 1000 - Normal closure (generation complete)
  • 1003 - Invalid message type (binary not supported, or expected text message)
  • 1007 - Invalid data (malformed JSON, validation errors)
  • 1008 - Policy violation (authentication failed, text too long, insufficient credits, invalid parameters)
  • 1011 - Internal server error (TTS generation failed, session preparation failed)

Multi-Context WebSocket

Endpoint: wss://dev.voice.ai/api/v1/tts/multi-stream Multiple concurrent TTS streams over a single WebSocket connection. Each context has its own voice and settings.
import asyncio
import json
import base64
import websockets

async def test_multi_context():
    url = "wss://dev.voice.ai/api/v1/tts/multi-stream"
    headers = {
        "Authorization": "Bearer YOUR_API_KEY",
    }
    
    async with websockets.connect(url, additional_headers=headers) as ws:
        # Init context-1 (first message to ctx-1)
        await ws.send(json.dumps({
            "context_id": "ctx-1",
            "voice_id": "VOICE_ID_1",
            "text": "Hello from context one.",
            "language": "en",
            "model": "voiceai-tts-v1-latest",
            "flush": True
        }))
        
        # Init context-2 (first message to ctx-2, can use different voice)
        await ws.send(json.dumps({
            "context_id": "ctx-2", 
            "voice_id": "VOICE_ID_2",
            "text": "Hello from context two.",
            "language": "en",
            "model": "voiceai-tts-v1-latest",
            "flush": True
        }))
        
        # Receive responses (will be interleaved, all JSON with base64 audio)
        audio_by_context = {}  # {context_id: bytes}
        completed = set()
        while len(completed) < 2:
            msg = await ws.recv()
            data = json.loads(msg)
            ctx = data.get("context_id")
            if data.get("audio"):
                chunk = base64.b64decode(data["audio"])
                if ctx not in audio_by_context:
                    audio_by_context[ctx] = b""
                audio_by_context[ctx] += chunk
                print(f"Audio chunk for {ctx}: {len(chunk)} bytes")
                # Note: is_last is sent as a separate message after all chunks
                continue
            elif data.get("is_last"):
                # Flush completion (sent after all audio chunks for this flush)
                # Context remains active and can receive more flushes
                completed.add(ctx)
                print(f"{ctx} flush complete!")
        
        # Send another message to existing context
        await ws.send(json.dumps({
            "context_id": "ctx-1",
            "text": "More text for context one.",
            "flush": True
        }))
        
        # Receive audio for the new message
        while True:
            msg = await ws.recv()
            data = json.loads(msg)
            if data.get("audio"):
                chunk = base64.b64decode(data["audio"])
                audio_by_context["ctx-1"] += chunk
            elif data.get("is_last"):
                break
        
        # Close entire socket
        await ws.send(json.dumps({"close_socket": True}))
    
    # Save audio to files
    for ctx_id, audio_data in audio_by_context.items():
        filename = f"output_{ctx_id}.mp3"
        with open(filename, "wb") as f:
            f.write(audio_data)
        print(f"Saved {ctx_id} audio to {filename} ({len(audio_data)} bytes)")

asyncio.run(test_multi_context())
Message Format:
  • Input: JSON text messages only
  • Output: JSON messages with base64-encoded audio and context_id
{
  "context_id": "ctx-1",   // Context identifier (auto-generated if omitted)
  "voice_id": "uuid",      // Optional: voice to use (defaults to model's built-in voice if omitted)
  "text": "Hello",         // Text to buffer
  "language": "en",        // Optional: Language code (ISO 639-1, e.g., "en", "es", "fr"), defaults to "en"
  "flush": true,           // Trigger audio generation
  "audio_format": "mp3",   // Optional: mp3, wav, pcm, or format-specific options (e.g., mp3_44100_128, opus_48000_64, pcm_16000)
  "temperature": 1.0,      // Optional: 0.0-2.0
  "top_p": 0.8,            // Optional: 0.0-1.0
  "model": "voiceai-tts-v1-latest",  // Optional: model
  "close_context": true,   // Close this context
  "close_socket": true     // Close entire connection
}
Server Responses:
  • {"audio": "<base64-encoded-audio>", "context_id": "ctx-1"} - Audio chunks with context ID (streamed immediately, no buffering delay)
  • {"is_last": true, "context_id": "ctx-1"} - Completion message sent after all audio chunks for a flush (separate message, never included in audio messages)
    • Note: is_last is sent after EACH flush completes, allowing the same context to be reused for multiple flushes
  • {"context_closed": true, "context_id": "ctx-1"} - Sent when a context is explicitly closed via close_context (separate from is_last)
  • {"error": "message", "context_id": "ctx-1"} on error
Closing Contexts and Connections:
  • {"context_id": "ctx-1", "close_context": true} - Close a specific context. Server responds with context_closed to confirm.
  • {"close_socket": true} - Close the entire WebSocket connection and all contexts. Can be included in any message or sent standalone.

When to Use

Use HTTP Chunked Streaming for:
  • Simple request/response patterns
  • One-off audio generation
  • Stateless operations
Use Single-Context WebSocket (/stream) for:
  • Single audio generation with WebSocket protocol
  • Text buffering before generation (send text incrementally, flush once)
  • When you need WebSocket protocol but only one generation per connection
Use Multi-Context WebSocket (/multi-stream) for:
  • Conversational AI applications with multiple turns
  • Multiple sequential audio generations over persistent connection
  • Multiple concurrent voices in the same application
  • Conversation simulations with multiple speakers
  • When voice settings should persist across multiple requests
  • Applications requiring voice switching
Use non-streaming for:
  • Batch processing
  • Simpler code requirements
  • Small text inputs
  • When you don’t need real-time audio

Best Practices

HTTP Chunked Streaming

  • Handle network errors gracefully
  • Start playing audio chunks as soon as they arrive
  • Implement timeout handling for long streams
  • Prefer MP3 for efficiency; PCM for highest quality

WebSocket Streaming

  • Always send an init message first (with voice, model, language)
  • Handle is_last as a separate message to know when audio is complete
    • The same context can be reused for multiple flushes - each flush generates its own is_last
  • Handle context_closed message for context closure confirmation
    • When you send close_context, the server responds with context_closed to confirm the context is closed
  • Decode base64 audio chunks properly
  • Handle errors gracefully
  • Single-context (/stream): Connection closes after is_last. Reconnect for each generation.
  • Multi-context (/multi-stream): Connection stays open. Track audio by context_id to avoid mixing streams.
    • Closing contexts: Send {"context_id": "ctx-1", "close_context": true} (can be sent standalone, no text/flush required). Server responds with context_closed to confirm closure.
    • Closing connection: Send {"close_socket": true} to close the entire WebSocket connection and all contexts.

General

  • Use appropriate audio format (MP3 for efficiency, WAV/PCM for quality)
  • All audio is output at 32kHz sample rate
  • Implement reconnection logic for WebSocket connections
  • Monitor connection health and handle disconnections
  • Set appropriate timeouts for all streaming methods

Audio Output Formats

The TTS API supports multiple audio formats with various sample rates and bitrates. Basic formats (mp3, wav, pcm) output at 32kHz sample rate. Format-specific options allow you to control sample rate and bitrate.

32kHz Formats

FormatDescriptionUse Case
mp3Compressed, smallest sizeWeb playback, bandwidth efficiency
wavUncompressed with headersProfessional audio, editing
pcmRaw 16-bit signed little-endianReal-time processing, custom decoders

MP3 Formats (with sample rate and bitrate)

FormatSample RateBitrateUse Case
mp3_22050_3222.05kHz32kbpsLow bandwidth, voice-only
mp3_24000_4824kHz48kbpsVoice applications
mp3_44100_3244.1kHz32kbpsMusic/voice, low bandwidth
mp3_44100_6444.1kHz64kbpsMusic/voice, balanced
mp3_44100_9644.1kHz96kbpsMusic/voice, good quality
mp3_44100_12844.1kHz128kbpsMusic/voice, high quality
mp3_44100_19244.1kHz192kbpsMusic/voice, highest quality

Opus Formats (with sample rate and bitrate)

FormatSample RateBitrateUse Case
opus_48000_3248kHz32kbpsLow bandwidth, voice-only
opus_48000_6448kHz64kbpsVoice applications, balanced
opus_48000_9648kHz96kbpsVoice/music, good quality
opus_48000_12848kHz128kbpsVoice/music, high quality
opus_48000_19248kHz192kbpsVoice/music, highest quality

PCM Formats (with sample rate)

FormatSample RateUse Case
pcm_80008kHzTelephony, low bandwidth
pcm_1600016kHzVoice applications
pcm_2205022.05kHzVoice/music, balanced
pcm_2400024kHzVoice/music
pcm_3200032kHzVoice/music, standard
pcm_4410044.1kHzMusic, CD quality
pcm_4800048kHzMusic, professional quality

WAV Formats (with sample rate)

FormatSample RateUse Case
wav_1600016kHzVoice applications
wav_2205022.05kHzVoice/music, balanced
wav_2400024kHzVoice/music

Telephony Formats

FormatSample RateUse Case
alaw_80008kHzA-law telephony (G.711)
ulaw_80008kHzμ-law telephony (G.711)
See the API Reference for complete documentation.