Text-to-Speech API Quickstart

Generate speech from text using the Voice.ai TTS API.

Prerequisites: API key

Get a Voice ID (Optional)

Optionally get a voice_id from your dashboard or clone a voice. Skip this to use the default voice.

Generate Speech

Generate speech from text. Include voice_id if you have one, or omit it to use the default voice. See the Generate Speech endpoint for details.

import requests

# Using default voice (voice_id is optional)
response = requests.post(
    'https://dev.voice.ai/api/v1/tts/speech',
    headers={'Authorization': 'Bearer YOUR_API_KEY', 'Content-Type': 'application/json'},
    json={
        'text': 'Hello! This is a test of the Voice.ai TTS API.',
        'model': 'voiceai-tts-v1-latest',  # Optional, defaults to voiceai-tts-v1-latest
        'language': 'en'  # Optional, defaults to 'en'
    }
)

# Or with a custom voice_id:
# json={'voice_id': 'your-voice-id-here', 'text': 'Hello! This is a test of the Voice.ai TTS API.', 'model': 'voiceai-tts-v1-latest', 'language': 'en'}

with open('output.mp3', 'wb') as f:
        f.write(response.content)

Streaming

For lowest latency: Use the WebSocket endpoint for conversational AI or multiple sequential requests.

HTTP Streaming (Simple)

For simple request/response streaming, use the HTTP streaming endpoint:

import requests

# Using default voice (voice_id is optional)
response = requests.post(
    'https://dev.voice.ai/api/v1/tts/speech/stream',
    headers={'Authorization': 'Bearer YOUR_API_KEY', 'Content-Type': 'application/json'},
    json={
        'text': 'This text will be streamed in chunks.',
        'model': 'voiceai-tts-v1-latest',  # Optional, defaults to voiceai-tts-v1-latest
        'language': 'en'  # Optional, defaults to 'en'
    },
    stream=True
)

# Or with a custom voice_id:
# json={'voice_id': 'your-voice-id-here', 'text': 'This text will be streamed in chunks.', 'model': 'voiceai-tts-v1-latest', 'language': 'en'}

with open('output.mp3', 'wb') as f:
    for chunk in response.iter_content():
        if chunk: f.write(chunk)

WebSocket Streaming (Optimal for Conversational AI)

For lowest latency in multi-turn conversations, use the Multi-Context WebSocket (/multi-stream):

import asyncio
import json
import base64
import websockets

async def tts_conversation():
    # Use /multi-stream for multiple generations over persistent connection
    url = "wss://dev.voice.ai/api/v1/tts/multi-stream"
    headers = {"Authorization": "Bearer YOUR_API_KEY"}
    
    try:
        async with websockets.connect(url, additional_headers=headers) as ws:
            # First message (context auto-generated)
            await ws.send(json.dumps({
                "text": "Hello! How can I help you today?",
                "language": "en",
                "flush": True
            }))
            
            # Receive audio chunks
            while True:
                msg = await ws.recv()
                data = json.loads(msg)
                if data.get("error"):
                    print(f"Error: {data['error']}")
                    break
                if data.get("audio"):
                    audio_chunk = base64.b64decode(data["audio"])
                    # Process/play audio chunk...
                elif data.get("is_last"):
                    break
            
            # Second generation (same connection)
            await ws.send(json.dumps({
                "text": "I can help you with that.",
                "flush": True
            }))
            
            # Receive audio...
            while True:
                msg = await ws.recv()
                data = json.loads(msg)
                if data.get("error"):
                    print(f"Error: {data['error']}")
                    break
                if data.get("audio"):
                    audio_chunk = base64.b64decode(data["audio"])
                    # Process/play audio chunk...
                elif data.get("is_last"):
                    break
            
            # Close when done
            await ws.send(json.dumps({"close_socket": True}))
            
    except websockets.ConnectionClosed as e:
        # Handle connection errors (auth failures, invalid params, etc.)
        # Close codes: 1000=normal, 1003=invalid message, 1007=invalid data,
        #              1008=policy violation (auth/credits), 1011=server error
        print(f"Connection closed: code={e.code} reason={e.reason}")

asyncio.run(tts_conversation())

WebSocket Close Codes: Errors are communicated via close codes. 1000 = normal, 1007 = invalid data (including validation errors like extra fields on text-only messages), 1008 = auth/credits/policy issue, 1011 = server error. See the Streaming Guide for full error handling documentation.

delivery_mode: Set "delivery_mode": "paced" for paced chunk emission on PCM-based outputs (pcm, pcm_*, ulaw_8000, alaw_8000). Other formats automatically fall back to "raw". Default is "raw" for lowest latency (emits chunks immediately as generated). See the Streaming Guide for all WebSocket options.

Supported Languages

The TTS API supports multiple languages. Specify the language parameter using ISO 639-1 language codes. If not provided, the API defaults to English (en).

Language Code	Language	Model Type
`en`	English	Non-multilingual
`ca`	Catalan	Multilingual
`sv`	Swedish	Multilingual
`es`	Spanish	Multilingual
`fr`	French	Multilingual
`de`	German	Multilingual
`it`	Italian	Multilingual
`pt`	Portuguese	Multilingual
`pl`	Polish	Multilingual
`ru`	Russian	Multilingual
`nl`	Dutch	Multilingual

Model Selection: The API automatically selects the appropriate model based on the language. English uses voiceai-tts-v1-latest (non-multilingual), while all other languages use voiceai-tts-multilingual-v1-latest. You can override this by explicitly specifying the model parameter.

Audio Output

The TTS API supports multiple audio formats with various sample rates and bitrates. Basic formats (mp3, wav, pcm) output at 32kHz sample rate. Format-specific options allow you to control sample rate and bitrate.

32kHz Formats

Format	Description	Use Case
`mp3`	Compressed, smallest file size	Web playback, storage efficiency
`wav`	Uncompressed with headers	Professional audio editing
`pcm`	Raw 16-bit signed little-endian, 32kHz mono	Real-time processing, custom decoders

MP3 Formats (with sample rate and bitrate)

Format	Sample Rate	Bitrate	Use Case
`mp3_22050_32`	22.05kHz	32kbps	Low bandwidth, voice-only
`mp3_24000_48`	24kHz	48kbps	Voice applications
`mp3_44100_32`	44.1kHz	32kbps	Music/voice, low bandwidth
`mp3_44100_64`	44.1kHz	64kbps	Music/voice, balanced
`mp3_44100_96`	44.1kHz	96kbps	Music/voice, good quality
`mp3_44100_128`	44.1kHz	128kbps	Music/voice, high quality
`mp3_44100_192`	44.1kHz	192kbps	Music/voice, highest quality

Opus Formats (with sample rate and bitrate)

Format	Sample Rate	Bitrate	Use Case
`opus_48000_32`	48kHz	32kbps	Low bandwidth, voice-only
`opus_48000_64`	48kHz	64kbps	Voice applications, balanced
`opus_48000_96`	48kHz	96kbps	Voice/music, good quality
`opus_48000_128`	48kHz	128kbps	Voice/music, high quality
`opus_48000_192`	48kHz	192kbps	Voice/music, highest quality

PCM Formats (with sample rate)

All pcm_* formats use 16-bit signed little-endian mono at the specified sample rate.

Format	Sample Rate	Use Case
`pcm_8000`	8kHz	Telephony, low bandwidth
`pcm_16000`	16kHz	Voice applications
`pcm_22050`	22.05kHz	Voice/music, balanced
`pcm_24000`	24kHz	Voice/music
`pcm_32000`	32kHz	Voice/music, standard
`pcm_44100`	44.1kHz	Music, CD quality
`pcm_48000`	48kHz	Music, professional quality

WAV Formats (with sample rate)

Format	Sample Rate	Use Case
`wav_16000`	16kHz	Voice applications
`wav_22050`	22.05kHz	Voice/music, balanced
`wav_24000`	24kHz	Voice/music

Telephony Formats

Format	Sample Rate	Use Case
`alaw_8000`	8kHz	A-law telephony (G.711)
`ulaw_8000`	8kHz	μ-law telephony (G.711)

Voice Cloning

Create custom voices from audio samples

Streaming

HTTP & WebSocket streaming (WebSocket for lowest latency)

API Reference

Complete endpoint documentation

Get started

Text-to-Speech

Voice Agents

SDKs

Streaming

HTTP Streaming (Simple)

WebSocket Streaming (Optimal for Conversational AI)

Supported Languages

Audio Output

32kHz Formats

MP3 Formats (with sample rate and bitrate)

Opus Formats (with sample rate and bitrate)

PCM Formats (with sample rate)

WAV Formats (with sample rate)

Telephony Formats

Voice Cloning

Streaming

API Reference

Get started

Text-to-Speech

Voice Agents

SDKs

​Streaming

​HTTP Streaming (Simple)

​WebSocket Streaming (Optimal for Conversational AI)

​Supported Languages

​Audio Output

​32kHz Formats

​MP3 Formats (with sample rate and bitrate)

​Opus Formats (with sample rate and bitrate)

​PCM Formats (with sample rate)

​WAV Formats (with sample rate)

​Telephony Formats

Voice Cloning

Streaming

API Reference

Streaming

HTTP Streaming (Simple)

WebSocket Streaming (Optimal for Conversational AI)

Supported Languages

Audio Output

32kHz Formats

MP3 Formats (with sample rate and bitrate)

Opus Formats (with sample rate and bitrate)

PCM Formats (with sample rate)

WAV Formats (with sample rate)

Telephony Formats