Skip to main content
Generate speech from text using the Voice.ai TTS API.
Prerequisites: API key
1

Get a Voice ID (Optional)

Optionally get a voice_id from your dashboard or clone a voice. Skip this to use the default voice.
2

Generate Speech

Generate speech from text. Include voice_id if you have one, or omit it to use the default voice. See the Generate Speech endpoint for details.
import requests

# Using default voice (voice_id is optional)
response = requests.post(
    'https://dev.voice.ai/api/v1/tts/speech',
    headers={'Authorization': 'Bearer YOUR_API_KEY', 'Content-Type': 'application/json'},
    json={
        'text': 'Hello! This is a test of the Voice.ai TTS API.',
        'model': 'voiceai-tts-v1-latest',  # Optional, defaults to voiceai-tts-v1-latest
        'language': 'en'  # Optional, defaults to 'en'
    }
)

# Or with a custom voice_id:
# json={'voice_id': 'your-voice-id-here', 'text': 'Hello! This is a test of the Voice.ai TTS API.', 'model': 'voiceai-tts-v1-latest', 'language': 'en'}

with open('output.mp3', 'wb') as f:
        f.write(response.content)

Streaming

For lowest latency: Use the WebSocket endpoint for conversational AI or multiple sequential requests.

HTTP Streaming (Simple)

For simple request/response streaming, use the HTTP streaming endpoint:
import requests

# Using default voice (voice_id is optional)
response = requests.post(
    'https://dev.voice.ai/api/v1/tts/speech/stream',
    headers={'Authorization': 'Bearer YOUR_API_KEY', 'Content-Type': 'application/json'},
    json={
        'text': 'This text will be streamed in chunks.',
        'model': 'voiceai-tts-v1-latest',  # Optional, defaults to voiceai-tts-v1-latest
        'language': 'en'  # Optional, defaults to 'en'
    },
    stream=True
)

# Or with a custom voice_id:
# json={'voice_id': 'your-voice-id-here', 'text': 'This text will be streamed in chunks.', 'model': 'voiceai-tts-v1-latest', 'language': 'en'}

with open('output.mp3', 'wb') as f:
    for chunk in response.iter_content():
        if chunk: f.write(chunk)

WebSocket Streaming (Optimal for Conversational AI)

For lowest latency in multi-turn conversations, use the Multi-Context WebSocket (/multi-stream):
import asyncio
import json
import base64
import websockets

async def tts_conversation():
    # Use /multi-stream for multiple generations over persistent connection
    url = "wss://dev.voice.ai/api/v1/tts/multi-stream"
    headers = {"Authorization": "Bearer YOUR_API_KEY"}
    
    async with websockets.connect(url, additional_headers=headers) as ws:
        # First message (context auto-generated)
        await ws.send(json.dumps({
            "text": "Hello! How can I help you today?",
            "language": "en",
            "flush": True
        }))
        
        # Receive audio chunks
        while True:
            msg = await ws.recv()
            data = json.loads(msg)
            if data.get("audio"):
                audio_chunk = base64.b64decode(data["audio"])
                # Process/play audio chunk...
            elif data.get("is_last"):
                break
        
        # Second generation (same connection)
        await ws.send(json.dumps({
            "text": "I can help you with that.",
            "flush": True
        }))
        
        # Receive audio...
        while True:
            msg = await ws.recv()
            data = json.loads(msg)
            if data.get("audio"):
                audio_chunk = base64.b64decode(data["audio"])
                # Process/play audio chunk...
            elif data.get("is_last"):
                break
        
        # Close when done
        await ws.send(json.dumps({"close_socket": True}))

asyncio.run(tts_conversation())
See the Streaming Guide for complete WebSocket documentation.

Supported Languages

The TTS API supports multiple languages. Specify the language parameter using ISO 639-1 language codes. If not provided, the API defaults to English (en).
Language CodeLanguageModel Type
enEnglishNon-multilingual
caCatalanMultilingual
svSwedishMultilingual
esSpanishMultilingual
frFrenchMultilingual
deGermanMultilingual
itItalianMultilingual
ptPortugueseMultilingual
plPolishMultilingual
ruRussianMultilingual
nlDutchMultilingual
Model Selection: The API automatically selects the appropriate model based on the language. English uses voiceai-tts-v1-latest (non-multilingual), while all other languages use voiceai-tts-multilingual-v1-latest. You can override this by explicitly specifying the model parameter.

Audio Output

The TTS API supports multiple audio formats with various sample rates and bitrates. Basic formats (mp3, wav, pcm) output at 32kHz sample rate. Format-specific options allow you to control sample rate and bitrate.

32kHz Formats

FormatDescriptionUse Case
mp3Compressed, smallest file sizeWeb playback, storage efficiency
wavUncompressed with headersProfessional audio editing
pcmRaw 16-bit signed little-endian samplesReal-time processing, custom decoders

MP3 Formats (with sample rate and bitrate)

FormatSample RateBitrateUse Case
mp3_22050_3222.05kHz32kbpsLow bandwidth, voice-only
mp3_24000_4824kHz48kbpsVoice applications
mp3_44100_3244.1kHz32kbpsMusic/voice, low bandwidth
mp3_44100_6444.1kHz64kbpsMusic/voice, balanced
mp3_44100_9644.1kHz96kbpsMusic/voice, good quality
mp3_44100_12844.1kHz128kbpsMusic/voice, high quality
mp3_44100_19244.1kHz192kbpsMusic/voice, highest quality

Opus Formats (with sample rate and bitrate)

FormatSample RateBitrateUse Case
opus_48000_3248kHz32kbpsLow bandwidth, voice-only
opus_48000_6448kHz64kbpsVoice applications, balanced
opus_48000_9648kHz96kbpsVoice/music, good quality
opus_48000_12848kHz128kbpsVoice/music, high quality
opus_48000_19248kHz192kbpsVoice/music, highest quality

PCM Formats (with sample rate)

FormatSample RateUse Case
pcm_80008kHzTelephony, low bandwidth
pcm_1600016kHzVoice applications
pcm_2205022.05kHzVoice/music, balanced
pcm_2400024kHzVoice/music
pcm_3200032kHzVoice/music, standard
pcm_4410044.1kHzMusic, CD quality
pcm_4800048kHzMusic, professional quality

WAV Formats (with sample rate)

FormatSample RateUse Case
wav_1600016kHzVoice applications
wav_2205022.05kHzVoice/music, balanced
wav_2400024kHzVoice/music

Telephony Formats

FormatSample RateUse Case
alaw_80008kHzA-law telephony (G.711)
ulaw_80008kHzμ-law telephony (G.711)