Open Source · Voice AI TTS Lite

Cloud-quality text-to-speech.
Without the cloud bill.

112M parameters. Runs on CPU and on-device on iOS. Streams with word-level timestamps. Zero-shot cloning with no reference transcript.

View Benchmarks Top-rated among open TTS models
Try it. Right now.

This is the actual inference server hosted on m6a.large. Pick a sample or write your own prompt.

tts-lite - live inference
Click a sample to hear it instantly - or type your own below
Lip sync
Synth latency
Network
Total TTFB
Gen time
Audio dur
Realtime ×
–:–– / –:––
160 px/s
Press Speak to generate audio
Hear it for yourself.

Zero-shot cloning. Paralinguistic tags. 32 kHz. All from a 112M-parameter model running on CPU.

Conversational
"And just like that — [laughs] — we shipped it. On a Friday. Into production."
0:00
Zero-shot clone32 kHzCPU inference
Technical narration
"This model runs entirely on CPU. Two vCores. No GPU. Faster than realtime."
0:00
Zero-shot clone32 kHzCPU inference
Expressive with tags
"Wait — [gasps] — you're telling me this whole thing runs on a laptop CPU?"
0:00
Zero-shot clone[gasps] tagCPU inference
Long-form
"The lighthouse keeper kept a journal for forty years. Only twice did he mention loneliness — once on a Tuesday after the storm, and once on the page where he stopped writing."
0:00
Zero-shot cloneNarrativeVaried prosody
100×

Cheaper than hosted TTS

Run it on a $42/mo m6a.large CPU instance. No per-token fees, no usage caps. Zero-shot cloning, streaming, and CPU inference in one model.

Data

Over 1 million hours of speech

Trained on proprietary speech data from Voice.ai's production voice conversion platform, which serves millions of users.

Cadence

Better every week

We ship model updates on a fast cadence. Same API - pull the latest checkpoint and your voices get better.

Enterprise
Need production-grade quality?
The full Voice.ai production model - higher quality, lower latency, priority infrastructure - is available via commercial API.
Talk to us
What makes TTS Lite different.

Zero-shot cloning, streaming timestamps, and CPU deployment in one open model.

Zero-shot voice cloning

Drop in a reference audio clip. No transcript needed. TTS Lite uses the reference audio as the acoustic prompt for the generated speech.

Streaming + word highlighting

Word-level timestamps stream alongside audio in real time, so clients can drive highlighting, captions, and karaoke-style UI without a separate forced-alignment pass.

Paralinguistic tags

Write [laughs], [sighs], [coughs], [gasps], or [clears throat] inline. The model renders them as part of the generated speech.

Alignment-gated EOS

A CTC alignment head tracks text progress at every decode step. EOS is suppressed until the text is spoken, then forced immediately, reducing cutoffs, repetition, and trailing hallucination.

Open source

Open weights, inference code, Docker setup, and examples. Run it locally, inspect the stack, and deploy without a hosted-only dependency.

Runs on CPU, iOS, and GPU

CPU and iOS run through ONNX Runtime. GPU deployments use PyTorch or AOT .pt2 packages.

32kHz
Clean output

Small models sound small.
TTS Lite doesn't.

We built VaiCodec so a 112M-parameter model can still produce full-frequency audio at 32 kHz, with clean output and low inference cost.

Quality that holds up.

In our benchmarks, TTS Lite posts the highest predicted MOS and speaker similarity among comparable external baselines.

3.34
Predicted MOS

Highest predicted MOS among comparable external baselines.

0.80
Speaker similarity

Highest speaker similarity among comparable external baselines.

Predicted MOS versus speaker similarity Voice AI TTS Lite appears at speaker similarity 0.80 and predicted MOS 3.34. External baselines appear lower or left: ElevenLabs Flash v2.5 at 0.78 and 3.30, PocketTTS at 0.80 and 3.29, and NeuTTS at 0.72 and 3.19. 3.15 3.25 3.35 0.70 0.74 0.78 0.82 Speaker similarity Predicted MOS NeuTTS ElevenLabs Flash v2.5 PocketTTS TTS Lite
See full benchmark details
The full sheet.
Parameters ~112M
Architecture Autoregressive transformer · 8 layers · 960 dims
Audio output 32 kHz · 3-codebook RVQ (1024 / 4096 / 8192)
Codec framerate 20 fps (50 ms per frame)
RTF (CPU) 0.31–0.37× on m6a.large, 2 vCPU (2.7–3.2× realtime)
First audio chunk <200 ms (KV-cached reference)
Decode step 13–18 ms mean
Zero-shot Reference audio only - no transcript required
Word timestamps Streaming · parallel to audio
Paralinguistic [laughs] [sighs] [coughs] [gasps] [clears throat] and more
Deployment CPU (ONNX) · iOS (ONNX) · GPU
Language English
License Apache 2.0
Codec Coming soon

Be first when it ships.

Leave your email and we'll send one note the day the repo goes public — nothing else.