The best zero-shot text-to-speech models today require GPUs. That makes them fast in a datacenter and useless everywhere else: on a phone, in a browser, at the edge, or anywhere you'd rather not rent H100-class GPU capacity at around $2–4 per GPU-hour.
Voice AI TTS Lite runs real-time inference on a small CPU instance, fits in a small memory footprint, and clones any voice from a short reference clip without requiring a transcript. In our benchmarks, the release model posts the highest predicted MOS and speaker similarity among comparable external baselines while keeping CPU-first deployment.
TTS Lite is an open-source 112M-parameter TTS model with zero-shot voice cloning, CPU inference, and streaming word-level timestamps. It runs faster than realtime on a $42/mo m6a.large CPU instance: 2 vCPUs, 8 GB RAM, no GPU.
Available now through the live demo. The GitHub release is coming soon — get notified.
TTS Lite is trained on over 1 million hours of proprietary speech data from Voice.ai's production voice conversion platform, which serves millions of users. This is the lite open-source checkpoint. The full production model is available through the Voice.ai API and delivers higher quality and lower latency for teams that need it. Contact us for enterprise access.
This is the first public checkpoint. We're shipping model updates on a fast cadence.
Quality snapshot
Headline metrics from the benchmark run, using shared evaluation samples across models.
Highest predicted MOS among comparable external baselines.
Competitive speaker similarity while preserving the CPU-first deployment target.
Close to the strongest hosted baseline in this eval, with open local runtime control.
Measured by ASR transcript edit distance; lower is better.
Try it
Run a prompt against the live inference server, or review the local setup path in the GitHub release repo.
Interactive demo
Type anything. Hear it back in under 200ms, streaming from an m6a.large CPU instance.
Local setup instructions and voice-agent examples live in the GitHub repo.
Why CPU matters
TTS Lite runs at 0.31–0.37× RTF on a single m6a.large CPU instance, equivalent to 2.7–3.2× realtime without GPU serving costs.
Streaming starts in <200 ms with a KV-cached reference, fast enough for responsive realtime voice-agent conversations.
| Hardware | Compute price | RTF | Pricing note | |
|---|---|---|---|---|
| Voice AI TTS Lite | m6a.large (2 vCPU) |
~$0.086/hr on demand; ~$0.057/hr reserved | 0.31–0.37× | About $42/mo on 1-year no-upfront reserved pricing |
| GPU model (typical) | A100 / H100 class cloud GPU | Usually dollars per hour | varies | GPU-dependent serving economics |
Benchmark results
We compared TTS Lite against a set of external baselines on shared evaluation samples. The chart plots predicted MOS against speaker similarity; the table keeps the full metric view.
| Model / provider | Predicted MOS | PESQ | SI-SDR | Speaker sim | WER | CER |
|---|---|---|---|---|---|---|
| Voice AI TTS Lite | 3.34 | 3.71 | 24.52 | 0.80 | 13.0% | 4.0% |
| ElevenLabs Flash v2.5 | 3.30 | 3.78 | 26.57 | 0.78 | 12.7% | 3.8% |
| PocketTTS | 3.29 | 3.27 | 22.01 | 0.80 | 11.0% | 3.8% |
| NeuTTS | 3.19 | 3.55 | 20.85 | 0.72 | 22.7% | 9.8% |
We measure generated speech on three axes:
Speaker embeddings extracted from both the reference and generated audio and compared via cosine similarity. Captures clone fidelity.
UTMOS — an automatic predictor trained on human listening data, scoring 1–5. Captures perceived naturalness. These are predicted scores, not human MOS.
Generated audio is transcribed back to text using Whisper large-v3 and compared to the original input via edit distance (jiwer). Captures intelligibility.
Architecture
TTS Lite is an autoregressive transformer that generates audio tokens from text. The reference clip is encoded with the audio codec and used as an acoustic prompt.
VaiCodec: short sequences, clean audio
TTS Lite uses VaiCodec, a neural audio codec with a 3-codebook residual vector quantizer (codebook sizes 1024 / 4096 / 8192) operating at 20 fps with a 32 kHz ISTFT decoder. Keeping the codec stream short reduces autoregressive decode work, which helps TTS Lite stay real-time on CPU.
Alignment-gated generation
A CTC alignment head tracks progress through the input at every decode step, producing word-level timestamps in real time as audio streams out. This enables lip-sync, subtitle generation, and viseme extraction without a separate forced-alignment pass. Most TTS models either don't support word timestamps at all or require an expensive post-processing step.
The same alignment head also gates generation. While text remains unspoken, end-of-speech is suppressed entirely. Once all text is covered, EOS is forced. The practical effect: no early stopping, no hallucinated repetition, no trailing silence.
What's next
- Expanded technical report: larger eval sets, human listening studies, and variable-length reference support.
- English only for now. Multilingual in development.
- Weekly model updates — subscribe to the repo for checkpoint drops.
Build with it
Everything you need to ship a production voice product.
Need production quality or enterprise support?
The full Voice.ai production model delivers higher quality and lower latency. Contact us for commercial access.