TTS Lite Research: CPU-First Open-Source TTS

The best zero-shot text-to-speech models today require GPUs. That makes them fast in a datacenter and useless everywhere else: on a phone, in a browser, at the edge, or anywhere you'd rather not rent H100-class GPU capacity at around $2–4 per GPU-hour.

Voice AI TTS Lite runs real-time inference on a small CPU instance, fits in a small memory footprint, and clones any voice from a short reference clip without requiring a transcript. In our benchmarks, the release model posts the highest predicted MOS and speaker similarity among comparable external baselines while keeping CPU-first deployment.

TL;DR

TTS Lite is an open-source 112M-parameter TTS model with zero-shot voice cloning, CPU inference, and streaming word-level timestamps. It runs faster than realtime on a $42/mo m6a.large CPU instance: 2 vCPUs, 8 GB RAM, no GPU.

Available now through the live demo. The GitHub release is coming soon — get notified.

TTS Lite is trained on over 1 million hours of proprietary speech data from Voice.ai's production voice conversion platform, which serves millions of users. This is the lite open-source checkpoint. The full production model is available through the Voice.ai API and delivers higher quality and lower latency for teams that need it. Contact us for enterprise access.

This is the first public checkpoint. We're shipping model updates on a fast cadence.

Quality snapshot

Headline metrics from the benchmark run, using shared evaluation samples across models.

Naturalness proxy

Predicted MOS 3.34

Highest predicted MOS among comparable external baselines.

Clone fidelity

SIM 0.80

Competitive speaker similarity while preserving the CPU-first deployment target.

Audio quality

PESQ 3.71

Close to the strongest hosted baseline in this eval, with open local runtime control.

Intelligibility

WER 13.0%

Measured by ASR transcript edit distance; lower is better.

Try it

Run a prompt against the live inference server, or review the local setup path in the GitHub release repo.

Interactive demo

Type anything. Hear it back in under 200ms, streaming from an m6a.large CPU instance.

Open the demo

Local setup instructions and voice-agent examples live in the GitHub repo.

Why CPU matters

TTS Lite runs at 0.31–0.37× RTF on a single m6a.large CPU instance, equivalent to 2.7–3.2× realtime without GPU serving costs.

Streaming starts in <200 ms with a KV-cached reference, fast enough for responsive realtime voice-agent conversations.

	Hardware	Compute price	RTF	Pricing note
Voice AI TTS Lite	`m6a.large` (2 vCPU)	~$0.086/hr on demand; ~$0.057/hr reserved	0.31–0.37×	About $42/mo on 1-year no-upfront reserved pricing
GPU model (typical)	A100 / H100 class cloud GPU	Usually dollars per hour	varies	GPU-dependent serving economics

Benchmark results

We compared TTS Lite against a set of external baselines on shared evaluation samples. The chart plots predicted MOS against speaker similarity; the table keeps the full metric view.

3.34

Predicted MOS

0.80

Speaker similarity

Model / provider	Predicted MOS	PESQ	SI-SDR	Speaker sim	WER	CER
Voice AI TTS Lite	3.34	3.71	24.52	0.80	13.0%	4.0%
ElevenLabs Flash v2.5	3.30	3.78	26.57	0.78	12.7%	3.8%
PocketTTS	3.29	3.27	22.01	0.80	11.0%	3.8%
NeuTTS	3.19	3.55	20.85	0.72	22.7%	9.8%

We measure generated speech on three axes:

SIM

Speaker similarity

↑ higher is better

Speaker embeddings extracted from both the reference and generated audio and compared via cosine similarity. Captures clone fidelity.

MOS

Mean opinion score

↑ higher is better

UTMOS — an automatic predictor trained on human listening data, scoring 1–5. Captures perceived naturalness. These are predicted scores, not human MOS.

WER

Word error rate

↓ lower is better

Generated audio is transcribed back to text using Whisper large-v3 and compared to the original input via edit distance (jiwer). Captures intelligibility.

Architecture

TTS Lite is an autoregressive transformer that generates audio tokens from text. The reference clip is encoded with the audio codec and used as an acoustic prompt.

Text prompt

text tokens

Reference audio

codec-encoded acoustic prompt

TTS Lite

autoregressive transformer

Generated tokens

3 codebooks · 20 fps

VaiCodec decoder

RVQ · ISTFT

Audio

32 kHz · streaming

VaiCodec: short sequences, clean audio

TTS Lite uses VaiCodec, a neural audio codec with a 3-codebook residual vector quantizer (codebook sizes 1024 / 4096 / 8192) operating at 20 fps with a 32 kHz ISTFT decoder. Keeping the codec stream short reduces autoregressive decode work, which helps TTS Lite stay real-time on CPU.

Alignment-gated generation

A CTC alignment head tracks progress through the input at every decode step, producing word-level timestamps in real time as audio streams out. This enables lip-sync, subtitle generation, and viseme extraction without a separate forced-alignment pass. Most TTS models either don't support word timestamps at all or require an expensive post-processing step.

The same alignment head also gates generation. While text remains unspoken, end-of-speech is suppressed entirely. Once all text is covered, EOS is forced. The practical effect: no early stopping, no hallucinated repetition, no trailing silence.

Transformer

8 layers · 960 dims

Parameters

~112M

Audio codec

VaiCodec · 3-codebook RVQ · 20 fps · 32 kHz ISTFT decoder

First audio chunk

<200 ms (KV-cached reference)

Alignment

CTC head · EOS gating · forced stop

Paralinguistic

[laughs] · [sighs] · [coughs] · [gasps] · [clears throat]

Deployment

GPU (eager / AOT) · CPU (ONNX) · iOS (ONNX)

What's next

Expanded technical report: larger eval sets, human listening studies, and variable-length reference support.
English only for now. Multilingual in development.
Weekly model updates — subscribe to the repo for checkpoint drops.

Build with it

Everything you need to ship a production voice product.

GitHub Soon

Source, README, examples, and release notes.

Model release Soon

Hugging Face model card and checkpoints.

Hosted API

Voice.ai TTS API quickstart and streaming docs.

Local deployment docs

Run TTS Lite locally with Docker, ONNX CPU, and downloaded model assets.

Platform examples

Voice.ai API examples for TTS, streaming, voice cloning, agents, and integrations.

Examples

Streaming, paralinguistic, and viseme example clients.

Need production quality or enterprise support?

The full Voice.ai production model delivers higher quality and lower latency. Contact us for commercial access.

Contact sales