✦ Open Source · Voice AI TTS Lite

Cloud-quality text-to-speech.
Without the cloud bill.

112M parameters. Runs on CPU and on-device on iOS. Streams with word-level timestamps. Zero-shot cloning with no reference transcript.

Try the Demo View on GitHub Coming soon

View Benchmarks Top-rated among open TTS models

Live Demo

Try it. Right now.

This is the actual inference server hosted on m6a.large. Pick a sample or write your own prompt.

Click a sample to hear it instantly - or type your own below

Lip sync

–

Synth latency

–

Network

–

Total TTFB

–

Gen time

–

Audio dur

–

Realtime ×

–:–– / –:––

160 px/s

Press Speak to generate audio

Listen

Hear it for yourself.

Zero-shot cloning. 32 kHz. All from a 112M-parameter model running on CPU.

Conversational

"And just like that, we shipped it. On a Friday. Into production."

Zero-shot clone32 kHzCPU inference

Technical narration

"This model runs entirely on CPU. Two vCores. No GPU. Faster than realtime."

Zero-shot clone32 kHzCPU inference

Expressive with tags

"Wait, you're telling me this whole thing runs on a laptop CPU?"

Zero-shot cloneCPU inference

Long-form

"The lighthouse keeper kept a journal for forty years. Only twice did he mention loneliness — once on a Tuesday after the storm, and once on the page where he stopped writing."

Zero-shot cloneNarrativeVaried prosody

100×

Cheaper than hosted TTS

Run it on a $42/mo m6a.large CPU instance. No per-token fees, no usage caps. Zero-shot cloning, streaming, and CPU inference in one model.

Data

Over 1 million hours of speech

Trained on proprietary speech data from Voice.ai's production voice conversion platform, which serves millions of users.

Cadence

Better every week

We ship model updates on a fast cadence. Same API - pull the latest checkpoint and your voices get better.

Enterprise

Need production-grade quality?

The full Voice.ai production model - higher quality, lower latency, priority infrastructure - is available via commercial API.

Talk to us

Features

What makes TTS Lite different.

Zero-shot cloning, streaming timestamps, and CPU deployment in one open model.

Zero-shot voice cloning

Drop in a reference audio clip. No transcript needed. TTS Lite uses the reference audio as the acoustic prompt for the generated speech.

Streaming + word highlighting

Word-level timestamps stream alongside audio in real time, so clients can drive highlighting, captions, and karaoke-style UI without a separate forced-alignment pass.

Built for voice agents

Sub-200ms first-chunk latency and CPU inference make TTS Lite a fit for real-time conversational agents. Stream responses as they generate, run it next to your logic on the same box, and skip the per-token cloud bill that kills agent unit economics.

Alignment-gated EOS

A CTC alignment head tracks text progress at every decode step. EOS is suppressed until the text is spoken, then forced immediately, reducing cutoffs, repetition, and trailing hallucination.

Open source

Open weights, inference code, Docker setup, and examples. Run it locally, inspect the stack, and deploy without a hosted-only dependency.

Runs on CPU, iOS, and GPU

CPU and iOS run through ONNX Runtime. GPU deployments use PyTorch or AOT .pt2 packages.

32kHz

Clean output

Small models sound small.
TTS Lite doesn't.

We built VaiCodec so a 112M-parameter model can still produce full-frequency audio at 32 kHz, with clean output and low inference cost.

Benchmarks

Quality that holds up.

In our benchmarks, TTS Lite posts the highest predicted MOS and speaker similarity among comparable external baselines.

3.34

Predicted MOS

Highest predicted MOS among comparable external baselines.

0.80

Speaker similarity

Highest speaker similarity among comparable external baselines.

See full benchmark details

Specifications

The full sheet.

Parameters	~112M
Architecture	Autoregressive transformer · 8 layers · 960 dims
Audio output	32 kHz · 3-codebook RVQ (1024 / 4096 / 8192)
Codec framerate	20 fps (50 ms per frame)
RTF (CPU)	0.31–0.37× on m6a.large, 2 vCPU (2.7–3.2× realtime)
First audio chunk	<200 ms (KV-cached reference)
Decode step	13–18 ms mean
Zero-shot	Reference audio only - no transcript required
Word timestamps	Streaming · parallel to audio
Paralinguistic	[sighs] [coughs] [gasps] [clears throat] and more
Deployment	CPU (ONNX) · iOS (ONNX) · GPU
Language	English
License	Apache 2.0
Codec	Coming soon

Be first when it ships.

Leave your email and we'll send one note the day the repo goes public — nothing else.

Cloud-quality text-to-speech.Without the cloud bill.