Cloud-quality text-to-speech.
Without the cloud bill.
112M parameters. Runs on CPU and on-device on iOS. Streams with word-level timestamps. Zero-shot cloning with no reference transcript.
This is the actual inference server hosted on m6a.large. Pick a sample or
write your own prompt.
Zero-shot cloning. Paralinguistic tags. 32 kHz. All from a 112M-parameter model running on CPU.
Cheaper than hosted TTS
Run it on a $42/mo m6a.large CPU instance. No per-token fees, no usage caps. Zero-shot cloning, streaming, and CPU inference in one model.
Over 1 million hours of speech
Trained on proprietary speech data from Voice.ai's production voice conversion platform, which serves millions of users.
Better every week
We ship model updates on a fast cadence. Same API - pull the latest checkpoint and your voices get better.
Zero-shot cloning, streaming timestamps, and CPU deployment in one open model.
Zero-shot voice cloning
Drop in a reference audio clip. No transcript needed. TTS Lite uses the reference audio as the acoustic prompt for the generated speech.
Streaming + word highlighting
Word-level timestamps stream alongside audio in real time, so clients can drive highlighting, captions, and karaoke-style UI without a separate forced-alignment pass.
Paralinguistic tags
Write [laughs], [sighs], [coughs], [gasps], or
[clears throat] inline. The model renders them as part of the generated speech.
Alignment-gated EOS
A CTC alignment head tracks text progress at every decode step. EOS is suppressed until the text is spoken, then forced immediately, reducing cutoffs, repetition, and trailing hallucination.
Open source
Open weights, inference code, Docker setup, and examples. Run it locally, inspect the stack, and deploy without a hosted-only dependency.
Runs on CPU, iOS, and GPU
CPU and iOS run through ONNX Runtime. GPU deployments use PyTorch or AOT .pt2 packages.
Small models sound small.
TTS Lite doesn't.
We built VaiCodec so a 112M-parameter model can still produce full-frequency audio at 32 kHz, with clean output and low inference cost.
In our benchmarks, TTS Lite posts the highest predicted MOS and speaker similarity among comparable external baselines.
Highest predicted MOS among comparable external baselines.
Highest speaker similarity among comparable external baselines.
| Parameters | ~112M |
| Architecture | Autoregressive transformer · 8 layers · 960 dims |
| Audio output | 32 kHz · 3-codebook RVQ (1024 / 4096 / 8192) |
| Codec framerate | 20 fps (50 ms per frame) |
| RTF (CPU) | 0.31–0.37× on m6a.large, 2 vCPU (2.7–3.2× realtime) |
| First audio chunk | <200 ms (KV-cached reference) |
| Decode step | 13–18 ms mean |
| Zero-shot | Reference audio only - no transcript required |
| Word timestamps | Streaming · parallel to audio |
| Paralinguistic | [laughs] [sighs] [coughs] [gasps] [clears throat] and more |
| Deployment | CPU (ONNX) · iOS (ONNX) · GPU |
| Language | English |
| License | Apache 2.0 |
| Codec | Coming soon |
Be first when it ships.
Leave your email and we'll send one note the day the repo goes public — nothing else.