Local, real-time text-to-speech: compare Supertonic 2's 10 voices
Supertonic 2 is a text-to-speech model that's surprisingly quick to run even on modest CPUs.
It comes with ten English voices — five female and five male. I couldn't find a side-by-side comparison of the voices, so here's one! My recommendation would be female 2, female 1 or male 3.
| Female | Male |
|---|---|
| Female 1 (accurate, monotonous) | Male 1 (confident, bit robotic) |
| Female 2 (my favourite — clear, upbeat, chipper) | Male 2 (neutral, monotonous) |
| Female 3 (calm, robotic) | Male 3 (my male favourite - clear, calm, warm) |
| Female 4 (raspy, confident) | Male 4 (robotic) |
| Female 5 (lower, less american) | Male 5 (relaxed, low energy) |
Each entry in the table says: "Good morning. Sophia is going for a 5 km jog in Regent's Park at 10:30 AM and has invited you to join - want me to say yes?"
If you want a faster assistant-style pace, Supertonic has a speed parameter, but at values far from 1.0 (e.g. 1.4) it occasionally drops or garbles words — especially on longer sentences or text with numbers and abbreviations. I'd recommend applying a pitch-preserving time-stretch to the rendered audio instead: same effective speaking rate without the artefacts. wyoming-supertonic does this by default.
For comparison: Piper
Piper is probably the most common self-hosted TTS in Home Assistant. Same phrase, four popular voices, all at the "medium" quality tier:
| en_US-amy | en_GB-alan |
| en_US-hfc_female | en_US-hfc_male |
Running this in Home Assistant
The reason I started looking into this was I wanted a better voice for my Home Assistant voice pipeline than Piper. I wrote a Wyoming protocol server that wraps Supertonic 2 to do this - it's a ~single-binary drop-in TTS server.
See the wyoming-supertonic README for details on setting it up.