Local, real-time text-to-speech: compare Supertonic 2's 10 voices

Headshot of Adam Jones

Adam Jones

Supertonic 2 is a text-to-speech model that's surprisingly quick to run even on modest CPUs.

It comes with ten English voices — five female and five male. I couldn't find a side-by-side comparison of the voices, so here's one! My recommendation would be female 2, female 1 or male 3.

FemaleMale
Female 1 (accurate, monotonous)
Male 1 (confident, bit robotic)
Female 2 (my favourite — clear, upbeat, chipper)
Male 2 (neutral, monotonous)
Female 3 (calm, robotic)
Male 3 (my male favourite - clear, calm, warm)
Female 4 (raspy, confident)
Male 4 (robotic)
Female 5 (lower, less american)
Male 5 (relaxed, low energy)

Each entry in the table says: "Good morning. Sophia is going for a 5 km jog in Regent's Park at 10:30 AM and has invited you to join - want me to say yes?"

If you want a faster assistant-style pace, Supertonic has a speed parameter, but at values far from 1.0 (e.g. 1.4) it occasionally drops or garbles words — especially on longer sentences or text with numbers and abbreviations. I'd recommend applying a pitch-preserving time-stretch to the rendered audio instead: same effective speaking rate without the artefacts. wyoming-supertonic does this by default.

For comparison: Piper

Piper is probably the most common self-hosted TTS in Home Assistant. Same phrase, four popular voices, all at the "medium" quality tier:

en_US-amy
en_GB-alan
en_US-hfc_female
en_US-hfc_male

Running this in Home Assistant

The reason I started looking into this was I wanted a better voice for my Home Assistant voice pipeline than Piper. I wrote a Wyoming protocol server that wraps Supertonic 2 to do this - it's a ~single-binary drop-in TTS server.

See the wyoming-supertonic README for details on setting it up.