I’m working a lot with TTS (Text-to-Speach), and it’s also a total wild west - e...

I’m working a lot with TTS (Text-to-Speach), and it’s also a total wild west - even worse than LLMs in some ways. The demos are always perfect, but once you generate hundreds of minutes you start seeing volume drift, pacing changes, random artifacts, and occasional mispronunciations that never show up in the curated clips.

The big difference from LLMs is that we don’t really have production-grade, standardized benchmarks for long-form TTS. We need things like volume-stability across segments, speech-rate consistency, and pronunciation accuracy over a hard corpus.

I wrote up what this could look like here: https://lielvilla.com/blog/death-of-demo/