Good to know that aeneas works reasonably well even for sung speech. I've tried ...

Good to know that aeneas works reasonably well even for sung speech. I've tried using aeneas for LibriVox audiobooks (10+ hours), which failed because it tries to load the whole file into memory at once and then compute FFTs on it all at once etc., which I don't have the RAM for. So right now I'm Rewriting in Rust™ using iterators to hopefully reduce memory usage and improve performance.

Espeak-ng supporting 108 languages is maybe a bit misleading. They have pronunciation definitions for many languages, but the actual level of support varies widely.

For Mandarin, espeak-ng 1.49.2 has a bug where it reads the tone numbers out loud instead of modifying the pitch contour, so e.g. the number 四 (four) is pronounced si si instead of sì, because it has the fourth tone. That's the version packaged for Ubuntu, so you may be using it for your API.

For Japanese, kanji aren't supported at all, so 四 is pronounced as "Chinese letter" (in English). For proper Japanese support, you'd need to switch to a different TTS engine like Open JTalk or preprocess the text to transform it into kana.

Also note that Aeneas is licensed under AGPL, which requires you to offer the source code if you let others interact with the program over a network (which is what your API does). So your attempt to keep the secret sauce private and only reveal it once someone guessed the algorithm was likely illegal. You should add proper copyright notices to your program and audioai.online