Certainly it's still far from being able to deceive a human into thinking synthesized speech of any speaker saying anything is real, but it has definitely and clearly capture a certain quality to each of those voices. Really cool project and I'm sure it portends even more awesome work in the area.
For one thing, to get a very good quality, lot of resources are needed. Studio quality recordings lasting many hours, voice directors and voice experts who can sift through wav files and ensure phoneme boundaries are aligned etc. And even with this, the quality may not be predictable - but they have gotten reasonably good. It is a hard task to do it at scale (The HMM HTS synthesis used in my app is scalable - but quality is not that great and is robotic).
That means that you can't simply reverse engineer voice from, say, a sample text read by an voice actor? I mean, down to the tiny bits of audio waveform? I mean, how hard could it be? :)
If you can have a speaker read through a specific list of items, a useful singing model can be constructed. That's how Vocaloid works.
What hasn't been done well yet is extracting a model from existing uncontrolled voice samples. That's what this is trying to do. Once this works well, software clones of dead singers will be popular. The RIAA is going to hate this.
That's exactly how unit selection systems like Festival work. The trouble starts when you hit a previously unencountered phoneme sequence and you have to interpolate. Sometimes good, sometimes bad.
Edit: Text to pronunciation is a whole other problem.
Aiming for lyrics is a much higher target than everyday text though, due to grammatical hints and the extra pitch and phrasing demands of lyrics. Your results might hit people harder on non-lyrical textual bodies.
Keep up the good work, I'd like to make something like this for musical instrument someday :)
Reminds me of this -- before Roger Ebert died, he tried to have his voice reconstructed by some company using audio from his TV show, etc., but alas, it was too difficult at the time, so he ended up using one of the Apple TTS voices instead.
After some searching, this is the best support I could find:
--snip
In early 2010, Ebert and Chaz announced on the “Oprah Winfrey Show” that they’d enlisted a Scottish company called CereProc to create a computerized voice that more closely resembled Ebert’s own by using snippets of his TV work, DVD commentaries and the like, but that never fully materialized. Alex stayed with him until the end.
--snip
Alex being the Apple TTS voice I mentioned earlier.
Thanks. Also a related effort in recent times is VocaliD. They are trying to create personalized voices from donor voices. The impact of such projects can be huge for those who need assistance.
How the effect works is actually pretty well known, and if you take a high-quality TTS, feed it high-quality input (i.e. phonemes and intonation commands etc. rather than just English text), apply the effect, then perhaps do some postprocessing (reverb) to make it sound more like the game, it turns out you get pretty close to the real thing.
One of my biggest peeves right now is that voices cost a ton of money, few are readily available otherwise, and a lot of the new stuff is cloud-dependent. (Which is a big turn-off to me.)
Right now, I'd like to see if there are any tweaks that can improve the quality (or even experiment with concatenate synthesis). Most likely, it will stay robotic. So an extension to do would be to research on singing synthesis (vocaloid kind).
What I meant was, are you looking at making it available for others to use in things (open source, perhaps), or looking to make a product out of it? And if the latter, something cloud-based, or something I could run on my own machine?
There have been rumours of government capability to do this for some time. For example, to use false voice messages for Radar instructions to enemy fighters etc. Interesting to see it in the commercial space.
I don't know if your response is a review on my app or the idea in itself. If it's my app, it's an iterative process like many things. I didn't know what I would end up getting, so I made an attempt.