Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Show HN: VoiceClonr attempts to reconstruct human voices (voiceclonr.com)
56 points by voiceclonr on July 1, 2015 | hide | past | favorite | 34 comments


Certainly it's still far from being able to deceive a human into thinking synthesized speech of any speaker saying anything is real, but it has definitely and clearly capture a certain quality to each of those voices. Really cool project and I'm sure it portends even more awesome work in the area.


I am wondering, what exacly is holding back the technology? Why isn't it there yet?


For one thing, to get a very good quality, lot of resources are needed. Studio quality recordings lasting many hours, voice directors and voice experts who can sift through wav files and ensure phoneme boundaries are aligned etc. And even with this, the quality may not be predictable - but they have gotten reasonably good. It is a hard task to do it at scale (The HMM HTS synthesis used in my app is scalable - but quality is not that great and is robotic).


That means that you can't simply reverse engineer voice from, say, a sample text read by an voice actor? I mean, down to the tiny bits of audio waveform? I mean, how hard could it be? :)


If you can have a speaker read through a specific list of items, a useful singing model can be constructed. That's how Vocaloid works.

What hasn't been done well yet is extracting a model from existing uncontrolled voice samples. That's what this is trying to do. Once this works well, software clones of dead singers will be popular. The RIAA is going to hate this.


That's exactly how unit selection systems like Festival work. The trouble starts when you hit a previously unencountered phoneme sequence and you have to interpolate. Sometimes good, sometimes bad.

Edit: Text to pronunciation is a whole other problem.


Given a text and waveform, how do you know how to match them up exactly?


I love that you're shooting for a holy grail!

Aiming for lyrics is a much higher target than everyday text though, due to grammatical hints and the extra pitch and phrasing demands of lyrics. Your results might hit people harder on non-lyrical textual bodies.

Keep up the good work, I'd like to make something like this for musical instrument someday :)

P.s. have you ever heard of Douglas Hofstadters Letter Spirit project which synthesizes fonts from a subset? http://www.cogsci.indiana.edu/farg/mcgrawg/fonts.gif


This is very cool.

Reminds me of this -- before Roger Ebert died, he tried to have his voice reconstructed by some company using audio from his TV show, etc., but alas, it was too difficult at the time, so he ended up using one of the Apple TTS voices instead.


I believe you're mistaken. Here is its debut, and it's pretty amazing:

https://www.youtube.com/watch?v=93jREDSWOYY#t=1m23s

Google "Cereproc" and "Roger Ebert" and you'll find that he was quite pleased with it.


After some searching, this is the best support I could find:

--snip

In early 2010, Ebert and Chaz announced on the “Oprah Winfrey Show” that they’d enlisted a Scottish company called CereProc to create a computerized voice that more closely resembled Ebert’s own by using snippets of his TV work, DVD commentaries and the like, but that never fully materialized. Alex stayed with him until the end.

--snip

Alex being the Apple TTS voice I mentioned earlier.

Source: http://voices.suntimes.com/arts-entertainment/the-daily-sizz...


Thanks. Also a related effort in recent times is VocaliD. They are trying to create personalized voices from donor voices. The impact of such projects can be huge for those who need assistance.


I would pay a fair amount for a TTS engine that can accurately mimic GLaDOS's voice.


How the effect works is actually pretty well known, and if you take a high-quality TTS, feed it high-quality input (i.e. phonemes and intonation commands etc. rather than just English text), apply the effect, then perhaps do some postprocessing (reverb) to make it sound more like the game, it turns out you get pretty close to the real thing.

To illustrate: https://www.youtube.com/watch?v=v7-Gwg0rL0k#t=5m33s - this one's not particularly well done (and doesn't even do all those steps), but you see what I mean.


Heh, this is the first and primary reason I'd love to get my hands on this.


This would be a triumph.


With a note of huge success. It's going to be hard to overstate my satisfaction.


I'm super excited about this.

One of my biggest peeves right now is that voices cost a ton of money, few are readily available otherwise, and a lot of the new stuff is cloud-dependent. (Which is a big turn-off to me.)

What are you looking to do with this?


Right now, I'd like to see if there are any tweaks that can improve the quality (or even experiment with concatenate synthesis). Most likely, it will stay robotic. So an extension to do would be to research on singing synthesis (vocaloid kind).


What I meant was, are you looking at making it available for others to use in things (open source, perhaps), or looking to make a product out of it? And if the latter, something cloud-based, or something I could run on my own machine?


Yes, I'm looking into a cloud based API that anyone can call into.


There have been rumours of government capability to do this for some time. For example, to use false voice messages for Radar instructions to enemy fighters etc. Interesting to see it in the commercial space.


Links or references?


It's been awhile but let me try to dig some up.


And I was hoping to find Morgan Freeman's voice there already.


I thought about it - but I would have been disappointed with the synthetic voice myself, so didn't even try :)


Request: Star Trek computer voice (Majel Barrett Roddenberry)

There's hundreds of episodes containing it, including remastered audio in the HD versions of TNG.


Very nice project, congratulations!

One suggestion: make the text-to-speech button bigger and centered (I missed it the first time).


Thanks! Will make the button changes tonite. Which device were you on when you missed it ?


Could a deep neural network clone the voice of a person given previous sound recordings of that person speaking?



They all just sound robotic to me.

Why even attempt to get things like imitating specific people's voices to work when your speech isn't even fluid and pronounced clearly to begin with?


I don't know if your response is a review on my app or the idea in itself. If it's my app, it's an iterative process like many things. I didn't know what I would end up getting, so I made an attempt.


ridiculously negative- i think its cool, and im sure whoever made it has learned alot in making it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: