Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

They trained the algorithm to watch the stick and play sounds from the database where the stick moved similarly.

But the title makes it seem like the algorithm is synthesizing the sounds from scratch!



Exactly, most of the work here is video processing - analyzing a portion of one video, then finding another video clip that has similar characteristics. Then they copy the sound from the second clip into the first clip. But they could be copying any metadata, not just sound. This isn't really about sound _at all_.


> But the title makes it seem like the algorithm is synthesizing the sounds from scratch!

It does. If you read the paper, they say that first they went with matching sounds from a database, but later turned on to full synthesis.

Reference: look for "parametric synthesis" in the paper https://arxiv.org/pdf/1512.08512v2.pdf


They do both; there's a parametric synthesis module later on in the video. It doesn't work all that well for water.


They can do pure parametric synthesis as well, but it's not nearly as convincing so most of the video is devoted to the more convincing match method. FWIW, constructing realistic sounds from first principles is much more difficult than you'd think.

> where the stick moved similarly

where the stick moved similarly and was hitting similar things, which is a non-trivial task.


yes, it would have to learn to simulate the physics of the system to match the video, which would be cool


And for what it is, it's really unimpressive. It's a cool idea and all, just with disappointing results.


They also train for the material the stick is hitting. But yeah, it's sound transfer, not synthesis from scratch.

But that's also the approach of current speech synthesis algorithms and works better than trying to create the waveform from scratch.


> But that's also the approach of current speech synthesis algorithms and works better than trying to create the waveform from scratch.

I don't think it's that simple. Speech synthesis by concatenation does produce more natural-sounding results, at least until you notice its quirks, so casual users tend to prefer it. But I know some heavy speech synthesis users, specifically blind programmers and power-users, and they tend to prefer parametric synthesis, because it's more intelligible at high speeds.


Yes the title may give that impression, but the video explains clearly the source of the audio




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: