They trained the algorithm to watch the stick and play sounds from the database ...

colordrops · on June 13, 2016

Exactly, most of the work here is video processing - analyzing a portion of one video, then finding another video clip that has similar characteristics. Then they copy the sound from the second clip into the first clip. But they could be copying any metadata, not just sound. This isn't really about sound _at all_.

visarga · on June 14, 2016

> But the title makes it seem like the algorithm is synthesizing the sounds from scratch!

It does. If you read the paper, they say that first they went with matching sounds from a database, but later turned on to full synthesis.

Reference: look for "parametric synthesis" in the paper https://arxiv.org/pdf/1512.08512v2.pdf

a1k0n · on June 13, 2016

They do both; there's a parametric synthesis module later on in the video. It doesn't work all that well for water.

542458 · on June 13, 2016

They can do pure parametric synthesis as well, but it's not nearly as convincing so most of the video is devoted to the more convincing match method. FWIW, constructing realistic sounds from first principles is much more difficult than you'd think.

> where the stick moved similarly

where the stick moved similarly and was hitting similar things, which is a non-trivial task.

tuewocnc · on June 13, 2016

yes, it would have to learn to simulate the physics of the system to match the video, which would be cool

c3534l · on June 13, 2016

And for what it is, it's really unimpressive. It's a cool idea and all, just with disappointing results.

jobigoud · on June 13, 2016

They also train for the material the stick is hitting. But yeah, it's sound transfer, not synthesis from scratch.

But that's also the approach of current speech synthesis algorithms and works better than trying to create the waveform from scratch.

mwcampbell · on June 13, 2016

> But that's also the approach of current speech synthesis algorithms and works better than trying to create the waveform from scratch.

I don't think it's that simple. Speech synthesis by concatenation does produce more natural-sounding results, at least until you notice its quirks, so casual users tend to prefer it. But I know some heavy speech synthesis users, specifically blind programmers and power-users, and they tend to prefer parametric synthesis, because it's more intelligible at high speeds.

jordache · on June 13, 2016

Yes the title may give that impression, but the video explains clearly the source of the audio