Exactly, most of the work here is video processing - analyzing a portion of one video, then finding another video clip that has similar characteristics. Then they copy the sound from the second clip into the first clip. But they could be copying any metadata, not just sound. This isn't really about sound _at all_.
They can do pure parametric synthesis as well, but it's not nearly as convincing so most of the video is devoted to the more convincing match method. FWIW, constructing realistic sounds from first principles is much more difficult than you'd think.
> where the stick moved similarly
where the stick moved similarly and was hitting similar things, which is a non-trivial task.
> But that's also the approach of current speech synthesis algorithms and works better than trying to create the waveform from scratch.
I don't think it's that simple. Speech synthesis by concatenation does produce more natural-sounding results, at least until you notice its quirks, so casual users tend to prefer it. But I know some heavy speech synthesis users, specifically blind programmers and power-users, and they tend to prefer parametric synthesis, because it's more intelligible at high speeds.
But the title makes it seem like the algorithm is synthesizing the sounds from scratch!