I hesitate to even post this, but I listened to the audio examples and it seems like this project was not yet a success. I'm not trying to be a jerk or snarky, but the reconstructed audio sounded terrible.
I have to agree. There's certainly more high-frequency content, but it seems mostly like noise, with only a vague amplitude correlation to the existing audio.
I'd be curious to see if any better results could be obtained by applying a similar technique in the frequency domain.
Opus does something similar in the frequency domain, but much simpler. It just copies codewords (minus energy) from lower bands to higher bands. It also lets you signal the energy of the copied bands (something you wouldn't have access to if reconstructing blindly).
"applying a similar technique in the frequency domain", "Maybe training an image reconstructor on the short term spectrogram" - This is what I originally thought to do. However, this approach suffers from information loss whenever you transform from the frequency domain back to the time domain. Since the goal was super-resolution in the time domain, working in the time domain is more sensible.
Mathematically, the DFT is invertable, ie lossless, but practically there will be a bit of loss due to the finite precision of float point numbers. Even though it isn't lossless, the amount of loss should be miniscule as compared to the 16KHz->2KHz loss you are trying to overcome.
It's not precision loss, it's that when you DFT you choose an interval. If you choose a short interval you are less certain about frequencies while if you choose a long interval you are less certain about time domain changes (i.e, changes in the signal over your time period).
When using DL, perhaps you might try to do a downsampling which would suite your DL?
I mean yes it would be awesome to use your network to upsample stuff but that is apperently hard. What about upsampling something DL friendly and trying to reduce the downsamplesize as the challange?
Since time domain content is the reconstruction target, wouldn't LSTMs be a better choice than CNNs? I would think the spectral content would be time variant and depend on the sequential history.
My thinking is this is a good GAN problem. L2 norm will have these bad trivial upscalings as local minima - since L2 in time domain is the same as L2 in frequency domain, you can think in the frequency domain that it basically has this big black area to infill from very little information. If you had some sort of perceptual similarity, on the other hand, there will be lots of adjacent improvements in quality that will reduce the error and make it easier to train. I think this matches the results seen in image upscaling, too.
In fact, when you listen to the downsampled example, there is actually a lot of information in the extract. Way more than enough. That's because the frequency should be in log scale to be more relevant to the human hear.
Here the frequency cutoff is 2 Khz, which is already a fairly high pitch.
I completely agree. You could even tell from the picture of the spectrum above that the "reconstruction" was not a success. The spectrum looks like it's been reconstructed by a flat extrapolation of the amplitude of the last frequency known.
This is exactly the kind of projects where deep models should excel. Something's not working properly here.
"Something's not working properly here" - I disagree. The model will overtrain (i.e. perfectly reconstruct the original waveforms of a small training set), which indicates it's capable of learning the necessary transformation. The problem lies in the limited amount of training time I had. To reiterate from an earlier comment, I trained on only 10 epochs, while the paper this is base on claimed to train on 400. Much more training is required for this model to generalize well without degrading the signal-to-noise ratio.
Hey thanks for the comment. That makes sense, I'd be interested in hearing the results with more training. This could work well and have a good range of applications.
"the reconstructed audio sounded terrible" - I think this is referring to the amount of static noise in the reconstructed waveform. Indeed, the SNR clearly shows the reconstruction is slightly worse than the downsampled waveform. As mentioned in the post, I strongly believe this is due to the limited amount of training I performed. The number of epochs of training data in my case was only 10 while the paper this project is based on trained for 400 epochs. During training I noticed a strong dependence on training epochs and perceptual performance.
My ears think so too, but upsampling by just 2 is roughly the same difficulty as upsampling an image by 2. As you probably know, you can't just CIA-like "ENHANCE" an image to double its resolution and expect its noise level to be lower than, I don't know, 10 decibels (of image brightness). Yet our ears can notice noise as low as 40-50 decibels, so it would be nearly impossible to reconstruct higher frequencies so that the result has no noticeable noise.
In this research, the author is attempting to upsample by a bit more than 2.
That's the point of using deep learning here. Of course you can't make up the missing information, but by training the model with a lot of samples, it should eventually reach a point where it produces the most likely original information.
> you can't just CIA-like "ENHANCE" an image to
> double its resolution
I think what you're saying is that if the high frequency information is gone, it's gone? But that shouldn't matter.. we don't need it to be identical to the original. It just needs to sound identical to the original.
If you hit a snare drum 5 times in a row, the high frequency data of each hit will differ wildly, and yet a human won't be able to tell the difference.
I'm one of the authors of the paper that proposes the deep learning model implemented in the blog post, and I would recommend training on a different dataset, such as VCTK (freely available, and what we used in our paper).
Super-resolution methods are very sensitive to the choice of training data. They will overfit seemingly insignificant properties of the training set, such as the type of low-pass filter you are using, or the acoustic conditions under which the recordings were made (e.g. distance to the microphone when recording a speaker).
To capture all the variations present in the TED talks dataset, you would need a very large model and probably train it for >10 epochs. The VCTK dataset is better in this regard.
For comparison, here are our samples: kuleshov.github.io/audio-super-res/
I'm going to try to release the code over the weekend.
Indeed, the TED dataset has a lot of variability in terms of audio quality, etc. which, as you mentioned, with just 10 epochs of training is difficult to capture. I did try a larger network (up to 11 downsampling layers), but this proved even more time consuming to train (as expected). Thus, I split the difference and went with a network similar to yours but was trainable over a four-day period (10 epochs).
I'm interested in seeing how computationally efficient this method turns out to be and how well it generalizes to other audio data and perhaps other signals as well. Going on a hunch by the model, I think there are some more efficient methods to do bandwidth extension on audio samples with better quality results, but it is great to see more deep learning people take an interest in this domain. I do believe that deep learning can have tremendous impact in DSP and compression.
(Disclaimer: I developed a somewhat similar method earlier this year applied in audio compression, yet to be published)
While something like this is bound to fail for most music of any complexity (e.g. a singing voice), I've often wondered if this would be highly successful on, say, old solo piano recordings, where the possibilities of the instrument are extremely well-defined and limited.
Thanks for sharing. The possibilities for this kind of technology are endless. Maybe one day we'll start having crystal clear conversations over telephone :)
I am a little curious as to how this factors into fundamental information theory.
In my mind, you are simply taking a 0-2khz signal and combining it with an entirely different 0-8khz signal that is generated (arbitrarily IMO) based on the band-limited original data. I can see the argument for having a library of samples as additional, common information (think many compressor algorithms), but it is still going to be an approximation (lossy).
"The loss function used was the mean-squared error between the output waveform and the original, high-resolution waveform." - This confuses me as a performance metric when dealing with audio waveforms.
I think a good question might be - "What would be better criteria for evaluating the Q (quality) of this system?"
THD between original and output averaged over the duration of the waveforms?
Subjective evaluations (w/ man in the middle training)?
etc...