Using Deep Learning to Reconstruct High-Resolution Audio

eggoa · on June 23, 2017

I hesitate to even post this, but I listened to the audio examples and it seems like this project was not yet a success. I'm not trying to be a jerk or snarky, but the reconstructed audio sounded terrible.

seandougall · on June 23, 2017

I have to agree. There's certainly more high-frequency content, but it seems mostly like noise, with only a vague amplitude correlation to the existing audio.

I'd be curious to see if any better results could be obtained by applying a similar technique in the frequency domain.

TD-Linux · on June 23, 2017

Opus does something similar in the frequency domain, but much simpler. It just copies codewords (minus energy) from lower bands to higher bands. It also lets you signal the energy of the copied bands (something you wouldn't have access to if reconstructing blindly).

See "band folding" here: https://people.xiph.org/~xiphmont/demo/celt/demo.html

jhetherly · on June 23, 2017

hey, author here

Thanks for the feedback.

"applying a similar technique in the frequency domain", "Maybe training an image reconstructor on the short term spectrogram" - This is what I originally thought to do. However, this approach suffers from information loss whenever you transform from the frequency domain back to the time domain. Since the goal was super-resolution in the time domain, working in the time domain is more sensible.

tasty_freeze · on June 24, 2017

Mathematically, the DFT is invertable, ie lossless, but practically there will be a bit of loss due to the finite precision of float point numbers. Even though it isn't lossless, the amount of loss should be miniscule as compared to the 16KHz->2KHz loss you are trying to overcome.

volkuleshov · on June 24, 2017

The problem with the DFT is not whether it's lossless or not, it's that it may not be the best feature representation for a given task.

Both the DFT and the proposed model apply convolutions to the input, but in the former case, these are fixed, while in the latter, they are learned.

This is similar to how we don't use hard-coded features like SIFT or wavelets, or Gabor filters when we do image classification with a CNN.

zxcmx · on June 24, 2017

It's not precision loss, it's that when you DFT you choose an interval. If you choose a short interval you are less certain about frequencies while if you choose a long interval you are less certain about time domain changes (i.e, changes in the signal over your time period).

Funnily enough this is similar to heisenberg's uncertainty principle, you can read about it here: http://fourier.eng.hmc.edu/e101/lectures/Fourier_Analysis/no...

sigi45 · on June 24, 2017

When using DL, perhaps you might try to do a downsampling which would suite your DL?

I mean yes it would be awesome to use your network to upsample stuff but that is apperently hard. What about upsampling something DL friendly and trying to reduce the downsamplesize as the challange?

hcrisp · on June 24, 2017

Since time domain content is the reconstruction target, wouldn't LSTMs be a better choice than CNNs? I would think the spectral content would be time variant and depend on the sequential history.

murbard2 · on June 23, 2017

I'd be more interested in seeing it applied to reduce MP3 or AAC artifacts.

d--b · on June 23, 2017

I thought that too. In my opinion the results would be _much_ better by working in frequency space.

Maybe training an image reconstructor on the short term spectrogram is a good start.

highd · on June 23, 2017

My thinking is this is a good GAN problem. L2 norm will have these bad trivial upscalings as local minima - since L2 in time domain is the same as L2 in frequency domain, you can think in the frequency domain that it basically has this big black area to infill from very little information. If you had some sort of perceptual similarity, on the other hand, there will be lots of adjacent improvements in quality that will reduce the error and make it easier to train. I think this matches the results seen in image upscaling, too.

d--b · on June 23, 2017

In fact, when you listen to the downsampled example, there is actually a lot of information in the extract. Way more than enough. That's because the frequency should be in log scale to be more relevant to the human hear.

Here the frequency cutoff is 2 Khz, which is already a fairly high pitch.

usaphp · on June 23, 2017

Yeah but it used "deep learning" so its going straight to the front page, no matter of the result...

angry_octet · on June 25, 2017

That is an incredibly useless comment. DL is a new and poorly understood technology, it is obvious that not everything will be perfect.

d--b · on June 23, 2017

I completely agree. You could even tell from the picture of the spectrum above that the "reconstruction" was not a success. The spectrum looks like it's been reconstructed by a flat extrapolation of the amplitude of the last frequency known.

This is exactly the kind of projects where deep models should excel. Something's not working properly here.

jhetherly · on June 23, 2017

hey, author here

Thanks for the feedback.

"Something's not working properly here" - I disagree. The model will overtrain (i.e. perfectly reconstruct the original waveforms of a small training set), which indicates it's capable of learning the necessary transformation. The problem lies in the limited amount of training time I had. To reiterate from an earlier comment, I trained on only 10 epochs, while the paper this is base on claimed to train on 400. Much more training is required for this model to generalize well without degrading the signal-to-noise ratio.

d--b · on June 24, 2017

Hey thanks for the comment. That makes sense, I'd be interested in hearing the results with more training. This could work well and have a good range of applications.

jhetherly · on June 23, 2017

hey, author here

Thanks for the feedback.

"the reconstructed audio sounded terrible" - I think this is referring to the amount of static noise in the reconstructed waveform. Indeed, the SNR clearly shows the reconstruction is slightly worse than the downsampled waveform. As mentioned in the post, I strongly believe this is due to the limited amount of training I performed. The number of epochs of training data in my case was only 10 while the paper this project is based on trained for 400 epochs. During training I noticed a strong dependence on training epochs and perceptual performance.

vortico · on June 24, 2017

My ears think so too, but upsampling by just 2 is roughly the same difficulty as upsampling an image by 2. As you probably know, you can't just CIA-like "ENHANCE" an image to double its resolution and expect its noise level to be lower than, I don't know, 10 decibels (of image brightness). Yet our ears can notice noise as low as 40-50 decibels, so it would be nearly impossible to reconstruct higher frequencies so that the result has no noticeable noise.

In this research, the author is attempting to upsample by a bit more than 2.

d--b · on June 24, 2017

That's the point of using deep learning here. Of course you can't make up the missing information, but by training the model with a lot of samples, it should eventually reach a point where it produces the most likely original information.

It works quite well on images: https://github.com/alexjc/neural-enhance

uniqueid · on June 24, 2017

    > you can't just CIA-like "ENHANCE" an image to 
    > double its resolution

I think what you're saying is that if the high frequency information is gone, it's gone? But that shouldn't matter.. we don't need it to be identical to the original. It just needs to sound identical to the original.

If you hit a snare drum 5 times in a row, the high frequency data of each hit will differ wildly, and yet a human won't be able to tell the difference.

volkuleshov · on June 24, 2017

I'm one of the authors of the paper that proposes the deep learning model implemented in the blog post, and I would recommend training on a different dataset, such as VCTK (freely available, and what we used in our paper).

Super-resolution methods are very sensitive to the choice of training data. They will overfit seemingly insignificant properties of the training set, such as the type of low-pass filter you are using, or the acoustic conditions under which the recordings were made (e.g. distance to the microphone when recording a speaker).

To capture all the variations present in the TED talks dataset, you would need a very large model and probably train it for >10 epochs. The VCTK dataset is better in this regard.

For comparison, here are our samples: kuleshov.github.io/audio-super-res/

I'm going to try to release the code over the weekend.

jhetherly · on June 24, 2017

Thanks for commenting and the suggestion!

Indeed, the TED dataset has a lot of variability in terms of audio quality, etc. which, as you mentioned, with just 10 epochs of training is difficult to capture. I did try a larger network (up to 11 downsampling layers), but this proved even more time consuming to train (as expected). Thus, I split the difference and went with a network similar to yours but was trainable over a four-day period (10 epochs).

hackpert · on June 23, 2017

I'm interested in seeing how computationally efficient this method turns out to be and how well it generalizes to other audio data and perhaps other signals as well. Going on a hunch by the model, I think there are some more efficient methods to do bandwidth extension on audio samples with better quality results, but it is great to see more deep learning people take an interest in this domain. I do believe that deep learning can have tremendous impact in DSP and compression.

(Disclaimer: I developed a somewhat similar method earlier this year applied in audio compression, yet to be published)

starchild3001 · on June 23, 2017

Thanks for mentioning. Any links to your (or relevant) work?

crazygringo · on June 23, 2017

While something like this is bound to fail for most music of any complexity (e.g. a singing voice), I've often wondered if this would be highly successful on, say, old solo piano recordings, where the possibilities of the instrument are extremely well-defined and limited.

starchild3001 · on June 23, 2017

Thanks for sharing. The possibilities for this kind of technology are endless. Maybe one day we'll start having crystal clear conversations over telephone :)

bob1029 · on June 23, 2017

I am a little curious as to how this factors into fundamental information theory.

In my mind, you are simply taking a 0-2khz signal and combining it with an entirely different 0-8khz signal that is generated (arbitrarily IMO) based on the band-limited original data. I can see the argument for having a library of samples as additional, common information (think many compressor algorithms), but it is still going to be an approximation (lossy).

"The loss function used was the mean-squared error between the output waveform and the original, high-resolution waveform." - This confuses me as a performance metric when dealing with audio waveforms.

I think a good question might be - "What would be better criteria for evaluating the Q (quality) of this system?"

THD between original and output averaged over the duration of the waveforms? Subjective evaluations (w/ man in the middle training)? etc...

cnxhk · on June 23, 2017

The title is good. The performance is limited and number of examples is not enough to make any useful conclusion.