Indeed, the TED dataset has a lot of variability in terms of audio quality, etc. which, as you mentioned, with just 10 epochs of training is difficult to capture. I did try a larger network (up to 11 downsampling layers), but this proved even more time consuming to train (as expected). Thus, I split the difference and went with a network similar to yours but was trainable over a four-day period (10 epochs).
Indeed, the TED dataset has a lot of variability in terms of audio quality, etc. which, as you mentioned, with just 10 epochs of training is difficult to capture. I did try a larger network (up to 11 downsampling layers), but this proved even more time consuming to train (as expected). Thus, I split the difference and went with a network similar to yours but was trainable over a four-day period (10 epochs).