Using sin(x) or the other input features like x^2 goes back to making it too easy, though. So far the best I can do is 7 layers of 7 which gets a loss of 0.02. 3x7 is almost cracking the Swiss Roll but can't quite finish it off and gets stuck at 0.05: https://imgur.com/Z3f2ECc ... Surprisingly, 2x8 can do it, as long as I have noise or regularization on, but 8/7 then seriously struggles. Is 16 neurons a critical limit here?
I managed to get to 0.01 loss from only x1/x2, using 3 hidden layers, L1 regularization, a bit of added noise, and some patience: http://i.imgur.com/Y3zKpJF.png
Yes, noise & regularization seem to be key here. I've gotten a 2-layer with 7/8 neurons down to 0.06 and dropping but only with noise & l1: http://playground.tensorflow.org/#activation=relu®ulariza... Final loss of 0.051. Interestingly, increasing noise from 10 to 15 destroys performance, loss of 0.47.
Is it really "making it too easy" if you're applying your knowledge of the structure of the problem space to make it easier for the computer to solve? Certainly this isn't easy to do with every problem, but it seems like a better idea in general to start with parameters you suspect to be correct.
In the "swiss cake roll" the circular nature of the classes suggests using a sin or cos function, and the fact that they spiral out suggests also inputting magnitude information. Sure, you can just add more neurons that will end up computing the same thing, but we might as well give the computer a head start when we can.