> We can divide by a number (scaling_factor) to scale down its magnitude to the right level
This argument bugs me a bit... since these numbers are represented using floating point, whose precision does not depend on their magnitude, what is the point of scaling them?
Furthermore, I do not believe his first example. Is torch really that bad? In octave:
x = randn(512, 1);
A = randn(512);
y = A^100 * x;
mean(p), std(p)
gives regular numbers ( 9.1118e+135 and 1.9190e+137 )
They are large, but far from overflowing. And this corresponds to a network of deep 100, which is not a realistic scenario.
This argument bugs me a bit... since these numbers are represented using floating point, whose precision does not depend on their magnitude, what is the point of scaling them?
Furthermore, I do not believe his first example. Is torch really that bad? In octave:
gives regular numbers ( 9.1118e+135 and 1.9190e+137 )They are large, but far from overflowing. And this corresponds to a network of deep 100, which is not a realistic scenario.