Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> We can divide by a number (scaling_factor) to scale down its magnitude to the right level

This argument bugs me a bit... since these numbers are represented using floating point, whose precision does not depend on their magnitude, what is the point of scaling them?

Furthermore, I do not believe his first example. Is torch really that bad? In octave:

    x = randn(512, 1);
    A = randn(512);
    y = A^100 * x;
    mean(p), std(p)
gives regular numbers ( 9.1118e+135 and 1.9190e+137 )

They are large, but far from overflowing. And this corresponds to a network of deep 100, which is not a realistic scenario.



That's because octave is using doubles. You can do the exact same thing in PyTorch by passing in dtype=torch.float64 into torch.randn.


> They are large, but far from overflowing.

Sure, but isn't large relative? Sure you can make them overflow in octave as well, given enough layers. Which brings us to next point :-)

> And this corresponds to a network of deep 100, which is not a realistic scenario.

Actually deep 100 is not unrealistic at all these days! https://arxiv.org/abs/1611.09326


There are approaches to ensure parameters remain stable despite the depth (selu, for example).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: