As a practitioner specializing in extremely fast-training neural networks, seein...

jillesvangurp · on May 23, 2023

I'm not an expert at all on this stuff but it seems that there are a lot of opinions floating around here on a topic that should be pretty easy to analyze with statistics. Which is supposedly something AI researchers should be very good at.

Basically what you want to know is the range and distribution of values. And then come up with efficient ways to store and encode those.

If you can go from having billions of values (32 bit) to around tens of thousands (16 bit) of values without too much penalty, that suggests 32bit is probably overkill. Also why use floats at all? Integer multiplication is cheap. Also are all values equal in importance? Is it an even distribution of values or are some ranges of values more important than others?

To me it seems that the topology of the neural networks would be a factor here. The reason for having more bits is having large numbers of incoming or outgoing connections. With only a few connections it probably matters less. But if you have thousands, noise/rounding errors might have a bigger impact. That's just my intuition for this. Again, not an expert.

My point here is that this seems a hotly debated topic but people aren't using a lot of the type of statistical arguments I would expect for that.

tysam_and · on May 23, 2023

> Also are all values equal in importance?

Back to the Shannon question at hand (slightly answered in my next answer).

> Also why use floats at all? Integer multiplication is cheap.

Gaussianality, and they cost about the same where we're using them in current GPGPUs/tensorcores (though if Horace He steps in and corrects me on some detail of this/etc I'll gladly defer).

> are some ranges of values more important than others?

See above, also range is a good way to keep from NaNs without the overhead of NaN checking steps. Think of it as a savings account for a rainy day of capacity.

> The reason for having more bits is having large numbers of incoming or outgoing connections.

This is good intuition, though the network survives on surprisingly little precision. I had a similar feeling until one magical moment with hlb-CIFAR10 where I had to keep kicking up the quantization regularization for it to do well (for one of the older versions, at least).

> My point here is that this seems a hotly debated topic but people aren't using a lot of the type of statistical arguments I would expect for that.

I agree to a degree though in my modality of thought I would replace it with information theory since that directly informs us of a few things that we might be able to/should expect during network training. As you noted in your second to last paragraph with noise/rounding errors/etc. Which I think is good stuff.

However the empirical numbers do show pretty clearly that it works well so I'm not too sure where the need for hot debate is. RWKV is one version of a scaled model that uses it, for example. You're sort of shooting yourself in the foot with not using it these days with GPU memory being the way it is. 2x flat memory boost (for model weights) is so huge, even if it's just for I think memory transfers. Lots of networks are memory-bound these days unfortunately.

I think you have good NN-related intuition. I feel like you would find it fun to play around with (if you haven't already). Many thanks for sharing, I greatly appreciated your response. It made me think a bit, and that especially is something I value. So thank you very much for that. <3 :) :thumbsup: :thumbsup:

mrv_asura · on May 23, 2023

> The reason for having more bits is having large numbers of incoming or outgoing connections.

I am having trouble getting my head around this statement, could you please explain this more? This idea is not intuitive to me. Any example will be much appreciated.

My current thought process is this: how having more dynamic range of a single weight/parameter will help in more incoming and outgoing connections? Maybe I am approaching this statement the wrong way.

Thank you. :)