As a practitioner specializing in extremely fast-training neural networks, seeing a paper in 2023 considering fp32 as a gold standard over pure non-mixed fp16/bp16 is a bit shocking to me and feels dated/distracting from the discussion. They make good points but unless I am hopelessly misinformed, it's been pretty well established at this point in a number of circles that fp32 is overkill for the majority of uses for many modern-day practitioners. Loads of networks train directly in bfloat16 as the standard -- a lot of the modern LLMs among them. Mixed precision is very much no longer needed, not even with fp16 if you're willing to tolerate some range hacks. If you don't want the range hacks, just use bfloat16 directly. The complexity is not worth it, adds not much at all, and the dynamic loss scaler a lot of people use is just begging for more issues.
Both of the main repos that I've published in terms of speed benchmarks train directly in pure fp16 and bf16 respectively without any fp32 frippery, if you want to see an example of both paradigms successfully feel free to take a look (I'll note that bf16 is simpler on the whole for a few reasons, generally seamless): https://github.com/tysam-code/hlb-CIFAR10 [for fp16] and https://github.com/tysam-code/hlb-gpt [for bf16]
Personally from my experience, I think fp16/bf16 is honestly a bit too expressive for what we need, fp8 seems to do just fine and I think will be quite alright with some accommodations, just as with pure fp16. The what and the how of that is a story for a different day (and at this point, the max pooling operation is basically one of the slowest now).
You'll have to excuse my frustration a bit, it just is a bit jarring to see a streetsign from way in the past fly forward in the wind to hit you in the face before tumbling on its merry way. And additionally in the comment section the general discussion doesn't seem to talk about what seems to be a pretty clearly-established consensus in certain research circles. It's not really too much of a debate anymore, it works and we're off to bigger and better problems that I think we should talk about. I guess in one sense it does justify the paper's utility, but also a bit frustrating because it normalizes the conversation as a few notches back from where I personally feel that it actually is at the moment.
We've got to move out of the past, this fp32 business to me personally is like writing a Relu-activated VGG network in Keras on Tensorflow. Phew.
And while we're at it, if I shall throw my frumpy-grumpy hat right back into the ring, this is an information-theoretic problem! Not enough discussion of Shannon and co. Let's please fix that too. See my other rants for x-references to that, should you be so-inclined to punish yourself in that manner.
I'm not an expert at all on this stuff but it seems that there are a lot of opinions floating around here on a topic that should be pretty easy to analyze with statistics. Which is supposedly something AI researchers should be very good at.
Basically what you want to know is the range and distribution of values. And then come up with efficient ways to store and encode those.
If you can go from having billions of values (32 bit) to around tens of thousands (16 bit) of values without too much penalty, that suggests 32bit is probably overkill. Also why use floats at all? Integer multiplication is cheap. Also are all values equal in importance? Is it an even distribution of values or are some ranges of values more important than others?
To me it seems that the topology of the neural networks would be a factor here. The reason for having more bits is having large numbers of incoming or outgoing connections. With only a few connections it probably matters less. But if you have thousands, noise/rounding errors might have a bigger impact. That's just my intuition for this. Again, not an expert.
My point here is that this seems a hotly debated topic but people aren't using a lot of the type of statistical arguments I would expect for that.
Back to the Shannon question at hand (slightly answered in my next answer).
> Also why use floats at all? Integer multiplication is cheap.
Gaussianality, and they cost about the same where we're using them in current GPGPUs/tensorcores (though if Horace He steps in and corrects me on some detail of this/etc I'll gladly defer).
> are some ranges of values more important than others?
See above, also range is a good way to keep from NaNs without the overhead of NaN checking steps. Think of it as a savings account for a rainy day of capacity.
> The reason for having more bits is having large numbers of incoming or outgoing connections.
This is good intuition, though the network survives on surprisingly little precision. I had a similar feeling until one magical moment with hlb-CIFAR10 where I had to keep kicking up the quantization regularization for it to do well (for one of the older versions, at least).
> My point here is that this seems a hotly debated topic but people aren't using a lot of the type of statistical arguments I would expect for that.
I agree to a degree though in my modality of thought I would replace it with information theory since that directly informs us of a few things that we might be able to/should expect during network training. As you noted in your second to last paragraph with noise/rounding errors/etc. Which I think is good stuff.
However the empirical numbers do show pretty clearly that it works well so I'm not too sure where the need for hot debate is. RWKV is one version of a scaled model that uses it, for example. You're sort of shooting yourself in the foot with not using it these days with GPU memory being the way it is. 2x flat memory boost (for model weights) is so huge, even if it's just for I think memory transfers. Lots of networks are memory-bound these days unfortunately.
I think you have good NN-related intuition. I feel like you would find it fun to play around with (if you haven't already). Many thanks for sharing, I greatly appreciated your response. It made me think a bit, and that especially is something I value. So thank you very much for that. <3 :) :thumbsup: :thumbsup:
> The reason for having more bits is having large numbers of incoming or outgoing connections.
I am having trouble getting my head around this statement, could you please explain this more? This idea is not intuitive to me. Any example will be much appreciated.
My current thought process is this: how having more dynamic range of a single weight/parameter will help in more incoming and outgoing connections?
Maybe I am approaching this statement the wrong way.
Both of the main repos that I've published in terms of speed benchmarks train directly in pure fp16 and bf16 respectively without any fp32 frippery, if you want to see an example of both paradigms successfully feel free to take a look (I'll note that bf16 is simpler on the whole for a few reasons, generally seamless): https://github.com/tysam-code/hlb-CIFAR10 [for fp16] and https://github.com/tysam-code/hlb-gpt [for bf16]
Personally from my experience, I think fp16/bf16 is honestly a bit too expressive for what we need, fp8 seems to do just fine and I think will be quite alright with some accommodations, just as with pure fp16. The what and the how of that is a story for a different day (and at this point, the max pooling operation is basically one of the slowest now).
You'll have to excuse my frustration a bit, it just is a bit jarring to see a streetsign from way in the past fly forward in the wind to hit you in the face before tumbling on its merry way. And additionally in the comment section the general discussion doesn't seem to talk about what seems to be a pretty clearly-established consensus in certain research circles. It's not really too much of a debate anymore, it works and we're off to bigger and better problems that I think we should talk about. I guess in one sense it does justify the paper's utility, but also a bit frustrating because it normalizes the conversation as a few notches back from where I personally feel that it actually is at the moment.
We've got to move out of the past, this fp32 business to me personally is like writing a Relu-activated VGG network in Keras on Tensorflow. Phew.
And while we're at it, if I shall throw my frumpy-grumpy hat right back into the ring, this is an information-theoretic problem! Not enough discussion of Shannon and co. Let's please fix that too. See my other rants for x-references to that, should you be so-inclined to punish yourself in that manner.