Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Isn't it known for quite a few years now that these CNN networks (ResNet, VGG etc) trains well with FP16? The problem is that attention layer with softmax can have dynamic range higher than FP16 can handle hence you have to go to BF16? I am lost in what's the novelty here in this paper.


It looks like they're formalizing the behavior of pure 16-bit training which is different from the mixed precision pipelines I'm aware of.


You can control range with a temperature, works pretty well! https://github.com/tysam-code/hlb-CIFAR10/blob/3bb104ce16d16...


Yes, it is well known in the industry that both FP16 as well as int16 have advantages. I don't see anything really new in the paper either.

It's like a lot of arXiv papers these days, as they only serve as an "Instagram for researchers".




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: