Isn't it known for quite a few years now that these CNN networks (ResNet, VGG etc) trains well with FP16? The problem is that attention layer with softmax can have dynamic range higher than FP16 can handle hence you have to go to BF16? I am lost in what's the novelty here in this paper.