Because you need a lot more information to perform back-propagation.

mirekrusin · on May 31, 2023

It's not "a lot more" information, it's holding derivative (single number) per parameter, right?

calaphos · on May 31, 2023

For automatic differentiation (backpropagation) you need to store the intermediate results per layer of the forward pass. With checkpointing you can only store every nth layer and recompute the rest accordingly to reduce memory requirements in favor of more compute.

mirekrusin · on May 31, 2023

What intermediate results you need to store?

For backpropagation you take the diff between actual and expected output and you go backwards to calculate derivate and apply it with optimiser - that's 8 extra bytes for single precision floats per trainable parameter.

Why do you need 80?

ioedward · on May 31, 2023

You also need the optimizer (e.g. Adam)'s state, which is usually double the parameter's size. So if using fp16, one parameter takes up 6 bytes in memory.

mirekrusin · on May 31, 2023

Yes, if you use ADAM - but it doesn't add up to 80, does it?

Even for fp64 it adds only 16 bytes.

RMSPRop, Adagrad have half of this overhead.

SGD has no optimizer overhead of course.

rfoo · on May 31, 2023

It's not per parameter, you also need to hold activations for back prop to work.

mirekrusin · on May 31, 2023

You need activations for inference as well.

But all of that (trainable parameters, activations, optimizer state) is like 12 bytes per trainable parameter, not 80.

gmueckl · on May 31, 2023

Not the GP, but I believe that they are talking about the size of the training data set in relation to the model size.

mirekrusin · on May 31, 2023

You don't need to and can't really load all training data.

For LLMs you need to load single row of context size, that's vector of ie. 8k numbers, which is 32kB for single precision floats.