Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Because you need a lot more information to perform back-propagation.


It's not "a lot more" information, it's holding derivative (single number) per parameter, right?


For automatic differentiation (backpropagation) you need to store the intermediate results per layer of the forward pass. With checkpointing you can only store every nth layer and recompute the rest accordingly to reduce memory requirements in favor of more compute.


What intermediate results you need to store?

For backpropagation you take the diff between actual and expected output and you go backwards to calculate derivate and apply it with optimiser - that's 8 extra bytes for single precision floats per trainable parameter.

Why do you need 80?


You also need the optimizer (e.g. Adam)'s state, which is usually double the parameter's size. So if using fp16, one parameter takes up 6 bytes in memory.


Yes, if you use ADAM - but it doesn't add up to 80, does it?

Even for fp64 it adds only 16 bytes.

RMSPRop, Adagrad have half of this overhead.

SGD has no optimizer overhead of course.


It's not per parameter, you also need to hold activations for back prop to work.


You need activations for inference as well.

But all of that (trainable parameters, activations, optimizer state) is like 12 bytes per trainable parameter, not 80.


Not the GP, but I believe that they are talking about the size of the training data set in relation to the model size.


You don't need to and can't really load all training data.

For LLMs you need to load single row of context size, that's vector of ie. 8k numbers, which is 32kB for single precision floats.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: