You can fine-tune a 60mm parameter (e.g. distilBERT) discriminative (not generat...

mlyle · on April 1, 2024

Yup, I'm not saying TinyLLAMA is minimal, efficient, etc (indeed, that is just saying that you can take models even smaller). And a whole lot of what we just throw LLMs at is not the right tool for the job, but it's expedient and surprisingly works.

anentropic · on April 2, 2024

it seems that BERT can be run on the llama.cpp platform https://github.com/ggerganov/llama.cpp/pull/5423

so presumably those models could benefit from the speed ups described in OP article when running on CPU

jerrygenser · on April 3, 2024

llama.cpp only support BERT architectures for embedding but not with classification heads - although there is a feature request to add that

anentropic · on April 3, 2024

ah I see, thanks, did not read the PR closely!