Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I agree distillation is the wild card. The question is whether distillation works for LLM. I am not aware of any public report of successful distillation of LLM (I searched quite hard for this; if you know of any and can tell me I would be very grateful), and I interpreted it to mean that it doesn't work yet and negative results are not published due to publication bias.


This was published here in HN last week: https://news.ycombinator.com/item?id=35810663

Don't know if there any public technical reports by any of the big AI companies about this, as its pretty new.


No, distilling step-by-step https://arxiv.org/abs/2305.02301 distills LLM to task specific model. That works, and I know of multiple successes. But it doesn't relate to choice of optimizing training FLOP vs training and inference FLOP, since the resulting distilled model is not LLM.


The name 3.5-turbo sounds to me like it implies distillation. The release notes at the time also hinted at it IIRC.


Well, that's why I said public. Personally, I don't think release notes https://help.openai.com/en/articles/6825453-chatgpt-release-... hinted at any such thing, and I think quantization is more likely than distillation.


Turbo uses a different vocabulary (same one as gpt-4). Indicates that it's not the same model as the original 3.5, so I would be very surprised if it wasn't distilled.


Does the turbo API being 10 times cheaper than davinci imply anything? It implies more than just quantisation to me.


"davinci" is the original GPT-3 (175B) which had too many parameters per Chinchilla scaling law. And parameter count is strongly correlated with inference cost. GPT-3.5 is likely Chinchilla optimal and much smaller than davinci.

Though this theory has the defect that GPT-4 is, I think, more expensive than GPT-3, but as I recall it was considered unlikely that GPT-4 is larger than 175 billion parameters. Not sure.


Off the top of my head there's DistilBERT from awhile back. I also recall distilled GPT-2 models from before the GPT-3 times.


Yes, DistilBERT https://arxiv.org/abs/1910.01108 is in fact the closest case I know of. But it is too small (distilling from 110M to 66M) and both BERT and DistilBERT is intended to be used (and benchmarked) with separate fine tuning for specific tasks, so they are not general.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: