I agree distillation is the wild card. The question is whether distillation work...

mrbungie · on May 10, 2023

This was published here in HN last week: https://news.ycombinator.com/item?id=35810663

Don't know if there any public technical reports by any of the big AI companies about this, as its pretty new.

sanxiyn · on May 10, 2023

No, distilling step-by-step https://arxiv.org/abs/2305.02301 distills LLM to task specific model. That works, and I know of multiple successes. But it doesn't relate to choice of optimizing training FLOP vs training and inference FLOP, since the resulting distilled model is not LLM.

jumpCastle · on May 10, 2023

The name 3.5-turbo sounds to me like it implies distillation. The release notes at the time also hinted at it IIRC.

sanxiyn · on May 10, 2023

Well, that's why I said public. Personally, I don't think release notes https://help.openai.com/en/articles/6825453-chatgpt-release-... hinted at any such thing, and I think quantization is more likely than distillation.

ImprobableTruth · on May 11, 2023

Turbo uses a different vocabulary (same one as gpt-4). Indicates that it's not the same model as the original 3.5, so I would be very surprised if it wasn't distilled.

kristianp · on May 10, 2023

Does the turbo API being 10 times cheaper than davinci imply anything? It implies more than just quantisation to me.

cubefox · on May 11, 2023

"davinci" is the original GPT-3 (175B) which had too many parameters per Chinchilla scaling law. And parameter count is strongly correlated with inference cost. GPT-3.5 is likely Chinchilla optimal and much smaller than davinci.

Though this theory has the defect that GPT-4 is, I think, more expensive than GPT-3, but as I recall it was considered unlikely that GPT-4 is larger than 175 billion parameters. Not sure.

fpgaminer · on May 10, 2023

Off the top of my head there's DistilBERT from awhile back. I also recall distilled GPT-2 models from before the GPT-3 times.

sanxiyn · on May 10, 2023

Yes, DistilBERT https://arxiv.org/abs/1910.01108 is in fact the closest case I know of. But it is too small (distilling from 110M to 66M) and both BERT and DistilBERT is intended to be used (and benchmarked) with separate fine tuning for specific tasks, so they are not general.