I agree distillation is the wild card. The question is whether distillation works for LLM. I am not aware of any public report of successful distillation of LLM (I searched quite hard for this; if you know of any and can tell me I would be very grateful), and I interpreted it to mean that it doesn't work yet and negative results are not published due to publication bias.
No, distilling step-by-step https://arxiv.org/abs/2305.02301 distills LLM to task specific model. That works, and I know of multiple successes. But it doesn't relate to choice of optimizing training FLOP vs training and inference FLOP, since the resulting distilled model is not LLM.
Turbo uses a different vocabulary (same one as gpt-4). Indicates that it's not the same model as the original 3.5, so I would be very surprised if it wasn't distilled.
"davinci" is the original GPT-3 (175B) which had too many parameters per Chinchilla scaling law. And parameter count is strongly correlated with inference cost. GPT-3.5 is likely Chinchilla optimal and much smaller than davinci.
Though this theory has the defect that GPT-4 is, I think, more expensive than GPT-3, but as I recall it was considered unlikely that GPT-4 is larger than 175 billion parameters. Not sure.
Yes, DistilBERT https://arxiv.org/abs/1910.01108 is in fact the closest case I know of. But it is too small (distilling from 110M to 66M) and both BERT and DistilBERT is intended to be used (and benchmarked) with separate fine tuning for specific tasks, so they are not general.