wouldn't this effectively be using a "model" twice the size? Would it be better ...

avereveard · on Nov 20, 2023

Parsing is faster than generating, so having a small model produce a whole output and then have Goliath only produce "good/bad" single token response evaluation would be faster than having Goliath produce everything. This would be the extreme, adhoc and iterative version of speculative decoding, which is already a thing and would probably give the best compromise

raincole · on Nov 20, 2023

I think the relationship between model size and training time isn't linear. So if you want a twice bigger model it'll take more resources to train it than two original models.

valine · on Nov 20, 2023

Maybe. Goliath 120B took two different llama variants and interwove the layers. Surprisingly Goliath 120B quantized to 2bit is outperforming llama 70B 4bit in many benchmarks.

https://www.reddit.com/r/LocalLLaMA/comments/17vcr9d/llm_com...

ghotli · on Nov 21, 2023

Do you happen to have a link to where that interwoven layers bit is described? As far as I can tell it's not clear on the model cards.

valine · on Nov 21, 2023

The model page is the only info I’ve found on it. As far as I can tell there’s no paper published on the technique.

In the “Merge Process” section they at least give the layer ranges.

https://huggingface.co/alpindale/goliath-120b

ghotli · on Nov 21, 2023

Ah, actually reviewing that more closely I found a link to it in the acknowledgements.

https://github.com/cg123/mergekit

sevagh · on Nov 21, 2023

I believe another factor is that sometimes the model responds better to your prompt than other times. This way you get two dice rolls of your prompt hitting "the good path."