Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

wouldn't this effectively be using a "model" twice the size?

Would it be better to just double the size of one of the models rather than house both?

Genuine question



Parsing is faster than generating, so having a small model produce a whole output and then have Goliath only produce "good/bad" single token response evaluation would be faster than having Goliath produce everything. This would be the extreme, adhoc and iterative version of speculative decoding, which is already a thing and would probably give the best compromise


I think the relationship between model size and training time isn't linear. So if you want a twice bigger model it'll take more resources to train it than two original models.


Maybe. Goliath 120B took two different llama variants and interwove the layers. Surprisingly Goliath 120B quantized to 2bit is outperforming llama 70B 4bit in many benchmarks.

https://www.reddit.com/r/LocalLLaMA/comments/17vcr9d/llm_com...


Do you happen to have a link to where that interwoven layers bit is described? As far as I can tell it's not clear on the model cards.


The model page is the only info I’ve found on it. As far as I can tell there’s no paper published on the technique.

In the “Merge Process” section they at least give the layer ranges.

https://huggingface.co/alpindale/goliath-120b


Ah, actually reviewing that more closely I found a link to it in the acknowledgements.

https://github.com/cg123/mergekit


I believe another factor is that sometimes the model responds better to your prompt than other times. This way you get two dice rolls of your prompt hitting "the good path."




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: