Parsing is faster than generating, so having a small model produce a whole output and then have Goliath only produce "good/bad" single token response evaluation would be faster than having Goliath produce everything. This would be the extreme, adhoc and iterative version of speculative decoding, which is already a thing and would probably give the best compromise
I think the relationship between model size and training time isn't linear. So if you want a twice bigger model it'll take more resources to train it than two original models.
Maybe. Goliath 120B took two different llama variants and interwove the layers. Surprisingly Goliath 120B quantized to 2bit is outperforming llama 70B 4bit in many benchmarks.
I believe another factor is that sometimes the model responds better to your prompt than other times. This way you get two dice rolls of your prompt hitting "the good path."
Would it be better to just double the size of one of the models rather than house both?
Genuine question