I'm not sufficiently familiar with the details on ML to assess the proposition made in the article.
From my understanding, RL is a tuning approach on LLMs, so the outcome is still the same kind of beast, albeit with a different parameter set.
So empirically, I actually thought that the lead companies would already be strongly focused on improving coding capabilities, since this is where LLMs are very effective, and where they have huge cashflows from token consumptions.
So, either the motivation isn't there, or they're already doing something like that, or they know it's not as effective as the approaches they already have.
Actually I didn't. Correct me if I am wrong, but my understanding is that RL is still an LLM tuning approach, i.e. an optimization of its parameter set, no matter if it's done at scale or via HF.
RL is a lot more general than that, it is basically a way in which an agent learns to make optimal decisions by learning from experience to maximize rewards. So you can do all kinds of stuff other than finetuning LLMs with it, like controlling a robotic arm, playing/mastering videogames, etc. For example, AlphaGo was also RL.
From my understanding, RL is a tuning approach on LLMs, so the outcome is still the same kind of beast, albeit with a different parameter set.
So empirically, I actually thought that the lead companies would already be strongly focused on improving coding capabilities, since this is where LLMs are very effective, and where they have huge cashflows from token consumptions.
So, either the motivation isn't there, or they're already doing something like that, or they know it's not as effective as the approaches they already have.
I wonder which one it is.