Last May, Karpathy mentioned that the next step in LLM performance would likely take inspiration from AlphaGo's Monte Carlo-style tree traversal to enable deeper and more self-correcting reasoning:
But if this approach holds up, it suggests that the most valuable part of the AlphaGo project that applies to LLM development is in fact reinforcement learning through self-play. Why not create a leaderboard of reward-generating language models that constantly "play" against each other, where model selection frequency is based on ELO, and update on the results? What if the Open LLM leaderboard (https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderb...) evolved into a constantly improving pool of such models? This also alleviates data scaling issues by providing a diverse and continuously changing distribution of new training inputs.
AlphaGo works because it's grounded in the rules of Go. The rules are unambiguous and programmed into the system manually.
There are no such unambiguous rules to ground language models. Iterative "improvement" could easily degenerate into nonsense without grounding in the real world. That's why some people think that true AGI will need to be grounded by the laws of physics, via experience interacting with the real physical world using robot bodies.
Any output of the system that interacts with humans can be used for this, it doesn't necessarily need to be robots and physical manipulation. For example, it could be "code a product" where the models all get the same spec and compete to provide the best implementation, or "create the funniest meme" where the models are judged by each other and then grounded in the real-world rules of reddit upvotes.
Any grounding that comes from humans is scarce. For training we really want automatic grounding from rules that can be applied trillions of times per second rather than labor intensive human grounding at much lower rates.
Of course, grounding from robots is also scarce unless you build a whole lot of robots. Ultimately we need systems that generalize rules from a small amount of grounding data. I guess that describes world models, which large language models currently lack (explicitly, at least).
Language is not necessarily best modelled by predicting the next token in sentence, just like Go is not best modelled by predicting the next move.
It makes perfect sense that predicting the next token can be improved by going back every so often to reevaluate if the sequence of tokens makes any sense as a whole.
If we're conjecturing anyway I reckon the next major step won't come from changes that merely improve training or predicting, but one that fundamentally makes the model capable of learning. Removing the distinction between 'conversing' and 'learning'. This is similar to what you call 'interacting', but I reckon that mere interaction won't be enough if the distinction between letting the model predict and training the model doesn't disappear.
The rules of Go are not exactly "programmed in" to AlphaGo; they are only provided so the model knows what the valid moves and win conditions are as it self-trains. The latest versions of the model can be quickly adapted to play other games[0] and MuZero can "master games without knowing their rules"[1]. Also, some of DeepMind's earliest work was a deep reinforcement learning model that could play seven different Atari games with no adjustments needed to switch games[2].
The rules of Go very much are programmed in to AlphaGo. The neural nets are generic, sure. But the game rules programmed manually into the training system are vital to the system working. Even with MuZero: while it uses a world model which is learned, ultimately during training there is still grounding in the manually programmed rules of whatever game it's playing. When it trains on Atari, the Atari system (manually programmed, obviously) provides the grounding rules.
What I am saying is there is no analogous source of manually programmed rules to ground language models during training.
Seems like now that we have multimodal LLMs and a virtually infinite supply of digitized video, we could get it to prompt itself about what might happen in the video and then tie rewards to what actually happens later in the video.
The analogy starts falling apart at the "play", i.e. it wouldn't make much sense to have a weak model play a strong model, and we can't live update a model based on the results. Generally, self-play is meant in the manner Meta did it, as part of training.
Harnessing the power of human-annotated data through
Supervised Fine-Tuning (SFT) is pivotal for advancing
Large Language Models (LLMs). In this paper, we
delve into the prospect of growing a strong LLM out of a
weak one without the need for acquiring additional
human annotated data. We propose a new fine-tuning
method called Self-Play fIne-tuNing (SPIN),
which starts from a supervised fine-tuned model. At the
heart of SPIN lies a self-play mechanism,
where the LLM refines its capability by playing against
instances of itself. More specifically, the
LLM generates its own training data from its previous
iterations, refining its policy by discerning
these self-generated responses from those obtained from
human-annotated data. Our method
progressively elevates the LLM from a nascent model to a
formidable one, unlocking the full
potential of human-annotated demonstration data for SFT.
Theoretically, we prove that the global optimum to the
training objective function of our method is achieved
only when the LLM policy aligns with the target data
distribution. Empirically, we evaluate our method on
several benchmark datasets including the HuggingFace
Open LLM Leaderboard, MT-Bench, and datasets from Big-
Bench.
Our results show that SPIN can significantly improve the
LLM’s performance across a variety of benchmarks and
even outperform models trained through direct
preference optimization (DPO) supplemented with extra
GPT-4 preference data. This sheds light on the promise
of self-play, enabling the achievement of human-level
performance in LLMs without the need for expert
opponents.
Define "play". With AlphaGo, "play" is functionally defined as generating additional data from the true distribution. There is no easy analogy for this in LLMs.
https://youtu.be/bZQun8Y4L2A?si=RgD7NWfwDdh0bklK&t=1630
But if this approach holds up, it suggests that the most valuable part of the AlphaGo project that applies to LLM development is in fact reinforcement learning through self-play. Why not create a leaderboard of reward-generating language models that constantly "play" against each other, where model selection frequency is based on ELO, and update on the results? What if the Open LLM leaderboard (https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderb...) evolved into a constantly improving pool of such models? This also alleviates data scaling issues by providing a diverse and continuously changing distribution of new training inputs.