Last May, Karpathy mentioned that the next step in LLM performance would likely ...

modeless · on Jan 21, 2024

AlphaGo works because it's grounded in the rules of Go. The rules are unambiguous and programmed into the system manually.

There are no such unambiguous rules to ground language models. Iterative "improvement" could easily degenerate into nonsense without grounding in the real world. That's why some people think that true AGI will need to be grounded by the laws of physics, via experience interacting with the real physical world using robot bodies.

CGamesPlay · on Jan 21, 2024

Any output of the system that interacts with humans can be used for this, it doesn't necessarily need to be robots and physical manipulation. For example, it could be "code a product" where the models all get the same spec and compete to provide the best implementation, or "create the funniest meme" where the models are judged by each other and then grounded in the real-world rules of reddit upvotes.

modeless · on Jan 21, 2024

Any grounding that comes from humans is scarce. For training we really want automatic grounding from rules that can be applied trillions of times per second rather than labor intensive human grounding at much lower rates.

Of course, grounding from robots is also scarce unless you build a whole lot of robots. Ultimately we need systems that generalize rules from a small amount of grounding data. I guess that describes world models, which large language models currently lack (explicitly, at least).

skybrian · on Jan 21, 2024

Grounding from programming languages (or similar sandboxes) wouldn't be scarce and feedback would be faster.

It only goes so far, but considering how far LLM's got already, it seems promising.

contravariant · on Jan 21, 2024

Language is not necessarily best modelled by predicting the next token in sentence, just like Go is not best modelled by predicting the next move.

It makes perfect sense that predicting the next token can be improved by going back every so often to reevaluate if the sequence of tokens makes any sense as a whole.

If we're conjecturing anyway I reckon the next major step won't come from changes that merely improve training or predicting, but one that fundamentally makes the model capable of learning. Removing the distinction between 'conversing' and 'learning'. This is similar to what you call 'interacting', but I reckon that mere interaction won't be enough if the distinction between letting the model predict and training the model doesn't disappear.

gary_0 · on Jan 21, 2024

The rules of Go are not exactly "programmed in" to AlphaGo; they are only provided so the model knows what the valid moves and win conditions are as it self-trains. The latest versions of the model can be quickly adapted to play other games[0] and MuZero can "master games without knowing their rules"[1]. Also, some of DeepMind's earliest work was a deep reinforcement learning model that could play seven different Atari games with no adjustments needed to switch games[2].

[0] https://en.wikipedia.org/wiki/AlphaGo_Zero

[1] https://en.wikipedia.org/wiki/MuZero

[2] https://arxiv.org/abs/1312.5602

modeless · on Jan 21, 2024

The rules of Go very much are programmed in to AlphaGo. The neural nets are generic, sure. But the game rules programmed manually into the training system are vital to the system working. Even with MuZero: while it uses a world model which is learned, ultimately during training there is still grounding in the manually programmed rules of whatever game it's playing. When it trains on Atari, the Atari system (manually programmed, obviously) provides the grounding rules.

What I am saying is there is no analogous source of manually programmed rules to ground language models during training.

bitshiftfaced · on Jan 21, 2024

Seems like now that we have multimodal LLMs and a virtually infinite supply of digitized video, we could get it to prompt itself about what might happen in the video and then tie rewards to what actually happens later in the video.

refulgentis · on Jan 21, 2024

The analogy starts falling apart at the "play", i.e. it wouldn't make much sense to have a weak model play a strong model, and we can't live update a model based on the results. Generally, self-play is meant in the manner Meta did it, as part of training.

arthurcolle · on Jan 21, 2024

A recent paper goes into this right? Self-Play Fine Tuning for Weak to Strong Generalization: https://arxiv.org/pdf/2401.01335.pdf

    Harnessing the power of human-annotated data through 
    Supervised Fine-Tuning (SFT) is pivotal for advancing 
    Large Language Models (LLMs). In this paper, we 
    delve into the prospect of growing a strong LLM out of a 
    weak one without the need for acquiring additional 
    human annotated data. We propose a new fine-tuning 
    method called Self-Play fIne-tuNing (SPIN),
    which starts from a supervised fine-tuned model. At the 
    heart of SPIN lies a self-play mechanism,
    where the LLM refines its capability by playing against 
    instances of itself. More specifically, the
    LLM generates its own training data from its previous 
    iterations, refining its policy by discerning
    these self-generated responses from those obtained from 
    human-annotated data. Our method
    progressively elevates the LLM from a nascent model to a 
    formidable one, unlocking the full
    potential of human-annotated demonstration data for SFT. 
    Theoretically, we prove that the global optimum to the 
    training objective function of our method is achieved 
    only when the LLM policy aligns with the target data 
    distribution. Empirically, we evaluate our method on
    several benchmark datasets including the HuggingFace 
    Open LLM Leaderboard, MT-Bench, and datasets from Big- 
    Bench. 

    Our results show that SPIN can significantly improve the 
    LLM’s performance across a variety of benchmarks and 
    even outperform models trained through direct
    preference optimization (DPO) supplemented with extra 
    GPT-4 preference data. This sheds light on the promise 
    of self-play, enabling the achievement of human-level 
    performance in LLMs without the need for expert 
    opponents.

refulgentis · on Jan 21, 2024

Yeah thats what we're commenting on

(c.f. instances of itself, self-play)

hackerlight · on Jan 21, 2024

Define "play". With AlphaGo, "play" is functionally defined as generating additional data from the true distribution. There is no easy analogy for this in LLMs.