Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Self-Rewarding Language Models (arxiv.org)
95 points by jph00 on Jan 20, 2024 | hide | past | favorite | 60 comments



I'm not an author on this paper, but I posted it here because it's in my area of research and I think it's a great example of recent excellent results coming from "self improvement" of LLMs. Most of the authors are very well known in the deep learning academic community and have written many high impact papers between them.

Many researchers believe that using synthetic data and automatic classification/scoring, i.e. using LLMs to improve LLMs, is likely to be one of the most successful lines of research. We've been seeing a lot of success from this in recent months, including OpenAI's DALL-e 3, which used (IIRC) 99% synthetic LLM data for captions.


> using synthetic data

I'm curious about both this and the emphasis on "high quality" data (e.g. Microsoft's Phi models) ...

1) What is/are the goal(s) of using synthetic data, just a source of more data, or "high quality" data ?

2) What is the definition/measure of "high quality" data - is this about consistency, or coverage, or what ?


And the really neat part is that this compounds.


If this is actually improving itself I assume it has to be bounded by it's ability to self evaluate.

Starting with llama 70b makes sense for Meta, but I can't help but wonder what the results would look like if applied to Mixtral. If it replicates and isn't overfitting could we see a performant and open source GPT4 competitor?


The ability to self-evaluate seems to be improved by adding highly evaluated output to its training data. I'm not sure how far that technique can be pushed but it's very promising given the performance of this 70b model. I bet progress in the self-evaluation technique could let this trick go quite far.

It's hard to say how well this would work on a MoE model, but best case scenario is something decently better than GPT4 that can run on 2x3090/4090 or 1x48gb 5090.


I hope Nvidia gives us a 48GB 5090, but I can just as easily see them wanting to keep the data center divide going by keeping the consumer cards low on VRAM.

Here's hoping AMD forces them to keep pushing the boundary.


That's not a problem, you can get used 24gb 3090 TIs for a decent price and run them together for WAY cheaper than a flagship 5090 will run. It'll generate tokens slower, but probably still faster than you can read.


Even if they did, 48GB would quickly become “the new 12GB” as models expanded.


Given Intel seeming to make a stronger push, I think it more likely that Intel delivers a high RAM (if slower) card into the fray.


I guess for self-evaluation and generation, we might want to choose a model that's performant for the job. This means that if the 70B is fine-tuned, that is probably the judge + augmentor vs a generic model. Also, I think the paper shows the win rate using the Mistral medium on some preliminary benchmark (Table 2)

But, I liked the idea that the reward model is not static, and if the user is provided with multiple options, then the extra score might help break the tie.


Even if it is overfitting, in some ways this is arguably fine tuning at some subset of the LLM’s capabilities. I can imagine this being a very powerful technique with Mixtral.


I agree to some extent, though I also wouldn't be surprised if a models ability to evaluate is tied to it's ability to predict. Better predictors evaluate better.

I also expect that evaluation ability doesn't grow linearly with prediction ability, so I doubt that a model will be able to fully optimise it's evaluation potential on it's own.

Could be wrong though, will be interesting to see. If I'm right on the former and wrong on the latter we could see maybe models self evaluating to an "optimal" state, for whatever it thinks optimal is via self evaluation is anyway.

Or this could be a somewhat useful case of overfitting that just refines a few bits.


I wouldn't characterize it as overfitting, since the evaluation function exists to filter some of the output. This is basically biasing the model towards a subset of the latent space that the evaluation function says is "good" in a very roundabout way.


So this is a good idea because it can create vastly more training data for a model to learn from. However, it seems likely that these models are going to hallucinate like crazy. As featured in the documentary, the AlphaGo program struggled with hallucination, and it performed self-play in a tiny world based on a perfectly rigid and exactly correct set of rules. LLMs already have tons of false corners and edges in their logic about the world, and this seems like it has the potential to spread those all around.

Hard to say — these things can be difficult to predict. I can see this working but there'll probably be some ratio of training data - self-play that we have a hard time getting past because it's a difficult-to-control form of extrapolation.


It's a very promising idea, though. Judging content is much easier than creating content, after all.


It is a promising idea, but it falls prey to the same tautology as in the game-playing agent days. In order for simulation to be useful, you need a really robust model of the world, but if you had a really robust model of the world, you wouldn't need the simulator. Simulation is really really good for one thing: learning a policy that can find a particular corner of the search space as fast as possible (such as a wining state in a Go game). But simulation is not good at actually generating the space. It's going to extrapolate mistakes really badly.

Also, while humans are better at judging (look at that fake thing!) than generating (drawing a realistic photo), I think you may find that — despite our abilities — detection is actually quite a bit harder. As an example:

"John Doe is dead."

This was very easy for me to create, but it's quite difficult for you to judge whether or not it is true due to a variety of factors (which John Doe, am I being honest, when was the last time you saw John Doe, do you know anyone who knows John doe and could check, perhaps John Doe had a twin who died, etc.)


Your problem is harder than it needs to be; we can convert it to a closed problem, by requiring citations. Then the question is whether the statement is properly supported by the citations, which is far easier to evaluate.


Interesting direction! Let's assume the citation is me. Does that make content generation harder than verification (judgement)?


Quickly scanned https://arxiv.org/abs/2401.10020 . Quite interesting work. The paper's idea is to have a single language model doing both question answering (responding to prompts) and self-evaluating its own answers. Iterative DPO training is used to improve the model's dual capabilities.

The authors tried different LLM-as-a-judge promptings to generate a reward score for each answer. A very particular additive 5-point rewarding prompting is found to be the most effective one. The two-step inferencing pipeline (answering questions+evaluating answers) also generates an extra dataset in the form of <question, winning-answering, losing-answering>.

This AI-generated dataset (called preference pairs) is fed back to the model in a training pipeline (using Direct Preference Optimization).

The inferencing and training pipelines are connected to have a closed-loop, iterative process. Each iteration generates better AI feedback training data and subsequently better model. The evaluation shows very promising results, outperforming Claude 2, Genimini Pro, and GPT-4 in selected benchmarks.

The paper has some zoom for improvements. 1) Figure 1 is not very accurate to reflect the entire workflow. A fixed model is used to generate prompts for example. But it is not shown in the figure. The preference pairs should be a matrix instead of a vector in the diagram. Also, the bootstrapping workflow (using seed instruction following and evaluation datasets) should be reflected.

2) The authors did not explain why a fixed model is used to generate prompts, instead of using the self-rewarding model directly.

3) The authors tried to use another form of AI feedback data (question, best-answer), coupled with supervised fine-tuning. However, it did not result in any performance improvement for the model. It is better to explore why or at least propose it as future work.

4) Fundamentally, the paper does not directly compare (or comment on) self-rewarding vs. independent rewarding. The iterative process can still apply to an independent rewarding model.


I'm no AI expert but this seems to me like it's overfitting after-the-fact by basically learning from the specific examples and remembering when it's re-run?


It sounds like they're giving it some temperature to allow answer variation, then having it score its own output and taking the best answers. This is akin to common bootstrap methods in statistics that resample the data to estimate a distribution over outcomes based on sample variance. The key in my opinion is the self-scoring procedure, how that is implemented would really control how well this process works. Since it works pretty well here - a 70b non-moe model in striking distance of GPT4 is huge - I'm inclined to say they've happened upon a self-scoring procedure that's good at reinforcing desired traits in the model.


It depends on the strength of the seed dataset in diversity of tasks and the initial power of the seed model, I’m guessing. If a wide enough array of diverse input is generated then it may not be overfitting.


Can a global model overfit on all the data?


Might also be worth scanning, Automatically Correcting Large Language Models: Surveying the landscape of diverse self-correction strategies

https://arxiv.org/abs/2308.03188


Probably a silly question, but as a non-academic, what's the barrier of entry for publishing these papers? I have many adjacent ideas and projects that never see the light of day outside of my personal lab/play space. This one is very similar to another I have been toying with.


The barrier of publishing a paper in arxiv is much lower than in academic journals. That said, you still have to convince them that your idea is relevant. They may require some sort of credentials, but if the idea is solid and the paper is well-written you have good chances of being accepted anyway.

edit: typos


But of course, like on a regular website, being published there won't mean it'll also be read. That would be the main challenge.


How might this compare to all these legends of "Q*"? Isn't this in some sense a combination of reinforcement learning and LLMs?


q* were some theory about openAI from the time sam altman was fired. This is meta


Sure, I understand that. I remember reading speculation that q* was an attempt to combine LLM training with reinforcement learning. My naive reading is that this result is somewhat similar and am wondering how it compares to the rumored/speculated q* technique.


Has this been reviewed and or published anywhere? Anyone able to vouch for the authors?


It's common for ML papers to be posted on arxiv in advance of official publication (hence, preprint server). I haven't heard of the other authors but Kyunghyun Cho is legit and well known in the field.


Last May, Karpathy mentioned that the next step in LLM performance would likely take inspiration from AlphaGo's Monte Carlo-style tree traversal to enable deeper and more self-correcting reasoning:

https://youtu.be/bZQun8Y4L2A?si=RgD7NWfwDdh0bklK&t=1630

But if this approach holds up, it suggests that the most valuable part of the AlphaGo project that applies to LLM development is in fact reinforcement learning through self-play. Why not create a leaderboard of reward-generating language models that constantly "play" against each other, where model selection frequency is based on ELO, and update on the results? What if the Open LLM leaderboard (https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderb...) evolved into a constantly improving pool of such models? This also alleviates data scaling issues by providing a diverse and continuously changing distribution of new training inputs.


AlphaGo works because it's grounded in the rules of Go. The rules are unambiguous and programmed into the system manually.

There are no such unambiguous rules to ground language models. Iterative "improvement" could easily degenerate into nonsense without grounding in the real world. That's why some people think that true AGI will need to be grounded by the laws of physics, via experience interacting with the real physical world using robot bodies.


Any output of the system that interacts with humans can be used for this, it doesn't necessarily need to be robots and physical manipulation. For example, it could be "code a product" where the models all get the same spec and compete to provide the best implementation, or "create the funniest meme" where the models are judged by each other and then grounded in the real-world rules of reddit upvotes.


Any grounding that comes from humans is scarce. For training we really want automatic grounding from rules that can be applied trillions of times per second rather than labor intensive human grounding at much lower rates.

Of course, grounding from robots is also scarce unless you build a whole lot of robots. Ultimately we need systems that generalize rules from a small amount of grounding data. I guess that describes world models, which large language models currently lack (explicitly, at least).


Grounding from programming languages (or similar sandboxes) wouldn't be scarce and feedback would be faster.

It only goes so far, but considering how far LLM's got already, it seems promising.


Language is not necessarily best modelled by predicting the next token in sentence, just like Go is not best modelled by predicting the next move.

It makes perfect sense that predicting the next token can be improved by going back every so often to reevaluate if the sequence of tokens makes any sense as a whole.

If we're conjecturing anyway I reckon the next major step won't come from changes that merely improve training or predicting, but one that fundamentally makes the model capable of learning. Removing the distinction between 'conversing' and 'learning'. This is similar to what you call 'interacting', but I reckon that mere interaction won't be enough if the distinction between letting the model predict and training the model doesn't disappear.


The rules of Go are not exactly "programmed in" to AlphaGo; they are only provided so the model knows what the valid moves and win conditions are as it self-trains. The latest versions of the model can be quickly adapted to play other games[0] and MuZero can "master games without knowing their rules"[1]. Also, some of DeepMind's earliest work was a deep reinforcement learning model that could play seven different Atari games with no adjustments needed to switch games[2].

[0] https://en.wikipedia.org/wiki/AlphaGo_Zero

[1] https://en.wikipedia.org/wiki/MuZero

[2] https://arxiv.org/abs/1312.5602


The rules of Go very much are programmed in to AlphaGo. The neural nets are generic, sure. But the game rules programmed manually into the training system are vital to the system working. Even with MuZero: while it uses a world model which is learned, ultimately during training there is still grounding in the manually programmed rules of whatever game it's playing. When it trains on Atari, the Atari system (manually programmed, obviously) provides the grounding rules.

What I am saying is there is no analogous source of manually programmed rules to ground language models during training.


Seems like now that we have multimodal LLMs and a virtually infinite supply of digitized video, we could get it to prompt itself about what might happen in the video and then tie rewards to what actually happens later in the video.


The analogy starts falling apart at the "play", i.e. it wouldn't make much sense to have a weak model play a strong model, and we can't live update a model based on the results. Generally, self-play is meant in the manner Meta did it, as part of training.


A recent paper goes into this right? Self-Play Fine Tuning for Weak to Strong Generalization: https://arxiv.org/pdf/2401.01335.pdf

    Harnessing the power of human-annotated data through 
    Supervised Fine-Tuning (SFT) is pivotal for advancing 
    Large Language Models (LLMs). In this paper, we 
    delve into the prospect of growing a strong LLM out of a 
    weak one without the need for acquiring additional 
    human annotated data. We propose a new fine-tuning 
    method called Self-Play fIne-tuNing (SPIN),
    which starts from a supervised fine-tuned model. At the 
    heart of SPIN lies a self-play mechanism,
    where the LLM refines its capability by playing against 
    instances of itself. More specifically, the
    LLM generates its own training data from its previous 
    iterations, refining its policy by discerning
    these self-generated responses from those obtained from 
    human-annotated data. Our method
    progressively elevates the LLM from a nascent model to a 
    formidable one, unlocking the full
    potential of human-annotated demonstration data for SFT. 
    Theoretically, we prove that the global optimum to the 
    training objective function of our method is achieved 
    only when the LLM policy aligns with the target data 
    distribution. Empirically, we evaluate our method on
    several benchmark datasets including the HuggingFace 
    Open LLM Leaderboard, MT-Bench, and datasets from Big- 
    Bench. 

    Our results show that SPIN can significantly improve the 
    LLM’s performance across a variety of benchmarks and 
    even outperform models trained through direct
    preference optimization (DPO) supplemented with extra 
    GPT-4 preference data. This sheds light on the promise 
    of self-play, enabling the achievement of human-level 
    performance in LLMs without the need for expert 
    opponents.


Yeah thats what we're commenting on

(c.f. instances of itself, self-play)


Define "play". With AlphaGo, "play" is functionally defined as generating additional data from the true distribution. There is no easy analogy for this in LLMs.


Cannot wait for Meta to open source the LLM that makes Google obsolete. Some startup will put a very simple react UI, with a search box, on top of some LLM running on AWS, and google as we know it will be replaced with 500k-3M in funding. The MVP backend will be a dead simple chat API, written with a few hundred lines of code.

What an interesting turn of history.


Not sure what you use google for but I use it to find web sites on the internet. It's not a problem an LLM is even trying to solve.

(I'll acknowledge that LLMs might eat a layer off the top of that where people are seeking knowledge .... but it's never even going to come close to replacing the traditional core business)


The URL link piece just requires training with the urls in the data. Results spit out urls. Updating the model is tricky, but the hardest part about replacing Google is solved, and will be open source.


Google search will soon become obsolete by itself. The quality of the results seems to be getting worse by a day. In fact, I’d say Google search in almost unrecognisable to what it used to be. It used to be the second place I’d go to when I couldn’t get the answer in DuckDuckGo, but now it’s like a waste of time even trying. Just full of seo tuned rubbish. I think what people are doing with LLM’s to write poor content has their whole search not able to tell the difference between quality content and just grammatically correct churned out crap.


Few problems with this theory.

1.Competing purely on result quality with Google is like competing with Coke based purely on having a subjectively better tasting formula.

If it really was better at search than Google (and scaled, was economical, etc.) it would have to be VASTLY better to get people to switch. Google has the brand, the defaults, the ecosystems, decades of habit forming, etc. It would not be enough to just be "better". Traditional wisdom would say it would have to be 10x better (the idea of quantifying a subjective improvement like this is absurd ofc.)

2. AGI aspires to be human-like. I don't see a human like intelligence as anywhere near Google's level for search results. A traditional vision of an AGI would be like asking your clever mate. Useful, but not for search.

3. Even if something came along which was truly better enough, Google themselves would have to not have access to it/something comparable. You would need to somehow have a lasting, huge & exclusive advantage.

4. Even if there was something that was a clear 10x improvement on Google, which had novel tech which Google could not themselves replicate, then Google would simply acquire them. Almost everything Google has done since the original site has been acquiring a tech, applying operational expertise, network effects, etc. & acting as essentially a distributor.


The point is that the cost to replicate Google search as a whole will drop like a rock. They have brand value, which you are arguing will maintain their monopoly. But I really dont think thats true.

Their business model is built on serving links, which as time goes on will become more and more obsolete. The links are where the ad money comes from.


Google still solve the "searching", AI still searches, or comes up with facts that they can't verify themselves, so probably an halucination.


[flagged]


The ad model is no longer needed, or even desired. When the operational expense is so low due to a simple LLM and very few employees, other business models are possible.


Training LLM models and generating LLM responses is probably more expensive than... basically anything else we have now.


It's quietly broken in favor of local LLMs over the past month. I say this as someone who was a huge skeptic until maybe 2 weeks ago. It's not GPT-4 but it doesn't need to be.

StableLM 3B can handle RAG inputs. This is huge. 7B models can't run on consumer mobile hardware: https://x.com/jpohhhh/status/1747451790969184682?s=20

llama.cpp got examples for iOS / Android in December. It can run on all platforms via one library: https://x.com/jpohhhh/status/1748852554920849579?s=20

retrieval/vector DB can run on all platforms via one library: https://github.com/Telosnex/fonnx


Its expensive right now. Compute cost always drops like a rock


Needed or desired by whom? It’s easy to make money with ads, therefore it will happen. Energy isn’t free and people won’t pay. The hypothetical AGI will necessarily require a lot of compute so only big companies will have access.

Funny to see people rooting for meta. Did you forget how they make money?


> The ad model is no longer needed, or even desired.

I'll believe that when YouTube is dead and gone.


> No way to block because they’re subtly embedded into everything.

Which will be super illegal in several jurisdictions that mandate ads be clearly labelled.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: