Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The analogy starts falling apart at the "play", i.e. it wouldn't make much sense to have a weak model play a strong model, and we can't live update a model based on the results. Generally, self-play is meant in the manner Meta did it, as part of training.


A recent paper goes into this right? Self-Play Fine Tuning for Weak to Strong Generalization: https://arxiv.org/pdf/2401.01335.pdf

    Harnessing the power of human-annotated data through 
    Supervised Fine-Tuning (SFT) is pivotal for advancing 
    Large Language Models (LLMs). In this paper, we 
    delve into the prospect of growing a strong LLM out of a 
    weak one without the need for acquiring additional 
    human annotated data. We propose a new fine-tuning 
    method called Self-Play fIne-tuNing (SPIN),
    which starts from a supervised fine-tuned model. At the 
    heart of SPIN lies a self-play mechanism,
    where the LLM refines its capability by playing against 
    instances of itself. More specifically, the
    LLM generates its own training data from its previous 
    iterations, refining its policy by discerning
    these self-generated responses from those obtained from 
    human-annotated data. Our method
    progressively elevates the LLM from a nascent model to a 
    formidable one, unlocking the full
    potential of human-annotated demonstration data for SFT. 
    Theoretically, we prove that the global optimum to the 
    training objective function of our method is achieved 
    only when the LLM policy aligns with the target data 
    distribution. Empirically, we evaluate our method on
    several benchmark datasets including the HuggingFace 
    Open LLM Leaderboard, MT-Bench, and datasets from Big- 
    Bench. 

    Our results show that SPIN can significantly improve the 
    LLM’s performance across a variety of benchmarks and 
    even outperform models trained through direct
    preference optimization (DPO) supplemented with extra 
    GPT-4 preference data. This sheds light on the promise 
    of self-play, enabling the achievement of human-level 
    performance in LLMs without the need for expert 
    opponents.


Yeah thats what we're commenting on

(c.f. instances of itself, self-play)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: