Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

There are multiple ways to model reward. You can have it be fine-grained, such that every token gets its own reward, but by far the most common is to feed in the whole sequence and generate a single reward at the end.


I guess I'm not sure how the "feed in the whole sequence" works, if there's a single reward at the end.


It depends on the model and the problem. As an example, BERT-based models have a special [CLS] token that was pre-trained to encode information about the whole sequence. A reward model based on BERT would take the output embedding of that token from the last layer and feed it through a classification head, which would depend on your problem. You could then train this classification head on your alignment dataset like a classification problem.

You can check the examples from the TRL library for more information.


> You can check the examples from the TRL library for more information.

What library is that? Thanks!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: