There are multiple ways to model reward. You can have it be fine-grained, such t...

1024core · on Oct 28, 2024

I guess I'm not sure how the "feed in the whole sequence" works, if there's a single reward at the end.

maleldil · on Oct 31, 2024

It depends on the model and the problem. As an example, BERT-based models have a special [CLS] token that was pre-trained to encode information about the whole sequence. A reward model based on BERT would take the output embedding of that token from the last layer and feed it through a classification head, which would depend on your problem. You could then train this classification head on your alignment dataset like a classification problem.

You can check the examples from the TRL library for more information.

1024core · on Nov 6, 2024

> You can check the examples from the TRL library for more information.

What library is that? Thanks!