I think the idea is they just feed each to the RLHF reward model used to train t...

		anticensor 7 months ago \| parent \| context \| favorite \| on: OpenAI dropped the price of o3 by 80% I think the idea is they just feed each to the RLHF reward model used to train the model and return the most rewarded answer.