Step 1. Train a VLM to supervise the RL training. Step 2. Train the RL network. ...

criemen · 2025-07-13T11:19:56 1752405596

My understanding is that this is essentially how RLHF works, and it doesn't scale. As you run RL for longer, the model will learn how to cheat the imperfections of the grader, instead of getting better at the task at hand. Therefore, to scale RL you really need good graders, and determinism is king.

clbrmbr · 2025-07-13T12:55:30 1752411330

Do you think constitutional approaches would help here? (Verifiable reward for the main score, but then asking the model to self-critique for security and quality.)

amelius · 2025-07-13T13:23:43 1752413023

You're talking about training an LLM. I'm talking about training robotic/motor skills and haptic feedback.