Why use RL for this instead of plain old supervised learning?

dinobones · on Oct 28, 2024

I am trying to understand this too.

Supervised learning you train on pairs of (x, y) where x is your input (title/post text/metadata) and y is the output score.

Naively, it's a linear regression model, Y = b0 + b1x1 + b2x2 + b3x3. Where b0 is your bias ("a floor for score points"), and b1, b2, and b3 are bias terms for the actual data of the post. You can solve this, closed form, and find the b1/b2/b3 that minimize the error of fitting to Y.

How do these equations change with RL? I always assumed RL was a multi-step process where actions are taken to get to a reward. If there is only 1 step/decision, to produce a "random" score, it feels much like supervised learning.

jampekka · on Oct 28, 2024

The post is not doing RL. It's just regression as you thought.

billmalarky · on Oct 28, 2024

This post is using regression to build a reward model. The reward model will then be used (in a future post) to build the overall RL system.

Here's the relevant text from the article:

>In this post we’ll discuss how to build a reward model that can predict the upvote count that a specific HN story will get. And in follow-up posts in this series, we’ll use that reward model along with reinforcement learning to create a model that can write high-value HN stories!

jampekka · on Oct 30, 2024

The title is misleading. The $4.80 is spent for supervised learning to find the best post.

The post is interesting and I'll be sure to check out the next parts too. It's just that people, as evidenced by this thread, clearly misunderstood or were what was done.

jampekka · on Oct 28, 2024

It is just plain old supervised learning. A regression from the post features to vote count. The RL discussion in TFA is a bit confusing.

Such a model can be used as the "reward model" for the "reinforcement learning from human feedback" (RLHF) method.