the same way LLMs are trained to predict next tokens from prev tokens, just long...

echoangle · on Jan 23, 2025

But then you can’t just give the previous frame, with the LLM analogy you would have to give the last few thousand frames (that’s the context window, right?). If you only give the previous frame, that’s like having an LLM that only gets the single previous token and has to predict the next one.

int_19h · on Jan 23, 2025

Indeed. Although more recently they figured out a way to feed the hidden state as the new input, which basically allows the model to "continue thinking" in vectors without round-tripping it via words (or pixels).

Presumably if you were to take that and build a large enough NN to accommodate all the necessary state it needs to carry and all the rules it needs to be able to execute, then after training it on enough game input you'd have a proper world simulation. Of course, as the article rightly notes, then you have just successfully reimplemented Minecraft in a way that is orders of magnitude more computationally expensive...

_flux · on Jan 23, 2025

Perhaps the trick used by text-based LLMs could be used: when the context window starts filling up, the LLM is asked to summarize the existing data in the context, thus compressing it (lossily..) into smaller space.

Sharlin · on Jan 23, 2025

More previous tokens in this case would mean more previous frames. But there's really no reason to just stick to rendered pixels as input (except for novelty's sake) because we could train directly on snapshots of full game state.

philipwhiuk · on Jan 23, 2025

Yeah but then it's not generalizable

airstrike · on Jan 23, 2025

Doesn't that depend on how such game state is modeled?