Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

the same way LLMs are trained to predict next tokens from prev tokens, just longer context = better memory = object permanence


But then you can’t just give the previous frame, with the LLM analogy you would have to give the last few thousand frames (that’s the context window, right?). If you only give the previous frame, that’s like having an LLM that only gets the single previous token and has to predict the next one.


Indeed. Although more recently they figured out a way to feed the hidden state as the new input, which basically allows the model to "continue thinking" in vectors without round-tripping it via words (or pixels).

Presumably if you were to take that and build a large enough NN to accommodate all the necessary state it needs to carry and all the rules it needs to be able to execute, then after training it on enough game input you'd have a proper world simulation. Of course, as the article rightly notes, then you have just successfully reimplemented Minecraft in a way that is orders of magnitude more computationally expensive...


Perhaps the trick used by text-based LLMs could be used: when the context window starts filling up, the LLM is asked to summarize the existing data in the context, thus compressing it (lossily..) into smaller space.


More previous tokens in this case would mean more previous frames. But there's really no reason to just stick to rendered pixels as input (except for novelty's sake) because we could train directly on snapshots of full game state.


Yeah but then it's not generalizable


Doesn't that depend on how such game state is modeled?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: