Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

My understanding is that the image data used a decoder-only stage, i.e. mapping images to tokens, basically taking the image textual description instead of the actual pixels so it can't "see" but can understand the "narration of an image"


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: