My understanding is that the image data used a decoder-only stage, i.e. mapping ...

		ddalex on June 1, 2023 \| parent \| context \| favorite \| on: Nvidia DGX GH200: 100 Terabyte GPU Memory System My understanding is that the image data used a decoder-only stage, i.e. mapping images to tokens, basically taking the image textual description instead of the actual pixels so it can't "see" but can understand the "narration of an image"