There may not be enough text content on the internet, but there’s plenty of audi...

There may not be enough text content on the internet, but there’s plenty of audio and video content, and there has already been some research about connecting that as an input to an LLM. So far we’ve seen that the more diverse the training data the more versatile the model, so I suspect multi-modal input training is inevitably where LLM’s are going.