One issue is that we don't have a lot of text, not even a megabyte of it(represe...

One issue is that we don't have a lot of text, not even a megabyte of it(represented as unicode characters). So you could get a language model, but how could you judge its output? Maybe it would be really good at generating more similar text, but that text isn't probably super representative of things we would want to be able to read.