ah, well there's actually two classes of replies and maybe i'm confusing one for the other here.
My claim regarding architecture follows just formally: you can take any statistical model trained via gd and phrase it as a kNN. The only difference is how hard it is to produce such a model from fitting to data, rather than from rephrasing.
The idea that there's something special about architecture is, really, a hardware illusion. Any empirical function approximation algorithm, designed to find the same conditional probability structure, will in the limit t->inf, approximate the same structure (ie., the actual conditional joint distribution of the data).
> The idea that there's something special about architecture is, really, a hardware illusion. Any empirical function approximation algorithm, designed to find the same conditional probability structure, will in the limit t->inf, approximate the same structure (ie., the actual conditional joint distribution of the data).
But it's not just about hardware. Maybe it would be, if we had access to an infinite stream of perfectly noise-free training data for every conceivable ML task. But we also need to worry about actually getting useful information out of finite data, not just finite computing resources. That's the limit you should be thinking about: the information content of input data, not compute cycles.
And yes, when trying to learn something as tremendously complicated as a world-model of multiple languages and human reasoning, even a dataset as big as The Pile might not be big enough if our model is inefficient at extracting information from data. And even with the (relatively) data-efficient transformer architecture, even a huge dataset has an upper limit of usefulness if it contains a lot of junk noise or generally has a low information density.
I put together an example that should hopefully demonstrate what I mean: https://paste.sr.ht/~wintershadows/7fb412e1d05a600a0da5db2ba.... Obviously this case is very stylized, but the key point is that the right model architecture can make good use of finite and/or noisy data, and the wrong model architecture cannot, regardless of how much compute power you throw at the latter.
It's Shannon, not Turing, who will get you in the end.
text is not a valid measure of the world, so there is no "informative model" ie., a model of the data generating process to fit it to. there is no sine curve, indeed there is no function from world->text -- there are an infinite family of functions, none of which is uniquely sampled by what happens to be written down
transformers, certainly, arent "informative" in this sense: they start with no prior model of how text would be distributed given the structure of the world.
these arguments all make radical assumptions that we are in somethihng like a physics experiment -- rather than scraping glyphs from books and replaying their patterns
My claim regarding architecture follows just formally: you can take any statistical model trained via gd and phrase it as a kNN. The only difference is how hard it is to produce such a model from fitting to data, rather than from rephrasing.
The idea that there's something special about architecture is, really, a hardware illusion. Any empirical function approximation algorithm, designed to find the same conditional probability structure, will in the limit t->inf, approximate the same structure (ie., the actual conditional joint distribution of the data).