It isn't some kind of Markov chain situation. Attention cross-links the abstract meaning of words, subtle implications based on context and so on.
So, "mat" follows "the cat sat on the" where we understand the entire worldview of the dataset used for training; not just the next-word probability based on one or more previous words ... it's based on all previous meaning probability, and those meaning probablility and so on.
So, "mat" follows "the cat sat on the" where we understand the entire worldview of the dataset used for training; not just the next-word probability based on one or more previous words ... it's based on all previous meaning probability, and those meaning probablility and so on.