This was a class assignment in college, we had a lot of fun with it, and one of my classmates, the brilliant Alyosha Efros, decided to apply the exact same technique to images instead of text. It turned into a paper that revitalized texture synthesis in the Siggraph community. The most interesting part about it (in my opinion) is that he ran it on images of text, and it produces images of readable text nonsense! With the right window size, perhaps there’s a nonzero possibility of being able to produce the same text either way. This always struck me as very meta and makes me wonder if there are ways we could go in reverse with image processing (or other types of data); if there’s a latent underlying representation for the information contained in image content that is much smaller and more efficient to work with. Neural networks might or might not be providing evidence.
As many know and point out, this idea is now very old (40+ years?). The big problem is that it creates "word salad" [0]. In the generative art and procgen community, the term "10,000 bowls of oatmeal" problem has been used [1].
Taking a step back, this is a perennial problem, even with AI old and new. I've heard that the early (good) chess bots, after Deep Blue, had a problem of only being locally context sensitive and being easily fooled by longer range attack planning. We even see it, to a certain extent, in LLMs where they forget or contradict themselves with what was just said. I have no doubt that LLMs will just keep getting better but, as a snapshot of right now, you can see shades of this "word salad" problem, just a few steps removed to become a "logic salad" problem.
Also: it works much better in English than in other languages I know. English grammar is simple, and many of the function words carry a strong prediction for the next tokens.
Idk. First, the obvious candidates: those with small, low quality text corpora. Second, I expect LLMs to perform less on long distance dependencies. English has a pretty strict word order, and the constituents are close to each other. There's also a lot of input available, so there are more large gaps to learn from. But (some) Germanic languages have some weird rules which allows constituents to move far away, so it's likely LLMs will make more errors there. Also, free word order languages might be a problem, although I've always suspected that those languages aren't that free in reality: speakers probably have strong preferences.
If I had to guess, from my time working with ASR (Advanced Speech Recognition) languages and its difficulty with the Indian (country of) accent; I'd guess Indian languages.
i put Alexis De Tocqueville through this and the results were legible but hard to read.
Then I put a demagogue politician speech through this and the results were almost like a speech that politician would have given. It was hard to tell the difference between the original and the generated.
in other words, if the public admires word salad, and someone speaks in word salad, then word salad output would not be considered as problematic by a large number of people.
I did a markov reworking of some of Trump's speeches and got:
"Hillary brought death and disaster to Iraq, Syria and Libya, she empowered Iran, and she unleashed ISIS. Now she wants to raise your taxes very substantially. Highest taxed nation in the world is a tenant of mine in Manhattan, so many great people. These are people that have been stolen, stolen by either very stupid politicians ask me the question, how are you going to get rid of all the emails?” “Yes, ma’am, they’re gonna stay in this country blind. My contract with the American voter begins with a plan to end government that will not protect its people is a government corruption at the State Department of Justice is trying as hard as they can to protect religious liberty"
It definitely gives you some whiplash, but most of the time it's just righteous ranting about ... whatever. The whole site was a trove of comedy, featuring quizzes like "Porn Star or My Little Pony?". I couldn't remember its name until I started typing this reply: lileks.com.
The output is "word salad" because the probabilities of the model are uniform, albeit implicitly- eyballing the python, it selects the next n-gram uniformly at random. A better n-gram model would calculate the probability of an n-gram following another n-gram according to their frequency in the training corpus. With a larger corpus and better training techniques (some smoothing for one thing) you'd get much more coherent output.
And of course, if you trained a gigantic model on the entire web, then you'd get a very smooth approximation of the text in the corpus and very fluent, grammatical output. So basically, an LLM.
>> As many know and point out, this idea is now very old (40+ years?).
More like 110 years. Markov Chains were proposed in 1913 (by Markov) and popularised by Shannon in 1948 (in "A Mathematical Theory of Communication", the work that introduced information theory):
The underlying mathematics of the n-gram was first proposed by Markov (1913),
who used what are now called Markov chains (bigrams and trigrams) to predict
whether an upcoming letter in Pushkin’s Eugene Onegin would be a vowel or a con-
sonant. Markov classified 20,000 letters as V or C and computed the bigram and
trigram probability that a given letter would be a vowel given the previous one or
two letters. Shannon (1948) applied n-grams to compute approximations to English
word sequences. Based on Shannon’s work, Markov models were commonly used in
engineering, linguistic, and psychological work on modeling word sequences by the
1950s. In a series of extremely influential papers starting with Chomsky (1956) and
including Chomsky (1957) and Miller and Chomsky (1963), Noam Chomsky argued
that “finite-state Markov processes”, while a possibly useful engineering heuristic,
were incapable of being a complete cognitive model of human grammatical knowl-
edge. These arguments led many linguists and computational linguists to ignore
work in statistical modeling for decades.
(Note the usual shaking of angry fists at Chomsky. The NLP community continued on its path anyway, and built smooth, fluent generators of total bullshit and absolutely no progress towards a "cognitive model of human grammatical knowledge").
The output is "word salad" because the probabilities of the model are uniform, albeit implicitly- eyballing the python, it selects the next n-gram uniformly at random.
From a quick look, it doesn't seem to sample uniformly. w3 is added to a list for the context w1, w2, not a set. So, say word A occurs twice as often in a particular context as B, it will be in the list twice as often. So, even though a uniform choice function is used, the probability of A getting samples is twice as high.
You get a word salad because a trigram model has to little context to do anything else. This is a well-known issue with Markov and hidden Markov models.
(Fun fact: some hidden Markov model taggers switch to a different trigram distribution for V2 languages after seeing the finite verb, otherwise they often fail catastrophically in the verb cluster due to the limited context.)
>> From a quick look, it doesn't seem to sample uniformly. w3 is added to a list for the context w1, w2, not a set. So, say word A occurs twice as often in a particular context as B, it will be in the list twice as often. So, even though a uniform choice function is used, the probability of A getting samples is twice as high.
Yeah, you're right. I had to squint a bit but it's like you say, the code is sampling uniformly from a list with possible multiples. Don't make me squint man! I'll get wrinkles :P
Squinting a bit more, that's not the way I know how to build n-grams. If you gave me the string (the cat sat on the bat) I'd give you bi-grams ($s the), (the cat), (cat sat), (sat on), (on the), (the bat), (bat $e). That way, after the first bigram, the next word only depends on the second word in the last bigram, because every bigram (w1 w2) is only ever followed by a bigram (w2 w3). So you're sliding a window of length 2 over the corpus, guided by the probability of the next word.
>> You get a word salad because a trigram model has to little context to do anything else. This is a well-known issue with Markov and hidden Markov models.
Yes, it's the Markov property that makes for word salad, ultimately, but you get less salad-y output if you can calculate better probabilities, and if you do it in the way I say above. And you can always build a string by selecting the next bigram that maximises the probability of the entire string. That's how I've always done it. I guess that's not Markovian any more but gives you reasonable output especially for small-ish corpora with not huge variance.
>> (Fun fact: some hidden Markov model taggers switch to a different trigram distribution for V2 languages after seeing the finite verb, otherwise they often fail catastrophically in the verb cluster due to the limited context.)
> A better n-gram model would calculate the probability of an n-gram following another n-gram according to their frequency in the training corpus. With a larger corpus and better training techniques (some smoothing for one thing) you'd get much more coherent output.
I don't understand what you mean here, isn't that exactly what an n-gram model is already designed to achieve where n > 1? Since this is a 2-gram model isn't it already doing this?
That's right, but a bi-gram model is just a big table of bi-grams and their probabilities (which are really their frequencies in a corpus normalised over the number of all bigrams in the corpus). You can use those probabilities in two ways: to calculate the probability of a string, by taking the bigrams in the string and multiplying their probabilities together; or by starting with some bigram (e.g. a search term entered by a user) and then selecting the next bigram according to its probability (i.e. its frequency in the corpus), or according to the probability of the entire string so-far (which you get by multiplying the probabilities of the bigrams so far); and repeat until you reach some maximum string-length.
So you're right that my comment is a bit confusing: I refer to the way you sample from a bi-gram model, so as to generate a string. Sorry!
> absolutely no progress towards a "cognitive model of human grammatical knowledge"
The thing is, it's only if you buy fully into Chomskyan thinking that you think a "cognitive model of human grammatical knowledge" might even be useful. Or that it's a particularly special thing compared to any other cognitive model of human knowledge.
>> Making the prefix shorter tends to produce less coherent prose; making it longer tends to reproduce the input text verbatim. For English text, using two words to select a third is a good compromise; it seems to recreate the flavor of the input while adding its own whimsical touch.
I was doing a similar experiment recently to generate random names that sound like names from a specific language.
I was breaking the list of names apart into 3 and 2 letter parts, marking which fragments are from the start, middle and end.
To generate the words I started from a random one from the start fragments, then continued with a random one from the middle fragments that starts with the latter that the previous one ended with, and similarly ended it with one from the end fragments.
Some examples:
Spanish Names:
Armusa
Vantara
Modria
German Names:
Ven Marwar,
Ger Naroff,
Vort Kraldent,
Görn Henter,
Urg Wicher,
Wan Ehranus,
Eck Hayazin,
Wert Biewin,
Rein Relberid,
You can learn! I know it's easy to say something is easy to a beginner, but figuring out Markov chains is truly something you can get the basics of over a weekend if you've ever written any software at all.
For a hack day project someone took our IRC logs to do something similar for employees who had left (one bot per engineer, small company so a handful of bots). This made for a great afternoon until the bots started talking to each other and then started triggering the simple command the ops bot would accept. It quickly got shut down.
I remember a friend of mine settings up an IRC bot (named Zeta) like that for his sheet music forum many years ago. She was involved in a lot of hilarity - probably my favorite antics were when she randomly decided to courtmatial someone. Good times indeed! :)
When I read about Markov Chains and hallucination of LLMs, the first thing that comes to my mind is Reggie watts [0] performances, also remind me i used to do the same thing when i was bored in the classroom, but on my paper notebooks,just nonsense that seems plausible, it was kind of entering in a meditative state, it felt good. when i discovered watts i thought to my self, 'what?! I cloud have done that for a living?'.
I did this once with Perl and a book called Eu by the Brazilian poet Augusto dos Anjos. It spat out verses and I joined a few to create this little thing:
Andam monstros sombrios pela escuridão dos remorsos
I created a little program in C++/Qt for using Markov chains to re-write text. But it works on character sequences, rather than words. There are examples and binaries for Windows and Mac:
This is exactly the approach I used back in the late 80s; I published a Markov chain text generator on comp.sources.games that digested Usenet posts, built tables and used the previous two words to generate the next word. It was mildly amusing at the time. Someone resurrected it and put on GitHub, which I'm fine with.
No, it was called markov3. Mark V. Shaney was similar but from Bell labs. See https://github.com/cheako/markov3 (not coded very well but I got better).
this is old tech - but it had me thinking. Markov chains are picking the next token from a random set and they are giving approximately all possible tokens (from the training set) an equal probability. What if it weighted the probability - say using the inverse vector distance or cosine similarity of the neighbors as a proxy for probability, where the vector embedding came from word2vec...how close would the performance be to a transformer model or even something like lstm rnns ? (I suppose I'm cheating a bit using word2vec here. I might as well just say I'm using the attention mechanism...)
That sounds interesting, but still fails to capture long-range dependencies.
If I understand correctly, what you're proposing is to replace co-occurrence frequency with word2vec cosine similarity.
I suppose it may help improve overall performance, you're still just blindly predicting the next word based on the previous one like a first order Markov chain would.
For example, it won't ever fit "2 plus 2 equals 4," because right when we get to equals, we discard all the previous words.
Perhaps if we could get the embedding model to consider the full sentence and then produce a set of probability-scored next token predictions it may work, but now we've just reinvented a transformer.
> "I suppose it may help improve overall performance, you're still just blindly predicting the next word based on the previous one like a first order Markov chain would."
Instead of only taking in the last "token" as context to the function that generates the next token - take the last 15 tokens (ie. the last 2-3 sentences), and predict based on that. And that's your "attention" mechanism.
We uh as a group project did markov chains with some custom algorithmic adjustments, scraped reddit via api, and academic torrent (this data required a lot of cleaning and was orders of magnitude bigger, ran separately), to simulate posts and comments. In same group project we also implemented Variable Length Markov Chain, tree-based custom context-length, although a team member did the matrix based implementation custom context-length as well through some matrix magic.
Genuinely curious to hear what’s wrong with what I thought a cool, curious fun time implementing markov chains. We learned a lot about statistics and linear algebra, matrix diagonalisation and exponentiation, I find it very relevant and can answer questions if you don’t understand.
I remember as a school boy I heard about Markov Chains, and played a bit with them. Instead of building big n-gram tables, I found a simpler way. Start with one n-gram, then scan the source text until you find the next occurrence of it, and take the next letter after it. Output it, and append it to your n-gram, dropping its first letter, and repeat until you have sufficient amount of nonsense text.
I wonder how this would behave if trained on 1 trillion words like the LLMs.
Also, Is training a Markov cheaper that training neural nets? It would be a great way to cut AI costs if they could be made as effective as neural nets.
There is also an interesting post by Orange duck. [0]
The only way to make it sound less like a word salad is to increase the n-gram length (e.g. use the last 10 words to predict the next), but at that point it starts repeating the corpus.
I guess technically you can train on a huge corpus like those of NNs to mitigate that, but you’ll end up with more refined word salads then (edit: refined as in “locally coherent but globally still a word salad”)
Yes, the problem with such long context lengths is sparsity - you still don't have enough data points to make representative many-gram distributions. This was exactly the motivation for Bengio and others to propose neural language models in 2003:
I recommend everyone interested in neural language models and/or wondering why we can't scale up Markov models by just using longer ngrams to read this paper. It's pretty accessible and explains how we got to where we are now in 2023.
Yeah, a century old approach using frequency tables where the next word is generated based just on the previous two words is highly competitive with contemporary neural nets. To produce coherent sentences, do you ever need to remember more than the last two words you said? I doubt it.
One of my first personal projects was making 2 discord bots that would have an infinite “conversation” using Markov chains. Feeding them text from Reddit produced some hilarious results.
Never forget a mate implemented a markov chainer in Lambda MOO, and named it “Cute Ickle Baby”. Left it in the Living Room, listening to what people said, and randomly saying things. Sometimes it said the funniest shit and had us all pissing ourselves laughing.
This works pretty well for languages that use a space as word separator. Now do this for those languages that don’t and have no clear and easy rules for word separation.
This was a class assignment in college, we had a lot of fun with it, and one of my classmates, the brilliant Alyosha Efros, decided to apply the exact same technique to images instead of text. It turned into a paper that revitalized texture synthesis in the Siggraph community. The most interesting part about it (in my opinion) is that he ran it on images of text, and it produces images of readable text nonsense! With the right window size, perhaps there’s a nonzero possibility of being able to produce the same text either way. This always struck me as very meta and makes me wonder if there are ways we could go in reverse with image processing (or other types of data); if there’s a latent underlying representation for the information contained in image content that is much smaller and more efficient to work with. Neural networks might or might not be providing evidence.
https://people.eecs.berkeley.edu/~efros/research/EfrosLeung....