Using a Markov chain to generate readable nonsense with 20 lines of Python

dahart · on Jan 1, 2024

See also Mark V Shaney https://en.wikipedia.org/wiki/Mark_V._Shaney

This was a class assignment in college, we had a lot of fun with it, and one of my classmates, the brilliant Alyosha Efros, decided to apply the exact same technique to images instead of text. It turned into a paper that revitalized texture synthesis in the Siggraph community. The most interesting part about it (in my opinion) is that he ran it on images of text, and it produces images of readable text nonsense! With the right window size, perhaps there’s a nonzero possibility of being able to produce the same text either way. This always struck me as very meta and makes me wonder if there are ways we could go in reverse with image processing (or other types of data); if there’s a latent underlying representation for the information contained in image content that is much smaller and more efficient to work with. Neural networks might or might not be providing evidence.

https://people.eecs.berkeley.edu/~efros/research/EfrosLeung....

fluoridation · on Jan 1, 2024

That's amazing! It's a literal uncrop. I'll have to look into this later.

abetusk · on Jan 1, 2024

As many know and point out, this idea is now very old (40+ years?). The big problem is that it creates "word salad" [0]. In the generative art and procgen community, the term "10,000 bowls of oatmeal" problem has been used [1].

Taking a step back, this is a perennial problem, even with AI old and new. I've heard that the early (good) chess bots, after Deep Blue, had a problem of only being locally context sensitive and being easily fooled by longer range attack planning. We even see it, to a certain extent, in LLMs where they forget or contradict themselves with what was just said. I have no doubt that LLMs will just keep getting better but, as a snapshot of right now, you can see shades of this "word salad" problem, just a few steps removed to become a "logic salad" problem.

[0] https://en.wikipedia.org/wiki/Word_salad

[1] https://galaxykate0.tumblr.com/post/139774965871/so-you-want...

tgv · on Jan 1, 2024

Also: it works much better in English than in other languages I know. English grammar is simple, and many of the function words carry a strong prediction for the next tokens.

echelon · on Jan 1, 2024

Which languages are harder for LLMs? Is there any writeup or analysis of this?

tgv · on Jan 1, 2024

Idk. First, the obvious candidates: those with small, low quality text corpora. Second, I expect LLMs to perform less on long distance dependencies. English has a pretty strict word order, and the constituents are close to each other. There's also a lot of input available, so there are more large gaps to learn from. But (some) Germanic languages have some weird rules which allows constituents to move far away, so it's likely LLMs will make more errors there. Also, free word order languages might be a problem, although I've always suspected that those languages aren't that free in reality: speakers probably have strong preferences.

justsomehnguy · on Jan 1, 2024

Any with inflections and particularly with grammatical gender.

tahnyall · on Jan 1, 2024

If I had to guess, from my time working with ASR (Advanced Speech Recognition) languages and its difficulty with the Indian (country of) accent; I'd guess Indian languages.

java_beyb · on Jan 2, 2024

what's advanced speech recognition?

doubloon · on Jan 1, 2024

i put Alexis De Tocqueville through this and the results were legible but hard to read.

Then I put a demagogue politician speech through this and the results were almost like a speech that politician would have given. It was hard to tell the difference between the original and the generated.

in other words, if the public admires word salad, and someone speaks in word salad, then word salad output would not be considered as problematic by a large number of people.

hermitcrab · on Jan 1, 2024

I did a markov reworking of some of Trump's speeches and got:

"Hillary brought death and disaster to Iraq, Syria and Libya, she empowered Iran, and she unleashed ISIS. Now she wants to raise your taxes very substantially. Highest taxed nation in the world is a tenant of mine in Manhattan, so many great people. These are people that have been stolen, stolen by either very stupid politicians ask me the question, how are you going to get rid of all the emails?” “Yes, ma’am, they’re gonna stay in this country blind. My contract with the American voter begins with a plan to end government that will not protect its people is a government corruption at the State Department of Justice is trying as hard as they can to protect religious liberty"

Details at: https://successfulsoftware.net/2019/04/02/bloviate/

vorticalbox · on Jan 1, 2024

When ngrams is high (10) how do you pick "what" to start from?

hermitcrab · on Jan 1, 2024

IIRC it starts with the same few characters as the source text.

chuckadams · on Jan 1, 2024

I recall one site way back that used a Markov chain generator to mash up Karl Marx and Ayn Rand. Was fairly plausible reading, actually.

nurettin · on Jan 1, 2024

I imagine that's like mixing different kinds of salad.

chuckadams · on Jan 3, 2024

It definitely gives you some whiplash, but most of the time it's just righteous ranting about ... whatever. The whole site was a trove of comedy, featuring quizzes like "Porn Star or My Little Pony?". I couldn't remember its name until I started typing this reply: lileks.com.

YeGoblynQueenne · on Jan 1, 2024

The output is "word salad" because the probabilities of the model are uniform, albeit implicitly- eyballing the python, it selects the next n-gram uniformly at random. A better n-gram model would calculate the probability of an n-gram following another n-gram according to their frequency in the training corpus. With a larger corpus and better training techniques (some smoothing for one thing) you'd get much more coherent output.

And of course, if you trained a gigantic model on the entire web, then you'd get a very smooth approximation of the text in the corpus and very fluent, grammatical output. So basically, an LLM.

>> As many know and point out, this idea is now very old (40+ years?).

More like 110 years. Markov Chains were proposed in 1913 (by Markov) and popularised by Shannon in 1948 (in "A Mathematical Theory of Communication", the work that introduced information theory):

The underlying mathematics of the n-gram was first proposed by Markov (1913), who used what are now called Markov chains (bigrams and trigrams) to predict whether an upcoming letter in Pushkin’s Eugene Onegin would be a vowel or a con- sonant. Markov classified 20,000 letters as V or C and computed the bigram and trigram probability that a given letter would be a vowel given the previous one or two letters. Shannon (1948) applied n-grams to compute approximations to English word sequences. Based on Shannon’s work, Markov models were commonly used in engineering, linguistic, and psychological work on modeling word sequences by the 1950s. In a series of extremely influential papers starting with Chomsky (1956) and including Chomsky (1957) and Miller and Chomsky (1963), Noam Chomsky argued that “finite-state Markov processes”, while a possibly useful engineering heuristic, were incapable of being a complete cognitive model of human grammatical knowl- edge. These arguments led many linguists and computational linguists to ignore work in statistical modeling for decades.

https://web.archive.org/web/20220522005827/https://web.stanf...

(Note the usual shaking of angry fists at Chomsky. The NLP community continued on its path anyway, and built smooth, fluent generators of total bullshit and absolutely no progress towards a "cognitive model of human grammatical knowledge").

microtonal · on Jan 1, 2024

The output is "word salad" because the probabilities of the model are uniform, albeit implicitly- eyballing the python, it selects the next n-gram uniformly at random.

From a quick look, it doesn't seem to sample uniformly. w3 is added to a list for the context w1, w2, not a set. So, say word A occurs twice as often in a particular context as B, it will be in the list twice as often. So, even though a uniform choice function is used, the probability of A getting samples is twice as high.

You get a word salad because a trigram model has to little context to do anything else. This is a well-known issue with Markov and hidden Markov models.

(Fun fact: some hidden Markov model taggers switch to a different trigram distribution for V2 languages after seeing the finite verb, otherwise they often fail catastrophically in the verb cluster due to the limited context.)

YeGoblynQueenne · on Jan 1, 2024

>> From a quick look, it doesn't seem to sample uniformly. w3 is added to a list for the context w1, w2, not a set. So, say word A occurs twice as often in a particular context as B, it will be in the list twice as often. So, even though a uniform choice function is used, the probability of A getting samples is twice as high.

Yeah, you're right. I had to squint a bit but it's like you say, the code is sampling uniformly from a list with possible multiples. Don't make me squint man! I'll get wrinkles :P

Squinting a bit more, that's not the way I know how to build n-grams. If you gave me the string (the cat sat on the bat) I'd give you bi-grams ($s the), (the cat), (cat sat), (sat on), (on the), (the bat), (bat $e). That way, after the first bigram, the next word only depends on the second word in the last bigram, because every bigram (w1 w2) is only ever followed by a bigram (w2 w3). So you're sliding a window of length 2 over the corpus, guided by the probability of the next word.

>> You get a word salad because a trigram model has to little context to do anything else. This is a well-known issue with Markov and hidden Markov models.

Yes, it's the Markov property that makes for word salad, ultimately, but you get less salad-y output if you can calculate better probabilities, and if you do it in the way I say above. And you can always build a string by selecting the next bigram that maximises the probability of the entire string. That's how I've always done it. I guess that's not Markovian any more but gives you reasonable output especially for small-ish corpora with not huge variance.

>> (Fun fact: some hidden Markov model taggers switch to a different trigram distribution for V2 languages after seeing the finite verb, otherwise they often fail catastrophically in the verb cluster due to the limited context.)

Thanks, I didn't know that.

shawnz · on Jan 1, 2024

> A better n-gram model would calculate the probability of an n-gram following another n-gram according to their frequency in the training corpus. With a larger corpus and better training techniques (some smoothing for one thing) you'd get much more coherent output.

I don't understand what you mean here, isn't that exactly what an n-gram model is already designed to achieve where n > 1? Since this is a 2-gram model isn't it already doing this?

YeGoblynQueenne · on Jan 2, 2024

That's right, but a bi-gram model is just a big table of bi-grams and their probabilities (which are really their frequencies in a corpus normalised over the number of all bigrams in the corpus). You can use those probabilities in two ways: to calculate the probability of a string, by taking the bigrams in the string and multiplying their probabilities together; or by starting with some bigram (e.g. a search term entered by a user) and then selecting the next bigram according to its probability (i.e. its frequency in the corpus), or according to the probability of the entire string so-far (which you get by multiplying the probabilities of the bigrams so far); and repeat until you reach some maximum string-length.

So you're right that my comment is a bit confusing: I refer to the way you sample from a bi-gram model, so as to generate a string. Sorry!

jameshart · on Jan 1, 2024

> absolutely no progress towards a "cognitive model of human grammatical knowledge"

The thing is, it's only if you buy fully into Chomskyan thinking that you think a "cognitive model of human grammatical knowledge" might even be useful. Or that it's a particularly special thing compared to any other cognitive model of human knowledge.

3abiton · on Jan 1, 2024

Any suggestiob on the background of the method?

tempodox · on Jan 1, 2024

Do a web search for “Markov chain”.

tempodox · on Jan 1, 2024

And the logic salad problem will not be solved by LLMs because there is nothing in their construction that makes them capable of understanding logic.

brookst · on Jan 1, 2024

Can you elaborate? I would argue that logic has even stronger token prediction characteristics than language.

As a total handwave, I would expect an LLM trained on formal logic and a huge corpus of proofs to produce pretty strong logic output.

westurner · on Jan 1, 2024

>> Making the prefix shorter tends to produce less coherent prose; making it longer tends to reproduce the input text verbatim. For English text, using two words to select a third is a good compromise; it seems to recreate the flavor of the input while adding its own whimsical touch.

Next word prediction; vector databases:

Vector database: https://en.wikipedia.org/wiki/Vector_database

https://news.ycombinator.com/item?id=37812219

ZetaH · on Jan 1, 2024

I was doing a similar experiment recently to generate random names that sound like names from a specific language.

I was breaking the list of names apart into 3 and 2 letter parts, marking which fragments are from the start, middle and end.

To generate the words I started from a random one from the start fragments, then continued with a random one from the middle fragments that starts with the latter that the previous one ended with, and similarly ended it with one from the end fragments. Some examples:

Spanish Names:

Armusa Vantara Modria

German Names:

Ven Marwar, Ger Naroff, Vort Kraldent, Görn Henter, Urg Wicher, Wan Ehranus, Eck Hayazin, Wert Biewin, Rein Relberid,

Catalan:

Pallava Ecorus Sangana Ginavari Telamita Exorxió

Hungarian cities:

Jószög Alszeny Hernafő Garnáza Ragytúr Hidácska Mezécs

(edit formatting)

jokoon · on Jan 1, 2024

I remember finding a set of english names generated with a markov chain, I wish I know how to make it, they sounded good.

pavel_lishin · on Jan 1, 2024

> I wish I know how to make it

You can learn! I know it's easy to say something is easy to a beginner, but figuring out Markov chains is truly something you can get the basics of over a weekend if you've ever written any software at all.

alejoar · on Jan 1, 2024

We had a bot that would randomly say things in our IRC channel 15 years ago that worked like this. You could also mention it to prompt it to reply.

Every message was added to it's knowledge base and it would say random but hilarious stuff made up from all the nonsense we used to talk about.

Good times.

duke_sam · on Jan 1, 2024

For a hack day project someone took our IRC logs to do something similar for employees who had left (one bot per engineer, small company so a handful of bots). This made for a great afternoon until the bots started talking to each other and then started triggering the simple command the ops bot would accept. It quickly got shut down.

bhattid · on Jan 1, 2024

Out of curiosity, was the bot based on the bMotion repository? https://github.com/jamesoff/bmotion

I remember a friend of mine settings up an IRC bot (named Zeta) like that for his sheet music forum many years ago. She was involved in a lot of hilarity - probably my favorite antics were when she randomly decided to courtmatial someone. Good times indeed! :)

seabass-labrax · on Jan 1, 2024

> she randomly decided to courtmatial someone

Well, you can't stave off fate - you've got to face the music eventually.

Lukkaroinen · on Jan 1, 2024

I wrote one bot like this. It was fun to play around.

ltr_ · on Jan 1, 2024

When I read about Markov Chains and hallucination of LLMs, the first thing that comes to my mind is Reggie watts [0] performances, also remind me i used to do the same thing when i was bored in the classroom, but on my paper notebooks,just nonsense that seems plausible, it was kind of entering in a meditative state, it felt good. when i discovered watts i thought to my self, 'what?! I cloud have done that for a living?'.

[0] https://www.youtube.com/watch?v=UXyHf_SpUUI

bradrn · on Jan 1, 2024

For a more thorough exploration of Markov text generation, I like this article: https://zompist.com/markov.html

codethief · on Jan 2, 2024

I wouldn't call it "more thorough", as it doesn't do much more than the OP, but it's certainly a nice read. Thanks for the link!

forinti · on Jan 1, 2024

I did this once with Perl and a book called Eu by the Brazilian poet Augusto dos Anjos. It spat out verses and I joined a few to create this little thing:

Andam monstros sombrios pela escuridão dos remorsos

Pairando acima dos transeuntes

Maldito seja o gênero humano

Prostituído talvez em desintegrações maravilhosas

Source code: https://alquerubim.blogspot.com/2018/01/gerador-de-augusto-d...

Pompidou · on Jan 3, 2024

I implemented one in R this year and feed it with religious content. Then I text to speech output and post it on youtube. https://youtube.com/playlist?list=PLEB5tND7siCOe7ySYjVOBwR2v...

hermitcrab · on Jan 1, 2024

I created a little program in C++/Qt for using Markov chains to re-write text. But it works on character sequences, rather than words. There are examples and binaries for Windows and Mac:

https://successfulsoftware.net/2019/04/02/bloviate/

not2b · on Jan 1, 2024

This is exactly the approach I used back in the late 80s; I published a Markov chain text generator on comp.sources.games that digested Usenet posts, built tables and used the previous two words to generate the next word. It was mildly amusing at the time. Someone resurrected it and put on GitHub, which I'm fine with.

aebtebeten · on Jan 1, 2024

Mark V. Shaney?

not2b · on Jan 1, 2024

No, it was called markov3. Mark V. Shaney was similar but from Bell labs. See https://github.com/cheako/markov3 (not coded very well but I got better).

mebassett · on Jan 1, 2024

this is old tech - but it had me thinking. Markov chains are picking the next token from a random set and they are giving approximately all possible tokens (from the training set) an equal probability. What if it weighted the probability - say using the inverse vector distance or cosine similarity of the neighbors as a proxy for probability, where the vector embedding came from word2vec...how close would the performance be to a transformer model or even something like lstm rnns ? (I suppose I'm cheating a bit using word2vec here. I might as well just say I'm using the attention mechanism...)

pona-a · on Jan 1, 2024

That sounds interesting, but still fails to capture long-range dependencies.

If I understand correctly, what you're proposing is to replace co-occurrence frequency with word2vec cosine similarity.

I suppose it may help improve overall performance, you're still just blindly predicting the next word based on the previous one like a first order Markov chain would.

For example, it won't ever fit "2 plus 2 equals 4," because right when we get to equals, we discard all the previous words.

Perhaps if we could get the embedding model to consider the full sentence and then produce a set of probability-scored next token predictions it may work, but now we've just reinvented a transformer.

csmpltn · on Jan 1, 2024

> "I suppose it may help improve overall performance, you're still just blindly predicting the next word based on the previous one like a first order Markov chain would."

Instead of only taking in the last "token" as context to the function that generates the next token - take the last 15 tokens (ie. the last 2-3 sentences), and predict based on that. And that's your "attention" mechanism.

terminous · on Jan 1, 2024

Yeah, that's just attention with extra steps

Frummy · on Jan 1, 2024

We uh as a group project did markov chains with some custom algorithmic adjustments, scraped reddit via api, and academic torrent (this data required a lot of cleaning and was orders of magnitude bigger, ran separately), to simulate posts and comments. In same group project we also implemented Variable Length Markov Chain, tree-based custom context-length, although a team member did the matrix based implementation custom context-length as well through some matrix magic.

Frummy · on Jan 1, 2024

Genuinely curious to hear what’s wrong with what I thought a cool, curious fun time implementing markov chains. We learned a lot about statistics and linear algebra, matrix diagonalisation and exponentiation, I find it very relevant and can answer questions if you don’t understand.

heikkilevanto · on Jan 1, 2024

I remember as a school boy I heard about Markov Chains, and played a bit with them. Instead of building big n-gram tables, I found a simpler way. Start with one n-gram, then scan the source text until you find the next occurrence of it, and take the next letter after it. Output it, and append it to your n-gram, dropping its first letter, and repeat until you have sufficient amount of nonsense text.

max_ · on Jan 1, 2024

I wonder how this would behave if trained on 1 trillion words like the LLMs.

Also, Is training a Markov cheaper that training neural nets? It would be a great way to cut AI costs if they could be made as effective as neural nets.

There is also an interesting post by Orange duck. [0]

[0] https://theorangeduck.com/page/17-line-markov-chain

gattilorenz · on Jan 1, 2024

The only way to make it sound less like a word salad is to increase the n-gram length (e.g. use the last 10 words to predict the next), but at that point it starts repeating the corpus.

I guess technically you can train on a huge corpus like those of NNs to mitigate that, but you’ll end up with more refined word salads then (edit: refined as in “locally coherent but globally still a word salad”)

microtonal · on Jan 1, 2024

Yes, the problem with such long context lengths is sparsity - you still don't have enough data points to make representative many-gram distributions. This was exactly the motivation for Bengio and others to propose neural language models in 2003:

https://dl.acm.org/doi/10.5555/944919.944966

I recommend everyone interested in neural language models and/or wondering why we can't scale up Markov models by just using longer ngrams to read this paper. It's pretty accessible and explains how we got to where we are now in 2023.

sgt101 · on Jan 1, 2024

well - the lawyers from the NYT called, and they disagree...

gattilorenz · on Jan 1, 2024

I didn’t get the call, so it’s not clear what they disagree on :)

I doubt the have anything to say about Markov chains. We’re talking about technical possibilities, not legality of training corpora

sgt101 · on Jan 1, 2024

The NN's trained on huge corpus's appear to be prone to memorization as well.

stabbles · on Jan 1, 2024

Yeah, a century old approach using frequency tables where the next word is generated based just on the previous two words is highly competitive with contemporary neural nets. To produce coherent sentences, do you ever need to remember more than the last two words you said? I doubt it.

CamperBob2 · on Jan 1, 2024

Now ask it to render a deepfake of Noam Chomsky waving his hands saying "Nothing to see here."

AndruLuvisi · on Jan 1, 2024

The first edition of Programming Perl had a script that did this called travesty. You can download it from https://resources.oreilly.com/examples/9780937175644

When I was learning Perl in the '90s, I remember having a lot of fun running this on different manual pages.

chuckadams · on Jan 1, 2024

It’s also built into emacs: I believe it’s called disassociated-press

weare138 · on Jan 1, 2024

https://en.wikipedia.org/wiki/MegaHAL

samsquire · on Jan 1, 2024

I had an idea.

What if you trained a markov chain and got weights of every pair or sequence of pairs for a certain length.

Could you also do rotations with this information in vector space? With multiple points of freedom?

What if you could do inverse kinematics with markov chains?

Like robot planning for a goal?

qarl · on Jan 1, 2024

https://archive.org/details/ComputerRecreationsMarkovChainer

mdberkey · on Jan 1, 2024

One of my first personal projects was making 2 discord bots that would have an infinite “conversation” using Markov chains. Feeding them text from Reddit produced some hilarious results.

kwhitefoot · on Jan 1, 2024

> using Lewis Carroll’s Alice in Wonderland as input. The book is more or less nonsense to begin with,

Has he ever read Alice?

swayvil · on Jan 1, 2024

Plausible nonsense of sufficient complexity and consistency as to seem real. It's trivial to generate.

It casts every cherished truth into doubt.

lowbloodsugar · on Jan 1, 2024

Never forget a mate implemented a markov chainer in Lambda MOO, and named it “Cute Ickle Baby”. Left it in the Living Room, listening to what people said, and randomly saying things. Sometimes it said the funniest shit and had us all pissing ourselves laughing.

bugbuddy · on Jan 1, 2024

This works pretty well for languages that use a space as word separator. Now do this for those languages that don’t and have no clear and easy rules for word separation.

raister · on Jan 1, 2024

This was posted 50 days ago - how does it reach first page again? https://news.ycombinator.com/from?site=benhoyt.com

_teyd · on Jan 1, 2024

By people being interested and upvoting it. Not everyone reads the home page all day every day.