*The output is "word salad" because the probabilities of the model are uniform, ...

YeGoblynQueenne · on Jan 1, 2024

>> From a quick look, it doesn't seem to sample uniformly. w3 is added to a list for the context w1, w2, not a set. So, say word A occurs twice as often in a particular context as B, it will be in the list twice as often. So, even though a uniform choice function is used, the probability of A getting samples is twice as high.

Yeah, you're right. I had to squint a bit but it's like you say, the code is sampling uniformly from a list with possible multiples. Don't make me squint man! I'll get wrinkles :P

Squinting a bit more, that's not the way I know how to build n-grams. If you gave me the string (the cat sat on the bat) I'd give you bi-grams ($s the), (the cat), (cat sat), (sat on), (on the), (the bat), (bat $e). That way, after the first bigram, the next word only depends on the second word in the last bigram, because every bigram (w1 w2) is only ever followed by a bigram (w2 w3). So you're sliding a window of length 2 over the corpus, guided by the probability of the next word.

>> You get a word salad because a trigram model has to little context to do anything else. This is a well-known issue with Markov and hidden Markov models.

Yes, it's the Markov property that makes for word salad, ultimately, but you get less salad-y output if you can calculate better probabilities, and if you do it in the way I say above. And you can always build a string by selecting the next bigram that maximises the probability of the entire string. That's how I've always done it. I guess that's not Markovian any more but gives you reasonable output especially for small-ish corpora with not huge variance.

>> (Fun fact: some hidden Markov model taggers switch to a different trigram distribution for V2 languages after seeing the finite verb, otherwise they often fail catastrophically in the verb cluster due to the limited context.)

Thanks, I didn't know that.