From the example: "--temp 0 turns off the random number generator (we don't want...

moffkalast · on April 1, 2024

I couldn't disagree more, turning temp to zero is like taking a monte carlo method and only using one sample, or a particle filter with only one particle. Takes the entire concept and throws it out of the window so you can have predictability.

LLMs need to probabilistically explore the generation domain to converge on a good result for best performance. Similar issue with people benchmarking models by only having them output one single token (e.g. yes or no) outright, which prevents any real computation from occurring so the results are predictably poor.

mvkel · on April 1, 2024

Is that what it does, though?

I thought setting temperature to 0 would (extremely simple example) equate to a spam filter seeing:

- this is a spam email

But if the sender adapts and says

- th1s is a spam email

It wouldn't be flagged as spam.

samus · on April 1, 2024

The output of an autoregressive model is a probability for each token to appear next after the input sequence. Computing these is strictly deterministic from the prior context and the model's weights.

Based on that probability distribution, a variety of text generation strategies are possible. The simplest (greedy decoding) is picking the token with the highest probability. To allow creativity, a random number generator is used to choose among the possible outputs, biased by the probabilities of course.

Temperature scales the output probabilities. As temperature increases, the probabilities approach 1/dictionary size, and the output becomes completely random. For very small temperature values, text generation approaches greedy sampling.

If all you want is a spam filter, better replace the output layer of an LLM with one with just two outputs, and finetune that on a public collection of spam mails and some "ham" from your inbox.

none_to_remain · on April 1, 2024

My understanding is that temperature applies to the output side and allows for some randomness in the next predicted token. Here Justine has constrained the machine to start with either "yes" or "no" and to predict only one token. This makes the issue stark: leaving a non-zero temperature here would just add a chance of flipping a boolean.

refulgentis · on April 1, 2024

It's more nuanced than that, in practice: this is true for the shims you see from API providers (ex. OpenAI, Anthropic, Mistral).

With llama.cpp, it's actually not a great idea to have temperature purely at 0: in practice, especially with smaller models, this leads to pure repeating or nonsense.

I can't remember where I picked this up, but, a few years back, without _some_ randomness, the next likely token was always the last token.