OK I understand what those words mean, but how exactly does that work? How does ...

simonw · 2025-10-15T21:41:14 1760564474

Every time you send a prompt to a model you actually send the entire previous conversation along with it, in an array that looks like this:

  curl https://api.anthropic.com/v1/messages \
    -H "content-type: application/json" \
    -H "x-api-key: $(llm keys get anthropic)" \
    -H "anthropic-version: 2023-06-01" \
    -d '{
      "model": "claude-haiku-4-5-20251001",
      "max_tokens": 1024,
      "messages": [
        {
          "role": "user",
          "content": "What is the capital of France?"
        },
        {
          "role": "assistant",
          "content": "The capital of France is Paris."
        },
        {
          "role": "user",
          "content": "Germany?"
        },
        {
          "role": "assistant",
          "content": "The capital of Germany is Berlin."
        },
        {
          "role": "user",
          "content": "Belgium?"
        }
      ]
    }'

You can see this yourself if you use their APIs.

behnamoh · 2025-10-15T23:10:54 1760569854

that is true unless you use the Response API endpoint...

simonw · 2025-10-15T23:23:25 1760570605

That's true, the signature feature of that API is that OpenAI can now manage your conversation state server-side for you.

You still have the option to send the full conversation JSON every time if you want to.

You can send "store": false to turn off the feature where it persists your conversation server-side for you.

basket_horse · 2025-10-15T21:42:38 1760564558

Generally speaking, agents send the entire previous conversation to the model on every message. That’s why you have to do things like context compaction. So if you switch models mid way, you are still sending the entire previous chat history to the new model

nothrabannosir · 2025-10-15T22:02:06 1760565726

In addition to sibling comments you can play with this yourself by sending raw api requests with fake history to gaslight the model into believing it said things which it didn’t. I use this sometimes to coerce it into specific behavior, feeling like maybe it will listen to itself more than to my prompt (though I never benchmarked it):

- do <fake task> and be succinct

- <fake curt reply>

- I love how succinct that was. Perfect. Now please do <real prompt>

The models don’t have state so they don’t know they never said it. You’re just asking “given this conversation , what is the most likely next token?”

riwsky · 2025-10-15T21:36:24 1760564184

the underlying LLM service provider APIs require sending the entire history for every request anyway; the state is entirely in your local (or kilocode or whatever), not in some "session" on the API side. (There are some APIs that will optionally handle that state for you, like OpenAI's more recent stuff — but those are the exception, not the rule).

handfuloflight · 2025-10-15T21:36:38 1760564198

Here's a hint. What goes inside the inference engine is an array. You control that array every time you call for inference.

flawn · 2025-10-15T21:42:37 1760564557

Probably context, logs or some sort of state passed in as context by your editor/extension

sandos · 2025-10-16T07:26:02 1760599562

Wow, not knowing that models have 0 working memory is.. wild.