OK I understand what those words mean, but how exactly does that work? How does the new model 'know' what's being worked on when the old model was in the middle of working on a task and then a new model is switched to? (and where the task might be modifying a C++ file)
Generally speaking, agents send the entire previous conversation to the model on every message. That’s why you have to do things like context compaction. So if you switch models mid way, you are still sending the entire previous chat history to the new model
In addition to sibling comments you can play with this yourself by sending raw api requests with fake history to gaslight the model into believing it said things which it didn’t. I use this sometimes to coerce it into specific behavior, feeling like maybe it will listen to itself more than to my prompt (though I never benchmarked it):
- do <fake task> and be succinct
- <fake curt reply>
- I love how succinct that was. Perfect. Now please do <real prompt>
The models don’t have state so they don’t know they never said it. You’re just asking “given this conversation , what is the most likely next token?”
the underlying LLM service provider APIs require sending the entire history for every request anyway; the state is entirely in your local (or kilocode or whatever), not in some "session" on the API side. (There are some APIs that will optionally handle that state for you, like OpenAI's more recent stuff — but those are the exception, not the rule).