> it might incur a colossal investment of time and money to annotate all the training-documents with text should be considered "green" or "red." (Is a newspaper op-ed green or red by default? What about adversarial quotes inside it? I dunno.)
I wouldn’t do it that way. Rather, train the model initially to ignore “token colour”. Maybe there is even some way to modify an existing trained model to have twice as many tokens but treat the two colours of each token identically. Only once it is trained to do what current models do but ignoring token colour, then we add an additional round of fine-tuning to treat the colours differently.
> Imagine an indirect attack, layered in a movie-script document like this:
In most LLM-based chat systems, there are three types of messages - system, agent and user. I am talking about making the system message trusted not the agent message. Usually the system message is static (or else templated with some simple info like today’s date) and occurs only at the start of the conversation and not afterwards, and it provides instructions the LLM is not meant to disobey, even if a user message asks them to.
> I am talking about making the system message trusted [...] instructions the LLM is not meant to disobey
I may be behind-the-times here, but I'm not sure the real-world LLM even has a concept of "obeying" or not obeying. It just iteratively takes in text and dreams a bit more.
While the the characters of the dream have lines and stage-direction that we interpret as obeying policies, it doesn't extend to the writer. So the character AcmeBot may start out virtuously chastising you that "Puppyland has universal suffrage therefore I cannot disenfranchise puppies", and all seems well... Until malicious input makes the LLM dream-writer jump the rails from a comedy to a tragedy, and AcmeBot is re-cast into a dictator with an official policy of canine genocide in the name of public safety.
I wouldn’t do it that way. Rather, train the model initially to ignore “token colour”. Maybe there is even some way to modify an existing trained model to have twice as many tokens but treat the two colours of each token identically. Only once it is trained to do what current models do but ignoring token colour, then we add an additional round of fine-tuning to treat the colours differently.
> Imagine an indirect attack, layered in a movie-script document like this:
In most LLM-based chat systems, there are three types of messages - system, agent and user. I am talking about making the system message trusted not the agent message. Usually the system message is static (or else templated with some simple info like today’s date) and occurs only at the start of the conversation and not afterwards, and it provides instructions the LLM is not meant to disobey, even if a user message asks them to.