That line triggered some deep memories of tweaking config files, dropping resolutions to something barely recognizable, and still calling it a win if the game technically ran
Seeing real games and benchmarks makes it more than a party trick, even if the use cases are niche. The serious takes kind of miss that the point is to poke the stack and see where it breaks
The lack of memory issue is already being solved architecturally, and ARTEMIS is a prime example. Instead of relying on the model's context window (which is "leaky"), they use structured state passed between iterations. It's not a DNC per se, but it is a functional equivalent of long-term memory. The agent remembers it tried an SQL injection an hour ago not because it's in the context, but because it's logged in its knowledge base. This allows for chaining exploits, which used to be the exclusive domain of humans
I agree with the prediction. The key driver here isn't even model intelligence, but horizontal scaling. A human pentester is constrained by time and attention, whereas an agent can spin up 1,000 parallel sub-agents to test every wild hypothesis and every API parameter for every conceivable injection. Even if the success rate of a single agent attempt is lower than a human's, the sheer volume of attempts more than compensates for it.
They also don't fatigue in the same way humans do. Within the constraint of a netpen, a human might be, say, 20% more creative at peak performance than an agent loop. But an agent loop will operate within a narrow band of its own peak performance throughout the whole test, on every stimulus/response trial it does. Humans cannot do that.
Note that gpt-5 in a standard scaffold (Codex) lost to almost everyone, while in the ARTEMIS scaffold, it won. The key isn't the model itself, but the Triage Module and Sub-agents. Splitting roles into "Supervisor" (manager) and "Worker" (executor) with intermediate validation is the only viable pattern for complex tasks. This is a blueprint for any AI agent, not just in cybersec
If you can do it by splitting roles explicitly, you can fold it into a unified model too. So "scaffolding advantage" might be a thing now, but I don't expect it to stay that way.
Is this true? I mean it’s true for any specific workflow, but I am not clear it’s true for all workflows - the power set of all workflows exceeds any single architecture, in my mind.
Think of it in an end-to-end way: produce a ton of examples of final results of supervisor-worker agentic outputs and then train a model to predict those from the original user prompts straight away.
It's not true for all workflows. But many of today's custom workflows are like the magic "let's think step by step" prompt for the early LLMs. Low-hanging fruits, set to become redundant as better agentic capabilities are folded into the LLMs themselves.
This was inevitable, and it's not about being strict, but about banal technical limitations. Remote proctoring has always been security theater, but Vision LLMs combined with hardware solutions put the final nail in the coffin. The fundamental problem is simple: you cannot software-secure a device to which the attacker has physical access.
ACCA simply admitted defeat in an arms race where the attack became orders of magnitude cheaper and more effective than the defense. The only reliable "air gap" from AI today is a physical room and paper
Goodhart's Law works on steroids with AI. If you tell a human dev "we need 100% coverage," they might write a few dummy tests, but they'll feel shame. AI feels no shame - it has a loss function. If the metric is "lines covered" rather than "invariants checked," the agent will flood the project with meaningless tests faster than a manager can blink. We'll end up with a perfectly green CI/CD dashboard and a completely broken production because the tests will verify tautologies, not business logic
reply