There’s no evidence that this ever happened other than this guy’s word. And sinc...

NitpickLawyer · 2025-10-09T16:09:56 1760026196

> There’s no evidence that this ever happened other than this guy’s word.

There's a yt channel where the sessions were livestreamed. It's in their FAQ. I haven't felt the need to check them, but there are 10-12h sessions in there if you're that invested in proving that this is "so far outside of any capabilities"...

A brief look at the commit history should show you that it's 99.9% guaranteed to be written by an LLM :)

When's the last time you used one of these SotA coding agents? They've been getting better and better for a while now. I am not surprised at all that this worked.

sarchertech · 2025-10-09T17:34:36 1760031276

>When's the last time you used one of these SotA coding agents?

This morning :)

>"so far outside of any capabilities"

Anthropic was just bragging last week about being able to code without intervention for 30 hours before completely losing focus. They hailed it as a new bench mark. It completed a project that was 11k lines of code.

The max unsupervised run that GPT-5-Codex has been able to pull off is 7 hours.

That's what I mean by the current SOTA demonstrated capabilities.

https://x.com/rohanpaul_ai/status/1972754113491513481

And yet here you have a rando who is saying that he was able able to get an agent to run unsupervised for 100x longer than what the model companies themselves have been able to do and produce 10x the amount of code--months ago.

I'm 100% confident this is fake.

>There's a yt channel where the sessions were livestreamed.

There are a few videos that long, not 3 months worth of videos. Also I spot checked the videos and it the framerate is so low that it would be trivial to cut out the human intervention.

>guaranteed to be written by an LLM

I don't doubt that it was 99.9% written by an LLM, the question is whether he was able to run unsupervised for 3 months or whether he spent 3 months guiding an LLM to write it.

NitpickLawyer · 2025-10-09T18:01:12 1760032872

I think you are confusing 2 things here. What the labs mean when they announce x hours sessions is on "one session" (i.e. the agent manages its own context via trimming and memory files, etc). What the project I linked did was "run in a bash loop", that basically resets the context every time the agent "finishes".

That would mean that every few hours the agent starts fresh, does the inspect repo thing, does the plan for that session, and so on. That would explain why it took it ~3 months to do what a human + ai could probably do in a few weeks. That's why it doesn't sound too ludicrous for me. If you look at the repo there are a lot of things that are not strictly needed for the initial prompt (make a programming language like go but with genz stuff, nocap).

Oh, and if you look at their discord + repo, lots of things don't actually work. Some examples do, some segfault. That's exactly what you'd expect from "running an agent in a loop". I still think it's impressive nonetheless.

The fact that you are so incredulous (and I get why that is, scepticism is warranted in this space) is actually funny. We are on the right track.

sarchertech · 2025-10-09T19:30:25 1760038225

There’s absolutely no difference from what he says he did and what Claude code can do behind the scenes.

If Anthropic thought they could produce anything remotely useful by wiping the context and reprompting every few hours, they would be doing it. And they’d be saying “look at this we implemented hard context reset and we can now run our agent for 30 days and produce an entire language implementation!”

In 3 months or 300 years of operating like this a current agent being freshly reprompted every few hours would never produce anything that even remotely looked like a language implementation.

As soon as its context was poisoned with slightly off topic todo comments it would spin out into writing a game of life implementation or whatever. You’d have millions of lines of nonsense code with nothing useful after 3 months of that.

The only way I see anything like this doing anything approaching “useful” is if the outer loop wipes the repo on every reset as well, and collects the results somewhere the agent can’t access. Then you essentially have 100 chances to one shot the thing.

But at that point you just have a needlessly expensive and slow agent.