Agreed. Gemini 3 is still pretty bad at agentic coding.
Just yesterday, in Antigravity, while applying changes, it deleted 500 lines of code and replaced it with a `<rest of code goes here>`. Unacceptable behavior in 2025, lol.
It has become very quickly unfashionable for people to say they like the Codex CLI. I still enjoy working with it and my only complaint is that its speed makes it unideal for pair coding.
On top of that, the Codex CLI team is responsive on github and it's clear that user complaints make their way to the team responsible for fine tuning these models.
I run bake offs on between all three models and GPT 5.2 generally has a higher success rate of implementing features, followed closely by Opus 4.5 and then Gemini 3, which has troubles with agentic coding. I'm interested to see how 5.2-codex behaves. I haven't been a fan of the codex models in general.
When Claude screws up a task I use Codex and vice versa. It helps a lot when I'm working on libraries that I've never touched before, especially iOS related.
(Also, I can't imagine who is blessed with so much spare tome that they would look down on an assistant that does decent work)
> When Claude screws up a task I use Codex and vice versa
Yeah, it feels really strange sometimes. Bumping up against something that Codex seemingly can't work out, and you give it to Claude and suddenly it's easy. And you continue with Claude and eventually it gets stuck on something, and you try Codex which gets it immediately. My guess would be that the training data differs just enough for it to have an impact.
I think Claude is more practically minded. I find that OAI models in general default to the most technically correct, expensive (in terms of LoC implementation cost, possible future maintenance burden, etc) solution. Whereas Claude will take a look at the codebase and say "Looks like a webshit React app, why don't you just do XYZ which gets you 90% of the way there in 3 lines".
But if you want that last 10%, codex is vital.
Edit: Literally after I typed this just had this happen. Codex 5.2 reports a P1 bug in a PR. I look closely, I'm not actually sure it's a "bug". I take it to Claude. Claude agrees it's more of a product behavioral opinion on whether or not to persist garbage data, and offer it's own product opinion that I probably want to keep it the way it is. Codex 5.2 meanwhile stubbornly accepts the view it's a product decision but won't seem to offer it's own opinion!
Correct, this has been true for all GPT-5 series. They produce much more "enterprise" code by default, sticking to "best practices", so people who need such code will much prefer them. Claude models tend to adapt more to the existing level of the codebase, defaulting to more lightweight solutions. Gemini 3 hasn't been out long enough yet to gauge, but so far seems somewhere in between.
>> My guess would be that the training data differs just enough for it to have an impact.
It's because performance degrades over longer conversations, which decreases the chance that the same conversation will result in a solution, and increases the chance that a new one will. I suspect you would get the same result even if you didn't switch to a different model.
So not really, certainly models degrade by some degree on context retrieval. However, in Cursor you can just change the model used for the exchange, it still has the same long context. You'll see the different model strengths and weaknesses contrasted.
They just have different strengths and weaknesses.
if claude is stuck on a thing but we’ve made progress (even if that progress is process of elimination) and it’s 120k tokens deep, often when i have claude distill our learnings into a file.. and /clear to start again with said file, i’ll get quicker success
which is analogous to taking your problem to another model and ideally feeding it some sorta lesson
i guess this is a specific example but one i play out a lot. starting fresh with the same problem is unusual for me. usually has a lesson im feeding it from the start
I care very little about fashion, whether in clothes or in computers. I've always liked Anthropic products a bit more but Codex is excellent, if that's your jam more power to you.
- Planning mode. Codex is extremely frustrating. You have to constantly tell it not to edit when you talk to it, and even then it will sometimes just start working.
- Better terminal rendering (Codex seems to go for a "clean" look at the cost of clearly distinguished output)
the faddish nature of these tools fits the narrative of the METR findings that the tools slow you down while making you feel faster.
since nobody (other than that paper) has been trying to measure output, everything is based on feelings and fashion, like you say.
I'm still raw dogging my code. I'll start using these tools when someone can measure the increase in output. Leadership at work is beginning to claim they can, so maybe the writing is on the wall for me. They haven't shown their methodology for what they are measuring, just telling everyone they "can tell"
But until then, I can spot too many psychological biases inherent in their use to trust my own judgement, especially when the only real study done so far on this subject shows that our intuition lies about this.
And in the meantime, I've already lost time investigating reasonable looking open source projects that turned out to be 1) vibe coded and 2) fully non functional even in the most trivial use. I'm so sick of it. I need a new career
what.cd was the world's greatest music discovery mechanism. You could always ask for recommendations in the forums or in the comment thread of the albums pages. The community always delivered. I miss that type of camaraderie. I also spent more on music as a member of that community than since it has been disbanded.
What.cd was the Library of Alexandria for recorded music, the depth of what was collated and properly labelled there was far beyond anything that has ever existed on any other service, paid for or not. Every permutation of every release, endless live recordings, often multiple of the same event, absolutely incredible.
Private trackers as I understand it, are still a thing in the mid 2020s. Did a replacement that matches (or surpasses) What.cd not pop up in the meantime?
I'm just wondering how a strong community like that was struck a deathblow. It's not like all of its content disappeared.
Orpheus and redacted (previously passtheheadphones) both appeared shortly after what.cd’s demise. I believe they both now have more total torrents than what.cd, however the depth is still not what what’s was 9 years on (I know this because some of my uploads from what are still missing, partially because I no longer have the source material). And, the “cultivation” (ensuring no duplicates, recommendations for releases, general community, etc) is nowhere near what’s.
I would say all other media (or at least, the media I care about - film, tv, books) has what.cd equivalents, sometimes multiple. I think Spotify and AM killed 95%+ of “true” private tracker interest for music, especially with lossless and surround releases being available. The diehard core are still there (names from 15 years ago are still active) but it’s really not the same.
Orpheus and Redacted existed but it's kind of hard to beat the convenience of streaming for the low price in 2025.
Granted you can set up automated *arr systems with PLEXAMP to get a pretty seamless "personal Spotify" setup IME getting true usefulness out of trackers of What's quality always required spending real money - to obtain rare records/CDs on marketplaces - or at least large amounts of time if you went the "rent CDs from the library" route. I personally haven't ran into much RYM releases lacking on Apple Music and what is lacking I can find on Bandcamp or YouTube.
OiNK before that, too. Once waffles and what disappeared then I was never 'able' to get on to one of the newer ones… the whole process is some real archaic thing. Used to have a great 'profile' on those others, but yeah.
While I agree that `--dangerously-skip-permissions` is (obviously) dangerous, it shouldn't be considered completely inaccessible to users. A few safeguards can sand off most of the rough edges.
What I've done is write a PreToolUse hook to block all `rm -rf` commands. I've also seen others use shell functions to intercept `rm` commands and have it either return a warning or remap it to `trash`, which allows you to recover the files.
> There’s an odd tendency in modern software development; we’ve collectively decided that naming things after random nouns, mythological creatures, or random favorite fictional characters is somehow acceptable professional practice. This would be career suicide in virtually any other technical field.
I worked in finance – we gave our models names that endeared us to them. My favorite is a coworker naming his model "beefhammer".
One of my favorite personal evals for llms is testing its stability as a reviewer.
The basic gist of it is to give the llm some code to review and have it assign a grade multiple times. How much variance is there in the grade?
Then, prompt the same llm to be a "critical" reviewer with the same code multiple times. How much does that average critical grade change?
A low variance of grades across many generations and a low delta between "review this code" and "review this code with a critical eye" is a major positive signal for quality.
I've found that gpt-5.1 produces remarkably stable evaluations whereas Claude is all over the place. Furthermore, Claude will completely [and comically] change the tenor of its evaluation when asked to be critical whereas gpt-5.1 is directionally the same while tightening the screws.
You could also interpret these results to be a proxy for obsequiousness.
Edit: One major part of the eval i left out is "can an llm converge on an 'A'?" Let's say the llm gives the code a 6/10 (or B-). When you implement its suggestions and then provide the improved code in a new context, does the grade go up? Furthermore, can it eventually give itself an A, and consistently?
It's honestly impressive how good, stable, and convergent gpt-5.1 is. Claude is not great. I have yet to test it on Gemini 3.
I agree, I mostly use Claude for writing code, but I always get GPT5 to review it. Like you, I find it astonishingly consistent and useful, especially compared to Claude. I like to reset my context frequently, so I’ll often paste the problems from GPT into Claude, then get it to review those fixes (going around that loop a few times), then reset the context and get it to do a new full review. It’s very reassuring how consistent the results are.
You mean literally assign a grade, like B+? This is unlikely to work based on how token prediction & temperature works. You're going to get a probability distribution in the end that is reflective of the model runtime parameters, not the intelligence of the model.
gpt-5* reasoning models do not have an adjustable temperature parameter. It seems like we may have a different understanding of these models.
And, like the other commenter said, the temperature may change the distribution of the next token, but the reasoning tends to reel those things in, which is why reasoning models are notoriously poor at creative writing.
You are free to run these experiments for yourself. Perhaps, with your deeper understanding, you'll shed new light on this behavior.
It surely is different. If you set the temp to 0 and do the test with slightly different wording, there is no guarantee at all the scores would be consistent.
And if an LLM is consistent, even with a high temp, it could give the same PR the same grade while choosing different words to say.
The tokens are still chosen from the distribution, so a higher probability of the same grade will result in the same grade being chosen regardless of the temp set.
I think you're restating (in a longer and more accurate way) what I understood the original criticism to be, that this grading test isn't testing what's it's supposed to, partly because a grade is too few tokens.
The model could "assess" the code qualitatively the same and still give slightly different letter grades.
my experience reviewing pr is that sometimes it says it is perfect with some nipicks and other times the same pr that it is trash and need a lot of work
Disney is the same company as it was 20 years ago. In fact, it's the same company as it was 100 years ago. They only care about profit. They do just enough brand management to preserve the profit motive.
To be fair to Walt Disney, he cared about a lot beyond profit and believed in advancing technology and society in a way modern corporate leaders absolutely do not. He was no saint but he's a far cry from modern CEO's.
To be fair, Walt Disney partnered with his brother Roy Disney, and they co-founded and ran the Walt Disney Company (and the iterations before it). These iterations of the Disney Company were never just Walt Disney.
Yes, but if you watch any documentaries about early Disney and listen to those people talk everything was about Walt's vision even after death they would ask "What would Walt want or do?" He was a figure whose influence and vision is on another level in American History (both good and bad) and early Disney was Walt no matter who was in charge on paper or even if Walt was still alive. That only started to change under Eisner. Roy was the one who kept Walt grounded so ambition shrunk but they stayed the course Walt set.
Walt grew up in an era when there was still a sense that wealth and power brought with it strong moral obligations to serve the community and nation. We lost that somewhere along the way.
I think that, given the times, we might rate him a little bit above "no saint". Perhaps slightly below or at par with the norms of his time, which we could now look back on as the peak of some rather nasty tendencies in society.
He also normalized and romanticized the American Expansion and displacement of Native Americans. He's a very complicated and flawed figured who irrevocably changed the course of this nation. Even Walt recognized that Walt Disney the man and Walt Disney the icon were two different people, and he was flawed in ways as a man the icon who appeared weekly in everyone's living room was not.
Companies can have additional motives to profit, and they’re more likely to when control is concentrated just because individual people have multiple desires.
This was certainly the case with early Disney because Walt Disney was a megalomaniac utopian. I don’t think the original Epcot plans ever had a reasonable chance of being profitable, but Walt pushed them because he believed he was the saviour of urbanism in America.
Yes, perhaps if we deflated Disney’s moral rot by a diversified basket of other morally-rotted goods, I suppose we’d be able to conclude that Disney is perhaps the same company.
Outside that effort, I see a company once famous for its prudishness now unafraid of shame.
I firmly disagree and think this shallow take dishonors a pretty great man. While not perfect, Disney gave us the bedrock of American children's culture which has been a soft tool for the US for generations. Not to mention technology and other advancements. I'm not a Disney nut, but the man was one-of-a-kind and an impressive industrialist who instilled a great culture of innovation and a deep love of children and play. All things I value.
Yep, Disney was also a leading producer of racist tropes and content during Jim Crow. Historical clips of Mickey Mouse characters putting on minstrel shows with blackface alongside other racist stereotypes like crows can easily be found online[0]. Not to mention Song of the South[1], a film Disney produced based on Uncle Remus stories following slaves who happily live on a Georgia plantation. Disney has, of course, done their best to scrub these entries from history, but they played a major role in depicting racist tropes to kids for decades.
We all acknowledge that Walt Disney was a flawed person, I don't think anybody here disagrees. To me, what sets him apart from other corporate leaders isn't Walt's moral character, but rather his ambition to influence the direction of humanities development, both culturally and technologically. He was about a lot more than just making number go up.
One could argue that the company reoriented itself so purely towards children's art and kitsch because they needed to get themselves into a market segment where they could completely sanitize their output of these kinds of embarrassments.
I read your comment as saying that we should blame the people who create the demand for Disney's products, and the voters who elect the politicians, instead of Disney and the politicians. Not so?
The context is messy, but my comment's in the context of rejecting blame on Disney alone for "losing their way" when they have had the same way (read: $$$) as before and they're delivering products people want.
Fwiw I think the all US presidents since Clinton were elected on a non interventionist/pacifist campaign. Blaming the voters when every one of them (less so with Biden) violated those promises is a bit unfair, if you still believe in democracy.
Almost every one of them was elected again, often by wider margins (the only exception losing to another one of them) after deatroying any illusion innthat direction you might argue was produced by their campaign positions, so I don't think you can absolved the American electorate here, even if one agrees that their campaign before taking office met your description.
Bush sure wasn't anti-interventionist for the second term after entering the Iraq War 2.0. Even Obama campaigned to persist the "necessary" Afghanistan war.
I don't recall George W. Bush ever actually promising to stay out of wars and interventions. It's been standard for the two parties to criticize each-other on grounds of doing interventionism badly or going too far towards one extreme regarding foreign policy, but nobody has run as a real pacifist or isolationist because they would lose in a landslide. It especially doesn't help that pacifism and isolationism are associated with activist fringes in both parties who often lean into crank theories or make friends publicly with adversarial states.
Blaming the voters seems completely sensible when they reelected W in 2004. The man's Vice President was Dick freakin' Cheney. You can't seriously tell me the people voted for pacifism and got screwed over.
I don’t have an opinion on him, despite the suggestiveness of my comment. He’s more illustrative of a spirit that Disney at a time did not have an appetite for.
Stephen A Smith has done as much to harm ESPN's brand than any other figure. Please don't assume my biases from whom I failed to mention – I could have used SAS instead of Pat and my point would have been the same.
Perhaps I should have expected that the conversation would get pulled this way but it's not where I wanted it to go.
But this was also just a short-lived political environment as well, where companies pretended to care about the current thing because it was politically expedient. How long did it take for them to do a 180? I mean they didn't believe in any of that stuff even a little.
> As part of the agreement, Disney will make a $1 billion equity investment in OpenAI, and receive warrants to purchase additional equity.
I say this with no snark or disdain: Sam has mastered the art of the flywheel.
Re licensed ai videos, if anyone wants to see the perspective the C-suites are being sold on, check out this episode of Belloni's The Town, in which they discuss the vision for AI + IP
https://overcast.fm/+AA4DU9JreIE
What are the terms? It is not at all clear from the announcement. "part of this three-year licensing agreement", it _could_ mean the license cost is $1 billion, which Disney in turn invests in OpenAI in return for equity, and they're calling it "investment" (that's what's hypothesized above, but I don't think we know). Disney surely gets something for the license other than the privilege of buying $1 billion in OpenAI stock at their most recent valuation price.
Disney gets the opportunity to tell the board and investors that they are now partnered with a leading AI company. In effect, Disney is now an AI company as well. They haven't really done anything, but if anyone asks they can just say "of course we're at the forefront of the entertainment industry. We're already leveraging AI in our partnerships"
Isn't that the case for most ultra-rich CEOs? All of the CEOs of Microsoft apparently started off either building product or helping develop the business into something profitable. But at some point it doesn't really matter if you have the skills to be an individual contributor, a team leader, or even a vice president. The role of CEO mostly is to keep investors happy & secondarily to put the right people in the company together to make things happen.
Has he made billions? He's obviously done well but I'm not sure he has been able to capture any value from openai except for publicity, and what else does he have? A few $10m from loopt and ycombinator?
>Has he made billions? He's obviously done well but I'm not sure he has been able to capture any value from openai except for publicity, and what else does he have?
Americans mostly. Both abuse copyright law to profit off of works done by creators/inventors of the past. Neither contribute much to humanity these days.
What if who's wrong? Sam? If right means turning a profit, he's most certainly wrong. It would be a very stupid investment to give money to him but it turns out there no shortage of dumb money. Can't blame the guy for taking money people are literally throwing at him. If you know anything about the guy he has no interest in business or profit. He just wants to create AGI. Mostly out of curiosity it seems.
I can't really see how Altman is a sociopath? I think his current vision greatly exceeds the technical capabilities that OpenAI can ever build. OpenAI seems to have produced some genuinely interesting products on the other hand. But they aren't profitable at present and I don't see it happening.
Altman talks to the talk of a CEO who is going to build a company that can change the world. It's what investors want to hear. He seems to make as many attempts as possible to actually execute on that. I think most of those plans are unlikely to be as successful as desired. But this isn't Theranos level fraud, where what they are trying to build is obviously impossible.
Very neat, but recently I've tried my best to reduce my extension usage across all apps (browsers/ide).
I do something similar locally by manually specifying all the things I want scrubbed/replaced and having keyboard maestro run a script on my system keyboard whenever doing a paste operation that's mapped to `hyperkey + v`. The plus side of this is that the paste is instant. The latency introduced by even the littlest of inference is enough friction to make you want to ditch the process entirely.
Another plus of the non-extension solution is that it's application agnostic.
That's a great idea. My original excuse to not do that was because I copy so many things but, duh, I could just key the sanitizing copy to `hyperkey + c`.
Multiple things: 1) extensions are overly permissive, 2) so many of them are sold to shady entities without peep from the developer, and 3) it's never been easier to generate my own tooling.
I just download the extension file, check it out, and install it locally. No worries about future updates until something breaks (doesn't tend to happen).
fair enough. I'll add that one fantastic use I've found for LLMs is quickly checking the source of a given addon (though obviously this is no replacement for a real audit or finely-grained permissions).
I'd be doing this type of thing a lot more if browsers didn't make it difficult to load unpacked addons (in which case I could be modifying things I didn't like on the fly).
Haleakalā is like this as well. Don't just drive up the crater - hike through the thing. It's a ~12 mile hike. It's a remarkable experience because the landscape changes so frequently and dramatically from desert to tropical forest.
The only comp to this is like the transition in Max Max from the desert to the oasis.
Tourists that drive to the crater, take pictures, and drive down have no idea what they're missing.
Highly recommend camping in the crater on a clear night around new moon. Some of the best stars you'll see. Seeing the sun rise in the crater gap (where you can sometimes see the big island) is stunning.
Park in the lower lot, hitchhike to the top (or get someone else to drive you), and then you can hike back up to your car the next day on the switchbacks.
Do not attempt to hike up the sliding sands trail you took down, it's *very rough*.
> Tourists that drive to the crater, take pictures, and drive down have no idea what they're missing.
And for some reason blather on and on loudly up there when the most mind blowing sunsets are happening. Can we not be silent for 15 minutes and look at the universe doing it's thing?
Just yesterday, in Antigravity, while applying changes, it deleted 500 lines of code and replaced it with a `<rest of code goes here>`. Unacceptable behavior in 2025, lol.
reply