Using a single custom benchmark as a metric seems pretty unreliable to me.
Even at the risk of teaching future AI the answer to your benchmark, I think you should share it here so we can evaluate it. It's entirely possible you are coming to a wrong conclusion.
after taking a walk for a bit i decided you’re right. I came to the wrong conclusion. Gemini 3 is incredibly powerful in some other stuff I’ve run.
This probably means my test is a little too niche. The fact that it didn’t pass one of my tests doesn’t speak to the broader intelligence of the model per se.
While i still believe in the importance of a personalized suite of benchmarks, my python one needs to be down weighted or supplanted.
my bad to the google team for the cursory brush off.
> This probably means my test is a little too niche.
> my python one needs to be down weighted or supplanted.
To me, this just proves your original statement. You can't know if an AI can do your specific task based on benchmarks. They are relatively meaningless. You must just try.
I have AI fail spectacularly, often, because I'm in a niche field. To me, in the context of AI, "niche" is "most of the code for this is proprietary/not in public repos, so statistically sparse".
I feel similarly. If you're working with some relatively niche APIs on services that don't get seen by the public, the AI isn't one-shotting anything. But I still find it helpful to generate some crap that I can then feel good about fixing.
I definitely agree on the importance of personalized benchmarks for really feeling when, where and how much progress is occurring. The standard benchmarks are important, but it’s hard to really feel what a 5% improvement in X exam means beyond hype. I have a few projects across domains that I’ve been working on since ChatGPT 3 launched and I quickly give them a try on each new model release. Despite popular opinion, I could really tell a huge difference between GPT 4 and 5 , but nothing compared to the current delta between 5.1 and Gemini 3 Pro…
TLDR; I don’t think personal benchmarks should replace the official ones of course, but I think the former are invaluable for building your intuition about the rate of AI progress beyond hype.
Even at the risk of teaching future AI the answer to your benchmark, I think you should share it here so we can evaluate it. It's entirely possible you are coming to a wrong conclusion.