The purpose is to benchmark both generality and intelligence. "Making up for" a ...

famouswaffles · 2026-03-27T14:26:30 1774621590

>"Making up for" a poor score on one test with an excellent score on another would be the opposite of generality.

Really ? This happens plenty with human testing. Humans aren't general ?

The score is convoluted and messy. If the same score can say materially different things about capability then that's a bad scoring methodology.

I can't believe I have to spell this out but it seems critical thinking goes out the window when we start talking about machine capabilities.

daveguy · 2026-03-27T19:58:34 1774641514

Just because humans are usually tested in a particular way that allows them to make up for a lack of generality with an outstanding performance in their specialization doesn't mean that is a good way to test generalization itself.

Apparently someone here doesn't know how outliers affect a mean. Or, for that matter, have any clue about the purpose of the ARC-AGI benchmark.

For anyone who is interested in critical thinking, this paper describes the original motivation behind the ARC benchmarks:

https://arxiv.org/abs/1911.01547

famouswaffles · 2026-03-27T21:28:41 1774646921

>Apparently someone here doesn't know how outliers affect a mean.

If the concern is that easy questions distort the mean, then the obvious fix is to reduce the proportion of easy questions, not to invent a convoluted scoring method to compensate for them after the fact. Standardized testing has dealt with this issue for a long time, and there’s a reason most systems do not handle it the way ARC-AGI 3 does. Francois is not smarter than all those people, and certainly neither are you.

This shouldn't be hard to understand.

daveguy · 2026-03-28T21:27:00 1774733220

How do you define "easy question" for a potential alien intelligence? The solution, like most solutions when dealing with outliers, in my opinion, is to minimize the impact of outliers.

famouswaffles · 2026-03-28T23:07:24 1774739244

I mean presumably that's what the preview testing stage would handle right ? It should be clear if there are a class of obviously easy questions. And if that's not clear then it makes the scoring even worse.

And in some sense, all of these benchmarks are tied and biased for human utility.

I don't think ARC would be designed and scored the way it is if giving consideration for an alien intelligence was a primary concern. In that case, the entire benchmark itself is flawed and too concerned with human spatial priors.

There are many ways to deal with a problem. Not all of them are good. The scoring for 3 is just bad. It does too much and tells too much.

5% could mean it only answered a fraction of problems or it answered all of them but with more game steps than the best human score. These are wildy different outcomes with wildly different implications. A scoring methodology that can allow for such is simply not a good one.