Also importantly, they do have a 'not attempted' or 'do not know' type of respon...

bcherry · on Oct 31, 2024

a little glossed over, but they do point out that most important improvement o1 has over gpt-4o is not it's "correct" score improving from 38% to 42% but actually it's "not attempted" going from 1% to 9%. The improvement is even more stark for o1-mini vs gpt-4o-mini: 1% to 28%.

They don't really describe what "success" would look like but it seems to me like the primary goal is to minimize "incorrect", rather than to maximize "correct". the mini models would get there by maximizing "not attempted" with the larger models having much higher "correct". Then both model sizes could hopefully reach 90%+ "correct" when given access to external lookup tools.