One classic problem in all ML is ensuring the benchmark is representative and th...

ipaddr · 2025-12-13T02:11:29 1765591889

This could be a solved problem. Come up with problems not online and compare. Later use LLMs to sort through your problems and classify between easy-difficult

vlovich123 · 2025-12-13T04:29:29 1765600169

Hard to do for an industry benchmark since doing the test in such a mode requires sending the question to the LLM which then basically puts it into a public training set.

This has been tried multiple times by multiple people and it ends up not doing so great over time in terms of retaining immunity to “cheating”.

kalkin · 2025-12-13T05:00:46 1765602046

How do you imagine existing benchmarks were created?