> We should make a collective git repo full of every kind of annoying bug we (ex...

> We should make a collective git repo full of every kind of annoying bug we (expert developers) can think of. Then use that to benchmark LLMs.

I think any LLM-user worth their salt have been doing this pretty much since we got API access to LLMs, as otherwise there is no way to actually see if they can solve the things you care about.

The only difference is that you must keep the actual benchmarks to yourself, don't share them with anyone and even less put them publicly. The second you do, you probably should stop using it as an actual benchmark, as newly trained LLMs will either intentionally or unintentionally slurp up your benchmark and suddenly it's no longer a good indicator.

I think I personally started keeping my own test cases for benchmarking around the GPT3 launch, when it became clear the web will be effectively "poisoned" from that part on, and anything on the public internet can be slurped up by the people feeding the LLMs training data.

Once you have this up and running, you'll get a much more measured view of how well new LLMs work, and you'll quickly see that a lot of the fanfare doesn't actually hold up when testing it against your own private benchmarks. On a happier note, you'll also be surprised when a model suddenly does a lot better in a specific area that wasn't even mentioned at release, and then you could switch to it for specifically that task :)