Look at imagenet results. Test set accuracy(top 1) for imagenet is 85% with deep...

notemaker · on April 5, 2021

Let's remember though that Imagenet is not a good representation of reality.

See e.g. performance on OjectNet, https://objectnet.dev/, when trained on Imagenet. For the same classes, we see _dramatic_ drops in accuracy.

YeGoblynQueenne · on April 5, 2021

Also see:

Do ImageNet Classifiers Generalize to ImageNet?

We build new test sets for the CIFAR-10 and ImageNet datasets. Both benchmarks have been the focus of intense research for almost a decade, raising the danger of overfitting to excessively re-used test sets. By closely following the original dataset creation processes, we test to what extent current classification models generalize to new data. We evaluate a broad range of models and find accuracy drops of 3% - 15% on CIFAR-10 and 11% - 14% on ImageNet. However, accuracy gains on the original test sets translate to larger gains on the new test sets. Our results suggest that the accuracy drops are not caused by adaptivity, but by the models' inability to generalize to slightly "harder" images than those found in the original test sets.

https://arxiv.org/abs/1902.10811

t_serpico · on April 5, 2021

nice. glad to see this exists.

YeGoblynQueenne · on April 5, 2021

The passage I quote above speaks of "out-of-sample" generalisation, not "test-set" generalisation. These are not the same.

Unfortunately such terminological confusion is common but "out-of-sample" should really be reserved for data that was not available during development of a system, either as a training, evaluation or testing partition. That, because "out of sample" suggests that the data was drawn from a different distribution than the, well, training sample (where the training sample is then subdivided to training, evaluation and testing partitions), i.e. the true distribution, so the real world.

I guess the OP is instead using "out-of-sample" to mean "test set" (which is not uncommon), but in that case we don't need to look all the way to learning theory to figure it out: published results are well known to select for successful experiments, in machine learning as in many areas of research, unfortunately.