They seem to keep adding categories though, which makes me suspect that it is all about ML training. Recently it's chimneys and bridges (although that one may be older).
The likely answer is a bit of both. They use the image tests because it's something that is still kind of hard to do for computers and then uses a small percentage of the boxes as unknown tests to improve some ML algorithm. Unfortunately as computer vision has gotten better they've had to make the challenges harder to the point where they're quite low quality and sometimes count very small features qualifying images. My least favorite is labeling 'cars' because it can be hard to tell if it wants to count cars way off in the distance through the adversarial noise they add to the images.