If they actually want it to work as intelligently as possible, they'll begin taking these complaints into consideration and building in a wisdom curating feature where people can contribute.
This much is obvious, but they seem to be satisfied with theory over practicality.
Anyway I'm just ranting b/c they haven't paid me.
How about an off the wall algorithm to estimate how much each scraped input turns out to influence the bigger picture, as a way to work towards satisfying the copyright question.
If you train your model to prioritize real photos (as they're often more accurate representations than artistic ones), you might wind up with Denzel Washington as the archetype; https://en.wikipedia.org/wiki/The_Tragedy_of_Macbeth_(2021_f....
There's a vast gap between human understanding and what LLMs "understand".