Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Reading the section on [synthetic data](https://huggingfacefw-blogpost-fineweb-v1.static.hf.space/di...) was eye-opening for me. The hockey-stick growth of words associated with common ChatGPT output in the common crawl database over the past ~18 months is worrying.


It might be worrying but they also point out that the quality seems to go up. Perhaps people think that random web scrapes are way better than they really are, and so expect ChatGPT output to worsen corpuses on average rather than improve them...


I am by no means a data scientist, but if, as a large language model, ChatGPT was trained to optimize the same "quality" metrics that are used to evaluate the models trained on these random web scrapes, and now ChatGPT output has a larger proportion of the random web scrapes, wouldn't the measured "quality" increase as a result? It all seems intertwined.

In other words are we just overfitting?

It's important to note that the tests that they use appear to be open source, for example https://huggingface.co/datasets/lighteval/mmlu.

Again, I could be totally ignorant on how these things work. (edited to add key words associated with ChatGPT output in order to increase the quality of my comment :))




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: