I think the idea is that "calling predict and tuning on a model with the test se...

I think the idea is that "calling predict and tuning on a model with the test set" is the "overfitting". It's not actual overfitting like we know in ML; it's as if the researcher is performing "descent" to get the best hyper-parameters. Problem is, if we use the test set to find these hyper-parameters, we'll have no idea how well it does in the real-world/in general. We'd need another set to figure that out - and we're back where we started.