Thanks for clarifying. Part of the reason why training a new model from scratch ...

Thanks for clarifying. Part of the reason why training a new model from scratch is so difficult is the amount of dataset-specific tweaking and fine-tuning that goes on and that is rarely discussed in publications. For example, papers willt describe the architecture of a system but will rarely go into much detail about how a particular component was selected over another (why this cost function over the other etc). This fine-tuning is essential to achieving state-of-the-art results and without it you can expect to produce a model that is simply not the same as the original. Certainly in terms of performance metrics.

Basically retraining the same system on a different dataset is a bit like entering the same river in two points in time. Yes, in theory it's "the same" system and only the dataset chnges. In practice, it takes so much architectural tweaking and fine-tuning that it's a different system and you've done almost all the work from scratch. This also is part and parcel of why so much fanfare surrounds the publication of a new model. Because it's really, really hard to train a new model on a new dataset (assuming one wants to get state of the art results).