I strongly disagree with the idea that validation sets are central to machine learning. The whole point of machine learning (usually) is to predict things well. Validation sets are merely one technique among many to gauge how well your predictions are doing. Because they are so easy, they are very common. But just because they are common doesn't mean they are central to the field. There are many other techniques out there, like Bayesian model selection (as the author mentions at the end).
Good to see Bayesian model selection get a mention. Bayesian model averaging is pretty interesting, too, in that it comes, in a sense, with built-in protection against overfitting.
I still think there is something quite fundamental, though, about validation sets and other related resampling-based methods for estimating generalisation performance (cross-validation, bootstrap, jackknife and so on).
The built-in picture you get about predictive performance from Bayesian methods comes with strong caveats -- "IF you believe in your model and your priors over its parameters, THEN this is what you should expect". Adding extra layers of hyperparameters and doing model selection or averaging over them might sometimes make things less sensitive to your assumptions, but it doesn't make this problem go away; anything the method tells you is dependent on its strong assumptions about the generative mechanism.
Most sensible people don't believe their models are true ("all models are false, some models are useful"), and don't really fully trust a method, fancy Bayesian methods included, until they've seen how well it does on held-out data. So then it comes back to the fundamentals -- non-parametric methods for estimating generalisation performance which make as few assumptions as possible about the data and the model they're evaluating.
Cross-validation isn't the only one of these, and perhaps not the best, but it's certainly one of the simplest. One thing people do forget about it is that it does make at least one basic assumption about your data -- independence -- which is often not true and can be pretty disastrous if you're dealing with (e.g.) time-series data.
I agree. As a Bayesian hoping to understand my data, P(X|M1) is useful: it's the probability I have for X under M1's modelling assumptions. Of course M1 is an approximation, but that's how science is done. You get to understand how your model behaves, and you may say "Well, X is a bit higher than it should be, but that's because M1 assumes a linear response, and we know that's not quite true".
Bayesian model averaging entails P(X) = P(X|M1)P(M1) + P(X|M2)P(M2). It assumes that either M1 or M2 is true. No conclusions can be derived from that. It might be useful from a purely predictive standpoint (maybe) , but it has no place inside the scientific pipeline.
There is a related quantity which is P(M1)/P(M2). That's how much the data favours M1 over M2, and it's a sensible formula, because it doesn't rely on the abominable P(M1) + P(M2) = 1
Yeah good perspective -- I guess I was thinking about this more from the perspective of predictive modelling than science.
Model averaging can be quite useful when you're averaging over versions of the same model with different hyperparameters, e.g. the number of clusters in a mixture model.
You still need a good hyper-prior over the hyperparameters to avoid overfitting in these cases though, as an example IIRC dirichlet process mixture models can often overfit the number of clusters.
Agreed that model averaging could be harder to justify as a scientist comparing models which are qualitatively quite different.
Model averaging can be quite useful when you're averaging over versions of the same model with different hyperparameters, e.g. the number of clusters in a mixture model.
Yeah, but in this case, there's a crucial difference: within the assumptions of a mixture model M, N=1, 2, ... clusters do make an exhaustive partition of the space, whereas if I compute a distribution for models M1 and M2, there is always M3, M4, ... lurking unexpressed and unaccounted for. In other words,
P(N=1|M) + P(N=2|M) + ... = 1
but
P(M1) + P(M2) << 1
Is the number of clusters even a hyperparameter? Wiki says that hyperparameters are parameters of the prior distribution. What do you think?
Great explanation. I would like to add to this, that held-out data is often used in Bayesian learning too - for example, in cases when you intentionally over-specify the model (adding more parameters than might be needed) because you don't really know what the best model might be. The inference goes until the likelihood on held-out data keeps increasing. Example, gesture recognition in Kinekt. If someone finds this info useful, I also recommend Coursera course on Probabilistic Graphical Models.
> Validation sets are merely one technique among many to gauge how well your predictions are doing
In Andrew Ng's "Machine Learning" offering on Coursera he talks about having three sets of data:
1. Training data. He uses this for fitting most model parameters.
2. A second set for "more general" analyses -- judging the effects of additional data, regularisation parameters, neural-network topology etc. Performance on this data is used to decide which model to use and how to use it.
3. A third set to estimate how good the choice of model is.
The theory is that the parameters in #1 are fitted to the training data, and the model choice is "fitted" to the data in #2. Even though we think (hope?) that the inferences made in those two steps will generalise reasonably well, we should still expect measures of fit from those analyses to be optimistic. We need a set that has not been used for calibration to reliably estimate how good our model will be on data in the field.
Validation is a method to control for over-fitting, but over-fitting isn't a danger to all projects. Suppose we know that our dataset is iid normally distributed with known sigma. Using all available data to find the mean doesn't put us in danger of overfitting. And if you would like a posterior on the true disposition of the mean, there are ways to produce that.
Generally we're in danger of overfitting when the cardinality of our data is comparable to or less than the cardinality of our parameters (including meta-parameters like which model to select).
What I just described is a perspective derived from Bayesian model selection. But Bayesian model selection encompasses other types of model selection; it need not be considered a separate path.
You are correct that validation sets are only one technique. But, the concept of validation, and the reasoning/justification to do it is an absolutely central idea. What is the point of a model that doesn't aim to generalize?