The main trick in machine learning

tel · on Dec 9, 2013

A professor of mine stated it very well. If you can imagine that there is a true model somewhere out in infinitely large model space then ML is just the search for that model.

In order to make it tractable, you pick a finite model space, train it on finite data, and use a finite algorithm to find the best choice inside of that space. That means you can fail in three ways---you can over-constrain your model space so that the true model cannot be found, you can underpower your search so that you have less an ability to discern the best model in your chosen model space, and you can terminate your search early and fail to reach that point entirely.

Almost all error in ML can be seen nicely in this model. In particular here, those who do not remember to optimize validation accuracy are often making their model space so large (overfitting) at the cost of having too little data to power the search within it.

Devroye, Gyorfi, and Lugosi (http://www.amazon.com/Probabilistic-Recognition-Stochastic-M...) have a really great picture of this in their book.

joe_the_user · on Dec 10, 2013

In order to make it tractable, you pick a finite model space, train it on finite data, and use a finite algorithm to find the best choice inside of that space. That means you can fail in three ways---you can over-constrain your model space so that the true model cannot be found, you can underpower your search so that you have less an ability to discern the best model in your chosen model space, and you can terminate your search early and fail to reach that point entirely.

It seems like you can "mis-power" your model also.

For example, the Ptolemaic system could approximate the movement of the planets to any degree if you added enough "wheels within wheels" but since these were "the wrong wheels", the necessary wheels grew without bounds to achieve reasonable approximation over time.

tel · on Dec 10, 2013

Also, to add, when DGL bring this kind of mental model up they do so to motivate a kind of semi-parametric modeling where the design space changes progressively to move closer to the true model without growing so quickly as to make inference unstable. The problem being, of course, that this causes your algorithm run time to blow out to something cubic, I think, and so you have a beautiful model that loses out on search error.

tel · on Dec 10, 2013

Totally agree, though I'd call that maybe "picking the wrong shape" for your model space. You can pay a whole lot but if you cannot admit a shape that gets close to the truth then you're spending your data in vain.

eru · on Dec 10, 2013

Funny enough, the wheels within wheels are equivalent to a Fourier transform.

rcthompson · on Dec 10, 2013

> For example, the Ptolemaic system could approximate the movement of the planets to any degree if you added enough "wheels within wheels" but since these were "the wrong wheels", the necessary wheels grew without bounds to achieve reasonable approximation over time.

That would be an example of over-constraining your model (i.e. imposing the arbitrary constraint of a stationary Earth).

joe_the_user · on Dec 10, 2013

I don't think this is useful way to phrase the situation.

A system of Ptolemaic circles can approximate the paths taken by any system. So the system really isn't absolutely constrained to follow or not follow any given path.

You could claim you have constrained your model not be some other better model but that, again, seems like a poor way to phrase things since a more accurate model is also constrained not to be a poor model.

Even specifically, the Newtonian/Keplerian system has the constrain of the sun being stationary as much as the Ptolemaic system has the constraint of the earth being stationary.

Edit: As Eru points out, the Ptolemaic system basically uses the Fourier transform to represent paths. Thus the approximation is actually completely unconstrained in the space of paths, that is it can approximate anything. But by that token, the fact that it can approximate a given path explains nothing and the choices that are simple in this system are not necessarily the best choices for the given case, estimating planetary motion.

See - http://en.wikipedia.org/wiki/Deferent_and_epicycle

rcthompson · on Dec 10, 2013

That's a good point, but after re-reading tel's original comment, I think my statement is still correct. Notice that tel's statement was that "you can over-constrain your model space so that the true model cannot be found". This doesn't necessarily mean constraining your model so that the true model is excluded from your parameter space. If your constraints technically encompass the true solution but only admit an overly complex parametrization of the solution, then it will still reduce (perhaps drastically) your power to find the true model. In this case, "overly complex" means unnecessarily many nonzero (or not almost zero) coefficients in the Fourier series.

joe_the_user · on Dec 10, 2013

My argument is that there are two kind of situations:

* The model could encompass the behavior of the input in a smooth fashion if it's basic parameters are relaxed.

* The model would tend to start finding models that are wildly different from the main model at the edges (space and time) if its parameter are relaxed, even if the model would eventually find the real model with enough input and training.

one has to handle these two conditions differently, right?

Natsu · on Dec 10, 2013

> (i.e. imposing the arbitrary constraint of a stationary Earth).

It's not really arbitrary--given the understanding at the time, there was no ability to measure the motion of the earth. In particular, stellar parallax which was understood as a contra-indication and too small to measure just yet. So a non-stationary Earth went against what they knew at the time rather strongly.

That said, relativity comes back and makes choosing a frame of reference arbitrary in the end, though some are easier to do physics in than others.

rcthompson · on Dec 10, 2013

Yes, that's true, I'm intentionally ignoring history and calling it arbitrary from our modern perspective.

alok-g · on Dec 10, 2013

Awesome example!

dantheman · on Dec 9, 2013

This is a great description, I've tried to explain something like this to novices before -- but this is amazingly eloquent.

yetanotherphd · on Dec 10, 2013

I wouldn't use the term "error" because some people might take that to mean there was a way to avoid these problems.

over-constraining your model space means having too few parameters in your model But fixing your data size, the "power" of your search goes down when you increase the number of parameters.

So it is no so much an issue of avoiding errors, but of choosing the right number of parameters for your model.

tel · on Dec 10, 2013

I called them errors because that's usually the technical term for them, but the real point, as always, is tradeoffs. The less "finite" a ML setup you want to buy, the less "error".

tedsanders · on Dec 9, 2013

I really like that explanation. I've never heard it in those terms before, but it makes a lot of sense. Thanks for sharing.

hooande · on Dec 10, 2013

I applaud the author of this post. I've seen a lot of people suffer with machine learning because they don't understand this basic concept. Taking MOOC classes and reading textbooks is a great way to learn, but they tend to focus on a lot on the mathematical principle and not the start-from-nothing practical considerations.

Machine learning is almost like learning chess in that there are certain obvious mistakes that noobs continue to make. And like chess there are multiple levels of thinking and understanding that are almost impossible to teach to someone that doesn't have lots of experience. Hopefully more blog posts like this will help people get past the novice level.

Regarding technical content:

N-fold Cross validation [1] can be a more effective approach to having a single held out or validation set. You split your data into N groups, say N = 10. Then you use groups 2-10 as a training set to make predictions on group 1, then groups 1,3-10 to make predictions on group 2, etc. Recombine the prediction output files and use the measured error to tune and tweak your predictor. It's more work and can still lead to overfitting, but it's generally better to overfit the entire training set than it is to overfit one held out sample.

[1] http://en.wikipedia.org/wiki/Cross-validation_%28statistics%...

glifchits · on Dec 10, 2013

The Coursera ML class has a week of lectures specifically regarding practical considerations. The prof discusses how to solve for underfitting/overfitting, and spends a lot of time on this idea of a cross-validation set. To whomever reads this, its a good course!

tlarkworthy · on Dec 10, 2013

Thanks. You have understood the exact level I am pitching at. The chess analogy is great. I have seen too many learners read too fast and on inconsequential optimizations compared to getting the basics right.

Of course the basic topic of validation gets pretty deep fairly quickly too. Out out of bag scores anyone?

waterside81 · on Dec 9, 2013

I work at a company that sells applied machine learning services, so I'd like to add a few more tricks to machine learning:

1) Have lots of data

2) Accept the possibility that your problem domain cannot be generalized.

I always find, whether in academic literature or in message boards, a desire to fit every round peg into a square hole. The reality of real world data is that sometimes, it's just a 50/50 coin toss. This might be because the features that really indicate some sort of pattern can't be defined or they can and the data can't be reliably retrieved, or the humans running things have a poor understanding of the problem domain to start with.

TL;DR: There's no magic

tel · on Dec 10, 2013

My experience with real world (but still academic) data has been that there is lots of magic---feature selection to be specific.

(I'm not disagreeing, just referring to a different kind of "magic")

Everything else matters, but when your ML doesn't work it's 100% a feature selection problem. Which usually means it's 99% a problem of getting lots of domain expertise jammed up against a lot of ML experience and mathematical understanding. It's also a bear.

nabla9 · on Dec 10, 2013

The way 80% of real real world (non academic) data mining problems are solved:

1. Feature selection.

2. intelligent data massage. Real world data has usually noise that humans can easily identify as irrelevant or erroneous.

3. logit regression.

Starting with simple, well understood algorithms first should be the second lesson after knowing about validation sets. In those cases where they are not enough, they set the baseline for comparison against other algorithms.

larrydag · on Dec 10, 2013

I would add 4. Ensemble methods. Having a few models helps to generalize the data fairly well.

tel · on Dec 10, 2013

That's a good point.

tlarkworthy · on Dec 10, 2013

Yeah I agree both with you and the ancestor. To get good results you basically have to provide the solution to the algorithm through great features. Then the choice of algorithm is not so important, as all the hard work is done in representation. So the algorithms are not magic, you have to drop the solution in front of them and all they do is find the optimal parameters for some trivial low dimensional separation manifold.

linux_devil · on Dec 10, 2013

Agree I too found feature extraction and selection one of the vital components that helps to improve scores in lot of ML competitions hosted at Kaggle

pyduan · on Dec 9, 2013

It's quite sad that this post is even necessary. That said, having a proper training/cross-validation/validation setup is sometimes not that obvious, as you have to stop and think about possible sources of contamination -- some sampling biases, for instance, can be quite tricky to detect, or your algorithm design might be flawed in some subtle way.

Personally, I wish people emphasized more the importance of a general understanding of econometrics when doing machine learning. In most of the introductory courses I've seen, the link between both field is never made explicit, despite the obvious analogies (coincidentally, there was an article by Hal Varian on the front page two days ago that discussed how both fields could benefit from sharing insights [1]). Understanding the idea behind minimizing generalization error is one thing, but I find that thinking in terms of internal/external validity and experiment design often gives people a more intuitive understanding of validation procedures, both regarding why and how we should do it. The same goes for understanding effect size, confidence intervals, causality (and causality inference), and so on.

[1] https://news.ycombinator.com/item?id=6870387

zmjjmz · on Dec 9, 2013

>stop and think about possible sources of contamination

One great one from my Machine Learning professor was an assignment where we were required to normalize our data to [0,1]. After doing this and then going through the typical cross-validation cycle, he had us try and figure out where we contaminated our validation sets. As it turns out, we all normalized our data before splitting it up, which meant that training data influenced testing data.

It's a simple fix, but if you've done that and gone to run a large convolutional neural network for a week only to find that you made a stupid error like that, it can be pretty painful. (Especially since the bad generalization error might not be obvious until you use it the model in production)

im3w1l · on Dec 10, 2013

Maybe one could benefit from a sort of blinding procedure, where the person designing the learner is never allowed to even look at the validation data.

jfim · on Dec 10, 2013

If both your training and testing datasets are representative of actual data, wouldn't the normalization function be nearly equivalent in both datasets?

azmenthe · on Dec 10, 2013

I'm a bit late to the conversation but I agree with you and just wanted to add my quick two cents.

I used to work in algorithmic trading (the kind which aims build consistent viable portfolios, not the HFT arms race).

This of course relies heavily on building your model, which can be anything from some simple linear regressions to more advanced techniques more commonly associated with the buzz word of machine learning, this applies to all predictive methods. You begin searching the training data to find optimal model parameters and then verifying performance on the validation set. The number ONE mistake I saw most was that when you get bad results on the CV set, going back to step 1.5 instead of just throwing the whole model out. To take your same core idea, tweak it slightly, add/remove a few parameters and restart the process. Unfortunately doing this enough times and your CV set starts to become the training set. Thus leaving your true validation set the day you turn it on live in production with real money.

It's never a good feeling to see your positively skewed returns in your training, testing and "CV" set morph into essentially a mean zero random distribution in production. This was quite an important lesson to learn for me.

tedsanders · on Dec 9, 2013

I strongly disagree with the idea that validation sets are central to machine learning. The whole point of machine learning (usually) is to predict things well. Validation sets are merely one technique among many to gauge how well your predictions are doing. Because they are so easy, they are very common. But just because they are common doesn't mean they are central to the field. There are many other techniques out there, like Bayesian model selection (as the author mentions at the end).

mjw · on Dec 10, 2013

Good to see Bayesian model selection get a mention. Bayesian model averaging is pretty interesting, too, in that it comes, in a sense, with built-in protection against overfitting.

I still think there is something quite fundamental, though, about validation sets and other related resampling-based methods for estimating generalisation performance (cross-validation, bootstrap, jackknife and so on).

The built-in picture you get about predictive performance from Bayesian methods comes with strong caveats -- "IF you believe in your model and your priors over its parameters, THEN this is what you should expect". Adding extra layers of hyperparameters and doing model selection or averaging over them might sometimes make things less sensitive to your assumptions, but it doesn't make this problem go away; anything the method tells you is dependent on its strong assumptions about the generative mechanism.

Most sensible people don't believe their models are true ("all models are false, some models are useful"), and don't really fully trust a method, fancy Bayesian methods included, until they've seen how well it does on held-out data. So then it comes back to the fundamentals -- non-parametric methods for estimating generalisation performance which make as few assumptions as possible about the data and the model they're evaluating.

Cross-validation isn't the only one of these, and perhaps not the best, but it's certainly one of the simplest. One thing people do forget about it is that it does make at least one basic assumption about your data -- independence -- which is often not true and can be pretty disastrous if you're dealing with (e.g.) time-series data.

ced · on Dec 10, 2013

I agree. As a Bayesian hoping to understand my data, P(X|M1) is useful: it's the probability I have for X under M1's modelling assumptions. Of course M1 is an approximation, but that's how science is done. You get to understand how your model behaves, and you may say "Well, X is a bit higher than it should be, but that's because M1 assumes a linear response, and we know that's not quite true".

Bayesian model averaging entails P(X) = P(X|M1)P(M1) + P(X|M2)P(M2). It assumes that either M1 or M2 is true. No conclusions can be derived from that. It might be useful from a purely predictive standpoint (maybe) , but it has no place inside the scientific pipeline.

There is a related quantity which is P(M1)/P(M2). That's how much the data favours M1 over M2, and it's a sensible formula, because it doesn't rely on the abominable P(M1) + P(M2) = 1

mjw · on Dec 10, 2013

Yeah good perspective -- I guess I was thinking about this more from the perspective of predictive modelling than science.

Model averaging can be quite useful when you're averaging over versions of the same model with different hyperparameters, e.g. the number of clusters in a mixture model.

You still need a good hyper-prior over the hyperparameters to avoid overfitting in these cases though, as an example IIRC dirichlet process mixture models can often overfit the number of clusters.

Agreed that model averaging could be harder to justify as a scientist comparing models which are qualitatively quite different.

ced · on Dec 10, 2013

Model averaging can be quite useful when you're averaging over versions of the same model with different hyperparameters, e.g. the number of clusters in a mixture model.

Yeah, but in this case, there's a crucial difference: within the assumptions of a mixture model M, N=1, 2, ... clusters do make an exhaustive partition of the space, whereas if I compute a distribution for models M1 and M2, there is always M3, M4, ... lurking unexpressed and unaccounted for. In other words,

P(N=1|M) + P(N=2|M) + ... = 1

but

P(M1) + P(M2) << 1

Is the number of clusters even a hyperparameter? Wiki says that hyperparameters are parameters of the prior distribution. What do you think?

avaku · on Dec 10, 2013

Great explanation. I would like to add to this, that held-out data is often used in Bayesian learning too - for example, in cases when you intentionally over-specify the model (adding more parameters than might be needed) because you don't really know what the best model might be. The inference goes until the likelihood on held-out data keeps increasing. Example, gesture recognition in Kinekt. If someone finds this info useful, I also recommend Coursera course on Probabilistic Graphical Models.

mailshanx · on Dec 10, 2013

What are some good resources to understand Bayesian model averaging?

mjw · on Dec 10, 2013

These slides have a bit on this (although quite dense material): http://www.gatsby.ucl.ac.uk/teaching/courses/ml1-2011/lect5b... as part of http://www.gatsby.ucl.ac.uk/teaching/courses/ml1-2011.html

I quite like "Bayesian reasoning and machine learning" too: http://web4.cs.ucl.ac.uk/staff/D.Barber/textbook/090310.pdf

repsilat · on Dec 10, 2013

> Validation sets are merely one technique among many to gauge how well your predictions are doing

In Andrew Ng's "Machine Learning" offering on Coursera he talks about having three sets of data:

1. Training data. He uses this for fitting most model parameters.

2. A second set for "more general" analyses -- judging the effects of additional data, regularisation parameters, neural-network topology etc. Performance on this data is used to decide which model to use and how to use it.

3. A third set to estimate how good the choice of model is.

The theory is that the parameters in #1 are fitted to the training data, and the model choice is "fitted" to the data in #2. Even though we think (hope?) that the inferences made in those two steps will generalise reasonably well, we should still expect measures of fit from those analyses to be optimistic. We need a set that has not been used for calibration to reliably estimate how good our model will be on data in the field.

JASchilz · on Dec 10, 2013

Validation is a method to control for over-fitting, but over-fitting isn't a danger to all projects. Suppose we know that our dataset is iid normally distributed with known sigma. Using all available data to find the mean doesn't put us in danger of overfitting. And if you would like a posterior on the true disposition of the mean, there are ways to produce that.

Generally we're in danger of overfitting when the cardinality of our data is comparable to or less than the cardinality of our parameters (including meta-parameters like which model to select).

What I just described is a perspective derived from Bayesian model selection. But Bayesian model selection encompasses other types of model selection; it need not be considered a separate path.

tlarkworthy · on Dec 9, 2013

You should start at the basics though. And validation sets were the start of the literature

mendicantB · on Dec 9, 2013

You are correct that validation sets are only one technique. But, the concept of validation, and the reasoning/justification to do it is an absolutely central idea. What is the point of a model that doesn't aim to generalize?

bravura · on Dec 9, 2013

In more formal terms, you are trying to minimize the expected risk (generalization error).

The expected risk is the sum of empirical risk (training set error) and the structural risk (model complexity).

In many instances, having low empirical risk comes at the cost of having high structural risk, which is overfitting.

danso · on Dec 9, 2013

I was just browsing through the classic "Mining of Massive Datasets" book (which is free!) when I noticed this apt passage in its introduction that explains the difference between data mining and machine learning:

http://infolab.stanford.edu/~ullman/mmds.html

> There are some who regard data mining as synonymous with machine learning. There is no question that some data mining appropriately uses algorithms from machine learning. Machine-learning practitioners use the data as a training set, to train an algorithm of one of the many types used by machine-learning prac- titioners, such as Bayes nets, support-vector machines, decision trees, hidden Markov models, and many others.

There are situations where using data in this way makes sense. The typical case where machine learning is a good approach is when we have little idea of what we are looking for in the data. For example, it is rather unclear what it is about movies that makes certain movie-goers like or dislike it. Thus, in answering the “Netflix challenge” to devise an algorithm that predicts the ratings of movies by users, based on a sample of their responses, machine- learning algorithms have proved quite successful. We shall discuss a simple form of this type of algorithm in Section 9.4.

On the other hand, machine learning has not proved successful in situations where we can describe the goals of the mining more directly. An interesting case in point is the attempt by WhizBang! Labs1 to use machine learning to locate people’s resumes on the Web. It was not able to do better than algorithms designed by hand to look for some of the obvious words and phrases that appear in the typical resume. Since everyone who has looked at or written a resume has a pretty good idea of what resumes contain, there was no mystery about what makes a Web page a resume. Thus, there was no advantage to machine-learning over the direct design of an algorithm to discover resumes.

http://infolab.stanford.edu/~ullman/mmds.html

apw · on Dec 10, 2013

Will you need to change that definition if I show you a machine learning algorithm capable of significantly outperforming the best human algorithms on the resume classification problem?

khawkins · on Dec 10, 2013

I would say, to be succinct, that the main trick in ML is Occam's Razor (http://en.wikipedia.org/wiki/Occam%27s_razor).

It has been found that, for most problems, a simple model which well represents previous experience should be accepted instead of a more complex one with marginally better representation. I would claim that the reason this generally works is an empirical discovery, as opposed to a mathematical result, but probably has philosophical implications in its success.

JASchilz · on Dec 10, 2013

Check out Bayesian Model Selection. It's the mathematical expression of Occam's Razor.

khawkins · on Dec 10, 2013

My point is that it shows up everywhere, just in different forms. Sparse coding has a penalty for large basises. Gaussian process regression tunes the density of its representation using Bayesian model Selection. SVMs have a slack parameter which dictates how many errors you'll tolerate to reduce the number of hyperplanes.

JASchilz · on Dec 10, 2013

I apologize, my reply was aimed too low.

PeterisP · on Dec 11, 2013

This is often represented explicitly within the model math by adding a penalty factor that marks more complex models as worse and thus also optimizes simplicity in addition to accuracy.

sampo · on Dec 9, 2013

Andrew Ng emphasized this quite clearly in his Machine Learning course on Coursera.

rcthompson · on Dec 10, 2013

It's not directly related, but I always liked this little "koan":

A man is looking around at the ground under a street lamp. You ask him what he is looking for, and he says "I'm looking for my keys. I dropped them somewhere in that parking lot over there." "Then why are you looking inder this street lamp?" you ask. He answers: "Because this is the only place I can see!"

stingrae · on Dec 9, 2013

Seems like HN is causing them problems. I saved the articles text at: https://www.evernote.com/shard/s360/sh/4e19f93c-8425-440c-b9...

yetanotherphd · on Dec 10, 2013

I think the situation is more complex than the author states.

For example if I have a linear model, Y = a + b * X, I will choose a and b to minimize in-sample fit. Choosing a and b to maximize out of sample fit goes against all theory.

However, if I want to choose which parameters go into my model, maximizing out of sample fit would be a good approach.

So at the end of the day, there is not a huge philosophical difference between using in-sample and out-of-sample fit, only different approaches to the same problem. In both cases, the assumption is (usually) the the data is i.i.d., and in both cases, you are choosing some coefficients/parameters/hyperparameters with the intent of maximizing out of sample fit, but using different methods.

nkurz · on Dec 10, 2013

Are you coming from a theoretical math point-of-view or background? It's hard for me to say exactly why, but I feel your response is evidence of just that "huge philosophical difference" between traditional stats and machine learning.

To me, even the statement "if I have a linear model" makes very little sense from the perspective of ML. Contrast with "if I think I'm dealing with a situation where a linear model might offer a good fit".

Regarding "maximizing out of sample fit would be a good approach", I think ML is always and just-about-only concerned with maximizing out-of-sample fit, for if it wasn't, the solution would be a lookup table.

I'm not trying to imply that you're wrong, rather that I think the 'gulf' is real. Or maybe I'm misunderstanding your point. For example, I feel that mjw's comment in this thread captures my view, which I think is more ML centered: https://news.ycombinator.com/item?id=6878336

Is that comment also in accord with your view, and it's me that's on the wrong side of that gulf?

girvo · on Dec 10, 2013

Great post. It's like maths, you have to check your answers. Validation sets are a way of doing that.

I've been getting into ML lately for my startup, it's a personal finance system that will learn your habits and use that to predict things in the future. It's been overwhelming attempting to move into this domain of software engineering (so much so that I am currently just hard coding certain important patterns and using basic statistical modelling instead) but it is absolutely fascinating!

shiven · on Dec 10, 2013

Funny, that this idea is so foreign to ML. As a macromolecular crystallographer, Rfree [0] is something drilled into every student's brain from day one!

TL;DR: Randomly, a certain percent (5-10%) of data is 'hidden' and never used for building/refining your model, but is only used to evaluate how well your model fits (or explains) that unseen data. This is absolutely, fundamentally essential to prevent over-fitting your data!!

EDIT: Think that you are solving a huge jigsaw puzzle, but made of thousands of jello pieces. You randomly hide a 100 or so pieces and try to solve the puzzle. Having used all the pieces (except the hidden 100), you think the puzzle forms a Treasure Map. Now, you take the previously hidden pieces and try to fit those into the puzzle and if after using the hidden pieces your puzzle still looks like a Treasure Map, you may have found a (mostly) correct solution. But, if you are unable to fit those hidden places in a way that still keeps the Treasure Map intact, you must question if you did in fact find the correct solution or if there is another, slightly different, solution that may be (more) correct because it will account for the hidden pieces a little better?

[0] http://reference.iucr.org/dictionary/Free_R_factor

m_ke · on Dec 10, 2013

Don't kid yourself, the idea must have been foreign to the author of the post, but you won't find a single published paper that doesn't test its results using cross validation or at least on some standard test set.

mendicantB · on Dec 9, 2013

Honestly, calling validation a trick isn't helping.

Understanding the motivation behind validation is an absolutely fundamental concept, and lack of coherence on the topic shows an inherent lack of understanding of the goal of building the model in the first place; GENERALIZATION.

This is synonymous with one checking in code that has no issues locally, without testing in the stack or a production environment.

I work and hire in this space and it's actually a bit shocking how widespread this lack of understanding is. Asking a candidate how to evaluate a model, even at a basic level, is this field's version of FizzBuzz. Just like Fizzbuzz, a lot of candidates I've encountered who are "trained" in machine learning or statistics fail miserably, and my peers seem to have similar experiences.

These issues are expected, given how popular data science is these days. We all win when more people are getting their hands dirty with data, but it's extraordinarily easy to misuse the techniques and reach misleading conclusions. This can potentially lead to people pointing fingers at the field and it's decline. The only thing we can do is correct the wrongs and do our best to limit incompetence that only serves to tarnish the field.

pmiller2 · on Dec 10, 2013

Count me among those who thought validation was a thing you just had to do when training ML algorithms. After all, the most beautiful theoretical model in the world is of no use if the predictions it delivers are terrible.

The real trick (for most algorithms) is to select the correct features to train against. This really is more of a black art than an exact science, so I think labeling it a trick is justified.

sadfaceunread · on Dec 9, 2013

Link appears to be /.'ed (HN'd). CoralCache/NYUD.net doesn't seem to have it in cache. Anyone got a cached page/mirror?

orting · on Dec 10, 2013

I think you need to view whatever process generated the answers as part of your model. In some cases, and in all textbook examples, we have a ground truth that is correct. But in real-world applications, such as segmentation problem in medial imaging, we have a gold standard which represents our best estimate, but is not necessarily correct.

Validation is not a magic bullet, we need to be critical of any part of the model that is given as truth, otherwise we might end up fitting a solution to the wrong problem.

More generally I think that textbooks should emphasize the need for the scientific method and stress that any model (or theory) is only as good as its ability to explain the entire problem domain.

tocomment · on Dec 9, 2013

How does the brain generalize to data it hasn't seen before? Any theories?

Maria1987 · on Dec 9, 2013

According to Piaget's theory of development while we grow up we have different experiences from which we acquire new information. If we lets say, are naive with no experiences or memories at all, otherwise known as a "tabula rasa" stage, then we will start learning this new information and grouping it into correlated structures of knowledge, known as schemas. For example, different types of dogs can be one schema as they share characteristics and they are correlated knowledge..As we learn we not only create these schemas, but we also adapt them when new unknown information arrives. For example, if I only experience dog in my life, then when I see a cat I know that this is likely to be an animal and share characteristics with dogs, as this is similar to dogs and will most likely belong to the same or a similar schema..And that's how I personally believe we learn and interpret new information that arrives...

Of course there are many different theories, but that's my favourite.

YZF · on Dec 9, 2013

I think in the context of machine learning the brain's ability to model the real world has evolved and a better model for the world represents a survival advantage. I don't know too much about how the brain actually models reality (and I don't know if anyone does) but the theory of machine learning still applies in the sense that each individual brain of each animal is a model and if you have a model that is too complex it will generalize poorly and therefore the owner of that brain is likely to do poorly in the real world.

It's very interesting in the sense that the totality of brains over time is essentially a sort of supervised learning with huge amounts of input data.

YZF · on Dec 9, 2013

What are you asking here?

The brain contains/is the model. It is trained by a range of inputs and by definition it generalizes outside those inputs.

If you're asking how does the brain minimize out-of-sample error? It does that by the virtue that it's model isn't too complex for the training set, just like what you do in machine learning. If the brain had a model that was too complex it would overfit and poorly generalize just like machine learning would do with a too complex of a model...

tocomment · on Dec 10, 2013

And a follow up question, what would an example of a brain over fitting something it learned?

arjie · on Dec 10, 2013

When we were in school, at one point our teacher switched our limit calculation questions from the almost standard notation (x,y,z for variables; a,b,c for constants) to the opposite.

Some people would have trouble handling something that had \lim_{a \to x} (some complicated f(a,x,y)) where y is a constant even though they could handle it with standard notation.

For another possible example, take something you've written recently, replace all the variable names with things like Integer, Double, and the function names with For, While (within the syntax of the language) and then try reading it.

Besides this, there's the jesus-in-toast, man-in-the-moon, face-on-mars business. The brain overfits everything, but it never stops training. It's in constant reinforcement learning.

michaelochurch · on Dec 9, 2013

Validation isn't "a trick", or shouldn't be. It's just being responsible. I'm sure there are people getting funded who don't know about it, but they're charlatans if they don't understand the dangers of overfitting (and underfitting).

tlarkworthy · on Dec 9, 2013

see the early history of learning, it was a discovery that is actually counter intuitive and a common trap for beginners. (don't minimize the training error)

I have seen new PhDs read about it "in theory", but not internalise it for practice, and then they go off an do Bayesian structure learning without a validation set. This DOES happen.

This post is to hammer into the brains of any beginner thinking about machine learning that understanding the validation set's purpose is the most important thing to internalise first.

e.g. Machine learning is easier than it looks: https://news.ycombinator.com/item?id=6770785