In particular, I think the dichotomy stems not from statisticians neglecting what works, or having a narrow mindset, or whatever.
It seems to me that it stems from different goals:
* business seeks to predict and classify
* science seeks to test hypotheses
And statisticians used to focus on the latter, for which you need classical statistics ("data modeling", or "generative modeling", as Donoho calls it), don't you?
And for prediction and classification, sure, there are the classical techniques (regression, time series (ARCH, GARCH, ...), Fisher's linear discriminant), there are Bayesian methods, newer statistical stuff such as SVM, and ML techniques such as random forests.
However, it's just driven by different objectives. As the commenters state, Efron: 'Prediction is certainly an interesting subject but Leo [Breiman]’s paper overstates both its role and our profession’s lack of interest in it.', or Cox: 'Professor Breiman takes data as his starting point. I would prefer to start with an issue, a question or a scientific hypothesis [...]', or Parzan: 'The two goals in analyzing data which Leo calls prediction and information I prefer to describe as “management” and “science.” Management seeks profit, practical answers (predictions) useful for decision making in the short run. Science seeks truth, fundamental knowledge about nature which provides understanding and control in the long run.'.
So, different objectives call for different methods. And, certainly 20 years ago, statisticians were mostly focusing on one rather than the other. Ok. So?
“Ok. So?” Well so Computer Science has been focusing mainly on the predictive side (ML/AI) and you had a lot of intellectual whining that they’re “re-inventing” statistics just with different terminology. I’m not sure if this is just an attempt to down play their results or if it’s more academic jealousy because the funding goes to the “cool stuff” like AI/ML in the CS dept. and the Stats dept. is seen as old and boring. That’s what it feels like. You’ll even see this type of commentary in the preface of books like All of Statistics.
No matter what comes from empirical research in Computer Science re: prediction/classification methods you’ll hear the Stats camp crying that it’s “just Statistics” at the end of the day. Fair enough, but computational Statistics was then neglected for long enough that computer scientists had to create more powerful techniques independently and can claim priority on that front. Theory lags practice in this area.
>that they’re “re-inventing” statistics just with different terminology.
To be very clear about it, this is referencing a very common problem with subfields of applied statistics in general; it's not limited to AI/ML! Econometrics, epidemiology, business stats (decision support) and so on and so forth, all tend to come up with their bespoke terminologies and reinventions of statistically basic principles. It would seem entirely appropriate to point that out.
> I’m not sure if this is just an attempt to down play their results or if it’s more academic jealousy because the funding goes to the “cool stuff” like AI/ML in the CS dept. and the Stats dept. is seen as old and boring.
No one is trying to downplay the legitimately impressive results of AI/ML. Deep learning, convolutional neural networks and GANs have had incredible success in fields like computer vision, and image/speech recognition. But outside of those areas the "results" for the current fads in AI/ML learning have been grossly overstated. You have academic computer scientists like Judea Pearl decry the "backward thinking" of statistics and who are championing a "causal revolution", despite not actually doing anything revolutionary. You have modern machine learning touted ad nauseam as a panacea to any predictive problem, only for systematic reviews to show they don't actually out perform traditional statistical methods [1]. And you have industry giants like IBM and countless consulting companies promise AI solutions to every business problem that turn out to be more style than substance, and "machine learning" algorithms that are just regression.
There's a reason why AI research has gone through multiple winters, and why another is looming. Those in AI/ML seem to be more prone and/or willing to overpromise and underdeliver.
When I first read this paper I thought it was thought-provoking and captured the tension being referenced pretty well.
Over time, I've come to see it as pretty dated and misleading.
The problem is that the methods of both "cultures" are pretty black box, and it's a matter of which black you want to dress your box in. Actually, it's all black boxes anyway, all the way down, epistemological matryoshki.
The real tension is between relatively more parametric approaches, and relatively nonparametric approaches, and how much you want to assume of your data. That in turn, reduces to a bias-variance tradeoff. Some approaches are more parametric and produce less variance but more bias; others are less parametric and produce more variance but less bias. In some problem areas the parameters of the problems might push things in one or another direction; e.g., in some fields you know a lot a priori, so just slapping a huge predictive net on x and y makes no sense, but in other fields you know nothing, so it makes a lot of sense.
Another tension being conflated a bit is between prediction and measurement (supervised and unsupervised classification, forward and inverse inference, etc.). Much of what is being hyped now is essentially prediction, but a huge class of problems exist that don't really fall in this category nicely.
I disagree that computational statistics was being neglected in statistics. What I have seen is a new method (NN classes of approaches) got new life breathed into it, and became extraordinarily successful in a very specific but important class of scenarios. Subsequently, the "AI/ML" learning label got expanded to include just about any relatively nonparametric, computational statistical method. Maybe computational multivariate predictive discrimination was neglected?
A lot of what AI/ML is starting to bump up against are problems that statistics and other quantitative fields have wrestled with for decades. How generalizable are the conclusions based on this giant datasets to other data? What do you do when you have a massive model fit to a idiosyncratic set of inputs? How do you determine your model is fitting to meaningful features? What is the meaning of those features? Why this model and not another one? There are really strong answers to many of these types of questions, and they're often in traditional areas of statistics.
Anyway, I see this paper as making a sort of artificial dichotomy with regard to issues that have existed for a long long time, and see that artificial dichotomy as masking more fundamental issues that face anyone fitting any quantitative models to data. It's a misleading and maybe even harmful paper in my opinion.
I'm also not impressed. He asserts blithely that "the goal is not interpretability, but accurate information", that is, only predictive ability matters. Maybe this is true in some domains, but in my experience scientists usually give a shit about what the hidden function is and are trying to learn how it behaves. No one in science wants to be left with a black box that you can't move past, they want a deeper understanding of the underlying phenomenon.
I don’t think that’s what Breiman is saying. He means useful and reliable information. The point is not just prediction but also actionable information about variables. Look at the 3 examples.
In particular, I think the dichotomy stems not from statisticians neglecting what works, or having a narrow mindset, or whatever.
It seems to me that it stems from different goals:
* business seeks to predict and classify
* science seeks to test hypotheses
And statisticians used to focus on the latter, for which you need classical statistics ("data modeling", or "generative modeling", as Donoho calls it), don't you?
And for prediction and classification, sure, there are the classical techniques (regression, time series (ARCH, GARCH, ...), Fisher's linear discriminant), there are Bayesian methods, newer statistical stuff such as SVM, and ML techniques such as random forests.
However, it's just driven by different objectives. As the commenters state, Efron: 'Prediction is certainly an interesting subject but Leo [Breiman]’s paper overstates both its role and our profession’s lack of interest in it.', or Cox: 'Professor Breiman takes data as his starting point. I would prefer to start with an issue, a question or a scientific hypothesis [...]', or Parzan: 'The two goals in analyzing data which Leo calls prediction and information I prefer to describe as “management” and “science.” Management seeks profit, practical answers (predictions) useful for decision making in the short run. Science seeks truth, fundamental knowledge about nature which provides understanding and control in the long run.'.
So, different objectives call for different methods. And, certainly 20 years ago, statisticians were mostly focusing on one rather than the other. Ok. So?