Are deep neural networks dramatically overfitted? (2019)

mjburgess · on April 5, 2021

Compression is a necessary but not sufficient condition for 'comprehension'. The whole 'compression' idea here is a bit of a misdirection; it will be a side effect of any success.

The data fed into these systems is just measurements of target systems (eg., of light for photographs). This data is radically incomplete, so no compression of it will be an accurate (eg., 3d) model.

To reconstruct the world you need to measure the measuring device (ie., the body) as it interacts with the target system. In most cases you need, also, the hypothesis behind the action to resolve ambiguity in the measurement data.

Eg., you need to know you moved your hand to touch the fireplace to properly interpret what 'finger pain' means.

It is for reasons of this kind that 'comprehension' is an engineering problem, not a programme for a universal turing machine (which has no device boundaries).

It is engineering in the sense that 'compression' has to occur under the right measurement procedures, with hypothesis-laden action, etc.

Focusing on mathematical abstracta completely misses the problem.

All deep systems do is compress their training data to form archetypes and compare novel input to compressed archetypes. Since the data itself is necessarily profoundly ambiguous there is only a trivial sense of 'generalisation' achievd.

wokwokwok · on April 5, 2021

> Focusing on mathematical abstracta completely misses the problem...

I don’t think it does.

We’ve been studying these networks for a long time now, and “common sense” interpretations are largely “completely bs” when examined closely.

These investigations have a tangible impact on how to prune and design networks.

High level hand waving about measuring the measuring device may seem profound, but I fail to see any meaningful application.

You can reconstruct 3D images from a few samples; see neural radiance fields. The observer is irrelevant.

mjburgess · on April 5, 2021

My comment here wasnt aimed at the problem of understanding and improving DNNs; rather, of generalisation and 'comprehension' proper.

3d image reconstruction is trivially possible, with the right data and with the right assumptions encoded into the right algorithm.

My target of attack here is the equivocation that 'compression is comprehension'.

In my view comprehension isnt about 'generalization via compressed archetypes' as in ordinary NNs.

And it isnt about 3d reconstruction given the relevant modification to data and approach.

Rather generalization is abduction. It is the ability to, via a single instance, form law-like provisional universalising models which explain your environment. This process will lead to a 'compressed', 'representation', but nothing like the sense in which compression is here used.

It is this that intuitively is naively assumed of these systems. They do not, and cannot, abduct. Abduction isnt a statistical process; involving at the very least, counterfactual reasoning and hypothetical action.

This is the problem that such equivocations miss (ie. that c = c).

And in my view this is an engineering challenge; not something to be specified for a universal computer.

The relevant capacities missing arent better means of compression.

wokwokwok · on April 5, 2021

> In my view comprehension isnt about 'generalization via compressed archetypes' as in ordinary NNs...

How is this helpful?

Be specific.

The point being made is that inductions can be made by a compressed representation of the data.

No one is claiming this is a key insight to AGI; you’re just taking the word “comprehension” and arbitrarily assigning it some personal meaning.

This is why mathematicial descriptions are useful; you’re personal view of what “comprehension” means isn’t relevant here.

The article doesn’t discuss comprehension.

That is something you’ve, as far as I can tell, meaninglessly brought into this discussion.

How are your insights helpful for improving DNNs?

Be specific.

trap_chateau · on April 5, 2021

Sorry but I have to chime in here. His thoughts seem insightful and relevant to discussion. You are calling his contributions 'meaningless' and condescendingly telling him twice to 'be specific'. That's not how things are meant to be discussed around here and not how you get an answer out of someone.

wokwokwok · on April 6, 2021

I’m comfortable, reading his comment history, that a blunt request to explain how the the comments are relevant to the OP (and the failure to articulate them when bluntly requested to do so) I indicates that what seems “insightful” is in fact, not falsifiable and therefore nothing more than armchair philosophy.

I’m not into that, so if it terminates the discussion then so be it.

There was nothing meaningful to discuss.

mjburgess · on April 6, 2021

Yip. I couldn't move myself to type the obvious: a person replying to my comment doesn't determine whether I am "on topic".

Erlich_Bachman · on April 5, 2021

> long time

4 decades is not a long time. It is a minuscule amount of time in the grand scheme of things.

p1esk · on April 5, 2021

It might be the case, like with quantum mechanics, that NN theory is just fundamentally weird (e.g. because the space is so multidimensional), and it's hard for us to understand it no matter how long we study it.

mscharrer · on April 5, 2021

Isn't the observer model explicitly built into NeRF architectures?

mjburgess · on April 5, 2021

It is; ive just skim read a paper.

Though, One trivial way to do it, with NNs in any case, is just to project forward from a range of observer models and guess the observer parameters from them.

This is still the wrong sense of generalisation. What cant be guessed is why a person took consecutive pictures at given angles/etc.

Such information is necessary to resolve deep ambiguities in cases where your observer model will fail.

Eg., yesterday i looked out my window and thought i saw two people; it was actually one with a shadow+bag.

I moved my eyes/head/body in such away so as to fit a variety of models and i was able to 'read the scene' in the end.

That is comprehension.

jmmcd · on April 5, 2021

And there's no reason we couldn't have a deep learning system where the input data (images) included time-stamps and movement vectors, and it could be good both at easy image classification, and at choosing particular "head movements" like those you performed, to help resolve ambiguous cases.

Further food for thought: these ambiguous cases seem (do you agree?) to be very rare.

mjburgess · on April 6, 2021

Ambiguity is the norm, it isn't rare. Almost all visual input, ie., light, is ambiguous. We (animals) use the history of our prior geometrical-light experiences (ie., walking around) to use environmental cues to resolve ambiguity.

That billions (, trillions) of images are needed to aproximate what we can do for a single instance, i think is a good guide to the magnitude of the problem.

Google the "amnes room" -- that "illusion" is how we are always seeing.

jmmcd · on April 8, 2021

> Ambiguity is the norm

Right. But as you say, usually our priors are good enough. The cases where we stop, double-take, and deliberately look from another angle are rare.

logicchains · on April 5, 2021

>Compression is a necessary but not sufficient condition for 'comprehension'.

Actually, algorithmic information theory shows that maximum compression necessarily entails maximum comprehension, because the only way to maximally compress something is to exactly understand the process producing it.

fractionalhare · on April 5, 2021

The parent commenter is not disagreeing with information theory (and what you're saying is shown in the article anyway).

They're making a practical distinction that you generally don't have access to the actual thing in an empirical format for which compression will achieve true learning. Instead you have access to training data which represents, let's say, a projection of the actual thing in a smaller space with fewer dimensions.

Like trying to learn from images instead of the 3d world. Humans learn to distinguish between objects in a 3-dimensional space using sight and interaction. This learning generalizably transfers to recognition in 2 dimensions. We don't generally equip models with robotic interfaces to train in 3d before benchmarking them on ImageNet.

reader_mode · on April 5, 2021

> We don't generally equip models with robotic interfaces to train in 3d before benchmarking them on ImageNet.

Don't they train models using 3D rendering and simulations ? We have relatively realistic simulations for various scenarios - having a learned model that could make inferences based on those complex simulations sounds like a win.

logicchains · on April 5, 2021

>Like trying to learn from images instead of the 3d world. Humans learn to distinguish between objects in a 3-dimensional space using sight and interaction. This learning generalizably transfers to recognition in 2 dimensions.

If we use human "comprehension" as a reference point, then the relevant point of comparison should be the understanding a human can develop given the same inputs.

wongarsu · on April 5, 2021

Sure, but how do you measure that? How do we figure out how much understanding a human can develop from only ever seeing 2d pictures, without any movement or interaction with a 3d world?

Most ML problems are things humans are quite good at and have a lot of context to draw from.

fractionalhare · on April 5, 2021

Sure. But again, practically speaking, that isn't the reality of how we learn. The commenter wasn't refuting Kolmogorov complexity. They're just saying it's an extremely limited way of viewing the problem. Useful sure, but insufficient.

mjburgess · on April 5, 2021

Relative to a device to perform the understanding.

A string X maximally compresses datsets Y iff X is a 'comprehending' of Y.

'OK'... but what produces and evaluates X? ie., comprehension.

This is the problem with defining these terms mathematically; you state the problem in basically useless ways.

Yes, you can specify what eqn produces the mass of the higgs boson. Thats basically no guide to building the LHC.

The production of such understanding is not abstract. Comprehension isnt a relation between two binary strings; it is an action taken in an environment with a goal.

logicchains · on April 5, 2021

>Relative to a device to perform the understanding.

It uses Kolmogorov complexity, which is defined as the length of a shortest computer program in a predetermined programming language that produces the object as output. Note that this measure is relative to a programming language, not a program, and the exact choice of language doesn't matter too much. Compression means creating a smaller program that produces the same output, and to produce the same output with less code necessarily requires more understanding.

As a concrete example, imagine the output is [1,2,fizz,4,buzz,fizz,7,8,fizz,buzz,11,fizz,13,14,fizzbuzz..1000). The longest program to output this would just hard-code it in the source code (much as a very inexperienced programmer might solve the problem, or a large neural net). Someone with a better understanding would write a program using iteration and the modulus operator, which would be shorter.

funkisjazz · on April 5, 2021

Yeap, it gets into a philosophical debate of what comprehension.

One could argue it's not about compression in bits but compression to primitives that makes sense to the human mind. But then the definition becomes to fuzzy because it naturally invites the question "Who's mind?"

visarga · on April 5, 2021

Compression is applied to data collected in the past while comprehension or intelligence also require adapting well to the future. Thus a pure compressor would optimize for a static distribution while an intelligent agent will probably be sub-optimal because it also needs to learn and adapt to the future.

gbrown · on April 5, 2021

But what if we drop the word comprehension, and we just go with “functional approximation X -> Y, computed from a finite dataset, which minimizes a predictive risk “?

It’s unclear why compression is necessary there, except as a practical benefit.

space_rock · on April 5, 2021

I will nitpick and say that models don't compress data but extract useful information. Compression lossly or lossless is for data reconstruction. Machine learning models retain information that generalises

https://arxiv.org/abs/1906.05849

viraptor · on April 5, 2021

Is there a meaningful / well defined difference? One could say that lossy compression is extraction of useful information. You need to identify unnecessary information to know what to discard safely.

mjburgess · on April 5, 2021

One is a subset of the other. Compression being the larger category.

A mean is a compression of a dataset and useful information.

My issue in my comment is that 'compression' corresponds to a massive class of technquies and there isnt a lot of content in the observation that useful information is compressive.

However there is some hypey people out there who think this observation has legs. Precisely the people who think intelligence is a mathematical problem, and not an engineering one -- which is my view.

ie., that a body isnt incidental to intelligence, but the heart of it.

Or: devices matter.

typon · on April 5, 2021

I tend to think about it the same way as you. Having an algorithm for multiplying two numbers is qualitatively different than having a lossy compression of a huge dataset of multiplication tables. The latter is what GPT3 has and it just doesn't scale.

space_rock · on April 5, 2021

Consider a generative adversarial network for faces. Photos have let's say some unique scars

A successful model could create faces with scars in them. But not the exact scar and face and background that they trained on without additional information. What you are looking for is mutual information between the images not compression

Yes you could use a very well trained GAN for face compression. But a GAN model itself would not be able to reconstruct its training input without being shown the images again

jhgb · on April 5, 2021

Isn't lossy compression pretty much the same thing as extracting some amount of useful information?

teruakohatu · on April 5, 2021

According to Wikipedia [data] compression "is the process of encoding information using fewer bits than the original representation" [1] while lossy compression is "the class of data encoding methods that uses inexact approximations and partial data discarding to represent the content" [2]

There is a difference between feature extraction and compression. The former selects the most unique and important elements from a set of data, while the latter attempts, in the case of lossy compression, to find a smaller approximation of the original data set.

For example imagine a data set of paintings. Feature extraction might simply identify what color paints were used in each painting. That alone might allow classification of paintings with respect to the painter, style or period. But this would not be an approximation of the original picture, unless you consider resizing a Jackson Pollock into an 8x1 pixel image to be compression.

[1] https://en.wikipedia.org/wiki/Data_compression [2] https://en.wikipedia.org/wiki/Lossy_compression

miltondts · on April 5, 2021

"The former selects the most unique and important elements from a set of data, while the latter attempts, in the case of lossy compression, to find a smaller approximation of the original data set."

The compressor finds a smaller approximation of the data by finding the most redundant data, and as a byproduct it also finds the most unique. At least from the description you gave there is no difference between the two.

"For example imagine a data set of paintings. Feature extraction might simply identify what color paints were used in each painting. That alone might allow classification of paintings with respect to the painter, style or period. But this would not be an approximation of the original picture, unless you consider resizing a Jackson Pollock into an 8x1 pixel image to be compression."

You can and we do use compressors to do exactly that. Take a painting (or a set) as reference and use it to compress other paintings and the greater the compression the more similar the styles are.

For an example see: https://www.sciencedirect.com/science/article/abs/pii/S00313...

EDIT: In fact one step in a compressor is feature extraction (usually called modelling) and the other is coding.

Retric · on April 5, 2021

No, because lossy compression has the goal of preserving elements for replay. At the extreme you could turn a movie into subtitles, but not a costume from the movie alone. The costume might contain more information from the movie, but you can meaningfully recreate the movie from a single costume.

jhgb · on April 5, 2021

If I derive Kepler's laws from positions of planets, I have preserved elements of their motion for replay as well.

Retric · on April 5, 2021

The laws alone don’t have enough information to replay their motion without each planets location, a foci, the orbital period, and an associated timestamp or equivalent information.

So, include that data and it’s lossy compression, exclude it and it’s not.

sfink · on April 5, 2021

> Focusing on mathematical abstracta completely misses the problem.

That rather depends on what you think the problem is or ought to be. I'm not sure that the author or many readers agree with you. It seemed fairly clear to me that they are interested in how well they can generate outputs corresponding to novel inputs that do well on some utility function. And for that, the "mathematical abstracta" is rather important and useful. "Comprehension" might also be important and useful, but is itself neither necessary nor sufficient.

A self-driving car will hopefully try to avoid running over a two-headed person crossing a road, without needing to worry about whether a two-headed person is mythical vs a miscomprehension of a woman carrying a child.

Besides, the layer at which comprehension occurs is not fixed. Our measuring devices themselves take in ambiguous inputs and process them. Do you stop at the surface of the skin? The nerve impulses? Various layers within the brain and nervous system?

Your complaint is reasonable, yet is not really that different from saying that computers are pointless without any I/O capabilities -- it doesn't matter what they compute if it is unobservable and has no effect on the world. That is true, yet doesn't mean that the whole of computer science is "a bit of a misdirection".

0x008 · on April 5, 2021

Such a great comment, thank you

blackbear_ · on April 5, 2021

I get so pissed every time the double descent paper gets brought up, maybe because of the hubris in its abstract. No "questions about the mathematical foundation of machine learning" were raised by deep learning and a few silly experiments are far from showing "limits of classical analyses".

All the paper shows is (1) stupid ways of counting model complexity and (2) that gradient descent is flawed. Nobody in their right mind believes that increasing the number of hidden neurons can result in a network that is worse on the test set. Since the bigger network contains the smaller network, it is perfectly capable of achieving the same performance, so the only reason why this does not happen is that SGD cannot find it. But of course "SGD cannot always find good solutions" is surprising to nobody, so let's just shit on decades of serious work to get our little paper out.

Sorry for the rant.

Nimitz14 · on April 5, 2021

I really do not understand your post. One doesn't train on the test set, so of course it's reasonable to think that increasing the number of parameters will cause more overfitting.

blackbear_ · on April 5, 2021

And yet that is not what is observed in practice. See figures 1 and 2 of that paper [1].

What I am complaining about is that the authors are confusing "what can be represented" by a neural network with "what can be easily learned" via SGD. I argued that the peak in test loss in those figures is observed becuse SGD struggles to find a solution that generalizes well, not because those networks are intrinsically less powerful (as the paper seems to imply).

[1] https://arxiv.org/pdf/1812.11118.pdf

Nimitz14 · on April 5, 2021

> And yet that is not what is observed in practice. See figures 1 and 2 of that paper

Yeah... that's why the paper is a good contribution contrary to what you're saying. Not sure why you're repeating this information.

> What I am complaining about is that the authors are confusing "what can be represented" by a neural network with "what can be easily learned" via SGD. I argued that the peak in test loss in those figures is observed becuse SGD struggles to find a solution that generalizes well, not because those networks are intrinsically less powerful (as the paper seems to imply).

I mean.. those models are by definition less powerful as they have less parameters. The (to me) main point of the paper is to point out a symptom which is interesting. Their explanation for the symptom being (maybe) wrong doesn't detract from the important work of showing the symptom exists.

A legitimate criticism would be that there have been earlier papers showing the same thing.

Sidenote: It seems having a larger model does make it easier for SGD to find good solutions[1]

[1] https://www.youtube.com/watch?v=kcVWAKf7UAg

mjburgess · on April 5, 2021

The interesting contribution was forcing interpolation on the training set, right?

Its one of those papers where if you had a real handle on how everything works, wasnt surprising. But it was more a rebuke on some cargo cult ideas.

Lots of impactful papers have that form, eg., Gettier's paper on justified true belief.

moultano · on April 5, 2021

The fact that over parametrized models can generalize better than under parameterized models, and that both are better than models that can just barely interpolate is a genuinely new insight that was not predicted by any prior theory.

PartiallyTyped · on April 5, 2021

Let's for a second assume that we bound the weights within [-1,1], considering that we are using floats to represent them which are a subset of real numbers, by increasing the number of hidden neurons, you increase the class of models that you can select from, any single bit change in the weights means a different model. By increasing the class of models in your search space, you increase the number of models that perform good on training, but perform worse on test, i.e. underestimate true loss.

Exactly because you select a single model and you have more models that underestimate the true (read test) error, you *may* be more likely to get models with worse performance on the test set compared to using a smaller network exactly because in the smaller network you are searching within a smaller distribution.

unishark · on April 5, 2021

I'm not clear why you are blaming SGD. Maybe I missed the point. In principle SDG might well find the global optimal. The problem is this optimal is achieved only for training data. It could certainly perform worse on the test set. Maybe you are referring to the entire training process? The general idea is from the days of SVM's where the optimization method was convex.

Though personally I do find a lot this modern "experimental" research quite hokey. I don't think this is something academics should be getting research funding to pursue. This is engineers building intuition about how to tune their product.

GregarianChild · on April 5, 2021

In practise, how do you know if SDG has converged to a global optimum, and do you let it run long enough in practise until covergence?

unishark · on April 5, 2021

You don't know. The point was even if you did get there it could still be an overfit model you don't want, since it's based on a training data set, not the true statistics of the distribution the samples come from.

eugenhotaj · on April 5, 2021

> Since the bigger network contains the smaller network, it is perfectly capable of achieving the same performance, so the only reason why this does not happen is that SGD cannot find it.

This is maybe true in the limit of infinite data, but not true in any practical sense, and I don't think it has anything to do with SGD. E.g. polynomial basis functions also have this property, but you can't use an arbitrarily large polynomial order or you'll eventually overfit. You can get a closed-form solution for polynomial regression problems, so no SGD involved.

sfink · on April 5, 2021

I found this to be a great survey post on a question I've been wondering more and more as I hear about all of the machine learning going on. I'm no academic nor mathematician, and I fear that diving into actual papers would quickly blow out the few remaining neurons I have left, but this article was mostly understandable and brought up a lot of points that have been floating around in my head. Kudos to the author.

The one substantive comment I want to make is that I really wonder about the difficult to reproduce findings. What's going on when attempts to reproduce them fail? It's a general question, but I do wonder how much of that is because other effects are swamping the signal, and how much is that the finding only applies to limited situations. (Not that those are entirely different.) If I don't see the double-u curve, is it because my problem space has an atypical shape, or because the researchers' did?

rademacher · on April 5, 2021

I haven't read this paper yet so I can't speak to it's quality but it appears to be addressing the same questions in this post. Bengio is a coauthor so maybe that's a good sign . Here's the abstract.

This paper provides theoretical insights into why and how deep learning can generalize well, despite its large capacity, complexity, possible algorithmic instability, nonrobustness, and sharp minima, responding to an open question in the literature. We also discuss approaches to provide non-vacuous generalization guarantees for deep learning. Based on theoretical observations, we propose new open problems and discuss the limitations of our results.

https://arxiv.org/abs/1710.05468

kensai · on April 5, 2021

I recently came across of a similar flaw in the EEG classification experiments. I think most results should be taken with a grain of salt until comprehensively and irrefutably confirmed by independent teams.

https://news.ycombinator.com/item?id=26696546

drothlis · on April 5, 2021

This contamination of test data from the training data reminds me of "Overly Optimistic Prediction Results on Imbalanced Data: a Case Study of Flaws and Benefits when Applying Over-sampling" [1] where almost 50% of the 24 peer-reviewed studies that use machine learning based on a particular publicly-available dataset, were claiming near-perfect accuracy at predicting the risk of pre-term birth for a patient, but were actually testing (accidentally) on training data.

[1]: https://arxiv.org/abs/2001.06296

timy2shoes · on April 5, 2021

Oversampling, then applying a train-test split? Jesus, that's like machine learning 101. But then again, I see a lot of questionable practices in the application of ML in biology.

ta988 · on April 5, 2021

Great find, thanks!

skywhopper · on April 5, 2021

What I don’t see clearly addressed here is whether the test data that these networks are validated against are part of the larger data set that was used for the initial training. I’m guessing the validation data is usually from the same data set, in which case it’s not really a surprise that a massively overfitted network would work pretty well against. Whereas some alternative data set produced by different people under even slightly different conditions will introduce many new unexpected variables that the network won’t be equipped to handle, and that’s when the “overfitting” to the original data set would be more obvious. But I’m going to guess that in practice, useful datasets vary so much that it’s impractical to do this sort of cross checking (and in reality it wouldn’t happen because you don’t want to publish a negative result).

sfink · on April 5, 2021

I'm not in the field (nor an academic, nor particularly smart), but my impression is that this is implicit in nearly everything people are doing. Almost all of the papers will be talking about interpolation, not extrapolation. More specifically, training data and test data are assumed to be partitioning an existing data set into test and training portions. "Generalization" is measured only by success at fitting the test data.

Of course, most actual applications immediately break out of that model by running live against previously unobserved data coming from different populations, different times (just think: post-2020 vs pre-2020!), often different purposes. And probably much of the error because you're now extrapolating gets regarded as an engineering problem?

qPM9l3XJrF · on April 5, 2021

The intrinsic dimension idea seemed interesting, but I didn't follow this bit:

>By searching through the value of d=1,2,…,D, the corresponding d when the solution emerges is defined as the intrinsic dimension.

Is the idea to increase the "intrinsic dimension" until the network is able to learn? I'm not sure I buy the part of the argument where because the network is able to learn when the "intrinsic dimension" is kept low, it therefore follows that the "intrinsic dimension" is also low even when we aren't forcing it to be low. It seems a bit like saying "because we can obtain 90% accuracy on MNIST with a 10-parameter model, it therefore follows that a 1000-parameter model for MNIST only has 10 'intrinsic' parameters". Seems like a dubious / handwavey argument to me.

samcodes · on April 5, 2021

this is a good explanation of intrinsic dimension https://eng.uber.com/intrinsic-dimension/

qPM9l3XJrF · on April 7, 2021

Thanks. The implication seems to be that if you restrict the # of parameters so it's equal to intrinsic dimension, learning isn't possible, but it is possible with this random projections method. Wonder why. It seems like with both methods, the number of possibilities being explored is the same, but the higher-parameter model space is richer with solutions for some reason.

belgian_guy · on April 5, 2021

The intrinsic dimension paper indeed doesn't really show that big networks also search in small subspaces (and neither does it claim to), but this has already been shown in related papers like https://arxiv.org/abs/1812.04754

rocqua · on April 5, 2021

I think it is related to the lottery ticket idea. Essentially, you try to find the lowest d such that you can find a 'winning ticket' network of size d.

yxhuvud · on April 5, 2021

Well it is not the first time the idea of dimensionality reduction have been used in the field. It is for example the idea behind latent semantic methods (LSI and LSA).

charleshmartin · on April 5, 2021

You can check for some signatures of over-fitting using the weightwatcher tool

https://calculatedcontent.com/2021/04/04/are-your-models-ove...

The tool identifies weight matrices that display atypical behavior, where the correlation is concentrated about unusually large matrix elements.

The idea comes from statistical mechanics of generalization, where it is known that neural networks that are over-fit are atypical and are in the spin glass phase of the learning phase space.

YeGoblynQueenne · on April 5, 2021

>> If you are like me, entering into the field of deep learning with experience in traditional machine learning, you may often ponder over this question: Since a typical deep neural network has so many parameters and training error can easily be perfect, it should surely suffer from substantial overfitting. How could it be ever generalized to out-of-sample data points?

Shouldn't this article start with presenting the evidence that deep neural nets _can_ generlaise to out-of-sample data? It is a bit frustrating having to read a discusion of the why without a discussion of the what.

A small introductory paragraph would suffice. For example, quote so-and-so on studies showing that deep neural nets can generalise to out-of-sample data. Ideally, quote studies that show how well deep neural nets generalise to out-of-sample data.

As the article is now, it seems to be giving many explanations for a phenomenon a reader won't even know exists in the first place.

YetAnotherNick · on April 5, 2021

Look at imagenet results. Test set accuracy(top 1) for imagenet is 85% with deep net, and 50% with best of all other approaches and decades of CV work.

notemaker · on April 5, 2021

Let's remember though that Imagenet is not a good representation of reality.

See e.g. performance on OjectNet, https://objectnet.dev/, when trained on Imagenet. For the same classes, we see _dramatic_ drops in accuracy.

YeGoblynQueenne · on April 5, 2021

Also see:

Do ImageNet Classifiers Generalize to ImageNet?

We build new test sets for the CIFAR-10 and ImageNet datasets. Both benchmarks have been the focus of intense research for almost a decade, raising the danger of overfitting to excessively re-used test sets. By closely following the original dataset creation processes, we test to what extent current classification models generalize to new data. We evaluate a broad range of models and find accuracy drops of 3% - 15% on CIFAR-10 and 11% - 14% on ImageNet. However, accuracy gains on the original test sets translate to larger gains on the new test sets. Our results suggest that the accuracy drops are not caused by adaptivity, but by the models' inability to generalize to slightly "harder" images than those found in the original test sets.

https://arxiv.org/abs/1902.10811

t_serpico · on April 5, 2021

nice. glad to see this exists.

YeGoblynQueenne · on April 5, 2021

The passage I quote above speaks of "out-of-sample" generalisation, not "test-set" generalisation. These are not the same.

Unfortunately such terminological confusion is common but "out-of-sample" should really be reserved for data that was not available during development of a system, either as a training, evaluation or testing partition. That, because "out of sample" suggests that the data was drawn from a different distribution than the, well, training sample (where the training sample is then subdivided to training, evaluation and testing partitions), i.e. the true distribution, so the real world.

I guess the OP is instead using "out-of-sample" to mean "test set" (which is not uncommon), but in that case we don't need to look all the way to learning theory to figure it out: published results are well known to select for successful experiments, in machine learning as in many areas of research, unfortunately.

zibzab · on April 5, 2021

This is an interesting subject and I would love to read more on it!

Can anyone point me to other publications on the same issue? Preferably something with some real-world examples?

disgruntledphd2 · on April 5, 2021

So, the thing that I think is most important about the article (which was wonderful, btw) is the double descent loss curve. Originally demonstrated for boosting in 1989, it seems to have made a comeback since 2016.

That being said, this kind of stuff is gonna mostly be in the papers, so I suggest following the interesting references from the article, and repeating until you feel you understand :)

ta988 · on April 5, 2021

This is pretty common in many fields. Researchers don't have time to read they say. I see people that complain when they have to read one paper a week... This is disastrous in some labs where their only work is to rediscover things and add make-up so it looks new once they realize it is not.

disgruntledphd2 · on April 5, 2021

I read more papers than that a week, and I'm not even an academic :)

GregarianChild · on April 5, 2021

Have you got a reference for the 1989 discovery of double descent? I brief Googling didn't unearth it. I'm not "citation needed"-ing you. I think a lot of decision-tree and boosting work is being rediscovered by the DL community. I would love to see the original discovery.

disgruntledphd2 · on April 5, 2021

There you go: https://projecteuclid.org/journals/annals-of-statistics/volu...

Rather like the 1957 development of cross-validation (which I remember reading, but definitely don't have a cite for), the double descent thing is part of the paper that is used on the way to something else.

cosmic_ape · on April 5, 2021

Do you mean the boosting graphs in Fig. 1, 4 of that paper?

It looks though that they have a double descent on the train set too, so it might not be the same phenomenon.

Nevertheless, good to know, thanks for sharing! I knew both papers but never thought giving much attention to such details of the figures of the 1998 one. Is the connection between the papers well known, i.e. something people talk about?

disgruntledphd2 · on April 6, 2021

I wasn't aware of it. I think it's mentioned in the 2016 paper that popularised the term, and I saw the date and went to read it.

Yeah, I do mean those graphs, it seems to be the same phenomenon at least.

kvathupo · on April 5, 2021

Nothing to add, other than to say that I love her blog!