Compression is a necessary but not sufficient condition for 'comprehension'. The whole 'compression' idea here is a bit of a misdirection; it will be a side effect of any success.
The data fed into these systems is just measurements of target systems (eg., of light for photographs). This data is radically incomplete, so no compression of it will be an accurate (eg., 3d) model.
To reconstruct the world you need to measure the measuring device (ie., the body) as it interacts with the target system. In most cases you need, also, the hypothesis behind the action to resolve ambiguity in the measurement data.
Eg., you need to know you moved your hand to touch the fireplace to properly interpret what 'finger pain' means.
It is for reasons of this kind that 'comprehension' is an engineering problem, not a programme for a universal turing machine (which has no device boundaries).
It is engineering in the sense that 'compression' has to occur under the right measurement procedures, with hypothesis-laden action, etc.
Focusing on mathematical abstracta completely misses the problem.
All deep systems do
is compress their training data to form archetypes and compare novel input to compressed archetypes. Since the data itself is necessarily profoundly ambiguous there is only a trivial sense of 'generalisation' achievd.
My comment here wasnt aimed at the problem of understanding and improving DNNs; rather, of generalisation and 'comprehension' proper.
3d image reconstruction is trivially possible, with the right data and with the right assumptions encoded into the right algorithm.
My target of attack here is the equivocation that 'compression is comprehension'.
In my view comprehension isnt about 'generalization via compressed archetypes' as in ordinary NNs.
And it isnt about 3d reconstruction given the relevant modification to data and approach.
Rather generalization is abduction. It is the ability to, via a single instance, form
law-like provisional universalising models which explain your environment. This process will lead to a 'compressed', 'representation', but nothing like the sense in which compression is here used.
It is this that intuitively is naively assumed of these systems. They do not, and cannot, abduct. Abduction isnt a statistical process; involving at the very least, counterfactual reasoning and hypothetical action.
This is the problem that such equivocations miss (ie. that c = c).
And in my view this is an engineering challenge; not something to be specified for a universal computer.
The relevant capacities missing arent better means of compression.
Sorry but I have to chime in here. His thoughts seem insightful and relevant to discussion. You are calling his contributions 'meaningless' and condescendingly telling him twice to 'be specific'. That's not how things are meant to be discussed around here and not how you get an answer out of someone.
I’m comfortable, reading his comment history, that a blunt request to explain how the the comments are relevant to the OP (and the failure to articulate them when bluntly requested to do so) I indicates that what seems “insightful” is in fact, not falsifiable and therefore nothing more than armchair philosophy.
I’m not into that, so if it terminates the discussion then so be it.
It might be the case, like with quantum mechanics, that NN theory is just fundamentally weird (e.g. because the space is so multidimensional), and it's hard for us to understand it no matter how long we study it.
Though, One trivial way to do it, with NNs in any case, is just to project forward from a range of observer models and guess the observer parameters from them.
This is still the wrong sense of generalisation. What cant be guessed is why a person took consecutive pictures at given angles/etc.
Such information is necessary to resolve deep ambiguities in cases where your observer model will fail.
Eg., yesterday i looked out my window and thought i saw two people;
it was actually one with a shadow+bag.
I moved my eyes/head/body in such away so as to fit a variety of models and i was able to 'read the scene' in the end.
And there's no reason we couldn't have a deep learning system where the input data (images) included time-stamps and movement vectors, and it could be good both at easy image classification, and at choosing particular "head movements" like those you performed, to help resolve ambiguous cases.
Further food for thought: these ambiguous cases seem (do you agree?) to be very rare.
Ambiguity is the norm, it isn't rare. Almost all visual input, ie., light, is ambiguous. We (animals) use the history of our prior geometrical-light experiences (ie., walking around) to use environmental cues to resolve ambiguity.
That billions (, trillions) of images are needed to aproximate what we can do for a single instance, i think is a good guide to the magnitude of the problem.
Google the "amnes room" -- that "illusion" is how we are always seeing.
>Compression is a necessary but not sufficient condition for 'comprehension'.
Actually, algorithmic information theory shows that maximum compression necessarily entails maximum comprehension, because the only way to maximally compress something is to exactly understand the process producing it.
The parent commenter is not disagreeing with information theory (and what you're saying is shown in the article anyway).
They're making a practical distinction that you generally don't have access to the actual thing in an empirical format for which compression will achieve true learning. Instead you have access to training data which represents, let's say, a projection of the actual thing in a smaller space with fewer dimensions.
Like trying to learn from images instead of the 3d world. Humans learn to distinguish between objects in a 3-dimensional space using sight and interaction. This learning generalizably transfers to recognition in 2 dimensions. We don't generally equip models with robotic interfaces to train in 3d before benchmarking them on ImageNet.
> We don't generally equip models with robotic interfaces to train in 3d before benchmarking them on ImageNet.
Don't they train models using 3D rendering and simulations ? We have relatively realistic simulations for various scenarios - having a learned model that could make inferences based on those complex simulations sounds like a win.
>Like trying to learn from images instead of the 3d world. Humans learn to distinguish between objects in a 3-dimensional space using sight and interaction. This learning generalizably transfers to recognition in 2 dimensions.
If we use human "comprehension" as a reference point, then the relevant point of comparison should be the understanding a human can develop given the same inputs.
Sure, but how do you measure that? How do we figure out how much understanding a human can develop from only ever seeing 2d pictures, without any movement or interaction with a 3d world?
Most ML problems are things humans are quite good at and have a lot of context to draw from.
Sure. But again, practically speaking, that isn't the reality of how we learn. The commenter wasn't refuting Kolmogorov complexity. They're just saying it's an extremely limited way of viewing the problem. Useful sure, but insufficient.
Relative to a device to perform the understanding.
A string X maximally compresses datsets Y iff X is a 'comprehending' of Y.
'OK'... but what produces and evaluates X? ie., comprehension.
This is the problem with defining these terms mathematically; you state the problem in basically useless ways.
Yes, you can specify what eqn produces the mass of the higgs boson. Thats basically no guide to building the LHC.
The production of such understanding is not abstract. Comprehension isnt a relation between two binary strings;
it is an action taken in an environment with a goal.
>Relative to a device to perform the understanding.
It uses Kolmogorov complexity, which is defined as the length of a shortest computer program in a predetermined programming language that produces the object as output. Note that this measure is relative to a programming language, not a program, and the exact choice of language doesn't matter too much. Compression means creating a smaller program that produces the same output, and to produce the same output with less code necessarily requires more understanding.
As a concrete example, imagine the output is [1,2,fizz,4,buzz,fizz,7,8,fizz,buzz,11,fizz,13,14,fizzbuzz..1000). The longest program to output this would just hard-code it in the source code (much as a very inexperienced programmer might solve the problem, or a large neural net). Someone with a better understanding would write a program using iteration and the modulus operator, which would be shorter.
Yeap, it gets into a philosophical debate of what comprehension.
One could argue it's not about compression in bits but compression to primitives that makes sense to the human mind. But then the definition becomes to fuzzy because it naturally invites the question "Who's mind?"
Compression is applied to data collected in the past while comprehension or intelligence also require adapting well to the future. Thus a pure compressor would optimize for a static distribution while an intelligent agent will probably be sub-optimal because it also needs to learn and adapt to the future.
But what if we drop the word comprehension, and we just go with “functional approximation X -> Y, computed from a finite dataset, which minimizes a predictive risk “?
It’s unclear why compression is necessary there, except as a practical benefit.
I will nitpick and say that models don't compress data but extract useful information. Compression lossly or lossless is for data reconstruction. Machine learning models retain information that generalises
Is there a meaningful / well defined difference? One could say that lossy compression is extraction of useful information. You need to identify unnecessary information to know what to discard safely.
One is a subset of the other. Compression being the larger category.
A mean is a compression of a dataset and useful information.
My issue in my comment is that 'compression' corresponds to a massive class of technquies
and there isnt a lot of content in the observation that useful information is compressive.
However there is some hypey people out there who think this observation has legs. Precisely the people who think intelligence is a mathematical problem, and not an engineering one -- which is my view.
ie., that a body isnt incidental to intelligence, but the heart of it.
I tend to think about it the same way as you. Having an algorithm for multiplying two numbers is qualitatively different than having a lossy compression of a huge dataset of multiplication tables. The latter is what GPT3 has and it just doesn't scale.
Consider a generative adversarial network for faces. Photos have let's say some unique scars
A successful model could create faces with scars in them. But not the exact scar and face and background that they trained on without additional information. What you are looking for is mutual information between the images not compression
Yes you could use a very well trained GAN for face compression. But a GAN model itself would not be able to reconstruct its training input without being shown the images again
According to Wikipedia [data] compression "is the process of encoding information using fewer bits than the original representation" [1] while lossy compression is "the class of data encoding methods that uses inexact approximations and partial data discarding to represent the content" [2]
There is a difference between feature extraction and compression. The former selects the most unique and important elements from a set of data, while the latter attempts, in the case of lossy compression, to find a smaller approximation of the original data set.
For example imagine a data set of paintings. Feature extraction might simply identify what color paints were used in each painting. That alone might allow classification of paintings with respect to the painter, style or period. But this would not be an approximation of the original picture, unless you consider resizing a Jackson Pollock into an 8x1 pixel image to be compression.
"The former selects the most unique and important elements from a set of data, while the latter attempts, in the case of lossy compression, to find a smaller approximation of the original data set."
The compressor finds a smaller approximation of the data by finding the most redundant data, and as a byproduct it also finds the most unique. At least from the description you gave there is no difference between the two.
"For example imagine a data set of paintings. Feature extraction might simply identify what color paints were used in each painting. That alone might allow classification of paintings with respect to the painter, style or period. But this would not be an approximation of the original picture, unless you consider resizing a Jackson Pollock into an 8x1 pixel image to be compression."
You can and we do use compressors to do exactly that. Take a painting (or a set) as reference and use it to compress other paintings and the greater the compression the more similar the styles are.
No, because lossy compression has the goal of preserving elements for replay. At the extreme you could turn a movie into subtitles, but not a costume from the movie alone. The costume might contain more information from the movie, but you can meaningfully recreate the movie from a single costume.
The laws alone don’t have enough information to replay their motion without each planets location, a foci, the orbital period, and an associated timestamp or equivalent information.
So, include that data and it’s lossy compression, exclude it and it’s not.
> Focusing on mathematical abstracta completely misses the problem.
That rather depends on what you think the problem is or ought to be. I'm not sure that the author or many readers agree with you. It seemed fairly clear to me that they are interested in how well they can generate outputs corresponding to novel inputs that do well on some utility function. And for that, the "mathematical abstracta" is rather important and useful. "Comprehension" might also be important and useful, but is itself neither necessary nor sufficient.
A self-driving car will hopefully try to avoid running over a two-headed person crossing a road, without needing to worry about whether a two-headed person is mythical vs a miscomprehension of a woman carrying a child.
Besides, the layer at which comprehension occurs is not fixed. Our measuring devices themselves take in ambiguous inputs and process them. Do you stop at the surface of the skin? The nerve impulses? Various layers within the brain and nervous system?
Your complaint is reasonable, yet is not really that different from saying that computers are pointless without any I/O capabilities -- it doesn't matter what they compute if it is unobservable and has no effect on the world. That is true, yet doesn't mean that the whole of computer science is "a bit of a misdirection".
I get so pissed every time the double descent paper gets brought up, maybe because of the hubris in its abstract. No "questions about the mathematical foundation of machine learning" were raised by deep learning and a few silly experiments are far from showing "limits of classical analyses".
All the paper shows is (1) stupid ways of counting model complexity and (2) that gradient descent is flawed. Nobody in their right mind believes that increasing the number of hidden neurons can result in a network that is worse on the test set. Since the bigger network contains the smaller network, it is perfectly capable of achieving the same performance, so the only reason why this does not happen is that SGD cannot find it. But of course "SGD cannot always find good solutions" is surprising to nobody, so let's just shit on decades of serious work to get our little paper out.
I really do not understand your post. One doesn't train on the test set, so of course it's reasonable to think that increasing the number of parameters will cause more overfitting.
And yet that is not what is observed in practice. See figures 1 and 2 of that paper [1].
What I am complaining about is that the authors are confusing "what can be represented" by a neural network with "what can be easily learned" via SGD. I argued that the peak in test loss in those figures is observed becuse SGD struggles to find a solution that generalizes well, not because those networks are intrinsically less powerful (as the paper seems to imply).
> And yet that is not what is observed in practice. See figures 1 and 2 of that paper
Yeah... that's why the paper is a good contribution contrary to what you're saying. Not sure why you're repeating this information.
> What I am complaining about is that the authors are confusing "what can be represented" by a neural network with "what can be easily learned" via SGD. I argued that the peak in test loss in those figures is observed becuse SGD struggles to find a solution that generalizes well, not because those networks are intrinsically less powerful (as the paper seems to imply).
I mean.. those models are by definition less powerful as they have less parameters. The (to me) main point of the paper is to point out a symptom which is interesting. Their explanation for the symptom being (maybe) wrong doesn't detract from the important work of showing the symptom exists.
A legitimate criticism would be that there have been earlier papers showing the same thing.
Sidenote: It seems having a larger model does make it easier for SGD to find good solutions[1]
The fact that over parametrized models can generalize better than under parameterized models, and that both are better than models that can just barely interpolate is a genuinely new insight that was not predicted by any prior theory.
Let's for a second assume that we bound the weights within [-1,1], considering that we are using floats to represent them which are a subset of real numbers, by increasing the number of hidden neurons, you increase the class of models that you can select from, any single bit change in the weights means a different model. By increasing the class of models in your search space, you increase the number of models that perform good on training, but perform worse on test, i.e. underestimate true loss.
Exactly because you select a single model and you have more models that underestimate the true (read test) error, you *may* be more likely to get models with worse performance on the test set compared to using a smaller network exactly because in the smaller network you are searching within a smaller distribution.
I'm not clear why you are blaming SGD. Maybe I missed the point. In principle SDG might well find the global optimal. The problem is this optimal is achieved only for training data. It could certainly perform worse on the test set. Maybe you are referring to the entire training process? The general idea is from the days of SVM's where the optimization method was convex.
Though personally I do find a lot this modern "experimental" research quite hokey. I don't think this is something academics should be getting research funding to pursue. This is engineers building intuition about how to tune their product.
You don't know. The point was even if you did get there it could still be an overfit model you don't want, since it's based on a training data set, not the true statistics of the distribution the samples come from.
> Since the bigger network contains the smaller network, it is perfectly capable of achieving the same performance, so the only reason why this does not happen is that SGD cannot find it.
This is maybe true in the limit of infinite data, but not true in any practical sense, and I don't think it has anything to do with SGD. E.g. polynomial basis functions also have this property, but you can't use an arbitrarily large polynomial order or you'll eventually overfit. You can get a closed-form solution for polynomial regression problems, so no SGD involved.
I found this to be a great survey post on a question I've been wondering more and more as I hear about all of the machine learning going on. I'm no academic nor mathematician, and I fear that diving into actual papers would quickly blow out the few remaining neurons I have left, but this article was mostly understandable and brought up a lot of points that have been floating around in my head. Kudos to the author.
The one substantive comment I want to make is that I really wonder about the difficult to reproduce findings. What's going on when attempts to reproduce them fail? It's a general question, but I do wonder how much of that is because other effects are swamping the signal, and how much is that the finding only applies to limited situations. (Not that those are entirely different.) If I don't see the double-u curve, is it because my problem space has an atypical shape, or because the researchers' did?
I haven't read this paper yet so I can't speak to it's quality but it appears to be addressing the same questions in this post. Bengio is a coauthor so maybe that's a good sign . Here's the abstract.
This paper provides theoretical insights into why and how deep learning can generalize well, despite its large capacity, complexity, possible algorithmic instability, nonrobustness, and sharp minima, responding to an open question in the literature. We also discuss approaches to provide non-vacuous generalization guarantees for deep learning. Based on theoretical observations, we propose new open problems and discuss the limitations of our results.
I recently came across of a similar flaw in the EEG classification experiments. I think most results should be taken with a grain of salt until comprehensively and irrefutably confirmed by independent teams.
This contamination of test data from the training data reminds me of "Overly Optimistic Prediction Results on Imbalanced Data: a Case Study of Flaws and Benefits when Applying Over-sampling" [1] where almost 50% of the 24 peer-reviewed studies that use machine learning based on a particular publicly-available dataset, were claiming near-perfect accuracy at predicting the risk of pre-term birth for a patient, but were actually testing (accidentally) on training data.
Oversampling, then applying a train-test split? Jesus, that's like machine learning 101. But then again, I see a lot of questionable practices in the application of ML in biology.
What I don’t see clearly addressed here is whether the test data that these networks are validated against are part of the larger data set that was used for the initial training. I’m guessing the validation data is usually from the same data set, in which case it’s not really a surprise that a massively overfitted network would work pretty well against. Whereas some alternative data set produced by different people under even slightly different conditions will introduce many new unexpected variables that the network won’t be equipped to handle, and that’s when the “overfitting” to the original data set would be more obvious. But I’m going to guess that in practice, useful datasets vary so much that it’s impractical to do this sort of cross checking (and in reality it wouldn’t happen because you don’t want to publish a negative result).
I'm not in the field (nor an academic, nor particularly smart), but my impression is that this is implicit in nearly everything people are doing. Almost all of the papers will be talking about interpolation, not extrapolation. More specifically, training data and test data are assumed to be partitioning an existing data set into test and training portions. "Generalization" is measured only by success at fitting the test data.
Of course, most actual applications immediately break out of that model by running live against previously unobserved data coming from different populations, different times (just think: post-2020 vs pre-2020!), often different purposes. And probably much of the error because you're now extrapolating gets regarded as an engineering problem?
The intrinsic dimension idea seemed interesting, but I didn't follow this bit:
>By searching through the value of d=1,2,…,D, the corresponding d when the solution emerges is defined as the intrinsic dimension.
Is the idea to increase the "intrinsic dimension" until the network is able to learn? I'm not sure I buy the part of the argument where because the network is able to learn when the "intrinsic dimension" is kept low, it therefore follows that the "intrinsic dimension" is also low even when we aren't forcing it to be low. It seems a bit like saying "because we can obtain 90% accuracy on MNIST with a 10-parameter model, it therefore follows that a 1000-parameter model for MNIST only has 10 'intrinsic' parameters". Seems like a dubious / handwavey argument to me.
Thanks. The implication seems to be that if you restrict the # of parameters so it's equal to intrinsic dimension, learning isn't possible, but it is possible with this random projections method. Wonder why. It seems like with both methods, the number of possibilities being explored is the same, but the higher-parameter model space is richer with solutions for some reason.
The intrinsic dimension paper indeed doesn't really show that big networks also search in small subspaces (and neither does it claim to), but this has already been shown in related papers like https://arxiv.org/abs/1812.04754
I think it is related to the lottery ticket idea. Essentially, you try to find the lowest d such that you can find a 'winning ticket' network of size d.
Well it is not the first time the idea of dimensionality reduction have been used in the field. It is for example the idea behind latent semantic methods (LSI and LSA).
The tool identifies weight matrices that display atypical behavior, where the correlation is concentrated about unusually large matrix elements.
The idea comes from statistical mechanics of generalization, where it is known that neural networks that are over-fit are atypical and are in the spin glass phase of the learning phase space.
>> If you are like me, entering into the field of deep learning with experience in traditional machine learning, you may often ponder over this question: Since a typical deep neural network has so many parameters and training error can easily be perfect, it should surely suffer from substantial overfitting. How could it be ever generalized to out-of-sample data points?
Shouldn't this article start with presenting the evidence that deep neural nets _can_ generlaise to out-of-sample data? It is a bit frustrating having to read a discusion of the why without a discussion of the what.
A small introductory paragraph would suffice. For example, quote so-and-so on studies showing that deep neural nets can generalise to out-of-sample data. Ideally, quote studies that show how well deep neural nets generalise to out-of-sample data.
As the article is now, it seems to be giving many explanations for a phenomenon a reader won't even know exists in the first place.
Look at imagenet results. Test set accuracy(top 1) for imagenet is 85% with deep net, and 50% with best of all other approaches and decades of CV work.
We build new test sets for the CIFAR-10 and ImageNet datasets. Both benchmarks have been the focus of intense research for almost a decade, raising the danger of overfitting to excessively re-used test sets. By closely following the original dataset creation processes, we test to what extent current classification models generalize to new data. We evaluate a broad range of models and find accuracy drops of 3% - 15% on CIFAR-10 and 11% - 14% on ImageNet. However, accuracy gains on the original test sets translate to larger gains on the new test sets. Our results suggest that the accuracy drops are not caused by adaptivity, but by the models' inability to generalize to slightly "harder" images than those found in the original test sets.
The passage I quote above speaks of "out-of-sample" generalisation, not "test-set" generalisation. These are not the same.
Unfortunately such terminological confusion is common but "out-of-sample" should really be reserved for data that was not available during development of a system, either as a training, evaluation or testing partition. That, because "out of sample" suggests that the data was drawn from a different distribution than the, well, training sample (where the training sample is then subdivided to training, evaluation and testing partitions), i.e. the true distribution, so the real world.
I guess the OP is instead using "out-of-sample" to mean "test set" (which is not uncommon), but in that case we don't need to look all the way to learning theory to figure it out: published results are well known to select for successful experiments, in machine learning as in many areas of research, unfortunately.
So, the thing that I think is most important about the article (which was wonderful, btw) is the double descent loss curve. Originally demonstrated for boosting in 1989, it seems to have made a comeback since 2016.
That being said, this kind of stuff is gonna mostly be in the papers, so I suggest following the interesting references from the article, and repeating until you feel you understand :)
This is pretty common in many fields. Researchers don't have time to read they say. I see people that complain when they have to read one paper a week... This is disastrous in some labs where their only work is to rediscover things and add make-up so it looks new once they realize it is not.
Have you got a reference for the 1989 discovery of double descent? I brief Googling didn't unearth it. I'm not "citation needed"-ing you. I think a lot of decision-tree and boosting work is being rediscovered by the DL community. I would love to see the original discovery.
Rather like the 1957 development of cross-validation (which I remember reading, but definitely don't have a cite for), the double descent thing is part of the paper that is used on the way to something else.
Do you mean the boosting graphs in Fig. 1, 4 of that paper?
It looks though that they have a double descent on the train set too, so it might not be the same phenomenon.
Nevertheless, good to know, thanks for sharing! I knew both papers but never thought giving much attention to such details of the figures of the 1998 one. Is the connection between the papers well known, i.e. something people talk about?
The data fed into these systems is just measurements of target systems (eg., of light for photographs). This data is radically incomplete, so no compression of it will be an accurate (eg., 3d) model.
To reconstruct the world you need to measure the measuring device (ie., the body) as it interacts with the target system. In most cases you need, also, the hypothesis behind the action to resolve ambiguity in the measurement data.
Eg., you need to know you moved your hand to touch the fireplace to properly interpret what 'finger pain' means.
It is for reasons of this kind that 'comprehension' is an engineering problem, not a programme for a universal turing machine (which has no device boundaries).
It is engineering in the sense that 'compression' has to occur under the right measurement procedures, with hypothesis-laden action, etc.
Focusing on mathematical abstracta completely misses the problem.
All deep systems do is compress their training data to form archetypes and compare novel input to compressed archetypes. Since the data itself is necessarily profoundly ambiguous there is only a trivial sense of 'generalisation' achievd.