> I feel like future research will focus on finding the line that divides the "tractable" problems from the "intractable" ones.
My gut feeling is there is no "tractable" line. All subsets of problems are tractable.
Via analogy, think of auto-encoders or dimensional reduction. You may have 500^2 dimensions representing each pixel in an image recognition task. Representing all of that data in a lower dimensional space is impossible via pigeonhole. But if you only feed your net the subset of images of cats (or dogs, or only images commonly taken on cameras) it's actually a much smaller space and it can be represented efficiently in lower dimensions. And you can take any subset and it will have structure and because it has structure it can be represented more efficiently and solved efficiently. The fact that we're only interested in a subset of the problem space it what makes it efficient for us to solve.
Even if we generated "random" pixels, but used a method to only generate a subset of that noise (ie gave it structure), then we'd be able to reduce dimensions on it.
There is no line of portion of the space that's tractable or not, it's the fact we're working with a structured portion of the space that always makes it tractable.
I think it will be phenomenal when it's finally common place, but I think the comment they're making is "It's amazing that there's this think called compression and it works on, and works well on just about anything we want to compress. I understand theoretically it should make some things larger, but just about everything I try to compress it makes smaller!" Of course, it does because the only things we want to compress are things that have enough inherent structure that we want to save and share them. If all we were interested in were just the output of a random number generator (which wouldn't compress), then we'd never want to compress it anyway.
It's phenomenal actually. The only problems it works on are the problems we want to solve. Because the problems we want to solve have structure in them, they follow some rules (which is why we care about them to begin with).
However, that still leaves an open problem. To the best of my knowledge there is no good way to measure "tractability," i.e., how much structure a problem has in comparison to other problems.
For example, I don't think there is a good way to measure how much more tractable -- i.e., how much more structure there is in -- the problem of classifying dog/cat photos versus the problem of classifying dog/cat/wolf photos versus the problem of classifying fruit/vegetable photos.
The various measures of "structure" that we have today (Shannon entropy, algorithmic information content, etc.) fall short for measuring the kind of hierarchical structure (multiple levels of function composition) that is prevalent in high-dimensional problems of interest.
Any research that might lead to a good way of measuring "hierarchical structure" (for lack of a better term) seems important to me.
> Any research that might lead to a good way of measuring "hierarchical structure" (for lack of a better term) seems important to me.
Which is an interesting question. Why is hierarchical structure so important for ML, but not for compression?
One view would be, our ML approach is "wrong" and could be improved. In almost all cases running compressed data through another compression scheme yields little to no improvement. Yet in ML we're focused on multiple transforms requiring a concept of hierarchy. Maybe it's because ML is the equivalent of a multi-layered modular compression schemes but in practice all compression is fairly monolithic. If instead we compressed using first a single layer doing key based look up of long identical strings, and then did run-length encoding we'd have items that look more like our ML approach. But, this doesn't intuitively feel right to me.
Another view is that ML works in the physical world and the physical world has inherent hierarchical structure. And we don't do this in the digital world of compression. But a counterpoint is that when we program, we exploit hierarchical structure in our code and data via functions, layers and abstraction.
Which leads me to think it's the third answer. Compression is lossless, and ML is lossy. Being lossy is what allows hierarchy to work. If you are lossless you can't ever leave working with the lowest level of abstraction or else you're by definition losing/changing information. The command equivalent "Send an email with content X to my sister" is absolutely a lossy summary of what is happening or will happen below the surface. Me giving my computer that command it will translate to different low level actions each time I do it. Similarly if someone asks what command I just ran and I give that description, it isn't enough information for them to recreate a CPU & network packet level exact reproduction of what I did. It is lossy. Machine Learning of "is this a cat?" is exactly like this. By contrast, in compression you have to have every bit recovered.
So I think the answer, back to your point, is we may not have an equivalent methodology of discussing information theory / content in a lossy domain. As a result of this we don't have a way of discussing information content in a hierarchical structure as they may be one and the same.
Yes, that makes sense to me: we don't have a formal, precise way of discussing (measuring) information content when faced with the kind of "hierarchical structure that is robust to loss" that is prevalent in natural high dimensional data.
(FWIW, I don't like the term "hierarchical structure" that much, but cannot think of a more appropriate term. It's as if there's no good terminology for discussing these issues, at least not yet.)
> The various measures of "structure" that we have today (Shannon entropy, algorithmic information content, etc.) fall short for measuring the kind of hierarchical structure (multiple levels of function composition) that is prevalent in high-dimensional problems of interest.
You don't think Kolmogorov complexity captures hierarchical structure? (Ignoring the fact that it isn't computable).
As you correctly point out, it's not computable, so we can't use it for measuring anything in practice. But let's ignore that.
The issue I have with it is that it theoretically measures the lossless compressibility of some sequence of bits, instead of the kind of "hierarchical decomposition that is robust to data loss" that I'm talking about here.
Think about this way: we can randomly zero out half the pixels in a cat photo, and a human being can still see it's the same cat, because it's composed of cat parts that "approximately look like" cat parts (pointy ears, whiskers, etc.), and those parts look like cat parts because they are in turn composed of things that "approximately look like" cat part parts, all the way down to pixels.
AFAIK, there's no way of measuring whether cat photos have more or less of this kind of robust-to-loss hierarchical structure than, say, dog photos, or fruit photos.
My gut feeling is there is no "tractable" line. All subsets of problems are tractable.
Via analogy, think of auto-encoders or dimensional reduction. You may have 500^2 dimensions representing each pixel in an image recognition task. Representing all of that data in a lower dimensional space is impossible via pigeonhole. But if you only feed your net the subset of images of cats (or dogs, or only images commonly taken on cameras) it's actually a much smaller space and it can be represented efficiently in lower dimensions. And you can take any subset and it will have structure and because it has structure it can be represented more efficiently and solved efficiently. The fact that we're only interested in a subset of the problem space it what makes it efficient for us to solve.
Even if we generated "random" pixels, but used a method to only generate a subset of that noise (ie gave it structure), then we'd be able to reduce dimensions on it.
There is no line of portion of the space that's tractable or not, it's the fact we're working with a structured portion of the space that always makes it tractable.
I think it will be phenomenal when it's finally common place, but I think the comment they're making is "It's amazing that there's this think called compression and it works on, and works well on just about anything we want to compress. I understand theoretically it should make some things larger, but just about everything I try to compress it makes smaller!" Of course, it does because the only things we want to compress are things that have enough inherent structure that we want to save and share them. If all we were interested in were just the output of a random number generator (which wouldn't compress), then we'd never want to compress it anyway.
It's phenomenal actually. The only problems it works on are the problems we want to solve. Because the problems we want to solve have structure in them, they follow some rules (which is why we care about them to begin with).