The title is misleading. The core technique still uses 60,000 images from MNIST, but 'distills' them into 10 images that contain the information from the original 60,000. The 10 'distilled' images look nothing like digits. Learning a complex model from 10 (later reduced to 2) 'distilled' number arrays is an interesting research idea, but it has little to do with reducing the size of the input dataset. Arguably the heavy lifting part of the learning process moved from training the model to generating the distilled dataset. There is also some unconvincing discussion around synthetic datasets, though it remains fully unclear how these synthetic datasets have anything to do with real world scenarios.
> In a previous paper, MIT researchers had introduced a technique to “distill” giant data sets into tiny ones, and as a proof of concept, they had compressed MNIST down to only 10 images.
It may mean we might finally have a method for reliably updating big neural networks instead of having to do continuous on the fly retraining. (Imagine future neural networks on your smart car having "upgrade packs" or country specific data that can be used to fine tune the main network in a matter of minutes) A high-level form of patch-and-diff for networks. There is probably a ML Ops startup opportunity somewhere in this.
"Upgrade packs" might be already possible using transformer adapters, i.e. tiny networks trained on customized data plugged into a large fixed pretrained transformer, providing whatever custom functionality you require.
I don't think this changes anything. Deployed networks typically use only inference from pretrained weights, and those weights are what get transferred for model "updates". You can have all your devices using weight array W0 for a neural net architecture, spend a million compute hours training that net on cutting edge systems to produce a much better weight array W1, then upgrade all the deployed devices by sending them W1 which will be the same size as W0.
So this is more like data storage/some sort of extreme compression than anything? Would it be accurate to interpret this as basically saving the trained neural network into 10 images and reading it back to retrain a model? Or, what really does this do/accomplish?
mostly it seems that it tells you something sort of weird about how neural networks train. it’s not obvious that this should work, and that it can be made to work is interesting.
The approach appears similar to model distillation, performed on the input data instead of on the weights. Model distillation scales well in classification but less well in generation.
To me, this seems similar to training a neural network on 10,000 images and then publishing the weights themselves as a distilled "image" (of course, it's not the weights themselves but a "distillation"). I feel like you could get something similar by taking the trained network and then applying deep-dream techniques for each classification label on a blank canvas.
It's interesting for localized learning... You have some local data that you want the model to incorporate, but don't want it to forget the main dataset. So you can create one of these distilled datasets and include it in a batch with the local data when you updated the model.
Interesting paper, although the headline is of course sensational. The crux of the paper is that by using "soft labels" (for example a probability distribution rather than one-hot), it's possible to create a decision boundary that encodes more classes than you have examples. In fact, only two examples can be used to encode any finite number of classes.
This is interesting because it means that, in theory, ML models should be able to learn decision spaces that are far more complex than the input data has traditionally been thought to encode. Maybe one day we can create complex, generalizable models using a small amount of data.
As written, this paper does not provide much actionable information. The problem is a toy problem, and is far from being useful in "modern" AI techniques (especially things like deep learning or boosted trees). The paper also is not practical in the sense that in real life you don't know what your decision boundary should look like (that's what you learn after all), and there's no obvious way to know which data to collect to get a decision boundary you want.
In other words, this paper has said "this representation is mathematically possible" and is hoping that future work can actually make it useful in practice.
The title is click-bait. This has been known for several years[1], the technique has little practical value, and the assertion that you can learn from no data is completely false and misleading. The training data was compressed to a few examples. To the journalist: it's OK not to maximize for click-bait when you write an article.
[1]: https://www.ttic.edu/dl/dark14.pdf
Why not just make one image where the pixel values are the trained network weights? Then you can say you can distill any training set into just one image.
"carefully engineered their soft labels" is the same thing as training the network. Just because you encode information outside of the weights doesn't mean you're not encoding information from training data.
It's like saying here's the ideal partitioning scheme, memorize this.
I've started to view Technology Review as a PR puff piece for MIT. They often overstate claims or leave out critical details.
As an example, the media lab is still citing innovation with deep fakes, claiming entirely novel results people are shocked to see. They hype their own researchers even though there are kids on YouTube who that have been making similar content up to a year prior to Technology Review's publication.
I suspect they do the same with fields I'm less familiar with.
In this case it's clear since this is a publication of MIT and they hype up their own research of course.
But even when you're note reading a university publication/PR piece, you still see this effect. One big player like a famous lab at Stanford or MIT publishes an incremental paper that is a followup on a well known existing research direction, where several groups are working in parallel on very similar things, and then it's presented as if it was some breakthrough and the whole subfield was invented by them right now.
It's very, very hard for outsiders to recognize this and to really understand what the actual incremental step in a particular paper is. Necessarily, when explaining to laypeople, you can only scratch the surface and present the rought idea of a whole big research field, and it gets really murky what is part of the established, pre-existing research field and what is the novel contribution.
I'm sure I fall victim to this when reading outside my expertise as well, i.e. when reading about genetics stuff or quantum computing.
> I'm sure I fall victim to this when reading outside my expertise as well, i.e. when reading about genetics stuff or quantum computing.
I try to avoid this by focusing on the content, not who the researchers are. If they can do some cool new thing with mosquito genetics, now I know that. I don't need to know whether the specific paper being hyped is novel or whether 95% of the content was proved by someone else: that's a task for the Nobel committee and I wouldn't recognise the researchers' names again anyway.
Am I correct that they don't use a test-train split for generating these distilled images? Until you test on new images outside of what is inputted to the distiller, it seems to be a way to just overfit specific images, probably by combining unique elements of each into a single composite image. There are plenty of classical signal processing ways to do this (including just building a composite patchwork quilt).
I like to think of this as adversarial training data. Adversarial inputs in general trick a NN to producing a specific output -- Adversarial training data tricks the NN into learning specific weights.
Note that the distilled data is not even from the same "domain" of input data any more. They're basically adversarial inputs.
if I understand correctly the key benefit would be that models could be trained on smaller datasets and therefore reduce the time spent computing the models?
I am not convinced that this time saving is more than the time spent to engineer the combined and synthesised data.
> ...very different from human learning. A child often needs to see just a few examples of an object, or even only one, before being able to recognize it for life.
I see this a lot. It's completely wrong. I'm not trying to pick on the author here, I think 95%+ of people share this misunderstanding of deep learning.
If you see "only one" horse, say for even a second, you really are seeing a huge number of horses, from various angles, with various shades of lighting. The motions of the horse; the motions of your head (even if slight); the undulations of the light; are generating a much larger number of basically augmented training data. If you look at a horse for a minute it could be the equivalent of training on 1 million images of a horse. I'm not sure the exact OOM, but it's certainly orders of magnitude more than "one" horse.
(Relatedly: Some people say there is an experiment you can conduct at home to see the actual images your brain is training on).
>> If you look at a horse for a minute it could be the equivalent of training on 1 million images of a horse.
If you trained a neural net with 1 million images of the same horse, it would learn to recognise that horse... but no other horse. Neural net datasets try to include as many variants of the target concept as possible in order to capture as many of the common features of instances of that concept as possible. A single horse would not suffice, e.g. 1 million images of a white horse would teach a neural net that horses are only white, etc.
Also, a child can learn to recognise horses from a caricature of a horse- which is a single image of a horse, not 1 million images. Whatever human minds do when they learn to recognise objects, that's not training with big data in real time.
Transfer learning is a thing though. If you train your network with thousands of pictures of cats, dogs, chickens, birds, goats but no horses, it's possible you can teach it to recognise a horse with a single additional image.
Not sure exactly how good the state of the art is compared to human children on that task, but it can't be too far off.
But how exactly do you "train" a 2-year old on 5 billion pictuers of non-horses? And at what point does the child learn to recognise images of anything?
I mean, how do you know it takes 5 billion images of non-horses before the child can learn to recognise a horse? Even if the child sees 5 billion images of non-horses, who says that they haven't learned to recognise the non-horses and the horses from the first of those 5 billion images? After all, I don't know that children stay transfixed for hours staring at horses and non-horses while their brains process all those billions of images until they finally get it.
I think you are making an effort to explain away something that is not very well understood by modern science, by analogy with machine learning, even though there is no good reason to suppose that the two are connected in any way.
24 hours a day times 3600 seconds per hour times 60hz times 1000 = approximately 5 billion.
Just a ball park figure for how many “pictures” they have seen. Probably off by a few OOM, but point holds.
I think there is a very good reason to think machine learning and human learning are connected. First, I’d say it’s largely true the people who have made the most impact in the former have all studied the way the mind works (as far as science allows), and tried to emulate that in machinery. Second, experimentally the phenomena observed are increasingly similar (deep dream, for example).
I like to quote Yann LeCun on how neural nets work like our minds:
IEEE Spectrum: We read about Deep Learning in the news a lot these days. What’s your least favorite definition of the term that you see in these stories?
Yann LeCun: My least favorite description is, “It works just like the brain.” I don’t like people saying this because, while Deep Learning gets an inspiration from biology, it’s very, very far from what the brain actually does. And describing it like the brain gives a bit of the aura of magic to it, which is dangerous. It leads to hype; people claim things that are not true. AI has gone through a number of AI winters because people claimed things they couldn’t deliver.
Also, like I say above, it doesn't matter how many images a human sees- what matters is how many she needs to see before learnign to recognise a thing. The example of an unchanging, two-dimensional caricature of a horse is evidence enough that, even if we do see billions of "images" as you say (I'm not sure it makes sense to speak of "images" int the sense you use it) we don't need to see all those billions of them before we learn what things look like.
> Also, a child can learn to recognise horses from a caricature of a horse
The way the chain of connection goes from the eye through the brain is like a tree, and when a child "sees" a horse, a number of those pathways are being activated, some of which are more general, and some of which are specific.
> which is a single image of a horse,
Although it may be a "single" image in the sense that is a single file on disk or a single printed image, a child is not seeing a single image. They are seeing a rushing river of horses, even if from just that 1 static image. Think of the 60hz refresh rate of your monitor. A child is seeing at least 60 "images per second", and likely many, many times more.
But those 60 images per second are the same static, unchanging caricature. It makes more sense to say that the child sees the same image multiple times than to say she sees multiple images.
So how does the child learn to recognise a fully 3-dimensional real-life horse from a single caricature of a horse seen any number of times? Can you explain?
Also, if you were to train a nerual net with a single insance of a caricature of a horse, even if you copied it a million times to create an example set of a million copies, the neural net would still only be able to elarn how to recognise that particular caricature, if that- it would certainly not be able to extract any useful features to recognise real-life horses from that single caricature.
No, they can’t. Flash an image of an animal for a microsecond in front of a child. They won’t “see” anything. One needs to “see” many, many versions of something before you “see” anything at all, nevermind learn something.
That isn’t seeing many versions of the animal. It is merely processing the one training sample over several seconds, because it takes a few hundred milliseconds for the human brain and visual system to process an image, and several more seconds for the brain to analyze and memorize key features of the sample.
Standard machine learning systems don’t work like this. No matter how much time they’re allowed to process a single sample image, they are unable to learn to classify it. There are of course machine learning systems that work more like humans (facial recognition is a notable example), but they’re the exception not the norm.
All machine learning models that I've trained have involved a data augmentation step. These are artificial augmentations though, so not as effective as more real data. Looking at the same horse live from 100 different perspectives might be better for learning what a horse is than training on 100 static images of different horses.
Yes, data augmentation is standard. It isn't new data though, just different variations of the same data. You also tend to do multiple training passes over the data, even after augmentation.
None of that is the point though: a child can look at a single static image (not 100 different perspectives - just one perspective) of a horse, and learn to recognize a horse. A standard machine learning model cannot, no matter how much you augment the image.
You need to define "look". How many nanoseconds? How much is the lighting changing during that time? How still is the photo? How still is the person's head? In DL a training example in a batch is perfectly defined in bits. Once you try to define a training example for a human you see that a "single static image" means totally different things. A human seeing a static image is the equivalent of training a model on at least thousands, if not millions+ of training images.
> In a previous paper, MIT researchers had introduced a technique to “distill” giant data sets into tiny ones, and as a proof of concept, they had compressed MNIST down to only 10 images.