As far as I understood from the description, in LeNet the convolutional layer lets you avoid training parameters that will effectively be doing the same thing as a convolution. Adjacent pixels are highly correlated, so convolutions can capture most of the information in groups of adjacent pixels without having to train a fully-connected layer of neurons. Effectively, you're kinda downsampling the image without losing information.
So, if you're using 1x1 convolutions, I think you're basically having a neuron per pixel, so you're forcing your fully-connected layers to learn the spacial correlations of pixels, instead of capturing that information in a convolutional layer. In other words, you're wasting training on capturing spacial correlations of adjacent pixels instead of other correlations.
> So, if you're using 1x1 convolutions, I think you're basically having a neuron per pixel, so you're forcing your fully-connected layers to learn the spacial correlations of pixels, instead of capturing that information in a convolutional layer.
Saying "a neuron per pixel" doesn't mean anything, really, that way of thinking isn't helpful unless you're looking at small multi-layer perceptrons. The right way to think about things is that you have tensors and layers that compute new tensors from old tensors.
A 1x1 convolution only 'sees' the feature channels of a pixel, and does the same thing to each pixel. So a 1x1 convolution on a grayscale input (e.g. a 1x28x28 tensor in the case of MNIST) does nothing, basically, other than scale and bias every pixel by the same linear function. It doesn't "force the network to learn" anything, it's just totally pointless.
One of the uses of 1x1 convolutions is to collapse the feature dimension when you're deeper in the network (e.g. 100 channels to 10 channels) to reduce number of parameters subsequent layers need operate on. It's a "channelwise fully connected layer".
I think you're thinking of (and perhaps what the author was thinking of) is the practice prior to convnets of collapsing the image into a vector and then doing a fully connected layer on it. That indeed doesn't exploit translation invariance of natural images, requires the net to learn the same features in every required spatial position at great expense, and so on. But that has nothing to do with 1x1 convolutions.
Ah yes, you're right, I was thinking of it that way. Thanks a bunch for your clear and thorough explaination, it makes a lot of sense! So if I understand what you're saying, a 1x1 convolutional layer for collapsing 100 channels to 10 channels would take a 100x512x512 tensor and collapse it to a 10x512x512 tensor?
[Also, sorry for attempting to answer your quesiton incorrectly. I was thinking of putting a disclaimer saying I hadn't worked with CNNs and so might be misunderstanding what the convolutions are doing; probably should have haha]
Maybe when the author was saying 'one can think the 1x1 convolutions are against the original principles of LeNet', he was anticipating my kind of confusion? :)
> So if I understand what you're saying, a 1x1 convolutional layer for collapsing 100 channels to 10 channels would take a 100x512x512 tensor and collapse it to a 10x512x512 tensor?
Correct. As I understand it, this would be applying a 1x1 covolution with 10 filters to a 100x512x512 tensor.
So, if you're using 1x1 convolutions, I think you're basically having a neuron per pixel, so you're forcing your fully-connected layers to learn the spacial correlations of pixels, instead of capturing that information in a convolutional layer. In other words, you're wasting training on capturing spacial correlations of adjacent pixels instead of other correlations.