I meant the input input. The first input. Just flip the fist input to the net. Then the flip will automatically cascade through all layers. The M matrices are initialized from a distribution that is symmetric around zero. They don't need to be explicitly flipped, it's not a "correction" in that sense. As long as you only have this type of modules and normal non-residual ReLU layers, it will be equivalent. The net is not equivalent but if you include the random init in your analysis, then you can see there will be a bijection.
I don't think I can explain this well over comments, think about what happens if you initialize your net from scratch. "Self-contained though experiment: Does it make a difference if you multiply all weights by -1 directly after init? " Once that is clear in itself, think of what happens if you init and flip all your very first inputs, the inputs that you feed to the start of the network.
So I made simple experiment in ghci (the "$" operators is a function application: f $ g $ x === f(g(x))):
Prelude> let relu x = (x + abs x)/2
Prelude> let res x = x - relu x
Prelude> let f x = res $ res x
Prelude> map f [-2..2]
[-2.0,-1.0,0.0,0.0,0.0]
Prelude> let res x = x + relu (negate x)
Prelude> let f x = res $ res x
Prelude> map f [-2..2]
[0.0,0.0,0.0,1.0,2.0]
Prelude> let f x = res $ res $ res x
Prelude> map f [-2..2]
[0.0,0.0,0.0,1.0,2.0]
Prelude> let res x = x - relu x
Prelude> let f x = res $ res $ res x
Prelude> map f [-2..2]
[-2.0,-1.0,0.0,0.0,0.0]
The networks pass different inputs, actually.
I have to add that you cannot have zero mean outputs of residual layer with the definition res x = x + relu (Ax + b). In that case outputs will have non-zero mean and subsequent layers will have to correct for that.
I never said relu(-anything). Please read the equations I wrote above with extreme care to where the minus signs are. They are at precise locations. I also emphasize that my explanation only works if you take into account the fact that the net is randomly initialized. I wrote a quoted sentence in my prev comment which was carefully formulated to build up the right intuition. I don't think I can explain it better.
But I'll try: the sequence of actions (random init, teain your residual net, test your net) is indistinguishable in effect from (random init, train usual net on negated starting input, and test on negated input). The second version will be indistinguishable from running a repeated experiment of the first type with a new random init.
Yes, I read that. The problem is that you need to train the bias to get comparable results.
Also, please note that you cannot have zero mean outputs of residual in the res x = x + relu (Ax + b) and you obviously can have zero mean outputs in the subtraction case (res x = x - relu (Ax + b)).
This SELU paper demonstrates, in my opinion, the benefits of having zero mean outputs.
(I have to say that in my experiments SELU was not all that beneficial, but other means that bring zero means into existence were)
I think that residual neural network is capable to route around the case of having to learn non-zero means in inputs. So you are right in stating that these two cases will be indistinguishable. I just have to say that having subtraction instead of addition helps neural net to train faster and get better accuracy just because training process have less things to learn.
I'm fairly sure I was right in the above analysis which would imply the minus version cannot have any substantial advantages.
> please note that you cannot have zero mean outputs of residual in the res x = x + relu (Ax + b) and you obviously can have zero mean outputs in the subtraction case (res x = x - relu (Ax + b)
This is incorrect. You seem to assume x is positive an therefore adding something relu'd onto it will take it further from zero, while your subtractive one can pull it towards and beyond zero. The problem in this reasoning is that in your version you sequentially subtract the residuals so you have the exact symmetric effect, getting further away from zero.
It's like a left hand and a right hand. Not the same, but have the same effect.
I said all I could at this point. If you still have your experiments set up, just try negating your input to the network and initialize randomly. The accuracies observed will be indistinguishable from using your variant. The network is not the same but the training procedure yields a sample from the same distribution.
I don't think I can explain this well over comments, think about what happens if you initialize your net from scratch. "Self-contained though experiment: Does it make a difference if you multiply all weights by -1 directly after init? " Once that is clear in itself, think of what happens if you init and flip all your very first inputs, the inputs that you feed to the start of the network.