I meant the input input. The first input. Just flip the fist input to the net. T...

thesz · on March 10, 2020

So I made simple experiment in ghci (the "$" operators is a function application: f $ g $ x === f(g(x))):

  Prelude> let relu x = (x + abs x)/2
  Prelude> let res x = x - relu x
  Prelude> let f x = res $ res x
  Prelude> map f [-2..2]
  [-2.0,-1.0,0.0,0.0,0.0]
  Prelude> let res x = x + relu (negate x)
  Prelude> let f x = res $ res x
  Prelude> map f [-2..2]
  [0.0,0.0,0.0,1.0,2.0]
  Prelude> let f x = res $ res $ res x
  Prelude> map f [-2..2]
  [0.0,0.0,0.0,1.0,2.0]
  Prelude> let res x = x - relu x
  Prelude> let f x = res $ res $ res x
  Prelude> map f [-2..2]
  [-2.0,-1.0,0.0,0.0,0.0]

The networks pass different inputs, actually.

I have to add that you cannot have zero mean outputs of residual layer with the definition res x = x + relu (Ax + b). In that case outputs will have non-zero mean and subsequent layers will have to correct for that.

bonoboTP · on March 10, 2020

I never said relu(-anything). Please read the equations I wrote above with extreme care to where the minus signs are. They are at precise locations. I also emphasize that my explanation only works if you take into account the fact that the net is randomly initialized. I wrote a quoted sentence in my prev comment which was carefully formulated to build up the right intuition. I don't think I can explain it better.

But I'll try: the sequence of actions (random init, teain your residual net, test your net) is indistinguishable in effect from (random init, train usual net on negated starting input, and test on negated input). The second version will be indistinguishable from running a repeated experiment of the first type with a new random init.

thesz · on March 10, 2020

Yes, I read that. The problem is that you need to train the bias to get comparable results.

Also, please note that you cannot have zero mean outputs of residual in the res x = x + relu (Ax + b) and you obviously can have zero mean outputs in the subtraction case (res x = x - relu (Ax + b)).

The very fact that in one case you have zero mean outputs and in other case you don't brings me to necessity to point you to SELU: https://towardsdatascience.com/selu-make-fnns-great-again-sn...

This SELU paper demonstrates, in my opinion, the benefits of having zero mean outputs.

(I have to say that in my experiments SELU was not all that beneficial, but other means that bring zero means into existence were)

I think that residual neural network is capable to route around the case of having to learn non-zero means in inputs. So you are right in stating that these two cases will be indistinguishable. I just have to say that having subtraction instead of addition helps neural net to train faster and get better accuracy just because training process have less things to learn.

bonoboTP · on March 11, 2020

I'm fairly sure I was right in the above analysis which would imply the minus version cannot have any substantial advantages.

> please note that you cannot have zero mean outputs of residual in the res x = x + relu (Ax + b) and you obviously can have zero mean outputs in the subtraction case (res x = x - relu (Ax + b)

This is incorrect. You seem to assume x is positive an therefore adding something relu'd onto it will take it further from zero, while your subtractive one can pull it towards and beyond zero. The problem in this reasoning is that in your version you sequentially subtract the residuals so you have the exact symmetric effect, getting further away from zero.

It's like a left hand and a right hand. Not the same, but have the same effect.

I said all I could at this point. If you still have your experiments set up, just try negating your input to the network and initialize randomly. The accuracies observed will be indistinguishable from using your variant. The network is not the same but the training procedure yields a sample from the same distribution.

thesz · on March 16, 2020

Sorry. I consulted the code and it turned out I used relu(Ax + b) - x as a modified residual layer.

I hope it clears things up.