It's also fun to consider that most neural network architectures are mostly piec...

thesz · on March 7, 2020

If I am remember correctly, residual layer has form of x+nonlin(Mx+b), not one you provided.

I also found (in my "learn DL" experiments) that for ReLU the x - relu(Mx + b) version works better (trains faster and achieves better accuracy).

bonoboTP · on March 8, 2020

Plus or minus shouldn't make any difference as long as your input distribution is symmetric around zero.

A simple transformation shows you can get the same effect by just flipping the input the output and the weights. The weights are initialized from a symmetric distribution, so the only difference may come from having the input nonevenly distributed around zero.

y = x - relu(Mx + b)

-y = -x + relu((-M)(-x) + b)

y' = x' + relu(M'x' + b)

Have you tried normalizing your input to have zero mean?

thesz · on March 8, 2020

ReLU(-x) is not equal to ReLU(x).

The output after ReLU is all non-negatives. So subtraction, actually, does a correction on input.

Yes, I tried normalization on input values and normalization and cross-correlation reduction of outputs of affine transformations. They all have separate positive effects on speed of training and final accuracy.

bonoboTP · on March 8, 2020

I know relu(-x) is not relu(x). I think the equations I wrote are correct. My point is, your version is not fundamentally different because we can get your effect by using the standard version and flipping the input to the network, the M matrix and then the output will come out with the sign flipped as well. If you put a sign-flipped fully conn layer and a softmax on this (for example) the result will be the same.

The only reason that yours might be better is an asymmetric distribution of x o around 0. If you flip the sign on x, you should get the same benefits.

To summarize: your net is not exactly the same as the usual one, but if you train your version on x and I train the usual version with -x, our results will be indistinguishable.

thesz · on March 9, 2020

No, you can't "flip" input because it is, most probably, output of ReLUs which are all non-negative. In that case you have to learn flipping and all that correction at the Mx+b part, which is, let's say, in the range of tens of thousands of parameters.

bonoboTP · on March 10, 2020

I meant the input input. The first input. Just flip the fist input to the net. Then the flip will automatically cascade through all layers. The M matrices are initialized from a distribution that is symmetric around zero. They don't need to be explicitly flipped, it's not a "correction" in that sense. As long as you only have this type of modules and normal non-residual ReLU layers, it will be equivalent. The net is not equivalent but if you include the random init in your analysis, then you can see there will be a bijection.

I don't think I can explain this well over comments, think about what happens if you initialize your net from scratch. "Self-contained though experiment: Does it make a difference if you multiply all weights by -1 directly after init? " Once that is clear in itself, think of what happens if you init and flip all your very first inputs, the inputs that you feed to the start of the network.

thesz · on March 10, 2020

So I made simple experiment in ghci (the "$" operators is a function application: f $ g $ x === f(g(x))):

  Prelude> let relu x = (x + abs x)/2
  Prelude> let res x = x - relu x
  Prelude> let f x = res $ res x
  Prelude> map f [-2..2]
  [-2.0,-1.0,0.0,0.0,0.0]
  Prelude> let res x = x + relu (negate x)
  Prelude> let f x = res $ res x
  Prelude> map f [-2..2]
  [0.0,0.0,0.0,1.0,2.0]
  Prelude> let f x = res $ res $ res x
  Prelude> map f [-2..2]
  [0.0,0.0,0.0,1.0,2.0]
  Prelude> let res x = x - relu x
  Prelude> let f x = res $ res $ res x
  Prelude> map f [-2..2]
  [-2.0,-1.0,0.0,0.0,0.0]

The networks pass different inputs, actually.

I have to add that you cannot have zero mean outputs of residual layer with the definition res x = x + relu (Ax + b). In that case outputs will have non-zero mean and subsequent layers will have to correct for that.

bonoboTP · on March 10, 2020

I never said relu(-anything). Please read the equations I wrote above with extreme care to where the minus signs are. They are at precise locations. I also emphasize that my explanation only works if you take into account the fact that the net is randomly initialized. I wrote a quoted sentence in my prev comment which was carefully formulated to build up the right intuition. I don't think I can explain it better.

But I'll try: the sequence of actions (random init, teain your residual net, test your net) is indistinguishable in effect from (random init, train usual net on negated starting input, and test on negated input). The second version will be indistinguishable from running a repeated experiment of the first type with a new random init.

thesz · on March 10, 2020

Yes, I read that. The problem is that you need to train the bias to get comparable results.

Also, please note that you cannot have zero mean outputs of residual in the res x = x + relu (Ax + b) and you obviously can have zero mean outputs in the subtraction case (res x = x - relu (Ax + b)).

The very fact that in one case you have zero mean outputs and in other case you don't brings me to necessity to point you to SELU: https://towardsdatascience.com/selu-make-fnns-great-again-sn...

This SELU paper demonstrates, in my opinion, the benefits of having zero mean outputs.

(I have to say that in my experiments SELU was not all that beneficial, but other means that bring zero means into existence were)

I think that residual neural network is capable to route around the case of having to learn non-zero means in inputs. So you are right in stating that these two cases will be indistinguishable. I just have to say that having subtraction instead of addition helps neural net to train faster and get better accuracy just because training process have less things to learn.

bonoboTP · on March 11, 2020

I'm fairly sure I was right in the above analysis which would imply the minus version cannot have any substantial advantages.

> please note that you cannot have zero mean outputs of residual in the res x = x + relu (Ax + b) and you obviously can have zero mean outputs in the subtraction case (res x = x - relu (Ax + b)

This is incorrect. You seem to assume x is positive an therefore adding something relu'd onto it will take it further from zero, while your subtractive one can pull it towards and beyond zero. The problem in this reasoning is that in your version you sequentially subtract the residuals so you have the exact symmetric effect, getting further away from zero.

It's like a left hand and a right hand. Not the same, but have the same effect.

I said all I could at this point. If you still have your experiments set up, just try negating your input to the network and initialize randomly. The accuracies observed will be indistinguishable from using your variant. The network is not the same but the training procedure yields a sample from the same distribution.

thesz · on March 16, 2020

Sorry. I consulted the code and it turned out I used relu(Ax + b) - x as a modified residual layer.

I hope it clears things up.

sdenton4 · on March 7, 2020

Correct; the idea is to take Identity + Tweak, so the Relu should hit the 'tweak.'

srean · on March 8, 2020

Hey sdenton did you work on or publish that harmonic analysis of correlations that you had mentioned long time ago. I am still quite curious about that. If polishing it for ICML etc is too much of a downer, you can upload it to arxiv. I would love to read that.

sdenton4 · on March 8, 2020

Hey, send me an email (should be in my profile, or easy to guess: it's a gmail account) and I'll see what I can dig up. :)

I've admittedly been pretty terrible about actually publishing... But definitely have notes and slide decks that could probably be put together into Something.