Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It's also fun to consider that most neural network architectures are mostly piecewise linear: relu(Mx + b) for standard feedforward or convolutional layers... So one would expect a lot of knowledge and folklore from (piecewise) linear optimization to carry over nicely.

IIRC, ResNet was a good example of that: It's more efficient to learn relu(x + Mx + b) than relu(Mx + b)

One difficulty, though, is that proofs generally /don't/ carry over, and IME a lot of 'classical' methods constrain themselves to provable operations... So at times it can be hard to tell the difference between 'useful folklore' and 'tweak that makes the proofs work.'

I've been delving into sparse coding literature from about ten years ago, and there's a lot of this kind of difficulty... Interestingly, the best sparse coding architectures ended up being very very similar to shallow neural networks. There's a nice 'deep ksvd denoising' paper from a few months back which improves handily on the older sparse coding architectures by bringing in a smattering of ideas from the DNN age; the rant at the end is makes the case for building these 'blended' architectures to get the best of both worlds. https://arxiv.org/abs/1909.13164

I tend to think that the DNN architectures beat the provable-domain architectures for Good Reasons, but the USEFUL response is to build awesome blended architectures that maybe go a long way towards bringing down model sizes and complexity, while increasing explainability.



If I am remember correctly, residual layer has form of x+nonlin(Mx+b), not one you provided.

I also found (in my "learn DL" experiments) that for ReLU the x - relu(Mx + b) version works better (trains faster and achieves better accuracy).


Plus or minus shouldn't make any difference as long as your input distribution is symmetric around zero.

A simple transformation shows you can get the same effect by just flipping the input the output and the weights. The weights are initialized from a symmetric distribution, so the only difference may come from having the input nonevenly distributed around zero.

y = x - relu(Mx + b)

-y = -x + relu((-M)(-x) + b)

y' = x' + relu(M'x' + b)

Have you tried normalizing your input to have zero mean?


ReLU(-x) is not equal to ReLU(x).

The output after ReLU is all non-negatives. So subtraction, actually, does a correction on input.

Yes, I tried normalization on input values and normalization and cross-correlation reduction of outputs of affine transformations. They all have separate positive effects on speed of training and final accuracy.


I know relu(-x) is not relu(x). I think the equations I wrote are correct. My point is, your version is not fundamentally different because we can get your effect by using the standard version and flipping the input to the network, the M matrix and then the output will come out with the sign flipped as well. If you put a sign-flipped fully conn layer and a softmax on this (for example) the result will be the same.

The only reason that yours might be better is an asymmetric distribution of x o around 0. If you flip the sign on x, you should get the same benefits.

To summarize: your net is not exactly the same as the usual one, but if you train your version on x and I train the usual version with -x, our results will be indistinguishable.


No, you can't "flip" input because it is, most probably, output of ReLUs which are all non-negative. In that case you have to learn flipping and all that correction at the Mx+b part, which is, let's say, in the range of tens of thousands of parameters.


I meant the input input. The first input. Just flip the fist input to the net. Then the flip will automatically cascade through all layers. The M matrices are initialized from a distribution that is symmetric around zero. They don't need to be explicitly flipped, it's not a "correction" in that sense. As long as you only have this type of modules and normal non-residual ReLU layers, it will be equivalent. The net is not equivalent but if you include the random init in your analysis, then you can see there will be a bijection.

I don't think I can explain this well over comments, think about what happens if you initialize your net from scratch. "Self-contained though experiment: Does it make a difference if you multiply all weights by -1 directly after init? " Once that is clear in itself, think of what happens if you init and flip all your very first inputs, the inputs that you feed to the start of the network.


So I made simple experiment in ghci (the "$" operators is a function application: f $ g $ x === f(g(x))):

  Prelude> let relu x = (x + abs x)/2
  Prelude> let res x = x - relu x
  Prelude> let f x = res $ res x
  Prelude> map f [-2..2]
  [-2.0,-1.0,0.0,0.0,0.0]
  Prelude> let res x = x + relu (negate x)
  Prelude> let f x = res $ res x
  Prelude> map f [-2..2]
  [0.0,0.0,0.0,1.0,2.0]
  Prelude> let f x = res $ res $ res x
  Prelude> map f [-2..2]
  [0.0,0.0,0.0,1.0,2.0]
  Prelude> let res x = x - relu x
  Prelude> let f x = res $ res $ res x
  Prelude> map f [-2..2]
  [-2.0,-1.0,0.0,0.0,0.0]
The networks pass different inputs, actually.

I have to add that you cannot have zero mean outputs of residual layer with the definition res x = x + relu (Ax + b). In that case outputs will have non-zero mean and subsequent layers will have to correct for that.


I never said relu(-anything). Please read the equations I wrote above with extreme care to where the minus signs are. They are at precise locations. I also emphasize that my explanation only works if you take into account the fact that the net is randomly initialized. I wrote a quoted sentence in my prev comment which was carefully formulated to build up the right intuition. I don't think I can explain it better.

But I'll try: the sequence of actions (random init, teain your residual net, test your net) is indistinguishable in effect from (random init, train usual net on negated starting input, and test on negated input). The second version will be indistinguishable from running a repeated experiment of the first type with a new random init.


Yes, I read that. The problem is that you need to train the bias to get comparable results.

Also, please note that you cannot have zero mean outputs of residual in the res x = x + relu (Ax + b) and you obviously can have zero mean outputs in the subtraction case (res x = x - relu (Ax + b)).

The very fact that in one case you have zero mean outputs and in other case you don't brings me to necessity to point you to SELU: https://towardsdatascience.com/selu-make-fnns-great-again-sn...

This SELU paper demonstrates, in my opinion, the benefits of having zero mean outputs.

(I have to say that in my experiments SELU was not all that beneficial, but other means that bring zero means into existence were)

I think that residual neural network is capable to route around the case of having to learn non-zero means in inputs. So you are right in stating that these two cases will be indistinguishable. I just have to say that having subtraction instead of addition helps neural net to train faster and get better accuracy just because training process have less things to learn.


I'm fairly sure I was right in the above analysis which would imply the minus version cannot have any substantial advantages.

> please note that you cannot have zero mean outputs of residual in the res x = x + relu (Ax + b) and you obviously can have zero mean outputs in the subtraction case (res x = x - relu (Ax + b)

This is incorrect. You seem to assume x is positive an therefore adding something relu'd onto it will take it further from zero, while your subtractive one can pull it towards and beyond zero. The problem in this reasoning is that in your version you sequentially subtract the residuals so you have the exact symmetric effect, getting further away from zero.

It's like a left hand and a right hand. Not the same, but have the same effect.

I said all I could at this point. If you still have your experiments set up, just try negating your input to the network and initialize randomly. The accuracies observed will be indistinguishable from using your variant. The network is not the same but the training procedure yields a sample from the same distribution.


Sorry. I consulted the code and it turned out I used relu(Ax + b) - x as a modified residual layer.

I hope it clears things up.


Correct; the idea is to take Identity + Tweak, so the Relu should hit the 'tweak.'


Hey sdenton did you work on or publish that harmonic analysis of correlations that you had mentioned long time ago. I am still quite curious about that. If polishing it for ICML etc is too much of a downer, you can upload it to arxiv. I would love to read that.


Hey, send me an email (should be in my profile, or easy to guess: it's a gmail account) and I'll see what I can dig up. :)

I've admittedly been pretty terrible about actually publishing... But definitely have notes and slide decks that could probably be put together into Something.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: