> Cortical neurons are well approximated by a deep neural network (DNN) with 5–8 layers
I wonder how many cortical neurons it takes to approximate a ReLu or tanh well. I suspect this number being large than 1. If so the paper only shows an upper bound. Think how many neurons does it take to add a two 10 digit numbers. It is perfectly feasible that some (possibly large) part of this 5-8 layers is just "emulation overhead".
Does someone know of studies of this emulation overhead, even outside biology?
Even between ARM and x86 there is an emulation overhead due to different memory models while both are register machines.
That’s true, the brain uses a VM to run maths or science, only the best scientists succeed at understanding some of the concepts natively.
Same for music: A student runs the music sheet in a VM, and progressively JIT makes the movements native, which allows much faster execution, and which allows building on top of the base layer.
Maybe we’re doing it all wrong writing programs in assembler. We should give them to a VM, the VM should see the similarity between various pieces of the programs, make them inline, and we could teach the machine faster.
> Maybe we’re doing it all wrong writing programs in assembler. We should give them to a VM,
This is what compilers do. Their input is a program in a more abstract language, either bytecode, an intermediate representation, or a source language.
The problem is that damn undecidability, which is like a minefield of rakes. It's undecidable for a compiler to tell if a program will do anything (e.g. halt). It's undecidable for a compiler to tell if two programs are equivalent. It's undecidable for a compiler to tell if a program is minimal.
So compilers have to well, be dumber. They approximate a lot.
> The problem is that damn undecidability, which is like a minefield of rakes. It's undecidable for a compiler to tell if a program will do anything (e.g. halt). It's undecidable for a compiler to tell if two programs are equivalent. It's undecidable for a compiler to tell if a program is minimal.
Only for Turing complete languages, to be clear. Now, of course, most interesting problems cannot really be solved in sub-Turing languages, but it's still a fundamental point to consider.
In fact finding the for loops to do tensor contractions (think matrix multiply but with many more dimensions) alone was something in NP range. Converting for loops to assembly as is done by https://polly.llvm.org/ is equivalent to Mixed-Integer Linear Programming, is equivalent to MaxSat is equivalent to Sat in a for a loop. In these domains there is a definition of minimal and they are still hard.
There is no need to approximate a ReLu or tanh well. Machine learning is statistical. The accuracy of these functions is not that important
ReLu is buggy and has an incorrect activation function for deep learning because it's not continuous everywhere. In practice, it rarely matters. It's chosen only because it's fast to implement buggy function than use someting proper.
The exact shape of tanh is not important either. It's enough to be monotone roughly s-shaped and easy to differentiate. Tanh is implemented in hardware so it's used.
Basically anything monotone and approximately differentiable works.
> There is no need to approximate a ReLu or tanh well
Similarily there might not a need to emulate neurons well to get the circuits in the brain to work. However when someone makes arguments that neurons are equivalent x artifical neurons it is necessary to choose a bound for comparison (fe. L2 error of activation) for the emulations you compare.
Also the nonlinearity only needs to be differentiable because ANNs are trained with gradient descent. With other more biologically plausible learning mechanisms, this might matter even less (or have other constraints / requirements)
Meanwhile, if we actually understood brains, I bet we would find endless examples of 'improper' behavior. Evolution picks up what seems to work, and sloooowly improves the parts that break, leaving good enough alone. (After all, if it doesn't affect reproductive probabilities, it doesn't matter.)
Activation functions will almost certainly not be the crux move for solving AGI.
Tanh is _not_ generally implemented in hardware, and it’s one of the fussier functions in math.h to implement well. Its only real virtues are that implementations are available everywhere, its derivative is relatively simple, and it has the right symmetries.
You're right that neural networks don't care too much the shape of most activation functions. I assume that splicing together two decaying exponential functions at the origin would work just as well in practice.
However tanh is a bit more special than just having the right symmetries. Sigmoid is the correct function to turn an additive value into a probability (range 0 to 1). Tanh is a scaled sigmoid which fulfills the same purpose for the -1 to +1 interval.
I sometimes wonder if clamped linear or exponential functions would work better than tanh/sigmoid in places where they're currently used (like LSTM/GRU gates).
Note that tanh saturates to ±1 faster than most except erf when normalized to have slope 1 at the origin (its series at +infinity is like 1 - 2e^{-2x} + o(e^{-4x}), while many of the other options have polynomial series, so they don't approach 1 nearly as fast).
I suspect some applications would in theory rather use erf, but erf is even worse to compute than tanh (on the other hand, erf's derivative is really nice, so who knows?)
By splicing together I mean a piecewise function which is `exp(x) - 1` on the left and `1 - exp(-x)` on the right. Which should be similar enough to tanh for most purposes.
Sure, it even has continuous derivatives of all orders and the right slope at the origin. It just doesn’t saturate to +/-1 as fast, which probably doesn’t matter.
I guess it depends on how accurately you're thinking about those functions being approximated. Neurons have a natural nonlinearity to their input-output (transfer) function, most obvious of which is the action potential threshold. Biological neurons have a saturating nonlinearity because there is an upper limit on their firing rate, but in certain regimes the nonlinearity of a single neuron could easily look qualitatively similar to relu or a (non-negative) tanh.
On the other hand, a single cell much simpler than a neuron (any bacteria) is able to perform significantly more complex calculations than any ANN we've tried so far (successfully interacting with an environment to move and find food).
Comparing these kinds of disparate tasks for "computational power levels" between vastly different architectures one of which we're not even close to understanding is generally pretty futile.
I wonder how many cortical neurons it takes to approximate a ReLu or tanh well. I suspect this number being large than 1. If so the paper only shows an upper bound. Think how many neurons does it take to add a two 10 digit numbers. It is perfectly feasible that some (possibly large) part of this 5-8 layers is just "emulation overhead".
Does someone know of studies of this emulation overhead, even outside biology?
Even between ARM and x86 there is an emulation overhead due to different memory models while both are register machines.