> Cortical neurons are well approximated by a deep neural network (DNN) with 5–8...

laurent92 · on Aug 12, 2021

That’s true, the brain uses a VM to run maths or science, only the best scientists succeed at understanding some of the concepts natively.

Same for music: A student runs the music sheet in a VM, and progressively JIT makes the movements native, which allows much faster execution, and which allows building on top of the base layer.

Maybe we’re doing it all wrong writing programs in assembler. We should give them to a VM, the VM should see the similarity between various pieces of the programs, make them inline, and we could teach the machine faster.

</daydreaming abstract=“true”>

titzer · on Aug 12, 2021

> Maybe we’re doing it all wrong writing programs in assembler. We should give them to a VM,

This is what compilers do. Their input is a program in a more abstract language, either bytecode, an intermediate representation, or a source language.

The problem is that damn undecidability, which is like a minefield of rakes. It's undecidable for a compiler to tell if a program will do anything (e.g. halt). It's undecidable for a compiler to tell if two programs are equivalent. It's undecidable for a compiler to tell if a program is minimal.

So compilers have to well, be dumber. They approximate a lot.

MauranKilom · on Aug 12, 2021

> The problem is that damn undecidability, which is like a minefield of rakes. It's undecidable for a compiler to tell if a program will do anything (e.g. halt). It's undecidable for a compiler to tell if two programs are equivalent. It's undecidable for a compiler to tell if a program is minimal.

Only for Turing complete languages, to be clear. Now, of course, most interesting problems cannot really be solved in sub-Turing languages, but it's still a fundamental point to consider.

freemint · on Aug 12, 2021

In fact finding the for loops to do tensor contractions (think matrix multiply but with many more dimensions) alone was something in NP range. Converting for loops to assembly as is done by https://polly.llvm.org/ is equivalent to Mixed-Integer Linear Programming, is equivalent to MaxSat is equivalent to Sat in a for a loop. In these domains there is a definition of minimal and they are still hard.

nabla9 · on Aug 12, 2021

There is no need to approximate a ReLu or tanh well. Machine learning is statistical. The accuracy of these functions is not that important

ReLu is buggy and has an incorrect activation function for deep learning because it's not continuous everywhere. In practice, it rarely matters. It's chosen only because it's fast to implement buggy function than use someting proper.

The exact shape of tanh is not important either. It's enough to be monotone roughly s-shaped and easy to differentiate. Tanh is implemented in hardware so it's used.

Basically anything monotone and approximately differentiable works.

freemint · on Aug 12, 2021

> There is no need to approximate a ReLu or tanh well

Similarily there might not a need to emulate neurons well to get the circuits in the brain to work. However when someone makes arguments that neurons are equivalent x artifical neurons it is necessary to choose a bound for comparison (fe. L2 error of activation) for the emulations you compare.

neuah · on Aug 12, 2021

Also the nonlinearity only needs to be differentiable because ANNs are trained with gradient descent. With other more biologically plausible learning mechanisms, this might matter even less (or have other constraints / requirements)

sdenton4 · on Aug 12, 2021

Meanwhile, if we actually understood brains, I bet we would find endless examples of 'improper' behavior. Evolution picks up what seems to work, and sloooowly improves the parts that break, leaving good enough alone. (After all, if it doesn't affect reproductive probabilities, it doesn't matter.)

Activation functions will almost certainly not be the crux move for solving AGI.

stephencanon · on Aug 12, 2021

> Tanh is implemented in hardware so it's used.

Tanh is _not_ generally implemented in hardware, and it’s one of the fussier functions in math.h to implement well. Its only real virtues are that implementations are available everywhere, its derivative is relatively simple, and it has the right symmetries.

CodesInChaos · on Aug 12, 2021

You're right that neural networks don't care too much the shape of most activation functions. I assume that splicing together two decaying exponential functions at the origin would work just as well in practice.

However tanh is a bit more special than just having the right symmetries. Sigmoid is the correct function to turn an additive value into a probability (range 0 to 1). Tanh is a scaled sigmoid which fulfills the same purpose for the -1 to +1 interval.

I sometimes wonder if clamped linear or exponential functions would work better than tanh/sigmoid in places where they're currently used (like LSTM/GRU gates).

stephencanon · on Aug 12, 2021

Yeah, wiki has a decent survey of sigmoid (the family, not the specific function ML people often refer to by that name) functions here: https://en.wikipedia.org/wiki/Sigmoid_function#/media/File:G...

Note that tanh saturates to ±1 faster than most except erf when normalized to have slope 1 at the origin (its series at +infinity is like 1 - 2e^{-2x} + o(e^{-4x}), while many of the other options have polynomial series, so they don't approach 1 nearly as fast).

I suspect some applications would in theory rather use erf, but erf is even worse to compute than tanh (on the other hand, erf's derivative is really nice, so who knows?)

MauranKilom · on Aug 12, 2021

I assume that splicing together two decaying exponential functions at the origin would work just as well in practice.

Also known as tanh: https://en.wikipedia.org/wiki/Hyperbolic_functions

One "disadvantage" is that it doesn't saturate to [-1.0, 1.0] like appropriately scaled tanh.

CodesInChaos · on Aug 13, 2021

By splicing together I mean a piecewise function which is `exp(x) - 1` on the left and `1 - exp(-x)` on the right. Which should be similar enough to tanh for most purposes.

stephencanon · on Aug 15, 2021

Sure, it even has continuous derivatives of all orders and the right slope at the origin. It just doesn’t saturate to +/-1 as fast, which probably doesn’t matter.

magicalhippo · on Aug 12, 2021

So sin() could be used instead of tanh, if appropriately shifted and scaled I presume?

CodesInChaos · on Aug 12, 2021

You'd at least want to keep it at ±1 once it reaches that value instead of oscillating.

magicalhippo · on Aug 12, 2021

I was thinking of a half-period, ie +/- pi/2.

But yeah I wasn't thinking too much about large input values, I presumed clamped inputs, which I guess might not be ideal.

CodesInChaos · on Aug 12, 2021

I was talking about an output value of ±1 which corresponds to ±pi/2 as an input value. So we mean the same thing.

neuah · on Aug 12, 2021

I guess it depends on how accurately you're thinking about those functions being approximated. Neurons have a natural nonlinearity to their input-output (transfer) function, most obvious of which is the action potential threshold. Biological neurons have a saturating nonlinearity because there is an upper limit on their firing rate, but in certain regimes the nonlinearity of a single neuron could easily look qualitatively similar to relu or a (non-negative) tanh.

simiones · on Aug 12, 2021

On the other hand, a single cell much simpler than a neuron (any bacteria) is able to perform significantly more complex calculations than any ANN we've tried so far (successfully interacting with an environment to move and find food).

Comparing these kinds of disparate tasks for "computational power levels" between vastly different architectures one of which we're not even close to understanding is generally pretty futile.

freemint · on Aug 12, 2021

> is able to perform significantly more complex calculations

> successfully interacting with an environment to move and find food

Yet the strategies they implement are equivalent to suprisingly simple to implement things, such as PI control or gradient descent.

tsimionescu · on Aug 13, 2021

That's only if you ignore the massive problem of actually perceiving the chemical environment, I believe.