Why Are Eight Bits Enough for Deep Neural Networks?

taliesinb · on Sept 19, 2015

I had wondered this myself -- it seems reasonable to see limiting the precision of activation as a form of regularization, as the author alludes to.

For me, the place we'll eventually end up is obviously custom deep learning / evaluation chips that perform analogue operations using transistors in their linear regime (like how op-amps work). These chips would be programmed merely to express the tensor operation graph, essentially analogue tensor FPGAs.

This should bring multiple order-of-magnitude reductions in power consumption and increases in evaluation speed. And when you don't have a clock there might also be interesting ways of dealing with time in which you don't discretize and unroll, like one currently does with GRUs or LSTMs.

darkmighty · on Sept 19, 2015

I agree that this kind of naive analog computing sounds very attractive with those simple linear operations (linear networks have been exhaustively studied, as you noted you essentially need only resistors and amplifiers). But it's not entirely obvious to me they ought to be better than digital electronics for comparable precision (considering their noise) and power consumption. I think you may get into trouble in the small current regime due to quantum mechanics: while you can do digital electronics with only a few electrons, you may need a large number to be able to maintain good linearity. An then there's the fact you can deal with exponentially larger numbers with roughly linearly (or polynomial) increasing memory, while if you use analog circuits you have to pay a quadratic cost on the exponential, so ~n^k vs ~exp(2n) power consumption doesn't look good from this pov. But who knows, as the article points out the nonlinearities of the network may miraculously make it work even with very poor linearity and poor precision. It remains to be tested.

p1esk · on Sept 20, 2015

you can deal with exponentially larger numbers with roughly linearly (or polynomial) increasing memory, while if you use analog circuits you have to pay a quadratic cost on the exponential

This does not make sense to me. Can you explain?

I think there might be misunderstanding of how analog computing is used to build a neural network. First, a weight is stored as some analog physical property, typically as charge on a floating gate, or on a capacitor in a DRAM type cell. Second, the multiplication operation is performed by modulating the analog input signal going through the floating gate transistor by the charge on the floating gate (weight). Third, the summation is done via simple summation of the currents. Finally, activation function is performed by an opamp.

Regarding power consumption: 1. A digital computer needs a thousand of transistors to perform multiplication, analog circuit can do it with a single one. 2. Analog NN stores parameters (weights) locally, right where they are needed to perform computation. Digital NN will need lots of memory transfers to bring weights from RAM to ALU, and to store intermediate results.

That's why a properly implemented analog NN will always consume much less power.

taliesinb · on Sept 20, 2015

> 1. A digital computer needs a thousand of transistors to perform multiplication, analog circuit can do it with a single one.

That's interesting. What would the circuit be?

> Digital NN will need lots of memory transfers to bring weights from RAM to ALU, and to store intermediate results.

That's not necessarily the case. Cellular neural networks were proposed long ago, for example, and they're digital -- how multiplication happens is independent from the data flow architecture.

> That's why a properly implemented analog NN will always consume much less power.

How do you know that the I^2 cost of operating in the linear regime isn't excessive? I'm totally ignorant on the matter -- I'd love to see a ballpark calculation to understand why it isn't important.

p1esk · on Sept 20, 2015

As I described above: "the multiplication operation is performed by modulating the analog input signal going through the floating gate transistor by the charge on the floating gate (weight)." The circuit is the single transistor in this case.

Cellular neural networks were proposed long ago, for example, and they're digital

What is so inherently digital about cellular networks? Can you provide a link to an implementation of a cellular net in digital hardware? How the weights are stored? Where the multiplication happens?

taliesinb · on Sept 20, 2015

> This does not make sense to me. Can you explain?

I understood the reasoning to be that to increase the range of accurately representable values in a circuit, you either need to increase the voltage or current used in an analog circuit (to achieve a certain accuracy versus a noise baseline), or devote more bits in a digital circuit. The first gives a linear dependence (or quadratic for I^2 losses) of power on range, the second logarithmic.

p1esk · on Sept 20, 2015

Ah I see. Well, remember, with analog circuits, we are talking about subthreshold currents. This current is orders of magnitude less than the current in a digital circuit (nA vs uA). Correspondingly, the power consumption will be negligible in comparison, even if you expand the current range. And that is only a fraction of the total power consumption. Adding more bits in a digital circuit linearly increases total power, dominated by interconnect capacitance.

darkmighty · on Sept 20, 2015

That was an important observation. Fighting noise is one of the primary reasons the first digital computers were invented.

To give a bit of a dramatic illustration, if you circuit has on the order of 1 nV of thermal noise and you wanted to do the linear analog equivalent of 64bit integer arithmetic, you would need a signal on the order of 10,000,000,000 V to have enough precision. In fact, in terms of power consumption it's even worse. If the 1 nV signal consumes something like 1 pW, you would need something like the total power output of the Sun (on the order of 10^26 W) -- a bit of an expensive multiplication, no :) ? That's how crazy it is!

Again, if you can get away with less than 8 bits of precision and imperfect linearity the picture changes, but I wouldn't declare it superior a priori without looking at the numbers.

p1esk · on Sept 20, 2015

Or, you could split your 64 bit computation into 8 bit computations, which could be done with analog circuits, and still save a lot of power! :-)

But yes, I understand your point. Both analog and digital implementations have their strengths and weaknesses. If you value power over precision, go with analog. If the opposite - go with digital.

darkmighty · on Sept 20, 2015

Right, but note you can't even split it, if you are thinking of linear circuits. Precision necessarily means how your signal compared to the thermal noise floor. It is possible to show can't compose 8-bit precision linear units to get a >8-bit precision value. What happens is actually the opposite, if the noise of the units are uncorrelated noise will propagate and increase to the tune of sqrt(number of operations). Avoiding error propagation is another advantage of digital operations.

The reason NNs don't exhibit strong error propagation is because of the non-linearities between linear layers that perform operations analogous to threshold/majority voting or the like, which have error correction properties.

p1esk · on Sept 20, 2015

Interesting, but then how do you explain that rectified linear operations between layers work better than sigmoids? According to your logic, ReLU should have worse error propagating quality than squashing functions?

darkmighty · on Sept 20, 2015

I'm going to reply to your question below here since HN is preventing a reply (anti-flaming/long threads I guess).

Be careful with jumping to conclusions: I never even cited ReLUs or Sigmoids in my post! I don't have any opinion on which non-linearity is better, I only know both are dramatic non-linearities. My claims were about linear circuits. You should use whatever nonlinear element works best in your Neural Network, of course (and I've heard ReLUs have good advantages).

taliesinb · on Sept 19, 2015

> n then there's the fact you can deal with exponentially larger numbers with roughly linearly (or polynomial) increasing memory, while if you use analog circuits you have to pay a quadratic cost on the exponential, so ~n^k vs ~exp(2n) power consumption doesn't look good from this pov.

That's true, I feel stupid for not having thought of that!

I'm not an electrical engineer, but with the FETs that modern Intel chips are using, what fraction of their power consumption comes from parasitic gate capacitance, versus other losses?

And if you operated in the linear region, what's the ballpark steady-state I_SD current you'd need on one FET to drive the gate of the next FET?

I think that's what this comes down to: if gate capacitance dominates other losses even in the linear regime, you still win by not having a clock and lots of digital transitions.

You could even imagine exploiting that: apply 'slow' augmentations of the input data that get you the equivalent of a bunch of iterations on a single example batch, while incurring a much smaller fraction of that initial cost because activations aren't going to change nearly as much as switching to a whole new example batch.

p1esk · on Sept 20, 2015

"parasitic gate capacitance" - not sure if you want to call it "parasitic", after all, a gate capacitance is what makes everything work!

Power is mainly lost via leakage (the smaller the transistor, the more it leaks), and via interconnect capacitance, which dominates all other capacitances in modern circuits.

taliesinb · on Sept 20, 2015

> "parasitic gate capacitance" - not sure if you want to call it "parasitic", after all, a gate capacitance is what makes everything work!

Of course, but the 'ideal' FET has zero gate capacitance, despite that being the way they work.

> Power is mainly lost via leakage (the smaller the transistor, the more it leaks), and via interconnect capacitance, which dominates all other capacitances in modern circuits.

Interconnect meaning things like the buses? There's no reason to want a von Neumann architecture for an analog chip. If that leaves leakage, I suppose an analog chip would be the beneficiary of needing a lot fewer transistors per op.

p1esk · on Sept 20, 2015

'ideal' FET has zero gate capacitance, despite that being the way they work.

I don't understand this statement. What do you mean? A FET is a capacitor (gate to channel). If a gate has no capacitance, you have no transistor.

Interconnect means wire. This has nothing to do with von Neumann architecture. If you have wires in your circuit, then you have wire capacitance. As transistors get smaller, that capacitance starts to dominate internal transistor capacitances.

rndn · on Sept 19, 2015

What are your thoughts on using memristors for neural networks? They appear to have pretty good properties for that.

taliesinb · on Sept 19, 2015

"Training and Operation of an Integrated Neuromorphic Network Based on Metal-Oxide Memristors", which appeared in Nature recently:

http://arxiv.org/ftp/arxiv/papers/1412/1412.0611.pdf

_ntka · on Sept 19, 2015

Let the laws of physics do the recurrent math for you. Analog RNN computers would be very interesting, but they would first setting in stone the basics of the algorithms we use. We are still only beginning to explore the algorithm space, and this requires a flexibility that analog computers (or even ASICs or FPGAs) don't provide.

It's still not clear whether the future of AI will even involve neural networks at all. Intuitively, they seem so inefficient.

nightski · on Sept 19, 2015

Wouldn't noise be a huge issue?

taliesinb · on Sept 19, 2015

Deep nets love noise, because with so many parameters they are very vulnerable to overfitting.

For example, dropout, which is almost ubiquitous for deep learning, basically makes activations 'wrong' 50% of the time during training.

p1esk · on Sept 20, 2015

That's not true. NNs don't like noise, there have been a lot of research done about effect of noise on NNs in the 90s. Random noise over a certain threshold will progressively degrade the performance of NNs, and below the threshold will have no effect.

Dropout is not the same as random noise. By using dropout you eliminate some neurons from making contribution. As a result, you effectively train many smaller nets, each one adjusting its available weights to perform the same task. During testing, there's no noise - all neurons are back in business and contributing.

taliesinb · on Sept 20, 2015

> Dropout is not the same as random noise. By using dropout you eliminate some neurons from making contribution. As a result, you effectively train many smaller nets, each one adjusting its available weights to perform the same task. During testing, there's no noise - all neurons are back in business and contributing.

I was speaking loosely -- dropout is multiplicative Bernoulli noise on the hidden layers.

> That's not true. NNs don't like noise, there have been a lot of research done about effect of noise on NNs in the 90s. Random noise over a certain threshold will progressively degrade the performance of NNs, and below the threshold will have no effect.

I'd argue that dropout (and its predecessor in denoising autoencoders) are perfectly valid to see as noise, albeit multiplicative.

p1esk · on Sept 20, 2015

You are missing my point - with dropout, you don't have any noise during the operation of the net. The noise we are talking about (circuit noise) is always present.

srean · on Sept 20, 2015

I am pretty sure that parent meant DNNs love to fit noise. So you two are in strong agreement.

p1esk · on Sept 20, 2015

No, we are talking about a random electrical circuit noise in the analog NN hardware. Of course, if the noise is known and fixed, the net could learn to compensate (to a certain extent). The noise we are talking about is like when you put your finger on the chip, and raise its temperature by 10 degrees, the whole thing needs to be retrained.

taliesinb · on Sept 20, 2015

> The noise we are talking about is like when you put your finger on the chip, and raise its temperature by 10 degrees, the whole thing needs to be retrained.

What would change with temperature that would require retraining? Are you saying the output of an op could depend sensitively on temperature, or that higher temperatures would increase things like thermal or shot noise? Why would the latter require retraining?

p1esk · on Sept 20, 2015

Current depends on threshold voltage and carrier mobility, which are temperature dependent.

avereveard · on Sept 19, 2015

did my thesis on this topic (at that time we were searching the lower bound of ALU needed to have them running in zero power devices)

it's interesting, NN degrade at about 6bit, and that's mostly because the transfer function become stable and the training gets stuck more often in local minimums.

we built a training methodology in two step, first you trained them in 16bit precision, finding the absolute minimum, then retrain them with 6bit precision, and the NN basically learned to cope with the precision loss on its own.

funny part is, the less bit you have, the more robust the network became, because error correcting became a normal part of its transfer function.

we couldn't make the network solution converge on 4bit however. we tried using different transfer function, but then ran out of time before getting meaningful results (Each function needs it's own back propagation adjustment and things like that take time, I'm not a mathematician :D)

fgimenez · on Sept 20, 2015

I had similar empirical results on one of my PhD projects for medical image classification. With small data sets, we got better results on 8-bit data sets compared to 16-bit. We viewed it as a form of regularization that was extremely effective on smaller data sets with a lot of noise (x-rays in this case).

tachyonbeam · on Sept 20, 2015

When using 8-bit weights, what kind of mapping do you do? Do you map the 8-bit range into -10 to 10? Do you have more precision near zero or is it a linear mapping?

avereveard · on Sept 20, 2015

Don't know about him but I was working with -8 8 for input and -4 4 for weights, using atan function for transfer maps quite well and there is no need to oversaturate the next layer.

Houshalter · on Sept 20, 2015

The problem with using digital calculations is that they are deterministic. If a result is really small, it is just rounded down to zero. So if you add a bunch of small numbers, you get zero. Even if the result should be large.

Stochastic rounding can fix this. You round each step with the probability so it's expected value is the same. Usually it will round down to 0, but sometimes it will round up to 1.

Relevant paper, using stochastic rounding. Without it the results get worse and worse before you even get to 8 bits. With stochastic rounding, there is no performance degradation. You could probably even reduce the bits even further. I think it may even be possible to get it down to 1 or 2 bits: http://arxiv.org/abs/1502.02551

The relevant graph: https://i.imgur.com/cOZ4fn3.jpg

hyperion2010 · on Sept 19, 2015

Point of interest, if you do the fundamental physics on neuronal membranes, the number of levels that are actually distinguishable give the noise in the system is only about 1000. So even in a biological system there are only 4x the the number of discrete levels. I realize this isn't a good match to what is mentioned in the article but it does put some constraints on the maximum dynamic range that biological sensors have to work within.

tajen · on Sept 19, 2015

1024 levels = 10 bits. The article mentions 8 bits, which is 256 levels. Now I get what you mean with 4x.

rdlecler1 · on Sept 20, 2015

These networks ought to be robust to minor changes in W. It's the topology that maters and frankly most of the W_ij != 0 are spurious connections -- meaning perturbation analysis will show that they play no causal role in the computation. I wrote a paper on this which has >100 citation (Survival of The Sparsest: Robust Gene Networks are Parsimonious). I used gene networks, but this is just a special case of neural networks. In fact there been a bunch of papers published on gene regulatory networks that show that topology is the main driver of function -- not surprising, if you show the circuit diagram of an 8-bit adder to an EE, they'll know exactly the function. Logically it has to be so. In fact you can model the gene network of the drosophila segmentation pattern with Boolean (1-bit) networks. The problem with ANN research is that no few take the time to understand why things function as they do. We should be reverse engineering these from biology. Every time a major advancement is made in ANNs neurobiologist say "yes, we could have told you that ten years ago" deep learning is just the latest example. It will hit its asymptote soon, then people will say that AI failed to live up to its expectation, then someone will make a new discovery. It's very frustrating to sit on the sidelines and watch this happen again and again.

JuliaLang · on Sept 20, 2015

Care to do it yourself?

TD-Linux · on Sept 19, 2015

>On the general CPU side, modern SIMD instruction sets are often geared towards float, and so eight bit calculations don’t offer a massive computational advantage on recent x86 or ARM chips.

This isn't true, modern SIMD instruction sets have tons of operations for smaller fixed point numbers, as used heavily in video codecs. Unless the author meant some sort of weird 8 bit float?

_r5wf · on Sept 19, 2015

Funny that I just finished initial implementation of the code that uses the techniques from the paper (Vanhoucke et al.) mentioned in the post.

https://github.com/ahmetaa/fast-dnn

dnautics · on Sept 19, 2015

Agreed. I'm going working on an 8bit floating point that is optimized for learning algos, and optimized to be easy to soft emulate and also very efficient in hardware. One of the cool things about this float is that transfer functions (like the logistic) basically becomes a lookup table for really good performance.

Also, there is no strong need for "zero".

Animats · on Sept 19, 2015

That's fascinating, especially since very slow training, where the weights don't change much per cycle, is in fashion. One would think that would result in changes rounding down to zero and nothing happening, but apparently it doesn't.

jokoon · on Sept 19, 2015

Wouldn't that mean that using 8 bit cores would be enough to simulate neuron networks ? That might significantly reduce the amount of transistors, thus increasing the amount of cores and parallelism.

IgorPartola · on Sept 20, 2015

I guess this is why rowing crews that regularly practice on choppy water end up doing better in an average competition (no citation, just something my coach once told me). Training in adverse conditions results in better built in corrections.

jsprogrammer · on Sept 19, 2015

It's not just eight bits. It's 8 bits * # of nodes.

The bits per node just determine the 'resolution' of your individual nodes; while the network as a whole determines how many states can be represented.

Tobu · on Sept 20, 2015

Eight bits are enough for me … https://www.youtube.com/watch?v=GoGpLl-SUfk