Kangaroos and Training Neural Networks (1994)

jmmcd · on May 18, 2021

> In standard backprop, the most common NN training method, the kangaroo is blind and has to feel around on the ground to make a guess about which way is up. A major problem with standard backprop is that the distance the kangaroo hops is related to the steepness of the terrain. If the kangaroo starts on a gently sloping plain instead of a mountain side, she will take very small hops and make very slow progress. When she finally starts to ascend a mountain, her hops get longer and more dangerous, and she may hop off the mountain altogether. If the kangaroo ever gets near the peak, she may jump back and forth across the peak without ever landing on it.

The first part of this is bad. In backprop we know the gradient just fine. And no analogy has been offered for the aspect that makes backprop distinct from every other algorithm mentioned.

LeegleechN · on May 18, 2021

The gradient we want is the gradient with respect to the process which generated the dataset. The gradient we get is an estimate based on only a handful of samples from that process at a time. The analogy holds up fine.

jmmcd · on May 19, 2021

If we were trying to describe stochastic gradient descent, this would be relevant, but we're talking about backprop (which does often use just a batch, but that's not inherent). And there is nothing about backprop that makes the kangaroo more blind than in any other form of gradient-based optimisation.

eutectic · on May 18, 2021

I would say it's more a commentary on the fact the the gradient is effectively based on L2 distance in parameter space, and so can be a bad/inefficent direction to move in even if you have access to the full gradient. Hence the motivation for momentum and second-order optimization.

CyberShadow · on May 18, 2021

I understand the fragment you quoted to be accurate, and the problem described is mostly ameliorated by optimizers. But, I don't understand your criticism of the quoted text. Would you mind elaborating?

jmmcd · on May 19, 2021

The idea of the kangaroo being blind and "feeling around" might be a good metaphor for an optimiser which uses numerical estimation of the gradient, but backprop does not do that. At a stretch, one could argue that being blind is a good metaphor for metaheuristic optimisation, but "feeling around" on the ground is not a good metaphor for metaheuristic optimisation.

If we are trying to find a metaphor for backprop, I think we are in trouble, because the central idea in backprop is that we have to calculate the gradient for some variables (eg the north-south variable) before we can calculate the gradient for some others (eg the east-west variable).

anotherevan · on May 19, 2021

I vaguely remember an old story about a flight simulation program being used to train helicopter pilots for flying in the Australian outback.

One of the training requirements was not to try and buzz a troop of kangaroos.

Thing was, the software had been adapted from a military flight simulator, so during early testing when a trainee got too close the kangaroos, they would bounce behind a ridge and then launch a ground-to-air missile at the chopper.

Despite the "bug" in the software, it was effective as the pilots always steered clear of the roos.