Gradient Vanishing / Gradient Explosion#

What happens when gradients vanish or explode?#

Remember that the neural network trains by using chain rules. That is, earlier layers’ gradients are calculated by scaling later layers’ gradients. Chain rule:

\[ \frac{dy}{dx} = \frac{df}{du} \frac{du}{dx} \]

If later layers gradients are extremely large, then it’s going to scale the earlier gradients by a huge factor. Due to how the computer represent numbers, this number may just be INFINITY. If later layers gradients are extremely small, then it’s going to scale down the earlier gradients a lot. The earlier gradients may just become 0 in such a case.

In either cases, the gradients calculated are not true gradients, and may just make the network un-trainable. Do you really want to update your parameter by INFINITY?

When do gradients vanish or explode?#

In very deep networks, vanishing or explosion is more likely to happen. For example, if every layer scales the norm by 10 (not that big considering that there are many parameters in a layer), then after 300 layers (which is not uncommon in current neural networks), the gradients will approach INFINITY. The same arguments can be applied to vanishing gradients. In deep networks, these things happen quite often.

How to deal with gradient vanishing or explosion?#

There are several ways to deal with vanishing or exploding gradients.

Normalization#

Normalization makes sure that in training, the size of the gradient does not get out of hand, because it is normalized by passing through this layer.

Residual Networks#

Residual networks reduces the depth of the networks by providing shortcuts. Shallower networks are less likely to have gradient explosion/vanishing problems.

Don’t use certain activation functions#

Activation functions like Sigmoid, Softmax, Tanh, if applied over and over in several layers, will make the gradients of the network is extremely small. This is the reason ReLU is popular, because it doesn’t suffer from gradient problems as often.