Vanishing gradient problem
Vanishing Gradient Problem
The Vanishing Gradient Problem is a common issue encountered during the training of deep neural networks. It occurs when the gradients (used to update weights) become extremely small, effectively preventing the network from learning.
π§ What is a Gradient?
In neural networks, gradients are values calculated during backpropagation. They show how much the model's weights should change to reduce the loss (error). The gradient is computed using the derivative of the loss with respect to each weight.
If this value is small, the model updates the weights slowly. If it is too small (close to zero), learning stops β this is the vanishing gradient problem.
β οΈ When Does It Happen?
The problem usually arises in:
- Very deep neural networks with many layers
- Networks that use activation functions like Sigmoid or Tanh, which squash outputs to small ranges
π§ͺ Example
Let's say we use the Sigmoid function:
Its derivative is:
The maximum derivative value is 0.25. So, if we keep multiplying by small numbers through many layers:
This means early layers get almost no signal, and they learn nothing.
π Effects of Vanishing Gradients
- π§ Early layers learn very slowly or not at all
- π Training becomes inefficient or completely fails
- π« Model accuracy suffers, especially in deep networks
π§ Solutions
Several techniques help reduce or fix the vanishing gradient problem:
1. Use ReLU Activation
ReLU (Rectified Linear Unit) avoids squashing outputs too much:
This function keeps large gradients alive for positive values.
2. Batch Normalization
BatchNorm helps by normalizing inputs at each layer, maintaining healthy gradient flow.
3. Residual Connections (ResNet)
ResNet uses skip connections that let gradients bypass some layers, helping the model stay "awake" even in deep networks.
4. Proper Weight Initialization
Techniques like Xavier or He initialization reduce the chance of gradients shrinking or exploding.
π Summary
Problem | Cause | Result | Solution |
---|---|---|---|
Vanishing Gradient | Small derivatives (e.g., from Sigmoid/Tanh) | Early layers stop learning | Use ReLU, BatchNorm, ResNet, proper initialization |
The vanishing gradient problem was a major obstacle in training deep neural networks. With modern techniques like ResNet and ReLU, it is now more manageable.