Vanishing gradient problem

Vanishing Gradient Problem

The Vanishing Gradient Problem is a common issue encountered during the training of deep neural networks. It occurs when the gradients (used to update weights) become extremely small, effectively preventing the network from learning.

🧠 What is a Gradient?

In neural networks, gradients are values calculated during backpropagation. They show how much the model's weights should change to reduce the loss (error). The gradient is computed using the derivative of the loss with respect to each weight.

Gradient = \frac{\partial Loss}{\partial Weights}

If this value is small, the model updates the weights slowly. If it is too small (close to zero), learning stops — this is the vanishing gradient problem.

⚠️ When Does It Happen?

The problem usually arises in:

Very deep neural networks with many layers
Networks that use activation functions like Sigmoid or Tanh, which squash outputs to small ranges

🧪 Example

Let's say we use the Sigmoid function:

σ (x) = \frac{1}{1 + e^{- x}}

Its derivative is:

σ^{'} (x) = σ (x) (1 - σ (x))

The maximum derivative value is 0.25. So, if we keep multiplying by small numbers through many layers:

0.25 \times 0.25 \times \dots \times 0.25 = Very small number

This means early layers get almost no signal, and they learn nothing.

🔍 Effects of Vanishing Gradients

🧠 Early layers learn very slowly or not at all
📉 Training becomes inefficient or completely fails
🚫 Model accuracy suffers, especially in deep networks

🔧 Solutions

Several techniques help reduce or fix the vanishing gradient problem:

1. Use ReLU Activation

ReLU (Rectified Linear Unit) avoids squashing outputs too much:

f (x) = \max (0, x)

This function keeps large gradients alive for positive values.

2. Batch Normalization

BatchNorm helps by normalizing inputs at each layer, maintaining healthy gradient flow.

3. Residual Connections (ResNet)

ResNet uses skip connections that let gradients bypass some layers, helping the model stay "awake" even in deep networks.

4. Proper Weight Initialization

Techniques like Xavier or He initialization reduce the chance of gradients shrinking or exploding.

📚 Summary

Problem	Cause	Result	Solution
Vanishing Gradient	Small derivatives (e.g., from Sigmoid/Tanh)	Early layers stop learning	Use ReLU, BatchNorm, ResNet, proper initialization

The vanishing gradient problem was a major obstacle in training deep neural networks. With modern techniques like ResNet and ReLU, it is now more manageable.

Vanishing gradient problem

Contents