Thakshashila: Created page with "== Vanishing Gradient Problem == The '''Vanishing Gradient Problem''' is a common issue encountered during the training of deep neural networks. It occurs when the gradients (used to update weights) become extremely small, effectively preventing the network from learning. === 🧠 What is a Gradient? === In neural networks, gradients are values calculated during '''backpropagation'''. They show how much the model's weights should change to reduce the loss (error). The..."

2025-06-11T10:06:54Z

Created page with "== Vanishing Gradient Problem == The '''Vanishing Gradient Problem''' is a common issue encountered during the training of deep neural networks. It occurs when the gradients (used to update weights) become extremely small, effectively preventing the network from learning. === 🧠 What is a Gradient? === In neural networks, gradients are values calculated during '''backpropagation'''. They show how much the model's weights should change to reduce the loss (error). The..."

New page

== Vanishing Gradient Problem ==

The '''Vanishing Gradient Problem''' is a common issue encountered during the training of deep neural networks. It occurs when the gradients (used to update weights) become extremely small, effectively preventing the network from learning.

=== 🧠 What is a Gradient? ===

In neural networks, gradients are values calculated during '''backpropagation'''. They show how much the model's weights should change to reduce the loss (error). The gradient is computed using the derivative of the loss with respect to each weight.

:<math> \text{Gradient} = \frac{\partial \text{Loss}}{\partial \text{Weights}} </math>

If this value is small, the model updates the weights slowly. If it is too small (close to zero), learning stops — this is the vanishing gradient problem.

=== ⚠️ When Does It Happen? ===

The problem usually arises in:
* Very deep neural networks with many layers
* Networks that use activation functions like '''Sigmoid''' or '''Tanh''', which squash outputs to small ranges

=== 🧪 Example ===

Let's say we use the Sigmoid function:

:<math> \sigma(x) = \frac{1}{1 + e^{-x}} </math>

Its derivative is:

:<math> \sigma'(x) = \sigma(x)(1 - \sigma(x)) </math>

The maximum derivative value is 0.25. So, if we keep multiplying by small numbers through many layers:

:<math> 0.25 \times 0.25 \times \dots \times 0.25 = \text{Very small number} </math>

This means early layers get almost no signal, and they learn nothing.

=== 🔍 Effects of Vanishing Gradients ===

* 🧠 Early layers learn very slowly or not at all
* 📉 Training becomes inefficient or completely fails
* 🚫 Model accuracy suffers, especially in deep networks

=== 🔧 Solutions ===

Several techniques help reduce or fix the vanishing gradient problem:

==== 1. Use ReLU Activation ====

ReLU (Rectified Linear Unit) avoids squashing outputs too much:

:<math> f(x) = \max(0, x) </math>

This function keeps large gradients alive for positive values.

==== 2. Batch Normalization ====

BatchNorm helps by normalizing inputs at each layer, maintaining healthy gradient flow.

==== 3. Residual Connections (ResNet) ====

ResNet uses skip connections that let gradients bypass some layers, helping the model stay "awake" even in deep networks.

==== 4. Proper Weight Initialization ====

Techniques like Xavier or He initialization reduce the chance of gradients shrinking or exploding.

=== 📚 Summary ===

{| class="wikitable"
! Problem
! Cause
! Result
! Solution
|-
| Vanishing Gradient
| Small derivatives (e.g., from Sigmoid/Tanh)
| Early layers stop learning
| Use ReLU, BatchNorm, ResNet, proper initialization
|}

The vanishing gradient problem was a major obstacle in training deep neural networks. With modern techniques like ResNet and ReLU, it is now more manageable.

=== 📎 See Also ===
* [[Exploding Gradient Problem]]
* [[Backpropagation]]
* [[Activation Functions]]
* [[ReLU]]
* [[ResNet]]

Vanishing gradient problem - Revision history