Exploding Gradient Problem: Difference between revisions

Latest revision as of 10:09, 11 June 2025

Exploding Gradient Problem

The Exploding Gradient Problem is a common issue in training deep neural networks where the gradients grow too large during backpropagation. This leads to very large weight updates, making the model unstable or completely unusable.

📈 What Are Gradients?

Gradients are computed during the backpropagation step of training. They help the model understand how to change its weights to reduce error.

Gradient = \frac{\partial Loss}{\partial Weight}

If gradients become very large, the weight updates become huge, which can cause the model to diverge (never reach a good solution).

⚠️ When Does It Happen?

It usually happens in:

Very deep networks with many layers
Recurrent Neural Networks (RNNs), especially for long sequences
When using poor weight initialization

🧪 Example

Let’s assume a layer has a weight matrix and a large gradient. When we compute updates:

Δ W = - η \cdot Gradient

If the gradient is large (e.g., 10,000), even a small learning rate $η$ leads to massive weight updates.

This can result in:

Loss becoming NaN (Not a Number) 💥
Weights exploding to infinity ➡️ ∞
Model failing to train 😢

🔍 Symptoms of Exploding Gradients

❌ Loss value jumps or becomes NaN
📈 Weights become excessively large
🔁 Training fails to converge
💥 Network outputs explode to very high values

🔧 Solutions

Several techniques are commonly used to fix or prevent this issue:

1. Gradient Clipping

Limit (or "clip") the gradients to a maximum value during backpropagation:

If ‖ g ‖ > threshold, then g : = \frac{threshold}{‖ g ‖} \cdot g

This keeps gradients from becoming too large.

2. Better Weight Initialization

Use techniques like:

Xavier initialization for Tanh/Sigmoid
He initialization for ReLU

These help control the scale of activations and gradients.

3. Use Normalization Layers

- Batch Normalization** helps to keep the network outputs within a stable range.

4. Choose Better Activation Functions

ReLU and its variants (Leaky ReLU, ELU) tend to work better in deep networks.

📚 Summary Table

Problem	Cause	Effect	Solution
Exploding Gradient	Deep networks, poor initialization	Huge weight updates, loss divergence	Gradient clipping, normalization, better activation functions

🧠 Difference from Vanishing Gradient

Problem	Gradient Size	Effect
Vanishing Gradient	Near zero	Training stops (no learning)
Exploding Gradient	Extremely large	Training blows up (unstable learning)

@@ Line 97: / Line 97: @@
 === 📎 See Also ===
-* [[Vanishing Gradient Problem]]
+* [[Vanishing gradient problem]]
 * [[Backpropagation]]
 * [[Gradient Clipping]]
 * [[Weight Initialization]]
 * [[ReLU]]