<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>https://qbase.texpertssolutions.com/index.php?action=history&amp;feed=atom&amp;title=Vanishing_gradient_problem</id>
	<title>Vanishing gradient problem - Revision history</title>
	<link rel="self" type="application/atom+xml" href="https://qbase.texpertssolutions.com/index.php?action=history&amp;feed=atom&amp;title=Vanishing_gradient_problem"/>
	<link rel="alternate" type="text/html" href="https://qbase.texpertssolutions.com/index.php?title=Vanishing_gradient_problem&amp;action=history"/>
	<updated>2026-06-15T08:52:59Z</updated>
	<subtitle>Revision history for this page on the wiki</subtitle>
	<generator>MediaWiki 1.43.1</generator>
	<entry>
		<id>https://qbase.texpertssolutions.com/index.php?title=Vanishing_gradient_problem&amp;diff=256&amp;oldid=prev</id>
		<title>Thakshashila: Created page with &quot;== Vanishing Gradient Problem ==  The &#039;&#039;&#039;Vanishing Gradient Problem&#039;&#039;&#039; is a common issue encountered during the training of deep neural networks. It occurs when the gradients (used to update weights) become extremely small, effectively preventing the network from learning.  === 🧠 What is a Gradient? ===  In neural networks, gradients are values calculated during &#039;&#039;&#039;backpropagation&#039;&#039;&#039;. They show how much the model&#039;s weights should change to reduce the loss (error). The...&quot;</title>
		<link rel="alternate" type="text/html" href="https://qbase.texpertssolutions.com/index.php?title=Vanishing_gradient_problem&amp;diff=256&amp;oldid=prev"/>
		<updated>2025-06-11T10:06:54Z</updated>

		<summary type="html">&lt;p&gt;Created page with &amp;quot;== Vanishing Gradient Problem ==  The &amp;#039;&amp;#039;&amp;#039;Vanishing Gradient Problem&amp;#039;&amp;#039;&amp;#039; is a common issue encountered during the training of deep neural networks. It occurs when the gradients (used to update weights) become extremely small, effectively preventing the network from learning.  === 🧠 What is a Gradient? ===  In neural networks, gradients are values calculated during &amp;#039;&amp;#039;&amp;#039;backpropagation&amp;#039;&amp;#039;&amp;#039;. They show how much the model&amp;#039;s weights should change to reduce the loss (error). The...&amp;quot;&lt;/p&gt;
&lt;p&gt;&lt;b&gt;New page&lt;/b&gt;&lt;/p&gt;&lt;div&gt;== Vanishing Gradient Problem ==&lt;br /&gt;
&lt;br /&gt;
The &amp;#039;&amp;#039;&amp;#039;Vanishing Gradient Problem&amp;#039;&amp;#039;&amp;#039; is a common issue encountered during the training of deep neural networks. It occurs when the gradients (used to update weights) become extremely small, effectively preventing the network from learning.&lt;br /&gt;
&lt;br /&gt;
=== 🧠 What is a Gradient? ===&lt;br /&gt;
&lt;br /&gt;
In neural networks, gradients are values calculated during &amp;#039;&amp;#039;&amp;#039;backpropagation&amp;#039;&amp;#039;&amp;#039;. They show how much the model&amp;#039;s weights should change to reduce the loss (error). The gradient is computed using the derivative of the loss with respect to each weight.&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt; \text{Gradient} = \frac{\partial \text{Loss}}{\partial \text{Weights}} &amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
If this value is small, the model updates the weights slowly. If it is too small (close to zero), learning stops — this is the vanishing gradient problem.&lt;br /&gt;
&lt;br /&gt;
=== ⚠️ When Does It Happen? ===&lt;br /&gt;
&lt;br /&gt;
The problem usually arises in:&lt;br /&gt;
* Very deep neural networks with many layers&lt;br /&gt;
* Networks that use activation functions like &amp;#039;&amp;#039;&amp;#039;Sigmoid&amp;#039;&amp;#039;&amp;#039; or &amp;#039;&amp;#039;&amp;#039;Tanh&amp;#039;&amp;#039;&amp;#039;, which squash outputs to small ranges&lt;br /&gt;
&lt;br /&gt;
=== 🧪 Example ===&lt;br /&gt;
&lt;br /&gt;
Let&amp;#039;s say we use the Sigmoid function:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt; \sigma(x) = \frac{1}{1 + e^{-x}} &amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Its derivative is:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt; \sigma&amp;#039;(x) = \sigma(x)(1 - \sigma(x)) &amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The maximum derivative value is 0.25. So, if we keep multiplying by small numbers through many layers:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt; 0.25 \times 0.25 \times \dots \times 0.25 = \text{Very small number} &amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This means early layers get almost no signal, and they learn nothing.&lt;br /&gt;
&lt;br /&gt;
=== 🔍 Effects of Vanishing Gradients ===&lt;br /&gt;
&lt;br /&gt;
* 🧠 Early layers learn very slowly or not at all&lt;br /&gt;
* 📉 Training becomes inefficient or completely fails&lt;br /&gt;
* 🚫 Model accuracy suffers, especially in deep networks&lt;br /&gt;
&lt;br /&gt;
=== 🔧 Solutions ===&lt;br /&gt;
&lt;br /&gt;
Several techniques help reduce or fix the vanishing gradient problem:&lt;br /&gt;
&lt;br /&gt;
==== 1. Use ReLU Activation ====&lt;br /&gt;
&lt;br /&gt;
ReLU (Rectified Linear Unit) avoids squashing outputs too much:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt; f(x) = \max(0, x) &amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This function keeps large gradients alive for positive values.&lt;br /&gt;
&lt;br /&gt;
==== 2. Batch Normalization ====&lt;br /&gt;
&lt;br /&gt;
BatchNorm helps by normalizing inputs at each layer, maintaining healthy gradient flow.&lt;br /&gt;
&lt;br /&gt;
==== 3. Residual Connections (ResNet) ====&lt;br /&gt;
&lt;br /&gt;
ResNet uses skip connections that let gradients bypass some layers, helping the model stay &amp;quot;awake&amp;quot; even in deep networks.&lt;br /&gt;
&lt;br /&gt;
==== 4. Proper Weight Initialization ====&lt;br /&gt;
&lt;br /&gt;
Techniques like Xavier or He initialization reduce the chance of gradients shrinking or exploding.&lt;br /&gt;
&lt;br /&gt;
=== 📚 Summary ===&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
! Problem&lt;br /&gt;
! Cause&lt;br /&gt;
! Result&lt;br /&gt;
! Solution&lt;br /&gt;
|-&lt;br /&gt;
| Vanishing Gradient&lt;br /&gt;
| Small derivatives (e.g., from Sigmoid/Tanh)&lt;br /&gt;
| Early layers stop learning&lt;br /&gt;
| Use ReLU, BatchNorm, ResNet, proper initialization&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The vanishing gradient problem was a major obstacle in training deep neural networks. With modern techniques like ResNet and ReLU, it is now more manageable.&lt;br /&gt;
&lt;br /&gt;
=== 📎 See Also ===&lt;br /&gt;
* [[Exploding Gradient Problem]]&lt;br /&gt;
* [[Backpropagation]]&lt;br /&gt;
* [[Activation Functions]]&lt;br /&gt;
* [[ReLU]]&lt;br /&gt;
* [[ResNet]]&lt;/div&gt;</summary>
		<author><name>Thakshashila</name></author>
	</entry>
</feed>