Thakshashila: Created page with "= Gradient Descent = '''Gradient Descent''' is an optimization algorithm used in machine learning and deep learning to minimize the cost (loss) function by iteratively updating model parameters in the direction of steepest descent, i.e., the negative gradient. == What is Gradient Descent? == Gradient Descent helps find the best-fit parameters (like weights in a neural network or coefficients in regression) that minimize the error between predicted and actual values. I..."

2025-06-10T06:35:26Z

Created page with "= Gradient Descent = '''Gradient Descent''' is an optimization algorithm used in machine learning and deep learning to minimize the cost (loss) function by iteratively updating model parameters in the direction of steepest descent, i.e., the negative gradient. == What is Gradient Descent? == Gradient Descent helps find the best-fit parameters (like weights in a neural network or coefficients in regression) that minimize the error between predicted and actual values. I..."

New page

= Gradient Descent =

'''Gradient Descent''' is an optimization algorithm used in machine learning and deep learning to minimize the cost (loss) function by iteratively updating model parameters in the direction of steepest descent, i.e., the negative gradient.

== What is Gradient Descent? ==

Gradient Descent helps find the best-fit parameters (like weights in a neural network or coefficients in regression) that minimize the error between predicted and actual values. It does this by adjusting the parameters gradually to reduce the loss.

== The Basic Formula ==

:<math>
\theta := \theta - \alpha \cdot \frac{\partial J(\theta)}{\partial \theta}
</math>

Where:
* <math>\theta</math> = model parameters (weights)
* <math>\alpha</math> = learning rate (step size)
* <math>J(\theta)</math> = cost/loss function
* <math>\frac{\partial J(\theta)}{\partial \theta}</math> = gradient (slope) of the loss with respect to the parameters

== Types of Gradient Descent ==

=== 1. Batch Gradient Descent ===

* Uses the entire training dataset to compute the gradient.
* Stable but slow on large datasets.

=== 2. Stochastic Gradient Descent (SGD) ===

* Updates weights for each training example.
* Faster but can be noisy and less stable.

=== 3. Mini-Batch Gradient Descent ===

* Uses a subset (mini-batch) of training data to compute each update.
* Combines advantages of both batch and SGD.
* Commonly used in deep learning.

== Learning Rate (α) ==

The learning rate controls how big the step is during each update.
* If <math>\alpha</math> is too small: slow convergence.
* If <math>\alpha</math> is too large: may overshoot or diverge.

== Example ==

Suppose we are minimizing the Mean Squared Error (MSE) in linear regression. Gradient descent updates the weights so that the predicted line fits the data points better over time.

== Visualization ==

Imagine a ball rolling down a curved surface to reach the lowest point (minimum). Gradient descent is the process of rolling the ball by calculating the slope and moving it downhill.

== Applications of Gradient Descent ==

* Training machine learning models (e.g., linear/logistic regression)
* Optimizing deep learning models (e.g., neural networks)
* Used in NLP, computer vision, recommendation systems, etc.

== Related Concepts ==

* [[Learning Rate]]
* [[Loss Function]]
* [[Optimization Algorithms]]
* [[Backpropagation]]
* [[Stochastic Gradient Descent]]
* [[Neural Networks]]

== SEO Keywords ==

gradient descent machine learning, how gradient descent works, types of gradient descent, optimization in ML, stochastic gradient descent, loss minimization, cost function optimization

Gradient Descent - Revision history