ML | Optimizer | SGD, Momentum, AdaGrad, RMSProp, Adam Optimizer

Optimizer

In machine learning, an Optimizer is the "engine" that minimizes the loss function. It updates the weights and biases of a model based on the gradients calculated during backpropagation.

The evolution from basic Stochastic Gradient Descent (SGD) to Adam represents a journey of solving specific mathematical hurdles like "pathological curvature" (oscillations) and "vanishing learning rates".

Stochastic Gradient Descent (SGD)

Standard Gradient Descent computes the gradient for the entire dataset before making one update. SGD instead updates parameters using only one sample (or a small "mini-batch") at a time.

\[\theta_{t+1} = \theta_t - \eta \cdot \nabla_{\theta} J(\theta_t; x^{(i)}, y^{(i)})\]

where \(\eta\) is the learning rate and \(J\) is the loss.

Intuition: It is like a drunk person walking down a hill; they take frequent, noisy steps. This noise actually helps the model "jump" out of shallow local minima.
Problem: In "ravines" (where the surface curves much more steeply in one dimension than another), SGD oscillates wildly across the slopes, making very slow progress toward the actual minimum.

Momentum

Momentum was designed to solve the "oscillation" problem of SGD. It adds a fraction of the previous update to the current one.

where \(\mu\) (usually 0.9) is the "momentum" or friction.

Intuition: Imagine a heavy ball rolling down a hill. It gains speed in the direction of the consistent downward slope and ignores minor bumps or side-to-side noise.
Why use it? It speeds up convergence significantly and "dampens" the zig-zagging seen in plain SGD.

AdaGrad

Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(7).

AdaGrad's name comes from Adaptative Gradient. Intuitively, it adapts the learning rate for each feature depending on the estimated geometry of the problem; particularly, it tends to assign higher learning rates to infrequent features, which ensures that the parameter updates rely less on frequency and more on relevance.

AdaGrad was introduced by Duchi et al. in a highly cited paper published in the Journal of machine learning research in 2011. It is arguably one of the most popular algorithms for machine learning (particularly for training deep neural networks) and it influenced the development of the Adam algorithm.

RMSProp

Tieleman and Hinton (2012) proposed the RMSProp algorithm as a simple fix to decouple rate scheduling from coordinate-adaptive learning rates.
RMSprop is unpublished optimization algorithm designed for neural networks, first proposed by Geoff Hinton in lecture 6 of the online course “Neural Networks for Machine Learning”.
Neural Networks for Machine LearningLecture 6 | Overview of mini-‐batch gradient descent | Geoffrey Hinton | 2018

RMSProp (Root Mean Square Propagation) is an adaptive learning rate optimization algorithm for training deep neural networks, designed to make training faster and more stable by adjusting the learning rate for each parameter **individually**, preventing issues like vanishing/exploding gradients and slow convergence, especially in complex models.

Adam

Adam: A Method for Stochastic Optimization | Diederik P. Kingma, Jimmy Ba | 2014

(Adaptive Moment Estimation) Adam is currently the "gold standard." It essentially combines Momentum (storing the first moment: mean of gradients) and RMSProp (storing the second moment: uncentered variance of gradients).

Update Mean (\(m_t\)): \(m_t = \beta_1 m_{t-1} + (1-\beta_1)g_t\) (Like Momentum)
Update Variance (\(v_t\)): \(v_t = \beta_2 v_{t-1} + (1-\beta_2)g_t^2\) (Like RMSProp)
Bias Correction: Adjust \(m_t\) and \(v_t\) because they are initialized at zero.
Update: \(\theta_{t+1} = \theta_t - \frac{\eta \cdot \hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}\)
Why use it? It is computationally efficient, requires little memory, and usually works well with "default" hyperparameters (\(\beta_1 = 0.9, \beta_2 = 0.999\)).

Second-order optimization

References

#AI #ML #pytorch #Optimizer #SGD #Momentum #AdaGrad #RMSProp #Adam

ML | Optimizer | SGD, Momentum, AdaGrad, RMSProp, Adam Optimizer

https://waipangsze.github.io/2026/01/14/ML-Optimizer-SGD-Momentum-AdaGrad-RMSProp-Adam-Optimizer/

Author

wpsze

Posted on

January 14, 2026

Updated on

February 7, 2026

Licensed under

NWP | ICON Previous

GeoJSON | To reduce the file size of GeoJSON files Next