ML | Optimizer | SGD, Momentum, AdaGrad, RMSProp, Adam Optimizer

Optimizer

In machine learning, an Optimizer is the "engine" that minimizes the loss function. It updates the weights and biases of a model based on the gradients calculated during backpropagation.

The evolution from basic Stochastic Gradient Descent (SGD) to Adam represents a journey of solving specific mathematical hurdles like "pathological curvature" (oscillations) and "vanishing learning rates".

Stochastic Gradient Descent (SGD)

Standard Gradient Descent computes the gradient for the entire dataset before making one update. SGD instead updates parameters using only one sample (or a small "mini-batch") at a time.

\[\theta_{t+1} = \theta_t - \eta \cdot \nabla_{\theta} J(\theta_t; x^{(i)}, y^{(i)})\]

where \(\eta\) is the learning rate and \(J\) is the loss.

  • Intuition: It is like a drunk person walking down a hill; they take frequent, noisy steps. This noise actually helps the model "jump" out of shallow local minima.
  • Problem: In "ravines" (where the surface curves much more steeply in one dimension than another), SGD oscillates wildly across the slopes, making very slow progress toward the actual minimum.

Momentum

Momentum was designed to solve the "oscillation" problem of SGD. It adds a fraction of the previous update to the current one.

where \(\mu\) (usually 0.9) is the "momentum" or friction.

  • Intuition: Imagine a heavy ball rolling down a hill. It gains speed in the direction of the consistent downward slope and ignores minor bumps or side-to-side noise.
  • Why use it? It speeds up convergence significantly and "dampens" the zig-zagging seen in plain SGD.

AdaGrad

AdaGrad's name comes from Adaptative Gradient. Intuitively, it adapts the learning rate for each feature depending on the estimated geometry of the problem; particularly, it tends to assign higher learning rates to infrequent features, which ensures that the parameter updates rely less on frequency and more on relevance.

AdaGrad was introduced by Duchi et al. in a highly cited paper published in the Journal of machine learning research in 2011. It is arguably one of the most popular algorithms for machine learning (particularly for training deep neural networks) and it influenced the development of the Adam algorithm.

RMSProp

RMSProp (Root Mean Square Propagation) is an adaptive learning rate optimization algorithm for training deep neural networks, designed to make training faster and more stable by adjusting the learning rate for each parameter **individually**, preventing issues like vanishing/exploding gradients and slow convergence, especially in complex models.

Adam

(Adaptive Moment Estimation) Adam is currently the "gold standard." It essentially combines Momentum (storing the first moment: mean of gradients) and RMSProp (storing the second moment: uncentered variance of gradients).

  • Update Mean (\(m_t\)): \(m_t = \beta_1 m_{t-1} + (1-\beta_1)g_t\) (Like Momentum)
  • Update Variance (\(v_t\)): \(v_t = \beta_2 v_{t-1} + (1-\beta_2)g_t^2\) (Like RMSProp)
  • Bias Correction: Adjust \(m_t\) and \(v_t\) because they are initialized at zero.
  • Update: \(\theta_{t+1} = \theta_t - \frac{\eta \cdot \hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}\)
  • Why use it? It is computationally efficient, requires little memory, and usually works well with "default" hyperparameters (\(\beta_1 = 0.9, \beta_2 = 0.999\)).

Second-order optimization

References

  1. [機器學習ML NOTE]SGD, Momentum, AdaGrad, Adam Optimizer
  2. Why Momentum Really Works
  3. ML入門(十二)SGD, AdaGrad, Momentum, RMSProp, Adam Optimizer
  4. 12.8. RMSProp

ML | Optimizer | SGD, Momentum, AdaGrad, RMSProp, Adam Optimizer
https://waipangsze.github.io/2026/01/14/ML-Optimizer-SGD-Momentum-AdaGrad-RMSProp-Adam-Optimizer/
Author
wpsze
Posted on
January 14, 2026
Updated on
January 15, 2026
Licensed under