Optimizers For Neural Networks
Various Gradient Descent optimizers.

hackernoon.com
Gradient descent is a first-order iterative optimization algorithm for finding the minimum of a function. To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient (or approximate gradient) of the function at the current point.
Gradient descent - Wikipedia
https://en.wikipedia.org/wiki/Gradient_descent
- (Standard/Full-Batch) Gradient Descent: Compute gradient w.r.t. ALL training examples. Update with this single gradient for once.
- Mini Batch Gradient Descent / SGD with Mini-Batch: Compute a gradient w.r.t. EVERY so many training examples w/o replacement. Update parameters as soon as each gradient is computed.
- Stochastic Gradient Descent (SGD): Compute a gradient w.r.t. EACH training example. Update parameters as soon as each gradient is computed.
- SGD w/ Momentum: Update the vector with the gradient, then add a fraction of the previous values of the vector (the “momentum”) to the current vector. (Essentially a “decaying average”.)
- Nesterov's Accelerated Descent: Update the vector with the momentum first, THEN compute and update with the gradient.
- Averaged SGD: Keeps track of all values taken by the vector. After training, replace the vector with the average of these historical values.
- “LRV - Learning Rate as Vectors”: Replaces the General Learning Rate with a vector. (i.e. Each parameter has its own learning rate now.)
- Adaptive Learning Rates: (with LRV) At each step, for each LR in the LR Vector, update it to be the General LR divided by something.
- AdaGrad (Adaptive Gradient): Divided by the root of the sum of the square of all previous gradients’ values.
- RMSProp: Divided by the root of a moving average of the square of a fixed window (of size t) of previous gradients’ values. http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf
- Idea 1: Similar to RMSProp, but using a decaying average instead (shown as root mean squared (RMS) error criterion of the gradient).
- Idea 2 (THE “AdaDelta”): Idea 1 + replace the GLR with also a decaying average of the last t-1 deltas in parameters’ historical values, for dimension (“unit”) correction.
- AdaM (Adaptive Moment Estimation): RMSProp + replacing the gradient w/ a decaying average of the gradient itself (“momentum”), for AdaGrad-like enhancement. (\hat indicates bias-correction.)
- AdaMax: AdaM + replacing the denominator (i.e. root of a moving average of gradients) with the higher value between (biased) last moving average of gradients and the magnitude of current gradient, for better stability.
- Nadam (Nesterov-accelerated Adaptive Moment Estimation): Like AdaM, but momentum works in a Nesterov (“momentum-then-gradient”) manner.
- AMSGrad: Like AdaM, but uses the maximum previous value instead of the (exponentially) decaying average, for better convergence.
- AdamW
- AdamWR