# Optimizers For Neural Networks

Various Gradient Descent optimizers.

hackernoon.com

**Gradient descent**is a first-order iterative optimization algorithm for finding the minimum of a function. To find a local minimum of a function using**gradient descent**, one takes steps proportional to the negative of the**gradient**(or approximate**gradient**) of the function at the current point. Gradient descent - Wikipedia https://en.wikipedia.org/wiki/Gradient_descent- (
**Standard/Full-Batch) Gradient Descent**: Compute gradient w.r.t. ALL training examples. Update with this single gradient for once. **Mini Batch Gradient Descent / SGD with Mini-Batch**: Compute a gradient w.r.t. EVERY so many training examples w/o replacement. Update parameters as soon as each gradient is computed.**Stochastic Gradient Descent (SGD**): Compute a gradient w.r.t. EACH training example. Update parameters as soon as each gradient is computed.

**SGD w/ Momentum**: Update the vector with the gradient, then add a fraction of the previous values of the vector (the “momentum”) to the current vector. (Essentially a “decaying average”.)**Nesterov's Accelerated Descent**: Update the vector with the momentum first, THEN compute and update with the gradient.

**Averaged SGD**: Keeps track of all values taken by the vector. After training, replace the vector with the average of these historical values.

- “
**LRV - Learning Rate as Vectors**”: Replaces the General Learning Rate with a vector. (i.e. Each parameter has its own learning rate now.) **Adaptive Learning Rates**: (with LRV) At each step, for each LR in the LR Vector, update it to be the General LR divided by something.**AdaGrad (Adaptive Gradient**): Divided by the root of the sum of the square of all previous gradients’ values.**RMSProp**: Divided by the root of a moving average of the square of a fixed window (of size*t*) of previous gradients’ values. http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf*Idea 1*: Similar to RMSProp, but using a decaying average instead (shown as root mean squared (RMS) error criterion of the gradient).*Idea 2*(THE “**AdaDelta**”): Idea 1 + replace the GLR with also a decaying average of the last*t-1*deltas in parameters’ historical values, for dimension (“unit”) correction.

**AdaM (Adaptive Moment Estimation**): RMSProp + replacing the gradient w/ a decaying average of the gradient itself (“momentum”), for AdaGrad-like enhancement. (\hat indicates bias-correction.)**AdaMax**: AdaM + replacing the denominator (i.e. root of a moving average of gradients) with the higher value between (biased) last moving average of gradients and the magnitude of current gradient, for better stability.**Nadam (Nesterov-accelerated Adaptive Moment Estimation**): Like AdaM, but momentum works in a Nesterov (“momentum-then-gradient”) manner.**AMSGrad**: Like AdaM, but uses the maximum previous value instead of the (exponentially) decaying average, for better convergence.**AdamW****AdamWR**