Links

Optimizers For Neural Networks

Various Gradient Descent optimizers.

Optimizers For Neural Networks

Gradient Descent: Take steps proportional to the negative of the gradient of the loss function (i.e. error surface) at the current point.

hackernoon.com Gradient descent is a first-order iterative optimization algorithm for finding the minimum of a function. To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient (or approximate gradient) of the function at the current point. Gradient descent - Wikipedia https://en.wikipedia.org/wiki/Gradient_descent

tricks by modifying the training examples taken into consideration

  • (Standard/Full-Batch) Gradient Descent: Compute gradient w.r.t. ALL training examples. Update with this single gradient for once.
  • Mini Batch Gradient Descent / SGD with Mini-Batch: Compute a gradient w.r.t. EVERY so many training examples w/o replacement. Update parameters as soon as each gradient is computed.
  • Stochastic Gradient Descent (SGD): Compute a gradient w.r.t. EACH training example. Update parameters as soon as each gradient is computed.

tricks by considering previous values of the vector

  • SGD w/ Momentum: Update the vector with the gradient, then add a fraction of the previous values of the vector (the “momentum”) to the current vector. (Essentially a “decaying average”.)
    • Nesterov's Accelerated Descent: Update the vector with the momentum first, THEN compute and update with the gradient.
  • Averaged SGD: Keeps track of all values taken by the vector. After training, replace the vector with the average of these historical values.

tricks by modifying the learning rate

  • LRV - Learning Rate as Vectors”: Replaces the General Learning Rate with a vector. (i.e. Each parameter has its own learning rate now.)
  • Adaptive Learning Rates: (with LRV) At each step, for each LR in the LR Vector, update it to be the General LR divided by something.
    • AdaGrad (Adaptive Gradient): Divided by the root of the sum of the square of all previous gradients’ values.
    • RMSProp: Divided by the root of a moving average of the square of a fixed window (of size t) of previous gradients’ values. http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf
      • Idea 1: Similar to RMSProp, but using a decaying average instead (shown as root mean squared (RMS) error criterion of the gradient).
      • Idea 2 (THE “AdaDelta”): Idea 1 + replace the GLR with also a decaying average of the last t-1 deltas in parameters’ historical values, for dimension (“unit”) correction.

tricks by modifying the gradient itself

  • AdaM (Adaptive Moment Estimation): RMSProp + replacing the gradient w/ a decaying average of the gradient itself (“momentum”), for AdaGrad-like enhancement. (\hat indicates bias-correction.)
  • AdaMax: AdaM + replacing the denominator (i.e. root of a moving average of gradients) with the higher value between (biased) last moving average of gradients and the magnitude of current gradient, for better stability.
  • Nadam (Nesterov-accelerated Adaptive Moment Estimation): Like AdaM, but momentum works in a Nesterov (“momentum-then-gradient”) manner.
  • AMSGrad: Like AdaM, but uses the maximum previous value instead of the (exponentially) decaying average, for better convergence.
  • AdamW
  • AdamWR