# Optimizers For Neural Networks

## Optimizers For Neural Networks

### Gradient Descent: Take steps proportional to the negative of the gradient of the loss function (i.e. error surface) at the current point.

hackernoon.com Gradient descent is a first-order iterative optimization algorithm for finding the minimum of a function. To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient (or approximate gradient) of the function at the current point. Gradient descent - Wikipedia https://en.wikipedia.org/wiki/Gradient_descent

#### tricks by modifying the training examples taken into consideration

• (Standard/Full-Batch) Gradient Descent: Compute gradient w.r.t. ALL training examples. Update with this single gradient for once.
• Mini Batch Gradient Descent / SGD with Mini-Batch: Compute a gradient w.r.t. EVERY so many training examples w/o replacement. Update parameters as soon as each gradient is computed.
• Stochastic Gradient Descent (SGD): Compute a gradient w.r.t. EACH training example. Update parameters as soon as each gradient is computed.

#### tricks by considering previous values of the vector

• SGD w/ Momentum: Update the vector with the gradient, then add a fraction of the previous values of the vector (the “momentum”) to the current vector. (Essentially a “decaying average”.)
• Nesterov's Accelerated Descent: Update the vector with the momentum first, THEN compute and update with the gradient.
• Averaged SGD: Keeps track of all values taken by the vector. After training, replace the vector with the average of these historical values.

#### tricks by modifying the learning rate

• LRV - Learning Rate as Vectors”: Replaces the General Learning Rate with a vector. (i.e. Each parameter has its own learning rate now.)
• Adaptive Learning Rates: (with LRV) At each step, for each LR in the LR Vector, update it to be the General LR divided by something.