Learning Rate#

Note

We will abbreviate learning rate as LR.

Note

Learning rate is also called step size.

What are LR?#

In vanilla gradient descent, with the LR \( \eta \), parameter \( x \), loss function \( f \), the update formula is as follows:

\[ x' = x - \eta * \frac{df}{dx} \]

Other optimizers are updated in a similar fashion. The takeaway is, LR is how much you step forward performing a gradient update. If LR is huge, you update the parameters more. If LR is tiny, you don’t update the parameter as much.

How should I choose my LR?#

Initially, we want the LR to be as big as possible (because updates are faster) whenever possible. However, with big LRs, it’s difficult to move into a precise location because it only takes big steps. With training progresses, we want to slightly reduce the LR such that we find a more finetuned solution. Some optimizers sort of do this internally (reducing step size by reducing the gradients’ scale), but we can always use a learning rate scheduler if we want an explicit LR schedule.

Can different learning rates be used on different parameters at the same time?#

The learning rate of different parameters can be different. Algorithms like Adam and Adagrad explicitly re-scales the learning rate to achieve faster training.