Learning Rate#
Note
We will abbreviate learning rate as LR.
Note
Learning rate is also called step size.
What are LR?#
In vanilla gradient descent, with the LR \( \eta \), parameter \( x \), loss function \( f \), the update formula is as follows:
Other optimizers are updated in a similar fashion. The takeaway is, LR is how much you step forward performing a gradient update. If LR is huge, you update the parameters more. If LR is tiny, you don’t update the parameter as much.
How should I choose my LR?#
Initially, we want the LR to be as big as possible (because updates are faster) whenever possible. However, with big LRs, it’s difficult to move into a precise location because it only takes big steps. With training progresses, we want to slightly reduce the LR such that we find a more finetuned solution. Some optimizers sort of do this internally (reducing step size by reducing the gradients’ scale), but we can always use a learning rate scheduler if we want an explicit LR schedule.
Can different learning rates be used on different parameters at the same time?#
The learning rate of different parameters can be different. Algorithms like Adam and Adagrad explicitly re-scales the learning rate to achieve faster training.