RAdam Warmup
- class pytorch_warmup.radam.RAdamWarmup(optimizer, last_step=-1)[source]
RAdam warmup schedule.
This warmup scheme is described in On the adequacy of untuned warmup for adaptive optimization.
The RAdam algorithm uses the warmup factor
\[\omega_{t}^{\rm RAdam} = \sqrt{ \frac{ \ ( \rho_{t} - 4 ) ( \rho_{t} - 2 ) \rho_{\infty} }{ \ ( \rho_{\infty} - 4) (\rho_{\infty} - 2 ) \rho_{t} } } \]at each iteration \(t\) for \(\rho_{t} > 4\), where
\[\rho_{\infty} = \frac{ 2 }{ 1 - \beta_{2} } - 1 \]and
\[\rho_{t} = \rho_{\infty} - \frac{ 2 t \cdot \beta_{2}^{t} }{ 1 - \beta_{2}^{t} } \]where \(\beta_{2}\) is the second discount factor of Adam. In the RAdam warmup schedule, the minimal offset \(\delta\) is chosen such that \(\rho_{\delta} > 4\), and then \(\omega_{t+\delta-1}^{\rm RAdam}\) is employed as the warmup factor at each iteration \(t\). For all practically relevant values of \(\beta_{2}\) (\(0.8 < \beta_{2} \le 1\)), \(\delta \le 5\) as deduced from Fact 3.1 of the paper.
- Parameters:
optimizer (Optimizer) – Adam optimizer or its variant:
Adam
,AdamW
,SparseAdam
, orNAdam
.RAdam
is not suitable because of the warmup redundancy. This warmup schedule makes no sense forAdamax
and, in principle, the AMSGrad variant ofAdam
andAdamW
as discussed in Note below. In practice, this warmup schedule improves the performance of the AMSGrad variant like that of the vanilla Adam.last_step (int) – The index of last step. Default: -1.
Note
This warmup schedule employs the same warmup factor for all variants of Adam. However, according to the RAdam theory,
Adamax
and the AMSGrad variant ofAdam
andAdamW
should have a different warmup factor because its \(\psi(\cdot)\) function is different from one of the vanilla Adam, where \(\psi(\cdot)\) specifies how the adaptive learning rate at \(t\) is calculated. The RAdam theory derives the warmup factor \(\omega_{t}\) from \(\psi(\cdot)\). For gradients \(\left\{ g_{i} \right\}\) viewed as i.i.d. normal random variables,\(\omega_{t} = \sqrt{ C_{\rm var} / {\rm Var}\left[ \psi(g_{1}, \dots, g_{t}) \right] }\)
where
\(C_{\rm var} = \inf_{t} {\rm Var}\left[ \psi(g_{1}, \dots, g_{t}) \right]\).
(For details please refer to On the Variance of the Adaptive Learning Rate and Beyond.)
The variance hypothesis of the RAdam theory has become questionable since Ma and Yarats’ paper pointed out that the adaptive learning rate may not be the best medium of analysis for understanding the role of warmup in Adam.
Example
>>> optimizer = AdamW(...) >>> lr_scheduler = CosineAnnealingLR(optimizer, ...) >>> warmup_scheduler = RAdamWarmup(optimizer) >>> for batch in dataloader: >>> optimizer.zero_grad() >>> loss = ... >>> loss.backward() >>> optimizer.step() >>> with warmup_scheduler.dampening(): >>> lr_scheduler.step()
Warning
The warmup schedule must not be initialized before the initialization of the learning rate schedule.
- warmup_factor(step, beta2, rho_inf, offset)[source]
Returns the warmup factor \(\omega_{t+\delta-1}^{\rm RAdam}\) at an iteration \(t\).
- Parameters:
step (int) – The index of current step.
beta2 (float) – The second discount factor of Adam, \(\beta_{2}\).
rho_inf (float) – The constant of the RAdam algorithm, \(\rho_{\infty}\).
offset (int) – The minimal offset \(\delta\).
- pytorch_warmup.radam.get_offset(beta2, rho_inf)[source]
Returns the minimal offset \(\delta\).
- Parameters:
beta2 (float) – The second discount factor of Adam, \(\beta_{2}\).
rho_inf (float) – The constant of the RAdam algorithm, \(\rho_{\infty}\).
- pytorch_warmup.radam.rho_fn(t, beta2, rho_inf)[source]
Returns the value of the function of the RAdam algorithm, \(\rho_{t}\), at an iteration \(t\).
- Parameters:
t (int) – The iteration \(t\).
beta2 (float) – The second discount factor of Adam, \(\beta_{2}\).
rho_inf (float) – The constant of the RAdam algorithm, \(\rho_{\infty}\).