Untuned Warmup

class pytorch_warmup.untuned.UntunedExponentialWarmup(optimizer, last_step=-1)[source]

Untuned exponential warmup schedule for Adam.

This warmup scheme is described in On the adequacy of untuned warmup for adaptive optimization.

The untuned exponential warmup schedule uses the warmup factor

\[\omega_{t}^{\rm expo, untuned} = 1 - \exp \left( - (1 - \beta_{2}) \cdot t \right) \]

at each iteration \(t\), where \(\beta_{2}\) is the second discount factor of Adam. In practice, \(\omega_{t}^{\rm expo, untuned}\) is calculated as \(\omega_{t}^{\rm expo, \tau}\) with \(\tau = \frac{1}{1 - \beta_{2}}\).

Note

The constant \(\tau\) is derived from the intuition that the warmup factor should be roughly equivalent to Adam’s second moment bias correction term, \(1 - \beta_{2}^{t}\).

Note

The effective warmup period is defined as

\({\cal T}(\omega) = \sum_{t = 1}^{\infty} \left( 1 - \omega_{t} \right)\)

for a warmup schedule \(\omega = \left\{ \omega_{t} \right\}_{t=1}^{\infty}\). The constant \(\tau\) of the untuned exponential warmup schedule is roughly equivalent to its effective warmup period:

\({\cal T}(\omega^{\rm expo, untuned}) = 1 / \left( \exp( 1 - \beta_{2}) - 1 \right) \approx \tau\)

for \(\beta_{2}\) near 1. The rough equivalence is also achieved for an exponential warmup schedule if its \(\tau\) is large enough, for example, \(\tau \ge 1\).

Parameters:
  • optimizer (Optimizer) – Adam optimizer or its variant: Adam, AdamW, SparseAdam, or NAdam. RAdam is not suitable because of the warmup redundancy. This warmup schedule makes no sense for Adamax as discussed in Note below.

  • last_step (int) – The index of last step. Default: -1.

Note

This warmup schedule employs the same constant \(\tau\) for all variants of Adam. However, Adamax should in principle need no warmup because Adamax is derived by employing a \(L^{p}\) norm update rule and letting \(p \rightarrow \infty\), and the second moment bias correction term is \(1-\beta_{2}^{pt}\), to which the warmup factor must be roughly equivalent in this warmup schedule derivation. In practice, an exponential warmup may slightly improve AdaMax’s performance because the initial update step is the same as one of the Adam optimizer.

Example

>>> optimizer = AdamW(...)
>>> lr_scheduler = CosineAnnealingLR(optimizer, ...)
>>> warmup_scheduler = UntunedExponentialWarmup(optimizer)
>>> for batch in dataloader:
>>>     optimizer.zero_grad()
>>>     loss = ...
>>>     loss.backward()
>>>     optimizer.step()
>>>     with warmup_scheduler.dampening():
>>>         lr_scheduler.step()

Warning

The warmup schedule must not be initialized before the initialization of the learning rate schedule.

class pytorch_warmup.untuned.UntunedLinearWarmup(optimizer, last_step=-1)[source]

Untuned linear warmup schedule for Adam.

This warmup scheme is described in On the adequacy of untuned warmup for adaptive optimization.

The untuned linear warmup schedule uses the warmup factor

\[\omega_{t}^{\rm linear, untuned} = \min \left\{ 1, \frac{1 - \beta_{2}}{2} \cdot t \right\} \]

at each iteration \(t\), where \(\beta_{2}\) is the second discount factor of Adam. In practice, \(\omega_{t}^{\rm linear, untuned}\) is calculated as \(\omega_{t}^{\rm linear, \tau}\) with \(\tau = \frac{2}{1 - \beta_{2}}\).

Note

The effective warmup period is defined as

\({\cal T}(\omega) = \sum_{t = 1}^{\infty} \left( 1 - \omega_{t} \right)\)

for a warmup schedule \(\omega = \left\{ \omega_{t} \right\}_{t=1}^{\infty}\). The warmup period \(\tau\) is deduced from solving approximately the rough equivalence:

\({\cal T}(\omega^{\rm expo, untuned}) \approx {\cal T}(\omega^{{\rm linear}, \tau}) \approx \frac{\tau}{2}\).

Parameters:
  • optimizer (Optimizer) – Adam optimizer or its variant: Adam, AdamW, SparseAdam, or NAdam. RAdam is not suitable because of the warmup redundancy. This warmup schedule makes no sense for Adamax as discussed in Note below.

  • last_step (int) – The index of last step. Default: -1.

Note

This warmup schedule employs the same warmup period \(\tau\) for all variants of Adam. However, Adamax should in principle need no linear warmup because it needs no exponential warmup. For further details please refer to Note in the documentation of UntunedExponentialWarmup. In practice, a linear warmup may slightly improve AdaMax’s performance because the initial update step is the same as one of the Adam optimizer.

Example

>>> optimizer = AdamW(...)
>>> lr_scheduler = CosineAnnealingLR(optimizer, ...)
>>> warmup_scheduler = UntunedLinearWarmup(optimizer)
>>> for batch in dataloader:
>>>     optimizer.zero_grad()
>>>     loss = ...
>>>     loss.backward()
>>>     optimizer.step()
>>>     with warmup_scheduler.dampening():
>>>         lr_scheduler.step()

Warning

The warmup schedule must not be initialized before the initialization of the learning rate schedule.