PyTorch’s Magic with Automatic Mixed Precision
Pytorch library is one of the go-to framework used these days for implementing neural networks or deep learning models. These models have various computational layers which are at times specific to the task & modality(conv1d, conv2d, rnn, lstm, gru, transformer etc.) and at times are generic(batch-norm, dropout, linear etc.). But all of them works on pytorch tensors which are represented using the precision of 32-bit floating point.
Training and evaluation of Neural networks with such a precision level is both meaningful and precise for getting better results, but it has been noted that decreasing the precision level to a lower value do not affect the computation results much. Even half of this precision levels give nearly same results. Therefore its fruitful to do the heavy mathematical computations with less overhead by using only 16-bit floating point tensors.
There are three factors that affect deep learning model performance[4]:-
- Arithmetic bandwidth
- Memory bandwidth
- Latency
Lowering the precision used reduces the impact of two of these factors.
- The pressure on memory bandwidth will be reduced because we need fewer bits to store the same parameters.
- The calculation time will also be reduced, because the accuracy of the calculation is reduced, resulting in higher throughput.
Computation in lower precisions can be significantly faster on modern GPUs. It also has the extra benefit of using less memory enabling training larger models with larger batch sizes which can boost the performance further. In 2017, NVIDIA researchers developed a methodology for mixed-precision training, which combines the single-precision (FP32) with half-precision (e.g. FP16) format when training a network, and achieved the same accuracy as FP32 training using the same hyper-parameters, with additional performance benefits on NVIDIA GPUs:
- Shorter training time
- Lower memory requirements, enabling larger batch sizes, larger models, or larger inputs
The mixed-precision training is done by performing some expensive operations (like convolutions and matrix multplications) in 16-bit by casting down the inputs while performing other numerically sensitive operations like accumulations in 32-bit floating point format. To do the same, pytorch provides two APIs called Autocast and GradScaler which we will explore ahead.
Autocast
Autocast serve as context managers or decorators that allow regions of your script to run in mixed precision. In these regions, ops run in an op-specific dtype chosen by autocast to improve performance while maintaining accuracy.
When entering an autocast-enabled region, Tensors may be any type. You should not call half()
or bfloat16()
on your model(s) or inputs when using autocasting. It should wrap only the forward pass(es) of your network, including the loss computation(s). Backward passes under autocast are not recommended. Backward ops run in the same type that autocast used for corresponding forward ops.
In Pytorch Code it should be used in training loop as:-
# Creates model and optimizer in default precision
model = Net().cuda()
optimizer = optim.SGD(model.parameters(), ...)
for input, target in data:
optimizer.zero_grad()
# Enables autocasting for the forward pass (model + loss)
with autocast():
output = model(input)
loss = loss_fn(output, target)
# Exits the context manager before backward()
loss.backward()
optimizer.step()
Lets see the effect of using autocast on accumulation and matrix multiplication operations:-
import torch
x = torch.rand([32, 32]).cuda()
y = torch.rand([32, 32]).cuda()
with torch.cuda.amp.autocast():
a = x + y
b = x @ y
print(a.dtype) # prints torch.float32 <= Addition operation in 32-bit
print(b.dtype) # prints torch.float16 <= Matrix multiplication in 16-bit
GradScalar
If the forward pass for a particular op has float16
inputs, the backward pass for that op will produce float16
gradients. Gradient values with small magnitudes may not be representable in float16
. These values will flush to zero (“underflow”), so the update for the corresponding parameters will be lost.
To prevent underflow, “gradient scaling” multiplies the network’s loss(es) by a scale factor and invokes a backward pass on the scaled loss(es). Gradients flowing backward through the network are then scaled by the same factor. In other words, gradient values have a larger magnitude, so they don’t flush to zero.
Each parameter’s gradient (.grad
attribute) should be unscaled before the optimizer updates the parameters, so the scale factor does not interfere with the learning rate.
Way to use Grad Scalar in training loop is as follows:-
# Creates model and optimizer in default precision
model = Net().cuda()
optimizer = optim.SGD(model.parameters(), ...)
# Creates a GradScaler once at the beginning of training.
scaler = GradScaler()
for epoch in epochs:
for input, target in data:
optimizer.zero_grad()
# Runs the forward pass with autocasting.
with autocast(device_type='cuda', dtype=torch.float16):
output = model(input)
loss = loss_fn(output, target)
# Scales loss. Calls backward() on scaled loss to create scaled gradients.
# Backward passes under autocast are not recommended.
# Backward ops run in the same dtype autocast chose for corresponding forward ops.
scaler.scale(loss).backward()
# scaler.step() first unscales the gradients of the optimizer's assigned params.
# If these gradients do not contain infs or NaNs, optimizer.step() is then called,
# otherwise, optimizer.step() is skipped.
scaler.step(optimizer)
# Updates the scale for next iteration.
scaler.update()
Using mixed precision training requires three steps:
- Converting the model to use the float16 data type where possible.
- Keeping float32 master weights to accumulate per-iteration weight updates.
- Using loss scaling to preserve small gradient values.
NVIDIA
In practice, for mixed precision training, NVIDIA’s recommendations are:
- Choose mini-batch to be a multiple of 8
- Choose linear layer dimensions to be a multiple of 8
- Choose convolution layer channel counts to be a multiple of 8
- For classification problems, pad vocabulary to be a multiple of 8
- For sequence problems, pad the sequence length to be a multiple of 8
- Make sure you are running model on Volta or Turing architecture for using Tensor Core acceleration
To know more techniques of reducing the compute time to train faster, increasing inference speed or both, check this blog on low effort high impact ways of training deep learning models.
References
[1] Mixed precision training paper: Micikevicius, Paulius, et al. “Mixed precision training.” arXiv preprint arXiv:1710.03740 (2017).
[2] Autocast and GradScaler: https://effectivemachinelearning.com/PyTorch/8._Faster_training_with_mixed_precision
[3] Pytorch documentation: https://pytorch.org/docs/stable/amp.html
[4] AI Research Center: https://fcuai.tw/2020/08/14/mixed-precision/#top
Some of the content in this post is taken from documentations and articles mentioned in references section for which I am not at all creditable. This blogpost is just an accumulation of noteworthy points. There might be some other sources of my reading as well.
Keep learning, Keep Hustling