Straightforward yet productive tricks to boost deep learning model training

Nikhil Verma
6 min readJan 20, 2023

--

Hi fellow Deep Learning researchers,

We train various deep learning models day-in and day-out paying attention to minute details possible while designing neural network architectures, taking input data, writing training loops and using evaluation strategy to test our proposed methodologies. It ranges from using cpu-gpu units, looking at memory constraints and indulging in Optimisation-Regularisation trade off while model training.

In this blogpost, I will be talking about various straight forward yet productive tricks that one could use while training deep learning models. These are simple but effective in various ways such as saving your compute time, optimally using present resources and making full use of options available in front of us while using the deep learning framework of our choice. In particular, I will be giving examples using Pytorch DL framework.

Learning Schedulers

Learning rate of an optimization technique like SGD, Adam or Adagrad plays an important role in the convergence of the learning algorithm. There are multiple ways of adjusting the learning rate based on the scheduler for the same. Few examples include StepLR, ExponentialLR, ReduceLRonPlateau, OneCycleLR etc. LR scheduling allows dynamic learning rate which can adjust after every epoch. But user has to take care of some of the additional hyper parameters to be passed to these scheduling schemes. An illustration of using LR scheduler is:-

model = [Parameter(torch.randn(2, 2, requires_grad=True))]
optimizer = SGD(model, 0.1)
# Defining a Scheduler
scheduler = ExponentialLR(optimizer, gamma=0.9)

for epoch in range(20):
for input, target in dataset:
optimizer.zero_grad()
output = model(input)
loss = loss_fn(output, target)
loss.backward()
optimizer.step()
# Updating the learning rate
scheduler.step()

Number of Workers and pinned_memory

While loading the data using dataloader in PyTorch, it by default uses a single-process. Within a Python process, the Global Interpreter Lock (GIL) prevents true fully parallelizing Python code across threads. To avoid blocking computation code with data loading, PyTorch provides an easy switch to perform multi-process data loading by simply setting the argument num_workers to a positive integer.

Also the host to GPU copies are much faster when they originate from pinned (page-locked) memory. For data loading, passing pin_memory=True to a DataLoader will automatically put the fetched data Tensors in pinned memory, and thus enables faster data transfer to CUDA-enabled GPUs.

Batch Size

Try to put as large batch size, that is possible in your GPU memory. Larger batch will give you more accurate computation of gradient to update the model parameters.

Automatic Mixed Precision

Training and evaluation of Neural networks with 32-bit floating point precision level is both meaningful and precise for getting better results, but it has been noted that decreasing the precision level to a lower value do not affect the computation results much. Even half of this precision levels give nearly same results. Therefore its fruitful to do the heavy mathematical computations with less overhead by using only 16-bit floating point tensors. This is usually achieved using an Autocast and GradScaler functionalities in PyTorch as illustrated below:-

# Creates model and optimizer in default precision
model = Net().cuda()
optimizer = optim.SGD(model.parameters(), ...)

# Creates a GradScaler once at the beginning of training.
scaler = GradScaler()

for epoch in epochs:
for input, target in data:
optimizer.zero_grad()

# Runs the forward pass with autocasting.
with autocast(device_type='cuda', dtype=torch.float16):
output = model(input)
loss = loss_fn(output, target)

# Scales loss. Calls backward() on scaled loss to create scaled gradients.
# Backward passes under autocast are not recommended.
# Backward ops run in the same dtype autocast chose for corresponding forward ops.
scaler.scale(loss).backward()

# scaler.step() first unscales the gradients of the optimizer's assigned params.
# If these gradients do not contain infs or NaNs, optimizer.step() is then called,
# otherwise, optimizer.step() is skipped.
scaler.step(optimizer)

# Updates the scale for next iteration.
scaler.update()

To know more about AMP, please click here.

Various Optimizer

There are plethora of optimizers proposed in litrature for deep neural network model training ranging from Batched gradient decent, SGD, Momentum based GD, Nestorov GD, Adaptive GD, RMSprop, Adam, weighted Adam etc. You could try from range of optimizers instead of the simple ones you know about.

Gradient Checkpointing

Checkpointing works by trading compute for memory. Rather than storing all intermediate activations of the entire computation graph for computing backward, the checkpointed part does not save intermediate activations, and instead recomputes them in backward pass. It can be applied on any part of a model.

Specifically, in the forward pass, function will run in torch.no_grad()manner, i.e., not storing the intermediate activations. Instead, the forward pass saves the inputs tuple and the function parameter. In the backwards pass, the saved inputs and function is retrieved, and the forward pass is computed on function again, now tracking the intermediate activations, and then the gradients are calculated using these activation values.

This will might slightly increase your run time for a given batch size, you’ll significantly reduce your memory footprint. This in turn will allow you to further increase the batch size you’re using allowing for better GPU utilization.

Gradient Accumulation

An approach to increasing the the batch size is to accumulate gradients across multiple backward passes before calling optimizer.step() This method was developed mainly to circumvent GPU memory limitations. Quoting an example proposed in [5] is:-

model.zero_grad()                                   # Reset gradients tensors
for i, (inputs, labels) in enumerate(training_set):
predictions = model(inputs) # Forward pass
loss = loss_function(predictions, labels) # Compute loss function
loss = loss / accumulation_steps # Normalize our loss (if averaged)
loss.backward() # Backward pass
if (i+1) % accumulation_steps == 0: # Wait for several backward steps
optimizer.step() # Now we can do an optimizer step
model.zero_grad() # Reset gradients tensors
if (i+1) % evaluation_steps == 0: # Evaluate the model when we...
evaluate_model() # ...have no gradients accumulated

Distributed training

PyTorch provides several options for data-parallel training. torch.nn.parallel.DistributedDataParallel container parallelizes the application of the given module by splitting the input across the specified devices by chunking in the batch dimension. The module is replicated on each machine and each device, and each such replica handles a portion of the input. During the backwards pass, gradients from each node are averaged. The batch size should be larger than the number of GPUs used locally.

The batch size should be larger than the number of GPUs used locally.

Debugging

PyTorch provides number of debugging tools but make sure to use them only when required which otherwise will slow the training of your model.

Gradient Clipping

Both exploding and vanishing gradients are an issue of concern while training deep learning models. For dealing with exploding gradients issue, gradient clipping is good remedy which helps in accelerating the convergence of learning algorithm. In PyTorch you could do the same using torch.nn.utils.clip_grad_norm_

Normalisation

An old but very effective strategy of processing data is to normalise it before feeding to the model. Not only the input data, but hidden layers may also receive inputs with varying distribution, commonly known as ‘Internal Covariate Shift’ for which batch normalisation is a proposed trick. Make sure, you are using these techniques in your implementation of deep learning model.

Pruning and Quantization

Pruning and quantization are techniques to reduce the model’s size, which can speed up the training process. PyTorch has built-in support for these techniques.

Profiling

Profiling the training process can help you identify the bottlenecks and optimize the training speed. PyTorch provides built-in support for profiling the training process. PyTorch torch.profiler API includes a profile() that lets you inspect the cost of different operators inside your model — both on the CPU and GPU.

Make sure to use the flags of training and evaluation correctly in your PyTorch Script using model.train() and model.eval() methods. You could also turn off the gradient computation during validation or testing via using torch.no_grad() functionality.

References

[1] Pytorch documentation on Automatic Mixed precision: https://pytorch.org/docs/stable/notes/amp_examples.html#amp-examples

[2] Efficient DL training Guide: https://pytorch.org/docs/stable/notes/amp_examples.html#amp-examples

[3] Thanks to my colleagues and research community around me who taught me or showed me various tricks possible while writing code.

[4] Tutorial for Gradient Check Pointing: https://github.com/prigoyal/pytorch_memonger/blob/master/tutorial/Checkpointing_for_PyTorch_models.ipynb

[5] Gradient Accumulation strategy: https://medium.com/huggingface/training-larger-batches-practical-tips-on-1-gpu-multi-gpu-distributed-setups-ec88c3e51255

Much of the content in this post is taken from documentations and articles mentioned in references section for which I am not at all creditable. This blogpost is just an accumulation of noteworthy points and tricks. There might be some other sources of my reading as well.

Keep Learning, Keep Hustling

--

--

Nikhil Verma

Knowledge shared is knowledge squared | My Portfolio https://lihkinverma.github.io/portfolio/ | My blogs are living document, updated as I receive comments