Deep Learning Optimizers

Nikhil Verma
2 min readFeb 17, 2023
Image credits to Google search results

The choice of optimization algorithm can depend on several factors, including the specific problem you’re trying to solve, the size and complexity of your dataset, and the resources (such as computational power and memory) that you have available.

  • Batch SGD with Momentum is a good choice for problems that have a lot of data and can be parallelized easily. It can be more efficient than standard stochastic gradient descent because it takes into account the previous updates to make the current update more efficient.
  • Nesterov Accelerated Gradient (NAG) is a variant of momentum optimization that makes predictions about where the model is going to be in the future and then make the adjustments before it reaches that point. It’s a good choice for problems that are sensitive to the learning rate hyperparameter.
  • AdaGrad is good for problems where the data has a lot of noise and the cost function is ill-conditioned, i.e. the different dimensions of the cost function are not of the same scale. It adapts the learning rate to the parameters and adjusts the learning rate for each parameter individually based on the historical gradient information, which helps the model converge more quickly.
  • Adadelta is a variant of AdaGrad that uses a moving average of the squared gradient instead of the sum of the squared gradient, this helps to reduce…

--

--

Nikhil Verma

Knowledge shared is knowledge squared | My Portfolio https://lihkinverma.github.io/portfolio/ | My blogs are living document, updated as I receive comments