MLExpert logo


What is Gradient Descent? What modifications exist?

Gradient Descent is an iterative optimization method. It tries to minimize some loss function f(θ)f(\theta). The definition is:

θ=θλθf(θ)\theta^{\prime} = \theta - \lambda \cdot \nabla_{\theta} f(\theta)

Gradient Descent is an iterative method that works in the following steps:

  • Start at some (random) point
  • Calculate the gradient of the target function w.r.t. to the parameters θ\theta
  • Move down the slope (using the gradient) with a step sizeof λ\lambda
  • Continue until the algorithm reaches some optimum point

Learning rate

The choice of learning rate is crucial when using optimization algorithms. Let's look at an example:

Effect of learning rate on Gradient Descent
Effect of learning rate on Gradient Descent

  • Low learning rate slows down the training process. You might not come close to an optimal value.
  • Good learning rate allows for fast training and finds optimal values
  • High learning rate allows for fast training, but it overshoots the optimal values

Stochastic Gradient Descent

Gradient Descent works with the whole dataset. In practice, you can't hold a large dataset in memory (unless you are someone with too much compute).

Stochastic Gradient Descent (SGD) solves this problem by choosing a small sample from the data (minibatch) and apply the Gradient Descent algorithm.

Unfortunately, more tricks are needed to get good convergence. A wide array of optimizers are available to change the learning rate while training and overcome saddle points.


Copyright © 2021 MLExpert by Venelin Valkov. All rights reserved.