A.I, Data and Software Engineering

Optimization: Gradient descent with python


When training your models, there are quite a few optimisers to use. Since many researchers adopt Adam optimiser, there are also reports the instability of the optimiser in some cases. Nevertheless, it is worth to note that Adam is a variant of the basic stochastic gradient descent (SGD), which turn, is a variant of the famous gradient descent optimization algorithm.

Gradient descent 

GD is a first-order iterative optimization algorithm for finding the minimum of a function. To find a local minimum of a function using GD, one takes steps proportional to the negative of the gradient.  (or approximate gradient of the function at the current point).

gradient descent on a series of level sets
Illustration of gradient descent on a series of level sets. (Src: Wikipedia)

The algorithm

Gradient descent is based on the observation that if the multi-variable function F(x) is defined and differentiable in a neighbourhood of a point a, then  F(x) decreases fastest if one goes from a in the opposite direction of the gradient of F at a, i.e. negative gradient: -\nabla F(\mathbf{a}).

It follows that if a_{n+1} = a_{n} - \gamma\nabla F(a_n) (for \gamma \in R_+ small enough), then F(a_n) \ge F(a_{n+1}).

Example of a single variable function with python: f(x) = x^4 -3x^3 + 2, with the gradient f'(x) = 4x^3 - 9x^2, the local minimum x = \frac{9}{4} = 2.25

With GD, we can approximate the value.

next_x = 6  # We start the search at x=6
gamma = 0.01  # Step size multiplier
precision = 0.00001  # Desired precision of result
max_iters = 10000  # Maximum number of iterations
# Derivative function
def df(x):
    return 4 * x**3 - 9 * x**2
for _i in range(max_iters):
    current_x = next_x
    next_x = current_x - gamma * df(current_x) #update with negative gradient
    step = next_x - current_x
    if abs(step) <= precision:
print("Minimum at ", next_x)
# The output for the above will be something like
# "Minimum at 2.2499646074278457"

Gradient descent with jupyter

I demonstrate the use of GD algorithm while optimizing a loss function while training the classification problem.


Important general paradigm when

  1. continuously parameterized hypothesis
  2. the error can be differentiated with respect to the hypothesis parameters

The key practical problems are:

  1. converging to a local minimum can be quite slow
  2. if there are multiple local minima, then there is no guarantee that the procedure will find the global minimum. (Notice: The gradient descent algorithm can work with other error definitions and will not have a global minimum. If we use the sum of squares error, this is not a problem.)

Add comment


A.I, Data and Software Engineering

PetaMinds focuses on developing the coolest topics in data science, A.I, and programming, and make them so digestible for everyone to learn and create amazing applications in a short time.