# Optimization: Gradient descent with python

O

When training your models, there are quite a few optimisers to use. Since many researchers adopt Adam optimiser, there are also reports the instability of the optimiser in some cases. Nevertheless, it is worth to note that Adam is a variant of the basic stochastic gradient descent (SGD), which turn, is a variant of the famous gradient descent optimization algorithm.

GD is a first-order iterative optimization algorithm for finding the minimum of a function. To find a local minimum of a function using GD, one takes steps proportional to the negative of the gradient.  (or approximate gradient of the function at the current point).

### The algorithm

Gradient descent is based on the observation that if the multi-variable function F(x) is defined and differentiable in a neighbourhood of a point a, then  F(x) decreases fastest if one goes from a in the opposite direction of the gradient of F at a, i.e. negative gradient: .

It follows that if (for small enough), then .

Example of a single variable function with python: , with the gradient , the local minimum With GD, we can approximate the value.

next_x = 6  # We start the search at x=6
gamma = 0.01  # Step size multiplier
precision = 0.00001  # Desired precision of result
max_iters = 10000  # Maximum number of iterations
# Derivative function
def df(x):
return 4 * x**3 - 9 * x**2
for _i in range(max_iters):
current_x = next_x
next_x = current_x - gamma * df(current_x) #update with negative gradient
step = next_x - current_x
if abs(step) <= precision:
break
print("Minimum at ", next_x)
# The output for the above will be something like
# "Minimum at 2.2499646074278457"

I demonstrate the use of GD algorithm while optimizing a loss function while training the classification problem.

### Conclusion

1. continuously parameterized hypothesis
2. the error can be differentiated with respect to the hypothesis parameters

The key practical problems are:

1. converging to a local minimum can be quite slow
2. if there are multiple local minima, then there is no guarantee that the procedure will find the global minimum. (Notice: The gradient descent algorithm can work with other error definitions and will not have a global minimum. If we use the sum of squares error, this is not a problem.)