When training your models, there are quite a few optimisers to use. Since many researchers adopt Adam optimiser, there are also reports the instability of the optimiser in some cases. Nevertheless, it is worth to note that Adam is a variant of the basic stochastic gradient descent (SGD), which turn, is a variant of the famous gradient descent optimization algorithm.
GD is a first-order iterative optimization algorithm for finding the minimum of a function. To find a local minimum of a function using GD, one takes steps proportional to the negative of the gradient. (or approximate gradient of the function at the current point).
Gradient descent is based on the observation that if the multi-variable function
F(x) is defined and differentiable in a neighbourhood of a point
a, then F(x) decreases fastest if one goes from
a in the opposite direction of the gradient of F at
a, i.e. negative gradient: .
It follows that if (for small enough), then .
Example of a single variable function with python: , with the gradient , the local minimum
With GD, we can approximate the value.
next_x = 6 # We start the search at x=6 gamma = 0.01 # Step size multiplier precision = 0.00001 # Desired precision of result max_iters = 10000 # Maximum number of iterations # Derivative function def df(x): return 4 * x**3 - 9 * x**2 for _i in range(max_iters): current_x = next_x next_x = current_x - gamma * df(current_x) #update with negative gradient step = next_x - current_x if abs(step) <= precision: break print("Minimum at ", next_x) # The output for the above will be something like # "Minimum at 2.2499646074278457"
Gradient descent with jupyter
I demonstrate the use of GD algorithm while optimizing a loss function while training the classification problem.
Important general paradigm when
- continuously parameterized hypothesis
- the error can be differentiated with respect to the hypothesis parameters
The key practical problems are:
- converging to a local minimum can be quite slow
- if there are multiple local minima, then there is no guarantee that the procedure will find the global minimum. (Notice: The gradient descent algorithm can work with other error definitions and will not have a global minimum. If we use the sum of squares error, this is not a problem.)