code
share


â–º Chapter 6: First order methods

6.4 Gradient descent

In the previous Section we examined some fundamental characteristics of the tangent line / hyperplane defined by a function's first order Taylor series approximation. In particular we saw how the negative gradient at a point provides a valid descent direction for a function itself, i.e., a direction that (at least locally) decreases the value of the function. With this fact in hand it is then quite natural to ask the question: can we construct a local optimization method using the negative gradient at each step as our descent direction? The answer is a resounding "yes", and the resulting scheme is precisely the gradient descent algorithm we aim to describe with this Chapter in full, and this Section in particular.

In [1]:

6.4.1 The gradient descent algorithm¶

As we introduced in the previous Chapter, a local optimization method is one where we aim to find minima of a given function by beginning at some point $\mathbf{w}^0$ and taking number of steps $\mathbf{w}^1, \mathbf{w}^2, \mathbf{w}^3,...,\mathbf{w}^{K}$ of the generic form

\begin{equation} \mathbf{w}^{\,k} = \mathbf{w}^{\,k-1} + \alpha \mathbf{d}^{\,k}. \end{equation}

where $\mathbf{d}^{\,k}$ are direction vectors (which ideally are descent directions that lead us to lower and lower parts of a function) and $\alpha$ is called the steplength parameter. In Chapter 5 we saw in one simple example of a local method called random local search, which was fatally flawed due to the way it determines each descent direction $\mathbf{d}^{\,k}$ -i.e., via random search - which grows exponentially more inefficient as the dimension of a function's input increases. (see Section 5.3 for a complete introduction to this concept).

In our current setting - where we have just reviewed (in the previous Section) how the negative gradient $-\nabla g\left(\mathbf{w}\right)$ of a function $g\left(\mathbf{w}\right)$ computed at a particular point always defines a valid descent direction there - we are perfectly poised to think about another local method employing first order descent directions. We could very naturally ask what a local method employing the negative gradient direction at each step might look like, and how it might behave - i.e., setting $\mathbf{d}^{k} = -\nabla g\left(\mathbf{w}^{k-1}\right)$ in the above formula. Such a sequence of steps would then take the form

\begin{equation} \mathbf{w}^{\,k} = \mathbf{w}^{\,k-1} - \alpha \nabla g\left(\mathbf{w}^{k-1}\right) \end{equation}

and it seems intuitive - at least at the outset - that because each and every direction is guaranteed to be one of descent here that (provided we set $\alpha$ appropriately, as we must do whenever using a local optimization method)taking such steps could lead us to a point near the a local minimum of any function.

Indeed this is precisely the gradient descent algorithm. It is it called gradient descent - in employing the (negative) gradient as our descent direction - we are repeatedly descending in the (negative) gradient direction at each step.

The gradient descent algorithm is a local optimization method where - at each step - we employ the negative gradient as our descent direction.

Appreciate the power of this descent direction - which is almost literally given to us - over the zero-order methods detailed in the previous Chapter. There we had to search to find a descent direction, here calculus provides us not only with a descent direction (without search), but an excellent one to boot.

The path taken by gradient descent is illustrated figuratively below for a general single-input function. At each step of this local optimization method we can think about drawing the first order Taylor series approximation to the function, and taking the descent direction of this tangent hyperplane (the negative gradient of the function at this point) as our descent direction for the algorithm. Beginning at the point $w^0$ the point at which we make our first approximation is drawn below as a red dot, with the first order Taylor series approximation drawn in green. Moving in the negative gradient descent direction provided by this approximation we arrive at a point $w^1 = w^0 - \alpha \frac{\partial}{\partial w}g\left(w^0\right)$ (remember - for a single input function the gradient is simply a single derivative), having taken our first gradient descent step. We then repeat this process at $w^1$, moving in the negative gradient direction there, to $w^2 = w^1 - \alpha \frac{\partial}{\partial w}g\left(w^1\right)$, and so forth.

Figure 1: A figurative drawing of the gradient descent algorithm. The first order Taylor series approximation - and the *negative gradient* of the function in particular - provides an excellent and easily computed descent direction at each step of this local optimization method (here a number of Taylor series approximations are shown in green, and evaluations by the function and linear approximations are shown as red dots and crosses respectively). Employing these directions at each step the *gradient descent algorithm* can be used to properly minimize generic functions. Moreover, unlike the random local search algorithm, gradient descent scales very well with input dimension since the descent direction of a hyperplane is much more easily computed in high dimensions.

As we will see here and throughout many of our future Chapters, the gradient descent algorithm is a far better local optimization algorithm than the zero order approaches discussed in the previous Chapter (indeed it is probably the most popular optimization algorithm used in machine learning / deep learning today). This is entirely due to the fact that the descent direction here - provided by calculus via the gradient - is universally easier to compute (particularly as the dimension of the input increases) than e.g., seeking out a descent direction at random. To this point (as discussed in Chapter 3) we can very easily compute the gradient of almost any generic function of interest using a gradient calculator known as an Automatic Differentiator. So in practice we need not even have to worry about computing the form of a function's gradient 'by hand'.

Below we provide the generic pseudo-code and Python implementation of the gradient descent algorithm which will be used in a variety of examples that follow in this Section. We describe the most basic yet perhaps the most common implementation, but there are a number of practical variations one can use in practice - like e.g., different halting conditions other than a maximum number of steps, or what values are returned by the method - which we touch on below as well.

The gradient descent algorithm¶


1:   input: function $g$, steplength $\alpha$, maximum number of steps $K$, and initial point $\mathbf{w}^0$
2:   for $\,\,k = 1...K$
3:           $\mathbf{w}^k = \mathbf{w}^{k-1} - \alpha \nabla g\left(\mathbf{w}^{k-1}\right)$
4:  output: history of weights $\left\{\mathbf{w}^{k}\right\}_{k=0}^K$ and corresponding function evaluations $\left\{g\left(\mathbf{w}^{k}\right)\right\}_{k=0}^K$


Note in particular the return values we have chosen to output in our pseudo-code above: both the entire sequence of gradient descent steps $\left\{\mathbf{w}^{k}\right\}_{k=0}^K$ and corresponding cost function evaluation $\left\{g\left(\mathbf{w}^{k}\right)\right\}_{k=0}^K$ computed while running the algorithm are returned. As we explore via examples below, the entire weight history is typically returned so that we - as good engineers - can visualize the progress of our algorithm for a given choice of the steplength parameter $\alpha$. The corresponding sequence of cost function evaluations is also returned for the same reason. It is especially important to recognize that when employing an Automatic Differentiator to compute each gradient evaluation - as we do here (we use autograd throughout this and future Chapters, but one can construct their own tool following the discussion in Chapter 3) - that we get each of these evaluations 'for free' when computing the corresponding gradient evaluation (this is the very nature of Automatic Differentiation computation - as described in Chapter 3). So - in terms of evaluating the function itself - we only need to do this ourselves at the final step (since we do not compute the gradient there).

In the next Python cell we implement gradient descent as described above. It involves just a few requisite initializations, the computation of the gradient function via e.g., an Automatic Differentiator, and the very simple for loop. The output is a history of the weights and corresponding cost function values at each step of the gradient descent algorithm.

In [3]:
# using an automatic differentiator - like the one imported via the statement below - makes coding up gradient descent a breeze
from autograd import value_and_grad 

# gradient descent function - inputs: g (input function), alpha (steplength parameter), max_its (maximum number of iterations), w (initialization)
def gradient_descent(g,alpha_choice,max_its,w):
    # compute the gradient function of our input function - note this is a function too
    # that - when evaluated - returns both the gradient and function evaluations (remember
    # as discussed in Chapter 3 we always ge the function evaluation 'for free' when we use
    # an Automatic Differntiator to evaluate the gradient)
    gradient = value_and_grad(g)

    # run the gradient descent loop
    weight_history = []      # container for weight history
    cost_history = []        # container for corresponding cost function history
    alpha = 0
    for k in range(1,max_its+1):
        # check if diminishing steplength rule used
        if alpha_choice == 'diminishing':
            alpha = 1/float(k)
        else:
            alpha = alpha_choice
        
        # evaluate the gradient, store current weights and cost function value
        cost_eval,grad_eval = gradient(w)
        weight_history.append(w)
        cost_history.append(cost_eval)

        # take gradient descent step
        w = w - alpha*grad_eval
            
    # collect final weights
    weight_history.append(w)
    # compute final cost function value via g itself (since we aren't computing 
    # the gradient at the final step we don't get the final cost function value 
    # via the Automatic Differentiatoor) 
    cost_history.append(g(w))  
    return weight_history,cost_history

Notice - like any local method - in terms of steplength parameter choices we have included in this implementation the ability to choose both fixed and the particular diminishing steplength choice $\alpha = \frac{1}{k}$.

6.4.2 General behavior of gradient descent¶

In this Subsection we walk through a number of simple examples using the gradient descent code above, many of which exhibit more general behaviors of the algorithm. This includes

  • Gradient descent converges to stationary points of a function: because we employ the (negative) gradient as our descent direction, the algorithm converges at / nearby stationary points
  • Gradient descent makes great progress far from stationary points, but crawls to a halt near them: in addition to the steplength parameter $\alpha$ the magnitude of the gradient controls how far we travel at each step of gradient descent, and this has both positive and negative consequences for performance
  • Non-convexity is not an inherent problem for gradient descent, but saddle points are: due to its convergence behavior saddle points pose an inherent problem for gradient descent
  • Gradient descent scales well with input dimension: because the gradient can be computed extremely efficiently gradient descent scales very well, computationally speaking, with the size of input dimension
  • Just like all local methods, one needs to carefully choose the steplength parameter $\alpha$: like all local optimization methods the steplength parameter $\alpha$ must be set properly in order to provoke timely convergence, with the most popular choices in machine learning / deep learning being the fixed and diminishing steplength rules introduced Section 5.3

Lets dig deeper into the details and basic behavior of the gradient descent algorithm by looking at a few simple examples.

Example 1. Gradient descent converges to stationary points of a function¶

Here we use gradient descent to minimize the polynomial function

\begin{equation} g(w) = \frac{1}{50}\left(w^4 + w^2 + 10w\right). \end{equation}

In Section 6.1 on the first order optimality condition we saw that the global minimum of this problem - which we could not compute by hand easily - was given explicitly as

\begin{equation} w = \frac{\sqrt[\leftroot{-2}\uproot{2}3]{\sqrt[\leftroot{-2}\uproot{2}]{2031} - 45}}{6^{\frac{2}{3}}} - \frac{1}{\sqrt[\leftroot{-2}\uproot{2}3]{6\left(\sqrt{2031}-45\right)}} \end{equation}

With gradient descent we can easily determine a point that is arbitrarily close to this one. This illustrates the general principle that gradient descent - and local search in general - can compute minima of functions that we cannot compute by hand using calculus alone.

Below we animate the process of gradient descent applied to minimize this function. We initialize the algorithm at $w^0 = 2.5$, set the steplength constant for all steps as $\alpha = 1$, and run for 25 iterations. As you move the slider left to right the descent process as described above is shown visually - including the illustration of the tangent line. We mark the evaluation of each step of the process on both the function and the tangent line for visualization purposes. Moving the slider all the way to the right we can see that the algorithm begins to slow down considerably near the global minimum of the function.

In [4]:
Out[4]: