In the previous Section on general lines and hyperplanes we described how the slope(s) of a line / hyperplane describes its direction of steepest (or greatest) ascent and (its negative) the direction of steepest descent. These are the directions one must travel (in the input space) to increase / decrease the value of the line/hyperplane as quickly as possible. Here we carry over the formulate from the previous Section, studying its impact on the scenario of the first order Taylor Series approximation (where our hyperplane is defined by a functions derivative(s)). This geometric anatomy define the very essence of the gradient descent algorithm.
The derivative of a single-input function defines a tangent line at each point its input domain - this is called its first order Taylor series approximation (see Chapter 3 for complete details). For a generic differentiable function $g(w)$ we can define this tangent line at each point $w^0$ as
\begin{equation} h(w) = g(w^0) + \frac{\mathrm{d}}{\mathrm{d}w}g(w^0)(w - w^0) \end{equation}In the context of the previous Section this linear approximation also has an easily computable direction of steepest ascent and descent - i.e., the direction we travel in that increases / decreases the line in the fastest way possible. The steepest ascent direction - according to the previous Section - is given explicitly by the slope of this line, which is the derivative itself
\begin{equation} \text{steepest ascent direction of tangent line} = \frac{\mathrm{d}}{\mathrm{d}w}g(w^0). \end{equation}Likewise, the steepest descent direction is given by the negative slope of this line - which is the negative derivative
\begin{equation} \text{steepest descent direction of tangent line} = -\frac{\mathrm{d}}{\mathrm{d}w}g(w^0). \end{equation}Now keep in mind the context we are in here: this particular (tangent) line is built to explicitly to closely approximate its underlying function near $w^0$. Because of of this its steepest ascent and descent directions tell us not just the directions we should travel in in order to increase / decrease its value locally, but the direction we should travel in (at least locally around $w^0$ the input point defining the tangent line) in order to increase / decrease the value of the underlying function itself. In the next Subsection we will see that this same idea holds for multi-input functions as well.
Lets explore this idea through a few examples.
Below we visualize the ascent and descent directions, derived from the first order Taylor series approximation, over a fine set of points for the single-input quadratic $g(w) = 0.5w^2 + 1$. The derivative / steepest ascent direction is plotted as a vector in black along the horizontal axis, while the negative derivative / steepest descent direction is similarly shown in red. The slider mechanism below the plot allows you to adjust the point $w^0$ at which the approximation is being made.
Moving the slider back and forth across the input region, we can see the steepest ascent direction provided by the first order approximation always provides a direction of travel that not only increases the value of the line, but of the underlying function as well. Likewise, the steepest descent direction of the tangent line always provides a direction in which the function itself decreases.
Of course a quadratic is a fairly simple creature, but nonetheless this idea holds universally for any function.
Below produce the same sort of animation as shown in the previous example only for the function
\begin{equation} g(w) = \text{sin}(3w) + 0.1w^2 + 1.5. \end{equation}The changing in the ascent / descent direction of each tangent line here is more interesting in this case due to the curvy nature of the function, but nonetheless you can see that at each point these directions provide ascent/descent in the function locally as well. In other words, if we were to follow the downward direction on any of the first order approximations drawn here it always leads us to smaller values of on the function itself. Likewise for ascent direction.
The same idea we have just explored for differentiable single-input functions holds for multi-input functions as well. Now with an $N$ dimensional input function $g\left(\mathbf{w}\right)$ instead of one derivative we have $N$ partial derivatives, one in each input direction, stacked into a vector called the gradient
\begin{equation} \nabla g\left(\mathbf{w}\right) = \begin{bmatrix} \ \frac{\partial}{\partial w_1}g\left(\mathbf{w}\right) \\ \frac{\partial}{\partial w_2}g\left(\mathbf{w}\right) \\ \vdots \\ \frac{\partial}{\partial w_N}g\left(\mathbf{w}\right). \end{bmatrix} \end{equation}Likewise the first order Taylor series is now a tangent hyperplane, which at a point $\mathbf{w}^0$ has the (analogous to the single input case) formula
\begin{equation} h(\mathbf{w}) = g(\mathbf{w}^0) + \nabla g(\mathbf{w}^0)^T(\mathbf{w} - \mathbf{w}^0). \end{equation}For a complete description of this set of idesa see Chapter 3.
In complete analogy to the single-input case, this linear approximation also has an easily computable directions of steepest ascent and descent. Each partial derivative along a single input provides us with the steepest descent direction along its respective coordinate axis.
\begin{equation} \text{steepest ascent direction along $n^{th}$ axis} = \frac{\partial}{\partial w_n} g(\mathbf{w}^0). \end{equation}Likewise the direction provided by each negative partial derivative provides us with a steepest descent direction along its respective input axis.
\begin{equation} \text{steepest descent direction along $n^{th}$ axis} = - \frac{\partial}{\partial w_n} g(\mathbf{w}^0). \end{equation}The steepest ascent direction itself - with respect to the entire $N$ dimensional input space - is then given by the entire gradient
\begin{equation} \text{ascent direction of tangent hyperplane} = \nabla g(\mathbf{w}^0). \end{equation}Likewise, the descent direction is given by the negative vector of slopes of this tangent hyperplane - which is the negative gradient
\begin{equation} \text{descent direction of tangent hyperplane} = -\nabla g(\mathbf{w}^0) \end{equation}This holds true irregardless of the function we are examining - so long as it is differentiable. Thus we can say in general - since the gradient is the natural generalization of the derivative of a single-input function to the multi-input case - that the gradient always defines an ascent / descent direction of a general function.
The steepest ascent / descent direction of the first order Taylor series approximation tells us the direction we must travel in (at least locally around where it most closely resembles its underlying function) in order to increase / decrease both the linear approximation and underlying function. These directions are defined explicitly by the gradient of the function.
Lets examine a few examples to firm up this idea.
Below we illustrate this fact for the quadratic function
\begin{equation} g(w_1,w_2) = w_1^2 + w_2^2 + 6 \end{equation}showing the steepest ascent and descent directions given by its gradient at $\mathbf{w}^0 = \begin{bmatrix} -1\\ 1 \end{bmatrix}$. We plot the coordinate-wise steepest ascent directions given by the two partial derivatives here in blue, with the gradient / ascent direction shown in black and the negative gradient / descent direction in red. The first order Taylor series approximation is shown in green. Here we can see that indeed both ascent and descent directions provide directions of ascent / descent in the underlying functions as well.
Below we plot the same function along with its first order Taylor series approximation at $\mathbf{w}^0 = \begin{bmatrix} -1\\ -1 \end{bmatrix}$, along with the corresponding direction vectors (with everything colored as was done above). Again we can see here that the steepest ascent / descent directions of the hyperplane also provide ascent / descent directions for the underlying function.
Here we illustrated the ascent / descent directions - precisely as in the previous example - provided by the first order Taylor series approximation for the wavy sinusoidal function
\begin{equation} g(w_1,w_2) = 5 + \text{sin}(1.5w_1) - 2w_2. \end{equation}First we plot everything at $\mathbf{w}^0 = \begin{bmatrix} 0\\ 0 \end{bmatrix}$, then at the point $\mathbf{w}^0 = \begin{bmatrix} 0\\ 0 \end{bmatrix}$. In both instances we can see how the ascent / descent directions of the tangent hyperplane provide steepest ascent / descent directions for the underlying function as well.
We have now seen how the gradient of a function defines a proper ascent / descent direction for any input. It is important to remember that these directions live in the input space of the function, as drawn in the examples of the previous Subsection. Removing the vertical dimension from our three-dimensional plots we can still draw such a function via a contour plot, which for a given function $g\left(\mathbf{w}\right)$ shows constant slices $g\left(\mathbf{w}\right) = c$ of the function as lines or curves projected onto the input space. We have seen these sorts of contour plots in previous Sections as well.
Because both the gradient-defined ascent / descent directions and a function's contour plot are illustrated in the input space, we can visualize both together in one two-dimensional plot. When we do this an interesting characteristic of the gradient defined directions presents itself, as we show in the examples below. What we will see is that a gradient ascent / descent direction at an input $\mathbf{w}^{\star}$ is always perpendicular to the contour $g\left(\mathbf{w}^{\star}\right) = c$. This statement can be rigorously (mathematically) proven to be true as well, which we will do after the examples.
Below we show the contour plot of a quadratic function
\begin{equation} g\left(\mathbf{w}\right) = w_0^2 + w_1^2 + 2 \end{equation}along with the gradient descent directions defined at three random points. The contour plot is colored blue - with darker regions indicating where the function takes on larger values, and lighter regions where it takes on lower values. Each of the points we choose are highlighted in a unique color, with the contour on which they sit on the function colored in the same manner. The descent direction defined by the gradient at each point is drawn as an arrow and the tangent line to the contour at each input is also drawn (in both instances colored the same as their respective point).
In each instance we can see how the gradient descent direction lies perpendicular to the contour it lies on - in particular being perpendicular to the tangent line at each point on the contour. Since the gradient direction in each instance is the normal vector for this tangent line they can each be written as
\begin{equation} \nabla g\left(\mathbf{w}\right)^T \left(\mathbf{w} - \mathbf{v}\right) = 0. \end{equation}Because the gradient ascent directions will simply point in the opposite direction as the descent directions shown here, they too will be perpendicular to the contours.
Here we show the contour plot and gradient descent directions in the same manner as the previous example, only for the wavy function
\begin{equation} g\left(\mathbf{w}\right) = w_0^2 + w_1^2 + 2\text{sin}\left(w_0 + w_1\right)^2 + 2. \end{equation}Here we have taken the quadratic in the previous example and added a three dimensional sinusoid to it, creating a highly non-convex function. However - evaluating the gradient at the same three random points - we still see that each gradient descent direction is perpendicular to its corresponding contour.
Finally we show the same sort of plot as in the previous example using the function
\begin{equation} g\left(\mathbf{w}\right) = \left(w_0^2 + w_1 - 11 \right)^2 + \left( w_0 + w_1^2 - 6 \right)^2 \end{equation}which is commonly used as a simple test for optimization algorithms. Here we evaluate gradient at three random points and illustrate the corresponding descent direction in each instance. Once again, we see that the gradient directions are perpendicular to their respective contours.
Having seen a number of examples of this phenomenon, we can be quite confident that in general the gradient-defined ascent / descent directions are always perpendicular to a function's contour. We can quickly confirm this intuition via rigorous mathematical reasoning as follows.
If we suppose $g\left(\mathbf{w}\right)$ is a differentiable function and $\mathbf{a}$ is some input point, then $\mathbf{a}$ lies on the contour defined by all those points where $g\left(\mathbf{w}\right) = g\left(\mathbf{a}\right) = c$ for some constant $c$. If we take another point from this contour $\mathbf{b}$ very close to $\mathbf{a}$ then the vector $\mathbf{a} - \mathbf{b}$ is essentially perpendicular to the gradient $\nabla g\left(\mathbf{a}\right)$ since $\nabla g\left(\mathbf{a}\right)^T\left(\mathbf{a} - \mathbf{b}\right) = 0$ essentially defines the line in the input space whose normal vector is precisely $\nabla g\left(\mathbf{a}\right)$. So indeed both the ascent and descent directions defined by the gradient of $g$ at $\mathbf{a}$ are perpendicular to the contour there. And since $\mathbf{a}$ was any arbitrary input of $g$, the same argument holds for each of its inputs.