In this Section we discuss the foundational first order concept on which many practical optimization algorithms are built: the first order optimality condition. The first order analog of the zero order condition discussed in the previous Chapter, the first order condition codifies the consistent behavior of how any differentiable function's first derivative(s) behave at its minima. While there are a few important instances when this tool can be used to directly determine the minima of a function (which we review here), in and of itself the first order optimality condition is more broadly useful because it helps its formal characterization of minimum points serves helps motivate the construction of both first and second order optimization algorithms.

6.1.1 The first order condition¶

In the next Python cell we plot a quadratic functions in two and three dimensions, and mark the global minimum point on each with a green point. In each panel we also draw the first order Taylor series approximation - a tangent line/hyperplane - generated by the first derivative(s) at the function's minimum value (see Chapter 3 for further details on Taylor series approximations). In terms of the behavior of the first order derivatives here we see - in both instances - that the tangent line/hyperplane is perfectly flat, indicating that the first derivative(s) is exactly zero at the function's minimum.

# plot a single input quadratic in both two and three dimensions
func1 = lambda w: w**2  + 3
func2 = lambda w: w[0]**2 + w[1]**2  + 3

# use custom plotter to show both functions
calib.derivative_visualizer.compare_2d3d(func1 = func1,func2 = func2)

This sort of first order behavior is universal regardless of the function one examines and - moreover - it holds regardless of the dimension of a function's input: first order derivatives are always zero at the minima of a function. This is because minimum values of a function are naturally located at 'valley floors' where a tangent line or hyperplane tangent to the function is perfectly flat, and thus has zero-valued slope(s).

Because the derivative/gradient at a point gives precisely this slope information, the value of first order derivatives provide a convenient way of characterizing minimum values of a function $g$. When $N=1$ any point $v$ where

\begin{equation} \frac{\mathrm{d}}{\mathrm{d}w}g\left(v\right)=0 \end{equation}

is a potential minimum. Analogously with general $N$ dimensional input, any $N$ dimensional point $\mathbf{v}$ where every partial derivative of $g$ is zero, that is

\begin{equation} \begin{array} \ \frac{\partial}{\partial w_{1}}g(\mathbf{v})=0\\ \frac{\partial}{\partial w_{2}}g(\mathbf{v})=0\\ \,\,\,\,\,\,\,\,\,\,\vdots \\ \frac{\partial}{\partial w_{N}}g(\mathbf{v})=0 \end{array} \end{equation}

is a potential minimum. This system of $N$ equations is naturally referred to as the first order system of equations. We can write the first order system more compactly using gradient notation as

\begin{equation} \nabla g\left(\mathbf{v}\right)=\mathbf{0}_{N\times1}. \end{equation}

This is a very useful characterization of minimum points because - at least in principle - it gives us a concrete alternative to seeking out a function's minimum points directly via some zero-order approach: by solving a function's first order system of equations.

The first order optimality condition translates the problem of identifying a function's minimum points into the task of solving a system of $N$ first order equations.

There are however two problems with the first order characterization of minima. First off, with few exceptions (including some important examples we detail in the next Subsection), it is virtually impossible to solve a general function's first order systems of equations 'by hand'. That is, to solve such equations algebraically for 'closed form' solutions one can write out on paper. This is where the local first and second order optimization methods we discuss in the current and subsequent Chapters come in - they are iterative ways of solving such a system.

The other problem is that the first order optimality condition does not only define minima of a function, but other points as well. The first order condition also equally characterizes maxima and saddle points of a function - as we see in a few simple examples below.

Example 1: Finding points of zero derivative for single-input functions graphically¶

In the next Python cell we plot the three functions

\begin{equation} \begin{array} \ g(w) = \text{sin}\left(2w\right) \\ g(w) = w^3 \\ g(w) = \text{sin}\left(3w\right) + 0.1w^2 \end{array} \end{equation}

in the top row of the figure. For each we mark all the zero derivative points in green and draw the first order Taylor series approximations/tangent lines there in green as well. Below each function we plot its first derivative, highlighting the points where it takes on the value zero as well (the horizontal axis in each case is drawn as a horizontal dashed black line).

# plot a single input quadratic in both two and three dimensions
func1 = lambda w: np.sin(2*w)
func2 = lambda w: w**3
func3 = lambda w: np.sin(3*w) + 0.1*w**2

# use custom plotter to show both functions
calib.derivative_visualizer.show_stationary(func1 = func1,func2 = func2,func3 = func3)

Examining these plots we can see that it is not only global minima that have zero derivatives, but a variety of other points as well. These consist of

local minima or points that are the smallest with respect to their immediate neighbors, like the one around the input value $w=2$ in the right panel

local and global maxima or points that are the largest with respect to their immediate neighbors, like the one around the input value $w=-2$ in the right panel

saddle points like the one shown in the middle panel, that are neither maximal nor minimal with respect to their immediate neighbors

The previous example illustrate the full swath of points having zero-valued derivative(s) - and this includes multi-input functions as well regardless of dimension. Taken together all such points are collectively referred to as stationary points or critical points.

Together local/global minima and maxima, as well as saddle points are referred to as stationary points. These are points at which a function's derivative(s) take on zero value, i.e., $\nabla g(\mathbf{w}) = \mathbf{0}_{N\times 1}$.

So - while imperfect in terms of the points it characterizes - this first order condition for optimality does characterize points of a function that include a function's global minima. Moreover the first order condition allows us to translate the problem of finding global minima to the problem of solving a system of (typically nonlinear) equations, for which many algorithmic schemes have been designed.

The first order condition for optimality: Stationary points of a function $g$ (including minima, maxima, and saddle points) satisfy the first order condition $\nabla g\left(\mathbf{v}\right)=\mathbf{0}_{N\times1}$. This allows us to translate the problem of finding global minima to the problem of solving a system of (typically nonlinear) equations, for which many algorithmic schemes have been designed.

Note: if a function is convex - as with the quadratic function's shown in the beginning of this Subsection - then any point of such a function satisfying the first order condition must be a global minima, as a convex function has no maxima or saddle points.

6.1.2 Special cases where the first order system can be solved 'by hand'¶

In principle the benefit of using the first order condition is that it allows us to transform the task of seeking out global minima to that of solving a system of equations, for which a wide range of algorithmic methods have been designed. The emphasis here on algorithmic schemes is key, as solving a system of equations by hand is generally speaking very difficult (if not impossible). Again we emphasize, the vast majority of (nonlinear) systems of equations cannot reasonably be solved by hand.

However there are a handful of relatively simple but important examples where one can compute the solution to a first order system by hand, or at least one can show algebraically that they reduce to a linear system of equations which can be easily solved numerically. By far the most important of these are the multi-input quadratic function and the highly related Rayleigh quotient, which we discuss below. We will see the former later on when discussing linear regression, and the latter in a number of instances where we use it as a tool for studying the properties of certain machine learning cost functions.

Example 2: Calculating stationary points of some single-input functions algebraically¶

In this Example we use the first order condition for optimality to compute stationary points of the functions

\begin{equation} \begin{array}\\ g\left(w\right)=w^{3} \\ g\left(w\right)=e^{w} \\ g\left(w\right)=\textrm{sin}\left(w\right)\\ g\left(w\right)=a + bw + cw^{2}, \,\,\,c>0 \\ \end{array} \end{equation}

and will distinguish the kind of stationary point visually for these instances.

$g\left(w\right)=w^{3}$, plotted in the middle panel of the second figure above, the first order condition gives $g'\left(v\right)=3v^{2}=0$ which we can visually identify as a saddle point at $v=0$.

$g\left(w\right)=e^{w}$, the first order condition gives $g'\left(v\right)=e^{v}=0$ which is only satisfied as $v$ goes to $-\infty$, giving a minimum.

$g\left(w\right)=\mbox{sin}\left(w\right)$ the first order condition gives stationary points wherever $g'\left(v\right)=\mbox{cos}\left(v\right)=0$ which occurs at odd integer multiples of $\frac{\pi}{2}$, i.e., maxima at $v=\frac{\left(4k+1\right)\pi}{2}$ and minima at $v=\frac{\left(4k+3\right)\pi}{2}$ where $k$ is any integer.

$g\left(w\right)=w^{2}$ for which the first order condition gives $g'\left(v\right)=2cv + b =0$ with a minimum at $v=\frac{-b}{2c}$.

Example 3: A simple looking function with difficult to compute (algebraically) global minimum¶

As mentioned previously the vast majority of functions - or to be more precise the system of equations derived from functions - cannot be solved by hand algebraically. To get a sense of this challenge here we show an example of a simple-enough looking function whose global minimum is very cumbersome to compute by hand.

Take the simple degree four polynomial

\begin{equation} g(w) = \frac{1}{50}\left(w^4 + w^2 + 10w\right) \end{equation}

which is plotted over a short range of inputs containing its global minimum in the next Python cell.

# specify range of input for our function
w = np.linspace(-5,5,50)
g = lambda w: 1/50*(w**4 + w**2 + 10*w)

# make a table of values for our function
func_table = np.stack((w,g(w)), axis=1)

# use custom plotter to display function
baslib.basics_plotter.single_plot(table = func_table,xlabel = '$w$',ylabel = '$g(w)$',rotate_ylabel = 0)

The first order system here can be easily computed as

\begin{equation} \frac{\mathrm{d}}{\mathrm{d} w}g(w) = \frac{1}{50}\left(4w^3 + 2w + 10\right) = 0 \end{equation}

which simplifies to

\begin{equation} 2w^3 + w + 5 = 0 \end{equation}

This has three possible solutions, but the one providing the minimum of the function $g(w)$ is

\begin{equation} w = \frac{\sqrt[\leftroot{-2}\uproot{2}3]{\sqrt[\leftroot{-2}\uproot{2}]{2031} - 45}}{6^{\frac{2}{3}}} - \frac{1}{\sqrt[\leftroot{-2}\uproot{2}3]{6\left(\sqrt{2031}-45\right)}} \end{equation}

which can be computed - after much toil - using centuries old tricks developed for just such problems.

Example 4: Stationary points of a general multi-input quadratic function¶

Take the general multi-input quadratic function

\begin{equation} g\left(\mathbf{w}\right)=a + \mathbf{b}^{T}\mathbf{w} + \mathbf{w}^{T}\mathbf{C}\mathbf{w} \end{equation}

where $\mathbf{C}$ is an $N\times N$ symmetric matrix, $\mathbf{b}$ is an $N\times 1$ vector, and $a$ is a scalar.

Computing the first derivative (gradient) we have

\begin{equation} \nabla g\left(\mathbf{w}\right)=2\mathbf{C}\mathbf{w}+\mathbf{b} \end{equation}

Setting this equal to zero gives a symmetric and linear system of equations of the form

\begin{equation} \mathbf{C}\mathbf{w}=-\frac{1}{2}\mathbf{b} \end{equation}

whose solutions are stationary points of the original function. Note here we have not explicitly solved for these stationary points, but have merely shown that the first order system of equations in this particular case is in fact one of the easiest to solve numerically.

Example 5: Stationary points of a simple but common quadratic function¶

A number of applications will find us employing a simple multi-input quadratic

\begin{equation} g\left(\mathbf{w}\right)=a + \mathbf{b}^{T}\mathbf{w} + \mathbf{w}^{T}\mathbf{C}\mathbf{w} \end{equation}

where the matrix $\mathbf{C} = \frac{1}{\beta} \mathbf{I}$. Here $\mathbf{I}$ is the $N\times N$ identity matrix, and $\beta > 0$ a positive scalar. With this matrix notice how the quadratic term can be simplified considerably

\begin{equation} \mathbf{w}^T \mathbf{C} \mathbf{w} = \mathbf{w}^T \frac{1}{\beta} \mathbf{I} \mathbf{w} = \frac{1}{\beta}\mathbf{w}^T \mathbf{w} = \frac{1}{\beta} \Vert \mathbf{w} \Vert^2_2 \end{equation}

The gradient of $g$ is then given as

\begin{equation} \nabla g(\mathbf{w}) = \frac{2}{\beta} \mathbf{w} + \mathbf{b} \end{equation}

and setting this equal to zero we can solve explicitly for the single stationary point - which is a global minimum - of this function

$$ \mathbf{w} = -\frac{\beta}{2}\mathbf{b} $$

Example 6: The Rayleigh Quotient¶

The Raleigh Quotient of an $N\times N$ matrix $\mathbf{C}$ is defined as the normalized quadratic function

\begin{equation} g(\mathbf{w}) = \frac{\mathbf{w}^T\mathbf{C}\mathbf{w}}{\mathbf{w}^T \mathbf{w}} \end{equation}

Computing, the gradient of this function is given by

\begin{equation} \nabla g(\mathbf{w}) = \frac{2}{\mathbf{w}^T \mathbf{w}}\left(\mathbf{C}\mathbf{w} - \frac{\mathbf{w}^T\mathbf{C}\mathbf{w}}{\mathbf{w}^T \mathbf{w}}\mathbf{w}\right) \end{equation}

Denote by $\mathbf{v}$ and $\lambda$ any eigenvector/eigenvalue pair of $\mathbf{C}$. Then plugging in $\mathbf{v}$ into the gradient we have, since by definition $\mathbf{C}\mathbf{v} = \lambda \mathbf{v}$, that

\begin{equation} \nabla g(\mathbf{v}) = \frac{2}{\mathbf{v}^T\mathbf{v}}\left(\lambda\mathbf{v} - \lambda\frac{ \mathbf{v}^T\mathbf{v}}{\mathbf{v}^T\mathbf{v}}\mathbf{v}\right) =\frac{2}{\mathbf{v}^T\mathbf{v}}\left(\lambda\mathbf{v} - \lambda \mathbf{v}\right) = 0 \end{equation}

In other words, the eigenvectors of $\mathbf{C}$ are stationary points of the Raleigh Quotient. By plugging the stationary point/eigenvector $\mathbf{v}$ into the original function we can see that it takes the value $\lambda$ given by the corresponding eigenvalue

\begin{equation} g(\mathbf{v}) = \frac{\mathbf{v}^T\mathbf{C}\mathbf{v}}{\mathbf{v}^T \mathbf{v}} = \lambda\frac{\mathbf{v}^T\ \mathbf{v}}{\mathbf{v}^T \mathbf{v}} = \lambda \end{equation}

In particular this shows that the maximum value taken by the Raleigh Quotient is given by the maximum eigenvalue of $\mathbf{C}$, and likewise at its minimum the function takes on the value of the minimum eigenvalue of $\mathbf{C}$.