code
share


Chapter 7: Second order methods

7.6 Quasi-Newton methods

While Newton's method is a powerful technique that exhibits rapid convergence for a (especially) convex function $g\left(\mathbf{w}\right)$, it is naturally constrained by $N$ the dimension of the input $\mathbf{w}$. In particular it is the quadratic generating Hessian $\nabla^2g\left(\mathbf{w}\right)$, an $N\times N$ matrix that limits Newton's method's use to cases where $N$ is (roughly speaking) in the thousands, since it is difficult to even store such a matrix when $N$ is larger (let alone compute with it). In this Section we give discuss a set of variations on Newton's method that are designed to ameliorate this issue, collectively referred to as quasi-Newton's methods. The main thrust of any such method is - in the Newton step - to replace the Hessian with a close approximation that does not suffer from this scaling problem.

In [1]:

7.6.1 Newton's method from the perspective of the gradient function

We have just seen how Newton's method works to find stationary points of a cost function, points where the gradient of the cost function is zero, effectively minimizing the function. However the classic application of Newton's method is actually towards finding the zeros of (polynomial) functions - that is where a function crosses the input plane. At first glance it may not seem like the Newton's method - at least the way we have viewed it thus far - is applicable to such problems. That is until we think about what Newton's method is doing in the space where the gradient of our cost function lives. If Newton's method does indeed take steps towards finding the stationary point of a cost function then then we can we can indeed view these steps evaluated not on the cost function, but on the derivative (or gradient more generally) itself. Any sequence of steps moving towards a stationary point of the cost function itself is naturally - when evaluated by the gradient - a sequence of steps moving towards a zero of the gradient function.

Because we will be frequently be manipulating first and second derivatives here, in this Section we will make use of the common (more compact) 'prime' notation to denote the first and second derivative of a single input function $g(w)$

\begin{equation} g^{\prime}(w) = \frac{\mathrm{d}}{\mathrm{d}w}g\left(w\right) \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,g^{\prime\prime}(w) = \frac{\mathrm{d}^2}{\mathrm{d}w^2}g\left(w\right). \end{equation}

Now, remember when $N = 1$ the $k^{th}$ that the Newton's step follows directly from the second order Taylor series approximation centered at a point ${w}^{k-1}$ which (in our 'prime' derivative notation) is

\begin{equation} h\left(w\right)=g\left(w^{k-1}\right)+g^{\prime}\left(w^{k-1}\right)\left(w-w^{k-1}\right)+ g^{\prime\prime}\left(w^{k-1}\right)\left(w-w^{k-1}\right)^{2} \end{equation}

We want to determine a stationary point of this quadratic (a global minimum of the quadratic if it is convex) by setting the derivative of this quadratic approximation to zero and solving. This gives our Newton step

\begin{equation} w^k = w^{k-1} - \frac{g^{\prime}(w^{k-1})}{g^{\prime\prime}(w^{k-1})} \end{equation}

By following this sequence of steps we are - when beginning in a convex region of a function - led to a stationary point which is also a local minimum of $g$. Now, since any stationary point of $g$ is satisfies $g^{\prime}(w) = 0$, we can think of this simultaneously as a procedure for finding a zero of our derivative function. Indeed we can easily derive this same Newton's method update equation beginning with the derivative, and with the desire to find a zero point of it. Let us go about doing this.

So, suppose start from scratch and we do not know about Newton's method, we are simply looking for an iterative way to find a zero of our cost function derivative $g^{\prime}(w) = 0$ - What could we do? Finding zeros of an arbitrary function is not a trivial affair. As with our discussion of the first order condition - where we discussed just this issue - we would in general look to develop an iterative scheme for approximating zero points of an arbitrary function $g^{\prime}$. If we followed precisely the intuition that led us to gradient descent we would likewise conclude that while it is difficult to compute the zero of arbitrary function, it is trivial to compute where a line or hyperplane. So why not repeatedly seek out the zero point(s) of the first order Taylor series approximation (a tangent line or hyperplane) to the function $g^{\prime}$?

Example 1. Illustrating the zero-finding algorithm concept

Below we animate the iterative scheme for finding a zero of an arbitrary function $g^{\prime}$ - here a sinusoid - based on repeatedly using the tangent line (our first order approximation) to a function and finding its zero. As yo move the slider from left to right the scheme progresses with the point of tangency shown as a colored circle, with its corresponding input shown as an 'x' symbol. The tangent point / input point along with the tangent line are all colored green - when the method begins - and transition to red as the method halts. The input where the tangent line crosses the input axis is marked with a white 'x' symbol at each step, and the corresponding function evaluation (the point at which the next tangent line will be drawn) is shown as a white circle. Moving the slider from left to right progresses the algorithm.

In [2]:
Out[2]:




To write out the first step of this scheme, remember first and foremost that we are thinking of this as an iterative method applied to the derivative function $g^{\prime}(w)$ - this is the function we want to find zero points of i.e., inputs where $g^{\prime}(w) = 0$. This means - beginning at a point $w^0$ - that our linear approximation to the derivative function naturally involves the second derivative of the function $g$. We can write the first order Taylor series linear approximation to the derivative function at $w^0$ as

\begin{equation} h\left(w^0\right) = g^{\prime}\left(w^0\right) + g^{\prime \prime}\left(w^0\right)\left(w - w^0\right). \end{equation}

Here the second derivative $g^{\prime \prime}\left(w^0 \right)$ is the slope of the tangent line to $g^{\prime}$ at the point $w^0$. We can easily compute where this line crosses the input axis by setting the equation above to zero

\begin{equation} g^{\prime}\left(w^0\right) + g^{\prime \prime}\left(w^0\right)\left(w - w^0\right) = 0 \end{equation}

and solving for $w$. Doing this, and calling the solution $w^1$, we have

\begin{equation} w^1 = w^0 - \frac{g^{\prime}\left(w^0\right)}{g^{\prime \prime}\left(w^0\right) }. \end{equation}

If we then repeat this procedure - finding where the tangent line at $w^1$ crosses the input axis and solving for the next update etc., - then the $k^{th}$ update step then look like

\begin{equation} w^k = w^{k-1} - \frac{g^{\prime}\left(w^{k-1}\right)}{g^{\prime \prime}\left(w^{k-1}\right) }. \end{equation}

Notice - these are Newton steps! So - in other words - we have re-derived Newton's method as a zero-finding algorithm for directly solving the first order optimality condition for single input functions.

Precisely the same reasoning as we followed above for multi-input functions leads to the multi-input Newton's step as well. So - indeed - we can view Newton's method simultaneously as

  • a method for determining a stationary point (preferably a minimum) of a cost function $g(\mathbf{w})$
  • a method for finding a zero of the derivative function, i.e., where $\nabla g(\mathbf{w}) = \mathbf{0}_{N\times 1}$.

Example 2. Thinking of Newton's method in two ways simultaneously

Below we show an animation that simultaneously draws Newton's method from a) the perspective of minimizing a function (left panel) and b) finding the zero of its derivative(s) (right panel). In particular we use the following function

\begin{equation} g(w) = \frac{1}{50}\left(w^4 + w^2 + 10w\right). \end{equation}

As you move the slider from left to right Newton's method proceeds in both panels, with the final step shown when the slider is moved all the way to the right. At each step in the algorithm the input and point of tangency are drawn as a white circle and 'x' symbol respectively, while the next step input / tangency are colored from green to red as the method proceeds. The tangent quadratic / line is also drawn in each panel, and are colored from green to red as the method proceeds as well.

In [3]:
Out[3]:



7.6.2 Approximating the second derivative and the secant method

Remember (as discussed in Chapter 3) that the derivative of a single input function defines the slope of the tangent line to a function at a point. Moreover the derivative/slope of a tangent line can be roughly approximated as the slope of a nearby secant line, that is a line that passes through the same point as well as another point nearby on the function (see Section 3.2).

Using this fact in our current scenario of Newton's method as a zero-finding algorithm, we can then say that the slope of the tangent line to our derivative function $g^{\prime}(w)$ - this slope being the second derivative itself $g^{\prime \prime}(w)$ - can itself be approximated by the slope of a nearby secant line as

\begin{equation} g^{\prime\prime}(w) \approx \frac{g^{\prime}(w^{\,}) - g^{\prime}(v^{\,})}{w^{\,} - v^{\,}} \end{equation}

if $v$ is taken close enough to the evaluation point $w$. Using this rule somewhat loosely if we then go back to our Newton step and swap out the second derivative evaluation with this secant approximation we have a generic step that looks like

\begin{equation} w^k = w^{k-1} - \frac{g^{\prime}\left(w^{k-1}\right)}{ \frac{g^{\prime}(w^{k-1}) - g^{\prime}(w^{k-2})}{w^{k-1} - w^{k-2}} }. \end{equation}

We can write this less cumbersomely as

\begin{equation} w^k = w^{k-1} - \frac{g^{\prime}\left(w^{k-1}\right)}{s^k} \end{equation}

where $s^k$ has been used to replace the second derivative, and is defined as

\begin{equation} s^k = \frac{g^{\prime}(w^{k-1}) - g^{\prime}(w^{k-2})}{w^{k-1} - w^{k-2}}. \end{equation}

This approach - called the Secant method since it employs a secant instead of a tangent line - can still determine zeros of a derivative function $g^{\prime}(w)$ / stationary points of the cost $g(w)$, but without the explicit need for the second derivative. This fact - while fairly inconsequential for a $N = 1$ dimensional input function - gains significantly more value when the Secant method idea is generalized to multi-input functions (since it is the very size of the Hessian that prohibits Newton's method's use for function's with larger values of $N$).

Example 3. The Secant method from two perspectives

Below we show an animation that simultaneously draws the Secant method from a) the perspective of minimizing a function (left panel) and b) finding the zero of its derivative(s) (right panel). In particular we use the following function

\begin{equation} g(w) = \frac{1}{50}\left(w^4 + w^2 + 10w\right). \end{equation}

As you move the slider from left to right Newton's method proceeds in both panels, with the final step shown when the slider is moved all the way to the right. At each step in the algorithm the input and point of tangency are drawn as a white circle and 'x' symbol respectively, while the next step input / tangency are colored from green to red as the method proceeds. The tangent quadratic / line is also drawn in each panel, and are colored from green to red as the method proceeds as well.

In [4]:
Out[4]:



7.6.3 The general $N$ dimensional input Secant method

Everything we have discussed for the generic single input $N= 1$ dimensional case tracks to the higher dimensional instance as well, with one important difference. Looking back at the Subsection above we replaced the second derivative with the slope of the corresponding secant line $s^k = \frac{g^{\prime}(w^{k-1}) - g^{\prime}(w^{k-2})}{w^{k-1} - w^{k-2}}$ as

\begin{equation} g^{\prime \prime}(w^{k-1}) \longleftarrow s^k \end{equation}

What is this analogous replacement when $N > 1$? Abusing notation we could express it as (this is incorrect mathematically speaking, but the algebra mirrors the one dimensional case) by literally just upgrading each piece of notation on the right hand side of the equality above to its vector analog giving

\begin{equation} \mathbf{S}^k = \frac{\nabla g\left(\mathbf{w}^{k-1}\right) - \nabla g\left(\mathbf{w}^{k-2}\right)}{\mathbf{w}^{k-1} - \mathbf{w}^{k-2}}. \end{equation}

In this case, since we are replacing the Hessian the variable $\mathbf{S}^k$ must also be an $N\times N$ matrix. However we cannot - strictly speaking - express the right hand side of the equality above, since this would mean we are (algebraically) dividing $N$ an length vector (the numerator) by an $N$ length vector (the denominator). The proper $N$ dimensional analog of the $N=1$ dimensional replacement is in fact a solution to the system

\begin{equation} \mathbf{S}^{k}\left(\mathbf{w}^{k-1} - \mathbf{w}^{k-2}\right) = \nabla g\left(\mathbf{w}^{k-1}\right) - \nabla g\left(\mathbf{w}^{k-2}\right). \end{equation}

Note that this is an $N^2$ system of equations, since the variable that must be solved for is $\mathbf{S}^k$ itself. This is often called the secant condition, since it is indeed the $N$ dimensional analog of the slope of the one dimensional secant line defined by the second derivative. So - if this system can be solved - we make the analog replacement of $g^{\prime \prime}(w^{k-1}) \longleftarrow s^k $ to the $N$ dimensional case, replacing the the full Hessian matrix

\begin{equation} \nabla^2 g\left(\mathbf{w}^{k-1}\right) \longleftarrow \mathbf{S}^k. \end{equation}

However here comes the rub: unlike the $N = 1$ dimensional case - where each update has a closed form update - the system we must solve in equation (16) for our secant replacement $\mathbf{S}^k$ will always have has infinitely many solutions. There are only $N$ equations, but $N^2$ entries to solve for in the matrix $\mathbf{S}^k$.

However because we want $\mathbf{S}^k$ to be as similar as possible to the Hessian we can introduce a number of reasonable constraints to limit the range of solutions to this system. Here are a few obvious characteristics we want $\mathbf{S}^{k}$ to have (in order for it to be similar to the Hessian)

  • $\mathbf{S}^k$ should be a solution to $\mathbf{S}^{k}\left(\mathbf{w}^{k-1} - \mathbf{w}^{k-2}\right) = \nabla g\left(\mathbf{w}^{k-1}\right) - \nabla g\left(\mathbf{w}^{k-2}\right)$: this is merely restating the secant condition
  • $\mathbf{S}^k$ should be symmetric: the Hessian $\nabla^{2}g\left(\mathbf{w}^{k}\right)$ is always symmetric, so $\mathbf{S}^k$ should be too
  • $\mathbf{S}^{k}$ should not differ substantially from $\mathbf{S}^{k-1}$: the Hessian $\nabla^{2}g\left(\mathbf{w}\right)$ is continuous, so our approximations $\mathbf{S}^{k}$ and $\mathbf{S}^{k-1}$ should not differ too wildly
  • $\mathbf{S}^k$ should generate a convex quadratic approximation (i.e., it needs to be positive semi-definite): we would like the approximation to be positive semi-definite, so we are always descending in the function

With these restrictions there are a number of possible solutions for $\mathbf{S}^k$, which we explore by example.

TODO

Example 4. Rank 1 update formula

Example 4. BFGS update formula