code
share


β–Ί Chapter 7: Second order methods

7.5 The first order derivation of Newton's method

We have just seen how Newton's method works to find stationary points of a cost function, points where the gradient of the cost function is zero, effectively minimizing the function. However the classic application of Newton's method is actually towards finding the zeros of (polynomial) functions - that is where a function crosses the input plane - which is the fundamental goal of solving the first order system of equations introduced in Section 6.1.

At first glance it may not seem like the Newton's method - at least the way we have viewed it thus far - is applicable to such problems. That is until we think about what Newton's method is doing in the space where the gradient of our cost function lives. If Newton's method does indeed take steps towards finding the stationary point of a cost function then then we can we can indeed view these steps evaluated not on the cost function, but on the derivative (or gradient more generally) itself. Any sequence of steps moving towards a stationary point of the cost function itself is - when evaluated by the gradient - a sequence of steps moving towards a zero.

InΒ [2]:

7.5.1 Newton's method from the perspective of the gradient functionΒΆ

Because we will be need to write out first and second derivatives a number of times in this Subsection, we will briefly use the common (more compact) 'prime' notation to denote the first and second derivative of $g(w)$, a function taking in a single dimensional input $w$

\begin{equation} g^{\prime}(w) = \frac{\mathrm{d}}{\mathrm{d}w}g\left(w\right) \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,g^{\prime\prime}(w) = \frac{\mathrm{d}^2}{\mathrm{d}w^2}g\left(w\right). \end{equation}

Now, remember when $N = 1$ the $k^{th}$ that the Newton's step follows directly from the second order Taylor series approximation centered at a point ${w}^{k-1}$ which (in our 'prime' derivative notation) is

\begin{equation} h\left(w\right)=g\left(w^{k-1}\right)+g^{\prime}\left(w^{k-1}\right)\left(w-w^{k-1}\right)+ g^{\prime\prime}\left(w^{k-1}\right)\left(w-w^{k-1}\right)^{2} \end{equation}

We want to determine a stationary point of this quadratic (a global minimum of the quadratic if it is convex) by setting the derivative of this quadratic approximation to zero and solving. This gives our Newton step

\begin{equation} w^k = w^{k-1} - \frac{g^{\prime}(w^{k-1})}{g^{\prime\prime}(w^{k-1})} \end{equation}

By following this sequence of steps we are - when beginning in a convex region of a function - led to a stationary point which is also a local minimum of $g$. Now, since any stationary point of $g$ is satisfies $g^{\prime}(w) = 0$, we can think of this simultaneously as a procedure for finding a zero of our derivative function. Indeed we can easily derive this same Newton's method update equation beginning with the derivative, and with the desire to find a zero point of it. Let us go about doing this.

So, suppose are starting from scratch and we do not know about the Newton's method update step, we are looking for an iterative way to find a zero of our cost function derivative $g^{\prime}(w) = 0$ - What could we do? Its not simple to determine where an arbitrary function crosses the zero - we know this because we have already studied this question in our discussion of the first order condition of calculus. Thus we should quickly give up on the idea of being able to calculate - algebraically by hand - the zero(s) of an arbitrary (derivative) function. Instead we should look to develop an iterative scheme for approximating these zero points.

As usual when we do not know how to compute something with a generic function - here the derivative of our cost function $g^{\prime}(w)$ - we turn instead to computing this thing with a Taylor series approximation of the function. Here - since we are looking for places where $g$ crosses the input axis - it is reasonable to just use the first order Taylor series, the best linear approximation to $g^{\prime}(w)$ centered at a point. We can then easily compute where this first order approximation (just a line) crosses the input axis. (Why use the first and not second order Taylor series approximation? Because while most lines cross the input axis just once, a general quadratic may not cross the input axis at all, or may cross it twice.)

An iterative scheme based on using the tangent line (our first order approximation) to he function and finding its zero would look like the animation below. As yo move the slider from left to right the scheme progresses, and later converges when the zero of the tangent line approximates that of the derivative function itself.

InΒ [3]:
Out[3]:



To write out the first step of this scheme, remember first and foremost that we are thinking of this as an iterative method applied to the derivative function $g^{\prime}(w)$ (we could of course apply it to the cost function itself $g(w)$, but our goal is to find stationary points of this function!). This means that - beginning at a point $w^0$ - that our linear approximation to the derivative function naturally involves the second derivative, since the slope of the tangent line is always given by the derivative of the function we are constructing it for. We can then write the linear approximation to the derivative at $w^0$ as

\begin{equation} h\left(w^0\right) = g^{\prime}\left(w^0\right) + g^{\prime \prime}\left(w^0\right)\left(w - w^0\right). \end{equation}

Then if we simply wish to see where the tangent line crosses the input axis, we set the above to zero

\begin{equation} g^{\prime}\left(w^0\right) + g^{\prime \prime}\left(w^0\right)\left(w - w^0\right) = 0 \end{equation}

and solve for $w$. Doing this, and calling the solution $w^1$, we have

\begin{equation} w^1 = w^0 - \frac{g^{\prime}\left(w^0\right)}{g^{\prime \prime}\left(w^0\right) }. \end{equation}

If we repeat these steps, the $k^{th}$ one will then take the same form as

\begin{equation} w^k = w^{k-1} - \frac{g^{\prime}\left(w^{k-1}\right)}{g^{\prime \prime}\left(w^{k-1}\right) }. \end{equation}

This is the formula for the $k^{th}$ Newton step!

So - indeed - we can view Newton's method simultaneously as

  • a method for determining a stationary point (preferably a minimum) of a cost function $g(w)$
  • a method for finding a zero of the derivative function, i.e., where $g^{\prime}(w) = 0$

Below we show an animation that - for a real input cost function - draws out the Newton steps on both the cost function and its derivative simultaneously. As you move the slider from left to right the algorithm proceeds forward.

InΒ [4]:
Out[4]:



7.5.2 Approximating the second derivative and the secant methodΒΆ

Remember from our discussions regarding the basic notion of a function derivative, that the derivative defines the tangent line to a function at a point. Therefore the derivative - our slope - can be roughly approximated as the slope of a nearby secant line. Using this fact in our current scenario, we can then say that the slope of the tangent line to our derivative function $g^{\prime}(w)$ - this slope being the second derivative itself $g^{\prime \prime}(w)$ - can itself be approximated by the slope of a nearby secant

\begin{equation} g^{\prime\prime}(w^{\,}) \approx \frac{g^{\prime}(w^{\,}) - g^{\prime}(v^{\,})}{w^{\,} - v^{\,}} \end{equation}

if $v$ is taken close enough to the evaluation point $w$. Using this rule somewhat loosely if we then go back to our Newton step and swap out the second derivative evaluation with this approximation we have a generic step that looks like

\begin{equation} w^k = w^{k-1} - \frac{g^{\prime}\left(w^{k-1}\right)}{ \frac{g^{\prime}(w^{k-1}) - g^{\prime}(w^{k-2})}{w^{k-1} - w^{k-2}} }. \end{equation}

This approach - called the Secant Method since it employs a secant instead of a tangent line - still determines zeros of the derivative function $g^{\prime}(w)$ / stationary points of the cost $g(w)$. This approach - while slower than the pure Newton's method - has the massive benefit of not having to compute second derivatives. This may not seem particularly helpful when dealing with $N = 1$ dimensional input cost functions, but when dealing with arbitrary dimensions it is indeed a life saver.

InΒ [5]:
Out[5]:



7.5.3 The general $N$ dimensional input Secant MethodΒΆ

Everything we have discussed for the generic $N= 1$ dimensional case tracks to the higher dimensional instance as well, with one important difference. Suppose we take our higher dimensional Newton's method step (supposing for the moment we can invert the Hessian matrix)

\begin{equation} \mathbf{w}^{k} = \mathbf{w}^{k-1} - \nabla g\left(\mathbf{w}^{k-1}\right) \nabla^2 g\left(\mathbf{w}^{k-1}\right)^{-1} \end{equation}

and swap out the Hessian with the approximation analogous to the one dimensional case where we made the substitution

\begin{equation} g^{\prime \prime}(w^{k-1}) \longleftarrow s^k = \frac{g^{\prime}(w^{k-1}) - g^{\prime}(w^{k-2})}{w^{k-1} - w^{k-2}} \end{equation}

What is this approximation? Abusing notation we could express it as (this is incorrect mathematically speaking, but the algebra mirrors the one dimensional case) as

\begin{equation} \nabla^2 g\left(\mathbf{w}^{k-1}\right) \longleftarrow \mathbf{S}^k = \frac{\nabla g\left(\mathbf{w}^{k-1}\right) - \nabla g\left(\mathbf{w}^{k-2}\right)}{\mathbf{w}^{k-1} - \mathbf{w}^{k-2}}. \end{equation}

Again - while the way we have written the line above is nice in that it mirrors the single-dimensional case it does not make sense. Every quantity on the right hand side is an $N\times 1$ vector, so we cannot actually compute the quotient on the right hand side above. The proper $N$ dimensional analog of equation (11) is the linear system

\begin{equation} \nabla^2 g\left(\mathbf{w}^{k-1}\right) \longleftarrow \mathbf{S}^{k}\left(\mathbf{w}^{k-1} - \mathbf{w}^{k-2}\right) = \nabla g\left(\mathbf{w}^{k-1}\right) - \nabla g\left(\mathbf{w}^{k-2}\right). \end{equation}

Here $\mathbf{S}^{k}$ is an $N\times N$ matrix we must solve for. We can write this more compactly as the system of equations

\begin{equation} \mathbf{S}\mathbf{a}^k = \mathbf{b}^k \end{equation}

where $\mathbf{a}^k = \mathbf{w}^{k-1} - \mathbf{w}^{k-2}$ and $\mathbf{b}^k = \nabla g\left(\mathbf{w}^{k-1}\right) - \nabla g\left(\mathbf{w}^{k-2}\right)$ and where the solution to this system is our $\mathbf{S}^k$. However note here that unlike the $N = 1$ dimensional case - where each update has a closed form update - in higher dimensions we have a linear system to solve. More specifically its the $N\times N$ matrix $\mathbf{S}^k$ itself we are solving for here - so this system will have infinitely many solutions.

But we do know that we would like our solution to act as similar as possible to a Hessian - that was our motivation to begin with (to replace the costly Hessian with a cheap substitute) - so thinking through some of the characteristic properties of $\nabla g\left(\mathbf{w}^k\right)$ we can setup a number of similar conditions we would like our approximation to have. Here are a few obvious characteristics we want $\mathbf{S}^{k}$ to have (the make it similar to the Hessian)

  • The Hessian $\nabla^{2}g\left(\mathbf{w}^{k}\right)$ is always symmetric, so our approximation $\mathbf{S}^k$ should be too
  • The Hessian $\nabla^{2}g\left(\mathbf{w}\right)$ is continuous, so our approximations $\mathbf{S}^{k}$ and $\mathbf{S}^{k-1}$ should not differ too wildly
  • We would like the approximation to be positive semi-definite, so we are going downhill

Taken together and translated into the language of math these conditions provoke a constrained optimization problem for recovering a solution matrix $\mathbf{S}^k$ with these properties. This constrained problem takes the form

\begin{equation} \begin{array}{cc} \underset{\mathbf{S}}{\text{minimize}} & \left\Vert \mathbf{S}^{\,}-\mathbf{S}^{k-1}\right\Vert _{F}^{2}\\ \text{subject to} & \mathbf{S}\mathbf{a}^{k}=\mathbf{b}^{k}\\ & \mathbf{S} \,\,\, \text{symmetric} \end{array} \end{equation}

where as before $\mathbf{a}^k = \mathbf{w}^{k-1} - \mathbf{w}^{k-2}$ and $\mathbf{b}^k = \nabla g(\mathbf{w}^{k-1}) - \nabla g(\mathbf{w}^{k-2})$.