In this Section we take everything we have just learned about higher order derivatives to define the Taylor Series of a function, a fundamental tool for mathematical optimization.

3.6.1 Linear approximation is only the beginning¶

We began our discussion of derivatives in Chapter 3 by defining the derivative at a point as the slope of the tangent line to a given input function. For a function $g(w)$ we then formally described the tangent line at a point $w^0$ as

\begin{equation} h(w) = g(w^0) + \frac{\mathrm{d}}{\mathrm{d}w}g(w^0)(w - w^0) \end{equation}

with the slope here given by the derivative $\frac{\mathrm{d}}{\mathrm{d}w}g(w^0)$. The justification for examining the tangent line / the derivative to begin with is fairly straight forward - locally (close to the point $w^0$) the tangent line looks awfully similar to the function, and so if we want to better understand $g$ near $w^0$ we can just as well look at the tangent line. This makes our lives a lot easier because a line is a fairly simple object - especially when compared to an arbitrary function $g$ - and so understanding the tangent line is always a simple affair.

Below we plot an example function with tangent line defined by the derivative at a range of points over the interval $[-3,3]$. You can use the slider below the figure to move the point $w^0$ around in this range and view the corresponding first order Taylor series approximation there (shown in green). Here the derivative is being computed using an Automatic Differentiator.

g = lambda w: np.sin(3*w);
test = calclib.taylor2d_viz.visualizer(g = g)
test.draw_it(num_frames = 100,first_order = True)

If we study the form of our tangent line $h(w)$ closely, we can define in precise mathematical terms how it matches the function $g$. Notice first of all that the tangent line takes on the same value as the function $g$ at the point $w^0$. Plugging in $w^0$ into $h$ we can see that

\begin{equation} h(w^0) = g(w^0) + \frac{\mathrm{d}}{\mathrm{d}w}g(w^0)(w^0 - w^0) = g(w^0) \end{equation}

Next notice that the first derivative value of these two functions match as well. That is if we take the first derivative of $h$ with respect to $w$ - applying the derivative rules for elementary functions/operations we have seen in the last several Sections - we can see that

\begin{equation} \frac{\mathrm{d}}{\mathrm{d}w}h(w^0) = \frac{\mathrm{d}}{\mathrm{d}w}\left (g(w^0) + \frac{\mathrm{d}}{\mathrm{d}w}g(w^0)(w - w^0)\right) = \frac{\mathrm{d}}{\mathrm{d}w}\left ( \frac{\mathrm{d}}{\mathrm{d}w}g(w^0)(w - w^0)\right) = \frac{\mathrm{d}}{\mathrm{d}w}g(w^0) \end{equation}

In short, with the tangent line $h$ matches $g$ exactly that at $w^0$ both the function value and derivative value are equal.

\begin{array} \ 1. \,\,\, h(w^0) = g(w^0) \\ 2. \,\,\, \frac{\mathrm{d}}{\mathrm{d}w}h(w^0) = \frac{\mathrm{d}}{\mathrm{d}w}g(w^0) \\ \end{array}

What if we turned this around - say we wanted to find a line that satisfies these two properties. We start with a general line

\begin{equation} h(w) = a_0 + a_1(w - w^0) \end{equation}

with unknown coefficients $a_0$ and $a_1$ - and we want to determine the right value for these coefficients so that the line satisfy these two properties. What would we do? Well, since the two criteria above constitute a system of equations we can compute the left hand side of both and solve for the correct values of $a_0$ and $a_1$. Computing the left hand side of each - where $h$ is our general line - we end up with a trivial system of equations to solve for both unknowns simultaneously

\begin{array} \ h(w^0) = a_0 = g(w^0) \\ \frac{\mathrm{d}}{\mathrm{d}w}h(w^0) = a_1 = \frac{\mathrm{d}}{\mathrm{d}w}g(w^0) \end{array}

and behold, the coefficients are precisely those of the tangent line!

In other words, if we start with a general line and determine parameter values that satisfy the two criteria above we could have derived the tangent line from first principles.

3.6.2 From tangent line to tangent quadratic¶

Given that the function and derivative values of the tangent line match those of its underlying function, can we do better? Can we find a simple function that matches the function value, first derivative, and the second derivative value of $g$ at the point $w_0$? In other words, is it possible to determine a simple function $h$ that satisfies

\begin{array} \ 1. \,\,\, h(w^0) = g(w^0) \\ 2. \,\,\, \frac{\mathrm{d}}{\mathrm{d}w}h(w^0) = \frac{\mathrm{d}}{\mathrm{d}w}g(w^0) \\ 3. \,\,\, \frac{\mathrm{d}^2}{\mathrm{d}w^2}h(w^0) = \frac{\mathrm{d}^2}{\mathrm{d}w^2}g(w^0) \\ \end{array}

Notice how a (tangent) line $h$ can only satisfy the first two of these properties and never the third, since it being a degree 1 polynomial $\frac{\mathrm{d}^2}{\mathrm{d}w^2}h(w) = 0$ for all $w$. This fact implies that we need at least a degree 2 polynomial to satisfy all three criteria, since the second derivative of a degree 2 polynomial need not be equal to zero.

What sort of degree 2 polynomial could satisfy these three criteria? Starting with a general degree 2 polynomial

\begin{equation} h(w) = a_0 + a_1(w - w^0) + a_2(w - w^0)^2 \end{equation}

with unknown coefficients $a_0$, $a_1$, and $a_2$, we can evaluate the left hand side of each criterion forming a system of 3 equations and solve for these coefficients.

\begin{array} \ h(w^0) = a_0 = g(w^0) \\ \frac{\mathrm{d}}{\mathrm{d}w}h(w^0) = a_1 = \frac{\mathrm{d}}{\mathrm{d}w}g(w^0) \\ \frac{\mathrm{d}^2}{\mathrm{d}w^2}h(w^0) = 2a_2 = \frac{\mathrm{d}^2}{\mathrm{d}w^2}g(w^0)\\ \end{array}

With all of our coefficients solved we have a degree 2 polynomial that satisfies the three desired criteria

\begin{equation} h(w) = g(w^0) + \frac{\mathrm{d}}{\mathrm{d}w}g(w^0)(w - w^0) + \frac{1}{2}\frac{\mathrm{d}^2}{\mathrm{d}w^2}g(w^0)(w - w^0)^2 \end{equation}

This is one step beyond the tangent line - a tangent quadratic function - note that the first two terms are indeed the tangent line itself.

Below we plot an example function with its first and second order Taylor series approximations over a range of points on the interval $[-3,3]$. You can use the slider below the figure to move the point $w^0$ around in this range and view the corresponding first / second order Taylor series approximation there (shown in green and blue respecively). Here the derivatives are being computed using an Automatic Differentiator.

g = lambda w: np.sin(3*w);
test = calclib.taylor2d_viz.visualizer(g = g)
test.draw_it(num_frames = 100,first_order = True,second_order = True)

3.6.3 Building better and better local approximations¶

Having derived this quadratic based on our reflection on the tangent line, one could think of going one step further. That is finding a simple function $h$ that satisfies even one more condition than the quadratic

\begin{array} \ 1. \,\,\, h(w^0) = g(w^0) \\ 2. \,\,\, \frac{\mathrm{d}}{\mathrm{d}w}h(w^0) = \frac{\mathrm{d}}{\mathrm{d}w}g(w^0) \\ 3. \,\,\, \frac{\mathrm{d}^2}{\mathrm{d}w^2}h(w^0) = \frac{\mathrm{d}^2}{\mathrm{d}w^2}g(w^0) \\ 4. \,\,\, \frac{\mathrm{d}^3}{\mathrm{d}w^3}h(w^0) = \frac{\mathrm{d}^3}{\mathrm{d}w^3}g(w^0) \\ \end{array}

Noting that no degree 2 polynomial could satisfy this last condition, since its third derivative is always equal to zero, we could seek out a degree 3 polynomial. Using the same analysis as above - setting up the corresponding system of equations based on a generic degree 3 polynomial - leads to the conclusion that the following does indeed satisfy all of the criteria above

\begin{equation} h(w) = g(w^0) + \frac{\mathrm{d}}{\mathrm{d}w}g(w^0)(w - w^0) + \frac{1}{2}\frac{\mathrm{d}^2}{\mathrm{d}w^2}g(w^0)(w - w^0)^2 + \frac{1}{3\times2}\frac{\mathrm{d}^3}{\mathrm{d}w^3}g(w^0)(w - w^0)^3 \end{equation}

This is an even better approximation of $g$ near the point $w^0$, since it contains more of the function's derivative information there.

And of course we could keep going. Setting up the corresponding set of $N+1$ criteria - the first demanding that $h(w^0) = g(w^0)$ and the remaining $N$ demanding that the first $n$ derivatives of $h$ match those of $g$ at $w^0$ - leads to the construction of degree $N$ polynomial

\begin{equation} h(w^0) + g(w^0) + \sum_{n=1}^{N} \frac{1}{n!}\frac{\mathrm{d}^n}{\mathrm{d}w^n}g(w^0)(w - w^0)^n \end{equation}

Notice how setting $N=1$ recovers the tangent line, $N=2$ the tangent quadratic, etc.,

This general degree $N$ polynomial is called the Taylor Series of $g$ at the point $w^0$. It is the degree $N$ polynomial that matches $g$ as well as its first $N$ derivatives at the point $w^0$, and therefore approximates $g$ near this point better and better as we increase $N$.

The degree $N$ polynomial $h(w^0) + g(w^0) + \sum_{n=1}^{N} \frac{1}{n!}\frac{\mathrm{d}^n}{\mathrm{d}w^n}g(w^0)(w - w^0)^n$ is called the Taylor Series of $g$ at the point $w_0$.

We illustrate the first four Taylor Series polynomials for a user-defined input function below, animated over a range of values of the input function. You can use the slider to shift the point at which each approximation is made back and forth across the input range.

# what function should we play with?  Defined in the next line.
g = lambda w: np.sin(3*w)

# create an instance of the visualizer with this function 
taylor_viz = calclib.taylor_series_simultaneous_approximations.visualizer(g = g)

# run the visualizer for our chosen input function
taylor_viz.draw_it(num_frames = 200)

Examining this figure we can clearly see that the approximation becomes better and better as we increase $N$. This makes sense, as each polynomial matches the underlying function at a point as we increase the degree. However we can never expect it to match the entire function everywhere: we build each polynomial to match $g$ at only a single point, so regardless of degree we can only expect it to match the underlying function near the point $w_0$.

3.6.4 The Taylor Series as a local function approximator¶

This idea of using a simple function in order to better understand another (potentially more complex) function is the germ of a centuries old area of mathematics called function approximation. In the parlance of function approximation what we have done here with the Taylor Series is to make a approximation to a function locally at a point. Throughout our study of machine learning / deep learning we will see the notion of function approximation pop up in a number of contexts.