code
share


Chapter 3: Derivatives and Automatic Differentiation

3.3 Derivative equations and hand computations

In the previous Section we talked about derivatives of a function that takes in a single input, at a single point. We then saw how to quickly whip up a way to compute numerical derivatives across entire input spaces. This approach - called Numerical Differentiation - while simple to build it was unfortunately generally unstable and at potentially innacurate depending on the input function.

In this Section we explore how we can derive formulae for the derivatives of generic functions constructed from elementary functions/operations, which has far-reaching and very positive consequences for the effective automated computing of derivatives.

In [1]:

3.3.1 The derivative as a table of values

Every mathematical function - whether or not it begins with an equation - can be expressed as a table of input/output values. For example, the function $g(w) = w^2$ can be expressed as an infinitely long table like this


input:  $w$ output:  $g(w)=w^2$
$0$ $0$
$-0.4$ $0.16$
$3$ $9$
$-1$ $1$
$\vdots$ $\vdots$

Here the input/output pairs - of which there are infinitely many - are not ordered in any particular way. On the contrary - as we discuss in our material about the basics of mathematical functions - it is not often the case that a mathematical function that begins as a table of values can be expressed as an equation.

The derivative $\frac{\mathrm{d}}{\mathrm{d}w}g(w)$ of a given function - like $g(w) = w^2$ when evaluated at every point is certainly an example of a table of values. For example we could write it like this


input:  $w$ output:  $\frac{\mathrm{d}}{\mathrm{d}w}g(w)$
$0$ $0$
$-0.4$ $-0.8$
$3$ $6$
$-1$ $-2$
$\vdots$ $\vdots$

where once again the input/output pairs - of which there are infinitely many - are not written in any particular order. So - in general - the derivative is certainly a mathematical function which can be expressed as a table of values. But can we express the derivative as an equation?

Well we could first try plotting the table of values for some examples to see if we could not visually identify an equation associated to a table - what will a derivative function look like if we plot it?

Example 1. Plotting the derivative of $g(w) = w^2$

In the next Python cell we compute take function of one input - here

\begin{equation} g(w) = w^2 \end{equation}

and compute its derivative over a coarse sample of points on the interval [-2.5,2.5] and use a slider mechanism to animate this computation. In the left panel we show the original function along with both a point shown in red and its corresponding tangent line in green with slope given by the derivative . As you slide from left to right the point at which the derivative is taken moves smoothly across the input inverval. In the right panel we simultenously plot the value of the derivative in green , along with every previous derivative value computed up to that point.

In [2]:
# what function should we play with?  Defined in the next line, along with our fixed point where we show tangency.
g = lambda w: w**2

# create an instance of the visualizer with this function
st = calclib.function_derivative_joint_visualizer.visualizer(g = g)

# run the visualizer for our chosen input function and initial point
st.draw_it(num_frames = 250)
Out[2]:



Here the plotted derivative looks like a line! In fact - pushing the slider all the way to the right - from the picture it looks like we can be more specific: the line cross the origin, so the vertical intercept must be zero, and at $w = 2.5$ (the end of the plot) the output value looks to be around $g(2.5) = 5$, so the slope of this line appears to be around $\frac{g(2.5) - g(0)}{2.5 - 0} = \frac{5}{2.5} = 2$. So - at least over the region of input plotted - the equation of the derivative appears to be

$$ \frac{\mathrm{d}}{\mathrm{d}w}g(w) = 2w$$

Example 2. Plotting the derivative of $g(w) = \text{sin}(w)$

Lets try the same experiment with another function - the sinusoid

\begin{equation} g(w) = \text{sin}(w) \end{equation}
In [3]:
# what function should we play with?  Defined in the next line, along with our fixed point where we show tangency.
g = lambda w: np.sin(w)

# create an instance of the visualizer with this function
st = calclib.function_derivative_joint_visualizer.visualizer(g = g)

# run the visualizer for our chosen input function and initial point
st.draw_it(num_frames = 250)
Out[3]:



All right - push the slider all the way to the right and lets examine the derivative. Well first off one thing we can say is that the derivative looks pretty wavy - like a sine or cosine function, and of the same frequency as the original function itself. Since its value at $w = 0$ is maximal we could say that some sort of cosine equation might work. Notice too that the magnitude of the derivative - its vertical range - has not changed.

Putting together these pieces we could loosely guess that the equation for the derivative over this range of input is the cosine function

\begin{equation} \frac{\mathrm{d}}{\mathrm{d}w}g(w) = \text{cos}(w) \end{equation}

Example 3. Plotting the derivative of $g(w) = \text{sin}(w^3)$

Ok - one more. Lets try the experiment with a crazier looking function

\begin{equation} g(w) = \text{sin}(w^3) \end{equation}
In [4]:
# what function should we play with?  Defined in the next line, along with our fixed point where we show tangency.
g = lambda w: np.sin(w**3)

# create an instance of the visualizer with this function
st = calclib.function_derivative_joint_visualizer.visualizer(g = g)

# run the visualizer for our chosen input function and initial point
st.draw_it(num_frames = 400)
Out[4]:



All right - push the slider all the way to the right. What sort of equation can we conjure up that might represent the derivative in this case? Its much harder to tell with this example - does one even exist?

This example is indicative of just how challenging it can be in general to try to glean the equation of a derivative by eye-balling a plot of its table of values.

Wrapping up

While the equation of the derivative was not at all easy to eyeball by plotting in the final example, we need not let this example crush the hope that - in general - an equation might exist for the derivative of a general function, as in the first two cases a reasonable equation was indeed derived by examining the derivative plot. For many cases - certainly for machine learning applications - derivatives in fact do have equations associated with them. In order to come up with a consistent and wide ranging rule for finding them - however - we need to go beyond visual examination and use a bit of mathematics.

3.3.2. Derivative equations for elementary functions and operations

Our experiments above seem to imply that we may be able to write down an equation for the derivative of at least some elementary functions. In fact we can find equations for every elementary function, and virtually every combination of elementary functions. This is a profound fact - every elementary function with known equations including lines, polynomials, trigonometric functions (sine, cosine, tanh, etc.,), transcendental functions (log, exponentials, etc.,), and even functions that are not differentiable everywhere (e.g., the relu function) have closed form formulae for their derivatives, as well as virtually every combination of these functions one can imagine.

Amazingly every elementary function and operation does in fact have a derivative equation associated with it. In Tables 1 and 2 below we organize all derivative formulae for popular elementary functions and elementary operations first discussed in our series on basic functions. In this Subsection we go through a few of these formulae, verifying visually that they are indeed true. For those interested we also provide rigorous mathematical proof of the veracity of many of these formulae in the final Subsection here (many more proofs can be found in any standard calculus resource). We actually verified the sinusoid and one instance of the power rule below in Table 1 visually in the examples of Section 1. One can use the slider toys presented there to visually identify other rules on this list as well.


Table 1: Derivative formulae for elementary functions
elementary function equation derivative
constant $c$ $0$
monomial (degree $p\neq 0$) $w^p$ $pw^{p-1}$
sine $\text{sin}(w)$ $\text{cos}(w)$
cosine $\text{cos}(w)$ $-\text{sin}(w)$
exponential $e^w$ $e^w$
logarithm $\text{log}(w)$ $\frac{1}{w}$
hyperbloic tangent $\text{tanh}(w)$ $1 - \text{tanh}^2(w)$
rectified linear unit (ReLU) $\text{max}\left(0,w\right)$ $\begin{cases}0 & w\leq0\\1 & w>0\end{cases}$


Table 2: Derivative formulae for elementary operations
operation equation derivative rule
addition of a constant $c$ $g(w) + c$ $\frac{\mathrm{d}}{\mathrm{d}w}\left(g(w) + c\right)= \frac{\mathrm{d}}{\mathrm{d}w}g(w)$
multiplication by a constant $c$ $cg(w)$ $\frac{\mathrm{d}}{\mathrm{d}w}\left(cg(w)\right)= c\frac{\mathrm{d}}{\mathrm{d}w}g(w)$
addition of functions (often called the summation rule) $f(w) + g(w)$ $\frac{\mathrm{d}}{\mathrm{d}w}(f(w) + g(w))= \frac{\mathrm{d}}{\mathrm{d}w}f(w) + \frac{\mathrm{d}}{\mathrm{d}w}g(w)$
multiplication of functions (often called the product rule) $f(w)g(w)$ $\frac{\mathrm{d}}{\mathrm{d}w}(f(w)\cdot g(w))= \left(\frac{\mathrm{d}}{\mathrm{d}w}f(w)\right)\cdot g(w) + f(w)\cdot \left(\frac{\mathrm{d}}{\mathrm{d}w}g(w)\right)$
composition of functions (often called the chain rule) $f(g(w))$ $\frac{\mathrm{d}}{\mathrm{d}w}(f(g(w)))= \frac{\mathrm{d}}{\mathrm{d}g}f(g) \cdot \frac{\mathrm{d}}{\mathrm{d}w}g(w)$
maximum of two functions $\text{max}(f(w),\,g(w))$ $\frac{\mathrm{d}}{\mathrm{d}w}(\text{max}(f(w),\,g(w))) = \begin{cases}\frac{\mathrm{d}}{\mathrm{d}w}f\left(w\right) & \text{if}\,\,\,f\left(w\right)\geq g\left(w\right)\\\frac{\mathrm{d}}{\mathrm{d}w}g\left(w\right) & \text{otherwise}\end{cases}$

Note again how the operation rules in Table 2 are for generic functions. There are not two distinct rules telling one how to calculate the derivative formula for the sum of two polynomials, and the sum of two sinusoidal functions: there is just one rule for addition. Likewise, there are not two rules telling one how to derive formula for the product of an exponential and a polynomial, and an sinusoid and a logarithm: there is only one rule for taking the derivative of a product of functions.

Every elementary function with known equations including lines, polynomials, trigonometric functions (sine, cosine, tanh, etc.,), transcendental functions (log, exponentials, etc.,), and even functions that are not differentiable everywhere (e.g., the relu function) have closed form formulae for their derivatives, as well as virtually every combination of these functions one can imagine.

Using a handful of simple rules that tell us how to deal with the combination of two or more elementary functions, along with perhaps a dozen specific elementary function derivative formulae like the ones shown in Table 1, one can write down a formula for the derivative of virtually any function one can dream up. This is because virtually every function - especially those used in machine learning / deep learning - can be broken down into elementary functions/operations using a computation graph (as first discussed in our Sections on basic functions). And because of this any function built out of elementary functions and operations has a derivative formulae that too is constructed from elementary parts.

Any function built out of elementary functions and operations has a derivative formulae that too is constructed from elementary parts.

Moreover, because any such function can be broken down into a sequence of basic operations on one or two elementary functions if we are patient and break down a given function into its computation graph we only ever need apply the derivative rules in Table 2 to elementary functions.

Computing the derivative of a generic function that is constructed from elementary functions and operations reduces to computing the derivatives of these elementary functions using the specific derivative formulae in Table 1, while only needing to apply the deritvative operations in Table 2 to these elementary functions.

This piece is - practically speaking - extremely important: it says that the computation of a general function's derivative is in fact equivalent to computing the derivatives of the elementary functions that it is built out from, and combining them according to the operations listed in Table 2. We will see several examples of how this is done in the next Subsection.

3.3.2 Employing derivative rules in combination

In this subsection we see how we can use the computation graph decomposition of a function along with the derivative formulae / rules in Tables 1 and 2 to compute the derivative equation of an arbitrarily complicated functions. We approach these derivative calculations algorithmically, following the same general steps with each example:

  1. Decompose the example into its computation graph as first discussed in our basic function series.
  2. Begin at the input variable $w$ and compute moving forward through the graph until we reach the final parent node
  3. At each node in the graph we compute the derivative with respect to $w$ and - moving from left to right across the graph - these calculations are naturally incorporated into the derivative calculations of parent nodes

Example 4. Derivative calculation of $g(w) = \text{sin}(w^3)$ going forwards through the computation graph

We plot the simple computation graph decomposing this equation

We want to compute $\frac{\mathrm{d}}{\mathrm{d}w}g(w)$, and given the computation graph decomposition of the function we can write this equivalently as $\frac{\mathrm{d}}{\mathrm{d}w}g(w) = \frac{\mathrm{d}}{\mathrm{d}w}b(a)$. It is this latter form we will compute using the graph structure.

We can begin by computing the derivative of the first node $a(w)$ with respect to $w$.

Since $a(w) = w^3$ this derivative is directly given by the power rule ine Table 1 as

\begin{equation} \frac{\mathrm{d}}{\mathrm{d}w}a(w) = 3\times w^2 \end{equation}

Moving forward through the graph we now compute the derivative of the next parent node $b(a)$ with respect to $w$.

Here we must use the chain rule from Table 2, writing the derivative formally as

\begin{equation} \frac{\mathrm{d}}{\mathrm{d}w}b(a) = \frac{\mathrm{d}}{\mathrm{d}a}b(a)\frac{\mathrm{d}}{\mathrm{d}w}a(w) \end{equation}

We have already computed the latter derivative $\frac{\mathrm{d}}{\mathrm{d}w}a(w)$ above, so all we need is $ \frac{\mathrm{d}}{\mathrm{d}a}b(a)$. Since $b(a) = \text{sin}(a)$ this derivative is found on the first Table, where we see that $ \frac{\mathrm{d}}{\mathrm{d}a}b(a) = \text{cos}(a)$. So all together we have the desired derivative calculation

\begin{equation} \frac{\mathrm{d}}{\mathrm{d}w}b(a) = \frac{\mathrm{d}}{\mathrm{d}a}b(a)\frac{\mathrm{d}}{\mathrm{d}w}a(w) = \text{cos}(a)\times 3\times w^2 = \text{cos}(w^3) \times 3 \times w^2 \end{equation}

where the final equality follows from subsituting $a = w^3$ into cosine. So we have $\frac{\mathrm{d}}{\mathrm{d}w}g(w) =\text{cos}(w^3) \times 3 \times w^2 $. We show the computation graph of this derivative below.

In the next Python cell we plot both the original function and this newly derived derivative function.

In [5]:
# specify range of input for our functions
w = np.linspace(-3,3,2000)    

# generate original function
g = np.sin(w**3)
function_table = np.stack((w,g), axis=1) 

# generate derivative function
dgdw = np.cos(w**3)*3*w**2
derivative_table = np.stack((w,dgdw), axis=1) 

# use custom plotter to show both functions
baslib.basics_plotter.double_plot(table1 = function_table, table2 = derivative_table,plot_type = 'continuous',xlabel = '$w$',ylabel_1 = '$g(w)$',ylabel_2 = r'$\frac{\mathrm{d}}{\mathrm{d}w}g(w)$',fontsize = 18)

In the previous Section we had plotted the raw derivative values for this function and inquired what the formula of the derivative - plotted in the right panel above - and now we have it.


Lets take a minute to appreciate what we have done here. We started off wanting to compute the derivative of $g(w) = \text{sin}(w^3)$ - which is not an elementary function with a known formula in Table 1. But the computation graph / parent-child decomposition of this function allows us to express derivative 100% completely in terms of the derivatives of elementary functions, whose forms are known and shown in Table 1. And - as we will see through examination of further examples as well - is completely general.

The computation graph deconstruction of a function - in breaking up the function into a nested composition of simple functions / operations - allow us to express the derivatives of complex functions in terms of the derivatives of elementary functions and operations.

We also how the derivative - when applied to a function composed of elementary components - takes in input mathematical function / computation graph, and produces another. In other words, how the derivative operation can be thought of as a function that transforms one computation graph (an input equation) into another (its derivative equation).

The derivative operation can be thought of as a function that transforms one computation graph (an input equation) into another (its derivative equation).

We illustrate this relationship for the particular instance studied in this example below, but it is true in general.

Example 5. Computing the derivative of $g(w) = \text{tanh}(w)\text{cos}(w) + \text{log}(w)$ going forwards through the computation graph

First we build the computation graph for this function, as shown below

Our goal is to compute $\frac{\mathrm{d}}{\mathrm{d}w}g(w)$ which - given the structure of the graph above - can be written equivalently as $\frac{\mathrm{d}}{\mathrm{d}w}g(w) = \frac{\mathrm{d}}{\mathrm{d}w}e(d,c)$. And it is this latter calculation we wil perform using the computation graph.

Examining the computation graph we see that $w$ has three parents: $a$, $b$, and $c$.

Computing each parent-child derivative we have

\begin{array} \ \frac{\mathrm{d}}{\mathrm{d}w}a(w) = (1 - \text{tanh}^2(w)) \\ \frac{\mathrm{d}}{\mathrm{d}w}b(w) = -\text{sin}(w)\\ \frac{\mathrm{d}}{\mathrm{d}w}c(w) = \frac{1}{w} \\ \end{array}

With these derivatives witih respect to $w$ computed we move upwards to the next parents in the graph, seeking to derive derivatives of these nodes with respect to $w$ as well. Examining the graph we can see that the parent of $a$ and $b$ is $d$, and the parent of $c$ is $e$.

Lets start with $d$. We next want to compute $\frac{\mathrm{d}}{\mathrm{d}w}d(a,b)$.

Because $d$ is a product of $a$ and $b$ we must use the product rule, and this derivative is given as

\begin{equation} \frac{\mathrm{d}}{\mathrm{d}w}d(a,b) = \left(\frac{\mathrm{d}}{\mathrm{d}w}a(w)\right)\times b(w) + a(w) \times \left(\frac{\mathrm{d}}{\mathrm{d}w}b(w)\right) \end{equation}

Now because we are moving forward through the graph we have already computed the derivatives of $a$ and $b$ with respect to $w$, and need only compute the parent-child derivatives which are easily given as

\begin{array} \ \frac{\mathrm{d}}{\mathrm{d}a}d(a,b) = b = \text{cos}(w)\\ \frac{\mathrm{d}}{\mathrm{d}b}d(a,b) = a = \text{tanh}(w) \\ \end{array}

Thus the entire derivative can be computed explicitly at this point as

\begin{equation} \frac{\mathrm{d}}{\mathrm{d}w}d(a,b) = (1 - \text{tanh}^2(w))\times \text{cos}(w) + \text{tanh}(w) \times (-\text{sin}(w)) \end{equation}

Now that we have resolved the derivative at $d$ we can work on the final parent node $e$, which is a parent of both $d$ and $c$.

Since $e$ is defined as the sum of $d$ and $c$ its derivative with respect to $w$ is written as

\begin{equation} \frac{\mathrm{d}}{\mathrm{d}w}e(d,c) = \frac{\mathrm{d}}{\mathrm{d}w}d(a,b) + \frac{\mathrm{d}}{\mathrm{d}w}c(w) \end{equation}

We have already computed both derivatives on the right side above, so plugging both in we have our desired derivative

\begin{equation} \frac{\mathrm{d}}{\mathrm{d}w}g(w) = \frac{\mathrm{d}}{\mathrm{d}w}e(d,c) = (1 - \text{tanh}^2(w))\times \text{cos}(w) + \text{tanh}(w) \times (-\text{sin}(w)) + \frac{1}{w} \end{equation}

The computation graph of this equation is shown below

We plot both the original function and our newly found derivative equation in the Python cell below.

In [6]:
# specify range of input for our functions
w = np.linspace(0.01,3,2000)    

# generate original function
g = np.tanh(w)*np.cos(w) + np.log(w)
function_table = np.stack((w,g), axis=1) 

# generate derivative function
dgdw = (1 - np.tanh(w)**2)*np.cos(w) + np.tanh(w)*(-np.sin(w)) + 1/w
derivative_table = np.stack((w,dgdw), axis=1) 

# use custom plotter to show both functions
baslib.basics_plotter.double_plot(table1 = function_table, table2 = derivative_table,plot_type = 'continuous',xlabel = '$w$',ylabel_1 = '$g(w)$',ylabel_2 = r'$\frac{\mathrm{d}}{\mathrm{d}w}g(w)$',fontsize = 18)

Lets relfect on this example for a moment.

Remember that every node on a graph represents an elementary operation - with every kind of function evaluation representing composition. This means that at each node on the graph we apply a rule from Table 2 - and in particular at each function evaluation node we apply the chain rule. This is universally true.

Every node on a computation graph represents an elementary operation, and hence a rule from Table 2 is used at each node on the graph. In particular all function evaluations represent compoision, thus at each such node we apply the chain rule.

Note also how the derivative is built recursively in precisely the same order as the function itself - node-by-node from left to right, progressively constructed moving forward through the computation graph. A function and its derivative are built simultaneously moving forward through the computation graph together, node-by-node.

A function and its derivative are built simultaneously moving forward through the computation graph together, recursively node-by-node.

Example 6. Computing the derivative of $g(w) = \frac{\text{cos}(20w)}{w^2 + 1}$ going forwards through the computation graph

We first decompose the function into its computation graph below.

Our goal is to compute $\frac{\mathrm{d}}{\mathrm{d}w}g(w)$, and examining the graph we can see that $\frac{\mathrm{d}}{\mathrm{d}w}g(w) = \frac{\mathrm{d}}{\mathrm{d}w}f(c,e)$, the right hand side we will be calculating using the computation graph.

First - we compute the derivatives of each parent node of $w$, $a(w) = 20\times w$ and $b(w) = w^2$,

both of which are given by formulae (the multiplication by a constant and power rule) on Table 1.

\begin{array} \ \frac{\mathrm{d}}{\mathrm{d}w}a(w) = 20 \\ \frac{\mathrm{d}}{\mathrm{d}w}b(w) = 2\times w \\ \end{array}

Moving forward to the next parents $c(a) = \text{cos}(a)$ and $d(b) = b + 1$ we compute the derivative of each with respect to $w$.

In both instances we must use the chain rule, with these derivatives then written formally as

\begin{array} \ \frac{\mathrm{d}}{\mathrm{d}w}c(a) = \frac{\mathrm{d}}{\mathrm{d}a}c(a)\frac{\mathrm{d}}{\mathrm{d}w}a(w) \\ \frac{\mathrm{d}}{\mathrm{d}w}d(b) = \frac{\mathrm{d}}{\mathrm{d}b}d(b)\frac{\mathrm{d}}{\mathrm{d}w}b(w) \\ \end{array}

Given the forms of $c$ and $d$ we can easily compute $\frac{\mathrm{d}}{\mathrm{d}a}c(a) = -\text{sin}(a)$ and $\frac{\mathrm{d}}{\mathrm{d}b}d(b) = 1$, and since we already have computed the derivatives of each child with respect to $w$ we can quickly compute the desired derivatives above

\begin{array} \ \frac{\mathrm{d}}{\mathrm{d}w}c(a) = \frac{\mathrm{d}}{\mathrm{d}a}c(a)\frac{\mathrm{d}}{\mathrm{d}w}a(w) = -\text{sin}(a)\times20=-20\text{sin}(20w)\\ \frac{\mathrm{d}}{\mathrm{d}w}d(b) = \frac{\mathrm{d}}{\mathrm{d}b}d(b)\frac{\mathrm{d}}{\mathrm{d}w}b(w) = 1\times 2\times w = 2w\\ \end{array}

Next, looking back at the graph, we see the next parent node is $e(d) = \frac{1}{d}$.

Writing out its derivative with respect to $w$ we have using one application of the chain rule

\begin{equation} \frac{\mathrm{d}}{\mathrm{d}w}e(d) = \frac{\mathrm{d}}{\mathrm{d}d}e(d)\frac{\mathrm{d}}{\mathrm{d}w}d(b) \end{equation}

We already have computed the right most derivative, and given the form of $e$ we can easily compute using the power rule from Table 1 $\frac{\mathrm{d}}{\mathrm{d}d}e(d) = -\frac{1}{d^2}$. So all together we have

\begin{equation} \frac{\mathrm{d}}{\mathrm{d}w}e(d) = \frac{\mathrm{d}}{\mathrm{d}d}e(d)\frac{\mathrm{d}}{\mathrm{d}w}d(b) = -\frac{1}{d^2}\times 2w = -\frac{2w}{b + 1} = -\frac{2w}{w^2 + 1} \end{equation}

Finally, computing the derivative of the last parent node $f(c,e) = c\times e$ with respect to $w$,

we can write using the product rule

\begin{equation} \frac{\mathrm{d}}{\mathrm{d}w}f(c,e) = \left(\frac{\mathrm{d}}{\mathrm{d}w}c(a)\right) \times e(d) + c(a) \times \left(\frac{\mathrm{d}}{\mathrm{d}w}e(d)\right) \end{equation}

We have actually already computed the derivatives on the right hand side above, so all that need be done is combine them and backsubstitute in order to express the final calculation in terms of $w$

\begin{equation} \frac{\mathrm{d}}{\mathrm{d}w}f(c,e) = \left(\frac{\mathrm{d}}{\mathrm{d}w}c(a)\right) \times e(d) + c(a) \times \left(\frac{\mathrm{d}}{\mathrm{d}w}e(d)\right) = \left(-20\text{sin}(20w)\right) \times \frac{1}{w^2 + 1} + \text{cos}(20w)\times \left( -\frac{2w}{w^2 + 1}\right) \end{equation}

which is our desired derivative.

We plot both the original function and our newly found derivative equation in the Python cell below.

In [7]:
# specify range of input for our functions
w = np.linspace(-3,3,2000)    

# generate original function
g = np.cos(20*w)/(w**2 + 1)
function_table = np.stack((w,g), axis=1) 

# generate derivative function
dgdw = -20*np.sin(20*w)*(1/(w**2 + 1)) + np.cos(20*w)*(-2*w)/(w**2 + 1)
derivative_table = np.stack((w,dgdw), axis=1) 

# use custom plotter to show both functions
baslib.basics_plotter.double_plot(table1 = function_table, table2 = derivative_table,plot_type = 'continuous',xlabel = '$w$',ylabel_1 = '$g(w)$',ylabel_2 = r'$\frac{\mathrm{d}}{\mathrm{d}w}g(w)$',fontsize = 18)

3.3.5 Section summary

We have now seen how - using a small set of universally applicable rules, and a diligent eye - we can compute the equation of a derivative for not only elementary functions, but virtually any combination of elementary functions one can imagine. Leveraging those rules we computed several examples by hand, describing an effective algorithm involving a) decomposing a general function into its elementary functions and operations, and b) computing individual derivatives at each node in terms of $w$ moving forward through the computation graph while c) applying the formulae / rules from Tables 1 and 2 to the individual components. In the process we saw how computing the derivative of a general function required only the knowledge of how to compute the derivatives of elementary functions (some of which are shown in Table 1), and how to combine these equations using the operations listed in Table 2. Furthermore, because of this, any function constructed out of elementary functions and operations has a derivative that must also be constructed from elementary functions/operations.

1) We can find the derivative equation for virtually any function by first decomposing a it into its computation graph and then repeatedly - moving forward through the graph - applying the formulae / rules from Tables 1 and 2 to the individual components.

2) The derivative can be thought of as a function that transforms one computation graph (an input equation) into another computation graph (its derivative equation).

3) Every node on the graph represents an elementary operation - with function evaluations representing composition. At each node we apply the corresponding operation rule from Table 2.

4) Importantly these computations only require the knowledge of how to compute the derivatives of elementary functions (some of which are shown in Table 1), as well as how to combine these equations using the operations listed in Table 2.

5) Computation of these derivatives naturally flows forward through the computation graph.

6) Moreover both a function and its derivative are built simultaneously moving forward through the computation graph together, recursively node-by-node.

7) Furthermore, because of this, any function constructed out of elementary functions and operations has a derivative that must also be constructed from elementary functions/operations.

However even though these computations are repeatable we soon saw how - practically speaking - even moderately complicated formulae can produce large amounts of repeatable but messy computations [1]. While we will certainly see instances in the future where performing these hand calculations will be worth the work, where they will typically help to shed light on the behavior of an important function or optimization algorithm, in general we should hand over this work to a computer.

3.3.6. Appendix of calculations*

In this appendix we discuss how the simple definition of the derivative discussed in the previous Section can be used to derive formulae for elementary functions as well as the four simple combination rules mentioned in the previous Section that allow one to compute the derivative of an infinite variety of combinations of elementary functions. Calculations which verify these formulae can found in virtually any calculus reference - we provide what we think are simpler and more transparent versions of such derivations for the interested reader.

Derivative formulae for derivatives of elementary functions

Remember this definition of the derivative at a point $w_0$ - developed in the previous Section in this series - says that the slope of two secant lines - both of which share $w_0$ as one point and a point to either the left or right of $w_0$ as the other - should converge to the same value. We ended by writing this compactly as

\begin{equation} \frac{g(w_0 + \epsilon) - g(w_0)}{\epsilon} \end{equation}

which - if convergent for small magnitude $\epsilon$ - implies that $g$ has a derivative at $w_0$, which we wrote as $\frac{\mathrm{d}}{\mathrm{d}w}g(w_0)$. Here we will be using this definition to show that certain elementary functions have derivatives at every point $w$, and that the derivative can be expressed by an equation.

In general the argument for showing that the derivative of a particular elementary function has a specific formula follow the same logic. Here is the flavor of how such arguments are made. First - compute the derivative using the definition above for a general point $w$ and small $\epsilon$ value, possibly applying a property or trick from algebra, trigonometry, or the study of transcendental formulae depending on the function. Second: shrink the magnitude of $\epsilon$ to zero and out comes the general formula of the derivative.

Example 7. Derivative of the constant function

Say our function $g(w) = c$ where $c$ is some constant (e.g., 1, -33.2, etc.,). Let $\epsilon$ be some small magnitude number, then for a general point $w$

\begin{equation} \frac{g(w + \epsilon) - g(w)}{\epsilon} = \frac{c - c}{\epsilon} = \frac{0}{\epsilon} = 0 \end{equation}

Neither the particular point $w$ nor the sign and magnitude of $\epsilon$ mattered above: the value is always equal to zero. So here we have

\begin{equation} \frac{\mathrm{d}}{\mathrm{d}w}g(w) = 0 \end{equation}

Example 8. Derivative a simple line

Take the simple line $g(w) = w$, then we have for general $w$ and small magnitude $\epsilon$

$$ \frac{g(w + \epsilon) - g(w)}{\epsilon} = \frac{w + \epsilon - w}{\epsilon} = \frac{\epsilon}{\epsilon} = 1$$

So here we have - again regardless of the magnitude and sign of $\epsilon$ - this value remains the same. Hence the general derivative is given as

$$ \frac{\mathrm{d}}{\mathrm{d}w}g(w)= 1 $$

Example 9. Derivative of general monomial terms

Lets start with the degree two monomial $g(w) = w^2$, then we have for general $w$ and small magnitude $\epsilon$

$$ \frac{g(w + \epsilon) - g(w)}{\epsilon} = \frac{(w + \epsilon)^2 - w^2}{\epsilon} = \frac{w^2 + 2w\epsilon + \epsilon^2 - w^2 }{\epsilon} = \frac{2w\epsilon + \epsilon^2}{\epsilon} = 2w + \epsilon$$

With $\epsilon$ being infinitely small it vanishes from the line above, and we then have that

$$ \frac{\mathrm{d}}{\mathrm{d}w}g(w) = 2w $$

Next: lets look at the degree three monomial $g(w) = w^3$

$$ \frac{g(w + \epsilon) - g(w)}{\epsilon} = \frac{(w + \epsilon)^3 - w^3}{\epsilon} = \frac{w^3 + 3w^2\epsilon + 3w\epsilon^2 + \epsilon^3 - w^3}{\epsilon} = 3w^2 + 3w\epsilon + \epsilon$$

For infinitely small magnitude $\epsilon$ only the $3w^2$ term remains. Thus the derivative is

$$\frac{\mathrm{d}}{\mathrm{d}w}g(w) = 3w^2$$

Now lets exmamine an arbitrary degree monomial term $g(w) = w^p$. This argument works in a completely similar manner to the degree 2 case above: we just need to expand the term $(w + \epsilon)^p$ and re-arrange terms the right way. Denoting the general binomial coeffecient (choose notation)

\begin{equation} {{p}\choose{j}} = \frac{j!}{j!(p-j)!} \end{equation}

the Binomial Theorem gives us this expansion as

\begin{equation} (w + \epsilon)^{\,p} = \sum_{j=0}^p {{p}\choose{j}}w^{\,j}{\epsilon}^{\,p-j} \end{equation}

Note that we can write this equivalently - pulling out the two highest degree terms in $w$ to make them explicit, leaving every other term in the sum multiplied by $\epsilon^2$

\begin{equation} = w^{\,p} + \epsilon \, pw^{\,p-1} + \epsilon^2\sum_{j=0}^{p-2} {{p}\choose{j}}w^{\,j}{\epsilon}^{\,p-j - 2} \end{equation}

Using this final form, and plugging it into the definition of the derivative we then have

$$ \frac{\left(w + \epsilon \right)^{\,p} - w^{\,p}}{\epsilon} = \frac{ \left(w^{\,p} + \epsilon \, pw^{\,p-1} + \epsilon^2\sum_{j=0}^{p-2} {{p}\choose{j}}w^{\,j}{\epsilon}^{\,p-j - 2} \right) - w^{\,p}}{\epsilon} $$\begin{equation} = pw^{\,p-1} + \epsilon\sum_{j=0}^{p-2} {{p}\choose{j}}w^{\,j}{\epsilon}^{\,p-j - 2} \end{equation}

and then as $\epsilon \longrightarrow 0$ only the $pw^{\,p-1}$ term remains.

Example 10. Derivative of the exponential function

With $g(w) = e^{w}$ we have from the definition of the derivative

$$ \frac{g(w + \epsilon) - g(w)}{\epsilon} = \frac{ e^{w + \epsilon} - e^{w}}{\epsilon} = e^{w}\frac{e^{\epsilon} - 1}{\epsilon} = e^{w}$$

Where the final equality follows from the fact that denoting $f(\epsilon) = \frac{e^{\epsilon} - 1}{\epsilon}$ then $f(0) = 1$, as can be seen in a plot of this function in the printout of the next Python cell.

In [8]:
# create function
epsilon = np.linspace(-2,2)
f = (np.exp(epsilon) - 1)/epsilon

# reshape and package for plotter
epsilon.shape = (len(epsilon),1)
f.shape = (len(f),1)
f_table = np.concatenate((epsilon,f),axis=1)

# use custom plotter to plot function table of values
baslib.basics_plotter.single_plot(table = f_table,xlabel = '$\epsilon$',ylabel = '$f(\epsilon)$',rotate_ylabel=0)

Example 11. Derivative the natural logarithm

For this example we must use several properties of the (natural) logarithm - that is log the Euler base $e$ = 2.2.71828... - which include the following for positive constants $a$ and $b$

$$ \text{log}(a)-\text{log}(b) = \text{log}\left(\frac{a}{b}\right)$$$$ a\cdot \text{log}(b) = \text{log}(b^a)$$

In addition we need to use one particular definition of Euler's constant

$$ \underset{\epsilon\rightarrow0}{\text{lim}} \,\,\left(1 + \epsilon \right)^{\frac{1}{\epsilon}} = e $$

(which can be verified by examining the plot of $f(\epsilon ) =\left(1 + \epsilon \right)^{\frac{1}{\epsilon}}$ in the printout of the next Python cell) and what follows in general that

$$ \left(1 + \epsilon w \right)^{\frac{1}{\epsilon}} = e^w $$

as $\epsilon$ becomes infinitely small.

Finally we need the fact that the natural log and $e$ are inverses of one another, e.g., that $\text{log}(e^w) = w$

In [9]:
# create function
epsilon = np.linspace(-0.1,0.1)
f = (1 + epsilon)**(1/epsilon)

# reshape and package for plotter
epsilon.shape = (len(epsilon),1)
f.shape = (len(f),1)
f_table = np.concatenate((epsilon,f),axis=1)

# use custom plotter to plot function table of values
baslib.basics_plotter.single_plot(table = f_table,xlabel = '$\epsilon$',ylabel = '$f(\epsilon)$',rotate_ylabel = 0)

With these facts in mind and $g(w) = \text{log}(w)$ we have from the definition of the derivative we have

$$\frac{\text{log}(w + \epsilon) - \text{log}(w)}{\epsilon} = \frac{\text{log}(\frac{w+\epsilon}{w})}{\epsilon} = \frac{\text{log}(1 + \frac{\epsilon}{w})}{\epsilon}= \text{log}((1 + \frac{\epsilon}{w})^\frac{1}{\epsilon})$$

And $\epsilon$ becomes infinitely small, using the definition of $e$, we have that the above

$$ =\text{log}(e^\frac{1}{w}) = \frac{1}{w} = \frac{\mathrm{d}}{\mathrm{d}w}g(w)$$

Example 12. Derivative of sine

Using the 'double angle' rule $\text{sin}(a + b) = \text{sin}(a)\text{cos}(b) + \text{cos}(a)\text{sin}(b)$ we can write

$$ \frac{\text{sin}(w + \epsilon) - \text{sin}(w)}{\epsilon} = \frac{\text{sin}(w)\text{cos}(\epsilon) + \text{cos}(w)\text{sin}(\epsilon) - \text{sin}(w)}{\epsilon}$$

and re-arranging this we have that

$$= \text{sin}(w)\frac{(\text{cos}(\epsilon) - 1)}{\epsilon} + \text{cos}(w)\frac{\text{sin}(\epsilon)}{\epsilon} = \text{cos}(w) $$

where the final equality follows from the facts that if we denote $f_1(\epsilon) = \frac{\text{cos}(\epsilon) - 1}{\epsilon}$ and $f_2(\epsilon) = \frac{\text{sin}(\epsilon)}{\epsilon}$ then $f_1(\epsilon) = 0$ and $f_2(\epsilon) = 1$ as can be verified by examining the plot of each of these functions in the printout of the next Python cell.

In [10]:
# create function
epsilon = np.linspace(-2,2)
f_1 = (np.cos(epsilon) - 1)/epsilon
f_2 = (np.sin(epsilon))/epsilon

# reshape and package for plotter
epsilon.shape = (len(epsilon),1)
f_1.shape = (len(f_1),1)
f_2.shape = (len(f_2),1)
f_1_table = np.concatenate((epsilon,f_1),axis=1)
f_2_table = np.concatenate((epsilon,f_2),axis=1)

# use custom plotter to plot function table of values
baslib.basics_plotter.double_plot(table1 = f_1_table,table2 = f_2_table,xlabel = '$\epsilon$',ylabel_1 = '$f_1(\epsilon)$',ylabel_2 = '$f_2(\epsilon)$')

Derivative rules

Using the same general definition of the derivative at a point we can derive all of the rules listed in Table 2.

Scalar multiplication rule

If $c$ is just a constant, then we have from the definition of the derivative $\frac{\mathrm{d}}{\mathrm{d} w}\left(c \cdot g(w)\right)$ that we can simply factor the constant out, giving

\begin{equation} \frac{c\cdot g(w + \epsilon) - c\cdot g(w)}{\epsilon} = c \cdot\frac{g(w + \epsilon) - g(w)}{\epsilon} \end{equation}

and as $\epsilon$ vanishes the right hand side is precisely $c\cdot\frac{\mathrm{d}}{\mathrm{d} w} g(w)$

Addition rule

To show that the derivative of a sum is the sum of individual derivatves, we start by computing the derivative of $\frac{\mathrm{d}}{\mathrm{d}w}\left(f(w) + g(w) \right)$ for small $\epsilon$, and simply rearrange terms as

$$ \frac{\left(f(w + \epsilon) + g(w + \epsilon)\right) - \left(f(w) + g(w)\right)}{\epsilon} $$\begin{equation} = \frac{f(w + \epsilon) - f(w)}{\epsilon} + \frac{g(w + \epsilon) - g(w)}{\epsilon} \end{equation}

with the latter being precisely $\frac{\mathrm{d}}{\mathrm{d}w}f(w) + \frac{\mathrm{d}}{\mathrm{d}w}g(w)$

Product rule

With two functions $f(w)$ and $g(w)$ the definition of the derivative for small magnitude $\epsilon$ gives

$$\frac{f(w + \epsilon)g(w + \epsilon) - f(w)g(w)}{\epsilon} $$

Adding and subtracting $f(w+\epsilon)g(w)$ (equals zero)

$$=\frac{f(w + \epsilon)g(w + \epsilon) - f(w + \epsilon)g(w) + f(w+\epsilon)g(w) - f(w)g(w)}{\epsilon} $$$$=\frac{f(w + \epsilon) - f(w)}{\epsilon}\,g(w) \, \,+ \,\, f(w + \epsilon)\,\frac{g(w + \epsilon) - g(w)}{\epsilon}$$

Then as $\epsilon \longrightarrow 0$ we have that

$\begin{align} f(w + \epsilon) \longrightarrow f(w)~~~~~~\\ \frac{f(w + \epsilon) - f(w)}{\epsilon}\longrightarrow \frac{\mathrm{d}}{\mathrm{d}w}f(w) \\ \frac{g(w + \epsilon) - g(w)}{\epsilon}\longrightarrow \frac{\mathrm{d}}{\mathrm{d}w}g(w) \\ \end{align}$

And so all together then the above gives $\frac{\mathrm{d}}{\mathrm{d}w}f(w)\cdot g(w) + f(w)\cdot\frac{\mathrm{d}}{\mathrm{d}w}g(w)$

Chain rule

With two functions $f(w)$ and $g(w)$ lets look at the composition $f(g(w))$ and compute its derivative $\frac{\mathrm{d}}{\mathrm{d}w}(f(g(w)))$ at a general input $w$. For small magnitude $\epsilon$ the definition of the derivative gives

$$ \frac{f(g(w + \epsilon)) - f(g(w))}{\epsilon} = \frac{f(g(w + \epsilon)) - f(g(w))}{g(w + \epsilon) - g(w)} \cdot \frac{g(w + \epsilon) - g(w)}{\epsilon} $$

As $\vert{\epsilon}\vert \longrightarrow 0$ the left hand side above becomes $\frac{\mathrm{d}}{\mathrm{d}w}(f(g(w)))$, but so too does $g(w + \epsilon) - g(w) \longrightarrow 0 $ and on the right side we have that

$$ \frac{f(g(w + \epsilon)) - f(g(w))}{g(w + \epsilon) - g(w)} \longrightarrow \frac{\mathrm{d}}{\mathrm{d}g}f(g) $$

$\,\,\,$ $$ ~~~~~~~~~~~\frac{g(w + \epsilon) - g(w)}{\epsilon} \longrightarrow \frac{\mathrm{d}}{\mathrm{d}w}g(w) $$

Therefore all together we have that

$$\frac{\mathrm{d}}{\mathrm{d}w}(f(g(w)))= \frac{\mathrm{d}}{\mathrm{d}g}f(g) \cdot \frac{\mathrm{d}}{\mathrm{d}w}g(w)$$

Calculating derivatves backwards through the computation graph

In this short section we provide examples of derivative calculations that move progressively backwards through the computation graph.

We will see that the backwards calculations have a particular recursive pattern of derivative taking, where we repeated these two steps over and over again until there are no more proper derivatives to take. This pattern holds in general when one calculates derivativse going backwards through a computation graph.

  1. In backwards through the graph each parent operation is expressed with an application of the chain rule with respect to its children.
  2. All derivatives of a parent with respect to its children are derivatives of elementary functions, and are computed in closed form using Table 1.
  3. When it comes to taking the derivative of a child with respect to $w$ we treat the child as a parent node, to which we must apply the chain rule.

Example 13. Derivative calculation of $g(w) = \text{sin}(w^3)$going backwards through the computation graph

We plot the simple computation graph decomposing this equation

Calculating the derivative here by moving backwards through the graph means that we start out our calculations at the parent node, and resolve particular dervatives moving backwareds. Using the chain rule we can break apart $\frac{\mathrm{d}}{\mathrm{d}w}g(w) = \frac{\mathrm{d}}{\mathrm{d}w}b(a(w))$ into simple stages. Because the relation between parent and child here is compositional we must employ the chain rule from Table 2 with each

$$ \frac{\mathrm{d}}{\mathrm{d}w}b(a(w)) = \frac{\mathrm{d}}{\mathrm{d}a}b(a)\frac{\mathrm{d}}{\mathrm{d}w}a(w)\frac{\mathrm{d}}{\mathrm{d}w}w $$

Each derivative on the right hand side above is a derivative of an elementary function by definition - and we can find the equation for each in the formulae of Table 1. In particular we have

\begin{array} \ \frac{\mathrm{d}}{\mathrm{d}a}b(a) = \frac{\mathrm{d}}{\mathrm{d}a}\text{sin}(a) = \text{cos}(a) \\ \frac{\mathrm{d}}{\mathrm{d}w}a(w) = \frac{\mathrm{d}}{\mathrm{d}w}w^3 = 3w^2 \\ \frac{\mathrm{d}}{\mathrm{d}w}w = 1 \end{array}

Note that since $\frac{\mathrm{d}}{\mathrm{d}w}w = 1$ it is redundant, and can be removed. Since $a = w^3$ by definition we have all together using these two derivatives of elementary functions that

$$ \frac{\mathrm{d}}{\mathrm{d}w}b(a(w)) = \frac{\mathrm{d}}{\mathrm{d}a}b(a)\frac{\mathrm{d}}{\mathrm{d}w}a(w)= \text{cos}(a)\times3w^2 = \text{cos}(w^3)\times3w^2 $$

Example 14. Computing the derivative of $g(w) = \text{tanh}(w)\text{cos}(w) + \text{log}(w)$ going backwards through the computation graph

We first represent the parent-child relationships in this function by drawing its computation graph.

In order to write out the full expression for the derivative here we must recursively apply the chain rule to each parent node in the graph. We begin by applying the chain rule to the last parent node $e$ relative to its children $c$ and $d$, and work backwards through the graph.

$$ \frac{\mathrm{d}}{\mathrm{d}w}e(c,d) = \frac{\mathrm{d}}{\mathrm{d}c}e(c,d)\frac{\mathrm{d}}{\mathrm{d}w}c(w) + \frac{\mathrm{d}}{\mathrm{d}d}e(c,d)\frac{\mathrm{d}}{\mathrm{d}w}d(a,b) $$

Note that because we know the form of $e$ we can immediately compute the derivatives of $e$ with respect to its children - i.e., $\frac{\mathrm{d}}{\mathrm{d}d}e(c,d) = 1$ and $\frac{\mathrm{d}}{\mathrm{d}c}e(c,d) = 1$. What about the derivatives of each child with respect to $w$? We have to apply the chain rule again on each. For the derivative on $c$ we can can compute

$$ \frac{\mathrm{d}}{\mathrm{d}w}c(w) = \frac{1}{w} $$

and for $d$

$$ \frac{\mathrm{d}}{\mathrm{d}w}d(a,b) = \left(\frac{\mathrm{d}}{\mathrm{d}w}a(w)\right)\times b(w) + a(w) \times \left(\frac{\mathrm{d}}{\mathrm{d}w}b(w)\right) $$

which necessitated using the use of the product rule given that $d(a,b) = a \times b$. And since both $a$ and $b$ are elementary functions whose derivatives are in Table 1, we can quickly compute this derivative

$$ \frac{\mathrm{d}}{\mathrm{d}w}d(a,b) = (1 - \text{tanh}^2(w))\times \text{cos}(w) + \text{tanh}(w) \times (-\text{sin}(w)) $$

Note here when computing $\frac{\mathrm{d}}{\mathrm{d}w}a(w) = \frac{\mathrm{d}}{\mathrm{d}w}\text{tanh}(w)\frac{\mathrm{d}}{\mathrm{d}w}w$ according to the chain rule, but we can ignore $\frac{\mathrm{d}}{\mathrm{d}w}w = 1$ since it is of no significance. Likewise for $b$, and we will follow this convention from now on.

With all derivatives of parent nodes accounted for we can put all of our calculations together - giving the desired derivative

$$ \frac{\mathrm{d}}{\mathrm{d}w}g(w) = \frac{\mathrm{d}}{\mathrm{d}w}e(c,d) = (1 - \text{tanh}^2(w))\times \text{cos}(w) + \text{tanh}(w) \times (-\text{sin}(w)) + \frac{1}{w} $$

Endnotes

[1] If you took a calculus course in school you may have - depending on your luck - spent much of your time there being forced to apply these simple rules over and over again, computing derivative equations of increasingly complicated examples of elementary functions combinations like this one. No need to re-live those terrors if we can avoid them. The point here is to recognize the fact that - if performed procedurally by first decomposing the function, then applying the derivative rules and formula, etc., - computing derivatives is a very automatable seeming task.