code
share


β–Ί Chapter 3: Derivatives and Automatic Differentiation

3.9 The Hessian and higher order derivatives

In this brief Section we describe higher order derivatives of multi-input functions. These, like the first order derivatives, are expressed as a set of partial derivative functions. Importantly, as we will see, the number of higher order partial derivatives grows exponentially in $N$, the number of function inputs.

InΒ [1]:

3.9.1 Second order derivativesΒΆ

In the previous Section we described the collection of partial derivatives of a function $g$ taking in $N$ inputs

\begin{equation} \mathbf{w} = \begin{bmatrix} w_1 \\ w_2 \\ \vdots \\ w_N \end{bmatrix} \end{equation}

known as the gradient. Denoted as

\begin{equation} \nabla g(w_1, w_2, \ldots, w_N) = \begin{bmatrix} \ \frac{\partial}{\partial w_1}g(w_1, w_2, \ldots, w_N) \\ \frac{\partial}{\partial w_2}g(w_1, w_2, \ldots, w_N) \\ \vdots \\ \frac{\partial}{\partial w_N}g(w_1, w_2, \ldots, w_N) \end{bmatrix} \end{equation}

the gradient contains the $n^{th}$ partial derivative $\frac{\partial}{\partial w_n}g(w_1, w_2, \ldots, w_N)$ as its $n^{th}$ entry. This partial derivative - like the original function itself - is a mathematical function taking in the $N$ inputs abbreviated $\mathbf{w}$. Because of this we can differentiate the $n^{th}$ partial derivative along each input axis. For instance, we can take the $m^{th}$ partial derivative of $\frac{\partial}{\partial w_n}g(w_1,w_2,..,w_N)$ as

\begin{equation} \frac{\partial}{\partial w_m} \frac{\partial}{\partial w_n}g(w_1, w_2, \ldots, w_N) \end{equation}

This is a second order derivative. How many of these does a function $g(\mathbf{w})$ taking in $N$ inputs have? Well every one of $g$'s $N$ first order derivatives - each being a function of $N$ inputs - has $N$ partial derivatives. This means that $g(\mathbf{w})$ has a total of $N^2$ second order derivatives.

A function $g(\mathbf{w})$ taking in $N$ inputs has $N^2$ second order derivatives.

As with the notion of the gradient, this large set of second order derivatives are typically organized in a very particular way so that they can be more easily communicated and computed with. The Hessian - which is written notationally as $\nabla^2 g(\mathbf{w})$ - is the $N\times N$ matrix of second order derivatives whose $(m,n)^{th}$ element is $\frac{\partial}{\partial w_m} \frac{\partial}{\partial w_n}g(\mathbf{w})$ or $\frac{\partial}{\partial w_m} \frac{\partial}{\partial w_n}g$ for short. The full Hessian matrix is written as

\begin{equation} \nabla^2 g(\mathbf{w}) = \begin{bmatrix} \frac{\partial}{\partial w_1} \frac{\partial}{\partial w_1}g& \frac{\partial}{\partial w_1} \frac{\partial}{\partial w_2}g & \cdots & \frac{\partial}{\partial w_1} \frac{\partial}{\partial w_N}g\\ \frac{\partial}{\partial w_2} \frac{\partial}{\partial w_1}g& \frac{\partial}{\partial w_2} \frac{\partial}{\partial w_2}g & \cdots & \frac{\partial}{\partial w_2} \frac{\partial}{\partial w_N}g\\ \vdots & \vdots & \ddots & \vdots\\ \frac{\partial}{\partial w_N} \frac{\partial}{\partial w_1}g& \frac{\partial}{\partial w_N} \frac{\partial}{\partial w_2}g & \cdots & \frac{\partial}{\partial w_N} \frac{\partial}{\partial w_N}g\\ \end{bmatrix} \end{equation}

Moreover since it is virtually always the case that $\frac{\partial}{\partial w_m} \frac{\partial}{\partial w_n}g = \frac{\partial}{\partial w_n} \frac{\partial}{\partial w_m}g$, particularly with the sort of functions used in machine learning, the Hessian is a symmetric matrix.

The Hessian is a symmetric $N\times N$ matrix of second order derivatives.

Example 1. The Hessian of $g(w_1,w_2) = w^2_1 + w^2_2$ΒΆ

In the Python cell below we plot the function $g(w_1,w_2) = w^2_1 + w^2_2$, along with its first and second derivatives in a binary tree whose structure reflects the fact that the number of derivatives grows exponentially with order. Notice that the second order derivatives $ \frac{\partial}{\partial w_1} \frac{\partial}{\partial w_2}g(w_1,w_2)$ and $ \frac{\partial}{\partial w_2} \frac{\partial}{\partial w_1}g(w_1,w_2)$ have been plotted in the same panel, since they are the same function.

InΒ [7]:
# define the function and viewing angle
func = lambda w: w[0]**2 + w[1]**2
view = [10,150]

# visualize function, first and second derivatives: note the renderer associated with %matplotlib notebook does not work well with this widget (tends to enlarge and center 3d images), use %config InlineBackend.figure_format = 'retina'
callib.derivative_tree.draw_it(func = func,view = view)

One can verify that these derivatives take the algebraic form

\begin{equation} \nabla g(w_1,w_2) = \begin{bmatrix} \frac{\partial}{\partial w_1}g(w_1,w_2) \\ \frac{\partial}{\partial w_2}g(w_1,w_2) \\ \end{bmatrix} = \begin{bmatrix} 2w_1 \\ 2w_2 \\ \end{bmatrix} \end{equation}

and

\begin{equation} \nabla^2 g(\mathbf{w}) = \begin{bmatrix} \frac{\partial}{\partial w_1} \frac{\partial}{\partial w_1}g& \frac{\partial}{\partial w_1} \frac{\partial}{\partial w_2}g\\ \frac{\partial}{\partial w_2} \frac{\partial}{\partial w_1}g& \frac{\partial}{\partial w_2} \frac{\partial}{\partial w_2}g \end{bmatrix} = \begin{bmatrix} 2 \,\,\, 0 \\ 0 \,\,\, 2 \end{bmatrix} \end{equation}

3.9.2 Higher order derivativesΒΆ

In general the number of partial derivatives of a multi-input function grows exponentially with the order. We have just seen that a function taking in $N$ inputs has $N^2$ second order derivatives. In general such a function has $N^D$ partial derivatives of order $D$ (note this accounts for the case of first order derivatives, when $D=1$, as well).

A function of $N$ inputs has $N^D$ partial derivatives of order $D$.