code
share


β–Ί Chapter 3: Derivatives and Automatic Differentiation

3.7 Derivatives of multi-input functions

In this Section we describe how derivatives are defined in higher dimensions, when dealing with multi-input functions. We explore these ideas first with $N=2$ inputs for visualization purposes, generalizing afterwards. These concepts generalize completely from the single input case we have been examining thus far.

InΒ [1]:

3.7.1 From tangent line to tangent hyperplaneΒΆ

Instead of the derivative representing the slope of a tangent line in the case of a single-input function, the derivative of a multi-input function represents the set of slopes that define a tangent hyperplane.

Example 1. Tangent hyperplaneΒΆ

This is illustrated in the next Python cell using the following two closely related functions

\begin{array} \ g(w) = 2 + \text{sin}(w)\\ g(w_1,w_2) = 2 + \text{sin}(w_1 + w_2) \end{array}

In particular we draw each function over a small portion of its input around the origin, with the single-input function on the left and multi-input function on the right. We also draw the tangent line / hyperplane - generated by the derivative there - on top of each function at the origin.

InΒ [2]:
# plot a single input quadratic in both two and three dimensions
func1 = lambda w: 2 + np.sin(w) 
func2 = lambda w: 2 + np.sin(w[0] + w[1]) 

# use custom plotter to show both functions
callib.derivative_ascent_visualizer.compare_2d3d(func1 = func1,func2 = func2)

Here we can see that the derivative for the multi-input function on the right naturally describes not just a line, but a tangent hyperplane. This is true in general. How do we define the derivative of a multi-input function / the tangent hyperplane it generates?

3.7.2 Derivatives: from secants to tangentsΒΆ

In Section 3.2 we saw how the derivative of a single input function $g(w)$ at a point $w^0$ was approximately the slope

\begin{equation} \frac{\mathrm{d}}{\mathrm{d}w}g(w^0) \approx \frac{g(w^0 + \epsilon) - g(w^0)}{\epsilon} \end{equation}

of the secant line passing through the point $(w^0,\,\,g(w^0))$ and a neighboring point $(w^0 + \epsilon, \,\, g(w^0 + \epsilon))$, and letting $|\epsilon|$ shrink to zero this approximation becomes an equality, and the derivative is precisely the slope of the tangent line at $w^0$.

Example 2. Single input secant experimentΒΆ

In the next Python cell we repeat an experiment illustrating this point from an earlier Section. Here we plot the function $g(w) = \text{sin}(w)$ over a short window of its input. We then fix $w^0 = 0$, take a point nearby that can be controlled via the slider mechanism, and connect the two via a secant line. When the neighborhood point is close enough to $0$ the secant line becomes tangent, and turns from red to green.

InΒ [3]:
# what function should we play with?  Defined in the next line, along with our fixed point where we show tangency.
g = lambda w: np.sin(w)

# create an instance of the visualizer with this function
st = callib.secant_to_tangent.visualizer(g = g)

# run the visualizer for our chosen input function and initial point
st.draw_it(w_init = 0, num_frames = 200)
Out[3]:




With $N$ inputs we have precisely the same situation - only we can compute a derivative along each input axis, and all such derivatives at every point of the input space.

For example, if we fix a point $(w_1,w_2) = (w^0_1,w^0_2)$ then we can examine the derivative along the first input axis $w_1$ using the same one-dimensional secant slope formula

\begin{equation} \frac{\mathrm{d}}{\mathrm{d}w_1}g(w^0_1,w^0_2) \approx \frac{g(w^0_1 + \epsilon,w^0_2) - g(w^0_1,w^0_2)}{\epsilon} \end{equation}

and again as $|\epsilon|$ shrinks to zero this approximation becomes an equality. Since we are in two dimensions the secant line with this slope is actually a hyperplane passing through the points $(w^0_1,w^0_2,g(w^0_1,w^0_2))$ and $(w^0_1 + \epsilon,w^0_2,g(w^0_1 + \epsilon,w^0_2))$. Likewise to compute the derivative in the second input axis $w_2$ here we compute the slope value

\begin{equation} \frac{\mathrm{d}}{\mathrm{d}w_2}g(w^0_1,w^0_2) \approx \frac{g(w^0_1 ,w^0_2 + \epsilon) - g(w^0_1,w^0_2)}{\epsilon} \end{equation}

Because each of the derivatives $\frac{\mathrm{d}}{\mathrm{d}w_1}g(w^0_1,w^0_2)$ and $\frac{\mathrm{d}}{\mathrm{d}w_2}g(w^0_1,w^0_2)$ is taken with respect to a single input, they are referred to as partial derivatives of the function $g(w_1,w_2)$.

Moreover, more commonly one uses a different notation to distinguish them from single input derivatives - replacing the $\mathrm{d}$ symbol with $\partial$. With this notation derivatives above are written equivalently as $\frac{\partial}{\partial w_1}g(w^0_1,w^0_2)$ and $\frac{\partial}{\partial w_2}g(w^0_1,w^0_2)$. Regardless of the notation partial derivatives are computed - as we will discuss in the next Sectionst - in virtually the same manner as single-input derivatives are (i.e., via repeated use of the derivative rules for elementary functions and operations).

This nomenclature and notation is used more generally as well to refer to any derivative of a multi-input function with respect to a single input dimension.

The term partial derivative is used to describe any derivative of a multi-input function with respect to a single input dimension.

Example 3. Multi-input secant experimentΒΆ

In the next Python cell we repeat the secant experiment - shown previously for a single-input function - for the following multi-input function

\begin{equation} g(w_1,w_2) = 5 + (w_1 + 0.5)^2 + (w_2 + 0.5)^2 \end{equation}

We fix the point $(w^0_1,w^0_2) = (0,0)$ and take a point along each axis whose proximity to the origin can be controlled via the slider mechanism. At each instance and in each input dimension we form a secant line (which is a hyperplane in three dimensions) connecting the evaluation of this point to that of the origin. The secant hyperplane whose slope is given by the partial derivative approximation $\frac{\partial}{\partial w_1}g(w^0_1,w^0_2)$ and $\frac{\partial}{\partial w_2}g(w^0_1,w^0_2)$ are then illustrated in the left and right panels of the output, respectively.

When the neighborhood point is close enough to the origin the secant becomes tangent in each input dimension, and the corresponding hyperplane changes color from red to green.

InΒ [3]:
# what function should we play with?  Defined in the next line, along with our fixed point where we show tangency.
func = lambda w: 5 + (w[0] +0.5)**2 + (w[1]+0.5)**2 
view = [20,150]

# run the visualizer for our chosen input function and initial point
callib.secant_to_tangent_3d.animate_it(func = func,num_frames=50,view = view)
Out[3]:




The hyperplanes at a point $(w^0_1,w^0_2)$ are tangent along each input dimension only - like those shown in the figure above - have a slope defined by the corresponding partial derivative. Each such hyperplane is rather simple, in the sense that it has non-trivial slope in only a single input axis (we discussed this more generally in the previous Section), and have a single-input form of equation. For example, the tangent hyperplane along the $w_1$ axis has the equation

\begin{equation} h(w_1,w_2) = g(w^0_1,w^0_2) + \frac{\partial}{\partial w_1}g(w^0_1,w^0_2)(w^{\,}_1 - w^0_1) \end{equation}

and likewise for the tangency along the $w_2$ axis.

\begin{equation} h(w_1,w_2) = g(w^0_1,w^0_2) + \frac{\partial}{\partial w_2}g(w^0_1,w^0_2)(w^{\,}_2 - w^0_2) \end{equation}

However neither simple hyperplane represents the full tangency at the point $(w^0_1,w^0_2)$, which must be a function of both inputs $w_1$ and $w_2$. To get this we must sum up the slope contributions from both input axes, which gives the full tangent hyperplane

\begin{equation} h(w_1,w_2) = g(w^0_1,w^0_2) + \frac{\partial}{\partial w_1}g(w^0_1,w^0_2)(w^{\,}_1 - w^0_1) + \frac{\partial }{\partial w_2}g(w^0_1,w^0_2)(w^{\,}_2 - w^0_2) \end{equation}

As was the case with the tangent line of a single-input function, this is also the first order Taylor Series approximation to $g$ at the point $(w^0_1,w^0_2)$.

Example 4. Arbitrary tangent hyperplaneΒΆ

In the next Python cell we illustrate each single-input tangency (with respect to $w_1$ and $w_2$ in the left and middle panels respectively), along with the full first order Taylor series approximation with the full tangent hyperplane (right panel) for the example function show in the previous animation.

InΒ [4]:
# what function should we play with?  Defined in the next line, along with our fixed point where we show tangency.
func = lambda w: 5 + (w[0] +0.5)**2 + (w[1]+0.5)**2
view = [10,150]

# run the visualizer for our chosen input function and initial point
callib.secant_to_tangent_3d.draw_it(func = func,num_frames=50,view = view)

So, in short, multi-input functions with $N=2$ inputs have $N=2$ partial derivatives, one for each input. At a single point taken together these partial derivatives - like the sole derivative of a single-input function - define the slopes of the tangent hyperplane at this point (also called the first order Taylor Series approximation).

3.7.3 The gradientΒΆ

For notational convenience these partial derivatives are typically collected into a vector-valued function called the gradient denoted $\nabla g(w_1,w_2)$, where the partial derivatives column-wise as

\begin{equation} \nabla g(w_1,w_2) = \begin{bmatrix} \ \frac{\partial}{\partial w_1}g(w_1,w_2) \\ \frac{\partial}{\partial w_2}g(w_1,w_2) \end{bmatrix} \end{equation}

Note because this is a stack of two derivatives the gradient in this case has two inputs and two outputs. When a function has only a single input the gradient reduces to a single derivative, which is why the derivative of a function (regardless of its number of inputs) is typically just referred to as its gradient.

When a function takes in general $N$ number of inputs the form of the gradient, as well as the tangent hyperplane, mirror precisely what we have seen above. A function taking in $N$ inputs $g(w_1,w_2,...,w_N)$ has a gradient consisting of its $N$ partial derivatives stacked into a column vector

\begin{equation} \nabla g(w_1,w_2,..,w_N) = \begin{bmatrix} \ \frac{\partial}{\partial w_1}g(w_1,w_2,..,w_N) \\ \frac{\partial}{\partial w_2}g(w_1,w_2,..,w_N) \\ \vdots \\ \frac{\partial}{\partial w_N}g(w_1,w_2,..,w_N) \end{bmatrix} \end{equation}

To see why this is a convenient way to express the partial derivatives of $g$ note that using vector notation for the input, e.g., $\mathbf{w} = (w_1,w_2)$ and $\mathbf{w}^0 = (w^0_1,w^0_2)$ the first order Taylor Series approximation can be written more compactly as

\begin{equation} h(\mathbf{w}) = g(\mathbf{w}^0) + \nabla g(\mathbf{w}^0)^T(\mathbf{w} - \mathbf{w}^0) \end{equation}

which more closely resembles the way we express the first order Taylor Series approximation for a single-input function, regardless of the value of $N$.