code
share


β–Ί Chapter 13: Multi-layer Perceptrons

13.2 Batch normalization

Previously when dealing with linear supervised learning (see e.g., Sections 8.4, 9.4, 10.3, and 11.3) we saw how normalizing each input feature of a dataset significantly aids in parameter tuning by improving the shape of a cost function's contours (making them more 'circular'). Another way of saying this is that we normalized every distribution that touches a system parameter - which in the linear case consists of the distribution of each input feature. The intuition that normalizing parameter touching distributions carries over completely from the linear learning scenario to our current situation - where we are conducting nonlinear learning via multilayer perceptrons. The difference here is that now we have many more parameters (in comparison to the linear case) and many of these parameters are internal as opposed to weights in a linear combination. Nonetheless each parameter - as we detail here - touches a distribution that when normalized tends to improve optimization speed.

Specifically - as we will investigate here in the context of the multilayer perceptron - to completely carry over the idea of input normalization to our current scenario we will need to normalize the output of each and every network activation. Moreover since these activation distributions naturally change whenever during parameter tuning - e.g., whenever a gradient descent step is made - we must normalize these internal distributions every time we make a parameter update. This leads to the incorporation of a normalization step grafted directly onto the architecture of the multilayer perceptron itself - which is called every time weights are changed. This natural extension of input normalization is popularly referred to as batch normalization.

InΒ [6]:

13.2.1 The stable weight touching distributions of a linear modelΒΆ

When discussing linear model based learning in Chapters 8 - 11 we employed the generic linear model

\begin{equation} \text{model}\left(\mathbf{x},\mathbf{w}\right) = w_0 + x_1w_1 + \cdots + x_Nw_N \end{equation}

for both regression and classification. When tuning these weights via the minimization of any cost function over a dataset of $P$ points $\left\{\mathbf{x}_p,y_p\right\}_{p=1}^P$ we can see how the $n^{th}$ dimension of each input point $x_{p,n}$ touches the $n^{th}$ weight $w_n$. Because of this, as we discussed previously, the input distribution along the $n^{th}$ dimension $\left\{x_{p,n} \right\}_{p=1}^P$ touching $w_n$ strongly affects the contours of any cost function along the $w_n$ direction. If these input distributions differ, as is typically the case in practice, the contours of a cost function stretch out in long directions making optimization via gradient descent quite challenging irregardless of the choice of steplength. However by normalizing each input dimensions via the standard normalization scheme - i.e., by mean centering and re-scaling by its standard deviation (see Sections 8.4, 9.4, and 10.3) - we can subtantially temper the contours of a cost funtion making it much easier to optimize.

Performing standard normalization along the $n^{th}$ input feature means making the replacement

\begin{equation} x_{p,n} \longleftarrow \frac{x_{p,n} - \mu_{n}}{\sigma_{n}} \end{equation}

where $\mu_n$ and $\sigma_n$ are the mean and standard deviation along the $n^{th}$ feature of the input, respectively. Performing standard normalization on the input of a linear model helps temper any cost function along each parameter direction $w_1,\,w_2,\,...,w_N$. Of course once normalized these input distributions $\left\{x_{p,n} \right\}_{p=1}^P$ for each $n$ never change again - they remain stable regardless of how we set the parameters of our model / during training.

Below we repeat a simple version of this standard normalizer for convenience, which we will use repeatedly below.

InΒ [1]:
# standard normalization function - input data, output standaard normalized version
def standard_normalize(data):
    # compute the mean and standard deviation of the input
    data_means = np.mean(data,axis = 1)[:,np.newaxis]
    data_stds = np.std(data,axis = 1)[:,np.newaxis]   

    # check to make sure thta x_stds > small threshold, for those not
    # divide by 1 instead of original standard deviation
    ind = np.argwhere(data_stds < 10**(-2))
    if len(ind) > 0:
        ind = [v[0] for v in ind]
        adjust = np.zeros((data_stds.shape))
        adjust[ind] = 1.0
        data_stds += adjust

    # return standard normalized data 
    return (data - data_means)/data_stds

13.2.2 Batch normalized single layer perceptron unitsΒΆ

Now let us think about our current context, where we are in general employing a linear combination of $B$ multilayer perceptron feature transformations in a nonlinear model that takes the general form

\begin{equation} \text{model}\left(\mathbf{x},\mathbf{w}\right) = w_0 + f_1\left(\mathbf{x}\right)w_1 + \cdots + f_B\left(\mathbf{x}\right)w_B \end{equation}

where note here the variable $\mathbf{w}$ on the left hand side denotes all parameters of the model (the weights of the linear combination $w_0$,...,$w_B$) as well as those internal to each nonlinear feature transformation $f_1$,...,$f_B$. As we can see by studying simple examples, here standard normalization of the input only tempers the contours of a cost function along the weights internal to the first layer of a multilayer perceptron only, as these are the weights touched by the distribution of each input dimension.

For example let us look at the simplest case (of a multilayer perceptron) and suppose each feature transformation of our model is a single hidden layer perceptron, the $b^{th}$ of which takes the general form (as introduced in the previous Section)

\begin{equation} f^{(1)}_b\left(\mathbf{x}\right)=a\left(w^{\left(1\right)}_{0,\,b}+\underset{n=1}{\overset{N}{\sum}}{w^{\left(1\right)}_{n,\,b}\,x_n}\right) \end{equation}

where $a\left(\cdot\right)$ is an activation function. Hence this means that our model takes the form

\begin{equation} \text{model}\left(\mathbf{x},\mathbf{w}\right) = w_0 + f^{(1)}_1\left(\mathbf{x}\right)w_1 + \cdots + f^{(1)}_B\left(\mathbf{x}\right)w_B. \end{equation}

In this instance we can see that the data input does not touch weights of the linear combination $w_1,\,w_2,...,w_B$ (as was the case with our linear model), but touches the internal weights of each perceptron i.e., the $n^{th}$ input feature dimension touches the internal weights $w_{b,n}$ for $b=1,...,B$. So in other words, by performing standard normalization on the input here we temper the contours of a cost function along the internal weights of a model employing single hidden layer elements.

But what about the contours of the cost function along the weights of the linear combination $w_1,\,w_2,...,w_B$? Which data distributions touch them? A simple glance at our model above and we can see that it is the distribution of each perceptron / activation output over the input data. In other words, the distribution $\left\{f^{(1)}_b\left(\mathbf{x}_p\right) \right\}_{p=1}^P$ touches the weight $w_b$ of the linear combiination. Now if we know that the distribution touching the weight of a model affects any cost function along this dimension then, in analogy to how we have normalized the input data, perhaps if we (standard) normalize each unit / activation output distribution we can similarly make it easier to optimize our model.

To do this we would standard normalize each unit itself by mean centering and re-scaling it according to the mean / standard deviation of the distribution over our input as

\begin{equation} f_b^{(1)} \left(\mathbf{x} \right) \longleftarrow \frac{f_b^{(1)} \left(\mathbf{x}\right) - \mu_{f_b^{(1)}}}{\sigma_{f_b^{(1)}}} \end{equation}

where

\begin{array} \ \mu_{f_b^{(1)}} = \frac{1}{P}\sum_{p=1}^{P}f_b^{(1)}\left(\mathbf{x}_p \right) \\ \sigma_{f_b^{(1)}} = \sqrt{\frac{1}{P}\sum_{p=1}^{P}\left(f_b^{(1)}\left(\mathbf{x}_p \right) - \mu_{f_b^{(1)}} \right)^2}. \end{array}

Would standard normalizing - also referred to as batch normalizing in the context of single / multilayer perceptrons - each unit of our single layer perceptron model help speed up optimization (much in the same way standard normalizing input helps in this manner)? It seems promising given our experience with normalizing our input - but the only way to know for sure is to experiment with the idea to find out. As we show via several experiments below, it does indeed help with optimization as we have intuited it would.

Note how for an arbitrary single layer unit $f^{(1)}\left(\mathbf{x}\right)$ how we can easily adjust the generic recipe for building single layer functions introduced in the previous Section to include this standard normalization step. To construct our batch normalized single layer perceptron function we simply follow the adjusted perceptron recipe below.

Recursive rescipe for batch normalized single layer perceptron unitsΒΆ


1:Β Β  input: Activation function $a\left(\cdot\right)$ and input data $\left\{\mathbf{x}_p\right\}_{p=1}^P$
2:Β Β  Compute linear combination: $\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,v = w_{0}^{(1)}+{\sum_{n=1}^{N}}{w_{n}^{(1)}\,x_n}$
3:Β Β  Pass result through activation: $\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, f^{(1)}\left(\mathbf{x}\right) = a\left(v\right)$
4:Β Β  Compute mean: $\mu_{f^{(1)}}$ / standard deviation $\sigma_{f^{(1)}}$ of: $\,\,\,\,\,\,\,\,\left\{f^{(1)}\left(\mathbf{x}_p\right) \right\}_{p=1}^P$
5:Β Β  Standard normalize: $ \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, f^{(1)} \left(\mathbf{x} \right) \longleftarrow \frac{f^{(1)} \left(\mathbf{x}\right) - \mu_{f^{(1)}}}{\sigma_{f^{(1)}}} $
6:Β Β  output: Batch normalized single layer unit $\,\, f^{(1)} \left(\mathbf{x} \right)$


However it is important to note that - unlike the distribution of the input data - the distribution of each of our single layer units changes every time the internal parameters of our system are changed. In other words, since each of the single layer units $f^{(1)}_b$ has internal parameters the distribution $\left\{f^{(1)}_b\left(\mathbf{x}_p\right) \right\}_{p=1}^P$ varies depending on the setting of these internal weights (as we will illustrate via examples below). In the jargon of deep learning this is often referred to as internal covariate shift or just covariate shift for short of a network model.

Since the weights of our model will change during optimization - e.g., from one step of gradient descent to the next - in order to keep the unit distributions normalized we have to normalize them at every step of parameter tuning (e.g., via gradient descent). To do this we can simply build in a standard normalization step directly into the perceptron architecture itself, as we show in the Python implementation next.

The distribution of single layer units / activation outputs $\left\{f^{(1)}_b\left(\mathbf{x}_p\right) \right\}_{p=1}^P$ changes depending on the setting of the units' internal parameters, an issue referred to as internal covariate shift in the jargon of deep learning. This means that - if we want to temper these distributions at each step of a run of e.g., gradient descent - we must normalize them every time weights are changed. This can be easily done by inserting a normalization step into the recursive algorithm for constructing perceptron units.

13.2.3 A Python implementation of batch normalization for single layer unitsΒΆ

Below we illustrate how one can implement batch normalization for single layer perceptron units. First we provide the same Python implementation for our original single layer perceptron feaeture_transforms described in the previous Section. We also define a relu activation function for testing.

InΒ [2]:
# relu activation
def activation(t):
    return np.maximum(0,t)
InΒ [17]:
# a feature_transforms function for computing
# U_1 single layer perceptron units efficiently
def feature_transforms(x,W_1):    
    #  pad with ones (to compactly take care of bias) for first layer computation        
    o = np.ones((1,np.shape(x)[1]))
    x = np.vstack((o,x))

    # compute linear combination of current layer units
    v = np.dot(x.T, W_1).T

    # pass through activation
    a = activation(v)
    return a

To standard normalize each unit we simply need to add a normalization step to the end of this function, as shown below. To do this we employ the standard normalization function standard_normalize given above. To differentiate this batch normalized version from the one above we call it feature_transforms_batch_normalized.

InΒ [4]:
# a feature_transforms function for computing
# U_1 single layer batch normalized 
# perceptron units efficiently
def feature_transforms_batch_normalized(x, W_1):    
    #  pad with ones (to compactly take care of bias) for first layer computation        
    o = np.ones((1,np.shape(x)[1]))
    x = np.vstack((o,x))

    # compute linear combination of current layer units
    v = np.dot(x.T, W_1).T

    # pass through activation
    a = activation(v)
        
    # NEW - perform standard normalization on 
    # first layer unit distributions 
    a = standard_normalize(a)
    return a

Now we can run a few experiments to verify that our batch normalzied version is indeed easier to optimize than the standard single layer perceptron model.

Example 1. The shifting distributions / internal covariate shift of a single layer perceptronΒΆ

In this example we illustrate the covariate shift of a single layer perceptron with two relu units $f^{(1)}_1$ and $ f^{(1)}_2$

\begin{equation} \text{model}\left(\mathbf{x},\mathbf{w}\right) = w_0 + f^{(1)}_1\left(\mathbf{x}\right)w_1 + f^{(1)}_2\left(\mathbf{x}\right)w_2 \end{equation}

employing the two-class classification dataset shown below.

InΒ [7]:

We now run $5,000$ steps of gradient descent to minimize the softmax cost using this single layer network, where we standard normalize the input data. We use a set of random weights for the network loaded in from memory.

InΒ [27]:

Below we show an animation of this gradient descent run, plotting the single unit distribution $\left\{f^{(1)}_1\left(\mathbf{x}_p\right),\,f^{(1)}_2\left(\mathbf{x}_p\right) \right\}_{p=1}^P$ at a subset of the steps taken during the run. In the left panel we show this covariate shift or activation output distribution at the $k^{th}$ step of the optimization, while the right panel shows the complete cost function history curve where the current step of the animated optimization is marked on the curve with a red dot. Moving the slider from left to right progresses the run from start to finish.

InΒ [29]:
Out[29]:



As you can see by moving the slider around, the distribution of activation outputs - i.e., the distributions touching the weights of our model's linear combination $w_0$ and $w_1$ - change dramatically as the gradient descent algorithm progresses. We can intuit (from our previous discussions on input normalization) that this sort of shifting distribution negatively effects the speed at which gradient descent can properly minimize our cost function.

Now we repeat the above experiment using the batch normalized single layer perceptron - making a run of $10,000$ gradient descent steps using the same initialization used above. We then animate the covariate shift / distribution of activation outputs using the same animation tool used above as well. Moving the slider below from left to right - progressing the algorithm - we can see here that the distribution of activation outputs stays considerably more stable.

InΒ [30]:
Out[30]:



Example 2. Comparing original and batch normalized single layer models on a subset of MNISTΒΆ

In this example we illustrate the benefit of batch normalization in terms of speeding up optimization via gradient descent on a dataset of $10,000$ handwritten digits from the MNIST dataset. Each image in this dataset has been contrast normalized, a common preprocessing step for image dataset we discuss later in the context of convolutional networks. Here we show $100$ steps of gradient descent, with the largest steplength of the form $10^{-\gamma}$ for integer $\gamma$ we found produced adequate convergence, comparing the standard to batch normalized versions of a network with relu activation and a single layer architecture with 100 units Here we can see that both in terms of cost function value and number of misclassifications the batch normalized version of the perceptron allows for much more rapid minimization via gradient descent than the original version.

InΒ [38]:

13.2.4 Batch normalized multilayer perceptron unitsΒΆ

If we employ a sum of $L$ layer multilayer functions with our model

\begin{equation} \text{model}\left(\mathbf{x},\mathbf{w}\right) = w_0 + f^{(L)}_1\left(\mathbf{x}\right)w_1 + \cdots + f^{(L)}_B\left(\mathbf{x}\right)w_B \end{equation}

and we pick apart how each one is constructed recursively (as detailed in the previous Section) we can see that the distribution of units from every layer of the network touches a model weight. By the same intuitive leap we came to in developing our normalization procedures - that it is always beneficial to normalize any distribution touching a model weight - along with our discussion above with the single layer case, it then makes sense to try normalizing the distribution units in every layer of a multilayer perceptron.

For example, as discussed in the previous Section an $L=2$ layer network element can be written as

\begin{equation} f^{\left(2\right)}\left(\mathbf{x}\right)=a\left(w^{\left(2\right)}_{0}+\underset{i=1}{\overset{U_1}{\sum}}{w^{\left(2\right)}_{i}}\,f^{(1)}_i\left(\mathbf{x}\right) \right) \end{equation}

where each single unit element is defined

\begin{equation} f^{(1)}_i\left(\mathbf{x}\right) = a\left(w^{\left(1\right)}_{0,i}+\underset{n=1}{\overset{N}{\sum}}{w^{\left(1\right)}_{n,i}\,x_n}\right). \end{equation}

In analogy to what we have seen so far - what quantities could be normalized here to help condition the contours of a cost function along particular weight directions? The answer: any distribution touching a weight. Here this includes

  • the input distribution of the data $x_n$ along the $n^{th}$ dimension, since each such distribution touches the first layer weight $w^{\left(1\right)}_{n,i}$
  • the first layer units' distributions $\left\{f^{(1)}_i\left(\mathbf{x}_p\right) \right\}_{p=1}^P$ for each fixed value of $i$, since each such distribution touches the second layer weight $w^{\left(2\right)}_{i,b}$
  • the second layer units' distributions $\left\{f^{(2)}\left(\mathbf{x}_p\right) \right\}_{p=1}^P$, since each such distribution touches a weight $w_b$ of the linear combination

This same pattern / logic holds for deeper networks as well. A general $L$ layer network unit can be written (as shown in the previous Section) as

\begin{equation} f^{\left(L\right)}\left(\mathbf{x}\right)=a\left(w^{\left(L\right)}_{0}+\underset{i=1}{\overset{U_{L-1}}{\sum}}{w^{\left(L\right)}_{i}}\,f^{(L-1)}_i\left(\mathbf{x}\right) \right). \end{equation}

And the same analysis leads to the conclusion that we can try standard normalizing each unit in the $L-1$ layer, as well as $f^{\left(L\right)}$, with the aim of improving optimization speed. This leads to the notion that - in a multilayer perceptron - we should (standard) normalize every unit in every layer.

So - in short - to perform the same kind of normalization to an $L$ layer unit we should first standard normalize each unit in its earlier layers, and finally standard normalize it as

\begin{equation} f^{(L)} \left(\mathbf{x} \right) \longleftarrow \frac{f^{(L)} \left(\mathbf{x}\right) - \mu_{f^{(L)}}}{\sigma_{f^{(L)}}} \end{equation}

where

\begin{array} \ \mu_{f^{(L)}} = \frac{1}{P}\sum_{p=1}^{P}f^{(L)}\left(\mathbf{x}_p \right) \\ \sigma_{f^{(L)}} = \sqrt{\frac{1}{P}\sum_{p=1}^{P}\left(f^{(L)}\left(\mathbf{x}_p \right) - \mu_{f^{(L)}} \right)^2}. \end{array}

Doing this, using the standard normalization procedure, is often referred to as batch normalizing a multilayer perceptron. With such a simple adjustment we can still construct each batch normalized perceptron unit recursively, since all we must do is insert a standard normalization step into the end of each layer as summarized below.

Note: in practice one often parameterizes the standard normalization of each unit e.g., for an $L$ layer unit as

\begin{equation} f^{(L)} \left(\mathbf{x} \right) \longleftarrow \alpha\frac{f^{(L)} \left(\mathbf{x}\right) - \mu_{f^{(L)}}}{\sigma_{f^{(L)}}} + \beta \end{equation}

and it is this parameterized standard normalization scheme which is often referred to as batch normalization. Including parameters provides more flexibility than the unparameterized form we discuss here. Because the heart of batch normalization lies is the standard normalization procedure itself we have introduced it unparameterized. Moreover, as we have seen above / will see below via experiment, even without parameters standard normalization of each network unit improves optimization substantially.

Recursive rescipe for batch normalized $L$ layer perceptron unitsΒΆ


1:Β Β  input: Activation function $a\left(\cdot\right)$, number of $\left(L-1\right)$ layer units $U_{L-1}$
2:Β Β  Construct $\left(L-1\right)$ layer batch normalized units: $\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,f^{(L-1)}_i\left(\mathbf{x}\right)$ for $i=1,\,...,U_{L-1}$
3:Β Β  Compute linear combination: $\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, v = w_{0}^{(L)}+{\sum_{i=1}^{U_{L-1}}}{w_{i}^{(L)}\,f^{(L-1)}_i}\left(\mathbf{x}\right)$
4:Β Β  Pass result through activation: $\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, f^{(L)}\left(\mathbf{x}\right) = a\left(v\right)$
5:Β Β  Compute mean: $\mu_{f^{(L)}}$ / standard deviation $\sigma_{f^{(L)}}$ of: $\,\,\,\,\,\,\,\,\left\{f^{(L)}\left(\mathbf{x}_p\right) \right\}_{p=1}^P$
6:Β Β  Standard normalize: $ \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, f^{(L)} \left(\mathbf{x} \right) \longleftarrow \frac{f^{(L)} \left(\mathbf{x}\right) - \mu_{f^{(L)}}}{\sigma_{f^{(L)}}} $
7:Β Β  output: Batch normalized $L$ layer unit $\,\, f^{(L)} \left(\mathbf{x} \right)$


13.2.5 A Python implementation of multilayer batch normalizationΒΆ

Below we illustrate how one can implement the activation output normalization idea unwravelled above. First we provide the same Python implementation for our original multilayer perceptron feaeture_transforms described in the previous Section.

InΒ [306]:
# a feature_transforms function for computing
# U_L L layer perceptron units efficiently
def feature_transforms(a, w):    
    # loop through each layer matrix
    for W in w:
        # pad with ones (to compactly take care of bias) for 
        # current layer computation        
        o = np.ones((1,np.shape(a)[1]))
        a = np.vstack((o,a))
        
        # compute linear combination of current layer units
        v = np.dot(a.T, W).T
    
        # pass through activation
        a = activation(v)
    return a

To standard normalize the distribution of each layer's units we simply add a normalization step to the end of our for loop above. We call this batch normalized version of the above feature_transforms_batch_normalized

InΒ [308]:
# a feature_transforms function for computing
# U_L L layer perceptron units efficiently
def feature_transforms(a, w):    
    # loop through each layer matrix
    for W in w:
        # pad with ones (to compactly take care of bias) for 
        # current layer computation        
        o = np.ones((1,np.shape(a)[1]))
        a = np.vstack((o,a))
        
        # compute linear combination of current layer units
        v = np.dot(a.T, W).T
    
        # pass through activation
        a = activation(v)
        
        # NEW - perform standard normalization on 
        # each layer units' distributions 
        a = standard_normalize(a)
    return a

Now we can perform a few simple experiments that illustrate both the problem of internal covariate shift, that is the shifting distributions of units at each layer of a multilayer perceptron, as well as the general benefit (in terms of optimization speed-up) of batch normalization.

Example 3. The shifting distributions / covariate shift of a multilayer perceptronΒΆ

In this example we illustrate the covariate shift of a standard $4$ layer multilayer perceptron with two units per layer, using the relu activation and the same dataset employed in the previous example. We then compare this to the covariate shift present in the batch normalized version of the network. Since each layer has just two units we can plot the distribution of activation outputs of each layer, visualizing the covariate shift. As in the previous example we make a run of $10,000$ gradient descent steps and animate the covariate shift of all $4$ layers of the network via a slider mechanism.

Below we animate the run of gradient descent. As with the animation in the previous example, the covariate shift of this network - shown in the left panel below - is considerable. As you move the slider from left to right you can track which step of gradient descent is being illustrated by the red point shown on the cost function history in the right panel (as you pull the slider from left to right the run is animated from start to finish).

Each layer's units' distribution is shown in this panel, with the output of the first layer $\left(f_1^{(1)},f_2^{(1)}\right)$ are colored in cyan, the second layer $\left(f_1^{(2)},f_2^{(2)}\right)$ is colored magenta, the third layer $\left(f_1^{(3)},f_2^{(3)}\right)$ colored lime green, and the fourth layer $\left(f_1^{(4)},f_2^{(4)}\right)$ is shown in orange. In analogy to the animation shown above for a single layer network, here the horizontal and vertical quantities of each point shown represent the activation output of the first and second unit respectively for each layer.

InΒ [49]:
Out[49]:



Performing batch normalization on each layer of this network helps considerably in taming this covariate shift. Below we run the same experiment, using the same initialization, activation, and dataset using the batch normalized version of the network. Afterwards we again animate the covariate shift for a subset of the steps of gradient descent.

Moving the slider from left to right below progresses the animation from the start to finish of the run. Scanning over the entire range of steps we can see in the left panel that the distribution of each layer's activation outputs remains much more stable than previously.

InΒ [51]:
Out[51]:



Example 4. Comparing original and batch normalized multilayer layer models on a subset of MNISTΒΆ

In this example we illustrate the benefit of batch normalization in terms of speeding up optimization via graident descent on a dataset of $10,000$ handwritten digits from the MNIST dataset. Each image in this dataset has been contrast normalized, a common preprocessing step for image dataset we discuss later in the context of convolutional networks. Here we show $100$ steps of gradient descent, with the largest steplength of the form $10^{-\gamma}$ for integer $\gamma$ we found produced adequate convergence, comparing the standard to batch normalized versions of a network with relu activation and a three layer architecture with 10 units per layer. Here we can see that both in terms of cost function value and number of misclassifications the batch normalized version of the perceptron allows for much more rapid minimization via gradient descent than the original version.

InΒ [54]:

13.2.6 Evaluating test points using a batch normalized networkΒΆ

Ann important point to to remember when employing a batch normalized network - which we encountered earlier in e.g., Sections 8.4 and 9.4 when introducing standard normalization of input data - is that we must treat test data precisely as we treat training data. Here this means that every step of normalization computed on the training data, the various means and standard deviations of the input as well as for each each layer of output activation, must be used in the evaluation of new test points as well. In other words, all normalization constants in a batch normalized network should be fixed to the values computed on the training data (at the best step of gradient descent) when evaluating new test points.

In order to properly evaluate test points with our normalized architecture they must be normalized with respect to the same network statistics (i.e., the same input and activation output distribution normalizations) used for the training data.