We have now seen multiple times how feature scaling via the standard normalization scheme - that is subtracting the mean and dividing off the standard deviation of each input feature - substantially improves our ability to properly tune linear regression and two-class classification cost functions when employing gradient descent and coordinate descent methods (see Sections 8.4 and 9.4). Unsurprisingly standard normalization provides the same benefits in the case of multiclass classification - whether we employ the One-versus-all framework or the multiclass perceptron / softmax cost - which we explore here.
As we saw in Section 9.1, One-versus-All (OvA) multiclass classification applied to a dataset with $C$ classes consists of applying $C$ two-class classifiers to the dataset - each assigned to learn a linear separator between a single class and all the other data - and combining the results. Because we have already seen the benefit of using standard normalization in the context of two-class classification (in Section 9.4), clearly employing the same normalization scheme in the context of OvA will provide the same benefits. Here we illustrate this fact using a simple single-input multiclass dataset, loaded in and shown below.
Below we train an OvA multiclass classifier to represent / separate the $C = 4$ classes of this dataset. For each of the $C$ two-class subproblems we use the softmax
cost function, $\alpha = 10^{-1}$ (the largest fixed steplength of the form $10^{-\gamma}$ with $\gamma$ an integer we could find that produced consistent convergence for this example), and $100$ iterations. Each subproblem is initialized at the point $\mathbf{w} = \begin{bmatrix} 1 \\ 4 \end{bmatrix}$.
Below we plot each gradient descent run on its corresponding cost function contour, labeling the axes of each contour plot with its respective weights. Here we can see how the contours of every subplot are highly elliptical, making it extremely difficult for gradient descent to make progress towards a good solution.
Plotting the resulting fit to the data - which we do below - we can see just how poorly we have been able to learn a reasonable representation of this dataset due to the highly skewed contours of its cost functions. Here notice that the fit shown is produced by evaluating the fusion rule described in the previous sections of this Chapter over a fine range of inputs across the input space of the dataset.
Lets now compare this to a run of the same general format, but having employed standard normalization first. That is - since this is a single input dataset - we replace each input by subtracting off the mean of the entire set of inputs and dividing off its standard deviation as
where the sample mean of the inputs $\mu$ is defined as
\begin{equation} \mu = \frac{1}{P}\sum_{p=1}^{P}x_p \\ \end{equation}and the sample standard deviation of the inputs $\sigma$ is defined as
\begin{array} \ \sigma = \sqrt{\frac{1}{P}\sum_{p=1}^{P}\left(x_p - \mu \right)^2}. \end{array}We produced nice Python
functionality for performing this standard normalization in Section 8.4, and we load this in from a backend file called normalizers
and apply it to our current dataset. Afterwards we re-run OvA training with for just $10$ steps of gradient descent initialized at the same initial point as above, and use a fixed steplength of $\alpha = 10$ in each instance (again, the largest steplength value we could use of this form that caused convergence). Notice - as is always the case after normalizing the input of a dataset - that here we can use far fewer steps and a much larger steplength value.
As we can see analyzing the contour plots below, unsurprisingly performing standard normalization greatly improves the contours of each sub-problem cost function making them much easier to minimize.
Examining the corresponding fit - plotted below - we can gain significantly better performance. Note - as is always the case - in order to evaluate new test points (e.g., to produce the plotted fit shown below) we must normalize them precisely as we did the training data.
Standard normalization also substantially improves training for simultaneous multiclass frameworks - the multiclass perceptron and softmax functions. As a simple example illustrating this fact, we first examine how well multiclass perceptron fits to the same toy dataset used in the previous Subsection - before and after normalizing its input.
Now we run $100$ steps of gradient descent with a fixed steplength of $10^{-2}$, the largest steplength of the form $10^{-\gamma}$ with $\gamma$ an integer we found to produce convergence. This will not result in significant learning, since we saw in Example 2 of the previous Section that with this dataset we required close to $10,000$ iterations to produce adequate learning.
Plotting the learned fusion rule model employing the best set of weights from this run we can indeed see that our model performs quite poorly.
Now we normalize the input using standard normalization - as described in Sections 8.4, 9.4, and the subsection above.
With the input normalized we now make a run of gradient descent - much shorter and with a much larger steplength than used previously (as is typically possible when normalizing the input to a cost function). This results in considerably better performance after just a few steps, as we plot the result of the best found weights below as well.
If we compare the cost and misclassification count per iteration of both gradient descent runs above we can see the startling difference between gradient the two gradient descent runs. In particular, in the time it takes for the run on standard normalized input to converge to a perfect solution the unnormalized run fails to make any real progress at all.