In practice many classification problems have more than two classes we wish to distinguish, e.g., face recognition, hand gesture recognition, general object detection, speech recognition, and more. However because it has only two sides, a single linear separator is fundamentally insufficient as a mechanism for differentiating between more than two classes of data. Nonetheless we can use our understanding of two-class classification to overcome this shortcoming when dealing with $C>2$ classes by learning $C$ linear classifiers (one per class), each distinguishing one class from the rest of the data. The trick of the matter is how we should combine these individual classifiers to create a reasonable multiclass decision boundary. In this Section we develop this scheme - called One-versus-All multiclass classification - step-by-step by studying how such an idea should unfold on a toy dataset. With due diligence and a little common sense we can intuitively derive universal ideas regarding multi-class classification that are the basis for most popular multiclass classification schemes, including One-versus-All (OvA) classification.
A multiclass dataset $\left\{ \left(\mathbf{x}_{p,}\,y_{p}\right)\right\} _{p=1}^{P}$ consists of $C$ distinct classes of data. As with the two-class case we can in theory use any $C$ distinct labels to distinguish between the classes, but for future derivations it is convenient to use label values $y_{p}\in\left\{ 0,1,...,C-1\right\} $. The next Python cell loads in and plots the $C = 3$ class dataset we will use in our derivation of OvA. Afterwards we will apply what we have developed to other datasets as well.
A note on notation: In this Section we will purposefully use the following semi-compact notation where we explicitly expose the bias $w_0$ by using $\mathbf{w}$ to denote
\begin{equation} \mathbf{w}=\begin{bmatrix} w_{1}\\ w_{2}\\ \vdots\\ w_{N} \end{bmatrix}. \end{equation}We can then write a linear model involving an input point $\mathbf{x}_p$ as
\begin{equation} w_0 + \mathbf{x}_p^T\mathbf{w}^{\,}_{\,}. \end{equation}We do this for the same reason this semi-compact notation was introduced in Subsection 9.1.5 - because we will naturally need to regularize normal vectors $\mathbf{w}$ and this is easier to express using non-compact notation.
The first step of OvA classification is simple - we reduce the new problem of multi-class classification into a sequence of smaller problems that we are already familiar with. If we reflect for a moment on what we want, one way of phrasing the goal of multi-class classification is that we want to learn how to distinguish each class of our data from the other $C-1$ classes. From this perspective it certainly seems that a good first step towards accomplishing our goal would be to learn $C$ two-class classifiers on the entire dataset, with the $c^{th}$ classifier trained to distinguish the $c^{th}$ class from the remainder of the data. With the $c^{th}$ two-class subproblem we simply assign temporary labels $\tilde y_p$ to the entire dataset, giving $+1$ labels to the $c^{th}$ class and $-1$ labels to the remainder of the dataset
\begin{equation} \tilde y_p = \begin{cases} +1 \,\,\,\,\,\,\text{if}\,\, y_p = c \\ -1 \,\,\,\,\,\,\text{if}\,\, y_p \neq c \end{cases} \end{equation}where again $y_p$ is the original label for the $p^{th}$ point, and run the two-class classification scheme of our choice.
Doing this for our $C = 3$ dataset in this case we end up learning 3 linear classifiers - here we use logistic regression for each subproblem, solving each using Newton's method.
With our classifiers trained we can now illustrate our learned decision boundaries - each learned to distinguish a single class from the remainder of the data. Below we plot two rows of images - in the top row our original dataset is plotted three times with each instance showing just one of the three two-class classifiers learned. The single class being distinguished is colored with its original color - with the corresponding learned decision boundary colored similarly - and all other data is colored gray. In the bottom row the dataset is shown with along with all three learned decision boundaries all at once.
With OvA we learn $C$ two-class classifiers - with the bias/slope weights denoted as $\left(w_0^{(0)},\,\mathbf{w}_{\mathstrut}^{(0)} \right),\,\left(w_0^{(1)},\,\mathbf{w}_{\mathstrut}^{(1)}\right),\,\ldots,\,\left(w_0^{(C-1)},\,\mathbf{w}_{\mathstrut}^{(C-1)}\right)$ - the $c^{th}$ of which can be written as
\begin{equation} w_0^{(c)} + \mathbf{x}_{\mathstrut}^T\mathbf{w}_{\mathstrut}^{(c)} = 0 \end{equation}Note in the figure above how in each case - because each subproblem is perfectly linearly separable and because of our choice of temporary labels - that the class to be distinguished from the rest lies on the positive side of its respective classifier, with the remainder of the points lying on the negative side. This of course means that for the $c^{th}$ classifier we have for the $p^{th}$ point $\mathbf{x}_p$ that
\begin{equation} w_0^{(c)} + \mathbf{x}_{p}^T\mathbf{w}_{\mathstrut}^{(c)} \begin{cases} > 0 \,\,\,\,\,\,\text{if}\,\,\, y_p = c \\ < 0 \,\,\,\,\,\, \text{if} \,\,\, y_p \neq c \end{cases} \end{equation}This implies that - when evaluated by each two-class classifier individually - the one learned to a point's true label always provides the largest evaluation, i.e.,
\begin{equation} w_0^{(c)} + \mathbf{x}_{p}^T\mathbf{w}_{\mathstrut}^{(c)} = \underset{j \,=\, 0,...,C-1}{\text{max}} \,\,\,w_0^{(\,j)} + \mathbf{x}_{p}^T\mathbf{w}_{\mathstrut}^{(\,j)} \end{equation}So we know how to classify the points that we have, what about those we do not? How do we classify arbitrary points in the space of our example? Lets figure this out step-by-step.
First, those points that lie solely on the positive side of the $c^{th}$ classifier only - like the points we already have - should clearly belong to the $c^{th}$ class. Such a point $\mathbf{x}$ lies close to those we already have, also clearly satisfying the condition that
\begin{equation} w_0^{(c)} + \mathbf{x}^T\mathbf{w}_{\mathstrut}^{(c)} = \underset{j \,=\, 0,...,C-1}{\text{max}} \,\,\,w_0^{(\,j)} + \mathbf{x}_{\mathstrut}^T\mathbf{w}_{\mathstrut}^{(\,j)} \end{equation}Therefore to get the associated label $y$ we therefore want the maximum argument of the right hand side - since this gives the index associated with the largest classifier. Formally then the predicted label $y$ for input point $\mathbf{x}$ here can be written as
\begin{equation} y = \underset{j \,=\, 0,...,C-1}{\text{argmax}} \,\,\,w_0^{(\,j)} + \mathbf{x}_{\mathstrut}^T\mathbf{w}_{\mathstrut}^{(\,j)} \end{equation}We color in the points representing these areas of the space of our dataset in the Python cell below.
Notice there are regions left uncolored above that do not fall into this first category (of being on the positive side of a single classifier). These include regions where points are on the positive side of more than one classifier - these are the three triangular white regions bordered by two decision boundaries. For example points in the white region at the top - bordered by the red and blue decision boundaries - are on the positive side of both the red and blue classifiers. The un-colored regions also include the one in the very middle of the plot whose points are on the negative side of all three classifiers.
Let us first determine how to appropriately classify points in those un-colored regions where more than one classifier is positive. We deal with the case where a point ends up on the negative side of all classifiers in the next Subsection. For simplicity, here we only consider the top region that is on the positive side of both the red and blue classifiers (the other regions should behave similarly).
In the Python cell below we show two example points in this region. In each case we plot the original dataset and individual classifiers along with new point in question in black. We also highlight the distance from the new point to both decision boundaries in dashed black, with the projection of the point onto each decision boundary shown as an 'X' in the same color as its respective hyperplane.
Beginning with the point shown on the left, which class should we assign this point to?
Recall, when we discussed logistic regression, that we think of a classifier as being 'more confident' of the class identity of given a point the farther the point lies from the classifier's decision boundary. This is a simple geometric/ probabilistic concept, the bigger a point's distance to the boundary the deeper into one region of a classifier's half-space it lies, and thus we can be much more confident in its class identity than a point closer to the boundary. Another way we can think about it - imagine if we slightly perturbed the decision boundary: those points originally close to its boundary might end up on the other side of the perturbed hyperplane, changing classes, whereas those points farther from the boundary are less likely to be so affected (hence we can be more confident in their class identities to begin with).
With this in mind, which classifier can we say is more confident about this point? The answer is the red one since our input point lies a greater distance from it.
What about the point shown in the right panel? By the same logic it is best assigned to the blue class, being at a greater distance from the blue classifier.
If we repeat this logic for every point in the region - as well as those points in the other two regions where two or more classifiers are positive - and color each point the color of its respective class, we will end up shading each such region as shown by the Python cell below.
And how do those points equidistant to two or more decision boundaries get assigned to a class? In the same way points lying on a two-class decision boundary do: we assign a class label at random. These points form the multi-class decision boundary.
How do we formalize this rule? In these regions we have assigned each point to the class whose boundary is at the largest distance from it. The signed distance of an arbitrary point $\mathbf{x}$ to the $c^{th}$ linear decision boundary (this was first discussed in Section 9.1 on logistic regression) is given as
\begin{equation} \text{signed distance of $\mathbf{x}$ to $c^{th}$ boundary} = \frac{w_0^{(c)} + \mathbf{x}_{\mathstrut}^T \mathbf{w}_{\mathstrut}^{(c)}}{\left\Vert \mathbf{w}_{\mathstrut}^{(c)} \right\Vert_2} \end{equation}If we normalize the weights of each linear classifier by the length of the normal vector
\begin{equation} w_0^{(c)} \longleftarrow \frac{w_0^{(c)}}{\left\Vert \mathbf{w}_{\mathstrut}^{(c)} \right\Vert_2} \,\,\,\,\,\,\,\,\,\,\,\,\,\,\, \mathbf{w}_{\mathstrut}^{(c)} \longleftarrow \frac{\mathbf{w}^{(c)}}{\left\Vert \mathbf{w}_{\mathstrut}^{(c)} \right\Vert_2} \end{equation}then this distance is simply written as
\begin{equation} \text{signed distance of $\mathbf{x}$ to $c^{th}$ boundary} = w_0^{(c)} + \mathbf{x}_{\mathstrut}^T \mathbf{w}_{\mathstrut}^{(c)} \end{equation}To assign a point in one of our current regions we seek out the classifier which maximizes this quantity. That is, we have found that - after weight-normalization - we have precisely the same prediction rule we found originally for regions of the space where only a single classifier is positive
\begin{equation} y = \underset{j \,=\, 0,...,C-1}{\text{argmax}} \,\,\,w_0^{(\,j)} + \mathbf{x}_{\mathstrut}^T\mathbf{w}_{\mathstrut}^{(\,j)} \end{equation}So the same rule above can be used in any region of the space where at least one classifier is positive.
Finally, what about the middle blank region of our dataset - where all of our classifiers are negative? How shall we assign class labels to these points. Once again let us examine two points in this region and reason out what should be done.
Lets start with the point on the left. This point, like all those in the middle region, are on the negative side of all our classifiers. In other words, when trained, each of our $C$ two-class classifiers designated these points as not in their respective class. Thus we cannot argue - as we did at points on the positive side of multiple classifiers - that one classifier is more 'confident' in the class identity of such points. So what about the opposite question, i.e., which classifier is the least 'unsure' about the class identity of this point? Which class is the point the least dissimilar spatially given our decision boundaries? Not the boundary it is furthest from as was the case previously, but the one it is closest to. In the case of our example point on the left it lies closest to the red decision boundary - it is closest to being a red class point - hence we assign it to the red class.
By the same reason our point on the right - which lies closest to the green decision boundary - is assigned to the green class.
If we repeat this logic for every point in the region and color each point the color of its respective class, we will end up shading this region as shown by the Python cell below.
As in the previous case, those points equidistant to two or more decision boundaries form the multi-class boundary and are assigned to a relevant class at random.
We can formalize this rule by noting that - once again - our reasoning has led us to assign a point to the class whose boundary is at the largest signed distance from it. Every point in the region lies on the negative side of our classifiers and all signed distances are negative. Hence the shortest distance in magnitude is the largest signed distance, being the smallest (in magnitude) negative number. Conversely the largest distance in magnitude is the largest (in magnitude) negative number, and hence the smallest signed distance.
Thus once again (assuming the weights of our classifiers have been normalized) for an input point $\mathbf{x}$ we assign the label $y$ providing the maximum signed distance, i.e.,
\begin{equation} y = \underset{j \,=\, 0,...,C-1}{\text{argmax}} \,\,\,w_0^{(j)} + \mathbf{x}^T\mathbf{w}^{(j)} \end{equation}We have now deduced that the single rule for assigning a label $y$ to a point $\mathbf{x}$
\begin{equation} y = \underset{j \,=\, 0,...,C-1}{\text{argmax}} \,\,\,w_0^{(\,j)} + \mathbf{x}_{\mathstrut}^T\mathbf{w}_{\mathstrut}^{(\,j)} \end{equation}which assigns the label based on the maximum signed distance of this point to each classifier, applies to the entire space of our problem (and - in general - to any multi-class problem). In fact this rule not only applies to the toy $C = 3$ dataset we have used in deriving the rule here, but to all multi-class datasets. Indeed we could have used any dataset (even those that are not completely separable) in our derivations here, as any multi-class problem can be broken down into the three fundamental regions we have seen in these derivations: regions where points are on the positive side of a single classifier, more than one classifier, or are on the negative side of all classifiers.
We call this the fusion rule - since it tells us precisely how to fuse our $C$ individual classifiers together to make a unified and consistent classification across the entire space of any dataset. It is the core of the One-versus-All algorithm, which we now state formally.
1:Β Β Input: multi-class dataset $\left\{ \left(\mathbf{x}_{p,}\,y_{p}\right)\right\} _{p=1}^{P}$ where $y_{p}\in\left\{ 0,...,C-1\right\}$, two-class classification scheme and optimizer
2:Β Β for
$\,\,j \, = \, 0,...,C-1$
3:Β Β Β Β Β Β Β Β Β Β form temporary labels $\tilde y_p = \begin{cases} +1 \,\,\,\,\,\,\text{if}\,\, y_p = c \\ -1 \,\,\,\,\,\,\text{if}\,\, y_p \neq c \end{cases}$
4:Β Β Β Β Β Β Β Β Β Β solve two-class subproblem on $\left\{ \left(\mathbf{x}_{p,}\,\tilde y_{p}\right)\right\} _{p=1}^{P}$ to find weights $w_0^{(c)}$ and $\mathbf{w}_{\mathstrut}^{(c)}$
5:Β Β Β Β Β Β Β Β Β Β normalize classifier weights as $w_0^{(c)} \longleftarrow \frac{w_0^{(c)}}{\left\Vert \mathbf{w}_{\mathstrut}^{(c)} \right\Vert_2}$ and $\mathbf{w}_{\mathstrut}^{(c)} \longleftarrow \frac{w_0^{(c)}}{\left\Vert \mathbf{w}_{\mathstrut}^{(c)} \right\Vert_2}$
6:Β Β end for
7:Β Β To assign label $y$ to a point $\mathbf{x}$, apply the fusion rule: $y = \underset{j\,=\, 0,...,C-1}{\text{argmax}} \,\,\,w_0^{(j)} + \mathbf{x}_{\mathstrut}^T\mathbf{w}_{\mathstrut}^{(j)}$
In practice it is common to see implementations that skip the normalization in step 5. This can theoretically lead to poor classification due to different sized normal vectors creating out of scale distance-to-clsasifier measurements. However because often each classifier is trained using the same optimization framework - e.g., the same sized initialization and optimization algorithm - the resulting size of each trained normal vector can end up being around the same size (hence reducing the need for normalization).
Having addressed every region in the space of our exemplar dataset, one might wonder what the entire space looks like after learning the decision boundary. We plot this in the next Python cell. In both the left and right panels below we correctly classify and color the entire space as described above. In the left panel we show the three original two-class linear decision boundaries, and on the right we show the multi-class decision boundary (in black) created by fusing these individual boundaries using the fusion rule. This black multi-class boundary arises at points where the fusion rule does not provide a unique solution - i.e., at points where two or more classifiers provide the maximum evaluation.
The boundary resulting from the fusion rule is always piecewise-linear as with our example shown here. While the fusion rule explicitly defines this boundary it does not provide us with a closed form formula for it as with e.g., logistic regression (although one may work out a somewhat convoluted formula describing the boundary in general). In fact the piecewise-linear boundaries shown in the figures of this Section were drawn not by determining their formula but by labeling (and appropriately coloring) every point using the fusion rule.
Python
ΒΆSince we have already discussed how to implement several two-class classification schemes in the previous Chapter, here we will only describe an efficient way of implementing the fusion rule in Python
. To take advantage of the numpy
libraries fast array operations we stack the trained weights from our $C$ classifiers together into a single $\left(N + 1\right) \times C$ array of the form
Here the bias and normal vector of the $c^{th}$ classifier have been stacked on top of one another and make up the $c^{th}$ column of the array. Now lets extend our model
notation to also denote the evaluation of our $C$ individual linear models as
Note the above is a $1\times C$ array of our linear models that can be evaluated at any $\mathbf{W}$. This sort of abuse of notation actually carries over completely in Python
, and using precisely the same model
implementation we have seen in the previous two Chapters can be used here as well. We repeat this code below - notice that the input w
to this Pythonic
function is indeed our matrix of weights $\mathbf{W}$.
# compute C linear combinations of input point, one per classifier
def model(x,w):
# tack a 1 onto the top of each input point all at once
o = np.ones((1,np.shape(x)[1]))
x = np.vstack((o,x))
# compute linear combination and return
a = np.dot(x.T,w)
return a
Notice that we can now write the fusion rule in this extended model
notation equivalently as
Hence - based on the Python
function for our model
above - we can implement the fusion rule quite simply as shown below.
# the fusion rule
def fusion_rule(x,w):
return np.argmax(model(x,w))
In this example we quickly apply the OvA algorithm derived above to a toy dataset with $C=4$ classes shown below.
Note with this dataset that each class is not linearly separable from the remainder of the data. This is no matter - the OvA framework still produces an appropriate multi-class boundary. We solve our $C$ two-class subproblems.
With our subproblems solved we can use the fusion rule to classify our entire input space. The Python cell below shows this printing out two panels with all points in the space colored according to the fusion rule in each panel. The left panel also shows the individual learned two-class classifiers, while the right panel shows the final multi-class decision boundary, which does a fine job of distinguishing the four classes.
In this example we illustrate how the fusion rule driving OvA - and multiclass classification in general - creates a multilevel step model (or regressor) to represent a given dataset. We begin with a single input example with $C = 4$ classes, as shown below.
With each subproblem solved we combine these learners (employing the final trained weights in each instance) using the fusion rule, which provides a good fitting multistep step function (or regressor) for this dataset. Here - with such a low-diomensional dataset - we can easily visualize the multilevel step regressor provided by the fusion rule by evaluating a finely sampled set of points over the input range of the dataset. This is done below, with the fusion rule model colored blue.
The fact that - geometrically speaking from the perspective of regression - the fusion rule produces a multilevel step function should not be too surprising given that the underlying model for two-class classification is a two-step function (as detailed in Section 9.1). With multiple classes we need a step function with more than two steps!
Notice here that there is no way for us to achieve a perfect classification result with OvA because the two-subproblems for class $2$ and $3$ result in two-class problems that cannot be perfectly represented using a linear two-step function. Below we visualize both each subproblem dataset, followed by these datasets with each subproblem step function fit to each subproblem dataset. Here we can see in the case of classes $2$ and $3$ that since there are many more points in the 'not $c$' each learned step function simply represents this class alone. Thus when combined via the fusion rule we do not achieve perfect classification.
We can also visualize the fusion rule defined multilevel step function with the two-input dataset used in detailing the OvA algorithm above. Here we have $C=3$ classes and - the way we have been visualizing such data - we have colored each class a distinct color.
Again we will train each indiivudal two-class classifier using a few steps of Newton's method.
By evaluating the fusion rule over a fine set of points over the input range of this dataset we can visualize the corresponding multilevel step function, which we do in the left panel below. Notice here that the semi-linear boundaries we have been plotting - when viewing the dataset and trained framework 'from above' as shown in the right panel below - are where this multilevel step function changes values. Or - in other words - the multiclass boundary always occurs where this multilevel step function changes levels.
Once trained we can compute predicted labels for our training set by simply evaluating each input via the fusion rule. Taking the input of the $p^{th}$ point $\left(\mathbf{x}_p,\,y_p\right)$ we use the fusion rule to produce a predicted output $\hat y_p$ as
\begin{equation} \hat y_p = \underset{j\,=\, 0,...,C-1}{\text{argmax}} \,\,\,w_0^{(\,j)} + \mathbf{x}_{p}^T\mathbf{w}_{\mathstrut}^{(\,j)} \end{equation}and then simply compare the actual and predicted labels to understand if our prediction is correct. One way to write this comparison is via $\left | \text{sign}\left(\hat y_p - \overset{\mathstrut}{y_p} \right) \right |$, which is $0$ if the prediction matches the true label and $+1$ otherwise. We can then write the total number of misclassifications on our training set as
\begin{equation} \text{number of misclassifications on training set } = \sum_{p = 1}^{P} \left | \text{sign}\left(\hat y_p - \overset{\mathstrut}{y_p}\right) \right | \end{equation}with the accuracy being then calculated as
\begin{equation} \text{accuracy of learned classifier} = 1 - \frac{1}{P} \sum_{p = 1}^{P} \left | \text{sign}\left(\hat y_p - \overset{\mathstrut}{y_p}\right) \right | \end{equation}