Linear Supervised Learning Series¶

Part 6: Linear multi-class classification¶

Here we discuss a popular alternative to OvA multi-class classification where we also learn $C$ two-class classifiers (and also employ the fusion rule) but train them simultaneously instead of independently as with OvA.

Press the button 'Toggle code' below to toggle code on and off for entire this presentation.

In [2]:

from IPython.display import display
from IPython.display import HTML
import IPython.core.display as di # Example: di.display_html('<h3>%s:</h3>' % str, raw=True)

# This line will hide code by default when the notebook is exported as HTML
di.display_html('<script>jQuery(function() {if (jQuery("body.notebook_app").length == 0) { jQuery(".input_area").toggle(); jQuery(".prompt").toggle();}});</script>', raw=True)

# This line will add a button to toggle visibility of code blocks, for use with the HTML export version
di.display_html('''<button onclick="jQuery('.input_area').toggle(); jQuery('.prompt').toggle();">Toggle code</button>''', raw=True)

1. Multi-class perceptron and logistic regression¶

1.1 The multi-class perceptron¶

As with OvA, we deal with an arbitrary multi-class dataset $\left\{ \left(\mathbf{x}_{p,}\,y_{p}\right)\right\} _{p=1}^{P}$ consisting of $C$ distinct classes of data with label values $y_{p}\in\left\{ 1,2,...,C\right\} $.

Recall: the fusion rule gives us predicted labels for our dataset, the $p^{th}$ of which $\hat y_p$ given as

\begin{equation} \hat y_p = \underset{j=1,...,C}{\text{argmax}} \,\,\,w_0^{(\,j)} + \mathbf{x}_{p}^T\mathbf{w}_{\mathstrut}^{(\,j)} \end{equation}

Also recall: the normal vectors here are assumed to all have unit length, i.e., $\left \Vert \mathbf{w}_{\mathstrut}^{(\,j)} \right \Vert_2^2 = 1$ for all $j$.

For the $p^{th}$ point, ideally we want this prediction to match its true label, i.e., $\hat y_p = y_p$, so that we have

\begin{equation} w_0^{(y_p)} + \mathbf{x}_{p}^T\mathbf{w}_{\mathstrut}^{(y_p)} = \underset{j=1,...,C}{\text{max}} \,\,\,w_0^{(\,j)} + \mathbf{x}_{p}^T\mathbf{w}_{\mathstrut}^{(\,j)} \end{equation}

Remember geometrically this simply says that the (signed) distance from the point $\mathbf{x}_p$ to its class decision boundary is greater than its distance from every other class's.

Subtracting $ w_0^{(y_p)} + \mathbf{x}_{p}^T\mathbf{w}_{\mathstrut}^{(y_p)}$ from both sides we have

\begin{equation} \left(\underset{j=1,...,C}{\text{max}} \,\,\,w_0^{(\,j)} + \mathbf{x}_{p}^T\mathbf{w}_{\mathstrut}^{(\,j)}\right) - \left(w_0^{(y_p)} + \mathbf{x}_{p}^T\mathbf{w}_{\mathstrut}^{(y_p)}\right) = 0 \end{equation}

Ideally, we want this to be true for every $\mathbf{x}_p$. But regardless, we always have that

AHA: summing up the expression above should give a proper cost function to minimimize in order to find optimal classifer weights.

\begin{equation} g\left(w_0^{(1)},\,\mathbf{w}_{\mathstrut}^{(1)},...,w_0^{(C)},\,\mathbf{w}_{\mathstrut}^{(C)} \right) = \sum_{p = 1}^P \left[\,\left(\underset{j=1,...,C}{\text{max}} \,\,\,w_0^{(\,j)} + \mathbf{x}_{p}^T\mathbf{w}_{\mathstrut}^{(\,j)}\right) - \left(w_0^{(y_p)} + \mathbf{x}_{p}^T\mathbf{w}_{\mathstrut}^{(y_p)}\right) \right] \end{equation}

This is called the multi-class perceptron cost

This cost is a direct generalization of the two-class perceptron, and reduces to it when $C=2$.

Like the two-class perceptron, this cost is also convex.

Note: in minimizing the perceptron cost we should - at least formally - subject it to the constraints that all normal vectors are unit length

\begin{equation} \begin{aligned} \underset{w_0^{(1)},\,\mathbf{w}_{\mathstrut}^{(1)},\,...,\,w_0^{(C)},\,\mathbf{w}_{\mathstrut}^{(C)}}{\,\,\,\,\,\,\,\,\,\,\,\,\mbox{minimize}\,\,\,} & \,\,\,\, \sum_{p = 1}^P \left[\,\left(\underset{j=1,...,C} {\text{max}} \,\,\,w_0^{(\,j)} + \mathbf{x}_{p}^T\mathbf{w}_{\mathstrut}^{(\,j)} \right) - \left(w_0^{(y_p)} + \mathbf{x}_{p}^T\mathbf{w}_{\mathstrut}^{(y_p)}\right) \right]\\ \mbox{subject to}\,\,\, & \,\,\,\,\, \left \Vert \mathbf{w}_{\mathstrut}^{(\,j)} \right \Vert_2^2 = 1, \,\,\,\,\,\, j=1,...,C \end{aligned} \end{equation}

This constrained optimization problem can be solved directly via, e.g., projected gradient descent.

However it is more commonplace to see this problem approximately solved by relaxing the constraints.

The unconstrained regularized form of multi-class perceptron

\begin{equation} \underset{w_0^{(1)},\,\mathbf{w}_{\mathstrut}^{(1)},...,w_0^{(C)},\,\mathbf{w}_{\mathstrut}^{(C)}}{\text{minimize}} \,\, \sum_{p = 1}^P \left[\,\left(\underset{j=1,...,C} {\text{max}} \,\,\,w_0^{(\,j)} + \mathbf{x}_{p}^T\mathbf{w}_{\mathstrut}^{(\,j)} \right) - \left(w_0^{(y_p)} + \mathbf{x}_{p}^T\mathbf{w}_{\mathstrut}^{(y_p)}\right) \right] + \lambda \sum_{j = 1}^{C} \left \Vert \mathbf{w}_{\mathstrut}^{(\,j)} \right \Vert_2^2 \end{equation}

The unconstrained form is convex.

It does not quite match the original constrained formulation as regularizing all normal vectors together will not necessarily guarantee that $\left \Vert \mathbf{w}_{\mathstrut}^{(\,j)} \right \Vert_2^2 = 1$ for all $j$.

Nonetheless it will generally force the length of all normal vectors to behave well, e.g., disallowing one normal vector to grow arbitrarily large while one shrinks to almost nothing.

$\lambda$ typically set to a small value, e.g., $10^{-3}$ or smaller.

Example 1: Multi-class perceptron¶

In this example we employ unnormalized gradient descent to minimize the regularized multi-class classifier defined above over a toy dataset with $C=3$ classes used previously in deriving OvA.

In [18]:

# load in dataset
data = np.loadtxt('../../mlrefined_datasets/superlearn_datasets/3class_data.csv',delimiter = ',')

# create an instance of the ova demo
demo = superlearn.multiclass_illustrator.Visualizer(data)

# visualize dataset
demo.show_dataset()

One is free to implement the cost function here in a number of ways. Note however that in the particular implementation shown here the weights from all $C$ classifiers are input as an $N + 1$ by $C$ array of the form

\begin{equation} \mathbf{W}=\left[\begin{array}{cccc} w_{0}^{(1)} & w_{0}^{(2)} & \cdots & w_{0}^{(C)}\\ \mathbf{w}^{(1)} & \mathbf{w}^{(2)} & \cdots & \mathbf{w}^{(C)} \end{array}\right] \end{equation}

where the bias and normal vector of the $c^{th}$ classifier have been stacked on top of one another and made the array's $c^{th}$ column. Also note that the quantity $w_0^{(\,j)} + \mathbf{x}_{p}^T\mathbf{w}_{\mathstrut}^{(\,j)}$ is pre-computed for all points prior to the summation and stored in a single array.

In [20]:

# multiclass perceptron regularized by the summed length of all normal vectors
lam = 10**-4  # our regularization paramter 
def multiclass_perceptron(W):        
    # pre-compute predictions on all points
    all_evals = W[0,:] + np.dot(x.T,W[1:,:])

    # compute counting cost
    cost = 0
    for p in range(len(y)):
        # pluck out current true label
        y_p = int(y[p][0]) - 1    # subtract off one due to python indexing

        # update cost summand
        cost +=  max(all_evals[p,:]) - all_evals[p,y_p]
        
        # return cost with regularizer added
    return cost + lam*np.linalg.norm(W[1:,:],'fro')**2

In [32]:

# plot classification of space, individual learned classifiers (left panel) and joint boundary (middle panel), and cost-function panel in the right panel
demo.show_complete_coloring(w_hist,show_cost = True, cost = multiclass_perceptron)

1.2 The smooth softmax approximation to multi-class perceptron cost¶

Recall that the softmax function

\begin{equation} \text{soft}\left(s_{1},...,s_{C}\right)=\text{log}\left(\sum_{j = 1}^{C} e^{s_{j}}\right) \end{equation}

is a close and smooth approximation to the maximum of $C$ scalar numbers $s_{1},...,s_{C}$, i.e.,

\begin{equation} \text{max}\left(s_{1},...,s_{C}\right) \approx \text{soft}\left(s_{1},...,s_{C}\right) \end{equation}

Lets replace the max function with softmax in each summand of the multi-class perceptron cost

This is what we get:

\begin{equation} g\left(w_0^{(1)},\,\mathbf{w}_{\mathstrut}^{(1)},...,w_0^{(C)},\,\mathbf{w}_{\mathstrut}^{(C)} \right) = \sum_{p = 1}^P \left[\text{log}\left( \sum_{j = 1}^{C} e^{ w_0^{(j)} + \mathbf{x}_{p}^T\mathbf{w}_{\mathstrut}^{(j)}} \right) - \left( w_0^{(y_p)} + \mathbf{x}_{p}^T\mathbf{w}_{\mathstrut}^{(y_p)}\right)\right] \end{equation}

This is called the multi-class softmax cost

It also reduces to the two-class version when $C = 2$

It is not only convex but - unlike the multi-class perceptron - it has infinitely many smooth derivatives, hence we can use Newton's method (in addition to gradient descent) to minimize it.

The multi-class sostmax cost goes by many names - e.g., softmax regression and multi-class logistic regression

Sometimes it is more convenient to write it equivalently - using basic properties of the log function - as

\begin{equation} g\left(w_0^{(1)},\,\mathbf{w}_{\mathstrut}^{(1)},...,w_0^{(C)},\,\mathbf{w}_{\mathstrut}^{(C)} \right) = -\sum_{p = 1}^P \text{log}\left(\frac{ e^{ w_0^{(y_p)} + \mathbf{x}_{p}^T\mathbf{w}_{\mathstrut}^{(y_p)}} }{ \sum_{j = 1}^{C} e^{ w_0^{(j)} + \mathbf{x}_{p}^T\mathbf{w}_{\mathstrut}^{(j)}} }\right) \end{equation}

Implementation warning¶

Care must be taken when implementing any cost function - or mathematical expression in general - involving the exponential function $e^{\left(\cdot\right)}$ in Python.

By nature the exponential function grows large very rapidly causing undesired 'overflow' issues even with moderate-sized exponents, e.g., $e^{1000}$. Large numbers like this cannot be stored explicitly on the computer and so are represented symbolically as $\infty$.

Additionally, the division of two such large numbers - which is a potentiality when evaluating the summands of the multi-class cost in equation (12) - is computed and stored as a NaN (not a number) causing severe numerical stability issues.

Ways to fix it¶

Normalizing the data - to fall within a relatively small range - and regularize the weights, e.g., via their $\ell_2$ norm, to punish/prevent large weight values.

In practice we can still run into overflow issues immediately after initialization especially when input is high-dimensional.

Implement your own version of the exponential function by capping the maximum value it can take

In [5]:

G = 500

def my_exp(x, G):
    return np.exp(x) if x<G else np.exp(G)

Example 2: Multi-class softmax on a toy dataset with $C=3$ classes¶

In this example we run the multi-class softmax classifier on the same dataset used in the previous example, first using unnormalized gradient descent and then Newton's method.

In [9]:

# multiclass softmaax regularized by the summed length of all normal vectors
lam = 10**-3  # our regularization paramter 
def multiclass_softmax(W):        
    # pre-compute predictions on all points
    all_evals = W[0,:] + np.dot(x.T,W[1:,:])

    # compute counting cost
    cost = 0
    for p in range(len(y)):
        # pluck out current true label
        y_p = int(y[p][0]) - 1    # subtract off one due to python indexing

        # update cost summand
        cost +=  np.log(np.sum(np.exp(all_evals[p,:]))) - all_evals[p,y_p]
        
        # return cost with regularizer added
    return cost + lam*np.linalg.norm(W[1:,:],'fro')**2

Now we minimize this cost function using gradient descent - for $500$ iterations using a fixed steplength value $\alpha = 10^{-2}$.

Gradient descent¶

In [35]:

# plot classification of space, individual learned classifiers (left panel) and joint boundary (middle panel), and cost-function panel in the right panel
demo.show_complete_coloring(w_hist,show_cost = True, cost = multiclass_softmax)

Example 3: Multi-class classification from a regression perspective¶

In [16]:

# plot classification of space, individual learned classifiers (left panel) and joint boundary (middle panel), and cost-function panel in the right panel
demo.show_surface_fit(w_hist,view = [15,115])

Example 4: Multi-class softmax on a toy dataset with $C = 4$ classes¶

In [10]:

# load in dataset
data = np.loadtxt('../../mlrefined_datasets/superlearn_datasets/4class_data.csv',delimiter = ',')

# create an instance of the ova demo
demo = superlearn.multiclass_illustrator.Visualizer(data)

# visualize dataset
demo.show_dataset()

In [17]:

# plot classification of space, individual learned classifiers (left panel) and joint boundary (middle panel), and cost-function panel in the right panel
demo.show_complete_coloring(w_hist,show_cost = True, cost = multiclass_softmax)

Example 5: Comparing cost function and counting cost values¶

Below we compare the number of misclassifications versus the multi-class softmax cost with regularizer. In this instance $\lambda = 10^{-3}$ for three runs of unnormalized gradient descent using a steplength parameter $\alpha = 10^{-2}$ for all three runs.

In [18]:

# load in dataset
data = np.loadtxt('../../mlrefined_datasets/superlearn_datasets/3class_data.csv',delimiter = ',')

# create an instance of the ova demo
demo = superlearn.multiclass_illustrator.Visualizer(data)

# run demo
demo.compare_to_counting(num_runs = 3)

1.4 Counting misclassifications and the accuracy of a multi-class classifier¶

Similar to OvA! Once trained we can compute predicted labels for our training set by simply evaluating each input via the fusion rule.

\begin{equation} \hat y_p = \underset{j=1,...,C}{\text{argmax}} \,\,\,w_0^{(j)} + \mathbf{x}_{p}^T\mathbf{w}_{\mathstrut}^{(j)} \end{equation}

\begin{equation} \text{number of misclassifications on training set } = \sum_{p = 1}^{P} \left | \text{sign}\left(\hat y_p - \overset{\mathstrut}{y_p}\right) \right | \end{equation}

\begin{equation} \text{accuracy of learned classifier} = 1 - \frac{1}{P} \sum_{p = 1}^{P} \left | \text{sign}\left(\hat y_p - \overset{\mathstrut}{y_p}\right) \right | \end{equation}

1.5 Multi-class softmax from a probabilistic perspective¶

Similarly to the two-class case, the data independence assumption allows us to simplify the joint likelihood as

\begin{equation} {\cal L}=\prod_{p=1}^{P}{\cal P}\left(y=y_{p}\,|\,\mathbf{x}_{p},\mathbf{W}\right) \end{equation}

where

We again connect the probability of a point $\mathbf{x}_p$ belonging to a certain class to the signed distance from $\mathbf{x}_p$ to that class' decision boundary.

Assuming all classifiers' normal vectors are normalized to have unit length

\begin{equation} {\cal P}\left(y=1\,|\,\mathbf{x}_{p},\mathbf{W}\right)\propto w_0^{(1)}+\mathbf{x}_p^T\mathbf{w}^{(1)}\\ {\cal P}\left(y=2\,|\,\mathbf{x}_{p},\mathbf{W}\right)\propto w_0^{(2)}+\mathbf{x}_p^T\mathbf{w}^{(2)}\\ \vdots\\ {\cal P}\left(y=C\,|\,\mathbf{x}_{p},\mathbf{W}\right)\propto w_0^{(C)}+\mathbf{x}_p^T\mathbf{w}^{(C)} \end{equation}

These signed distances however can be negative and hence cannot be used immediately as class probabilities.

We can resolve this issue by passing them through an always-positive and monotonically-increasing function such as $e^{\left(\cdot \right)}$ to get

\begin{equation} {\cal P}\left(y=1\,|\,\mathbf{x}_{p},\mathbf{W}\right)\propto e^{w_0^{(1)}+\mathbf{x}_p^T\mathbf{w}^{(1)}}\\ {\cal P}\left(y=2\,|\,\mathbf{x}_{p},\mathbf{W}\right)\propto e^{w_0^{(2)}+\mathbf{x}_p^T\mathbf{w}^{(2)}}\\ \vdots\\ {\cal P}\left(y=C\,|\,\mathbf{x}_{p},\mathbf{W}\right)\propto e^{w_0^{(C)}+\mathbf{x}_p^T\mathbf{w}^{(C)}} \end{equation}

One issue still remains and that is $e^{w_0^{(c)}+\mathbf{x}_p^T\mathbf{w}^{(c)}}$ can potentially be larger than one and hence not a valid probability.

Luckily there is a simple fix for this issue as well: divide all values by $\sum_{j=1}^{C} e^{ w_0^{(j)} + \mathbf{x}_{p}^T\mathbf{w}_{\mathstrut}^{(j)}}$

\begin{equation} {\cal P}\left(y=1\,|\,\mathbf{x}_{p},\mathbf{W}\right) \propto \frac{e^{w_0^{(1)}+\mathbf{x}_p^T\mathbf{w}^{(1)}}}{\sum_{j=1}^{C} e^{ w_0^{(j)} + \mathbf{x}_{p}^T\mathbf{w}_{\mathstrut}^{(j)}}}\\ {\cal P}\left(y=2\,|\,\mathbf{x}_{p},\mathbf{W}\right) \propto \frac{e^{w_0^{(2)}+\mathbf{x}_p^T\mathbf{w}^{(2)}}}{\sum_{j=1}^{C} e^{ w_0^{(j)} + \mathbf{x}_{p}^T\mathbf{w}_{\mathstrut}^{(j)}}}\\ \vdots\\ {\cal P}\left(y=C\,|\,\mathbf{x}_{p},\mathbf{W}\right)\propto \frac{e^{w_0^{(C)}+\mathbf{x}_p^T\mathbf{w}^{(C)}}}{\sum_{j=1}^{C} e^{ w_0^{(j)} + \mathbf{x}_{p}^T\mathbf{w}_{\mathstrut}^{(j)}}} \end{equation}

Notice that all right hand side values now lie between $0$ and $1$, and moreover, they all add up to $1$ making them mathematically valid values to represent the class probabilities.

Plugging

\begin{equation} {\cal P}\left(y=y_p\,|\,\mathbf{x}_{p},\mathbf{W}\right) = \frac{e^{w_0^{(y_p)}+\mathbf{x}_p^T\mathbf{w}^{(y_p)}}}{\sum_{j=1}^{C} e^{ w_0^{(j)} + \mathbf{x}_{p}^T\mathbf{w}_{\mathstrut}^{(j)}}} \end{equation}

into

\begin{equation} {\cal L}=\prod_{p=1}^{P}{\cal P}\left(y=y_{p}\,|\,\mathbf{x}_{p},\mathbf{W}\right) \end{equation}

and taking the negative logarithm of the resulting function gives the multi-class negative log-likelihood loss

\begin{equation} g\left(\mathbf{W} \right) = -\sum_{p = 1}^P \text{log}\left(\frac{ e^{ w_0^{(y_p)} + \mathbf{x}_{p}^T\mathbf{w}_{\mathstrut}^{(y_p)}} }{ \sum_{j = 1}^{C} e^{ w_0^{(j)} + \mathbf{x}_{p}^T\mathbf{w}_{\mathstrut}^{(j)}} }\right) \end{equation}

This is precisely the multi-class softmax cost we derived previously!

1.6 Which multi-class classification scheme works best?¶

One-versus-all (OvA) and multi-class softmax classifiers perform similarly well in practice, having both been built using the same fusion rule.

The OvA classifier is naturally parallelizable, as each linear separator can be learned independently of the rest.

The multi-class softmax scheme provides a more commonly used framework for performing nonlinear multi-class classification using neural networks.