Here we discuss a popular alternative to OvA multi-class classification where we also learn $C$ two-class classifiers (and also employ the fusion rule) but train them simultaneously instead of independently as with OvA.
Press the button 'Toggle code' below to toggle code on and off for entire this presentation.
from IPython.display import display
from IPython.display import HTML
import IPython.core.display as di # Example: di.display_html('<h3>%s:</h3>' % str, raw=True)
# This line will hide code by default when the notebook is exported as HTML
di.display_html('<script>jQuery(function() {if (jQuery("body.notebook_app").length == 0) { jQuery(".input_area").toggle(); jQuery(".prompt").toggle();}});</script>', raw=True)
# This line will add a button to toggle visibility of code blocks, for use with the HTML export version
di.display_html('''<button onclick="jQuery('.input_area').toggle(); jQuery('.prompt').toggle();">Toggle code</button>''', raw=True)
As with OvA, we deal with an arbitrary multi-class dataset $\left\{ \left(\mathbf{x}_{p,}\,y_{p}\right)\right\} _{p=1}^{P}$ consisting of $C$ distinct classes of data with label values $y_{p}\in\left\{ 1,2,...,C\right\} $.
Recall: the fusion rule gives us predicted labels for our dataset, the $p^{th}$ of which $\hat y_p$ given as
\begin{equation} \hat y_p = \underset{j=1,...,C}{\text{argmax}} \,\,\,w_0^{(\,j)} + \mathbf{x}_{p}^T\mathbf{w}_{\mathstrut}^{(\,j)} \end{equation}Also recall: the normal vectors here are assumed to all have unit length, i.e., $\left \Vert \mathbf{w}_{\mathstrut}^{(\,j)} \right \Vert_2^2 = 1$ for all $j$.
For the $p^{th}$ point, ideally we want this prediction to match its true label, i.e., $\hat y_p = y_p$, so that we have
\begin{equation} w_0^{(y_p)} + \mathbf{x}_{p}^T\mathbf{w}_{\mathstrut}^{(y_p)} = \underset{j=1,...,C}{\text{max}} \,\,\,w_0^{(\,j)} + \mathbf{x}_{p}^T\mathbf{w}_{\mathstrut}^{(\,j)} \end{equation}Remember geometrically this simply says that the (signed) distance from the point $\mathbf{x}_p$ to its class decision boundary is greater than its distance from every other class's.
Subtracting $ w_0^{(y_p)} + \mathbf{x}_{p}^T\mathbf{w}_{\mathstrut}^{(y_p)}$ from both sides we have
\begin{equation} \left(\underset{j=1,...,C}{\text{max}} \,\,\,w_0^{(\,j)} + \mathbf{x}_{p}^T\mathbf{w}_{\mathstrut}^{(\,j)}\right) - \left(w_0^{(y_p)} + \mathbf{x}_{p}^T\mathbf{w}_{\mathstrut}^{(y_p)}\right) = 0 \end{equation}Ideally, we want this to be true for every $\mathbf{x}_p$. But regardless, we always have that
\begin{equation} \left(\underset{j=1,...,C}{\text{max}} \,\,\,w_0^{(\,j)} + \mathbf{x}_{p}^T\mathbf{w}_{\mathstrut}^{(\,j)}\right) - \left(w_0^{(y_p)} + \mathbf{x}_{p}^T\mathbf{w}_{\mathstrut}^{(y_p)}\right) \geq 0 \end{equation}AHA: summing up the expression above should give a proper cost function to minimimize in order to find optimal classifer weights.
Note: in minimizing the perceptron cost we should - at least formally - subject it to the constraints that all normal vectors are unit length
The unconstrained regularized form of multi-class perceptron
\begin{equation} \underset{w_0^{(1)},\,\mathbf{w}_{\mathstrut}^{(1)},...,w_0^{(C)},\,\mathbf{w}_{\mathstrut}^{(C)}}{\text{minimize}} \,\, \sum_{p = 1}^P \left[\,\left(\underset{j=1,...,C} {\text{max}} \,\,\,w_0^{(\,j)} + \mathbf{x}_{p}^T\mathbf{w}_{\mathstrut}^{(\,j)} \right) - \left(w_0^{(y_p)} + \mathbf{x}_{p}^T\mathbf{w}_{\mathstrut}^{(y_p)}\right) \right] + \lambda \sum_{j = 1}^{C} \left \Vert \mathbf{w}_{\mathstrut}^{(\,j)} \right \Vert_2^2 \end{equation}In this example we employ unnormalized gradient descent to minimize the regularized multi-class classifier defined above over a toy dataset with $C=3$ classes used previously in deriving OvA.
# load in dataset
data = np.loadtxt('../../mlrefined_datasets/superlearn_datasets/3class_data.csv',delimiter = ',')
# create an instance of the ova demo
demo = superlearn.multiclass_illustrator.Visualizer(data)
# visualize dataset
demo.show_dataset()
One is free to implement the cost function here in a number of ways. Note however that in the particular implementation shown here the weights from all $C$ classifiers are input as an $N + 1$ by $C$ array of the form
\begin{equation} \mathbf{W}=\left[\begin{array}{cccc} w_{0}^{(1)} & w_{0}^{(2)} & \cdots & w_{0}^{(C)}\\ \mathbf{w}^{(1)} & \mathbf{w}^{(2)} & \cdots & \mathbf{w}^{(C)} \end{array}\right] \end{equation}where the bias and normal vector of the $c^{th}$ classifier have been stacked on top of one another and made the array's $c^{th}$ column. Also note that the quantity $w_0^{(\,j)} + \mathbf{x}_{p}^T\mathbf{w}_{\mathstrut}^{(\,j)}$ is pre-computed for all points prior to the summation and stored in a single array.
# multiclass perceptron regularized by the summed length of all normal vectors
lam = 10**-4 # our regularization paramter
def multiclass_perceptron(W):
# pre-compute predictions on all points
all_evals = W[0,:] + np.dot(x.T,W[1:,:])
# compute counting cost
cost = 0
for p in range(len(y)):
# pluck out current true label
y_p = int(y[p][0]) - 1 # subtract off one due to python indexing
# update cost summand
cost += max(all_evals[p,:]) - all_evals[p,y_p]
# return cost with regularizer added
return cost + lam*np.linalg.norm(W[1:,:],'fro')**2
# plot classification of space, individual learned classifiers (left panel) and joint boundary (middle panel), and cost-function panel in the right panel
demo.show_complete_coloring(w_hist,show_cost = True, cost = multiclass_perceptron)
Recall that the softmax function
\begin{equation} \text{soft}\left(s_{1},...,s_{C}\right)=\text{log}\left(\sum_{j = 1}^{C} e^{s_{j}}\right) \end{equation}is a close and smooth approximation to the maximum of $C$ scalar numbers $s_{1},...,s_{C}$, i.e.,
\begin{equation} \text{max}\left(s_{1},...,s_{C}\right) \approx \text{soft}\left(s_{1},...,s_{C}\right) \end{equation}Lets replace the max function with softmax in each summand of the multi-class perceptron cost
\begin{equation} g\left(w_0^{(1)},\,\mathbf{w}_{\mathstrut}^{(1)},...,w_0^{(C)},\,\mathbf{w}_{\mathstrut}^{(C)} \right) = \sum_{p = 1}^P \left[\,\left(\underset{j=1,...,C}{\text{max}} \,\,\,w_0^{(\,j)} + \mathbf{x}_{p}^T\mathbf{w}_{\mathstrut}^{(\,j)}\right) - \left(w_0^{(y_p)} + \mathbf{x}_{p}^T\mathbf{w}_{\mathstrut}^{(y_p)}\right) \right] \end{equation}This is what we get:
\begin{equation} g\left(w_0^{(1)},\,\mathbf{w}_{\mathstrut}^{(1)},...,w_0^{(C)},\,\mathbf{w}_{\mathstrut}^{(C)} \right) = \sum_{p = 1}^P \left[\text{log}\left( \sum_{j = 1}^{C} e^{ w_0^{(j)} + \mathbf{x}_{p}^T\mathbf{w}_{\mathstrut}^{(j)}} \right) - \left( w_0^{(y_p)} + \mathbf{x}_{p}^T\mathbf{w}_{\mathstrut}^{(y_p)}\right)\right] \end{equation}The multi-class sostmax cost goes by many names - e.g., softmax regression and multi-class logistic regression
Sometimes it is more convenient to write it equivalently - using basic properties of the log function - as
\begin{equation} g\left(w_0^{(1)},\,\mathbf{w}_{\mathstrut}^{(1)},...,w_0^{(C)},\,\mathbf{w}_{\mathstrut}^{(C)} \right) = -\sum_{p = 1}^P \text{log}\left(\frac{ e^{ w_0^{(y_p)} + \mathbf{x}_{p}^T\mathbf{w}_{\mathstrut}^{(y_p)}} }{ \sum_{j = 1}^{C} e^{ w_0^{(j)} + \mathbf{x}_{p}^T\mathbf{w}_{\mathstrut}^{(j)}} }\right) \end{equation}G = 500
def my_exp(x, G):
return np.exp(x) if x<G else np.exp(G)
In this example we run the multi-class softmax classifier on the same dataset used in the previous example, first using unnormalized gradient descent and then Newton's method.
# multiclass softmaax regularized by the summed length of all normal vectors
lam = 10**-3 # our regularization paramter
def multiclass_softmax(W):
# pre-compute predictions on all points
all_evals = W[0,:] + np.dot(x.T,W[1:,:])
# compute counting cost
cost = 0
for p in range(len(y)):
# pluck out current true label
y_p = int(y[p][0]) - 1 # subtract off one due to python indexing
# update cost summand
cost += np.log(np.sum(np.exp(all_evals[p,:]))) - all_evals[p,y_p]
# return cost with regularizer added
return cost + lam*np.linalg.norm(W[1:,:],'fro')**2
Now we minimize this cost function using gradient descent - for $500$ iterations using a fixed steplength value $\alpha = 10^{-2}$.
# plot classification of space, individual learned classifiers (left panel) and joint boundary (middle panel), and cost-function panel in the right panel
demo.show_complete_coloring(w_hist,show_cost = True, cost = multiclass_softmax)
# plot classification of space, individual learned classifiers (left panel) and joint boundary (middle panel), and cost-function panel in the right panel
demo.show_surface_fit(w_hist,view = [15,115])
# load in dataset
data = np.loadtxt('../../mlrefined_datasets/superlearn_datasets/4class_data.csv',delimiter = ',')
# create an instance of the ova demo
demo = superlearn.multiclass_illustrator.Visualizer(data)
# visualize dataset
demo.show_dataset()
# plot classification of space, individual learned classifiers (left panel) and joint boundary (middle panel), and cost-function panel in the right panel
demo.show_complete_coloring(w_hist,show_cost = True, cost = multiclass_softmax)
Below we compare the number of misclassifications versus the multi-class softmax cost with regularizer. In this instance $\lambda = 10^{-3}$ for three runs of unnormalized gradient descent using a steplength parameter $\alpha = 10^{-2}$ for all three runs.
# load in dataset
data = np.loadtxt('../../mlrefined_datasets/superlearn_datasets/3class_data.csv',delimiter = ',')
# create an instance of the ova demo
demo = superlearn.multiclass_illustrator.Visualizer(data)
# run demo
demo.compare_to_counting(num_runs = 3)
Similar to OvA! Once trained we can compute predicted labels for our training set by simply evaluating each input via the fusion rule.
\begin{equation} \hat y_p = \underset{j=1,...,C}{\text{argmax}} \,\,\,w_0^{(j)} + \mathbf{x}_{p}^T\mathbf{w}_{\mathstrut}^{(j)} \end{equation}Similarly to the two-class case, the data independence assumption allows us to simplify the joint likelihood as
\begin{equation} {\cal L}=\prod_{p=1}^{P}{\cal P}\left(y=y_{p}\,|\,\mathbf{x}_{p},\mathbf{W}\right) \end{equation}where
\begin{equation} \mathbf{W}=\left[\begin{array}{cccc} w_{0}^{(1)} & w_{0}^{(2)} & \cdots & w_{0}^{(C)}\\ \mathbf{w}^{(1)} & \mathbf{w}^{(2)} & \cdots & \mathbf{w}^{(C)} \end{array}\right] \end{equation}We again connect the probability of a point $\mathbf{x}_p$ belonging to a certain class to the signed distance from $\mathbf{x}_p$ to that class' decision boundary.
Assuming all classifiers' normal vectors are normalized to have unit length
\begin{equation} {\cal P}\left(y=1\,|\,\mathbf{x}_{p},\mathbf{W}\right)\propto w_0^{(1)}+\mathbf{x}_p^T\mathbf{w}^{(1)}\\ {\cal P}\left(y=2\,|\,\mathbf{x}_{p},\mathbf{W}\right)\propto w_0^{(2)}+\mathbf{x}_p^T\mathbf{w}^{(2)}\\ \vdots\\ {\cal P}\left(y=C\,|\,\mathbf{x}_{p},\mathbf{W}\right)\propto w_0^{(C)}+\mathbf{x}_p^T\mathbf{w}^{(C)} \end{equation}These signed distances however can be negative and hence cannot be used immediately as class probabilities.
We can resolve this issue by passing them through an always-positive and monotonically-increasing function such as $e^{\left(\cdot \right)}$ to get
\begin{equation} {\cal P}\left(y=1\,|\,\mathbf{x}_{p},\mathbf{W}\right)\propto e^{w_0^{(1)}+\mathbf{x}_p^T\mathbf{w}^{(1)}}\\ {\cal P}\left(y=2\,|\,\mathbf{x}_{p},\mathbf{W}\right)\propto e^{w_0^{(2)}+\mathbf{x}_p^T\mathbf{w}^{(2)}}\\ \vdots\\ {\cal P}\left(y=C\,|\,\mathbf{x}_{p},\mathbf{W}\right)\propto e^{w_0^{(C)}+\mathbf{x}_p^T\mathbf{w}^{(C)}} \end{equation}One issue still remains and that is $e^{w_0^{(c)}+\mathbf{x}_p^T\mathbf{w}^{(c)}}$ can potentially be larger than one and hence not a valid probability.
Luckily there is a simple fix for this issue as well: divide all values by $\sum_{j=1}^{C} e^{ w_0^{(j)} + \mathbf{x}_{p}^T\mathbf{w}_{\mathstrut}^{(j)}}$
\begin{equation} {\cal P}\left(y=1\,|\,\mathbf{x}_{p},\mathbf{W}\right) \propto \frac{e^{w_0^{(1)}+\mathbf{x}_p^T\mathbf{w}^{(1)}}}{\sum_{j=1}^{C} e^{ w_0^{(j)} + \mathbf{x}_{p}^T\mathbf{w}_{\mathstrut}^{(j)}}}\\ {\cal P}\left(y=2\,|\,\mathbf{x}_{p},\mathbf{W}\right) \propto \frac{e^{w_0^{(2)}+\mathbf{x}_p^T\mathbf{w}^{(2)}}}{\sum_{j=1}^{C} e^{ w_0^{(j)} + \mathbf{x}_{p}^T\mathbf{w}_{\mathstrut}^{(j)}}}\\ \vdots\\ {\cal P}\left(y=C\,|\,\mathbf{x}_{p},\mathbf{W}\right)\propto \frac{e^{w_0^{(C)}+\mathbf{x}_p^T\mathbf{w}^{(C)}}}{\sum_{j=1}^{C} e^{ w_0^{(j)} + \mathbf{x}_{p}^T\mathbf{w}_{\mathstrut}^{(j)}}} \end{equation}Notice that all right hand side values now lie between $0$ and $1$, and moreover, they all add up to $1$ making them mathematically valid values to represent the class probabilities.
Plugging
\begin{equation} {\cal P}\left(y=y_p\,|\,\mathbf{x}_{p},\mathbf{W}\right) = \frac{e^{w_0^{(y_p)}+\mathbf{x}_p^T\mathbf{w}^{(y_p)}}}{\sum_{j=1}^{C} e^{ w_0^{(j)} + \mathbf{x}_{p}^T\mathbf{w}_{\mathstrut}^{(j)}}} \end{equation}into
\begin{equation} {\cal L}=\prod_{p=1}^{P}{\cal P}\left(y=y_{p}\,|\,\mathbf{x}_{p},\mathbf{W}\right) \end{equation}and taking the negative logarithm of the resulting function gives the multi-class negative log-likelihood loss
\begin{equation} g\left(\mathbf{W} \right) = -\sum_{p = 1}^P \text{log}\left(\frac{ e^{ w_0^{(y_p)} + \mathbf{x}_{p}^T\mathbf{w}_{\mathstrut}^{(y_p)}} }{ \sum_{j = 1}^{C} e^{ w_0^{(j)} + \mathbf{x}_{p}^T\mathbf{w}_{\mathstrut}^{(j)}} }\right) \end{equation}This is precisely the multi-class softmax cost we derived previously!