code
share


Chapter 11: Linear unsupervised learning

11.6 General Matrix Factorization techniques

We have now seen three standard linear unsupervised learning models: PCA, Recommender Systems, and K-Means clustering. Along the way we have also seen how the shape of the most fundamental of these three - PCA - shows up in the latter two both in how the problems are both framed and optimized. In this Section we complete this picture, fully drawing out the connections major unsupervised learning methods have to PCA, by describing them in all through the singular lens of matrix factorization. This will help further unify our view of linear unsupervised methods, opening the door to further common methods that are similar variations on the theme of PCA.

In [1]:

11.6.1 Re-casting PCA, Recommender Systems, and K-Means as matrix factorization problems

To express PCA as a matrix factorization problem all we must do is re-cast the PCA Least Squares cost function described in equation (2) of Section 9.2 more compactly in terms of matrices. Denoting by $\mathbf{X}$ and $\mathbf{W}$ the $N\times P$ and $K \times P$ matrices of the data and weights stacked column-wise (respectively), the cost function in equation (2) can be written equivalently in terms of matrix notation as

\begin{equation} g\left(\mathbf{W},\mathbf{C}\right) = \frac{1}{P}\left \Vert \mathbf{C}\mathbf{W} - \mathbf{X} \right\Vert_F^2. \end{equation}

Here $\left \Vert \mathbf{A} \right \Vert_F^2 = \sum_{n=1}^N \sum_{p=1}^P A_{n,\,p}^2$ is the 'Frobenius' norm - which is the analog of the squared $\ell_2$ norm for matrices. If we then roll backwards one step and express the approximation we are aiming for in minimizing this cost function we can see that it is

\begin{equation} \mathbf{C}\mathbf{W} \approx \mathbf{X} \end{equation}

that is we are trying to express or factorize the matrix $\mathbf{X}$ into a product of two matrices $\mathbf{C}$ and $\mathbf{W}$. This is matrix analog of factorizing a single digit into two 'simpler' ones, e.g., as $5\times 2 = 10$.

Recommender Systems, as we saw in Section 9.3, results in a cost function that mirrors that of PCA's, only is restricted only to the elements of the $\mathbf{X}$ which are known. There we used an index set per column of the (ratings) matrix as $\Omega_p = \left \{\,\,j\,\,\rvert \,\,x_{j,\,p} \,\,\text{exists} \right \}$, however combining these into a single index set $\Omega = \left \{\,\,\left(\,j,\,p\right)\,\,\rvert \,\,x_{j,\,p} \,\,\text{exists} \right \}$ we can likewise write out the Recommender System cost function using matrix notation as

\begin{equation} g\left(\mathbf{W},\mathbf{C}\right) = \frac{1}{P}\left \Vert \left.\left\{ \mathbf{C}\mathbf{W} - \mathbf{X} \right\}\right\vert_{\Omega} \right\Vert_F^2. \end{equation}

where the symbol $\left. \left\{\cdot \right\}\right\vert_{\Omega}$ is used to denote that we only care about entries in the index set $\Omega$. Again rolling back to the desired approximation with Recommender Systems using this notation we have the same matrix factorization desire as PCA, only restricted to the same index set

\begin{equation} \left.\left\{ \mathbf{C}\mathbf{W} \approx \mathbf{X} \right\}\right\vert_{\Omega}. \end{equation}

Finally, to see that K-Means falls into the same category of matrix factorization let us start with the initial desire, and quickly re-derive the method using the same matrix notation as above. First, our desire is that points in the $k^{th}$ cluster should lie close to its centroid may be written mathematically as

\begin{equation} \mathbf{c}_{k}\approx\mathbf{x}_{p}\quad\textrm{for all }p\in\mathcal{S}_{k} \,\,\,\,\,\,\,\,\, k=1,...,K \end{equation}

where $\mathbf{c}_{k}$ is the centroid of the $k^{th}$ cluster and $\mathcal{S}_{k}$ the set of indices of the subset of those $P$ data points belonging to this cluster. These desired relations can be written more conveniently in matrix notation for the centroids - denoting by $\mathbf{e}_{k}$ the $k^{th}$ standard basis vector (that is a $K\times1$ vector with a $1$ in the $k^{th}$ slot and zeros elsewhere) - likewise as

\begin{equation} \mathbf{C} \, \mathbf{e}_{k}\approx\mathbf{x}_{p}\quad\textrm{for all }p\in\mathcal{S}_{k} \,\,\,\,\,\,\,\,\, k=1,...,K. \end{equation}

Then, introducing matrix notation for the weights (here constrained to be standard basis vectors) and the data we can likewise write the above relations as

\begin{equation} \mathbf{C}\mathbf{W} \approx\mathbf{X}\quad\textrm{for all }p\in\mathcal{S}_{k} \,\,\,\,\,\,\,\,\, k=1,...,K. \end{equation}

where

\begin{equation} \mathbf{w}_{p}\in\left\{ \mathbf{e}_{k}\right\} _{k=1}^{K},\,\,\, p=1,\ldots,P \end{equation}

The figure below pictorially illustrates the compactly written desired K-means relationship above for the prototypical dataset shown while devising K-Means in the previous Section. Note that the location of the only nonzero entry in each column of the assignment matrix $\mathbf{W}$ determines the cluster membership of its corresponding data point in $\mathbf{X}$. So - in other words - K-Means too is a matrix factorization problem (with a very particular set of constraints on the matrix $\mathbf{W}$).

Figure 1: K-means clustering relations described in a compact matrix form. Cluster centroids in $C$ lie close to their corresponding cluster points in $X$. The $p^{th}$ column of the assignment matrix $W$ contains the standard basis vector corresponding to the data point's cluster centroid.

Having framed the desired outcome - when parameters are set optimally - the associated cost function for K-Means can then likewise be written as

\begin{equation} g\left(\mathbf{W},\mathbf{C}\right) = \frac{1}{P}\left \Vert \mathbf{C}\mathbf{W} - \mathbf{X} \right\Vert_F^2. \end{equation}

subject to the constraint that $\mathbf{w}_{p}\in\left\{ \mathbf{e}_{k}\right\} _{k=1}^{K},\,\,\, p=1,\ldots,P$. In other words, to tune K-Means we must solve the constrained optimization problem

\begin{equation} \begin{aligned}\underset{\mathbf{C},\mathbf{W}}{\mbox{minimize}} & \,\,\left\Vert \mathbf{CW}-\mathbf{X}\right\Vert _{F}^{2}\\ \mbox{subject to} & \,\,\,\,\,\mathbf{w}_{p}\in\left\{ \mathbf{e}_{k}\right\} _{k=1}^{K},\,\,\, p=1,\ldots,P. \end{aligned} \end{equation}

One can easily show that the K-Means algorithm we derived in the previous Section is also the set of updates resulting from the application of the block-coordinate descent method for solving the above K-Means optimization problem. This perspective on K-Means is particularly helpful, since in the natural derivation of K-Means shown in the previous Section K-Means is a somewhat heuristic algorithm (i.e., it is not tied to the minimization of a cost function, like every other method we discuss is). One practical consequence of this is that - previously - we had no framework in which to judge how a single run of the algorithm was progressing. Now we do. Now we know that we can treat the K-Means algorithm precisely as we do every other optimization method we discuss - as a way of minimizing a particular cost function - and can use the cost function to understand how the algorithm is functioning.


11.6.2 Further variations on the theme of PCA / matrix factorization

We saw in equation (10) how K-means can be re-cast as a constrained matrix factorization problem, one where each column $\mathbf{w}_p$ of the assignment matrix $\mathbf{W}$ is constrained to be a standard basis vector. This is done to guarantee every data point $\mathbf{x}_p$ ends up getting assigned to one (and only one) cluster centroid $\mathbf{c}_k$. A natural generalization of this idea is when it makes sense to assign a single data point to multiple clusters, say at most $S$. Mathematically, this generalized version of K-means can be written, by adjusting (10), as

subject to the constraint that $\mathbf{w}_{p}\in\left\{ \mathbf{e}_{k}\right\} _{k=1}^{K},\,\,\, p=1,\ldots,P$. In other words, to tune K-Means we must solve the constrained optimization problem

\begin{equation} \begin{aligned}\underset{\mathbf{C},\mathbf{W}}{\mbox{minimize}} & \,\,\left\Vert \mathbf{CW}-\mathbf{X}\right\Vert _{F}^{2}\\ \mbox{subject to} & \,\,\,\,\,\left\Vert \mathbf{w}_{p}\right\Vert_{0} \leq S,\,\,\, p=1,\ldots,P. \end{aligned} \end{equation}

where the K-means constraints are replaced with constraints of the form $\left\Vert \mathbf{w}_{p}\right\Vert_{0} \leq S$, making it possible for each $\mathbf{x}_p$ to be assigned to at most $S$ clusters simultaneously. Recall, $\left\Vert \mathbf{w}_{p}\right\Vert_{0}$ indicates the number of nonzero entries in the vector $\mathbf{w}_{p}$.

The generalized K-means formulation in (11) is commonly referred to as sparse dictionary learning - a well-studied problem outside of machine learning, in signal processing and related fields. In the latter contexts, by allowing each data point (signal) to be represented by multiple centroids (dictionary elements) we can significantly reduce the error between the signal itself and its approximation using a sparse linear combination of dictionary elements. That we can learn a dictionary over which signals of interest are sparsely representable, is used in solving a host of problems including image (and more generally signal) compression, denoising, inpainting, and more.

Besides sparsity, seeking a nonnegative factorization of the input matrix $\mathbf{X}$ is another constraint sometimes put on matrices $\mathbf{C}$ and $\mathbf{W}$, giving the so-called nonnegative matrix factorization problem

\begin{equation} \begin{aligned}\underset{\mathbf{C},\mathbf{W}}{\mbox{minimize}} & \,\,\left\Vert \mathbf{CW}-\mathbf{X}\right\Vert _{F}^{2}\\ \mbox{subject to} & \,\,\,\,\,\mathbf{C},\mathbf{W} \geq 0 \end{aligned} \end{equation}

Nonnegative matrix factorization is used predominantly in situations where data is naturally nonnegative (e.g., Bag-of-Word representation of text data, pixel intensity representation of image data, etc.) where presence of negative entries hinders interpretability of learned solutions. Table 1 shows a list of common matrix factorization problems subject to possible constraints on $\mathbf{C}$ and $\mathbf{W}$.


Table 1: Common matrix factorization problems $\mathbf{C}\mathbf{W} \approx \mathbf{X}$ subject to possible constraints on $\mathbf{C}$ and $\mathbf{W}$

Matrix factorization problem $\qquad\qquad\qquad\qquad$ Constraints on $\mathbf{C}$ and $\mathbf{W}$
Principle component analysis (PCA) Columns of $\mathbf{C}$ are orthogonal
Recommender systems No constraint on $\mathbf{C}$ and $\mathbf{W}$, but $\mathbf{X}$ is only partially known
K-means clustering Each column of $\mathbf{W}$ is a standard basis vector
Sparse dictionary learning Each column of $\mathbf{W}$ is sparse, i.e., has at most $S$ nonzero entries
Nonnegative matrix factorization Both $\mathbf{C}$ and $\mathbf{W}$ are nonnegative