Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
ethen8181
GitHub Repository: ethen8181/machine-learning
Path: blob/master/dim_reduct/svd.ipynb
1470 views
Kernel: Python 3
from jupyterthemes import get_themes from jupyterthemes.stylefx import set_nb_theme themes = get_themes() set_nb_theme(themes[1])
# 1. magic for inline plot # 2. magic to print version # 3. magic so that the notebook will reload external python modules # 4. magic to enable retina (high resolution) plots # https://gist.github.com/minrk/3301035 %matplotlib inline %load_ext watermark %load_ext autoreload %autoreload 2 %config InlineBackend.figure_format = 'retina' import numpy as np import pandas as pd import matplotlib.pyplot as plt from scipy.linalg import svd as scipy_svd from sklearn.pipeline import Pipeline from sklearn.datasets import load_iris from sklearn.decomposition import PCA, TruncatedSVD from sklearn.preprocessing import normalize, StandardScaler from sklearn.feature_extraction.text import TfidfVectorizer %watermark -a 'Ethen' -d -t -v -p numpy,scipy,pandas,sklearn,matplotlib
Ethen 2017-11-18 21:13:40 CPython 3.5.2 IPython 6.2.1 numpy 1.13.3 scipy 0.19.1 pandas 0.20.3 sklearn 0.19.1 matplotlib 2.1.0

Singular Value Decomposition (SVD)

When conducting data analysis project, it's very common to encounter dataset that contains some useful information to the task at hand, but also contains low quality information that do not contribute too much to the end goal. When facing this issue, there are numerous ways to isolating the signal from the noise. e.g. We can employ regularization methods such as Lasso regression to perform constrained optimization, automatically dropping uninformative features from the model or use tree-based methods to identify the features that were most often used for constructing the tree.

Or use a variance maximization method such as Principal Component Analysis (PCA) that aims to transform the data into a new set of orthogonal components, ensuring that the first component aligns to the maximum variance in the dataset, and the subsequent component aligns with the next maximum component and so on. In other words, it makes a dataset more compact while preserving information. The convention method for calculating PCA requires us to compute the full covariance matrix, making it suffer from extensive use of memory and can be numerically unstable. It turns out, SVD is a method that can be used to compute PCA and obtain the principal component to transform our raw dataset.

Singular Value Decomposition (SVD) is a particular decomposition method that decomposes an arbitrary matrix AA with mm rows and nn columns (assuming this matrix also has a rank of rr, i.e. rr columns of the matrix AA are linear independent) into a set of related matrices:

A=UΣVT\begin{align} A = U \Sigma V^{T} \end{align}

where:

  • Σ\Sigma (Sigma) is a rrr * r non-negative, decreasing order diagonal matrix. All elements not on the main diagonal are 0 and the elements of Σ\Sigma are called the singular values. Another common notation that is used for this matrix is SS. Thus in the following post, we'll use these two symbols interchangeably.

  • UU is a mrm * r orthonormal matrix and VV is a nrn * r orthonormal matrix.

    • Orthogonal matrix refers to a square matrix where the columns are 90 degrees between each other and its inner dot product is zero, i.e. Given a orthogonal matrix QQ, QTQ=QQT=IQ^T Q = Q Q^T = I and QT=Q1Q^T = Q^{-1}.

    • Orthonormal matrix: orthogonal matrix where columns are unit vectors.

A classic pictorial representation of SVD.

Interpretation of SVD

Geometric Interpretation

We'll use a 2 dimensional dataset for the geometric interpretation for ease of visualization. Transformation of a matrix by UΣVTU \Sigma V^T can be visualized as a rotation and reflection, scaling, rotation and reflection. We'll see this as a step-by-step visualization.

Given a matrix x=[1010202010202010]x = \begin{bmatrix} -10 & -10 & 20 & 20\\ -10 & 20 & 20 & -10 \end{bmatrix} and a transformation matrix A=[10.30.451.2]A = \begin{bmatrix} 1 & 0.3 \\ 0.45 & 1.2 \end{bmatrix}.

  • VTxV^T x We can see that multiplying by VTV^T rotates and reflects the input matrix xx. Notice the swap of colors red-blue and green-yellow indicating a reflection along the x-axis.

  • SVTxS V^T x Since SS only contains values on the diagonal, it scales the matrix. VV rotates the matrix to a position where the singular values now represent the scaling factor along the V-basis. In the picture below VTxV^Tx is dashed and SVTxSV^Tx is solid.

  • USVTU S V^T Finally, UU rotates and reflects the matrix back to the standard basis.

Putting all three steps into one picture below, the dashed square shows xx as the corners and the transformed matrix AxAx as the solid shape.

The most useful property of the SVD is that the axes in the new space, which represent new latent attributes, are orthogonal. Hence original attributes are expressed in terms of new attributes that are independent of each other.

# we can confirm this with code A = np.array([[1, 0.3], [0.45, 1.2]]) U, S, V = scipy_svd(A) print('singular values:', S) # the toy 2d matrix x = np.array([[-10, -10, 20, 20], [-10, 20, 20, -10]]).T x
singular values: [ 1.49065822 0.71444949]
array([[-10, -10], [-10, 20], [ 20, 20], [ 20, -10]])
# change default font size plt.rcParams['font.size'] = 12 # the plot is not as pretty as the diagram above, # but hopefully it gets the point across fig, ax = plt.subplots(1, 4, figsize = (20, 4)) ax[0].scatter(x[:, 0], x[:, 1]) ax[0].set_title('Original matrix') temp = x @ V.T ax[1].scatter(temp[:, 0], temp[:, 1]) ax[1].set_title('$V^Tx$') temp = temp @ np.diag(S) ax[2].scatter(temp[:, 0], temp[:, 1]) ax[2].set_title('$SV^Tx$') temp = temp @ U ax[3].scatter(temp[:, 0], temp[:, 1]) ax[3].set_title('$USV^Tx$') plt.tight_layout() plt.show()
Image in a Jupyter notebook

Factor Interpretation

Here we are given a rank 3 matrix AA, representing ratings of movies by users.

NameMatrixAlienStar WarsCasablancaTitanic
Joe11100
Jim33300
John44400
Jack55500
Jill02044
Jenny00055
Jane01022

Applying SVD to this matrix will give us the following decomposition:

The key to understanding what SVD offers is viewing the rr columns of UU, Σ\Sigma, and VV as representing concepts that are hidden in the original matrix. In our contrived example, we can imagine there are two concepts underlying the movies, scientific fiction and romance.

To be explicit:

  • The matrix UU connects people to concept. For example, looking at Joe (the first row in the original matrix). The value 0.14 in the first row and first column of UU is smaller than some of the other entries in that column. The rationale for this is because while Joe watches only science fiction, he doesn’t rate those movies highly.

  • The matrix VV relates movies to concept. The approximately 0.58 in each of the first three columns of the first row of VTV^T indicates that the first three movies – The Matrix, Alien and Star Wars – each are of science-fiction genre.

  • Matrix Σ\Sigma gives the strength of each of the concepts. In our example, the strength of the science-fiction concept is 12.4, while the strength of the romance concept is 9.5. Intuitively, the science-fiction concept is stronger because the data provides more information about the movies of that genre and the people who like them.

  • The third concept is a bit harder to interpret, but it doesn't matter that much, because its weight, given by the third nonzero diagonal entry in Σ\Sigma is relatively low compared to the first two concepts.

  • Note that the matrix decomposition doesn't know the meaning of any column in the dataset, it discovers the underlying concept and it is up to us to interpret these latent factors.

Worked Example Full SVD

Let's step through a worked example of "Full" SVD. In practice the full version is computationally expensive, since we must calculate the full matrices UmrU_{mr}, SrrS_{rr}, and VnrTV_{nr}^{T}. The "truncated" versions of SVD are usually preferred, where we can preselect the top k<rk < r dimensions of interest and calculate UmkU_{mk}, SkkS_{kk} and VnkTV_{nk}^{T}. But the truncated version is a topic for another day.

# matrix form of the table above rank = 3 A = np.array([ [1, 1, 1, 0, 0], [3, 3, 3, 0, 0], [4, 4, 4, 0, 0], [5, 5, 5, 0, 0], [0, 2, 0, 4, 4], [0, 0, 0, 5, 5], [0, 1, 0, 2, 2]]) # we'll use a library to perform the svd # so we can confirm our result with it U, S, V = scipy_svd(A, full_matrices = False) # we'll just print out S, a.k.a Sigma to show the numbers # are identical to the results shown earlier print(S)
[ 1.24810147e+01 9.50861406e+00 1.34555971e+00 3.04642685e-16 0.00000000e+00]
# the following cell verifies some properties of SVD # Verify calculation of A=USV^T print(np.allclose(A, U @ np.diag(S) @ V)) # orthonormal, columns are unit vectors (length = 1) print(np.allclose(np.round(np.sum(U * U, axis = 0)), np.ones(S.size))) # orthogonal, dot product of itself is equivalent to # the identity matrix U^T U = I print(np.allclose(U.T @ U, np.eye(S.size)))
True True True

The SVD of a matrix AA is strongly connected to the eigenvalues of the symmetric matrices ATAA^{T}A and AATAA^{T}. We'll start with the expression for SVD: A=UΣVTA = U \Sigma V^T.

AT=(UΣVT)T=(VT)TΣTUT=VΣTUT=VΣUT\begin{align} A^T &= (U \Sigma V^T)^T \\ &= (V^T)^T \Sigma^T U^T \\ &= V \Sigma^T U^T \\ &= V \Sigma U^T \end{align}

In the second step we use the matrix property that (BA)T=ATBT(BA)^T = A^T B^T and in the final step Σ\Sigma is a diagonal matrix, thus ΣT=Σ\Sigma^T = \Sigma. Next:

ATA=(VΣUT)(UΣVT)=VΣIΣVT=VΣΣVT\begin{align} A^T A &= (V \Sigma U^T)(U \Sigma V^T) \\ &= V \Sigma I \Sigma V^T \\ &= V \Sigma \Sigma V^T \end{align}

In the second step, we use the fact that UU is a orthonormal matrix, so UTUU^T U is an identity matrix of the appropriate size.

We now multiply both sides of this equation by VV to get:

ATAV=VΣ2VTV=VΣ2I=VΣ2\begin{align} A^T A V &= V \Sigma^2 V^T V \\ &= V \Sigma^2 I \\ &= V \Sigma^2 \end{align}

Here we use the fact that VV is also a orthonormal matrix, so VTVV^T V is an identity matrix of the appropriate size. Looking at the equation ATAV=VΣ2A^T A V = V \Sigma^2, we now see that VV is the eigenvector of the matrix ATAA^T A and Σ2\Sigma^2 is the diagonal matrix whose entries are the corresponding eigenvalues. i.e. V=eig(ATA)V = eig(A^T A)

AtA = A.T @ A _, V1 = np.linalg.eig(AtA) # Note that the possible non-uniqueness of the decomposition means # that an axis can be flipped without changing anything fundamental, # thus we compare whether the absolute values are relatively close # instead of the raw value print(np.allclose(np.abs(V1[:, :rank]), np.abs(V1[:, :rank])))
True

Only UU remains to be computed, but it can be found in the same way we found VV. Instead this time, we'll be starting with AATA A^T

AATU=(UΣVT)(VΣUT)U=VΣIΣVTU=UΣΣUTU=UΣΣI=UΣ2\begin{align} A A^T U &= (U \Sigma V^T)(V \Sigma U^T) U \\ &= V \Sigma I \Sigma V^T U \\ &= U \Sigma \Sigma U^T U \\ &= U \Sigma \Sigma I \\ &= U \Sigma^2 \end{align}

In other words: U=eig(AAT)U = eig(A A^T)

AAt = A @ A.T _, U1 = np.linalg.eig(AAt) np.allclose(np.abs(U1[:, :rank]), np.abs(U[:, :rank]))
True
# notice that since this is a rank 3 matrix # only the first 3 values of 3 contains non-zero values np.round(S, 0)
array([ 12., 10., 1., 0., 0.])

To sum this section up:

  • UU is a mrm * r orthonormal matrix of "left-singular" (eigen)vectors of AATAA^{T}.

  • VV is a nrn * r orthonormal matrix of 'right-singular' (eigen)vectors of ATAA^{T}A.

  • Σ\Sigma is a rrr * r non-negative, decreasing order diagonal matrix. All elements not on the main diagonal are 0 and the elements of Σ\Sigma are called the singular values, which is the square root of nonzero eigenvalues.

For those interested, the following link contains a detail walkthrough of the computation by hand. Notes: Singular Value Decomposition Tutorial

Relationships with PCA

This usage of SVD is very similar to Principal Components Analysis (PCA) and in fact several numerical software libraries actually use SVD under the hood for their PCA routines, for example sklearn.decomposition.PCA within scikit-learn. This is due to the fact that it is more numerically stable and it's also possible to perform a truncated SVD, which only needs us to calculate UΣVTU \Sigma V^T for the first k<nk<n features; this makes it far quicker to compute than the full covariance matrix as computed within PCA.

In the following section, we'll take a look at the relationship between these two methods, PCA and SVD. Recall from the documentation on PCA, given the input matrix X\mathbf X the math behind the algorithm is to solve the eigendecomposition for the correlation matrix (assuming we standardized all features) C=XTX/(n1)\mathbf C = \mathbf X^T \mathbf X / (n - 1). It turns out, we can represent C\mathbf C by a product of its eigenvectors W\mathbf W and diagonalized eigenvalues L\mathbf L.

C=WLWT\begin{align} \mathbf C &= \mathbf W \mathbf L \mathbf W^T \end{align}
# use some toy dataset iris = load_iris() X = iris['data'] # construct the pipeline standardize = StandardScaler() pca = PCA() pipeline = Pipeline([ ('standardize', standardize), ('pca', pca) ]) X_pca = pipeline.fit_transform(X)
standardize = pipeline.named_steps['standardize'] X_std = standardize.transform(X) # confirm the WLW^T X_cov = np.cov(X_std.T) eigen_values, eigen_vecs = np.linalg.eig(X_cov) reconstructed_X = eigen_vecs @ np.diag(eigen_values) @ np.linalg.inv(eigen_vecs) print(np.allclose(X_cov, reconstructed_X))
True

After obtaining the eigenvectors, i.e. principal direrctions, we can project our raw data onto the principal axes, which are called principal component scores via the operation XW\mathbf{XW}.

As for singular decompostion, X=UΣVT\mathbf X = \mathbf U \mathbf \Sigma \mathbf V^T. We can write out the correlation matrix using this form:

C=VΣUTUΣVT/(n1)=VΣ2n1VT\begin{align} \mathbf C &= \mathbf V \mathbf \Sigma \mathbf U^T \mathbf U \mathbf \Sigma \mathbf V^T / (n - 1) \\ &= \mathbf V \frac{\mathbf \Sigma^2}{n - 1}\mathbf V^T \end{align}

Meaning thte right singular vectors V\mathbf V are principal directions and that singular values are related to the eigenvalues of correlation matrix via L=Σ2/(n1)\mathbf L = \mathbf \Sigma^2 / (n - 1) And the principal component scores can be computed by: XV=UΣVTV=UΣ\mathbf X \mathbf V = \mathbf U \mathbf \Sigma \mathbf V^T \mathbf V = \mathbf U \mathbf \Sigma.

# here we'll print out the eigenvectors # learned from PCA and the V learned from svd pca.components_
array([[ 0.52237162, -0.26335492, 0.58125401, 0.56561105], [ 0.37231836, 0.92555649, 0.02109478, 0.06541577], [-0.72101681, 0.24203288, 0.14089226, 0.6338014 ], [-0.26199559, 0.12413481, 0.80115427, -0.52354627]])
# we can do X @ V to obtain the principal component score U, S, V = scipy_svd(X_std) V
array([[ 0.52237162, -0.26335492, 0.58125401, 0.56561105], [-0.37231836, -0.92555649, -0.02109478, -0.06541577], [ 0.72101681, -0.24203288, -0.14089226, -0.6338014 ], [ 0.26199559, -0.12413481, -0.80115427, 0.52354627]])

Notice that some of the signs are flipped, this is normal due to the previously stated non-uniqueness of the decomposition. We'll now wrap up this section with a diagram of PCA versus SVD:

Applications

This is by no means an exhaustive list of SVD's application.

Dimensionality Reduction

Due to its relationships with PCA, we can imagine a very frequent use of SVD is feature reduction. By only selecting the top kk singular values, we have in effect compressed the original information and represented it using fewer features. Note that because SVD is also a numerical algorithm, it is important to standardize the features to ensure the magnitude of the entries are of similar range.

Information Retrieval

SVD has also been used extensively in information retrieval, in this particular application, it is also known as Latent Semantic Analysis (LSA) or Latent Semantic Indexing (LSI). As we'll soon see, the idea is very similar to topic modeling. The fundamental problem in information retrieval is: given some search terms, retrieve all of the documents that contain those search terms or, perhaps more usefully, return documents whose content is semantically related to the search terms. For example, if one of the search terms was "automobile" it might be appropriate to return also documents that contain the search term "car".

One approach to this problem is: Given an information repository, we might convert a raw text to document-term matrix with one row per document and one column per word. Then convert the search term as a vector in the same space, and retrieving document vectors that are close to the search vector. There are several problems with vector-based retrieval.

  • First, the space is very high dimensional. For example, a typical collection of documents can easily mention more than 100,000 words even if stemming is used (i.e., "skip", "skipping", "skipped" are all treated as the same word). This creates problems for distance measurement due to the curse of dimensionality.

  • Second, it treats each word as independent, whereas in languages like English, the same word can mean two different things ("bear" a burden versus "bear" in the woods), and two different words can mean the same thing ("car" and "automobile").

By applying SVD, we can reduce the dimension to speed up the search, words with similar meanings will get mapped to a similar truncated space. We'll take at this application in the following quick example:

example = [ 'Machine learning is super fun', 'Python is super, super cool', 'Statistics is cool, too', 'Data science is fun', 'Python is great for machine learning', 'I like football', 'Football is great to watch'] # a two-staged model pipeline, # first convert raw words to a tfidf document-term matrix # and apply svd decomposition after that tfidf = TfidfVectorizer(stop_words = 'english') svd = TruncatedSVD(n_components = 2) pipeline = Pipeline([ ('tfidf', tfidf), ('svd', svd) ]) X_lsa = pipeline.fit_transform(example) X_lsa
array([[ 0.82714832, -0.20216821], [ 0.64317518, -0.27989764], [ 0.19952711, -0.19724375], [ 0.24907097, -0.13828783], [ 0.7392593 , 0.14892526], [ 0.1162772 , 0.73645697], [ 0.28427388, 0.79260792]])
# mapping of words to latent factors/concepts, # i.e. each concept is a linear combination of words tfidf = pipeline.named_steps['tfidf'] vocab = tfidf.get_feature_names() pd.DataFrame(svd.components_, index = ['concept1', 'concept2'], columns = vocab)
svd = pipeline.named_steps['svd'] print('total variance explained:', np.sum(svd.explained_variance_)) # mapping of document to latent factors/concepts, # i.e. Eech document is a linear combination of the concepts pd.DataFrame(X_lsa, index = example, columns = ['concept1', 'concept2'])
total variance explained: 0.252606886963

After applying LSA, we can use the compressed features to see which documents are more similar to a particular document. The following code chunk shows the pairwise cosine similarity of all the documents.

X_normed = normalize(X_lsa, axis = 1) similarity = X_normed @ X_normed.T pd.DataFrame(similarity, index = example, columns = example)

Collaborative Filtering

This post is getting quite long, thus we'll just list it here that similar to the movie rating matrix in a couple of sections back, SVD can be applied to implementing recommendation system, namely collaborative filtering. I personally haven't checked it yet, but the following post seems to contain a walkthrough of SVD applied to collaborative filtering for people who are interested in diving deeper. Blog: Matrix Factorization for Movie Recommendations in Python

Reference