Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
rasbt
GitHub Repository: rasbt/machine-learning-book
Path: blob/main/ch05/ch05.ipynb
1247 views
Kernel: Python 3 (ipykernel)

Machine Learning with PyTorch and Scikit-Learn

-- Code Examples

Package version checks

Add folder to path in order to load from the check_packages.py script:

import sys sys.path.insert(0, '..')

Check recommended package versions:

from python_environment_check import check_packages d = { 'numpy': '1.21.2', 'matplotlib': '3.4.3', 'sklearn': '1.0', 'pandas': '1.3.2' } check_packages(d)
[OK] Your Python version is 3.9.6 | packaged by conda-forge | (default, Jul 11 2021, 03:35:11) [Clang 11.1.0 ] [OK] numpy 1.21.5 [OK] matplotlib 3.5.1 [OK] sklearn 1.0.1 [OK] pandas 1.3.5

Chapter 5 - Compressing Data via Dimensionality Reduction



Overview



from IPython.display import Image %matplotlib inline

Unsupervised dimensionality reduction via principal component analysis

The main steps behind principal component analysis

Image(filename='figures/05_01.png', width=400)
Image in a Jupyter notebook

Extracting the principal components step-by-step

import pandas as pd df_wine = pd.read_csv('https://archive.ics.uci.edu/ml/' 'machine-learning-databases/wine/wine.data', header=None) # if the Wine dataset is temporarily unavailable from the # UCI machine learning repository, un-comment the following line # of code to load the dataset from a local path: # df_wine = pd.read_csv('wine.data', header=None) df_wine.columns = ['Class label', 'Alcohol', 'Malic acid', 'Ash', 'Alcalinity of ash', 'Magnesium', 'Total phenols', 'Flavanoids', 'Nonflavanoid phenols', 'Proanthocyanins', 'Color intensity', 'Hue', 'OD280/OD315 of diluted wines', 'Proline'] df_wine.head()

Splitting the data into 70% training and 30% test subsets.

from sklearn.model_selection import train_test_split X, y = df_wine.iloc[:, 1:].values, df_wine.iloc[:, 0].values X_train, X_test, y_train, y_test = \ train_test_split(X, y, test_size=0.3, stratify=y, random_state=0)

Standardizing the data.

from sklearn.preprocessing import StandardScaler sc = StandardScaler() X_train_std = sc.fit_transform(X_train) X_test_std = sc.transform(X_test)

Note

Accidentally, I wrote X_test_std = sc.fit_transform(X_test) instead of X_test_std = sc.transform(X_test). In this case, it wouldn't make a big difference since the mean and standard deviation of the test set should be (quite) similar to the training set. However, as you remember from Chapter 3, the correct way is to re-use parameters from the training set if we are doing any kind of transformation -- the test set should basically stand for "new, unseen" data.

My initial typo reflects a common mistake which is that some people are not re-using these parameters from the model training/building and standardize the new data "from scratch." Here is a simple example to explain why this is a problem.

Let's assume we have a simple training set consisting of 3 examples with 1 feature (let's call this feature "length"):

  • train_1: 10 cm -> class_2

  • train_2: 20 cm -> class_2

  • train_3: 30 cm -> class_1

mean: 20, std.: 8.2

After standardization, the transformed feature values are

  • train_std_1: -1.21 -> class_2

  • train_std_2: 0 -> class_2

  • train_std_3: 1.21 -> class_1

Next, let's assume our model has learned to classify examples with a standardized length value < 0.6 as class_2 (class_1 otherwise). So far so good. Now, let's say we have 3 unlabeled data points that we want to classify:

  • new_4: 5 cm -> class ?

  • new_5: 6 cm -> class ?

  • new_6: 7 cm -> class ?

If we look at the "unstandardized "length" values in our training datast, it is intuitive to say that all of these examples are likely belonging to class_2. However, if we standardize these by re-computing standard deviation and mean you would get similar values as before in the training set and your classifier would (probably incorrectly) classify examples 4 and 5 as class_2.

  • new_std_4: -1.21 -> class_2

  • new_std_5: 0 -> class_2

  • new_std_6: 1.21 -> class_1

However, if we use the parameters from your "training set standardization," we'd get the values:

  • example5: -18.37 -> class_2

  • example6: -17.15 -> class_2

  • example7: -15.92 -> class_2

The values 5 cm, 6 cm, and 7 cm are much lower than anything we have seen in the training set previously. Thus, it only makes sense that the standardized features of the "new examples" are much lower than every standardized feature in the training set.


Eigendecomposition of the covariance matrix.

import numpy as np cov_mat = np.cov(X_train_std.T) eigen_vals, eigen_vecs = np.linalg.eig(cov_mat) print('\nEigenvalues \n', eigen_vals)
Eigenvalues [4.84274532 2.41602459 1.54845825 0.96120438 0.84166161 0.6620634 0.51828472 0.34650377 0.3131368 0.10754642 0.21357215 0.15362835 0.1808613 ]

Note:

Above, I used the numpy.linalg.eig function to decompose the symmetric covariance matrix into its eigenvalues and eigenvectors.

>>> eigen_vals, eigen_vecs = np.linalg.eig(cov_mat) This is not really a "mistake," but probably suboptimal. It would be better to use numpy.linalg.eigh in such cases, which has been designed for Hermetian matrices. The latter always returns real  eigenvalues; whereas the numerically less stable np.linalg.eig can decompose nonsymmetric square matrices, you may find that it returns complex eigenvalues in certain cases. (S.R.)



Total and explained variance

tot = sum(eigen_vals) var_exp = [(i / tot) for i in sorted(eigen_vals, reverse=True)] cum_var_exp = np.cumsum(var_exp)
import matplotlib.pyplot as plt plt.bar(range(1, 14), var_exp, align='center', label='Individual explained variance') plt.step(range(1, 14), cum_var_exp, where='mid', label='Cumulative explained variance') plt.ylabel('Explained variance ratio') plt.xlabel('Principal component index') plt.legend(loc='best') plt.tight_layout() # plt.savefig('figures/05_02.png', dpi=300) plt.show()
Image in a Jupyter notebook


Feature transformation

# Make a list of (eigenvalue, eigenvector) tuples eigen_pairs = [(np.abs(eigen_vals[i]), eigen_vecs[:, i]) for i in range(len(eigen_vals))] # Sort the (eigenvalue, eigenvector) tuples from high to low eigen_pairs.sort(key=lambda k: k[0], reverse=True)
w = np.hstack((eigen_pairs[0][1][:, np.newaxis], eigen_pairs[1][1][:, np.newaxis])) print('Matrix W:\n', w)
Matrix W: [[-0.13724218 0.50303478] [ 0.24724326 0.16487119] [-0.02545159 0.24456476] [ 0.20694508 -0.11352904] [-0.15436582 0.28974518] [-0.39376952 0.05080104] [-0.41735106 -0.02287338] [ 0.30572896 0.09048885] [-0.30668347 0.00835233] [ 0.07554066 0.54977581] [-0.32613263 -0.20716433] [-0.36861022 -0.24902536] [-0.29669651 0.38022942]]

Note Depending on which version of NumPy and LAPACK you are using, you may obtain the Matrix W with its signs flipped. Please note that this is not an issue: If vv is an eigenvector of a matrix Σ\Sigma, we have

Σv=λv,\Sigma v = \lambda v,

where λ\lambda is our eigenvalue,

then −v-v is also an eigenvector that has the same eigenvalue, since Σ⋅(−v)=−Σv=−λv=λ⋅(−v).\Sigma \cdot (-v) = -\Sigma v = -\lambda v = \lambda \cdot (-v).

X_train_std[0].dot(w)
array([2.38299011, 0.45458499])
X_train_pca = X_train_std.dot(w) colors = ['r', 'b', 'g'] markers = ['o', 's', '^'] for l, c, m in zip(np.unique(y_train), colors, markers): plt.scatter(X_train_pca[y_train == l, 0], X_train_pca[y_train == l, 1], c=c, label=f'Class {l}', marker=m) plt.xlabel('PC 1') plt.ylabel('PC 2') plt.legend(loc='lower left') plt.tight_layout() # plt.savefig('figures/05_03.png', dpi=300) plt.show()
Image in a Jupyter notebook


Principal component analysis in scikit-learn

NOTE

The following four code cells have been added in addition to the content to the book, to illustrate how to replicate the results from our own PCA implementation in scikit-learn:

from sklearn.decomposition import PCA pca = PCA() X_train_pca = pca.fit_transform(X_train_std) pca.explained_variance_ratio_
array([0.36951469, 0.18434927, 0.11815159, 0.07334252, 0.06422108, 0.05051724, 0.03954654, 0.02643918, 0.02389319, 0.01629614, 0.01380021, 0.01172226, 0.00820609])
plt.bar(range(1, 14), pca.explained_variance_ratio_, align='center') plt.step(range(1, 14), np.cumsum(pca.explained_variance_ratio_), where='mid') plt.ylabel('Explained variance ratio') plt.xlabel('Principal components') plt.show()
Image in a Jupyter notebook
pca = PCA(n_components=2) X_train_pca = pca.fit_transform(X_train_std) X_test_pca = pca.transform(X_test_std)
plt.scatter(X_train_pca[:, 0], X_train_pca[:, 1]) plt.xlabel('PC 1') plt.ylabel('PC 2') plt.show()
Image in a Jupyter notebook
from matplotlib.colors import ListedColormap def plot_decision_regions(X, y, classifier, test_idx=None, resolution=0.02): # setup marker generator and color map markers = ('o', 's', '^', 'v', '<') colors = ('red', 'blue', 'lightgreen', 'gray', 'cyan') cmap = ListedColormap(colors[:len(np.unique(y))]) # plot the decision surface x1_min, x1_max = X[:, 0].min() - 1, X[:, 0].max() + 1 x2_min, x2_max = X[:, 1].min() - 1, X[:, 1].max() + 1 xx1, xx2 = np.meshgrid(np.arange(x1_min, x1_max, resolution), np.arange(x2_min, x2_max, resolution)) lab = classifier.predict(np.array([xx1.ravel(), xx2.ravel()]).T) lab = lab.reshape(xx1.shape) plt.contourf(xx1, xx2, lab, alpha=0.3, cmap=cmap) plt.xlim(xx1.min(), xx1.max()) plt.ylim(xx2.min(), xx2.max()) # plot class examples for idx, cl in enumerate(np.unique(y)): plt.scatter(x=X[y == cl, 0], y=X[y == cl, 1], alpha=0.8, c=colors[idx], marker=markers[idx], label=f'Class {cl}', edgecolor='black')

Training logistic regression classifier using the first 2 principal components.

from sklearn.linear_model import LogisticRegression pca = PCA(n_components=2) X_train_pca = pca.fit_transform(X_train_std) X_test_pca = pca.transform(X_test_std) lr = LogisticRegression(multi_class='ovr', random_state=1, solver='lbfgs') lr = lr.fit(X_train_pca, y_train)
plot_decision_regions(X_train_pca, y_train, classifier=lr) plt.xlabel('PC 1') plt.ylabel('PC 2') plt.legend(loc='lower left') plt.tight_layout() # plt.savefig('figures/05_04.png', dpi=300) plt.show()
Image in a Jupyter notebook
plot_decision_regions(X_test_pca, y_test, classifier=lr) plt.xlabel('PC 1') plt.ylabel('PC 2') plt.legend(loc='lower left') plt.tight_layout() # plt.savefig('figures/05_05.png', dpi=300) plt.show()
Image in a Jupyter notebook
pca = PCA(n_components=None) X_train_pca = pca.fit_transform(X_train_std) pca.explained_variance_ratio_
array([0.36951469, 0.18434927, 0.11815159, 0.07334252, 0.06422108, 0.05051724, 0.03954654, 0.02643918, 0.02389319, 0.01629614, 0.01380021, 0.01172226, 0.00820609])

Assessing feature contributions

loadings = eigen_vecs * np.sqrt(eigen_vals) fig, ax = plt.subplots() ax.bar(range(13), loadings[:, 0], align='center') ax.set_ylabel('Loadings for PC 1') ax.set_xticks(range(13)) ax.set_xticklabels(df_wine.columns[1:], rotation=90) plt.ylim([-1, 1]) plt.tight_layout() plt.savefig('figures/05_05_02.png', dpi=300) plt.show()
Image in a Jupyter notebook
loadings[:, 0]
array([-0.3020184 , 0.54408942, -0.05600938, 0.45540829, -0.33970111, -0.8665386 , -0.9184327 , 0.67279444, -0.67489496, 0.16623657, -0.71769524, -0.81117245, -0.65291742])
sklearn_loadings = pca.components_.T * np.sqrt(pca.explained_variance_) fig, ax = plt.subplots() ax.bar(range(13), sklearn_loadings[:, 0], align='center') ax.set_ylabel('Loadings for PC 1') ax.set_xticks(range(13)) ax.set_xticklabels(df_wine.columns[1:], rotation=90) plt.ylim([-1, 1]) plt.tight_layout() plt.savefig('figures/05_05_03.png', dpi=300) plt.show()
Image in a Jupyter notebook


Supervised data compression via linear discriminant analysis

Principal component analysis versus linear discriminant analysis

Image(filename='figures/05_06.png', width=400)
Image in a Jupyter notebook

The inner workings of linear discriminant analysis



Computing the scatter matrices

Calculate the mean vectors for each class:

np.set_printoptions(precision=4) mean_vecs = [] for label in range(1, 4): mean_vecs.append(np.mean(X_train_std[y_train == label], axis=0)) print(f'MV {label}: {mean_vecs[label - 1]}\n')
MV 1: [ 0.9066 -0.3497 0.3201 -0.7189 0.5056 0.8807 0.9589 -0.5516 0.5416 0.2338 0.5897 0.6563 1.2075] MV 2: [-0.8749 -0.2848 -0.3735 0.3157 -0.3848 -0.0433 0.0635 -0.0946 0.0703 -0.8286 0.3144 0.3608 -0.7253] MV 3: [ 0.1992 0.866 0.1682 0.4148 -0.0451 -1.0286 -1.2876 0.8287 -0.7795 0.9649 -1.209 -1.3622 -0.4013]

Compute the within-class scatter matrix:

d = 13 # number of features S_W = np.zeros((d, d)) for label, mv in zip(range(1, 4), mean_vecs): class_scatter = np.zeros((d, d)) # scatter matrix for each class for row in X_train_std[y_train == label]: row, mv = row.reshape(d, 1), mv.reshape(d, 1) # make column vectors class_scatter += (row - mv).dot((row - mv).T) S_W += class_scatter # sum class scatter matrices print('Within-class scatter matrix: ' f'{S_W.shape[0]}x{S_W.shape[1]}')
Within-class scatter matrix: 13x13

Better: covariance matrix since classes are not equally distributed:

print('Class label distribution:', np.bincount(y_train)[1:])
Class label distribution: [41 50 33]
d = 13 # number of features S_W = np.zeros((d, d)) for label, mv in zip(range(1, 4), mean_vecs): class_scatter = np.cov(X_train_std[y_train == label].T) S_W += class_scatter print('Scaled within-class scatter matrix: ' f'{S_W.shape[0]}x{S_W.shape[1]}')
Scaled within-class scatter matrix: 13x13

Compute the between-class scatter matrix:

mean_overall = np.mean(X_train_std, axis=0) mean_overall = mean_overall.reshape(d, 1) # make column vector d = 13 # number of features S_B = np.zeros((d, d)) for i, mean_vec in enumerate(mean_vecs): n = X_train_std[y_train == i + 1, :].shape[0] mean_vec = mean_vec.reshape(d, 1) # make column vector S_B += n * (mean_vec - mean_overall).dot((mean_vec - mean_overall).T) print('Between-class scatter matrix: ' f'{S_B.shape[0]}x{S_B.shape[1]}')
Between-class scatter matrix: 13x13


Selecting linear discriminants for the new feature subspace

Solve the generalized eigenvalue problem for the matrix SW−1SBS_W^{-1}S_B:

eigen_vals, eigen_vecs = np.linalg.eig(np.linalg.inv(S_W).dot(S_B))

Note:

Above, I used the numpy.linalg.eig function to decompose the symmetric covariance matrix into its eigenvalues and eigenvectors.

>>> eigen_vals, eigen_vecs = np.linalg.eig(cov_mat) This is not really a "mistake," but probably suboptimal. It would be better to use numpy.linalg.eigh in such cases, which has been designed for Hermetian matrices. The latter always returns real  eigenvalues; whereas the numerically less stable np.linalg.eig can decompose nonsymmetric square matrices, you may find that it returns complex eigenvalues in certain cases. (S.R.)

Sort eigenvectors in descending order of the eigenvalues:

# Make a list of (eigenvalue, eigenvector) tuples eigen_pairs = [(np.abs(eigen_vals[i]), eigen_vecs[:, i]) for i in range(len(eigen_vals))] # Sort the (eigenvalue, eigenvector) tuples from high to low eigen_pairs = sorted(eigen_pairs, key=lambda k: k[0], reverse=True) # Visually confirm that the list is correctly sorted by decreasing eigenvalues print('Eigenvalues in descending order:\n') for eigen_val in eigen_pairs: print(eigen_val[0])
Eigenvalues in descending order: 349.6178089059941 172.76152218979385 2.842170943040401e-14 2.564925776858485e-14 2.4338020883285312e-14 2.4338020883285312e-14 2.0175057173967873e-14 1.4666248938581886e-14 1.4666248938581886e-14 6.749911050291042e-15 6.749911050291042e-15 4.473806353827043e-15 1.1725209460862484e-15
tot = sum(eigen_vals.real) discr = [(i / tot) for i in sorted(eigen_vals.real, reverse=True)] cum_discr = np.cumsum(discr) plt.bar(range(1, 14), discr, align='center', label='Individual discriminability') plt.step(range(1, 14), cum_discr, where='mid', label='Cumulative discriminability') plt.ylabel('Discriminability ratio') plt.xlabel('Linear discriminants') plt.ylim([-0.1, 1.1]) plt.legend(loc='best') plt.tight_layout() #plt.savefig('figures/05_07.png', dpi=300) plt.show()
Image in a Jupyter notebook
w = np.hstack((eigen_pairs[0][1][:, np.newaxis].real, eigen_pairs[1][1][:, np.newaxis].real)) print('Matrix W:\n', w)
Matrix W: [[-0.1481 -0.4092] [ 0.0908 -0.1577] [-0.0168 -0.3537] [ 0.1484 0.3223] [-0.0163 -0.0817] [ 0.1913 0.0842] [-0.7338 0.2823] [-0.075 -0.0102] [ 0.0018 0.0907] [ 0.294 -0.2152] [-0.0328 0.2747] [-0.3547 -0.0124] [-0.3915 -0.5958]]


Projecting examples onto the new feature space

X_train_lda = X_train_std.dot(w) colors = ['r', 'b', 'g'] markers = ['o', 's', '^'] for l, c, m in zip(np.unique(y_train), colors, markers): plt.scatter(X_train_lda[y_train == l, 0], X_train_lda[y_train == l, 1] * (-1), c=c, label=f'Class {l}', marker=m) plt.xlabel('LD 1') plt.ylabel('LD 2') plt.legend(loc='lower right') plt.tight_layout() plt.savefig('figures/05_08.png', dpi=300) plt.show()
Image in a Jupyter notebook


LDA via scikit-learn

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA lda = LDA(n_components=2) X_train_lda = lda.fit_transform(X_train_std, y_train)
from sklearn.linear_model import LogisticRegression lr = LogisticRegression(multi_class='ovr', random_state=1, solver='lbfgs') lr = lr.fit(X_train_lda, y_train) plot_decision_regions(X_train_lda, y_train, classifier=lr) plt.xlabel('LD 1') plt.ylabel('LD 2') plt.legend(loc='lower left') plt.tight_layout() # plt.savefig('figures/05_09.png', dpi=300) plt.show()
Image in a Jupyter notebook
X_test_lda = lda.transform(X_test_std) plot_decision_regions(X_test_lda, y_test, classifier=lr) plt.xlabel('LD 1') plt.ylabel('LD 2') plt.legend(loc='lower left') plt.tight_layout() # plt.savefig('figures/05_10.png', dpi=300) plt.show()
Image in a Jupyter notebook


Nonlinear dimensionality reduction techniques

Image(filename='figures/05_11.png', width=500)
Image in a Jupyter notebook

Visualizing data via t-distributed stochastic neighbor embedding

from sklearn.datasets import load_digits digits = load_digits() fig, ax = plt.subplots(1, 4) for i in range(4): ax[i].imshow(digits.images[i], cmap='Greys') # plt.savefig('figures/05_12.png', dpi=300) plt.show()
Image in a Jupyter notebook
digits.data.shape
(1797, 64)
y_digits = digits.target X_digits = digits.data
from sklearn.manifold import TSNE tsne = TSNE(n_components=2, init='pca', random_state=123) X_digits_tsne = tsne.fit_transform(X_digits)
/Users/sebastian/miniforge3/lib/python3.9/site-packages/sklearn/manifold/_t_sne.py:790: FutureWarning: The default learning rate in TSNE will change from 200.0 to 'auto' in 1.2. warnings.warn( /Users/sebastian/miniforge3/lib/python3.9/site-packages/sklearn/manifold/_t_sne.py:982: FutureWarning: The PCA initialization in TSNE will change to have the standard deviation of PC1 equal to 1e-4 in 1.2. This will ensure better convergence. warnings.warn(
import matplotlib.patheffects as PathEffects def plot_projection(x, colors): f = plt.figure(figsize=(8, 8)) ax = plt.subplot(aspect='equal') for i in range(10): plt.scatter(x[colors == i, 0], x[colors == i, 1]) for i in range(10): xtext, ytext = np.median(x[colors == i, :], axis=0) txt = ax.text(xtext, ytext, str(i), fontsize=24) txt.set_path_effects([ PathEffects.Stroke(linewidth=5, foreground="w"), PathEffects.Normal()]) plot_projection(X_digits_tsne, y_digits) # plt.savefig('figures/05_13.png', dpi=300) plt.show()
Image in a Jupyter notebook


Summary

...


Readers may ignore the next cell.

! python ../.convert_notebook_to_script.py --input ch05.ipynb --output ch05.py
[NbConvertApp] WARNING | Config option `kernel_spec_manager_class` not recognized by `NbConvertApp`. [NbConvertApp] Converting notebook ch05.ipynb to script [NbConvertApp] Writing 20329 bytes to ch05.py