Machine Learning with PyTorch and Scikit-Learn
-- Code Examples
Package version checks
Add folder to path in order to load from the check_packages.py script:
Check recommended package versions:
Chapter 5 - Compressing Data via Dimensionality Reduction
Overview
Unsupervised dimensionality reduction via principal component analysis
The main steps behind principal component analysis
Extracting the principal components step-by-step
Splitting the data into 70% training and 30% test subsets.
Standardizing the data.
Note
Accidentally, I wrote X_test_std = sc.fit_transform(X_test)
instead of X_test_std = sc.transform(X_test)
. In this case, it wouldn't make a big difference since the mean and standard deviation of the test set should be (quite) similar to the training set. However, as you remember from Chapter 3, the correct way is to re-use parameters from the training set if we are doing any kind of transformation -- the test set should basically stand for "new, unseen" data.
My initial typo reflects a common mistake which is that some people are not re-using these parameters from the model training/building and standardize the new data "from scratch." Here is a simple example to explain why this is a problem.
Let's assume we have a simple training set consisting of 3 examples with 1 feature (let's call this feature "length"):
train_1: 10 cm -> class_2
train_2: 20 cm -> class_2
train_3: 30 cm -> class_1
mean: 20, std.: 8.2
After standardization, the transformed feature values are
train_std_1: -1.21 -> class_2
train_std_2: 0 -> class_2
train_std_3: 1.21 -> class_1
Next, let's assume our model has learned to classify examples with a standardized length value < 0.6 as class_2 (class_1 otherwise). So far so good. Now, let's say we have 3 unlabeled data points that we want to classify:
new_4: 5 cm -> class ?
new_5: 6 cm -> class ?
new_6: 7 cm -> class ?
If we look at the "unstandardized "length" values in our training datast, it is intuitive to say that all of these examples are likely belonging to class_2. However, if we standardize these by re-computing standard deviation and mean you would get similar values as before in the training set and your classifier would (probably incorrectly) classify examples 4 and 5 as class_2.
new_std_4: -1.21 -> class_2
new_std_5: 0 -> class_2
new_std_6: 1.21 -> class_1
However, if we use the parameters from your "training set standardization," we'd get the values:
example5: -18.37 -> class_2
example6: -17.15 -> class_2
example7: -15.92 -> class_2
The values 5 cm, 6 cm, and 7 cm are much lower than anything we have seen in the training set previously. Thus, it only makes sense that the standardized features of the "new examples" are much lower than every standardized feature in the training set.
Eigendecomposition of the covariance matrix.
Note:
Above, I used the numpy.linalg.eig
function to decompose the symmetric covariance matrix into its eigenvalues and eigenvectors. >>> eigen_vals, eigen_vecs = np.linalg.eig(cov_mat) This is not really a "mistake," but probably suboptimal. It would be better to use numpy.linalg.eigh
in such cases, which has been designed for Hermetian matrices. The latter always returns real eigenvalues; whereas the numerically less stable np.linalg.eig
can decompose nonsymmetric square matrices, you may find that it returns complex eigenvalues in certain cases. (S.R.)
Total and explained variance
Feature transformation
Note Depending on which version of NumPy and LAPACK you are using, you may obtain the Matrix W with its signs flipped. Please note that this is not an issue: If is an eigenvector of a matrix , we have
where is our eigenvalue,
then is also an eigenvector that has the same eigenvalue, since
Principal component analysis in scikit-learn
NOTE
The following four code cells have been added in addition to the content to the book, to illustrate how to replicate the results from our own PCA implementation in scikit-learn:
Training logistic regression classifier using the first 2 principal components.
Assessing feature contributions
Supervised data compression via linear discriminant analysis
Principal component analysis versus linear discriminant analysis
The inner workings of linear discriminant analysis
Computing the scatter matrices
Calculate the mean vectors for each class:
Compute the within-class scatter matrix:
Better: covariance matrix since classes are not equally distributed:
Compute the between-class scatter matrix:
Selecting linear discriminants for the new feature subspace
Solve the generalized eigenvalue problem for the matrix :
Note:
Above, I used the numpy.linalg.eig
function to decompose the symmetric covariance matrix into its eigenvalues and eigenvectors. >>> eigen_vals, eigen_vecs = np.linalg.eig(cov_mat) This is not really a "mistake," but probably suboptimal. It would be better to use numpy.linalg.eigh
in such cases, which has been designed for Hermetian matrices. The latter always returns real eigenvalues; whereas the numerically less stable np.linalg.eig
can decompose nonsymmetric square matrices, you may find that it returns complex eigenvalues in certain cases. (S.R.)
Sort eigenvectors in descending order of the eigenvalues:
Projecting examples onto the new feature space
LDA via scikit-learn
Nonlinear dimensionality reduction techniques
Visualizing data via t-distributed stochastic neighbor embedding
Summary
...
Readers may ignore the next cell.