CoCalc -- Day 4 Principal Component Analysis (PCA).ipynb

GitHub Repository: suyashi29/python-su
Path: blob/master/Machine Learning Unsupervised Methods/Day 4 Principal Component Analysis (PCA).ipynb
³⁰⁷⁴ views

Kernel: Python 3 (ipykernel)

Dimension reduction

A Technique used in machine learning and statistics to reduce the number of input variables in a dataset while retaining as much information as possible. The primary goals are to simplify models, reduce computational cost, and help in visualizing data.

High Dimension:

When the number of features is very large.
e.g: images(pixels as feature)
if we have a dataset with 90+ features rows are just 100.

It can be classified into two main types:

Feature Selection: Selecting a subset of the original variables.
Feature Extraction: Transforming the data from a high-dimensional space to a low-dimensional space.

Purpose: To simplify models, reduce computational cost, and improve model performance.

Techniques:

Feature Selection: Methods like Forward Selection, Backward Elimination, and Recursive Feature Elimination (RFE).
Feature Extraction: Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), t-Distributed Stochastic Neighbor Embedding (t-SNE), and Autoencoders.
Benefits: Removes multicollinearity, improves visualization, reduces overfitting, and accelerates algorithm performance.
Applications: Used in fields like machine learning, pattern recognition, and signal processing.

PCA

Principal Component Analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of possibly correlated variables into a set of linearly uncorrelated variables called principal components. The number of principal components is less than or equal to the number of original variables.

Mathematical Formulation of PCA

Standardize the Data: Center the data by subtracting the mean of each feature and scaling by the standard deviation.

Standardization ensures that each feature contributes equally to the analysis.
Covariance Matrix captures the variance and covariance of the data.
Eigenvalues and Eigenvectors determine the principal components, where eigenvalues indicate the amount of variance captured by each principal component.
Transformation reduces the dimensionality of the data by projecting it onto the selected principal components, which capture the most variance.
PCA is widely used for data compression, visualization, and noise reduction. It helps in understanding the structure of the data and finding patterns that are not apparent in high-dimensional spaces

In [1]:

## Importing Libraries
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

In [2]:

# Load the dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Convert to a pandas DataFrame for easier manipulation
df = pd.DataFrame(X, columns=data.feature_names)
df.head(4)

Out[2]:

In [3]:

df.describe()

Out[3]:

In [16]:

# Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply PCA
pca = PCA(n_components=3)
X_pca = pca.fit_transform(X_scaled)
X_scaled

Out[16]:

array([[ 1.09706398, -2.07333501,  1.26993369, ...,  2.29607613,
         2.75062224,  1.93701461],
       [ 1.82982061, -0.35363241,  1.68595471, ...,  1.0870843 ,
        -0.24388967,  0.28118999],
       [ 1.57988811,  0.45618695,  1.56650313, ...,  1.95500035,
         1.152255  ,  0.20139121],
       ...,
       [ 0.70228425,  2.0455738 ,  0.67267578, ...,  0.41406869,
        -1.10454895, -0.31840916],
       [ 1.83834103,  2.33645719,  1.98252415, ...,  2.28998549,
         1.91908301,  2.21963528],
       [-1.80840125,  1.22179204, -1.81438851, ..., -1.74506282,
        -0.04813821, -0.75120669]])

In [17]:

X_pca

Out[17]:

array([[ 9.19283683,  1.94858307, -1.12316616],
       [ 2.3878018 , -3.76817174, -0.52929269],
       [ 5.73389628, -1.0751738 , -0.55174759],
       ...,
       [ 1.25617928, -1.90229671,  0.56273053],
       [10.37479406,  1.67201011, -1.87702933],
       [-5.4752433 , -0.67063679,  1.49044308]])

In [19]:

# Create a DataFrame with the PCA results
df_pca = pd.DataFrame(X_pca, columns=['PC1', 'PC2','PC3'])
df_pca['target'] = y

df_pca.tail(10)

Out[19]:

In [20]:


# Plot the PCA results
plt.figure(figsize=(12, 8))
plt.scatter(df_pca['PC1'], df_pca['PC2'], c=df_pca['target'], cmap='viridis')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('PCA of Breast Cancer Dataset')
plt.colorbar(label='target')
plt.show()

Out[20]:

In [21]:


# Plot the PCA results
plt.figure(figsize=(12, 8))
plt.scatter(df_pca['PC1'], df_pca['PC3'], c=df_pca['target'], cmap='viridis')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 3')
plt.title('PCA of Breast Cancer Dataset')
plt.colorbar(label='target')
plt.show()

Out[21]:

In [22]:

# Plot the PCA results
plt.figure(figsize=(12, 8))
plt.scatter(df_pca['PC3'], df_pca['PC2'], c=df_pca['target'], cmap='viridis')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 3')
plt.title('PCA of Breast Cancer Dataset')
plt.colorbar(label='target')
plt.show()

Out[22]:

In [23]:


# Create a new sample (for demonstration, we'll take a sample from the original dataset)
new_sample = X[0].reshape(1, -1)

# Standardize the new sample
new_sample_scaled = scaler.transform(new_sample)

# Apply PCA to the new sample
new_sample_pca = pca.transform(new_sample_scaled)

print("Original sample:", new_sample)
print("PCA transformed sample:", new_sample_pca)

Out[23]:

Original sample: [[1.799e+01 1.038e+01 1.228e+02 1.001e+03 1.184e-01 2.776e-01 3.001e-01
  1.471e-01 2.419e-01 7.871e-02 1.095e+00 9.053e-01 8.589e+00 1.534e+02
  6.399e-03 4.904e-02 5.373e-02 1.587e-02 3.003e-02 6.193e-03 2.538e+01
  1.733e+01 1.846e+02 2.019e+03 1.622e-01 6.656e-01 7.119e-01 2.654e-01
  4.601e-01 1.189e-01]]
PCA transformed sample: [[ 9.19283683  1.94858307 -1.12316616]]

Dimension reduction

High Dimension:

It can be classified into two main types:

Purpose: To simplify models, reduce computational cost, and improve model performance.

Techniques:

PCA

Mathematical Formulation of PCA

Product

Resources

Company