Path: blob/master/Machine Learning Unsupervised Methods/Day 4 Principal Component Analysis (PCA).ipynb
3074 views
Dimension reduction
A Technique used in machine learning and statistics to reduce the number of input variables in a dataset while retaining as much information as possible. The primary goals are to simplify models, reduce computational cost, and help in visualizing data.
High Dimension:
When the number of features is very large.
e.g: images(pixels as feature)
if we have a dataset with 90+ features rows are just 100.
It can be classified into two main types:
Feature Selection: Selecting a subset of the original variables.
Feature Extraction: Transforming the data from a high-dimensional space to a low-dimensional space.
Purpose: To simplify models, reduce computational cost, and improve model performance.
Techniques:
Feature Selection: Methods like Forward Selection, Backward Elimination, and Recursive Feature Elimination (RFE).
Feature Extraction: Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), t-Distributed Stochastic Neighbor Embedding (t-SNE), and Autoencoders.
Benefits: Removes multicollinearity, improves visualization, reduces overfitting, and accelerates algorithm performance.
Applications: Used in fields like machine learning, pattern recognition, and signal processing.
PCA
Principal Component Analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of possibly correlated variables into a set of linearly uncorrelated variables called principal components. The number of principal components is less than or equal to the number of original variables.
Mathematical Formulation of PCA
Standardize the Data: Center the data by subtracting the mean of each feature and scaling by the standard deviation.
Standardization ensures that each feature contributes equally to the analysis.
Covariance Matrix captures the variance and covariance of the data.
Eigenvalues and Eigenvectors determine the principal components, where eigenvalues indicate the amount of variance captured by each principal component.
Transformation reduces the dimensionality of the data by projecting it onto the selected principal components, which capture the most variance.
PCA is widely used for data compression, visualization, and noise reduction. It helps in understanding the structure of the data and finding patterns that are not apparent in high-dimensional spaces