Table of Contents
Spark PCA
This is simply an API walkthough, for more details on PCA consider referring to the following documentation.
Next, in order to train ML models in Spark later, we'll use the VectorAssembler to combine a given list of columns into a single vector column.
Next, we standardize the features, notice here we only need to specify the assembled column as the input feature.
After the preprocessing step, we fit the PCA model.
Notice that unlike scikit-learn, we use transform on the dataframe at hand for all ML models' class after fitting it (calling .fit on the dataframe). This will return the result in a new column, where the name is specified by the outputCol argument in the ML models' class.
We can convert it back to a numpy array by extracting the pcaFeatures column from each row, and use collect to bring the entire dataset back to a single machine.