Table of Contents
Spark PCA
This is simply an API walkthough, for more details on PCA consider referring to the following documentation.
Next, in order to train ML models in Spark later, we'll use the VectorAssembler
to combine a given list of columns into a single vector column.
Next, we standardize the features, notice here we only need to specify the assembled column as the input feature.
After the preprocessing step, we fit the PCA model.
Notice that unlike scikit-learn, we use transform
on the dataframe at hand for all ML models' class after fitting it (calling .fit
on the dataframe). This will return the result in a new column, where the name is specified by the outputCol
argument in the ML models' class.
We can convert it back to a numpy array by extracting the pcaFeatures
column from each row, and use collect
to bring the entire dataset back to a single machine.