Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
YStrano
GitHub Repository: YStrano/DataScience_GA
Path: blob/master/lessons/lesson_15/PCA - (done).ipynb
1904 views
Kernel: Python 3

Load in some Economic Data

Note that this data has been scaled and normalized so that everything has a mean of 0 and a standard deviation of 1 (z-score).

%matplotlib inline import pandas as pd from sklearn import preprocessing import matplotlib.pyplot as plt import seaborn as sns df = pd.read_csv('financial_indicators3.csv') df.set_index('DATE',inplace=True) df.dropna(inplace=True) ## Normalize df = df.apply(lambda x: (x-x.mean())/ x.std(), axis=1) # ALWAYS DO THIS TO NORMALIZE FOR PCA / this is a z-score df.head()
df.shape
(349, 8)

Lets look at it

df.plot()
<matplotlib.axes._subplots.AxesSubplot at 0xb316d68>
Image in a Jupyter notebook

This is a mess, so lets smooth it out by doing a 12 month rolling average

df.rolling(12).mean().plot(legend=False) #rolls 1-12, then 2-13, 3-14 etc. there is overlap
<matplotlib.axes._subplots.AxesSubplot at 0x1a1c67f390>
Image in a Jupyter notebook

This is better, but you can still see a lot of very correlated variables and two particularly volatile ones that seem to be negatively correlated

df.corr()
sns.heatmap(df.corr())
<matplotlib.axes._subplots.AxesSubplot at 0x11171dda0>
Image in a Jupyter notebook

PCA can help!

The below code shows that with two "principal components" you can capture > 97% of the variation!

from sklearn.decomposition import PCA pca = PCA(n_components = 2) pca.fit(df.drop('inflation', axis=1)) pca.explained_variance_ratio_
array([0.77973177, 0.19289192])

Lets look at these things

pcs = pca.transform(df.drop('inflation',axis=1)) pcs = pd.DataFrame(pcs, columns = ['PC1','PC2']) pcs.head()

On their own, they are uninterpretable

pcs.columns = ['thing_1','thing_2']

Lets see how they vary over time

pcs.index = pd.to_datetime(df.index) pcs['thing_1'].rolling(12).mean().plot() #done in z scores, -3 -> 3, 3 stds
<matplotlib.axes._subplots.AxesSubplot at 0x1a200ca1d0>
Image in a Jupyter notebook

Now I can use this single feature in a regression without all of that noise clogging my outputs

df.head()
pcs.head()
df['PC_1'] = pcs['thing_1'] df.head()
?LinearRegression().fit()
from sklearn.linear_model import LinearRegression regression = LinearRegression() regression.fit(df['PC_1'].values.reshape(-1,1), df['inflation'])
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
regression.coef_
array([-0.13271296])
regression.score(df['PC_1'].values.reshape(-1,1), df['inflation'])
0.7790902187113105
df[['PC_1','inflation']].rolling(12).mean().plot()
<matplotlib.axes._subplots.AxesSubplot at 0x1a2103fdd8>
Image in a Jupyter notebook
#df[['PC_1','inflation']].rolling(12).mean().to_csv('pc_inflation.csv')

see csv example that shows graph of inflation negatively correlated to pcs 1 - this is what was driving it all along... replicate that graph in python with matplotlib

PCA for Plotting

del(df) df = pd.read_csv('capitol_words.csv', sep='|', encoding='utf-8') df['speaker_name'].value_counts()
Bernie Sanders 2241 Joseph Biden 1854 Rick Santorum 1613 Mike Pence 1238 Lindsey Graham 1158 Hillary Clinton 830 Rand Paul 455 Barack Obama 411 Jim Webb 381 Ted Cruz 365 Marco Rubio 359 John Kasich 316 Lincoln Chafee 154 Joe Biden 1 Name: speaker_name, dtype: int64
speaker_1 = df.loc[df['speaker_name'] == 'Rick Santorum', ['speaker_name', 'text']] speaker_2 = df.loc[df['speaker_name'] == 'Joseph Biden', ['speaker_name', 'text']] df_new = pd.concat([speaker_1, speaker_2]) df_new.reset_index(inplace=True) df_new.head()
df_new.shape
(3467, 3)
df_new.tail()
from sklearn.feature_extraction.text import TfidfVectorizer vec = TfidfVectorizer(lowercase=True, stop_words='english') speeches = vec.fit_transform(df_new['text']) speeches
<3467x29674 sparse matrix of type '<class 'numpy.float64'>' with 523660 stored elements in Compressed Sparse Row format>
from sklearn.decomposition import PCA pca = PCA(n_components = 2) speeches_pca = pca.fit_transform(speeches.todense())
speeches_pca.shape
(3467, 2)
vis_df = pd.DataFrame(speeches_pca, columns = ['PC1','PC2']) vis_df.head()
df = pd.concat([df_new, vis_df], axis=1)
df.head()
sns.lmplot(x='PC1',y='PC2', hue='speaker_name', fit_reg = False, data = df)
<seaborn.axisgrid.FacetGrid at 0x1a21766748>
Image in a Jupyter notebook