GitHub Repository: YStrano/DataScience_GA
Path: blob/master/lessons/lesson_15/PCA - (done).ipynb
¹⁹⁰⁴ views

Kernel: Python 3

Load in some Economic Data

Note that this data has been scaled and normalized so that everything has a mean of 0 and a standard deviation of 1 (z-score).

In [1]:

%matplotlib inline
import pandas as pd
from sklearn import preprocessing
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv('financial_indicators3.csv')
df.set_index('DATE',inplace=True)
df.dropna(inplace=True)

## Normalize
df = df.apply(lambda x: (x-x.mean())/ x.std(), axis=1) # ALWAYS DO THIS TO NORMALIZE FOR PCA / this is a z-score
df.head()

Out[1]:

In [2]:

df.shape

Out[2]:

(349, 8)

Lets look at it

In [3]:

df.plot()

Out[3]:

<matplotlib.axes._subplots.AxesSubplot at 0xb316d68>

This is a mess, so lets smooth it out by doing a 12 month rolling average

In [4]:

df.rolling(12).mean().plot(legend=False) #rolls 1-12, then 2-13, 3-14 etc. there is overlap

Out[4]:

<matplotlib.axes._subplots.AxesSubplot at 0x1a1c67f390>

This is better, but you can still see a lot of very correlated variables and two particularly volatile ones that seem to be negatively correlated

In [5]:

df.corr()

Out[5]:

In [6]:

sns.heatmap(df.corr())

Out[6]:

<matplotlib.axes._subplots.AxesSubplot at 0x11171dda0>

PCA can help!

The below code shows that with two "principal components" you can capture > 97% of the variation!

In [7]:

from sklearn.decomposition import PCA
pca = PCA(n_components = 2)
pca.fit(df.drop('inflation', axis=1))
pca.explained_variance_ratio_

Out[7]:

array([0.77973177, 0.19289192])

Lets look at these things

In [8]:

pcs = pca.transform(df.drop('inflation',axis=1))
pcs = pd.DataFrame(pcs, columns = ['PC1','PC2'])
pcs.head()

Out[8]:

On their own, they are uninterpretable

In [9]:

pcs.columns = ['thing_1','thing_2']

Lets see how they vary over time

In [10]:

pcs.index = pd.to_datetime(df.index)
pcs['thing_1'].rolling(12).mean().plot()

#done in z scores, -3 -> 3, 3 stds

Out[10]:

<matplotlib.axes._subplots.AxesSubplot at 0x1a200ca1d0>

Now I can use this single feature in a regression without all of that noise clogging my outputs

In [11]:

df.head()

Out[11]:

In [12]:

pcs.head()

Out[12]:

In [13]:

df['PC_1'] = pcs['thing_1']
df.head()

Out[13]:

In [19]:

?LinearRegression().fit()

In [23]:

from sklearn.linear_model import LinearRegression
regression = LinearRegression()
regression.fit(df['PC_1'].values.reshape(-1,1), df['inflation'])

Out[23]:

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [24]:

regression.coef_

Out[24]:

array([-0.13271296])

In [27]:

regression.score(df['PC_1'].values.reshape(-1,1), df['inflation'])

Out[27]:

0.7790902187113105

In [33]:

df[['PC_1','inflation']].rolling(12).mean().plot()

Out[33]:

<matplotlib.axes._subplots.AxesSubplot at 0x1a2103fdd8>

In [ ]:

#df[['PC_1','inflation']].rolling(12).mean().to_csv('pc_inflation.csv')

see csv example that shows graph of inflation negatively correlated to pcs 1 - this is what was driving it all along... replicate that graph in python with matplotlib

In [ ]:

In [ ]:

In [ ]:

In [ ]:

In [ ]:

In [ ]:

In [ ]:

PCA for Plotting

In [34]:

del(df)
df = pd.read_csv('capitol_words.csv', sep='|', encoding='utf-8')
df['speaker_name'].value_counts()

Out[34]:

Bernie Sanders     2241
Joseph Biden       1854
Rick Santorum      1613
Mike Pence         1238
Lindsey Graham     1158
Hillary Clinton     830
Rand Paul           455
Barack Obama        411
Jim Webb            381
Ted Cruz            365
Marco Rubio         359
John Kasich         316
Lincoln Chafee      154
Joe Biden             1
Name: speaker_name, dtype: int64

In [35]:

speaker_1 = df.loc[df['speaker_name'] == 'Rick Santorum', ['speaker_name', 'text']]
speaker_2 = df.loc[df['speaker_name'] == 'Joseph Biden', ['speaker_name', 'text']]

df_new = pd.concat([speaker_1, speaker_2])
df_new.reset_index(inplace=True)
df_new.head()

Out[35]:

In [13]:

df_new.shape

Out[13]:

(3467, 3)

In [14]:

df_new.tail()

Out[14]:

In [15]:

from sklearn.feature_extraction.text import TfidfVectorizer
vec = TfidfVectorizer(lowercase=True, stop_words='english')
speeches = vec.fit_transform(df_new['text'])
speeches

Out[15]:

<3467x29674 sparse matrix of type '<class 'numpy.float64'>'
	with 523660 stored elements in Compressed Sparse Row format>

In [16]:

from sklearn.decomposition import PCA
pca = PCA(n_components = 2)
speeches_pca = pca.fit_transform(speeches.todense())

In [17]:

speeches_pca.shape

Out[17]:

(3467, 2)

In [18]:

vis_df = pd.DataFrame(speeches_pca, columns = ['PC1','PC2'])
vis_df.head()

Out[18]:

In [19]:

df = pd.concat([df_new, vis_df], axis=1)

In [20]:

df.head()

Out[20]:

In [21]:

sns.lmplot(x='PC1',y='PC2', hue='speaker_name', fit_reg = False, data = df)

Out[21]:

<seaborn.axisgrid.FacetGrid at 0x1a21766748>

In [ ]: