CoCalc -- demo-lesson-06-starter

GitHub Repository: YStrano/DataScience_GA
Path: blob/master/lessons/lesson_05/code/starter-code/demo-lesson-06-starter - (done).ipynb
¹⁹⁰⁴ views

Kernel: Python 3

##Lesson 05 Demo

In [3]:

%matplotlib inline
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
sns.set_style("darkgrid")

# this is the standard import if you're using "formula notation" (similar to R)
import statsmodels.formula.api as smf

In [4]:

# read data into a DataFrame
data = pd.read_csv('http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv', index_col=0)
data.head()

Out[4]:

#Checks for Linear Regression. It works best when:

The data is normally distributed (but doesn’t have to be)
X’s are independent of each other (low multicollinearity)
X’s significantly explain y (have low p-values)

Check 1. Distribution

Last time we plotted our data like this

In [5]:

# visualize the relationship between the features and the response using scatterplots
fig, axs = plt.subplots(1, 3, sharey=True)
data.plot(kind='scatter', x='TV', y='sales', ax=axs[0], figsize=(16, 8))
data.plot(kind='scatter', x='radio', y='sales', ax=axs[1])
data.plot(kind='scatter', x='newspaper', y='sales', ax=axs[2])

Out[5]:

<matplotlib.axes._subplots.AxesSubplot at 0x1139a98d0>

TV sales: non stable variance, diminishing returns

Seaborn plotting library

https://stanford.edu/~mwaskom/software/seaborn/index.html

Today we use lmplot https://stanford.edu/~mwaskom/software/seaborn/generated/seaborn.lmplot.html

In [6]:

sns.lmplot('TV', 'sales', data)

Out[6]:

<seaborn.axisgrid.FacetGrid at 0x1138d9a20>

In [7]:

sns.lmplot('radio', 'sales', data)
sns.lmplot('newspaper', 'sales', data)

Out[7]:

<seaborn.axisgrid.FacetGrid at 0x117be7cc0>

Check 2. Low Multicollinearity

In [8]:

cmap = sns.diverging_palette(220, 10, as_cmap=True)

correlations = data[['TV', 'radio', 'newspaper']].corr()
print(correlations)
print(sns.heatmap(correlations, cmap=cmap))

Out[8]:

                 TV     radio  newspaper
TV         1.000000  0.054809   0.056648
radio      0.054809  1.000000   0.354104
newspaper  0.056648  0.354104   1.000000
AxesSubplot(0.125,0.125;0.62x0.755)

Student question:

Do these variables have colinearity?

Answer:

Check 3: X’s significantly explain y (have low p-values)

Let's take a look again the the crude model

In [9]:

lm = smf.ols(formula='sales ~ TV', data=data).fit() 

#this is the regression model 
#where sales are what you are trying to predict (dependant variable) and TV is your predictor (independant variable)
#the coefficient is the number that you multiply your predictor by to get your prediction 

#print the full summary
lm.summary()

Out[9]:

interecept: this value is what you would expect (in this case sales) if your predictors are zero, in this case TV = 0.
TV or the predictor: this value is the amount that what you are trying to predict, in this case sales, would go up by, if TV or the predictor was raised by 1 unit. so sales would go up by .0475 for every unit in crease in TV.
std err: if you take your coefficient of either the predictor or what you are tryin to predict, and add/subtract the std err x 2, you will be brought to the range of the lower and upper bounds of the curve. it is like the std dev, going out two bands which is 97.5% of the data.
R squared: this shows significance and can be read as, the amount that the predictor contributes to the prediction. - R^2 of 100% indicates that the model explains all the variability of the response data around its mean.
meaning R^2 of 100% in this case of TV, would suggest that TV is the sole driver of sales.
in this case, our model suggests that TV is the driver of 61% of sales. leaving 39% to other variables.
note that were you to invest 100 dollars in TV your sales would NOT increase by 61%, as they would only increase by 100*.0475, which would be $4.75.
Important things to look out for when reading the results of a liner regression model:
R^2: make sure that you have a high value
P value: make sure it is below .05
if P value is large, make sure that the confidence interval (aka .025 - .975) does not contain 0 in its range.

Student Model

Now do a full model with TV, Radio and Newspaper

syntax can be found here: http://statsmodels.sourceforge.net/devel/example_formulas.html

In [10]:

#fit model
lm = smf.ols(formula='sales ~ TV + radio + newspaper', data=data).fit()

#print the full summary
lm.summary()

#print summary

Out[10]:

In [11]:

#fit model
lm = smf.ols(formula='sales ~ TV + radio', data=data).fit()

#print the full summary
lm.summary()

#print summary

Out[11]:

In [12]:

X_variables = data[['TV', 'radio', 'newspaper']]
Y = data['sales']

In [13]:

r_squared = []

from itertools import combinations
    
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
model = LinearRegression()

for i in combinations(X_variables.columns,2):
    model.fit(data[list(i)], Y)
    preds = model.predict(data[list(i)])
    r_2 = r2_score(Y, preds)
    r_squared.append([i, r_2])

r_squared

Out[13]:

[[('TV', 'radio'), 0.8971942610828956],
 [('TV', 'newspaper'), 0.6458354938293271],
 [('radio', 'newspaper'), 0.33270518395032256]]

In [ ]:

In [14]:

X_variables = data[['TV', 'radio', 'newspaper']]

In [15]:

from itertools import combinations

for i in combinations(X_variables.columns,2):
    print(i)

Out[15]:

('TV', 'radio')
('TV', 'newspaper')
('radio', 'newspaper')

In [ ]:

1. Which of the media buys were significantly associated with the sales?

Answer:

2. Controlling for all the other media buys, which media type had the largest association with sales?

Answer:

####3. Given that one of the variables above was not significant do we drop it from our model? Why or why not?

Answer:

In [ ]:

In [ ]:

Check 1. Distribution

Last time we plotted our data like this

Seaborn plotting library

Check 2. Low Multicollinearity

Student question:

Check 3: X’s significantly explain y (have low p-values)

Let's take a look again the the crude model

Student Model

Now do a full model with TV, Radio and Newspaper

1. Which of the media buys were significantly associated with the sales?

2. Controlling for all the other media buys, which media type had the largest association with sales?

Product

Resources

Company