Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
YStrano
GitHub Repository: YStrano/DataScience_GA
Path: blob/master/lessons/lesson_05/code/starter-code/demo-lesson-06-starter - (done).ipynb
1904 views
Kernel: Python 3

##Lesson 05 Demo

%matplotlib inline import numpy as np import pandas as pd from matplotlib import pyplot as plt import seaborn as sns sns.set_style("darkgrid") # this is the standard import if you're using "formula notation" (similar to R) import statsmodels.formula.api as smf
# read data into a DataFrame data = pd.read_csv('http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv', index_col=0) data.head()

#Checks for Linear Regression. It works best when:

  1. The data is normally distributed (but doesn’t have to be)

  2. X’s are independent of each other (low multicollinearity)

  3. X’s significantly explain y (have low p-values)

Check 1. Distribution

Last time we plotted our data like this

# visualize the relationship between the features and the response using scatterplots fig, axs = plt.subplots(1, 3, sharey=True) data.plot(kind='scatter', x='TV', y='sales', ax=axs[0], figsize=(16, 8)) data.plot(kind='scatter', x='radio', y='sales', ax=axs[1]) data.plot(kind='scatter', x='newspaper', y='sales', ax=axs[2])
<matplotlib.axes._subplots.AxesSubplot at 0x1139a98d0>
Image in a Jupyter notebook
  • TV sales: non stable variance, diminishing returns

sns.lmplot('TV', 'sales', data)
<seaborn.axisgrid.FacetGrid at 0x1138d9a20>
Image in a Jupyter notebook
sns.lmplot('radio', 'sales', data) sns.lmplot('newspaper', 'sales', data)
<seaborn.axisgrid.FacetGrid at 0x117be7cc0>
Image in a Jupyter notebookImage in a Jupyter notebook

Check 2. Low Multicollinearity

cmap = sns.diverging_palette(220, 10, as_cmap=True) correlations = data[['TV', 'radio', 'newspaper']].corr() print(correlations) print(sns.heatmap(correlations, cmap=cmap))
TV radio newspaper TV 1.000000 0.054809 0.056648 radio 0.054809 1.000000 0.354104 newspaper 0.056648 0.354104 1.000000 AxesSubplot(0.125,0.125;0.62x0.755)
Image in a Jupyter notebook

Student question:

  1. Do these variables have colinearity?

Answer:

Check 3: X’s significantly explain y (have low p-values)

Let's take a look again the the crude model

lm = smf.ols(formula='sales ~ TV', data=data).fit() #this is the regression model #where sales are what you are trying to predict (dependant variable) and TV is your predictor (independant variable) #the coefficient is the number that you multiply your predictor by to get your prediction #print the full summary lm.summary()
  • interecept: this value is what you would expect (in this case sales) if your predictors are zero, in this case TV = 0.

  • TV or the predictor: this value is the amount that what you are trying to predict, in this case sales, would go up by, if TV or the predictor was raised by 1 unit. so sales would go up by .0475 for every unit in crease in TV.

  • std err: if you take your coefficient of either the predictor or what you are tryin to predict, and add/subtract the std err x 2, you will be brought to the range of the lower and upper bounds of the curve. it is like the std dev, going out two bands which is 97.5% of the data.

  • R squared: this shows significance and can be read as, the amount that the predictor contributes to the prediction. - R^2 of 100% indicates that the model explains all the variability of the response data around its mean.

  • meaning R^2 of 100% in this case of TV, would suggest that TV is the sole driver of sales.

  • in this case, our model suggests that TV is the driver of 61% of sales. leaving 39% to other variables.

  • note that were you to invest 100 dollars in TV your sales would NOT increase by 61%, as they would only increase by 100*.0475, which would be $4.75.

  • Important things to look out for when reading the results of a liner regression model:

  • R^2: make sure that you have a high value

  • P value: make sure it is below .05

  • if P value is large, make sure that the confidence interval (aka .025 - .975) does not contain 0 in its range.

Student Model

Now do a full model with TV, Radio and Newspaper

syntax can be found here: http://statsmodels.sourceforge.net/devel/example_formulas.html

#fit model lm = smf.ols(formula='sales ~ TV + radio + newspaper', data=data).fit() #print the full summary lm.summary() #print summary
#fit model lm = smf.ols(formula='sales ~ TV + radio', data=data).fit() #print the full summary lm.summary() #print summary
X_variables = data[['TV', 'radio', 'newspaper']] Y = data['sales']
r_squared = [] from itertools import combinations from sklearn.linear_model import LinearRegression from sklearn.metrics import r2_score model = LinearRegression() for i in combinations(X_variables.columns,2): model.fit(data[list(i)], Y) preds = model.predict(data[list(i)]) r_2 = r2_score(Y, preds) r_squared.append([i, r_2]) r_squared
[[('TV', 'radio'), 0.8971942610828956], [('TV', 'newspaper'), 0.6458354938293271], [('radio', 'newspaper'), 0.33270518395032256]]
X_variables = data[['TV', 'radio', 'newspaper']]
from itertools import combinations for i in combinations(X_variables.columns,2): print(i)
('TV', 'radio') ('TV', 'newspaper') ('radio', 'newspaper')

1. Which of the media buys were significantly associated with the sales?

Answer:

2. Controlling for all the other media buys, which media type had the largest association with sales?

Answer:

####3. Given that one of the variables above was not significant do we drop it from our model? Why or why not?

Answer: