Path: blob/master/lessons/lesson_05/code/starter-code/demo-lesson-06-starter - (done).ipynb
1904 views
##Lesson 05 Demo
#Checks for Linear Regression. It works best when:
The data is normally distributed (but doesn’t have to be)
X’s are independent of each other (low multicollinearity)
X’s significantly explain y (have low p-values)
Check 1. Distribution
Last time we plotted our data like this
TV sales: non stable variance, diminishing returns
Seaborn plotting library
https://stanford.edu/~mwaskom/software/seaborn/index.html
Today we use lmplot https://stanford.edu/~mwaskom/software/seaborn/generated/seaborn.lmplot.html
Check 2. Low Multicollinearity
Student question:
Do these variables have colinearity?
Answer:
Check 3: X’s significantly explain y (have low p-values)
Let's take a look again the the crude model
interecept: this value is what you would expect (in this case sales) if your predictors are zero, in this case TV = 0.
TV or the predictor: this value is the amount that what you are trying to predict, in this case sales, would go up by, if TV or the predictor was raised by 1 unit. so sales would go up by .0475 for every unit in crease in TV.
std err: if you take your coefficient of either the predictor or what you are tryin to predict, and add/subtract the std err x 2, you will be brought to the range of the lower and upper bounds of the curve. it is like the std dev, going out two bands which is 97.5% of the data.
R squared: this shows significance and can be read as, the amount that the predictor contributes to the prediction. - R^2 of 100% indicates that the model explains all the variability of the response data around its mean.
meaning R^2 of 100% in this case of TV, would suggest that TV is the sole driver of sales.
in this case, our model suggests that TV is the driver of 61% of sales. leaving 39% to other variables.
note that were you to invest 100 dollars in TV your sales would NOT increase by 61%, as they would only increase by 100*.0475, which would be $4.75.
Important things to look out for when reading the results of a liner regression model:
R^2: make sure that you have a high value
P value: make sure it is below .05
if P value is large, make sure that the confidence interval (aka .025 - .975) does not contain 0 in its range.
Student Model
Now do a full model with TV, Radio and Newspaper
syntax can be found here: http://statsmodels.sourceforge.net/devel/example_formulas.html
1. Which of the media buys were significantly associated with the sales?
Answer:
2. Controlling for all the other media buys, which media type had the largest association with sales?
Answer:
####3. Given that one of the variables above was not significant do we drop it from our model? Why or why not?
Answer: