CoCalc -- demo-lesson-06-starter-code.ipynb

GitHub Repository: YStrano/DataScience_GA
Path: blob/master/april_18/lessons/lesson-06-alt/code/starter-code/demo-lesson-06-starter-code.ipynb
²³⁵⁸ views

Kernel: Python 2

##Lesson 06 Demo

In [9]:

%matplotlib inline
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
sns.set_style("darkgrid")

# this is the standard import if you're using "formula notation" (similar to R)
import statsmodels.formula.api as smf

In [2]:

# read data into a DataFrame
data = pd.read_csv('http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv', index_col=0)
data.head()

Out[2]:

#Checks for Linear Regression. It works best when:

The data is normally distributed (but doesn’t have to be)
X’s are independent of each other (low multicollinearity)
X’s significantly explain y (have low p-values)

Check 1. Distribution

Last time we plotted our data like this

In [3]:

# visualize the relationship between the features and the response using scatterplots
fig, axs = plt.subplots(1, 3, sharey=True)
data.plot(kind='scatter', x='TV', y='Sales', ax=axs[0], figsize=(16, 8))
data.plot(kind='scatter', x='Radio', y='Sales', ax=axs[1])
data.plot(kind='scatter', x='Newspaper', y='Sales', ax=axs[2])

Out[3]:

<matplotlib.axes._subplots.AxesSubplot at 0x10a293a90>

//anaconda/lib/python2.7/site-packages/matplotlib/collections.py:590: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  if self._edgecolors == str('face'):

Seaborn plotting library

https://stanford.edu/~mwaskom/software/seaborn/index.html

Today we use lmplot https://stanford.edu/~mwaskom/software/seaborn/generated/seaborn.lmplot.html

In [7]:

sns.lmplot('TV', 'Sales', data)

Out[7]:

<seaborn.axisgrid.FacetGrid at 0x10aa7a190>

In [6]:

sns.lmplot('Radio', 'Sales', data)
sns.lmplot('Newspaper', 'Sales', data)

Out[6]:

<seaborn.axisgrid.FacetGrid at 0x10aac25d0>

Check 2. Low Multicollinearity

In [18]:

cmap = sns.diverging_palette(220, 10, as_cmap=True)

correlations = data[['TV', 'Radio', 'Newspaper']].corr()
print correlations
print sns.heatmap(correlations, cmap=cmap)

Out[18]:

                 TV     Radio  Newspaper
TV         1.000000  0.054809   0.056648
Radio      0.054809  1.000000   0.354104
Newspaper  0.056648  0.354104   1.000000
Axes(0.125,0.125;0.62x0.775)

Student question:

Do these variables have colinearity?

Answer:

Check 3: X’s significantly explain y (have low p-values)

Let's take a look again the the crude model

In [11]:

lm = smf.ols(formula='Sales ~ TV', data=data).fit()

#print the full summary
lm.summary()

Out[11]:

Student Model

Now do a full model with TV, Radio and Newspaper

syntax can be found here: http://statsmodels.sourceforge.net/devel/example_formulas.html

In [1]:

#fit model


#print summary

1. Which of the media buys were significantly associated with the sales?

Answer:

2. Controlling for all the other media buys, which media type had the largest association with sales?

Answer:

####3. Given that one of the variables above was not significant do we drop it from our model? Why or why not?

Answer:

Check 1. Distribution

Last time we plotted our data like this

Seaborn plotting library

Check 2. Low Multicollinearity

Student question:

Check 3: X’s significantly explain y (have low p-values)

Let's take a look again the the crude model

Student Model

Now do a full model with TV, Radio and Newspaper

1. Which of the media buys were significantly associated with the sales?

2. Controlling for all the other media buys, which media type had the largest association with sales?

Product

Resources

Company