Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
YStrano
GitHub Repository: YStrano/DataScience_GA
Path: blob/master/april_18/lessons/lesson-04/code/starter-code/demo-starter-code-4.ipynb
1905 views
Kernel: Python 2

Part 1. Hypothesis Testing

Libraries

For today's demo, we'll be using Statsmodels for teaching purposes, since it has some nice characteristics for linear modeling.

We will be demostrating hypothesis testing as it relates to linear modeling. We'll dive into how to do linear regression models in later classes.

# imports import pandas as pd import matplotlib.pyplot as plt # this allows plots to appear directly in the notebook %matplotlib inline

Example: Advertising Data

Let's take a look at some data, ask some questions about that data, and then use linear regression to answer those questions!

# read data into a DataFrame data = pd.read_csv('http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv', index_col=0) data.head()

Student Question- What are the features?

Answer:

Student Question- What is the response?

Answer:

# print the shape of the DataFrame data.shape
(200, 4)

There are 200 observations, and thus 200 markets in the dataset.

# visualize the relationship between the features and the response using scatterplots fig, axs = plt.subplots(1, 3, sharey=True) data.plot(kind='scatter', x='TV', y='Sales', ax=axs[0], figsize=(16, 8)) data.plot(kind='scatter', x='Radio', y='Sales', ax=axs[1]) data.plot(kind='scatter', x='Newspaper', y='Sales', ax=axs[2])
<matplotlib.axes._subplots.AxesSubplot at 0x107bf67d0>
//anaconda/lib/python2.7/site-packages/matplotlib/collections.py:590: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison if self._edgecolors == str('face'):
Image in a Jupyter notebook

Questions About the Advertising Data

Let's pretend you work for the company that manufactures and markets this new device. The company might ask you the following: On the basis of this data, how should we spend our advertising money in the future?

  1. Is there a relationship between ads and sales?

Student Question- Is this a causal relationship?

Answer:

Student Question: What other questions might we want to know about this data?

Answer:

Let's use Statsmodels to estimate the associatione between advertising efforts and sales.

# this is the standard import if you're using "formula notation" (similar to R) import statsmodels.formula.api as smf # create a fitted model in one line #formula notification is the equivalent to writing out our models such that 'outcome = predictor' #with the follwing syntax formula = 'outcome ~ predictor1 + predictor2 ... predictorN' lm = smf.ols(formula='Sales ~ TV', data=data).fit() #print the full summary lm.summary()
# print the coefficients lm.params
Intercept 7.032594 TV 0.047537 dtype: float64

Interpreting Model Coefficients

How do we interpret the TV coefficient (β1\beta_1)?

  • A "unit" increase in TV ad spending is associated with a 0.047537 "unit" increase in Sales.

  • Or more clearly: An additional $1,000 spent on TV ads is associated with an increase in sales of 47.537 widgets.

Note that if an increase in TV ad spending was associated with a decrease in sales, β1\beta_1 would be negative.

Using the Model for Prediction

Let's say that there was a new market where the TV advertising spend was $50,000. What would we predict for the Sales in that market?

y=β0+β1xy = \beta_0 + \beta_1xy=7.032594+0.047537×50000y = 7.032594 + 0.047537 \times 50000
# manually calculate the prediction 7.032594 + 0.047537*50000
2383.882594

Thus, we would predict Sales of 2,383 widgets in that market.

Of course, we can also use Statsmodels to make the prediction:

# you have to create a DataFrame since the Statsmodels formula interface expects it X_new = pd.DataFrame({'TV': [50000]}) X_new.head()
# use the model to make predictions on a new value lm.predict(X_new)
array([ 2383.8646152])

Part 2. Confidence in our Model

Question: Is linear regression a high bias/low variance model, or a low bias/high variance model?

Answer:


A closely related concept is confidence intervals. Statsmodels calculates 95% confidence intervals for our model coefficients, which are interpreted as follows: If the population from which this sample was drawn was sampled 100 times, approximately 95 of those confidence intervals would contain the "true" coefficient.

# print the confidence intervals for the model coefficients lm.conf_int()

Keep in mind that we only have a single sample of data, and not the entire population of data. The "true" coefficient is either within this interval or it isn't, but there's no way to actually know. We estimate the coefficient with the data we do have, and we show uncertainty about that estimate by giving a range that the coefficient is probably within.

Note that using 95% confidence intervals is just a convention. You can create 90% confidence intervals (which will be more narrow), 99% confidence intervals (which will be wider), or whatever intervals you like.

Hypothesis Testing and p-values

Closely related to confidence intervals is hypothesis testing. Generally speaking, you start with a null hypothesis and an alternative hypothesis - a hypothesis that is the opposite of the null. Then, you check whether the data supports rejecting the null hypothesis or failing to reject the null hypothesis.

Note that "failing to reject" the null is not the same as "accepting" the null hypothesis. Your alternative hypothesis may indeed be true, but you don't necessarily have enough data to show that yet.

As it relates to model coefficients, here is the conventional hypothesis test:

  • null hypothesis: There is no relationship between TV ads and Sales (and thus β1\beta_1 equals zero)

  • alternative hypothesis: There is a relationship between TV ads and Sales (and thus β1\beta_1 is not equal to zero)

How do we test this hypothesis? We reject the null (and thus believe the alternative) if the 95% confidence interval does not include zero.

Conversely, the p-value represents the probability that the coefficient is actually zero:

# print the p-values for the model coefficients lm.pvalues

If the 95% confidence interval includes zero, the p-value for that coefficient will be greater than 0.05.

If the 95% confidence interval does not include zero, the p-value will be less than 0.05. Thus, a p-value less than 0.05 is one way to decide whether there is likely a relationship between the feature and the response. Using 0.05 as the cutoff is a standard convention.

In this case, the p-value for TV is far less than 0.05, and so we believe that there is a relationship between TV ads and Sales.

Note that we generally ignore the p-value for the intercept.

What are a few key things we learn from this output?

  • TV and Radio have significant p-values, whereas Newspaper does not. Thus we reject the null hypothesis for TV and Radio (that there is no association between those features and Sales), and fail to reject the null hypothesis for Newspaper.

  • TV and Radio ad spending are both positively associated with Sales, whereas Newspaper ad spending is slightly negatively associated with Sales. However, this is irrelevant since we have failed to reject the null hypothesis for Newspaper.