GitHub Repository: YStrano/DataScience_GA
Path: blob/master/lessons/lesson_05/code/Linear Regression with Statsmodels and Scikit-Learn - (done w' the exception of exponential regression).ipynb
¹⁹⁰⁴ views

Kernel: Python 3

Linear Regression with Statsmodels and Scikit-Learn

There are many ways to fit a linear regression and in python I find myself commonly using both scikit-learn and statsmodels. This notebook demos some common tasks using these libraries:

Linear regressions in both
Using dummy variables
Multilinear regression
Quadratic and polynomial regressions
Exponential regressions

You can also create polynomial fits with numpy and more general curve fits with scipy.

To get started let's load in the libraries that we'll need.

In [26]:

%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from scipy import stats
import seaborn as sns
import statsmodels.formula.api as smf
from sklearn import linear_model as lm

For the first few examples we'll use the famous Iris dataset. Seaborn provides a few data sets including this one. Let's load the data and take a quick look using Seaborn's pairplot.

In [27]:

# Load the data into a pandas dataframe
iris = sns.load_dataset("iris") #this dataset is in seaborn already so you don't need to get a file, just load it in
iris.head()

Out[27]:

In [28]:

# Quick plot of the data using seaborn
sns.pairplot(iris, hue="species")

Out[28]:

<seaborn.axisgrid.PairGrid at 0x1a1ce4dc50>

You can see a pretty strong linear relationship between petal_length and petal_width. Let's fit a linear regression. Seaborn can plot the data with a regression line so let's do that first (but it doesn't give us much in the way of statistics).

In [29]:

sns.lmplot(x="petal_length", y="petal_width", data=iris)
# petal width = B0 + b1*petal_length + e

Out[29]:

<seaborn.axisgrid.FacetGrid at 0x1a1e682828>

Now let's use scikit-learn to find the best fit line.

In [30]:

# from sklearn import linear_model

X = iris[["petal_length"]] #independant variable/predictor
y = iris["petal_width"] #dependant variable/predicted

# Fit the linear model
model = lm.LinearRegression() #this creates a linear regression object
results = model.fit(X, y) #this fits the model and runs the regression
#model = linear_model.LinearRegression().fit(X,y)     # this does two steps above in one

# print the coefficients
print (results.intercept_, results.coef_)

Out[30]:

-0.3630755213190291 [0.41575542]

This means that our best fit line is: $y = a + b x$ where $a = -0.363075521319$ and $b = 0.41575542$ .

meaning where a = the intercept and b = the coefficient you get the best fit line
the best fit line is a straight line that best represents the data on a scatter plot
it is also the line that would return the least squares, meaning the distance between all points to the line are the least

Next let's use statsmodels.

In [31]:

# import statsmodels.api as sm

model = smf.ols(formula = 'petal_width ~ petal_length', data=iris) # note the swap of X and y
results = model.fit()
# statsmodels gives R-like statistical output
results.summary()

Out[31]:

interpretation: for every one unit increase in petal_length, petal_width goes up by beta 1, which is 0.4158

Note that the coefficients are almost identical to what we saw before with scikit-learn, and the fit is pretty good ( $R^2=0.927$ ). Let's see if adding in the species helps. Since that feature is categorical, we need to use dummy variables.

In [32]:

dummies = pd.get_dummies(iris["species"], drop_first=True)
# Add to the original dataframe
iris = pd.concat([iris, dummies], axis=1) #axis=1 concats by column, axis=0 concats by rows
                                          #shift tab tab is very useful if in doubt
iris.head()

Out[32]:

Now we perform a multilinear regression with the dummy variables added.

In [34]:

model = smf.ols(formula = 'petal_width ~ petal_length + versicolor + virginica', data = iris)
results = model.fit()
results.summary()

Out[34]:

In this case it looks like we got a slight improvement from including the dummy variables. The dummy variables have a bigger impact on a fit between petal_length and sepal_length.

In [40]:

# Here is a notation from statsmodels that you can also use - I highly recommend the first 
# don't use this syntax
import statsmodels.api as sm

X = iris[["petal_length"]]
X = sm.add_constant(X)
y = iris["petal_width"]

model = sm.OLS(y, X)
results = model.fit()
results.summary()

Out[40]:

In [44]:

#a continuum from above, don't use this syntax, rather use scikit learn or the syntax from the prior example above
X = iris[["petal_length", "versicolor", "virginica"]]
X = sm.add_constant(X)
y = iris["petal_width"]

model = sm.OLS(y, X)
results = model.fit()
results.summary()

Out[44]:

Quadratic Fit

Next we look at a data set that needs a quadratic fit. Let's do both a linear and quadratic fit and compare.

In [49]:

func = lambda x: 2 + 0.5 * x + 3 * x**2 + 5 * stats.norm.rvs(0, 10)

#can write a long for function like this
#def func(x):
#    y = 2 + 0.5 * x + 3 * x ** 2 + 5 * stats.norm.rvs(0, 10)
#    return y

df = pd.DataFrame()
df["x"] = list(range(0, 30))
df["y"] = list(map(func, df["x"])) #map goes into each row and applies the function
df.plot(x='x', y='y', kind='scatter')

Out[49]:

<matplotlib.axes._subplots.AxesSubplot at 0x1c1f27fd30>

In [50]:

# Linear Fit
model = smf.ols('y~x', data = df)
results = model.fit()
results.summary()

Out[50]:

In [78]:

sns.lmplot(x="x", y="y", data=df) 

#this is the regression line that you get (prior to raising x to the power of 2 - see below)
#this is a bad fit bc you basically miss most the data
#to be a good fit, you'd want data on both sides of the lines evenly throughout

Out[78]:

<seaborn.axisgrid.FacetGrid at 0x1c20c43ba8>

In [60]:

df['y_pred1'] = results.predict()

In [68]:

sns.lmplot(x='y_pred1', y="y", data=df) 

#this is not a regression line and is not the line you would get with your equation
#this is y_predictions plotted against y actuals, and you can see that you are off and don't have a good fit

Out[68]:

<seaborn.axisgrid.FacetGrid at 0x1c204ae7b8>

In [54]:

# Quadratic Fit
df['x2'] = df['x']**2 #in order to account for the quadratic curve, we square x

model2 = smf.ols('y~x+x2', data=df)
results2 = model2.fit()
results2.summary()

Out[54]:

In [58]:

df['y_pred'] = results2.predict() #-5.5088 + 2.0698*x + 2.9554*x2
df['y_pred']

Out[58]:

     -5.508812
     -0.483615
     10.452464
     27.299425
     50.057269
     78.725995
    113.305604
    153.796095
    200.197469
    252.509725
   310.732864
   374.866885
   444.911789
   520.867575
   602.734243
   690.511794
   784.200227
   883.799543
   989.309741
  1100.730822
  1218.062785
  1341.305631
  1470.459359
  1605.523969
  1746.499462
  1893.385838
  2046.183096
  2204.891236
  2369.510259
  2540.040164
Name: y_pred, dtype: float64

In [65]:

sns.lmplot(x='y_pred', y="y", data=df)

#this is not a regression line and is not the line you would get with your equation
#here is y_predictions plotted against y actuals, and you can see that we are on the mark
#this shows why it helps to raise the x values to the power of 2 when dealing with a quadratic regression

Out[65]:

<seaborn.axisgrid.FacetGrid at 0x1c2000a668>

In [77]:

fig = plt.figure()
ax = plt.axes()
x = df['x']
y = results2.predict()
plt.scatter(x,df['y'],color='k')

ax.plot(x, y)
ax

# this is the actual quadratic regression line that you get after raising x to the power of 2

Out[77]:

<matplotlib.axes._subplots.AxesSubplot at 0x1c20c27be0>

We see that the quadratic fit is better. We can plot the residuals in both cases to see how far off the models are in each case.

In [66]:

# Plot true values versus the predictions
plt.scatter(df['y'], results.predict(), color="g", label="Linear")
plt.scatter(df['y'], results2.predict(), color="b", label="Quadratic")
plt.xlabel("True Values")
plt.ylabel("Predicted Values")
plt.show()

Out[66]:

note to self, circle back on residuals

Although it's a little hard to tell from the plot (since both fits are good), the blue (quadratic) fit is closer to "y=x", indicating a closer agreement with the true values and the model's predictions.

Higher order polynomial regressions are as easy as increasing the exponent parameter in numpy.vander.

Exponential functions

We can also transform our data before applying linear regression. This allows us to fit functions such as exponentials of the form $y=A e^{k x}$ using linear regression. Here's some exponentially distributed data.

In [80]:

func = lambda x: 2.5 * np.exp(0.5 * x) + stats.norm.rvs(0, 0.3)
df = pd.DataFrame()
df["x"] = np.arange(-1, 4, 0.1)
df["y"] = list(map(func, df["x"]))
df.plot.scatter(x='x', y='y')

Out[80]:

<matplotlib.axes._subplots.AxesSubplot at 0x1c20bbf908>

If we take the log of the y-variable we get something more linear.

In [81]:

df["log-y"] = np.log(df["y"])
df.plot.scatter(x='x', y='log-y')

Out[81]:

<matplotlib.axes._subplots.AxesSubplot at 0x1c21028860>

We can then use linear regression to determine $k$ and $\log A$ , since taking the $\log$ of both sides of $y = A e^{k x}$ gives us $\log y = \log A + k x$ .

In [82]:

X = df["x"]
X = sm.add_constant(X)
y = df["log-y"]

model = sm.OLS(y, X)
results = model.fit()
results.summary()

Out[82]:

As you can see the fit is very good.

In [ ]:

Linear Regression with Statsmodels and Scikit-Learn

Quadratic Fit

Exponential functions

Product

Resources

Company