Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
YStrano
GitHub Repository: YStrano/DataScience_GA
Path: blob/master/lessons/lesson_05/code/Linear Regression with Statsmodels and Scikit-Learn - (done w' the exception of exponential regression).ipynb
1904 views
Kernel: Python 3

Linear Regression with Statsmodels and Scikit-Learn

There are many ways to fit a linear regression and in python I find myself commonly using both scikit-learn and statsmodels. This notebook demos some common tasks using these libraries:

  • Linear regressions in both

  • Using dummy variables

  • Multilinear regression

  • Quadratic and polynomial regressions

  • Exponential regressions

You can also create polynomial fits with numpy and more general curve fits with scipy.

To get started let's load in the libraries that we'll need.

%matplotlib inline import pandas as pd import matplotlib.pyplot as plt import numpy as np from scipy import stats import seaborn as sns import statsmodels.formula.api as smf from sklearn import linear_model as lm

For the first few examples we'll use the famous Iris dataset. Seaborn provides a few data sets including this one. Let's load the data and take a quick look using Seaborn's pairplot.

# Load the data into a pandas dataframe iris = sns.load_dataset("iris") #this dataset is in seaborn already so you don't need to get a file, just load it in iris.head()
# Quick plot of the data using seaborn sns.pairplot(iris, hue="species")
<seaborn.axisgrid.PairGrid at 0x1a1ce4dc50>
Image in a Jupyter notebook

You can see a pretty strong linear relationship between petal_length and petal_width. Let's fit a linear regression. Seaborn can plot the data with a regression line so let's do that first (but it doesn't give us much in the way of statistics).

sns.lmplot(x="petal_length", y="petal_width", data=iris) # petal width = B0 + b1*petal_length + e
<seaborn.axisgrid.FacetGrid at 0x1a1e682828>
Image in a Jupyter notebook

Now let's use scikit-learn to find the best fit line.

# from sklearn import linear_model X = iris[["petal_length"]] #independant variable/predictor y = iris["petal_width"] #dependant variable/predicted # Fit the linear model model = lm.LinearRegression() #this creates a linear regression object results = model.fit(X, y) #this fits the model and runs the regression #model = linear_model.LinearRegression().fit(X,y) # this does two steps above in one # print the coefficients print (results.intercept_, results.coef_)
-0.3630755213190291 [0.41575542]

This means that our best fit line is: y=a+bxy = a + b x where a=0.363075521319a = -0.363075521319 and b=0.41575542b = 0.41575542.

  • meaning where a = the intercept and b = the coefficient you get the best fit line

  • the best fit line is a straight line that best represents the data on a scatter plot

  • it is also the line that would return the least squares, meaning the distance between all points to the line are the least

Next let's use statsmodels.

# import statsmodels.api as sm model = smf.ols(formula = 'petal_width ~ petal_length', data=iris) # note the swap of X and y results = model.fit() # statsmodels gives R-like statistical output results.summary()

interpretation: for every one unit increase in petal_length, petal_width goes up by beta 1, which is 0.4158

Note that the coefficients are almost identical to what we saw before with scikit-learn, and the fit is pretty good (R2=0.927R^2=0.927). Let's see if adding in the species helps. Since that feature is categorical, we need to use dummy variables.

dummies = pd.get_dummies(iris["species"], drop_first=True) # Add to the original dataframe iris = pd.concat([iris, dummies], axis=1) #axis=1 concats by column, axis=0 concats by rows #shift tab tab is very useful if in doubt iris.head()

Now we perform a multilinear regression with the dummy variables added.

model = smf.ols(formula = 'petal_width ~ petal_length + versicolor + virginica', data = iris) results = model.fit() results.summary()

In this case it looks like we got a slight improvement from including the dummy variables. The dummy variables have a bigger impact on a fit between petal_length and sepal_length.

# Here is a notation from statsmodels that you can also use - I highly recommend the first # don't use this syntax import statsmodels.api as sm X = iris[["petal_length"]] X = sm.add_constant(X) y = iris["petal_width"] model = sm.OLS(y, X) results = model.fit() results.summary()
#a continuum from above, don't use this syntax, rather use scikit learn or the syntax from the prior example above X = iris[["petal_length", "versicolor", "virginica"]] X = sm.add_constant(X) y = iris["petal_width"] model = sm.OLS(y, X) results = model.fit() results.summary()

Quadratic Fit

Next we look at a data set that needs a quadratic fit. Let's do both a linear and quadratic fit and compare.

func = lambda x: 2 + 0.5 * x + 3 * x**2 + 5 * stats.norm.rvs(0, 10) #can write a long for function like this #def func(x): # y = 2 + 0.5 * x + 3 * x ** 2 + 5 * stats.norm.rvs(0, 10) # return y df = pd.DataFrame() df["x"] = list(range(0, 30)) df["y"] = list(map(func, df["x"])) #map goes into each row and applies the function df.plot(x='x', y='y', kind='scatter')
<matplotlib.axes._subplots.AxesSubplot at 0x1c1f27fd30>
Image in a Jupyter notebook
# Linear Fit model = smf.ols('y~x', data = df) results = model.fit() results.summary()
sns.lmplot(x="x", y="y", data=df) #this is the regression line that you get (prior to raising x to the power of 2 - see below) #this is a bad fit bc you basically miss most the data #to be a good fit, you'd want data on both sides of the lines evenly throughout
<seaborn.axisgrid.FacetGrid at 0x1c20c43ba8>
Image in a Jupyter notebook
df['y_pred1'] = results.predict()
sns.lmplot(x='y_pred1', y="y", data=df) #this is not a regression line and is not the line you would get with your equation #this is y_predictions plotted against y actuals, and you can see that you are off and don't have a good fit
<seaborn.axisgrid.FacetGrid at 0x1c204ae7b8>
Image in a Jupyter notebook
# Quadratic Fit df['x2'] = df['x']**2 #in order to account for the quadratic curve, we square x model2 = smf.ols('y~x+x2', data=df) results2 = model2.fit() results2.summary()
df['y_pred'] = results2.predict() #-5.5088 + 2.0698*x + 2.9554*x2 df['y_pred']
0 -5.508812 1 -0.483615 2 10.452464 3 27.299425 4 50.057269 5 78.725995 6 113.305604 7 153.796095 8 200.197469 9 252.509725 10 310.732864 11 374.866885 12 444.911789 13 520.867575 14 602.734243 15 690.511794 16 784.200227 17 883.799543 18 989.309741 19 1100.730822 20 1218.062785 21 1341.305631 22 1470.459359 23 1605.523969 24 1746.499462 25 1893.385838 26 2046.183096 27 2204.891236 28 2369.510259 29 2540.040164 Name: y_pred, dtype: float64
sns.lmplot(x='y_pred', y="y", data=df) #this is not a regression line and is not the line you would get with your equation #here is y_predictions plotted against y actuals, and you can see that we are on the mark #this shows why it helps to raise the x values to the power of 2 when dealing with a quadratic regression
<seaborn.axisgrid.FacetGrid at 0x1c2000a668>
Image in a Jupyter notebook
fig = plt.figure() ax = plt.axes() x = df['x'] y = results2.predict() plt.scatter(x,df['y'],color='k') ax.plot(x, y) ax # this is the actual quadratic regression line that you get after raising x to the power of 2
<matplotlib.axes._subplots.AxesSubplot at 0x1c20c27be0>
Image in a Jupyter notebook

We see that the quadratic fit is better. We can plot the residuals in both cases to see how far off the models are in each case.

# Plot true values versus the predictions plt.scatter(df['y'], results.predict(), color="g", label="Linear") plt.scatter(df['y'], results2.predict(), color="b", label="Quadratic") plt.xlabel("True Values") plt.ylabel("Predicted Values") plt.show()
Image in a Jupyter notebook
  • note to self, circle back on residuals

Although it's a little hard to tell from the plot (since both fits are good), the blue (quadratic) fit is closer to "y=x", indicating a closer agreement with the true values and the model's predictions.

Higher order polynomial regressions are as easy as increasing the exponent parameter in numpy.vander.

Exponential functions

We can also transform our data before applying linear regression. This allows us to fit functions such as exponentials of the form y=Aekxy=A e^{k x} using linear regression. Here's some exponentially distributed data.

func = lambda x: 2.5 * np.exp(0.5 * x) + stats.norm.rvs(0, 0.3) df = pd.DataFrame() df["x"] = np.arange(-1, 4, 0.1) df["y"] = list(map(func, df["x"])) df.plot.scatter(x='x', y='y')
<matplotlib.axes._subplots.AxesSubplot at 0x1c20bbf908>
Image in a Jupyter notebook

If we take the log of the y-variable we get something more linear.

df["log-y"] = np.log(df["y"]) df.plot.scatter(x='x', y='log-y')
<matplotlib.axes._subplots.AxesSubplot at 0x1c21028860>
Image in a Jupyter notebook

We can then use linear regression to determine kk and logA\log A, since taking the log\log of both sides of y=Aekxy = A e^{k x} gives us logy=logA+kx\log y = \log A + k x.

X = df["x"] X = sm.add_constant(X) y = df["log-y"] model = sm.OLS(y, X) results = model.fit() results.summary()

As you can see the fit is very good.