Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
YStrano
GitHub Repository: YStrano/DataScience_GA
Path: blob/master/april_18/lessons/lesson-06/code/Linear Regression with Statsmodels and Scikit-Learn.ipynb
1904 views
Kernel: Python 2

Linear Regression with Statsmodels and Scikit-Learn

There are many ways to fit a linear regression and in python I find myself commonly using both scikit-learn and statsmodels. This notebook demos some common tasks using these libraries:

  • Linear regressions in both

  • Using dummy variables

  • Multilinear regression

  • Quadratic and polynomial regressions

  • Exponential regressions

You can also create polynomial fits with numpy and more general curve fits with scipy.

To get started let's load in the libraries that we'll need.

%matplotlib inline import pandas as pd import matplotlib.pyplot as plt import numpy as np from scipy import stats import seaborn as sns import statsmodels.api as sm from sklearn import linear_model

For the first few examples we'll use the famous Iris dataset. Seaborn provides a few data sets including this one. Let's load the data and take a quick look using Seaborn's pairplot.

# Load the data into a pandas dataframe iris = sns.load_dataset("iris") iris.head()
# Quick plot of the data using seaborn sns.pairplot(iris, hue="species") sns.plt.show()
Image in a Jupyter notebook

You can see a pretty strong linear relationship between petal_length and petal_width. Let's fit a linear regression. Seaborn can plot the data with a regression line so let's do that first (but it doesn't give us much in the way of statistics).

sns.lmplot(x="petal_length", y="petal_width", data=iris) sns.plt.show()
Image in a Jupyter notebook

Now let's use scikit-learn to find the best fit line.

# from sklearn import linear_model X = iris[["petal_length"]] y = iris["petal_width"] # Fit the linear model model = linear_model.LinearRegression() results = model.fit(X, y) # Print the coefficients print results.intercept_, results.coef_
-0.363075521319 [ 0.41575542]

This means that our best fit line is: y=a+bxy = a + b x where a=0.363075521319a = -0.363075521319 and b=0.41575542b = 0.41575542.

Next let's use statsmodels.

# import statsmodels.api as sm # Note the swap of X and y model = sm.OLS(y, X) results = model.fit() # Statsmodels gives R-like statistical output print(results.summary())
OLS Regression Results ============================================================================== Dep. Variable: petal_width R-squared: 0.967 Model: OLS Adj. R-squared: 0.967 Method: Least Squares F-statistic: 4417. Date: Tue, 14 Jun 2016 Prob (F-statistic): 1.22e-112 Time: 22:19:00 Log-Likelihood: -8.7179 No. Observations: 150 AIC: 19.44 Df Residuals: 149 BIC: 22.45 Df Model: 1 Covariance Type: nonrobust ================================================================================ coef std err t P>|t| [95.0% Conf. Int.] -------------------------------------------------------------------------------- petal_length 0.3365 0.005 66.463 0.000 0.327 0.347 ============================================================================== Omnibus: 19.720 Durbin-Watson: 0.857 Prob(Omnibus): 0.000 Jarque-Bera (JB): 23.498 Skew: 0.957 Prob(JB): 7.90e-06 Kurtosis: 3.311 Cond. No. 1.00 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

If you look closely you'll note that this model doesn't include an intercept by default like scikit-learn does. There's an easy way to do this using numpy's Vandermonde matrix function numpy.vander.

X = [1, 2, 3] np.vander(X, 3)
array([[1, 1, 1], [4, 2, 1], [9, 3, 1]])

As you can see, np.vander gives us the powers of the input matrix or array. We can use it to simply add a constant row for the intercept (zeroth power) or to do polynomial fits (larger exponents). First let's redo the statsmodels fit with an intercept.

X = iris["petal_length"] X = np.vander(X, 2) # add a constant row for the intercept y = iris["petal_width"] model = sm.OLS(y, X) results = model.fit() print(results.summary())
OLS Regression Results ============================================================================== Dep. Variable: petal_width R-squared: 0.927 Model: OLS Adj. R-squared: 0.927 Method: Least Squares F-statistic: 1882. Date: Tue, 14 Jun 2016 Prob (F-statistic): 4.68e-86 Time: 22:23:17 Log-Likelihood: 24.796 No. Observations: 150 AIC: -45.59 Df Residuals: 148 BIC: -39.57 Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [95.0% Conf. Int.] ------------------------------------------------------------------------------ x1 0.4158 0.010 43.387 0.000 0.397 0.435 const -0.3631 0.040 -9.131 0.000 -0.442 -0.285 ============================================================================== Omnibus: 5.765 Durbin-Watson: 1.455 Prob(Omnibus): 0.056 Jarque-Bera (JB): 5.555 Skew: 0.359 Prob(JB): 0.0622 Kurtosis: 3.611 Cond. No. 10.3 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Note that the coefficients are almost identical to what we saw before with scikit-learn, and the fit is pretty good (R2=0.927R^2=0.927). Let's see if adding in the species helps. Since that feature is categorical, we need to use dummy variables.

dummies = pd.get_dummies(iris["species"]) # Add to the original dataframe iris = pd.concat([iris, dummies], axis=1) iris.head()

Now we perform a multilinear regression with the dummy variables added.

X = iris[["petal_length", "setosa", "versicolor", "virginica"]] X = sm.add_constant(X) # another way to add a constant row for an intercept y = iris["petal_width"] model = sm.OLS(y, X) results = model.fit() print(results.summary())
OLS Regression Results ============================================================================== Dep. Variable: petal_width R-squared: 0.946 Model: OLS Adj. R-squared: 0.944 Method: Least Squares F-statistic: 845.5 Date: Tue, 14 Jun 2016 Prob (F-statistic): 4.88e-92 Time: 22:46:01 Log-Likelihood: 46.704 No. Observations: 150 AIC: -85.41 Df Residuals: 146 BIC: -73.37 Df Model: 3 Covariance Type: nonrobust ================================================================================ coef std err t P>|t| [95.0% Conf. Int.] -------------------------------------------------------------------------------- const 0.2501 0.098 2.561 0.011 0.057 0.443 petal_length 0.2304 0.034 6.691 0.000 0.162 0.298 setosa -0.3410 0.051 -6.655 0.000 -0.442 -0.240 versicolor 0.0944 0.054 1.751 0.082 -0.012 0.201 virginica 0.4967 0.096 5.150 0.000 0.306 0.687 ============================================================================== Omnibus: 6.210 Durbin-Watson: 1.736 Prob(Omnibus): 0.045 Jarque-Bera (JB): 9.578 Skew: -0.110 Prob(JB): 0.00832 Kurtosis: 4.218 Cond. No. 1.69e+16 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The smallest eigenvalue is 9.68e-30. This might indicate that there are strong multicollinearity problems or that the design matrix is singular.

In this case it looks like we got a slight improvement from including the dummy variables. The dummy variables have a bigger impact on a fit between petal_length and sepal_length.

X = iris[["petal_length"]] X = sm.add_constant(X) y = iris["sepal_length"] model = sm.OLS(y, X) results = model.fit() print(results.summary())
OLS Regression Results ============================================================================== Dep. Variable: sepal_length R-squared: 0.760 Model: OLS Adj. R-squared: 0.758 Method: Least Squares F-statistic: 468.6 Date: Tue, 14 Jun 2016 Prob (F-statistic): 1.04e-47 Time: 22:49:10 Log-Likelihood: -77.020 No. Observations: 150 AIC: 158.0 Df Residuals: 148 BIC: 164.1 Df Model: 1 Covariance Type: nonrobust ================================================================================ coef std err t P>|t| [95.0% Conf. Int.] -------------------------------------------------------------------------------- const 4.3066 0.078 54.939 0.000 4.152 4.462 petal_length 0.4089 0.019 21.646 0.000 0.372 0.446 ============================================================================== Omnibus: 0.207 Durbin-Watson: 1.867 Prob(Omnibus): 0.902 Jarque-Bera (JB): 0.346 Skew: 0.069 Prob(JB): 0.841 Kurtosis: 2.809 Cond. No. 10.3 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
X = iris[["petal_length", "setosa", "versicolor", "virginica"]] y = iris["sepal_length"] model = sm.OLS(y, X) results = model.fit() print(results.summary())
OLS Regression Results ============================================================================== Dep. Variable: sepal_length R-squared: 0.837 Model: OLS Adj. R-squared: 0.833 Method: Least Squares F-statistic: 249.4 Date: Tue, 14 Jun 2016 Prob (F-statistic): 3.10e-57 Time: 22:49:23 Log-Likelihood: -48.116 No. Observations: 150 AIC: 104.2 Df Residuals: 146 BIC: 116.3 Df Model: 3 Covariance Type: nonrobust ================================================================================ coef std err t P>|t| [95.0% Conf. Int.] -------------------------------------------------------------------------------- petal_length 0.9046 0.065 13.962 0.000 0.777 1.033 setosa 3.6835 0.106 34.719 0.000 3.474 3.893 versicolor 2.0826 0.280 7.435 0.000 1.529 2.636 virginica 1.5659 0.363 4.315 0.000 0.849 2.283 ============================================================================== Omnibus: 0.578 Durbin-Watson: 1.802 Prob(Omnibus): 0.749 Jarque-Bera (JB): 0.672 Skew: 0.140 Prob(JB): 0.715 Kurtosis: 2.830 Cond. No. 71.3 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Quadratic Fit

Next we look at a data set that needs a quadratic fit. Let's do both a linear and quadratic fit and compare.

func = lambda x: 2 + 0.5 * x + 3 * x ** 2 + 5 * stats.norm.rvs(0, 10) df = pd.DataFrame() df["x"] = list(range(0, 30)) df["y"] = map(func, df["x"]) df.plot.scatter(x='x', y='y')
<matplotlib.axes._subplots.AxesSubplot at 0x7fb3d1003050>
Image in a Jupyter notebook
# Linear Fit X = df["x"] X = np.vander(X, 2) # add a constant row for the intercept y = df["y"] model = sm.OLS(y, X) results = model.fit() print(results.summary())
OLS Regression Results ============================================================================== Dep. Variable: y R-squared: 0.926 Model: OLS Adj. R-squared: 0.923 Method: Least Squares F-statistic: 351.0 Date: Tue, 14 Jun 2016 Prob (F-statistic): 2.23e-17 Time: 23:03:30 Log-Likelihood: -203.34 No. Observations: 30 AIC: 410.7 Df Residuals: 28 BIC: 413.5 Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [95.0% Conf. Int.] ------------------------------------------------------------------------------ x1 86.9246 4.640 18.736 0.000 77.421 96.428 const -394.8666 78.347 -5.040 0.000 -555.353 -234.380 ============================================================================== Omnibus: 3.221 Durbin-Watson: 0.175 Prob(Omnibus): 0.200 Jarque-Bera (JB): 2.444 Skew: 0.553 Prob(JB): 0.295 Kurtosis: 2.145 Cond. No. 33.0 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
# Quadratic Fit X2 = df['x'] X2 = np.vander(X2, 3) # add a constant and quadratic term y = df["y"] model2 = sm.OLS(y, X2) results2 = model2.fit() print(results2.summary())
OLS Regression Results ============================================================================== Dep. Variable: y R-squared: 0.997 Model: OLS Adj. R-squared: 0.996 Method: Least Squares F-statistic: 4003. Date: Tue, 14 Jun 2016 Prob (F-statistic): 4.06e-34 Time: 23:03:42 Log-Likelihood: -156.99 No. Observations: 30 AIC: 320.0 Df Residuals: 27 BIC: 324.2 Df Model: 2 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [95.0% Conf. Int.] ------------------------------------------------------------------------------ x1 3.1033 0.130 23.798 0.000 2.836 3.371 x2 -3.0725 3.914 -0.785 0.439 -11.103 4.958 const 25.1198 24.517 1.025 0.315 -25.186 75.425 ============================================================================== Omnibus: 0.146 Durbin-Watson: 2.670 Prob(Omnibus): 0.929 Jarque-Bera (JB): 0.021 Skew: 0.039 Prob(JB): 0.990 Kurtosis: 2.897 Cond. No. 1.10e+03 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 1.1e+03. This might indicate that there are strong multicollinearity or other numerical problems.

We see that the quadratic fit is better. We can plot the residuals in both cases to see how far off the models are in each case.

# Plot true values versus the predictions plt.scatter(df['y'], results.predict(), color="g", label="Linear") plt.scatter(df['y'], results2.predict(), color="b", label="Quadratic") plt.xlabel("True Values") plt.ylabel("Predicted Values") plt.show()
Image in a Jupyter notebook

Although it's a little hard to tell from the plot (since both fits are good), the blue (quadratic) fit is closer to "y=x", indicating a closer agreement with the true values and the model's predictions.

Higher order polynomial regressions are as easy as increasing the exponent parameter in numpy.vander.

Exponential functions

We can also transform our data before applying linear regression. This allows us to fit functions such as exponentials of the form y=Aekxy=A e^{k x} using linear regression. Here's some exponentially distributed data.

func = lambda x: 2.5 * np.exp(0.5 * x) + stats.norm.rvs(0, 0.3) df = pd.DataFrame() df["x"] = np.arange(-1, 4, 0.1) df["y"] = map(func, df["x"]) df.plot.scatter(x='x', y='y')
<matplotlib.axes._subplots.AxesSubplot at 0x7fb3d9d10250>
Image in a Jupyter notebook

If we take the log of the y-variable we get something more linear.

df["log-y"] = np.log(df["y"]) df.plot.scatter(x='x', y='log-y')
<matplotlib.axes._subplots.AxesSubplot at 0x7fb3dd31a9d0>
Image in a Jupyter notebook

We can then use linear regression to determine kk and logA\log A, since taking the log\log of both sides of y=Aekxy = A e^{k x} gives us logy=logA+kx\log y = \log A + k x.

X = df["x"] X = sm.add_constant(X) y = df["log-y"] model = sm.OLS(y, X) results = model.fit() print(results.summary())
OLS Regression Results ============================================================================== Dep. Variable: log-y R-squared: 0.980 Model: OLS Adj. R-squared: 0.980 Method: Least Squares F-statistic: 2403. Date: Tue, 14 Jun 2016 Prob (F-statistic): 1.17e-42 Time: 23:11:11 Log-Likelihood: 39.452 No. Observations: 50 AIC: -74.90 Df Residuals: 48 BIC: -71.08 Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [95.0% Conf. Int.] ------------------------------------------------------------------------------ const 0.8160 0.022 36.278 0.000 0.771 0.861 x 0.5390 0.011 49.021 0.000 0.517 0.561 ============================================================================== Omnibus: 6.788 Durbin-Watson: 2.358 Prob(Omnibus): 0.034 Jarque-Bera (JB): 11.226 Skew: 0.105 Prob(JB): 0.00365 Kurtosis: 5.312 Cond. No. 3.29 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

As you can see the fit is very good.