Path: blob/master/april_18/lessons/lesson-06/code/Linear Regression with Statsmodels and Scikit-Learn.ipynb
1904 views
Linear Regression with Statsmodels and Scikit-Learn
There are many ways to fit a linear regression and in python I find myself commonly using both scikit-learn and statsmodels. This notebook demos some common tasks using these libraries:
Linear regressions in both
Using dummy variables
Multilinear regression
Quadratic and polynomial regressions
Exponential regressions
You can also create polynomial fits with numpy and more general curve fits with scipy.
To get started let's load in the libraries that we'll need.
For the first few examples we'll use the famous Iris dataset. Seaborn provides a few data sets including this one. Let's load the data and take a quick look using Seaborn's pairplot.
You can see a pretty strong linear relationship between petal_length
and petal_width
. Let's fit a linear regression. Seaborn can plot the data with a regression line so let's do that first (but it doesn't give us much in the way of statistics).
Now let's use scikit-learn to find the best fit line.
This means that our best fit line is: where and .
Next let's use statsmodels
.
If you look closely you'll note that this model doesn't include an intercept by default like scikit-learn does. There's an easy way to do this using numpy's Vandermonde matrix function numpy.vander
.
As you can see, np.vander
gives us the powers of the input matrix or array. We can use it to simply add a constant row for the intercept (zeroth power) or to do polynomial fits (larger exponents). First let's redo the statsmodels fit with an intercept.
Note that the coefficients are almost identical to what we saw before with scikit-learn, and the fit is pretty good (). Let's see if adding in the species helps. Since that feature is categorical, we need to use dummy variables.
Now we perform a multilinear regression with the dummy variables added.
In this case it looks like we got a slight improvement from including the dummy variables. The dummy variables have a bigger impact on a fit between petal_length
and sepal_length
.
Quadratic Fit
Next we look at a data set that needs a quadratic fit. Let's do both a linear and quadratic fit and compare.
We see that the quadratic fit is better. We can plot the residuals in both cases to see how far off the models are in each case.
Although it's a little hard to tell from the plot (since both fits are good), the blue (quadratic) fit is closer to "y=x", indicating a closer agreement with the true values and the model's predictions.
Higher order polynomial regressions are as easy as increasing the exponent parameter in numpy.vander
.
Exponential functions
We can also transform our data before applying linear regression. This allows us to fit functions such as exponentials of the form using linear regression. Here's some exponentially distributed data.
If we take the log of the y
-variable we get something more linear.
We can then use linear regression to determine and , since taking the of both sides of gives us .
As you can see the fit is very good.