Path: blob/master/lessons/lesson_05/code/Linear Regression with Statsmodels and Scikit-Learn - (done w' the exception of exponential regression).ipynb
1904 views
Linear Regression with Statsmodels and Scikit-Learn
There are many ways to fit a linear regression and in python I find myself commonly using both scikit-learn and statsmodels. This notebook demos some common tasks using these libraries:
Linear regressions in both
Using dummy variables
Multilinear regression
Quadratic and polynomial regressions
Exponential regressions
You can also create polynomial fits with numpy and more general curve fits with scipy.
To get started let's load in the libraries that we'll need.
For the first few examples we'll use the famous Iris dataset. Seaborn provides a few data sets including this one. Let's load the data and take a quick look using Seaborn's pairplot.
You can see a pretty strong linear relationship between petal_length
and petal_width
. Let's fit a linear regression. Seaborn can plot the data with a regression line so let's do that first (but it doesn't give us much in the way of statistics).
Now let's use scikit-learn to find the best fit line.
This means that our best fit line is: where and .
meaning where a = the intercept and b = the coefficient you get the best fit line
the best fit line is a straight line that best represents the data on a scatter plot
it is also the line that would return the least squares, meaning the distance between all points to the line are the least
Next let's use statsmodels
.
interpretation: for every one unit increase in petal_length, petal_width goes up by beta 1, which is 0.4158
Note that the coefficients are almost identical to what we saw before with scikit-learn, and the fit is pretty good (). Let's see if adding in the species helps. Since that feature is categorical, we need to use dummy variables.
Now we perform a multilinear regression with the dummy variables added.
In this case it looks like we got a slight improvement from including the dummy variables. The dummy variables have a bigger impact on a fit between petal_length
and sepal_length
.
Quadratic Fit
Next we look at a data set that needs a quadratic fit. Let's do both a linear and quadratic fit and compare.
We see that the quadratic fit is better. We can plot the residuals in both cases to see how far off the models are in each case.
note to self, circle back on residuals
Although it's a little hard to tell from the plot (since both fits are good), the blue (quadratic) fit is closer to "y=x", indicating a closer agreement with the true values and the model's predictions.
Higher order polynomial regressions are as easy as increasing the exponent parameter in numpy.vander
.
Exponential functions
We can also transform our data before applying linear regression. This allows us to fit functions such as exponentials of the form using linear regression. Here's some exponentially distributed data.
If we take the log of the y
-variable we get something more linear.
We can then use linear regression to determine and , since taking the of both sides of gives us .
As you can see the fit is very good.