NDL Training #5: Linear Regression
To motivate our example of Linear Regression, we will be looking at the problem of predicitng housing prices based on some given features. This notebook will help to give you a bit of a flavor of what it's like to perform predictive analytics on a given dataset. By the end of this, you should have a good enough understanding of the process behind machine learning and being able to leverage scikit-learn to perform regression analysis.
Let's start off by importing some of the required libraries we will need for today...
Importing the Data
First we will have to import the data. We will use pandas again to import our CSV file.
Let's take a look at the first few records to see what we are working with...
Exploratory Data Analysis
Before we start diving into the dataset and building our models, let's take a look at the data and explore some statistical properties of the dataset. In this process, we will also look at what data we might want to potentially utilize for our input into our regression model.
Notice when we saw the first few rows of the data, there are many values with NaN or empty values. We will have to find some way to mitigate this issue by performing some preprocessing methods.
Last Sold Price
Let's start by looking at the last sold price, our target values for our regression problem.
It would probably be best if we removed any NaN values in this column, since we will need all samples to have a price listed or else it we would not have the corresponding values to learn about for every features. So let's start by selecting only the rows where the data is not null (or NaN in this case)...
Now for some descriptive stats regarding the Last Sold Price...
Bed, Bath and Beyond!
Now let's check out some of the input features we have in the dataset, primarily to see how many NaNs they might have and if we can somehow augment the dataset a bit to make it easier for us to work with.
There are certainly many different ways we can approach this, but we will only try few techniques on how to handle missing data.
Now let's look at square foot and lot size. For those that are not really familiar with housing terminology:
Sq Foot refers to how much does your house take up floor area wise.
Lot Size refers to the size of the land or property that you own.
Data Set Final Form
Here is the dataset in it's final form. For our first regression analysis, we will use Lot Size as our input into the model.
As for other interesting types of exploratory data analysis, consider some of the following analysis as an exercise:
House Numbers
Counts of Street Names ending with St, Ave, Ter, Rd, Dr, etc.
Analysis on the street names themselves
Longitude and Latitude
Simple Linear Regression
Now that we have understood the scope of our data, analyzed some of the features and preprocessed some of the NaN values, we can now start building our models.
Recall that the linear regression model is in the form of and we want to find the parameters for the corresponding target function. To keep it simple, we will first start out with a simple regression model - a mapping between one value to another. For this example, we will use our lot size to predict the last sold price of the home.
First we must setup our variables to feed into the model, we need our input parameters and the target values and also reshape the dataset to make sure the dimensions are proper. Let's also print the shape just to make sure everything is consistent.
Now, before we train our data set, we must split our data set into training and validation, as we need to evaluate our model to see how well it performs on accuracy of predicting housing prices.
For that, we use a special function provided by scikit learn to help us partition the dataset. This function also shuffles the dataset before hand so that we can get a random sample for each subsample.
Now that we have our training and validation (which we will call it testing in our program), we will now start to build our Linear Regression model.
Now that we have trained our first model, let's see how well it did and test it against our validation set.
Congratulations! You trained your first machine learning model!!
But unfortunately it's probably not the best, (whew, that MSE...), because of the features we chose towards our model. As you can see, we have scored around 0.43 variance score (which is pretty low)... 😦
By the way, variance score according to scikit learn's documentation is defined as follows:
The higher the variance score (meaning, close to 1.0), the better our model is.
But worry not, we can do better than this!
In order to better imporve our model, one way to handle this issue is to use better or more features. Another way to improve this would be to find a dataset with a larger sample size. (But since we can't get any more data, we can only rely on what we are given, for now...)
Let's try to use some more features from our dataset, such as:
Bed Count
Bath Count
Days on Market
And let's split our data set and train the model!
Wow! With just adding three more features to our training data, we got a huge boost in our variance score! As you can see, the features that you put into the model holds a great significance as to how well your model performs.
Remember, it's not always about the algorithms... it's all in the data.
Now to wrap this up, let's try predicting how much I should sell my house for according to the model we just made...