Path: blob/master/Machine Learning Supervised Methods/Day 2 Use Case Linear Regression.ipynb
3074 views
Problem Statement
Housing dataset which contains information about different houses in Boston. This data was originally a part of UCI Machine Learning Repository and has been removed now. We can also access this data from the scikit-learn library. There are 506 samples and 13 feature variables in this dataset.
The objective is to predict the value of prices of the house using the given features.
data: contains the information for various houses
target: prices of the house
feature_names: names of the features
DESCR: describes the dataset
The prices of the house indicated by the variable MEDV is our target variable and the remaining are the feature variables based on which we will predict the value of a house.
Loading data into Data frame
We can see that the target value MEDV is missing from the data. We create a new column of target values and add it to the dataframe.
Data preprocessing
We count the number of missing values for each feature using isnull()
Exploratory Data Analysis
Exploratory Data Analysis is a very important step before training the model.
Let’s first plot the distribution of the target variable MEDV. We will use the distplot function from the seaborn library
We see that the values of MEDV are distributed normally with few outliers
The correlation matrix can be formed by using the corr function from the pandas dataframe library. We will use the heatmap function from the seaborn library to plot the correlation matrix.
Insights:
Based on the above observations we will RM and LSTAT as our features. Using a scatter plot let’s see how these features vary with MEDV.
FE (Recursive Feature Elimination)
The Recursive Feature Elimination (RFE) method works by recursively removing attributes and building a model on those attributes that remain. It uses accuracy metric to rank the feature according to their importance. The RFE method takes the model to be used and the number of required features as input. It then gives the ranking of all the variables, 1 being most important. It also gives its support, True being relevant feature and False being irrelevant feature.
Insights
The prices increase as the value of RM increases linearly. There are few outliers and the data seems to be capped at 50.
The prices tend to decrease with an increase in LSTAT. Though it doesn’t look to be following exactly a linear line.
Preparing the data for training the model
We concatenate the LSTAT and RM columns using np.c_ provided by the numpy library.
Splitting the data into training and testing sets
Next, we split the data into training and testing sets. We train the model with 80% of the samples and test with the remaining 20%. We do this to assess the model’s performance on unseen data.
To split the data we use train_test_split function provided by scikit-learn library. We finally print the sizes of our training and test set to verify if the splitting has occurred properly.
Training and testing the model
We use scikit-learn’s LinearRegression to train our model on both the training and test sets.
Model evaluation
We will evaluate our model using RMSE and R2-score
Comparing Actual and Predicted value
Conclusion
There are 63% of the total cases prediction will be correct.
We can use cross validation for Model improvement