CoCalc -- Use Case Linear Regression.ipynb

GitHub Repository: suyashi29/python-su
Path: blob/master/ML/Notebook/Use Case Linear Regression.ipynb
³⁰⁸⁷ views

Kernel: Python 3

Problem Statement

Housing dataset which contains information about different houses in Boston. This data was originally a part of UCI Machine Learning Repository and has been removed now. We can also access this data from the scikit-learn library. There are 506 samples and 13 feature variables in this dataset.

The objective is to predict the value of prices of the house using the given features.

In [2]:

import numpy as np
import matplotlib.pyplot as plt 

import pandas as pd  
import seaborn as sns 

%matplotlib inline

In [3]:

from sklearn.datasets import load_boston
boston_dataset = load_boston()

In [4]:

print(boston_dataset.keys())

Out[4]:

dict_keys(['data', 'target', 'feature_names', 'DESCR'])

data: contains the information for various houses
target: prices of the house
feature_names: names of the features
DESCR: describes the dataset

In [6]:

boston_dataset.DESCR

Out[6]:

".. _boston_dataset:\n\nBoston house prices dataset\n---------------------------\n\n**Data Set Characteristics:**  \n\n    :Number of Instances: 506 \n\n    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.\n\n    :Attribute Information (in order):\n        - CRIM     per capita crime rate by town\n        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.\n        - INDUS    proportion of non-retail business acres per town\n        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)\n        - NOX      nitric oxides concentration (parts per 10 million)\n        - RM       average number of rooms per dwelling\n        - AGE      proportion of owner-occupied units built prior to 1940\n        - DIS      weighted distances to five Boston employment centres\n        - RAD      index of accessibility to radial highways\n        - TAX      full-value property-tax rate per $10,000\n        - PTRATIO  pupil-teacher ratio by town\n        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town\n        - LSTAT    % lower status of the population\n        - MEDV     Median value of owner-occupied homes in $1000's\n\n    :Missing Attribute Values: None\n\n    :Creator: Harrison, D. and Rubinfeld, D.L.\n\nThis is a copy of UCI ML housing dataset.\nhttps://archive.ics.uci.edu/ml/machine-learning-databases/housing/\n\n\nThis dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.\n\nThe Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic\nprices and the demand for clean air', J. Environ. Economics & Management,\nvol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics\n...', Wiley, 1980.   N.B. Various transformations are used in the table on\npages 244-261 of the latter.\n\nThe Boston house-price data has been used in many machine learning papers that address regression\nproblems.   \n     \n.. topic:: References\n\n   - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.\n   - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.\n"

The prices of the house indicated by the variable MEDV is our target variable and the remaining are the feature variables based on which we will predict the value of a house.

Loading data into Data frame

In [5]:

boston = pd.DataFrame(boston_dataset.data, columns=boston_dataset.feature_names)
boston.head(3)

Out[5]:

We can see that the target value MEDV is missing from the data. We create a new column of target values and add it to the dataframe.

In [6]:

boston['MEDV'] = boston_dataset.target
boston["MEDV"].describe()

Out[6]:

count    506.000000
mean      22.532806
std        9.197104
min        5.000000
25%       17.025000
50%       21.200000
75%       25.000000
max       50.000000
Name: MEDV, dtype: float64

Data preprocessing

We count the number of missing values for each feature using isnull()

In [10]:

boston.isnull().sum()

Out[10]:

CRIM       0
ZN         0
INDUS      0
CHAS       0
NOX        0
RM         0
AGE        0
DIS        0
RAD        0
TAX        0
PTRATIO    0
B          0
LSTAT      0
MEDV       0
dtype: int64

Exploratory Data Analysis

Exploratory Data Analysis is a very important step before training the model.

Let’s first plot the distribution of the target variable MEDV. We will use the distplot function from the seaborn library

In [11]:

sns.set(rc={'figure.figsize':(11.7,8.27)})
sns.distplot(boston['MEDV'], bins=30)
plt.show()

Out[11]:

We see that the values of MEDV are distributed normally with few outliers

The correlation matrix can be formed by using the corr function from the pandas dataframe library. We will use the heatmap function from the seaborn library to plot the correlation matrix.

In [22]:

import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
sns.set()
plt.figure(figsize=(20,20))
cor = boston.corr().round(2)
sns.heatmap(cor, annot=True, cmap=plt.cm.Reds)
plt.show()

Out[22]:

correlation_matrix = boston.corr().round(2)
# annot = True to print the values inside the square
#plt.figure(figsize=(30,30))
sns.heatmap(data=correlation_matrix, annot=True)

Insights:

-  To fit a linear regression model, we select those features which have a high correlation with our target variable MEDV. By looking at the correlation matrix we can see that RM has a strong positive correlation with MEDV (0.7) where as LSTAT has a high negative correlation with MEDV(-0.74).
- An important point in selecting features for a linear regression model is to check for multi-co-linearity. The features RAD, TAX have a correlation of 0.91. These feature pairs are strongly correlated to each other. We should not select both these features together for training the model. Check this for an explanation. Same goes for the features DIS and AGE which have a correlation of -0.75.

Based on the above observations we will RM and LSTAT as our features. Using a scatter plot let’s see how these features vary with MEDV.

In [24]:

#Correlation with output variable
cor_target = abs(cor['MEDV'])
#Selecting highly correlated features
relevant_features = cor_target[cor_target>0.5]
relevant_features

Out[24]:

RM         0.70
PTRATIO    0.51
LSTAT      0.74
MEDV       1.00
Name: MEDV, dtype: float64

In [25]:

plt.figure(figsize=(20, 5))

features = ['LSTAT', 'RM','PTRATIO' ]
target = boston['MEDV']

for i, col in enumerate(features):
    plt.subplot(1, len(features) , i+1)
    x = boston[col]
    y = target
    plt.scatter(x, y, marker='o')
    plt.title(col)
    plt.xlabel(col)
plt.ylabel('MEDV')

Out[25]:

Text(0, 0.5, 'MEDV')

Insights

The prices increase as the value of RM increases linearly. There are few outliers and the data seems to be capped at 50.
The prices tend to decrease with an increase in LSTAT. Though it doesn’t look to be following exactly a linear line.

Preparing the data for training the model

We concatenate the LSTAT and RM columns using np.c_ provided by the numpy library.

In [27]:

X = pd.DataFrame(np.c_[boston['LSTAT'], boston['RM']], columns = ['LSTAT','RM'])
Y = boston['MEDV']

Splitting the data into training and testing sets

Next, we split the data into training and testing sets. We train the model with 80% of the samples and test with the remaining 20%. We do this to assess the model’s performance on unseen data.
To split the data we use train_test_split function provided by scikit-learn library. We finally print the sizes of our training and test set to verify if the splitting has occurred properly.

In [28]:

from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state=5)
print(X_train.shape)
print(X_test.shape)
print(Y_train.shape)
print(Y_test.shape)

Out[28]:

(404, 2)
(102, 2)
(404,)
(102,)

Training and testing the model

We use scikit-learn’s LinearRegression to train our model on both the training and test sets.

In [29]:

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

lin_model = LinearRegression()
lin_model.fit(X_train, Y_train)

Out[29]:

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [32]:

print(lin_model.intercept_)

Out[32]:

2.7362403426066173

In [34]:

print(lin_model.coef_)

Out[34]:

[-0.71722954  4.58938833]

Model evaluation

We will evaluate our model using RMSE and R2-score

In [31]:

# model evaluation for training set
from sklearn.metrics import mean_squared_error, r2_score
y_train_predict = lin_model.predict(X_train)
rmse = (np.sqrt(mean_squared_error(Y_train, y_train_predict)))
r2 = r2_score(Y_train, y_train_predict)

print("The model performance for training set")
print("--------------------------------------")
print('RMSE is {}'.format(rmse))
print('R2 score is {}'.format(r2))
print("\n")

# model evaluation for testing set
y_test_predict = lin_model.predict(X_test)
rmse = (np.sqrt(mean_squared_error(Y_test, y_test_predict)))
r2 = r2_score(Y_test, y_test_predict)

print("The model performance for testing set")
print("--------------------------------------")
print('RMSE is {}'.format(rmse))
print('R2 score is {}'.format(r2))

Out[31]:

The model performance for training set
--------------------------------------
RMSE is 5.6371293350711955
R2 score is 0.6300745149331701


The model performance for testing set
--------------------------------------
RMSE is 5.137400784702911
R2 score is 0.6628996975186953

Comparing Actual and Predicted value

In [36]:

r = pd.DataFrame({'Actual': Y_test, 'Predicted': y_test_predict})  
r

Out[36]:

Conclusion

There are 63% of the total cases prediction will be correct.
We can use cross validation for Model improvement

Problem Statement

The objective is to predict the value of prices of the house using the given features.

Loading data into Data frame

Data preprocessing

Exploratory Data Analysis

The correlation matrix can be formed by using the corr function from the pandas dataframe library. We will use the heatmap function from the seaborn library to plot the correlation matrix.

Based on the above observations we will RM and LSTAT as our features. Using a scatter plot let’s see how these features vary with MEDV.

Insights

Preparing the data for training the model

Splitting the data into training and testing sets

Training and testing the model

Model evaluation

Comparing Actual and Predicted value

Conclusion

Product

Resources

Company