Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
suyashi29
GitHub Repository: suyashi29/python-su
Path: blob/master/ML/Notebook/Use Case Linear Regression.ipynb
3087 views
Kernel: Python 3

Problem Statement

  • Housing dataset which contains information about different houses in Boston. This data was originally a part of UCI Machine Learning Repository and has been removed now. We can also access this data from the scikit-learn library. There are 506 samples and 13 feature variables in this dataset.

The objective is to predict the value of prices of the house using the given features.

import numpy as np import matplotlib.pyplot as plt import pandas as pd import seaborn as sns %matplotlib inline
from sklearn.datasets import load_boston boston_dataset = load_boston()
print(boston_dataset.keys())
dict_keys(['data', 'target', 'feature_names', 'DESCR'])
  • data: contains the information for various houses

  • target: prices of the house

  • feature_names: names of the features

  • DESCR: describes the dataset

boston_dataset.DESCR
".. _boston_dataset:\n\nBoston house prices dataset\n---------------------------\n\n**Data Set Characteristics:** \n\n :Number of Instances: 506 \n\n :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.\n\n :Attribute Information (in order):\n - CRIM per capita crime rate by town\n - ZN proportion of residential land zoned for lots over 25,000 sq.ft.\n - INDUS proportion of non-retail business acres per town\n - CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)\n - NOX nitric oxides concentration (parts per 10 million)\n - RM average number of rooms per dwelling\n - AGE proportion of owner-occupied units built prior to 1940\n - DIS weighted distances to five Boston employment centres\n - RAD index of accessibility to radial highways\n - TAX full-value property-tax rate per $10,000\n - PTRATIO pupil-teacher ratio by town\n - B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town\n - LSTAT % lower status of the population\n - MEDV Median value of owner-occupied homes in $1000's\n\n :Missing Attribute Values: None\n\n :Creator: Harrison, D. and Rubinfeld, D.L.\n\nThis is a copy of UCI ML housing dataset.\nhttps://archive.ics.uci.edu/ml/machine-learning-databases/housing/\n\n\nThis dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.\n\nThe Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic\nprices and the demand for clean air', J. Environ. Economics & Management,\nvol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, 'Regression diagnostics\n...', Wiley, 1980. N.B. Various transformations are used in the table on\npages 244-261 of the latter.\n\nThe Boston house-price data has been used in many machine learning papers that address regression\nproblems. \n \n.. topic:: References\n\n - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.\n - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.\n"
  • The prices of the house indicated by the variable MEDV is our target variable and the remaining are the feature variables based on which we will predict the value of a house.

Loading data into Data frame

boston = pd.DataFrame(boston_dataset.data, columns=boston_dataset.feature_names) boston.head(3)
  • We can see that the target value MEDV is missing from the data. We create a new column of target values and add it to the dataframe.

boston['MEDV'] = boston_dataset.target boston["MEDV"].describe()
count 506.000000 mean 22.532806 std 9.197104 min 5.000000 25% 17.025000 50% 21.200000 75% 25.000000 max 50.000000 Name: MEDV, dtype: float64

Data preprocessing

  • We count the number of missing values for each feature using isnull()

boston.isnull().sum()
CRIM 0 ZN 0 INDUS 0 CHAS 0 NOX 0 RM 0 AGE 0 DIS 0 RAD 0 TAX 0 PTRATIO 0 B 0 LSTAT 0 MEDV 0 dtype: int64

Exploratory Data Analysis

Exploratory Data Analysis is a very important step before training the model.

  • Let’s first plot the distribution of the target variable MEDV. We will use the distplot function from the seaborn library

sns.set(rc={'figure.figsize':(11.7,8.27)}) sns.distplot(boston['MEDV'], bins=30) plt.show()
Image in a Jupyter notebook
  • We see that the values of MEDV are distributed normally with few outliers

The correlation matrix can be formed by using the corr function from the pandas dataframe library. We will use the heatmap function from the seaborn library to plot the correlation matrix.

import seaborn as sns import warnings warnings.filterwarnings("ignore") sns.set() plt.figure(figsize=(20,20)) cor = boston.corr().round(2) sns.heatmap(cor, annot=True, cmap=plt.cm.Reds) plt.show()
Image in a Jupyter notebook
correlation_matrix = boston.corr().round(2) # annot = True to print the values inside the square #plt.figure(figsize=(30,30)) sns.heatmap(data=correlation_matrix, annot=True)

Insights:

- To fit a linear regression model, we select those features which have a high correlation with our target variable MEDV. By looking at the correlation matrix we can see that RM has a strong positive correlation with MEDV (0.7) where as LSTAT has a high negative correlation with MEDV(-0.74). - An important point in selecting features for a linear regression model is to check for multi-co-linearity. The features RAD, TAX have a correlation of 0.91. These feature pairs are strongly correlated to each other. We should not select both these features together for training the model. Check this for an explanation. Same goes for the features DIS and AGE which have a correlation of -0.75.

Based on the above observations we will RM and LSTAT as our features. Using a scatter plot let’s see how these features vary with MEDV.

#Correlation with output variable cor_target = abs(cor['MEDV']) #Selecting highly correlated features relevant_features = cor_target[cor_target>0.5] relevant_features
RM 0.70 PTRATIO 0.51 LSTAT 0.74 MEDV 1.00 Name: MEDV, dtype: float64
plt.figure(figsize=(20, 5)) features = ['LSTAT', 'RM','PTRATIO' ] target = boston['MEDV'] for i, col in enumerate(features): plt.subplot(1, len(features) , i+1) x = boston[col] y = target plt.scatter(x, y, marker='o') plt.title(col) plt.xlabel(col) plt.ylabel('MEDV')
Text(0, 0.5, 'MEDV')
Image in a Jupyter notebook

Insights

  • The prices increase as the value of RM increases linearly. There are few outliers and the data seems to be capped at 50.

  • The prices tend to decrease with an increase in LSTAT. Though it doesn’t look to be following exactly a linear line.

Preparing the data for training the model

  • We concatenate the LSTAT and RM columns using np.c_ provided by the numpy library.

X = pd.DataFrame(np.c_[boston['LSTAT'], boston['RM']], columns = ['LSTAT','RM']) Y = boston['MEDV']

Splitting the data into training and testing sets

  • Next, we split the data into training and testing sets. We train the model with 80% of the samples and test with the remaining 20%. We do this to assess the model’s performance on unseen data.

  • To split the data we use train_test_split function provided by scikit-learn library. We finally print the sizes of our training and test set to verify if the splitting has occurred properly.

from sklearn.model_selection import train_test_split X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state=5) print(X_train.shape) print(X_test.shape) print(Y_train.shape) print(Y_test.shape)
(404, 2) (102, 2) (404,) (102,)

Training and testing the model

  • We use scikit-learn’s LinearRegression to train our model on both the training and test sets.

from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error lin_model = LinearRegression() lin_model.fit(X_train, Y_train)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
print(lin_model.intercept_)
2.7362403426066173
print(lin_model.coef_)
[-0.71722954 4.58938833]

Model evaluation

We will evaluate our model using RMSE and R2-score

# model evaluation for training set from sklearn.metrics import mean_squared_error, r2_score y_train_predict = lin_model.predict(X_train) rmse = (np.sqrt(mean_squared_error(Y_train, y_train_predict))) r2 = r2_score(Y_train, y_train_predict) print("The model performance for training set") print("--------------------------------------") print('RMSE is {}'.format(rmse)) print('R2 score is {}'.format(r2)) print("\n") # model evaluation for testing set y_test_predict = lin_model.predict(X_test) rmse = (np.sqrt(mean_squared_error(Y_test, y_test_predict))) r2 = r2_score(Y_test, y_test_predict) print("The model performance for testing set") print("--------------------------------------") print('RMSE is {}'.format(rmse)) print('R2 score is {}'.format(r2))
The model performance for training set -------------------------------------- RMSE is 5.6371293350711955 R2 score is 0.6300745149331701 The model performance for testing set -------------------------------------- RMSE is 5.137400784702911 R2 score is 0.6628996975186953

Comparing Actual and Predicted value

r = pd.DataFrame({'Actual': Y_test, 'Predicted': y_test_predict}) r

Conclusion

  • There are 63% of the total cases prediction will be correct.

  • We can use cross validation for Model improvement