Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
suyashi29
GitHub Repository: suyashi29/python-su
Path: blob/master/Machine Learning Supervised Methods/Day 2 Use Case Linear Regression.ipynb
3074 views
Kernel: Python 3 (ipykernel)

Problem Statement

  • Housing dataset which contains information about different houses in Boston. This data was originally a part of UCI Machine Learning Repository and has been removed now. We can also access this data from the scikit-learn library. There are 506 samples and 13 feature variables in this dataset.

The objective is to predict the value of prices of the house using the given features.

import numpy as np import matplotlib.pyplot as plt import pandas as pd import seaborn as sns %matplotlib inline
from sklearn.datasets import load_boston boston_dataset = load_boston()
print(boston_dataset.keys())
  • data: contains the information for various houses

  • target: prices of the house

  • feature_names: names of the features

  • DESCR: describes the dataset

boston_dataset.DESCR
  • The prices of the house indicated by the variable MEDV is our target variable and the remaining are the feature variables based on which we will predict the value of a house.

Loading data into Data frame

boston = pd.DataFrame(boston_dataset.data, columns=boston_dataset.feature_names) boston.head(3)
  • We can see that the target value MEDV is missing from the data. We create a new column of target values and add it to the dataframe.

boston['MEDV'] = boston_dataset.target boston["MEDV"].describe()

Data preprocessing

  • We count the number of missing values for each feature using isnull()

boston.isnull().sum()
1- DROP COL : dropna 2- fillna: 10% numerical: Approx value : median or mean mean: Normal, outiers median

Exploratory Data Analysis

Exploratory Data Analysis is a very important step before training the model.

  • Let’s first plot the distribution of the target variable MEDV. We will use the distplot function from the seaborn library

sns.set(rc={'figure.figsize':(11.7,8.27)}) sns.distplot(boston['MEDV'], bins=30) plt.show()
  • We see that the values of MEDV are distributed normally with few outliers

The correlation matrix can be formed by using the corr function from the pandas dataframe library. We will use the heatmap function from the seaborn library to plot the correlation matrix.

import seaborn as sns import warnings warnings.filterwarnings("ignore") sns.set() plt.figure(figsize=(20,20)) cor = boston.corr().round(2) sns.heatmap(cor, annot=True, cmap=plt.cm.Reds) plt.show()
correlation_matrix = boston.corr().round(2) # annot = True to print the values inside the square #plt.figure(figsize=(30,30)) sns.heatmap(data=correlation_matrix, annot=True)

Insights:

- To fit a linear regression model, we select those features which have a high correlation with our target variable MEDV. By looking at the correlation matrix we can see that RM has a strong positive correlation with MEDV (0.7) where as LSTAT has a high negative correlation with MEDV(-0.74). - An important point in selecting features for a linear regression model is to check for multi-co-linearity. The features RAD, TAX have a correlation of 0.91. These feature pairs are strongly correlated to each other. We should not select both these features together for training the model. Check this for an explanation. Same goes for the features DIS and AGE which have a correlation of -0.75.

Based on the above observations we will RM and LSTAT as our features. Using a scatter plot let’s see how these features vary with MEDV.

#Correlation with output variable cor_target = abs(cor['MEDV']) #Selecting highly correlated features relevant_features = cor_target[cor_target>0.5] relevant_features
plt.figure(figsize=(20, 5)) features = ['LSTAT', 'RM','PTRATIO' ] target = boston['MEDV'] for i, col in enumerate(features): plt.subplot(1, len(features) , i+1) x = boston[col] y = target plt.scatter(x, y, marker='o') plt.title(col) plt.xlabel(col) plt.ylabel('MEDV')

FE (Recursive Feature Elimination)

The Recursive Feature Elimination (RFE) method works by recursively removing attributes and building a model on those attributes that remain. It uses accuracy metric to rank the feature according to their importance. The RFE method takes the model to be used and the number of required features as input. It then gives the ranking of all the variables, 1 being most important. It also gives its support, True being relevant feature and False being irrelevant feature.

X=boston Y = boston['MEDV']
from sklearn.model_selection import train_test_split X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state=5)
from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error lin_model = LinearRegression() lin_model.fit(X_train, Y_train)
boston = pd.DataFrame(boston_dataset.data, columns=boston_dataset.feature_names) boston.head(1)
from sklearn.feature_selection import RFE model = LinearRegression() #Initializing RFE model rfe = RFE(model,n_features_to_select=4) #Transforming data using RFE X_rfe = rfe.fit_transform(X,y) #Fitting the data to model model.fit(X_rfe,y) print(rfe.support_) print(rfe.ranking_)
#no of features nof_list=np.arange(1,5) high_score=0 #Variable to store the optimum features nof=0 score_list =[] for n in range(len(nof_list)): X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.3, random_state = 0) model = LinearRegression() rfe = RFE(model,nof_list[n]) X_train_rfe = rfe.fit_transform(X_train,y_train) X_test_rfe = rfe.transform(X_test) model.fit(X_train_rfe,y_train) score = model.score(X_test_rfe,y_test) score_list.append(score) if(score>high_score): high_score = score nof = nof_list[n] print("Optimum number of features: %d" %nof) print("Score with %d features: %f" % (nof, high_score))

Insights

  • The prices increase as the value of RM increases linearly. There are few outliers and the data seems to be capped at 50.

  • The prices tend to decrease with an increase in LSTAT. Though it doesn’t look to be following exactly a linear line.

Preparing the data for training the model

  • We concatenate the LSTAT and RM columns using np.c_ provided by the numpy library.

X = pd.DataFrame(np.c_[boston['LSTAT'], boston['RM']], columns = ['LSTAT','RM']) Y = boston['MEDV']

Splitting the data into training and testing sets

  • Next, we split the data into training and testing sets. We train the model with 80% of the samples and test with the remaining 20%. We do this to assess the model’s performance on unseen data.

  • To split the data we use train_test_split function provided by scikit-learn library. We finally print the sizes of our training and test set to verify if the splitting has occurred properly.

from sklearn.model_selection import train_test_split X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state=5) print(X_train.shape) print(X_test.shape) print(Y_train.shape) print(Y_test.shape)

Training and testing the model

  • We use scikit-learn’s LinearRegression to train our model on both the training and test sets.

from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error lin_model = LinearRegression() lin_model.fit(X_train, Y_train)
print(lin_model.intercept_)
print(lin_model.coef_)

Model evaluation

We will evaluate our model using RMSE and R2-score

# model evaluation for training set from sklearn.metrics import mean_squared_error, r2_score y_train_predict = lin_model.predict(X_train) rmse = (np.sqrt(mean_squared_error(Y_train, y_train_predict))) r2 = r2_score(Y_train, y_train_predict) print("The model performance for training set") print("--------------------------------------") print('RMSE is {}'.format(rmse)) print('R2 score is {}'.format(r2)) print("\n") # model evaluation for testing set y_test_predict = lin_model.predict(X_test) rmse = (np.sqrt(mean_squared_error(Y_test, y_test_predict))) r2 = r2_score(Y_test, y_test_predict) print("The model performance for testing set") print("--------------------------------------") print('RMSE is {}'.format(rmse)) print('R2 score is {}'.format(r2))

Comparing Actual and Predicted value

r = pd.DataFrame({'Actual': Y_test, 'Predicted': y_test_predict}) r

Conclusion

  • There are 63% of the total cases prediction will be correct.

  • We can use cross validation for Model improvement

Performance Improvement by Cross validation

from sklearn.model_selection import train_test_split train, validation = train_test_split(boston, test_size=0.50, random_state = 5)
X_train, X_v, y_train, y_v = train_test_split(X, y, test_size=0.5, random_state=5) from sklearn.linear_model import LinearRegression reg = LinearRegression() reg.fit(X_train, y_train)
print(reg.intercept_,reg.coef_)
y_pred = reg.predict(X_v) #y_pred
from sklearn.metrics import mean_squared_error, r2_score r2 = r2_score(y_v, y_pred) print(r2)
from sklearn import metrics print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_v, y_pred)))

Model Correction

image-3.png

from sklearn.model_selection import cross_val_score lm = LinearRegression() scores = cross_val_score(lm, X_train, y_train, scoring='r2', cv=7) scores
# can tune other metrics, such as MSE scores = cross_val_score(lm, X_train, y_train, scoring='neg_mean_squared_error', cv=5) scores