GitHub Repository: suyashi29/python-su
Path: blob/master/ML/Notebook/Default Modelling using logistic regression.ipynb
³⁰⁸⁷ views

Kernel: Python 3

Logistic reggression

About Dataset

This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

In [15]:

# Loading DataSet
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.pyplot import pie, axis, show
%matplotlib inline                              
# load dataset
dia = pd.read_csv("https://raw.githubusercontent.com/suyashi29/python-su/master/ML/diabetes.csv")

In [16]:

dia.head(2)

Out[16]:

In [17]:

dia.shape

Out[17]:

(768, 9)

In [18]:

dia.describe()

Out[18]:

In [19]:

dia.notnull().count()

Out[19]:

Pregnancies                 768
Glucose                     768
BloodPressure               768
SkinThickness               768
Insulin                     768
BMI                         768
DiabetesPedigreeFunction    768
Age                         768
Outcome                     768
dtype: int64

In [20]:

dia['Age'].plot.hist(orientation='vertical', cumulative=True)

Out[20]:

<matplotlib.axes._subplots.AxesSubplot at 0x257bd9a0c18>

In [21]:

dia.plot.hist(color='g', alpha=1, bins=50)##aplha for changing tranparency

Out[21]:

<matplotlib.axes._subplots.AxesSubplot at 0x257bdcdfa58>

In [22]:

dia.plot(kind="bar",x='Age',y='BMI')

Out[22]:

<matplotlib.axes._subplots.AxesSubplot at 0x257be0eacc0>

In [23]:

dia.plot.area()

Out[23]:

<matplotlib.axes._subplots.AxesSubplot at 0x257bfddc390>

Selecting Feature

Divide the given columns into two types of variables dependent(or target variable) and independent variable(or feature variables).

In [32]:

#split dataset in features and target variable
X = dia.iloc[:,:-1]# Excluding last column =features
y = dia.Outcome # Target variable

In [37]:

# split X and y into training and testing sets
import sklearn
from sklearn.model_selection import train_test_split  
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25,random_state=0)

Here, the Dataset is broken into two parts in a ratio of 75:25. It means 75% data will be used for model training and 25% for model testing.

Model Development and Prediction

First, import the Logistic Regression module and create a Logistic Regression classifier object using LogisticRegression() function.
Then, fit your model on the train set using fit() and perform prediction on the test set using predict().

In [38]:

# import the class
from sklearn.linear_model import LogisticRegression

# instantiate the model (using the default parameters)
logreg = LogisticRegression()

# fit the model with data
logreg.fit(X_train,y_train)

#
y_pred=logreg.predict(X_test)

Out[38]:

C:\Users\HP\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)

Model Evaluation using Confusion Matrix

A confusion matrix is a table that is used to evaluate the performance of a classification model. You can also visualize the performance of an algorithm. The fundamental of a confusion matrix is the number of correct and incorrect predictions are summed up class-wise

In [39]:

# import the metrics class
from sklearn import metrics
cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
cnf_matrix

Out[39]:

array([[119,  11],
       [ 26,  36]], dtype=int64)

In [44]:

print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
print("Precision:",metrics.precision_score(y_test, y_pred))
print("Recall:",metrics.recall_score(y_test, y_pred))

Out[44]:

Accuracy: 0.8072916666666666
Precision: 0.7659574468085106
Recall: 0.5806451612903226

Insights

Here, you can see the confusion matrix in the form of the array object. The dimension of this matrix is 2*2 because this model is binary classification. You have two classes 0 and 1.
Diagonal values represent accurate predictions, while non-diagonal elements are inaccurate predictions. In the output, 119 and 36 are actual predictions, and 26 and 11 are incorrect predictions.
Classification rate of 80%, considered as good accuracy.
In this prediction case, when your Logistic Regression model predicted patients are going to suffer from diabetes, that patients have 76% of the time.
Recall: If there are patients who have diabetes in the test set and your Logistic Regression model can identify it 58% of the time.

Visualizing Confusion Matrix using Heatmap

Let's visualize the results of the model in the form of a confusion matrix using matplotlib and seaborn.

In [45]:

import seaborn as sns
%matplotlib inline
import numpy as np
class_names=[0,1] # name  of classes
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
# create heatmap
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="YlGnBu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')

Out[45]:

Text(0.5, 257.44, 'Predicted label')

ROC Curve

Receiver Operating Characteristic(ROC) curve is a plot of the true positive rate against the false positive rate. It shows the tradeoff between sensitivity and specificity.

In [46]:


y_pred_proba = logreg.predict_proba(X_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test,  y_pred_proba)
auc = metrics.roc_auc_score(y_test, y_pred_proba)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()

Out[46]:

Advantages

Because of its efficient and straightforward nature, doesn't require high computation power, easy to implement, easily interpretable, used widely by data analyst and scientist

Disadvantages

Logistic regression is not able to handle a large number of categorical features/variables. It is vulnerable to overfitting. Also, can't solve the non-linear problem with the logistic regression that is why it requires a transformation of non-linear features. Logistic regression will not perform well with independent variables that are not correlated to the target variable and are very similar or correlated to each other.

In [ ]: