GitHub Repository: suyashi29/python-su
Path: blob/master/Machine Learning Supervised Methods/Modelling Binary Logistic Regression Using Python.ipynb
³⁰⁷⁴ views

Kernel: Python 3 (ipykernel)

Logistic regression

About Dataset

This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

The objective of the data set is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the data set

%matplotlib inline sets the backend of matplotlib to the 'inline' backend: With this backend, the output of plotting commands is displayed inline within frontends like the Jupyter notebook, directly below the code cell that produced it.

In [1]:

# Loading DataSet
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.pyplot import pie, axis, show
import seaborn as sns
sns.set()
%matplotlib inline    
import warnings
warnings.filterwarnings('ignore')
# load dataset
dia = pd.read_csv("diabetes.csv")

dia = pd.read_csv("diabetes.csv")
a_data=pd.ExcelFile(r"C:\Users\suyashi144893\Documents\data Sets\admission.xlsx").parse("Sheet2")
a_data=pd.read_csv(r"C:\Users\suyashi144893\Documents\data Sets\diabetic.csv")

In [2]:

dia.head(2)

Out[2]:

In [3]:

dia.shape

Out[3]:

(768, 9)

In [4]:

### Plot pedigree and diabetes and add the logistic fit
sns.regplot(x = "Glucose", y = "Outcome", 
            y_jitter = 0.03,
            data = dia, 
            logistic = True,
            ci = None)

# Display the plot
plt.show()
#jitter:Add uniform random noise of this size to either the x or y variables. 
#The noise is added to a copy of the data after fitting the regression, and only influences the look of the scatterplot. This can be helpful when plotting variables that take discrete values

Out[4]:

In [5]:

dia["Outcome"].unique()

Out[5]:

array([1, 0], dtype=int64)

In [6]:

dia.groupby("Outcome").count()

Out[6]:

Insights

Age , DPF, ST, BMI, Glucose level affect outcome

In [7]:

dia.describe()

Out[7]:

## data.loc[~(data==0).all(axis=1)]
import numpy as np
df = df.replace(0, np.nan)
df = df.dropna(how='all', axis=0)
df = df.replace(np.nan, 0)

In [ ]:

dia.isnull().sum()

dia(if dia["BMI"]==0:
          dia["BMI"]==dia["BMI"].mean()
    else :
    pass

#First you can find the nonzero mean :

nonzero_mean = dia[ dia.BMI != 0 ].mean()

#Then replace the zero values with this mean :

dia.loc[ dia.BMI == 0, "BMI" ] = nonzero_mean

In [8]:

dia['Age'].plot.hist(orientation='vertical')

Out[8]:

<AxesSubplot:ylabel='Frequency'>

In [ ]:

dia.plot.hist(alpha=1, bins=20,stacked = True )##aplha for changing tranparency

Selecting Feature

Divide the given columns into two types of variables dependent(or target variable) and independent variable(or feature variables).

In [9]:

import seaborn as sns
sns.set()
sns.heatmap(dia.corr(),annot=True)

Out[9]:

<AxesSubplot:>

Data_Modelling

dia= dia.drop(["Insulin"],axis=1)

In [10]:

#split dataset in features and target variable
X = dia.iloc[:,:-1]# Excluding last one column =features
y = dia.Outcome # Target variable

In [11]:

# split X and y into training and testing sets
import sklearn
from sklearn.model_selection import train_test_split  
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25,random_state=0)

Here, the Dataset is broken into two parts in a ratio of 75:25. It means 75% data will be used for model training and 25% for model testing.

Model Development and Prediction

First, import the Logistic Regression module and create a Logistic Regression classifier object using LogisticRegression() function.
Then, fit your model on the train set using fit() and perform prediction on the test set using predict().

In [12]:

# import the class
from sklearn.linear_model import LogisticRegression

# instantiate the model (using the default parameters)
logreg = LogisticRegression()

# fit the model with data
logreg.fit(X_train,y_train)

#

Out[12]:

In [13]:

y_pred=logreg.predict(X_test)

In [14]:

logreg.intercept_

Out[14]:

array([-8.5028549])

#no of features
import numpy as np
from sklearn.feature_selection import RFE
nof_list=np.arange(1,9)            
high_score=0
#Variable to store the optimum features
nof=0           
score_list =[]
for n in range(len(nof_list)):
    X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.25, random_state = 0)
    model = LogisticRegression()
    rfe = RFE(model,nof_list[n])
    X_train_rfe = rfe.fit_transform(X_train,y_train)
    X_test_rfe = rfe.transform(X_test)
    model.fit(X_train_rfe,y_train)
    score = model.score(X_test_rfe,y_test)
    score_list.append(score)
    if(score>high_score):
        high_score = score
        nof = nof_list[n]
print("Optimum number of features: %d" %nof)
print("Score with %d features: %f" % (nof, high_score))

Model Evaluation using Confusion Matrix

A confusion matrix is a table that is used to evaluate the performance of a classification model. You can also visualize the performance of an algorithm. The fundamental of a confusion matrix is the number of correct and incorrect predictions are summed up class-wise

In [15]:

dia.head()

Out[15]:

import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
v=pd.DataFrame()
v["VIF Factor"]= [variance_inflation_factor(X.values,i) for i in range(X.shape[1])]
v["features"]= X.columns
v.round(1)

In [16]:

import statsmodels.api as sm
logit_model=sm.Logit(y,X)
result=logit_model.fit()
print(result.summary2())
## p>0.05: Not Diabetic , p<0.05:Diabetic

Out[16]:

Optimization terminated successfully.
         Current function value: 0.608498
         Iterations 5
                             Results: Logit
========================================================================
Model:                 Logit              Pseudo R-squared:   0.059     
Dependent Variable:    Outcome            AIC:                950.6528  
Date:                  2023-02-20 13:23   BIC:                987.8031  
No. Observations:      768                Log-Likelihood:     -467.33   
Df Model:              7                  LL-Null:            -496.74   
Df Residuals:          760                LLR p-value:        2.5825e-10
Converged:             1.0000             Scale:              1.0000    
No. Iterations:        5.0000                                           
------------------------------------------------------------------------
                          Coef.  Std.Err.    z    P>|z|   [0.025  0.975]
------------------------------------------------------------------------
Pregnancies               0.1284   0.0286  4.4843 0.0000  0.0723  0.1845
Glucose                   0.0129   0.0027  4.7568 0.0000  0.0076  0.0183
BloodPressure            -0.0303   0.0047 -6.4806 0.0000 -0.0395 -0.0212
SkinThickness             0.0002   0.0061  0.0323 0.9742 -0.0117  0.0121
Insulin                   0.0007   0.0008  0.9420 0.3462 -0.0008  0.0023
BMI                      -0.0048   0.0107 -0.4494 0.6531 -0.0258  0.0162
DiabetesPedigreeFunction  0.3203   0.2399  1.3351 0.1818 -0.1499  0.7905
Age                      -0.0156   0.0084 -1.8517 0.0641 -0.0322  0.0009
========================================================================

Confusion Matrix

A confusion matrix is a summary of prediction results on a classification problem.
The number of correct and incorrect predictions are summarized with count values and broken down by each class. This is the key to the confusion matrix.
The confusion matrix shows the ways in which your classification model is confused when it makes predictions.

In [18]:

(115+37)/(115+15+25+37)

Out[18]:

0.7916666666666666

In [17]:

# import the metrics class
from sklearn import metrics
cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
cnf_matrix

Out[17]:

array([[115,  15],
       [ 25,  37]], dtype=int64)

Classification Rate/Accuracy:

Classification Rate or Accuracy is given by the relation:

Recall

Recall can be defined as the ratio of the total number of correctly classified positive examples divide to the total number of positive examples. High Recall indicates the class is correctly recognized (small number of FN)

To get the value of precision we divide the total number of correctly classified positive examples by the total number of predicted positive examples. High Precision indicates an example labeled as positive is indeed positive (small number of FP).

High recall, low precision:This means that most of the positive examples are correctly recognized (low FN) but there are a lot of false positives.
Low recall, high precision:This shows that we miss a lot of positive examples (high FN) but those we predict as positive are indeed positive (low FP).
F1 Score = 2 * (Precision * Recall) / (Precision + Recall)

In [19]:

import numpy as np
print("Test Accuracy:",metrics.accuracy_score(y_test, y_pred))
print("Train accuracy:", np.round(metrics.accuracy_score(y_train,logreg.predict(X_train))))                                                             
print("F1 Score:",metrics.accuracy_score(y_test,y_pred))
print("Precision:",metrics.precision_score(y_test, y_pred))
print("Recall:",metrics.recall_score(y_test, y_pred))
## Train>Test (Overfit)

Out[19]:

Test Accuracy: 0.7916666666666666
Train accuracy: 1.0
F1 Score: 0.7916666666666666
Precision: 0.7115384615384616
Recall: 0.5967741935483871

Insights

Here, you can see the confusion matrix in the form of the array object. The dimension of this matrix is 2*2 because this model is binary classification. You have two classes 0 and 1.
Diagonal values represent accurate predictions, while non-diagonal elements are inaccurate predictions. In the output, 119 and 36 are actual predictions, and 26 and 11 are incorrect predictions.
Classification rate of 80%, considered as good accuracy.
In this prediction case, when your Logistic Regression model predicted patients are going to suffer from diabetes, that patients have 76% of the time.
Recall: If there are patients who have diabetes in the test set and your Logistic Regression model can identify it 58% of the time.

Visualizing Confusion Matrix using Heatmap

Let's visualize the results of the model in the form of a confusion matrix using matplotlib and seaborn.

In [20]:

import seaborn as sns
%matplotlib inline
import numpy as np
class_names=[0,1] # name  of classes
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
# create heatmap
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="YlGnBu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')

Out[20]:

Text(0.5, 257.44, 'Predicted label')

In [21]:

r = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})  
r.sum()

Out[21]:

Actual       62
Predicted    52
dtype: int64

ROC Curve

Receiver Operating Characteristic(ROC) curve is a plot of the true positive rate against the false positive rate. It shows the tradeoff between sensitivity and specificity.

In [22]:


y_pred_proba = logreg.predict_proba(X_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test,  y_pred_proba)
auc = metrics.roc_auc_score(y_test, y_pred_proba)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()

Out[22]:

Model Evaluation for Correction

In [23]:

print ("the first 10 predicted probabilities for class 1")
logreg.predict_proba(X_test)[0:10, 1]

Out[23]:

the first 10 predicted probabilities for class 1

array([0.97430957, 0.16022445, 0.07678281, 0.68049997, 0.12195515,
       0.03382118, 0.80422912, 0.89676956, 0.50992597, 0.36495393])

In [24]:

## store the predicted probabilities for class 1
y_pred_prob = logreg.predict_proba(X_test)[:, 1]
y_pred_prob

Out[24]:

array([0.97430957, 0.16022445, 0.07678281, 0.68049997, 0.12195515,
       0.03382118, 0.80422912, 0.89676956, 0.50992597, 0.36495393,
       0.64201291, 0.96975574, 0.31701509, 0.25424458, 0.13279596,
       0.17806406, 0.89820938, 0.02213942, 0.46769295, 0.2519952 ,
       0.68355155, 0.38914714, 0.21704831, 0.05840366, 0.0504395 ,
       0.3574953 , 0.04689485, 0.94254127, 0.10883221, 0.12914643,
       0.52659378, 0.24569865, 0.10114872, 0.50018212, 0.08990043,
       0.67492384, 0.4971289 , 0.08540017, 0.32088031, 0.73550373,
       0.28404179, 0.18463535, 0.16624777, 0.82727428, 0.76343607,
       0.01188445, 0.07935747, 0.19982664, 0.41692831, 0.29029418,
       0.39056442, 0.17188968, 0.90118886, 0.54005448, 0.1219156 ,
       0.00203653, 0.0602086 , 0.46453713, 0.26627091, 0.07660383,
       0.74979662, 0.46648078, 0.09061968, 0.69938126, 0.67208696,
       0.9363622 , 0.72136369, 0.13259295, 0.33719638, 0.10871875,
       0.09683759, 0.35019193, 0.08553882, 0.96747777, 0.82969343,
       0.28822922, 0.11453031, 0.70090926, 0.06886681, 0.17235988,
       0.28396975, 0.40448049, 0.22648705, 0.03090837, 0.18863598,
       0.16644236, 0.26760448, 0.35378474, 0.89329979, 0.15913864,
       0.1593517 , 0.15960134, 0.22600358, 0.04150666, 0.67710905,
       0.18972856, 0.46825894, 0.49951642, 0.7248281 , 0.23024712,
       0.21617517, 0.09435724, 0.1798586 , 0.03565466, 0.69887225,
       0.39278551, 0.14443899, 0.28648514, 0.04121514, 0.78666692,
       0.11663182, 0.3121767 , 0.57295891, 0.52773838, 0.52900606,
       0.60150401, 0.1223423 , 0.6804319 , 0.08827328, 0.79471765,
       0.36964159, 0.38566715, 0.26601112, 0.48248907, 0.20807182,
       0.03726402, 0.32154473, 0.39474133, 0.48948797, 0.36438066,
       0.39238393, 0.03840617, 0.05200753, 0.78371482, 0.32993714,
       0.46034628, 0.11718399, 0.3746378 , 0.64419607, 0.18196806,
       0.07130916, 0.55096353, 0.055268  , 0.06825165, 0.31079958,
       0.08285671, 0.07297724, 0.09444651, 0.16356192, 0.22339213,
       0.0753922 , 0.63815881, 0.09008165, 0.20123731, 0.80601384,
       0.1232886 , 0.67041315, 0.15171418, 0.47025799, 0.95205734,
       0.6895525 , 0.77339795, 0.01900237, 0.2662362 , 0.84898209,
       0.24090555, 0.17480166, 0.10330463, 0.20578689, 0.08714751,
       0.10819279, 0.24643851, 0.16472526, 0.20411091, 0.5238622 ,
       0.12641341, 0.37711871, 0.10086938, 0.10475796, 0.05844572,
       0.13166587, 0.82887664, 0.14112278, 0.97281125, 0.55781951,
       0.02388552, 0.57823457, 0.23773787, 0.42795066, 0.13740184,
       0.18610465, 0.09351092])

In [25]:

## allow plots to appear in the notebook
%matplotlib inline
import matplotlib.pyplot as plt

# adjust the font size 
plt.rcParams['font.size'] = 10

In [26]:

## histogram of predicted probabilities

# 8 bins
plt.hist(y_pred_prob, bins=8)

# x-axis limit from 0 to 1
plt.xlim(0,1)
plt.title('Histogram of predicted probabilities')
plt.xlabel('Predicted probability of diabetes')
plt.ylabel('Frequency')

Out[26]:

Text(0, 0.5, 'Frequency')

We can see from the third bar

About 45% of observations have probability from 0.2 to 0.3
Small number of observations with probability > 0.5
This is below the threshold of 0.5
Most would be predicted "no diabetes" in this case

Solution

Decrease the threshold for predicting diabetes
Increase the sensitivity of the classifier
This would increase the number of TP
More sensitive to positive instances

In [27]:

# predict diabetes if the predicted probability is greater than 0.3
from sklearn.preprocessing import binarize
# it will return 1 for all values above 0.3 and 0 otherwise
# results are 2D so we slice out the first column
y_pred_class = (y_pred_prob > 0.3).astype(int)

In [28]:

# print the first 10 predicted probabilities
y_pred_prob[0:10]

Out[28]:

array([0.97430957, 0.16022445, 0.07678281, 0.68049997, 0.12195515,
       0.03382118, 0.80422912, 0.89676956, 0.50992597, 0.36495393])

In [29]:

# print the first 10 predicted classes with the lower threshold
y_pred_class[0:10]

Out[29]:

array([1, 0, 0, 1, 0, 0, 1, 1, 1, 1])

In [30]:

## previous confusion matrix (default threshold of 0.from sklearn import metrics
cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
cnf_matrix

Out[30]:

array([[115,  15],
       [ 25,  37]], dtype=int64)

In [31]:

print(metrics.confusion_matrix(y_test, y_pred_class))

Out[31]:

[[94 36]
 [12 50]]

In [32]:

print("Test Accuracy:",metrics.accuracy_score(y_test, y_pred_class))
print("Precision:",metrics.precision_score(y_test, y_pred_class))
print("Recall:",metrics.recall_score(y_test, y_pred_class))

Out[32]:

Test Accuracy: 0.75
Precision: 0.5813953488372093
Recall: 0.8064516129032258

Dimensionality reduction

This technique can be defined as, "It is a way of converting the higher dimensions dataset into lesser dimensions dataset ensuring that it provides similar information." These techniques are widely used in machine learning for obtaining a better fit predictive model while solving the classification and regression problems.

Multicollinearity occurs when features (input variables) are highly correlated with one or more of the other features in the dataset. It affects the performance of regression and classification models. PCA (Principal Component Analysis) takes advantage of multicollinearity and combines the highly correlated variables into a set of uncorrelated variable. PCA PCA is a linear dimensionality reduction technique (algorithm) that transforms a set of correlated variables (p) into a smaller k (k<p) number of uncorrelated variables called principal components while retaining as much of the variation in the original dataset as possible.

If the variables are not measured on a similar scale, we need to do feature scaling before applying PCA for our data. This is because PCA directions are highly sensitive to the scale of the data.
The most important part in PCA is selecting the best number of components for the given dataset

Principal Component Analysis (PCA)

PCA is a linear dimensionality reduction method that transforms a set of correlated variables (p) into a smaller k (k<p) number of uncorrelated variables called principal components while retaining as much of the variation in the original dataset as possible.
In the context of Machine Learning (ML), PCA is an unsupervised machine learning algorithm that is used for dimensionality reduction.

Feature Extraction using PCA**

To extract features from the dataset using the PCA technique, firstly we need to find the percentage of variance explained as dimensionality decreases.

First, we apply PCA keeping all components equal to the original number of dimensions

In [ ]:

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_scaled = sc.fit_transform(X)

# Apply PCA
from sklearn.decomposition import PCA
pca = PCA(n_components=None)
pca.fit(X_scaled)

# Get the eigenvalues
print("Eigenvalues:")
print(pca.explained_variance_)
print()

# Get explained variances
print("Variances (Percentage):")
print(pca.explained_variance_ratio_ * 100)
print()

# Make the scree plot
plt.plot(np.cumsum(pca.explained_variance_ratio_ * 100))
plt.xlabel("Number of components (Dimensions)")
plt.ylabel("Explained variance (%)")

Insight:

it is observed that for 5 dimensions the percentage of variance explained is 85%. This means we are preserving 85% of variance by projecting higher dimensionality (7) into lower space (5).

In [ ]:

# Do feature scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_scaled = sc.fit_transform(X)

# Apply PCA
from sklearn.decomposition import PCA
pca = PCA(n_components=5)
X_pca = pca.fit_transform(X_scaled)

# Get the transformed dataset
X_pca = pd.DataFrame(X_pca)
print(X_pca.head())
print("\nSize: ")
print(X_pca.shape)

In [ ]:

fig = plt.figure(figsize=(10, 8))
sns.heatmap(X_pca.corr(), 
            annot=True)

We cannot see any correlation between components. This is because PCA has transformed the set of correlated variables in the original dataset into a set of uncorrelated variables.

In [ ]:

# Make train and test sets
from sklearn.model_selection import train_test_split
X_train_pca, X_test_pca, y_train, y_test = train_test_split(X_pca, y, test_size=0.20, 
                                                            shuffle=True, random_state=2)

# Initialize the logistic regression model
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(max_iter=2500)

# Train the model
clf.fit(X_train_pca, y_train)

# Make predictions
y_pred = clf.predict(X_test_pca) # Predictions
y_true = y_test # True values

# Measure accuracy
from sklearn.metrics import accuracy_score
import numpy as np
print("Train accuracy:", np.round(accuracy_score(y_train, 
                                                 clf.predict(X_train_pca)), 2))
print("Test accuracy:", np.round(accuracy_score(y_true, y_pred), 2))

# Make the confusion matrix
from sklearn.metrics import confusion_matrix
cf_matrix = confusion_matrix(y_true, y_pred)
print("\nTest confusion_matrix")
sns.heatmap(cf_matrix, annot=True, cmap='Blues')
plt.xlabel('Predicted', fontsize=12)
plt.ylabel('True', fontsize=12)

Advantages of Logostic Regression

Because of its efficient and straightforward nature, doesn't require high computation power, easy to implement, easily interpretable, used widely by data analyst and scientist

Disadvantages

Logistic regression is not able to handle a large number of categorical features/variables. It is vulnerable to overfitting. Also, can't solve the non-linear problem with the logistic regression that is why it requires a transformation of non-linear features. Logistic regression will not perform well with independent variables that are not correlated to the target variable and are very similar or correlated to each other.

In [ ]:

Logistic regression

About Dataset

The objective of the data set is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the data set

Insights

Selecting Feature

Data_Modelling

Model Development and Prediction

Model Evaluation using Confusion Matrix

Confusion Matrix

Classification Rate/Accuracy:

Recall

Insights

Visualizing Confusion Matrix using Heatmap

ROC Curve

Model Evaluation for Correction

We can see from the third bar

Solution

Dimensionality reduction

Principal Component Analysis (PCA)

Feature Extraction using PCA**

First, we apply PCA keeping all components equal to the original number of dimensions

Insight:

We cannot see any correlation between components. This is because PCA has transformed the set of correlated variables in the original dataset into a set of uncorrelated variables.

Advantages of Logostic Regression

Disadvantages

Product

Resources

Company