Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
YStrano
GitHub Repository: YStrano/DataScience_GA
Path: blob/master/lessons/lesson_08/extra-materials/titanic_confusion.ipynb
1904 views
Kernel: Python 3

Logistic regression exercise with Titanic data

Introduction

  • Data from Kaggle's Titanic competition: data, data dictionary

  • Goal: Predict survival based on passenger characteristics

  • titanic.csv is already in our repo, so there is no need to download the data from the Kaggle website

Step 1: Read the data into Pandas

import pandas as pd url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/titanic.csv' titanic = pd.read_csv(url, index_col='PassengerId') titanic.head()

Step 2: Create X and y

Define Pclass and Parch as the features, and Survived as the response.

feature_cols = ['Pclass', 'Parch']

Step 3: Split the data into training and testing sets

Step 4: Fit a logistic regression model and examine the coefficients

Confirm that the coefficients make intuitive sense.

Step 5: Make predictions on the testing set and calculate the accuracy

# class predictions (not predicted probabilities)
# calculate classification accuracy

Step 6: Compare your testing accuracy to the null accuracy

# this works regardless of the number of classes
# this only works for binary classification problems coded as 0/1

Confusion matrix of Titanic predictions

# print confusion matrix print(metrics.confusion_matrix(y_test, y_pred_class))
# save confusion matrix and slice into four pieces confusion = metrics.confusion_matrix(y_test, y_pred_class) TP = confusion[1][1] TN = confusion[0][0] FP = confusion[0][1] FN = confusion[1][0]
print('True Positives:', TP) print('True Negatives:', TN) print('False Positives:', FP) print('False Negatives:', FN)
# calculate the sensitivity print(TP / float(TP + FN)) print(44 / float(44 + 51))
# calculate the specificity print(TN / float(TN + FP)) print(105 / float(105 + 23))
# store the predicted probabilities y_pred_prob = logreg.predict_proba(X_test)[:, 1]
# histogram of predicted probabilities %matplotlib inline import matplotlib.pyplot as plt plt.hist(y_pred_prob) plt.xlim(0, 1) plt.xlabel('Predicted probability of survival') plt.ylabel('Frequency')
# increase sensitivity by lowering the threshold for predicting survival import numpy as np y_pred_class = np.where(y_pred_prob > 0.3, 1, 0)
# old confusion matrix print(confusion)
# new confusion matrix print(metrics.confusion_matrix(y_test, y_pred_class))
# new sensitivity (higher than before) print(63 / float(63 + 32))
# new specificity (lower than before) print(72 / float(72 + 56))