GitHub Repository: YStrano/DataScience_GA
Path: blob/master/lessons/lesson_08/extra-materials/titanic_confusion.ipynb
¹⁹⁰⁴ views

Kernel: Python 3

Logistic regression exercise with Titanic data

Introduction

Data from Kaggle's Titanic competition: data, data dictionary
Goal: Predict survival based on passenger characteristics
titanic.csv is already in our repo, so there is no need to download the data from the Kaggle website

Step 1: Read the data into Pandas

In [ ]:

import pandas as pd
url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/titanic.csv'
titanic = pd.read_csv(url, index_col='PassengerId')
titanic.head()

Step 2: Create X and y

Define Pclass and Parch as the features, and Survived as the response.

In [ ]:

feature_cols = ['Pclass', 'Parch']

Step 3: Split the data into training and testing sets

In [ ]:

Step 4: Fit a logistic regression model and examine the coefficients

Confirm that the coefficients make intuitive sense.

In [ ]:

Step 5: Make predictions on the testing set and calculate the accuracy

In [ ]:

# class predictions (not predicted probabilities)

In [ ]:

# calculate classification accuracy

Step 6: Compare your testing accuracy to the null accuracy

In [ ]:

# this works regardless of the number of classes

In [ ]:

# this only works for binary classification problems coded as 0/1

Confusion matrix of Titanic predictions

In [ ]:

# print confusion matrix
print(metrics.confusion_matrix(y_test, y_pred_class))

In [ ]:

# save confusion matrix and slice into four pieces
confusion = metrics.confusion_matrix(y_test, y_pred_class)
TP = confusion[1][1]
TN = confusion[0][0]
FP = confusion[0][1]
FN = confusion[1][0]

In [ ]:

print('True Positives:', TP)
print('True Negatives:', TN)
print('False Positives:', FP)
print('False Negatives:', FN)

In [ ]:

# calculate the sensitivity
print(TP / float(TP + FN))
print(44 / float(44 + 51))

In [ ]:

# calculate the specificity
print(TN / float(TN + FP))
print(105 / float(105 + 23))

In [ ]:

# store the predicted probabilities
y_pred_prob = logreg.predict_proba(X_test)[:, 1]

In [ ]:

# histogram of predicted probabilities
%matplotlib inline
import matplotlib.pyplot as plt
plt.hist(y_pred_prob)
plt.xlim(0, 1)
plt.xlabel('Predicted probability of survival')
plt.ylabel('Frequency')

In [ ]:

# increase sensitivity by lowering the threshold for predicting survival
import numpy as np
y_pred_class = np.where(y_pred_prob > 0.3, 1, 0)

In [ ]:

# old confusion matrix
print(confusion)

In [ ]:

# new confusion matrix
print(metrics.confusion_matrix(y_test, y_pred_class))

In [ ]:

# new sensitivity (higher than before)
print(63 / float(63 + 32))

In [ ]:

# new specificity (lower than before)
print(72 / float(72 + 56))

Logistic regression exercise with Titanic data

Introduction

Step 1: Read the data into Pandas

Step 2: Create X and y

Step 3: Split the data into training and testing sets

Step 4: Fit a logistic regression model and examine the coefficients

Step 5: Make predictions on the testing set and calculate the accuracy

Step 6: Compare your testing accuracy to the null accuracy

Confusion matrix of Titanic predictions

Product

Resources

Company