Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
DataScienceUWL
GitHub Repository: DataScienceUWL/DS775
Path: blob/main/Lessons/Lesson 08 - Hyperparameter Optimization (Project)/extras/Confusion_Matrix_and_Report.ipynb
871 views
Kernel: Python 3 (system-wide)

Confusion Matrix and Classification Report

Predicting Opioid Abuse from Perception of Risk

The data for this project uses 2016 National Survey on Drug Use and Health to attempt to predict opioid abuse risk based on responses from a small number of survey questions regarding the perceived risk of alcohol, tobacco, and substance use. The intent was to create a screening tool for participants in Division of Extension education programs that could flag individuals that might be more at risk, so additional targeted interventions could be provided.

Extensive data cleaning was performed in R, resulting in a dataset with 40241 adults with no history of opioid abuse and 2381 adults with a history of opioid abuse.

Let's read in the data and one-hot-encode the category variables for sklearn.

We'll also make a much smaller data set for demonstration purposes. Otherwise, this code runs extremely slowly. If you wanted more accurate results, the entire dataset should be used.

Loading the data

# imports import pandas as pd import numpy as np from sklearn.preprocessing import OneHotEncoder #read in the data X = pd.read_csv('../data/opioid_data.csv') #grab the y column (1 = opioid user, 0 = not a user) y = np.array(X['isUser']) #drop the y column X = X.drop(columns = ['isUser']) #one hot encode the categories onehot_encoder = OneHotEncoder(sparse=False, categories='auto') X = onehot_encoder.fit_transform(X) # split into test and training data from sklearn.model_selection import train_test_split #for testing, split twice to get a much smaller dataset - just 5000 #comment out this line to run with the entire data set x_train_toss, X, y_train_toss, y = train_test_split(X, y, test_size = 5000, random_state = 0) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0) #Just to confirm how many records we're dealing with.... print('Final Training Size', len(X_train)) print('Final Testing Size', len(X_test))
Final Training Size 4000 Final Testing Size 1000

For every 1 opioid user in our dataset, we have approximately 17 non opioid users. Given that our sample is so imbalanced, we'll need to use some mechanism to try to even the scales. Luckily, sklearn has ways of handling that. For instance, in LogisticRegression, we can pass the class_weight parameter to obtain a "balanced" problem.

An example classifier

Let's do a simple logistic regression. We'll compare our accuracy score for a model that does not account for our imbalanced data with one that does account for it.

Note that all we need to do to make it balanced is to use the class_weight parameter with the value of balanced. We found the needed parameter by consulting the documentation for sklearn LogisticRegression.

The documentation states that "balanced" mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data. In other words, it more strongly weights the minority class, so that the classifier does a better job of finding those needles.

from sklearn.linear_model import LogisticRegression from sklearn import metrics # we do need to go higher than the default iterations for the solver to get convergence # and the explicity declaration of the solver avoids a warning message, otherwise # the parameters are defaults. #without balancing logreg_model_imbalanced = LogisticRegression(solver='lbfgs',max_iter=1000) #fit logreg_model_imbalanced.fit(X_train, y_train) # Use score method to get accuracy of model score_imbalanced = logreg_model_imbalanced.score(X_test, y_test) # this is accuracy print('Score (Accuracy) - Imbalanced:', score_imbalanced) #with balancing logreg_model = LogisticRegression(solver='lbfgs',max_iter=1000, class_weight='balanced') #fit logreg_model.fit(X_train, y_train) # Use score method to get accuracy of the balanced model score = logreg_model.score(X_test, y_test) # this is accuracy print('Score (Accuracy) - Balanced:', score)
Score (Accuracy) - Imbalanced: 0.946 Score (Accuracy) - Balanced: 0.701

Our imbalanced score sure looks good, doesn't it? Hm... Let's look at another metric.

Accuracy vs. Area Under the Curve

Accuracy is how many of the predicted values matched the actual values. Area Under the Curve is a different measure for scoring classifiers. An AUC of .5 would indicate random guessing, or the inability of your classifier to separate the two groups, whereas an AUC of 1 would indicate a perfect classifier.

We'll also track AUC for our classifiers.

#get auc y_pred = logreg_model_imbalanced.predict(X_test) fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred, pos_label=1) auc = metrics.auc(fpr, tpr) print('Area Under the Curve (imbalanced):', auc) #get auc y_pred = logreg_model.predict(X_test) fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred, pos_label=1) auc = metrics.auc(fpr, tpr) print('Area Under the Curve (balanced):', auc)
Area Under the Curve (imbalanced): 0.5 Area Under the Curve (balanced): 0.6760825307336935

Even though our accuracy was really high for the model that didn't take the imbalanced nature of the data into account, when we look at area under the curve, we can see that the model actually did no better than random guessing.

Confusion Matrix and Statistics

A confusion matrix is a quick way to look at how well your classifier did, and from it we can derive some more statistics. Specifically, we'll be looking at sensitivity (true positive rate), specificity (true negative rate), and precision (positive predictive value).

Sklearn provides a quick and easy way to get the statistics via the classification_report function.

# obtaining the confusion matrix and making it look nice from sklearn.metrics import confusion_matrix import pandas as pd #get predictions from the imbalanced model y_pred = logreg_model_imbalanced.predict(X_test) # must put true before predictions in confusion matrix function cmtx = pd.DataFrame( confusion_matrix(y_test, y_pred, labels=[1,0]), index=['true:user', 'true:not user'], columns=['pred:user','pred:not user'] ) print('Imbalanced Confusion Matrix:') display(cmtx) #we can also get the classification report directly from sklearn. from sklearn.metrics import classification_report cr = classification_report(y_test, y_pred, output_dict=True) print('Imbalanced Statistics:') display(cr) #get predictions from the balanced model y_pred = logreg_model.predict(X_test) # must put true before predictions in confusion matrix function cmtx = pd.DataFrame( confusion_matrix(y_test, y_pred, labels=[1,0]), index=['true:user', 'true:not user'], columns=['pred:user','pred:not user'] ) print('Balanced Confusion Matrix:') display(cmtx) #we can also get the classification report directly from sklearn. from sklearn.metrics import classification_report cr = classification_report(y_test, y_pred, output_dict=True) print('Balanced Statistics:') display(cr)
Imbalanced Confusion Matrix:
Imbalanced Statistics:
{'0': {'precision': 0.946, 'recall': 1.0, 'f1-score': 0.9722507708119219, 'support': 946}, '1': {'precision': 0.0, 'recall': 0.0, 'f1-score': 0.0, 'support': 54}, 'accuracy': 0.946, 'macro avg': {'precision': 0.473, 'recall': 0.5, 'f1-score': 0.48612538540596095, 'support': 1000}, 'weighted avg': {'precision': 0.8949159999999999, 'recall': 0.946, 'f1-score': 0.9197492291880782, 'support': 1000}}
Balanced Confusion Matrix:
Balanced Statistics:
{'0': {'precision': 0.9722627737226277, 'recall': 0.7040169133192389, 'f1-score': 0.8166768853464133, 'support': 946}, '1': {'precision': 0.1111111111111111, 'recall': 0.6481481481481481, 'f1-score': 0.18970189701897017, 'support': 54}, 'accuracy': 0.701, 'macro avg': {'precision': 0.5416869424168694, 'recall': 0.6760825307336935, 'f1-score': 0.5031893911826917, 'support': 1000}, 'weighted avg': {'precision': 0.9257605839416058, 'recall': 0.701, 'f1-score': 0.7828202359767313, 'support': 1000}}

When we look at our confusion matrix and statistics, we can see why our area under the curve was so bad for the imbalanced model. It just predicted everyone was not an opioid user. This is the behavior we expected. But, you can see that the model that used class weights to balance the data did a much better job. It overpredicted the number of users, but it did also correctly predict most of the users in the test set.