Path: blob/main/Lessons/Lesson 08 - Hyperparameter Optimization (Project)/extras/Confusion_Matrix_and_Report.ipynb
871 views
Confusion Matrix and Classification Report
Predicting Opioid Abuse from Perception of Risk
The data for this project uses 2016 National Survey on Drug Use and Health to attempt to predict opioid abuse risk based on responses from a small number of survey questions regarding the perceived risk of alcohol, tobacco, and substance use. The intent was to create a screening tool for participants in Division of Extension education programs that could flag individuals that might be more at risk, so additional targeted interventions could be provided.
Extensive data cleaning was performed in R, resulting in a dataset with 40241 adults with no history of opioid abuse and 2381 adults with a history of opioid abuse.
Let's read in the data and one-hot-encode the category variables for sklearn.
We'll also make a much smaller data set for demonstration purposes. Otherwise, this code runs extremely slowly. If you wanted more accurate results, the entire dataset should be used.
Loading the data
For every 1 opioid user in our dataset, we have approximately 17 non opioid users. Given that our sample is so imbalanced, we'll need to use some mechanism to try to even the scales. Luckily, sklearn has ways of handling that. For instance, in LogisticRegression, we can pass the class_weight parameter to obtain a "balanced" problem.
An example classifier
Let's do a simple logistic regression. We'll compare our accuracy score for a model that does not account for our imbalanced data with one that does account for it.
Note that all we need to do to make it balanced is to use the class_weight parameter with the value of balanced. We found the needed parameter by consulting the documentation for sklearn LogisticRegression.
The documentation states that "balanced" mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data. In other words, it more strongly weights the minority class, so that the classifier does a better job of finding those needles.
Our imbalanced score sure looks good, doesn't it? Hm... Let's look at another metric.
Accuracy vs. Area Under the Curve
Accuracy is how many of the predicted values matched the actual values. Area Under the Curve is a different measure for scoring classifiers. An AUC of .5 would indicate random guessing, or the inability of your classifier to separate the two groups, whereas an AUC of 1 would indicate a perfect classifier.
We'll also track AUC for our classifiers.
Even though our accuracy was really high for the model that didn't take the imbalanced nature of the data into account, when we look at area under the curve, we can see that the model actually did no better than random guessing.
Confusion Matrix and Statistics
A confusion matrix is a quick way to look at how well your classifier did, and from it we can derive some more statistics. Specifically, we'll be looking at sensitivity (true positive rate), specificity (true negative rate), and precision (positive predictive value).
Sklearn provides a quick and easy way to get the statistics via the classification_report function.
When we look at our confusion matrix and statistics, we can see why our area under the curve was so bad for the imbalanced model. It just predicted everyone was not an opioid user. This is the behavior we expected. But, you can see that the model that used class weights to balance the data did a much better job. It overpredicted the number of users, but it did also correctly predict most of the users in the test set.