Path: blob/master/model_selection/auc/auc.ipynb
2573 views
ROC/AUC for Binary Classification
For this documentation, we'll be working with a human resource dataset. Our goal is to find out the employees that are likely to leave in the future and act upon our findings, i.e. retain them before they choose to leave. This dataset contains 12000 observations and 7 variables, each representing :
SThe satisfaction level on a scale of 0 to 1LPELast project evaluation by a client on a scale of 0 to 1NPRepresents the number of projects worked on by employee in the last 12 monthANHAverage number of hours worked in the last 12 month for that employeeTICAmount of time the employee spent in the company, measured in yearsNewbornThis variable will take the value 1 if the employee had a newborn within the last 12 month and 0 otherwiseleft1 if the employee left the company, 0 if they're still working here. This is our response variable
To train and evaluate the model, we’ll perform a simple train/test split. 80 percent of the dataset will be used to actually train the model, while the rest will be used to evaluate the accuracy of this model, i.e. out of sample error. Note that the best practice is to split it in three ways train/validation/test split.
This probability table tells you that around 16 percent of the employees who became a staff member of yours have left! If those employees are all the ones that are performing well in the company, then this is probably not a good sign. We'll leave the exploratory analysis part to you ...
Sklearn Transformer
We then convert perform some generic data preprocessing including standardizing the numeric columns and one-hot-encode the categorical columns (the "Newborn" variable is treated as a categorical variable) and convert everything into a numpy array that sklearn expects. This generic preprocessing step is written as a custom sklearn Transformer. You don't have to follow this structure if you prefer your way of doing it.
To roll out our own Transformer a adheres to the sklearn API, we need to
Ensure that all arguments to the
__init__method should be explicit: i.e.*argsor**kwargsshould be avoided, as they will not be correctly handled within cross-validation routinesSubclass/Inherit
BaseEstimatorto get some free stuff. It will give us class representations that are more informative when printing the class object. And provides us aget_paramsandset_paramsfunctions. These functionalities are used in sklearn's methods such as GridSearch and RandomSearch.Subclass/Inherit an appropriate class for your task (one of ClassifierMixin, RegressorMixin, ClusterMixin, TransformerMixin). In our case, we will be implementing a Transformer, thus we'll be subclassing
TransformerMixin. For transformer, we need to implement a.fitmethod which fits some stuff on the training data and a.transformmethod that can perform transformation on both the training and test data. Note that we don't need to subclassTransformerMixinthis to work, but it does give the end-user the idea that this is a Transformer and we get the.fit_transformmethod that does the fitting and transformer on the training data in one shot for freeIn the fit implementation, you'll notice results that were learned during the
.fitmethod is stored with a trailing underscore (e.g., self.colnames_). This is a convention used in sklearn so that we can quickly scan the members of an estimator and distinguish which members are fitting during training time
If you would like to read more on this topic. The following two link might be of interest to you.
After training our model, we need to evaluate whether its any good or not and the most straightforward and intuitive metric for a supervised classifier's performance is accuracy. Unfortunately, there are circumstances where simple accuracy does not work well. For example, with a disease that only affects 1 in a million people, a completely bogus screening test that always reports "negative" will be 99.9999% accurate. Unlike accuracy, ROC curves are less sensitive to class imbalance; the bogus screening test would have an AUC of 0.5, which is like not having a test at all.
ROC curves
ROC curve (Receiver Operating Characteristic) is a commonly used way to visualize the performance of a binary classifier and AUC (Area Under the ROC Curve) is used to summarize its performance in a single number. Most machine learning algorithms have the ability to produce probability scores that tells us the strength in which it thinks a given observation is positive. Turning these probability scores into yes or no predictions requires setting a threshold; cases with scores above the threshold are classified as positive, and vice versa. Different threshold values can lead to different result:
A higher threshold is more conservative about labeling a case as positive; this makes it less likely to produce false positive (an observation that has a negative label but gets classified as positive by the model) results but more likely to miss cases that are in fact positive (lower true positive rate)
A lower threshold produces positive labels more liberally, so it creates more false positives but also generate more true positives
A quick refresher on terminology:
ParseError: KaTeX parse error: Expected 'EOF', got '#' at position 59: …= \frac{[\text{#̲ positive data …true positive rate is also known as recall or sensitivity
ParseError: KaTeX parse error: Expected 'EOF', got '#' at position 60: …= \frac{[\text{#̲ positive data …The ROC curve is created by plotting the true positive rate (when it's actually a yes, how often does it predict yes?) on the y axis against the false positive rate (when it's actually a no, how often does it predict yes?) on the x axis at various cutoff settings, giving us a picture of the whole spectrum of the trade-off we're making between the two measures.
If all these true/false positive terminology is confusing to you, consider reading the material at the following link. Blog: Simple guide to confusion matrix terminology
Implementation
There are packages to plot ROC curves and to compute metrics from them, but it can still be worthwhile to work through how these curves are calculated from scratch to try and understand better what exactly are they showing us.
From the result above, we can see that the function will compute the true/false positive count for all unique threshold in the predicted score y_score. We can validate the result by hand to confirm that the calculation this in fact correct.
Recall that ROC curve plots that true positive rate on the y-axis and false positive rate on the x-axis. Thus all we need to do is to convert the count into rate and we have our ROC curve.
Now to calculate the AUC (Area Under the Curve) for the ROC curve, we need sum up the rectangular area and the triangular area under the curve. Depicted by the visualization below:
For the rectangular area (the plot on the left illustrates one of them), the height are the TPR (true positive rate) and widths are in the difference in the FPR (false positive rate), so the total area of all the rectangles is the dot product of TPR and FPR's differences
For the triangular area (the plot on the right illustrates one of them), the height are the difference in TPR (true positive rate) and widths are in the difference in the FPR (false positive rate), so the total area of all the rectangles is the dot product of both TPR's and FPR's difference. But only half the area of each rectangle is below its segment of the ROC curve, thus we divide the rectangle by 2 to obtain the triangular area
After working through the implementation of ROC curve and AUC score from sratch, we now pull back and visualize:
The ROC curve of our original model
Dotted line represents the ROC curve of a purely random classifier and a perfect classifier
The goal of visualizing the ROC curve is to let us know how well can our classifier be expected to perform in general, at a variety of different baseline probabilities (percentage of the majority class)?
The diagonal line depicts a completely random classifier and ideally our model's ROC curve should be toward the top-left corner and stay as far away from the diagonal line as possible.
Side note: Apart from comparing the model's ROC curve against the ROC curve of a classifier that does random guessing, it's also useful to plot the ROC curve of different classifiers to compare performance against each other.
AUC probabilistic interpretation
The probabilistic interpretation of the AUC metric is that if we randomly choose a positive case and a negative case, the probability that the positive case outranks the negative case according to the classifier's prediction. Hopefully, this is evident from the ROC curve figure, where plot is enumerating all possible combinations of positive and negative cases, and the fraction under the curve comprises of the area where the positive case outranks the negative one. I personally find this interpretation extremely useful when conveying what AUC is measuring to a non-technical audience.
Precision Recall Curve
Apart from ROC curve, there is also the precision recall curve. Instead of plotting true positive rate (a.k.a recall) versus false positive rate. We now plot precision versus recall.
ParseError: KaTeX parse error: Expected 'EOF', got '#' at position 50: …= \frac{[\text{#̲ positive data …A classifier with high recall but low precision flags many positive results, but most of its predicted labels are incorrect when compared to its corresponding labels. On the other hand, a classifier with high precision but low recall is just the opposite, returning very few results, but most of its predicted labels are correct when compared to the training labels. An ideal system with high precision and high recall will return many results, with all results labeled correctly.
Precision recall curve answers a fundamentally different question compared to ROC curve. By definition, precision directly answers the question, "What is the probability that this is a real hit given my classifier says it is? Thus it is useful in practice for needle-in-haystack type problems or problems where the "positive" class is more interesting than the negative class.
You can also think about it in the following way. ROC AUC looks at TPR and FPR, the entire confusion matrix for all thresholds. On the other hand, Precision-Recall AUC looks at Precision and Recall (TPR), it doesn't look at True Negative Rate (TNR). Because of that PR AUC can be a better choice when you care only about the "positive" while ROC AUC cares about both "positive" and "negative". Since PR AUC doesn't use TNR directly it can also be better for highly imbalanced problems. You may want to take a look at this Blog: F1 Score vs ROC AUC vs Accuracy vs PR AUC: Which Evaluation Metric Should You Choose?.
Although ROC curve is presumably the more popular choice when evaluating binary classifiers, it is highly recommended to use precision recall curve as a supplement to ROC curves to get a full picture when evaluating and comparing classifiers. For more discussion on this topic, consider taking a look at another documentation. Notebook: Evaluating Imbalanced Datasets
Thresholding via Cost
Lots of real world binary classification needs a threshold to convert the model into a business decision. i.e. All cases with a model score above the threshold get some sort of special treatment. For example:
Fraud Prevention: We're a social network company and we'd like to delete fake accounts. We build a classifier that assigns a "fraud score" between 0 and 1 to each account and, after some research, decide that all accounts whose score is above 0.9 should be sent to our fraud team, which will review each case and delete the accounts that are actually fake
Response/Propensity Modeling: We're at an enterprise software company and we'd like to improve our outbound sales program. We buy a large database of potential customers and build a classifier to predict which ones are likely buy our product if contacted by our sales team. We decide that all customers with a "response score" of above threshold 0.7 should get a call
Shopping Cart Abandonment: We're at an e-commerce company, and we'd like to email a 10% off coupon to users who abandoned their shopping carts and won't return organically. (We don't want to give a 10% off coupon to users who are going to return anyway, since then you'd be losing 10% of the sale price.) We build a classifier to predict which users will never return to their carts, and decide that all users with a "abandonment score" above 0.85 should get the 10% off coupon
Etc.
Thresholding is popular because of its simplicity and ease of implementation: We translate a continuous score to a binary yes/no decision, and act on it in a predetermined way. The biggest question for the thresholding pattern is: Where should I set the threshold point?
Up until this point, we've been using AUC to give us a single-number summary of classifier performance. This might be suitable in some circumstances, but for binary classifiers, evaluation metrics that take into account the actual costs of false positive and false negative errors may be much more appropriate than AUC. If we know these costs, we can use them not only to tie the evaluation metric more directly to the business value but also choose an appropriate final cutoff threshold for the classifier.
In real world application, the cost that comes along with making these two mistakes (false positive and false negative) are usually a whole lot different. Take our case for example, a false negative (FN) means an employee left our company but our model failed to detect that, while a false positive (FP) means an employee is still currently working at our company and our model told us that they will be leaving. The former mistake would be a tragedy, since, well, the employee left and we didn't do anything about it! As for conducting the latter mistake, we might be wasting like 20 minutes of a HR manager's time when we arrange a face to face interview with a employee, questioning about how the company can do better to retain him/her, while he/she is perfectly fine with the current situation.
In the code cell below, we assign a cost for a false negative (FN) and false positive (FP) to be 100 and 1000 respectively, given the cost associated with the two mistakes we will multiply them with the false negative and false positive rate at each threshold to figure out where's the best cutoff value. Note that the cost associated with the mistake can just be a back of the envelope number as long as we're sure about which one is more "expensive" than the other.
Just to hit the notion home, when executing on a project, if we are able to compute expected cost for each mistake, consider maximizing that directly instead of AUC or another general-purpose metrics.