Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
YStrano
GitHub Repository: YStrano/DataScience_GA
Path: blob/master/lessons/lesson_10-sub-Jacob_Koehler/01-Classification-Review - done.ipynb
1904 views
Kernel: Python 3
%matplotlib inline import matplotlib.pyplot as plt import numpy as np import pandas as pd from sklearn.linear_model import LogisticRegression from sklearn.dummy import DummyClassifier from sklearn.metrics import classification_report from sklearn.datasets import load_digits
digits = load_digits()
X, y = digits.data, digits.target #this is inherent in sklear data, data, target, and description
plt.imshow(X[10].reshape(8, 8))
<matplotlib.image.AxesImage at 0x1164949b0>
Image in a Jupyter notebook

Accuracy with Imbalanced Class

y
array([0, 1, 2, ..., 8, 9, 8])
print(digits.target_names, "\n", np.bincount(y))
[0 1 2 3 4 5 6 7 8 9] [178 182 177 183 181 182 181 179 174 180]
y_imbalanced = [] for num in y: if num == 0: y_imbalanced.append(0) else: y_imbalanced.append(1)
np.bincount(y_imbalanced)
array([ 178, 1619])

Dummy Classifier

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y_imbalanced)
dummy_maj = DummyClassifier(strategy='most_frequent').fit(X_train, y_train)
dummy_maj.score(X_test, y_test)
0.9
dummy_maj.predict(X_test)
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
np.bincount(dummy_maj.predict(X_test))
array([ 0, 450])
  • stratified: based on training distribution

  • uniform: uniformly random predictions

  • constant: always predicts constant label

Comparing our Logistic Model

lgr = LogisticRegression() #instantiate logreg model #lgr_imbalanced = LogisticRegression(class_weight='balanced')
lgr.fit(X_train, y_train)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1, penalty='l2', random_state=None, solver='liblinear', tol=0.0001, verbose=0, warm_start=False)
#lgr_imbalanced.fit(X_train, y_train)
lgr.score(X_test, y_test) #this shows overfitting
1.0
#lgr_imbalanced.score(X_test, y_test)

We built a model! Now what?

Imagine user responses to some of the following statements:

  1. The predictive model I built has an accuracy of 80%.

  2. The logistic regression was optimized with L2 regularization, so you know it's good.

  3. Gender was more important than age in the predictive model because it had a larger coefficient.

  4. Here's the AUC chart that shows how well the model did.

How might your stakeholders respond? How would you respond back?

In a business setting, you are often the only person who can interpret what you've built. While some people may be familiar with basic data visualizations, by and large you will need to do a lot of "hand holding," especially if your team has never worked with data scientists before.

We'll focus this discussion around "simpler" problems (e.g. binary classification), but these tips apply to any type of model you might be working with.

First, let's review some of the knowledge we've developed about classification metrics, append some more, and then we'll talk about you can communicate your results.

Review: Back to the Confusion Matrix

Let's review the confusion matrix:

Confusion matrices, for a binary classification problem, allow for the interpretation of correct and incorrect predictions for each class label. Remember, the confusion matrix is the beginning step for the majority of classification metrics, and gives our predictions deeper meaning beyond an accuracy score.

Recall: How do we calculate the following metrics?

  1. Accuracy

  2. True Positive Rate

  3. False Positive Rate

Intro: Precision and Recall

Our previous metrics primarily were designed for less biased data problems: we could be interested in both outcomes, so it was important to generalize our approach.

Precision and Recall are additional metrics built off the confusion matrix, focusing on information retrieval, particularly when one class label is more interesting than another.

With precision, we're interested in producing a high amount of relevancy instead of irrelevancy. With recall, we're interesting in seeing how well a model returns specific data (literally, checking whether the model can recall what a class label looked like).

Recall (pun not intended): If the goal of the "recall" metric "recall" is to identify specific values of a class correctly, what other metric performs a similar calculation?

Answer: TPR is the same calculation!

Breaking It Down With Math

In fact, True Positive Rate and Recall are one in the same: calculating true positives over the count of all positives. Another term that is used when looking at labeled AUC figures is sensitivity. These terms all have the same calculation: the count of predicted true positives over the total count of that class label.

Imagine predicting a marble color either green or red. There are 10 of each. If the model identifies 8 of the green marbles as green, the recall, or sensitivity, is .8. However, this says nothing about the number of red marbles that are also identified as green.

Precision, or the positive predicted value, is calculated as the count of predicted true positives over the count of all predicted to be positive values. Precision focuses on relevancy.

Using the same example: if a model predicts 8 of the green marbles as green, then precision would be 1, because all marbles predicted as green were in fact green. The precision of red marbles (assuming all red marbles were correct, and 2 green were predicted as red) would be roughly 0.833: 10 / (10 + 2)

Check: What would the precision and recall be for the following confusion matrix (with "green" being "true")?

| predicted_green | predicted_not_green | -------------|-----------------|-------------------- is_green | 13 | 7 is_not_green | 8 | 12

The key difference between the two is the attribution and value of an error: should our model be more picky in avoiding false positives (precision), or should it be more picky in avoiding false negatives (recall)?

The answer should be determined by the problem you're trying to solve.

Comparing Accuracy, Precision, and Recall

from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, dummy_maj.predict(X_test))
array([[ 0, 45], [ 0, 405]])
confusion_matrix(y_test, lgr.predict(X_test))
array([[ 45, 0], [ 0, 405]])
import seaborn as sns
plt.figure(figsize = (14, 6)) plt.subplot(121) sns.heatmap(confusion_matrix(y_test, lgr.predict(X_test)), cmap = "viridis") plt.title("Logistic Regressor") plt.subplot(122) sns.heatmap(confusion_matrix(y_test, dummy_maj.predict(X_test)), cmap = "viridis") plt.title("Dummy Classifier");
Image in a Jupyter notebook
dummy_stratified = DummyClassifier(strategy='stratified') #there are other stretegies you can use
dummy_stratified.fit(X_train, y_train)
DummyClassifier(constant=None, random_state=None, strategy='stratified')
confusion_matrix(y_test, dummy_stratified.predict(X_test))
array([[ 2, 43], [ 32, 373]])
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier(max_depth=2).fit(X_train, y_train)
confusion_matrix(y_test, dt.predict(X_test))
array([[ 42, 3], [ 3, 402]])
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=3).fit(X_train, y_train)
confusion_matrix(y_test, knn.predict(X_test))
array([[ 45, 0], [ 0, 405]])
from sklearn.metrics import classification_report
print(classification_report(y_test, dt.predict(X_test)))
precision recall f1-score support 0 0.93 0.93 0.93 45 1 0.99 0.99 0.99 405 avg / total 0.99 0.99 0.99 450
#f1 score is high bc precision and recall are close in value. # support is the total values that were calculated or used in the model