Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
YStrano
GitHub Repository: YStrano/DataScience_GA
Path: blob/master/april_18/lessons/lesson-10/code/solution-code/solution-code-10.ipynb
1905 views
Kernel: Python 2

Cost Benefit Questions

  1. How would you rephrase the business problem if your model was optimizing toward precision? i.e., How might the model behave differently, and what effect would it have?

  2. How would you rephrase the business problem if your model was optimizing toward recall?

  3. What would the most ideal model look like in this case?

Answers:

  1. If this model where optimized toward precision, we'd be minimizing the number of false positives: users who are targeted in the campaign but are not retained.

  2. If this model where optimized toward recall, we'd be minimizing the number of false negatives, or making sure that users who could have retained, did.

  3. The model would be most optimized, at this point, towards recall, as there's the largest business gain.

Visualizing models over variables

%matplotlib inline
import pandas as pd import sklearn.linear_model as lm import matplotlib.pyplot as plt import seaborn as sns df = pd.read_csv('../../assets/dataset/flight_delays.csv') df = df.loc[df.DEP_DEL15.notnull()].copy()
df.head()
df = df[df.DEP_DEL15.notnull()] df = df.join(pd.get_dummies(df['CARRIER'], prefix='carrier')) df = df.join(pd.get_dummies(df['DAY_OF_WEEK'], prefix='dow')) model = lm.LogisticRegression() features = [i for i in df.columns if 'dow_' in i]
df.shape
(458311, 37)
features += ['CRS_DEP_TIME'] model.fit(df[features[1:]], df['DEP_DEL15']) df['probability'] = model.predict_proba(df[features[1:]]).T[1]
ax = plt.subplot(111) colors = ['blue', 'green', 'red', 'purple', 'orange', 'brown'] for e, c in enumerate(colors): df[df[features[e]] == 1].plot(x='CRS_DEP_TIME', y='probability', kind='scatter', color = c, ax=ax) ax.set(title='Probability of Delay\n Based on Day of Week and Time of Day')
[<matplotlib.text.Text at 0x108e4ee90>]
Image in a Jupyter notebook

Other Answers: visualizing Airline performance over time; visualizing the inverse

features = [i for i in df.columns if 'carrier_' in i] features += ['CRS_DEP_TIME'] model = lm.LogisticRegression() model.fit(df[features[1:]], df['DEP_DEL15']) df['probability'] = model.predict_proba(df[features[1:]]).T[1] ax = plt.subplot(111) colors = ['blue', 'green', 'red', 'purple'] for e, c in enumerate(colors): df[df[features[e]] == 1].plot(x='CRS_DEP_TIME', y='probability', kind='scatter', color = c, ax=ax) ax.set(title='Probability of Admission\n Based on Carrier and Time of Day')
[<matplotlib.text.Text at 0x109f0ca90>]
Image in a Jupyter notebook
features = [i for i in df.columns if 'carrier_' in i] features += ['CRS_DEP_TIME'] model = lm.LogisticRegression() model.fit(df[features[1:]], df['DEP_DEL15']) df['probability'] = model.predict_proba(df[features[1:]]).T[0] ax = plt.subplot(111) colors = ['blue', 'green', 'red', 'purple'] for e, c in enumerate(colors): df[df[features[e]] == 1].plot(x='CRS_DEP_TIME', y='probability', kind='scatter', color = c, ax=ax) ax.set(title='Probability of Admission\n Based on Carrier and Time of Day')
[<matplotlib.text.Text at 0x10a68bcd0>]
Image in a Jupyter notebook

Visualizing Performance Against Baseline

Visualizing AUC and comparing Models

from sklearn import dummy, metrics
model0 = dummy.DummyClassifier() model0.fit(df[features[1:]], df['DEP_DEL15']) df['probability_0'] = model0.predict_proba(df[features[1:]]).T[1] model1 = lm.LogisticRegression() model.fit(df[features[1:]], df['DEP_DEL15']) df['probability_1'] = model.predict_proba(df[features[1:]]).T[1]
df.shape
ax = plt.subplot(111) vals = metrics.roc_curve(df.DEP_DEL15, df.probability_0) ax.plot(vals[0], vals[1]) vals = metrics.roc_curve(df.DEP_DEL15, df.probability_1) ax.plot(vals[0], vals[1]) ax.set(title='Area Under the Curve for prediction delayed=1', ylabel='', xlabel='', xlim=(0, 1), ylim=(0, 1))
[<matplotlib.text.Text at 0x10b8d5650>, <matplotlib.text.Text at 0x108ea2250>, (0, 1), (0, 1), <matplotlib.text.Text at 0x109140250>]
Image in a Jupyter notebook

Visualizing Precision / Recall (with cleaner code)

model0 = dummy.DummyClassifier() model0.fit(df[features[1:]], df.DEP_DEL15) df['probability_0'] = model0.predict_proba(df[features[1:]]).T[1] model = lm.LogisticRegression() model.fit(df[features[1:]], df.DEP_DEL15) df['probability_1'] = model.predict_proba(df[features[1:]]).T[1]
ax = plt.subplot(111) for i in range(2): vals = metrics.precision_recall_curve(df.DEP_DEL15, df['probability_' + str(i)]) ax.plot(vals[1], vals[0]) ax.set(title='Precision Recall Curve for prediction delayed=1', ylabel='', xlabel='', xlim=(0, 1), ylim=(0, 1))
[<matplotlib.text.Text at 0x10a306e90>, <matplotlib.text.Text at 0x108ef1650>, (0, 1), (0, 1), <matplotlib.text.Text at 0x10b26d690>]
Image in a Jupyter notebook