CoCalc -- solution-code-10.ipynb

GitHub Repository: YStrano/DataScience_GA
Path: blob/master/april_18/lessons/lesson-10/code/solution-code/solution-code-10.ipynb
²³⁴⁷ views

Kernel: Python 2

Cost Benefit Questions

How would you rephrase the business problem if your model was optimizing toward precision? i.e., How might the model behave differently, and what effect would it have?
How would you rephrase the business problem if your model was optimizing toward recall?
What would the most ideal model look like in this case?

Answers:

If this model where optimized toward precision, we'd be minimizing the number of false positives: users who are targeted in the campaign but are not retained.
If this model where optimized toward recall, we'd be minimizing the number of false negatives, or making sure that users who could have retained, did.
The model would be most optimized, at this point, towards recall, as there's the largest business gain.

Visualizing models over variables

In [2]:

%matplotlib inline

In [3]:

import pandas as pd
import sklearn.linear_model as lm
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv('../../assets/dataset/flight_delays.csv')
df = df.loc[df.DEP_DEL15.notnull()].copy()

In [4]:

df.head()

Out[4]:

In [5]:

df = df[df.DEP_DEL15.notnull()]
df = df.join(pd.get_dummies(df['CARRIER'], prefix='carrier'))
df = df.join(pd.get_dummies(df['DAY_OF_WEEK'], prefix='dow'))
model = lm.LogisticRegression()
features = [i for i in df.columns if 'dow_' in i]

In [6]:

df.shape

Out[6]:

(458311, 37)

In [10]:

features += ['CRS_DEP_TIME']
model.fit(df[features[1:]], df['DEP_DEL15'])

df['probability'] = model.predict_proba(df[features[1:]]).T[1]

In [11]:

ax = plt.subplot(111)
colors = ['blue', 'green', 'red', 'purple', 'orange', 'brown']
for e, c in enumerate(colors):
    df[df[features[e]] == 1].plot(x='CRS_DEP_TIME', y='probability', kind='scatter', color = c, ax=ax)

ax.set(title='Probability of Delay\n Based on Day of Week and Time of Day')

Out[11]:

[<matplotlib.text.Text at 0x108e4ee90>]

Other Answers: visualizing Airline performance over time; visualizing the inverse

In [12]:

features = [i for i in df.columns if 'carrier_' in i]
features += ['CRS_DEP_TIME']
model = lm.LogisticRegression()
model.fit(df[features[1:]], df['DEP_DEL15'])

df['probability'] = model.predict_proba(df[features[1:]]).T[1]

ax = plt.subplot(111)
colors = ['blue', 'green', 'red', 'purple']
for e, c in enumerate(colors):
    df[df[features[e]] == 1].plot(x='CRS_DEP_TIME', y='probability', kind='scatter', color = c, ax=ax)

    
ax.set(title='Probability of Admission\n Based on Carrier and Time of Day')

Out[12]:

[<matplotlib.text.Text at 0x109f0ca90>]

In [15]:

features = [i for i in df.columns if 'carrier_' in i]
features += ['CRS_DEP_TIME']
model = lm.LogisticRegression()
model.fit(df[features[1:]], df['DEP_DEL15'])

df['probability'] = model.predict_proba(df[features[1:]]).T[0]

ax = plt.subplot(111)
colors = ['blue', 'green', 'red', 'purple']
for e, c in enumerate(colors):
    df[df[features[e]] == 1].plot(x='CRS_DEP_TIME', y='probability', kind='scatter', color = c, ax=ax)

    
ax.set(title='Probability of Admission\n Based on Carrier and Time of Day')

Out[15]:

[<matplotlib.text.Text at 0x10a68bcd0>]

Visualizing Performance Against Baseline

Visualizing AUC and comparing Models

In [16]:

from sklearn import dummy, metrics

In [20]:

model0 = dummy.DummyClassifier()
model0.fit(df[features[1:]], df['DEP_DEL15'])
df['probability_0'] = model0.predict_proba(df[features[1:]]).T[1]

model1 = lm.LogisticRegression()
model.fit(df[features[1:]], df['DEP_DEL15'])
df['probability_1'] = model.predict_proba(df[features[1:]]).T[1]

In [ ]:

df.shape

In [21]:

ax = plt.subplot(111)
vals = metrics.roc_curve(df.DEP_DEL15, df.probability_0)
ax.plot(vals[0], vals[1])
vals = metrics.roc_curve(df.DEP_DEL15, df.probability_1)
ax.plot(vals[0], vals[1])

ax.set(title='Area Under the Curve for prediction delayed=1', ylabel='', xlabel='', xlim=(0, 1), ylim=(0, 1))

Out[21]:

[<matplotlib.text.Text at 0x10b8d5650>,
 <matplotlib.text.Text at 0x108ea2250>,
 (0, 1),
 (0, 1),
 <matplotlib.text.Text at 0x109140250>]

Visualizing Precision / Recall (with cleaner code)

In [22]:


model0 = dummy.DummyClassifier()
model0.fit(df[features[1:]], df.DEP_DEL15)
df['probability_0'] = model0.predict_proba(df[features[1:]]).T[1]


model = lm.LogisticRegression()
model.fit(df[features[1:]], df.DEP_DEL15)
df['probability_1'] = model.predict_proba(df[features[1:]]).T[1]

In [23]:

ax = plt.subplot(111)
for i in range(2):
    vals = metrics.precision_recall_curve(df.DEP_DEL15, df['probability_' + str(i)])
    ax.plot(vals[1], vals[0])

ax.set(title='Precision Recall Curve for prediction delayed=1', ylabel='', xlabel='', xlim=(0, 1), ylim=(0, 1))

Out[23]:

[<matplotlib.text.Text at 0x10a306e90>,
 <matplotlib.text.Text at 0x108ef1650>,
 (0, 1),
 (0, 1),
 <matplotlib.text.Text at 0x10b26d690>]

In [ ]:

Cost Benefit Questions

Visualizing models over variables

Other Answers: visualizing Airline performance over time; visualizing the inverse

Visualizing Performance Against Baseline

Visualizing AUC and comparing Models

Visualizing Precision / Recall (with cleaner code)

Product

Resources

Company