Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
YStrano
GitHub Repository: YStrano/DataScience_GA
Path: blob/master/april_18/lessons/lesson-10/code/starter-code/starter-code-10.ipynb
1905 views
Kernel: Python 2

Cost Benefit Questions

  1. How would you rephrase the business problem if your model was optimizing toward precision? i.e., How might the model behave differently, and what effect would it have?

  2. How would you rephrase the business problem if your model was optimizing toward recall?

  3. What would the most ideal model look like in this case?

Visualizing models over variables

%matplotlib inline
import pandas as pd import sklearn.linear_model as lm import matplotlib.pyplot as plt import seaborn as sns df = pd.read_csv('../../assets/dataset/flight_delays.csv') df = df.loc[df.DEP_DEL15.notnull()].copy()
df.head()
df = df[df.DEP_DEL15.notnull()] df = df.join(pd.get_dummies(df['CARRIER'], prefix='carrier')) df = df.join(pd.get_dummies(df['DAY_OF_WEEK'], prefix='dow')) model = lm.LogisticRegression() features = [i for i in df.columns if 'dow_' in i]
df.shape
(458311, 37)
features += ['CRS_DEP_TIME'] model.fit(df[features[1:]], df['DEP_DEL15']) df['probability'] = model.predict_proba(df[features[1:]]).T[1]
ax = plt.subplot(111) colors = ['blue', 'green', 'red', 'purple', 'orange', 'brown'] for e, c in enumerate(colors): df[df[features[e]] == 1].plot(x='CRS_DEP_TIME', y='probability', kind='scatter', color = c, ax=ax) ax.set(title='Probability of Delay\n Based on Day of Week and Time of Day')
[<matplotlib.text.Text at 0x101b6b650>]
Image in a Jupyter notebook

Other Answers: visualizing Airline or the inverse

features = [i for i in df.columns if 'carrier_' in i] features += ['CRS_DEP_TIME'] #...

Visualizing Performance Against Baseline

Visualizing AUC and comparing Models

from sklearn import dummy, metrics
model0 = dummy.DummyClassifier() model0.fit(df[features[1:]], df['DEP_DEL15']) df['probability_0'] = model0.predict_proba(df[features[1:]]).T[1] model1 = lm.LogisticRegression() model.fit(df[features[1:]], df['DEP_DEL15']) df['probability_1'] = model.predict_proba(df[features[1:]]).T[1]
df.shape
ax = plt.subplot(111) vals = metrics.roc_curve(df.DEP_DEL15, df.probability_0) ax.plot(vals[0], vals[1]) vals = metrics.roc_curve(df.DEP_DEL15, df.probability_1) ax.plot(vals[0], vals[1]) ax.set(title='Area Under the Curve for prediction delayed=1', ylabel='', xlabel='', xlim=(0, 1), ylim=(0, 1))
[<matplotlib.text.Text at 0x10b8d5650>, <matplotlib.text.Text at 0x108ea2250>, (0, 1), (0, 1), <matplotlib.text.Text at 0x109140250>]
Image in a Jupyter notebook

Visualizing Precision / Recall