CoCalc -- starter-code-8.ipynb

GitHub Repository: YStrano/DataScience_GA
Path: blob/master/lessons/lesson_08/code/starter-code/starter-code-8.ipynb
¹⁹⁰⁴ views

Kernel: Python 3

Guided Practice: Logit Function and Odds

In [1]:

def logit_func(odds):
    # uses a float (odds) and returns back the log odds (logit)
    return None

def sigmoid_func(logit):
    # uses a float (logit) and returns back the probability
    return None

odds_set = [
    5./1,
    20./1,
    1.1/1,
    1.8/1,
    1.6/1
]

In [2]:

import pandas as pd
from sklearn.linear_model import LogisticRegression

In [5]:

lm = LogisticRegression()

df = pd.read_csv('../../assets/dataset/collegeadmissions.csv')

In [6]:

df.head()

Out[6]:

In [9]:

df = df.join(pd.get_dummies(df['rank']))

In [10]:

df.head()

Out[10]:

In [11]:

lm.fit(df[['gre', 'gpa', 1, 2, 3,]], df['admit'])

Out[11]:

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [12]:

import numpy as np

In [14]:

print (lm.coef_)
print (lm.intercept_)
print (df.admit.mean())

Out[14]:

[[ 1.63913356e-03  4.33354702e-04  1.15220976e+00  5.14395667e-01
  -3.62326169e-02]]
[-2.09315184]
0.3175

Below is some code to walk through confusion matrices. It'll be useful for working through the Titanic problem.

In [16]:

%matplotlib inline
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, roc_auc_score

Below the ROC curve is based on various thresholds: it shows with a false positive rate (x-axis) ~0, it also expects a true positive rate (y-axis) ~0 (the same, ish, for the top right hand of the figure).

The second chart, which does not play with thesholds, shows the one true TPR and FPR point, joined to 0,0 and 1,1.

The first chart will be more effective as you compare models and determine where the decision line should exist for the data. The second simplifies the first in case this idea of thresholds is confusing.

In [17]:

actuals = lm.predict(feature_set) 
probas = lm.predict_proba(feature_set)
plt.plot(roc_curve(df[['admit']], probas[:,1])[0], roc_curve(df[['admit']], probas[:,1])[1])

Out[17]:

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-17-3a6ba23792dd> in <module>()
----> 1 actuals = lm.predict(feature_set)
      2 probas = lm.predict_proba(feature_set)
      3 plt.plot(roc_curve(df[['admit']], probas[:,1])[0], roc_curve(df[['admit']], probas[:,1])[1])
NameError: name 'feature_set' is not defined

In [71]:

plt.plot(roc_curve(df[['admit']], actuals)[0], roc_curve(df[['admit']], actuals)[1])

Out[71]:

[<matplotlib.lines.Line2D at 0x10c1fe7d0>]

Finally, you can use the roc_auc_score function to calculate the area under these curves (AUC).

In [72]:

roc_auc_score(df['admit'], lm.predict(feature_set)

Out[72]:

0.55914164575581893

Titanic Problem

** Goals **

Spend a few minutes determining which data would be most important to use in the prediction problem. You may need to create new features based on the data available. Consider using a feature selection aide in sklearn. But a worst case scenario; identify one or two strong features that would be useful to include in the model.
Spend 1-2 minutes considering which metric makes the most sense to optimize. Accuracy? FPR or TPR? AUC? Given the business problem (understanding survival rate aboard the Titanic), why should you use this metric?
Build a tuned Logistic model. Be prepared to explain your design (including regularization), metric, and feature set in predicting survival using the tools necessary (such as a fit chart).

In [ ]:

Guided Practice: Logit Function and Odds

Titanic Problem

Product

Resources

Company