Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
YStrano
GitHub Repository: YStrano/DataScience_GA
Path: blob/master/lessons/lesson_08/code/solution-code/solution-code-8.ipynb
1904 views
Kernel: Python [Root]

Guided Practice: Logit Function and Odds

import pandas as pd import numpy as np
def logit_func(odds): # uses a float (odds) and returns back the log odds (logit) return np.log(odds) def sigmoid_func(logit): # uses a float (logit) and returns back the probability return 1. / (1 + np.exp(-logit)) odds_set = [ 5./1, 20./1, 1.1/1, 1.8/1, 1.6/1 ]
for odds in odds_set: print sigmoid_func(logit_func(odds))
0.833333333333 0.952380952381 0.52380952381 0.642857142857 0.615384615385
import pandas as pd # Statsmodels logistic regression is sm.Logit import statsmodels.api as sm
# Read in the data df = pd.read_csv('../../assets/dataset/collegeadmissions.csv')
df.head()
df = df.join(pd.get_dummies(df['rank']))
df.head()
X = df[['gre', 'gpa', 1, 2, 3,]] X = sm.add_constant(X) y = df['admit'] lm = sm.Logit(y, X) result = lm.fit() result.summary()
Optimization terminated successfully. Current function value: 0.573147 Iterations 6
print df.admit.mean()
0.3175
# You can easily convert these into odds using numpy.exp() print np.exp(result.params)
const 0.003921 gre 1.002267 gpa 2.234545 1 4.718371 2 2.401325 3 1.235233 dtype: float64

The above makes it more clear that a schools rank as it approaches 4 decreases the odds of getting admitted. The accuracy of the model with all features (removing one rank) is ~70%.

predicted = result.predict(X) threshold = 0.5 predicted_classes = (predicted > threshold).astype(int) from sklearn.metrics import accuracy_score accuracy_score(y, predicted_classes)
0.70999999999999996

Below is some code to walk through confusion matrices. It'll be useful for working through the Titanic problem.

%matplotlib inline import matplotlib.pyplot as plt from sklearn.metrics import roc_curve, roc_auc_score

Below the ROC curve is based on various thresholds: it shows with a false positive rate (x-axis) ~0, it also expects a true positive rate (y-axis) ~0 (the same, ish, for the top right hand of the figure).

The second chart, which does not play with thesholds, shows the one true TPR and FPR point, joined to 0,0 and 1,1.

The first chart will be more effective as you compare models and determine where the decision line should exist for the data. The second simplifies the first in case this idea of thresholds is confusing.

plt.plot(roc_curve(df[['admit']], predicted)[0], roc_curve(df[['admit']], predicted)[1])
[<matplotlib.lines.Line2D at 0x7f42080b3fd0>]
Image in a Jupyter notebook
plt.plot(roc_curve(df[['admit']], predicted_classes)[0], roc_curve(df[['admit']], predicted_classes)[1])
[<matplotlib.lines.Line2D at 0x7f4203895d10>]
Image in a Jupyter notebook

Finally, you can use the roc_auc_score function to calculate the area under these curves (AUC).

roc_auc_score(df['admit'], predicted_classes)
0.58331170142193767

Note: sklearn also has logistic regression:

from sklearn.linear_model import LogisticRegression lm = LogisticRegression() lm.fit(X, y)

Titanic Problem

** Goals **

  1. Spend a few minutes determining which data would be most important to use in the prediction problem. You may need to create new features based on the data available. Consider using a feature selection aide in sklearn. But a worst case scenario; identify one or two strong features that would be useful to include in the model.

  2. Spend 1-2 minutes considering which metric makes the most sense to optimize. Accuracy? FPR or TPR? AUC? Given the business problem (understanding survival rate aboard the Titanic), why should you use this metric?

  3. Build a tuned Logistic model. Be prepared to explain your design (including regularization), metric, and feature set in predicting survival using the tools necessary (such as a fit chart).

Teaching Notes

Note this is just one approach optimized for Area Under the Curve.

Age will need some work (since it is missing for a significant portion), and other data cleanup simplifies the data problem a little.

titanic = pd.read_csv('../../assets/dataset/titanic.csv')
titanic.head()
titanic.set_index('PassengerId', inplace=True) titanic = titanic.join(pd.get_dummies(titanic.Pclass)) titanic['is_male'] = titanic.Sex.apply(lambda x: 1 if x == 'male' else 0)
%matplotlib inline titanic.groupby('Survived').Age.hist()
Survived 0 Axes(0.125,0.125;0.775x0.775) 1 Axes(0.125,0.125;0.775x0.775) Name: Age, dtype: object
Image in a Jupyter notebook
titanic.tail()
titanic['Age'] = titanic.groupby(["Sex", 'Pclass']).Age.transform(lambda x: x.fillna(x.mean())) titanic['had_parents'] = titanic.Parch.apply(lambda x: 1 if x > 0 else 0) titanic['had_siblings'] = titanic.SibSp.apply(lambda x: 1 if x > 0 else 0)
from sklearn import grid_search, cross_validation from sklearn.linear_model import LogisticRegression feature_set = titanic[['is_male', 1, 2, 'Fare', 'Age', 'had_parents', 'had_siblings']] gs = grid_search.GridSearchCV( estimator=LogisticRegression(), param_grid={'C': [10**-i for i in range(-5, 5)], 'class_weight': [None, 'balanced']}, cv=cross_validation.KFold(n=len(titanic), n_folds=10), scoring='roc_auc' ) gs.fit(feature_set, titanic.Survived) gs.grid_scores_ #print gs.best_estimator_
[mean: 0.83905, std: 0.02899, params: {'C': 100000, 'class_weight': None}, mean: 0.83905, std: 0.02934, params: {'C': 100000, 'class_weight': 'balanced'}, mean: 0.83900, std: 0.02900, params: {'C': 10000, 'class_weight': None}, mean: 0.83905, std: 0.02934, params: {'C': 10000, 'class_weight': 'balanced'}, mean: 0.83900, std: 0.02900, params: {'C': 1000, 'class_weight': None}, mean: 0.83905, std: 0.02934, params: {'C': 1000, 'class_weight': 'balanced'}, mean: 0.83894, std: 0.02869, params: {'C': 100, 'class_weight': None}, mean: 0.83910, std: 0.02936, params: {'C': 100, 'class_weight': 'balanced'}, mean: 0.83909, std: 0.02895, params: {'C': 10, 'class_weight': None}, mean: 0.83906, std: 0.02946, params: {'C': 10, 'class_weight': 'balanced'}, mean: 0.84019, std: 0.02962, params: {'C': 1, 'class_weight': None}, mean: 0.83890, std: 0.02989, params: {'C': 1, 'class_weight': 'balanced'}, mean: 0.83737, std: 0.03051, params: {'C': 0.1, 'class_weight': None}, mean: 0.83560, std: 0.03130, params: {'C': 0.1, 'class_weight': 'balanced'}, mean: 0.80650, std: 0.04865, params: {'C': 0.01, 'class_weight': None}, mean: 0.80081, std: 0.05313, params: {'C': 0.01, 'class_weight': 'balanced'}, mean: 0.70905, std: 0.05468, params: {'C': 0.001, 'class_weight': None}, mean: 0.73558, std: 0.06662, params: {'C': 0.001, 'class_weight': 'balanced'}, mean: 0.67996, std: 0.05839, params: {'C': 0.0001, 'class_weight': None}, mean: 0.70236, std: 0.07329, params: {'C': 0.0001, 'class_weight': 'balanced'}]
print gs.best_estimator_
LogisticRegression(C=1, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1, penalty='l2', random_state=None, solver='liblinear', tol=0.0001, verbose=0, warm_start=False)