GitHub Repository: YStrano/DataScience_GA
Path: blob/master/projects/project_3/starter-code/Project 3 - Yair Strano.ipynb
¹⁹⁰⁴ views

Kernel: Python 3

Project 3

In this project, you will perform a logistic regression on admissions data

In [62]:

%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import statsmodels.api as sm
import pylab as pl
import numpy as np

In [63]:

df = pd.read_csv("../assets/admissions.csv")
df = df.dropna()
df.head()

Out[63]:

In [64]:

df.info()

Out[64]:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 397 entries, 0 to 399
Data columns (total 4 columns):
admit       397 non-null int64
gre         397 non-null float64
gpa         397 non-null float64
prestige    397 non-null float64
dtypes: float64(3), int64(1)
memory usage: 15.5 KB

Part 1. Frequency Tables

1. Let's create a frequency table of our variables. Look at the documentation for pd.crosstab

In [65]:

comb = pd.crosstab(index=df['admit'], columns=df['prestige'], margins=True)
comb.columns = ['prestige 1','prestige 2','prestige 3','prestige 4', 'rowtotal']
comb.index= ['admit 0','admit 1', 'coltotal']
comb

Out[65]:

Part 2. Return of dummy variables

2.1 Create class or dummy variables for prestige

In [66]:

prestige_dummies = pd.get_dummies(df['prestige'])
prestige_dummies.head()

Out[66]:

In [67]:

prestige_dummies.rename(columns={1.0: 'Prestige1', 2.0: 'Prestige2', 3.0: 'Prestige3', 4.0: 'Prestige4'}, inplace=True)
prestige_dummies.head()

Out[67]:

2.2 When modeling our class variables, how many do we need?

Answer:

3 dummies are needed.

when presented with a categorical variables for which every row must take one and exactly one value, you should drop one of the dummy columns so as to avoid redundancy in your exogenous variables (e.g. flip of a coin, you need either heads or tails as a column and do not need both). however if you have a categorical variable for which a row could take multiple or no values, then you leave all the columns there.

Part 3. Hand calculating odds ratios

Develop your intuition about expected outcomes by hand calculating odds ratios.

In [68]:

cols_to_keep = ['admit', 'gre', 'gpa']
handcalc = df[cols_to_keep].join(prestige_dummies)
handcalc.head()

Out[68]:

In [69]:

#discovery calcs:

len(handcalc['admit'])
len(handcalc[handcalc['admit']==0])
handcalc['admit'].sum()
len(handcalc[handcalc['Prestige1']==0])
handcalc['Prestige1'].sum()
handcalc['Prestige1'].value_counts()

Out[69]:

0    336
1     61
Name: Prestige1, dtype: int64

In [70]:

#it's unclear why the below code returns the columns in the incorrect order, i.e. column 'prestige 4' is really 'prestige 1'?

#comb = pd.crosstab(index=handcalc['admit'], columns=[handcalc['Prestige1'],handcalc['Prestige2'],handcalc['Prestige3'],handcalc['Prestige4']], margins=True)
#comb.columns = ['prestige 1','prestige 2','prestige 3','prestige 4', 'rowtotal']
#comb.index = ['admit 0','admit 1', 'coltotal']
#comb

In [71]:

pd.crosstab(df['admit'], df['prestige'], rownames=['admit'])

Out[71]:

In [72]:

comb = pd.crosstab(index=df['admit'], columns=df['prestige'])
comb.columns = ['prestige 1','prestige 2','prestige 3','prestige 4']
comb.index = ['admit 0','admit 1']
comb

Out[72]:

In [73]:

prestige_1 = pd.crosstab(index=handcalc['Prestige1'], columns='count')
prestige_1

Out[73]:

In [74]:

admit = pd.crosstab(index=handcalc['admit'], columns='count')
admit

Out[74]:

In [75]:

# crosstab 'prestige 1' admission, indexed by 'admit'
# frequency table cutting prestige and whether or not someone was admitted
comb1 = pd.crosstab(index=handcalc['admit'], columns=handcalc['Prestige1'])
comb1.columns = ['not prestige 1', 'prestige 1'] #what determines the order for the column names?
comb1.index = ['admit 0','admit 1']
comb1

Out[75]:

In [76]:

# crosstab 'prestige 1' admission, indexed by 'prestige 1' 
# frequency table cutting prestige and whether or not someone was admitted
comb2 = pd.crosstab(index=handcalc['Prestige1'], columns=handcalc['admit'])
comb2.columns = ['admit 0','admit 1'] 
comb2.index = ['not prestige 1', 'prestige 1'] #what determines the order for the column names?
comb2

Out[76]:

In [77]:

comb3 = pd.crosstab(handcalc['admit'], handcalc['Prestige1'], rownames=['admit'], colnames=['Prestige1'])
comb3

Out[77]:

In [78]:

comb4 = pd.crosstab(handcalc['Prestige1'], handcalc['admit'], rownames=['Prestige1'], colnames=['admit'])
comb4

Out[78]:

3.1 Use the cross tab above to calculate the odds of being admitted to grad school if you attended a #1 ranked college

In [79]:

odds1 = comb4.iloc[1][1] / (comb4.iloc[1].sum() - comb4.iloc[1][1])
odds1

Out[79]:

1.1785714285714286

odds ratio: 33:28

In [80]:

prob1 = 33/(33+28)
prob1

Out[80]:

0.5409836065573771

3.2 Now calculate the odds of admission if you did not attend a #1 ranked college

In [81]:

odds_n1 = comb4.iloc[0][1] / (comb4.iloc[0].sum() - comb4.iloc[0][1])
odds_n1

Out[81]:

0.38271604938271603

In [82]:

prob4 = 93/(93+243)
prob4

Out[82]:

0.2767857142857143

3.3 Calculate the odds ratio

odds ratio: 93:243

3.4 Write this finding in a sentenance:

Answer:

we see that prestige plays a big role in admittance to grad school. if you did not attend a prestige 1 school, your odds (93:243) of getting admitted are severely hindered. non-prestige 1 attendance stands a 27% chance of admittance versus a 54% chance if you did attend a prestige 1 school.

3.5 Print the cross tab for prestige_4

In [83]:

comb5 = pd.crosstab(handcalc['Prestige4'], handcalc['admit'], rownames=['Prestige4'], colnames=['admit'])
comb5

Out[83]:

3.6 Calculate the Odds Ratio

12:55

In [84]:

odds4 = 12/(67-12)
odds4

Out[84]:

0.21818181818181817

In [85]:

prob4 = 12/(12+55)
prob4

Out[85]:

0.1791044776119403

3.7 Write this finding in a sentence

Answer:

we see that if you attended a prestige 4 school, your odds (12:55) of getting admitted are even more bleak. prestige 4 attendance stands a 18% chance of admittance versus a 54% chance if you did attend a prestige 1 school.

Part 4. Analysis

In [86]:

# create a clean data frame for the regression
cols_to_keep = ['admit', 'gre', 'gpa']

# Dropping one of the dummy columns
data = df[cols_to_keep].join(prestige_dummies.iloc[:, 1:]) #the first section in [ , ] is rows, and the second section is columns
data.head()

Out[86]:

4.1 Create the X and Y variables

In [87]:

feature_cols = ['gre', 'gpa', 'Prestige2', 'Prestige3', 'Prestige4']
X = data[feature_cols] #create X (we are passing a list of arrays, so we don't need to use double [[]] to ensure it reads as a df)
y = data['admit']  #create y

4.2 Fit the model -

Load sklearn's logistic regression
Create the regression object
Fit the model

In [88]:

#fitting a logistic regression model and storing the class predictions

from sklearn.linear_model import LogisticRegression #load sklearn's logistic regression

logreg = LogisticRegression() #create regression object

logreg.fit(X, y) #fit
pred = logreg.predict(X) #predict

logreg.score(X, y) #this returns the accuracy

Out[88]:

0.7052896725440806

4.3 Print the coefficients

In [89]:

print (logreg.coef_)
print (logreg.intercept_) #note, this is the internal intercept that gets fed into the logistic regression
print (df.admit.mean())

Out[89]:

[[ 0.00178497  0.23229458 -0.60347467 -1.17214957 -1.37729795]]
[-1.81701706]
0.31738035264483627

In [90]:

admit_perc = 126 / (271+126)
admit_perc

Out[90]:

0.31738035264483627

if you throw 0 for all the y preds, you would be right 68% of the time
if you throw 1 for all the y preds, you would be right 32% of the time
that is not a very good model

In [91]:

print (pred) #this is the predicton

Out[91]:

[0 0 1 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0
0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0
0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

In [92]:

from sklearn.metrics import precision_score as ps
from sklearn.metrics import recall_score as rc
from sklearn.metrics import confusion_matrix as cm

In [93]:

ps(y, pred) #this gives the precision

#true postive / (true postive + false positive)
#the group you identified correctly divided by the total group you identified

Out[93]:

0.6216216216216216

In [94]:

rc(y, pred) #this gives the recall

#true postive / (true postive + false negative) 

#the group you identified correctly divided by the group total

Out[94]:

0.18253968253968253

In [95]:

cm(y, pred) #this gives the confusion matrix, which is easier to read with labels

Out[95]:

array([[257,  14],
       [103,  23]], dtype=int64)

In [96]:

23 / (14+23) 

#precision
#23 is the true positive predicted correctly 
#14 is the false positive, precicted as admitted but they are not actually admitted

Out[96]:

0.6216216216216216

In [97]:

23 / (23+103)

#recall
#23 is the true positive predicted correctly 
#103 is the false negative, precicted as not admitted but they are actually admitted

Out[97]:

0.18253968253968253

4.4 Calculate the odds ratios of the coeffiencents

hint 1: np.exp(X)

odds = probability / (1 - probability) i.e. one specific outcome/the rest of the other outcomes
probability = odds / (1 + odds) i.e. one specific outcome/all outcomes

logistic regression, compresses the linear regression to fit between 0 and 1
the np.exp(X) reverts it back

In [98]:

logreg.coef_ #this is a list of a list, which is why you need to index into it

Out[98]:

array([[ 0.00178497,  0.23229458, -0.60347467, -1.17214957, -1.37729795]])

In [99]:

#logodds = logreg.intercept_ + logreg.coef_[0] * ???
#logodds

In [100]:

#this gives the odds ratio
params = logreg.coef_[0]
np.exp(params)

Out[100]:

array([1.00178657, 1.26149128, 0.546908  , 0.3097005 , 0.25225925])

In [101]:

# Convert log odds to odds.
odds = np.exp(params)
odds

Out[101]:

array([1.00178657, 1.26149128, 0.546908  , 0.3097005 , 0.25225925])

In [102]:

# Convert odds to probability.
prob = odds/(1 + odds)
prob

Out[102]:

array([0.50044624, 0.5578139 , 0.35354915, 0.23646666, 0.20144331])

4.5 Interpret the OR of Prestige_2

Answer:

ppl who went to prestige 2 school, are 54% more likely to get admitted than prestige 1 students
bc prestige 1 is the base dummy variable

4.6 Interpret the OR of GPA

Answer:

for one unit increase in gpa you are 1.26149128 times likely to be admitted

Bonus

Plot the probability of being admitted into graduate school, stratified by GPA and GRE score.

In [ ]:

Project 3

Part 1. Frequency Tables

1. Let's create a frequency table of our variables. Look at the documentation for pd.crosstab

Part 2. Return of dummy variables

2.1 Create class or dummy variables for prestige

2.2 When modeling our class variables, how many do we need?

Part 3. Hand calculating odds ratios

3.1 Use the cross tab above to calculate the odds of being admitted to grad school if you attended a #1 ranked college

3.2 Now calculate the odds of admission if you did not attend a #1 ranked college

3.3 Calculate the odds ratio

3.4 Write this finding in a sentenance:

3.5 Print the cross tab for prestige_4

3.6 Calculate the Odds Ratio

3.7 Write this finding in a sentence

Part 4. Analysis

4.1 Create the X and Y variables

4.2 Fit the model -

4.3 Print the coefficients

4.4 Calculate the odds ratios of the coeffiencents

4.5 Interpret the OR of Prestige_2

4.6 Interpret the OR of GPA

Bonus

Product

Resources

Company