Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
YStrano
GitHub Repository: YStrano/DataScience_GA
Path: blob/master/projects/project_3/starter-code/Project 3 - Yair Strano.ipynb
1904 views
Kernel: Python 3

Project 3

In this project, you will perform a logistic regression on admissions data

%matplotlib inline import matplotlib.pyplot as plt import pandas as pd import statsmodels.api as sm import pylab as pl import numpy as np
df = pd.read_csv("../assets/admissions.csv") df = df.dropna() df.head()
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 397 entries, 0 to 399 Data columns (total 4 columns): admit 397 non-null int64 gre 397 non-null float64 gpa 397 non-null float64 prestige 397 non-null float64 dtypes: float64(3), int64(1) memory usage: 15.5 KB

Part 1. Frequency Tables

1. Let's create a frequency table of our variables. Look at the documentation for pd.crosstab

comb = pd.crosstab(index=df['admit'], columns=df['prestige'], margins=True) comb.columns = ['prestige 1','prestige 2','prestige 3','prestige 4', 'rowtotal'] comb.index= ['admit 0','admit 1', 'coltotal'] comb

Part 2. Return of dummy variables

2.1 Create class or dummy variables for prestige

prestige_dummies = pd.get_dummies(df['prestige']) prestige_dummies.head()
prestige_dummies.rename(columns={1.0: 'Prestige1', 2.0: 'Prestige2', 3.0: 'Prestige3', 4.0: 'Prestige4'}, inplace=True) prestige_dummies.head()

2.2 When modeling our class variables, how many do we need?

Answer:

3 dummies are needed.

when presented with a categorical variables for which every row must take one and exactly one value, you should drop one of the dummy columns so as to avoid redundancy in your exogenous variables (e.g. flip of a coin, you need either heads or tails as a column and do not need both). however if you have a categorical variable for which a row could take multiple or no values, then you leave all the columns there.

Part 3. Hand calculating odds ratios

Develop your intuition about expected outcomes by hand calculating odds ratios.

cols_to_keep = ['admit', 'gre', 'gpa'] handcalc = df[cols_to_keep].join(prestige_dummies) handcalc.head()
#discovery calcs: len(handcalc['admit']) len(handcalc[handcalc['admit']==0]) handcalc['admit'].sum() len(handcalc[handcalc['Prestige1']==0]) handcalc['Prestige1'].sum() handcalc['Prestige1'].value_counts()
0 336 1 61 Name: Prestige1, dtype: int64
#it's unclear why the below code returns the columns in the incorrect order, i.e. column 'prestige 4' is really 'prestige 1'? #comb = pd.crosstab(index=handcalc['admit'], columns=[handcalc['Prestige1'],handcalc['Prestige2'],handcalc['Prestige3'],handcalc['Prestige4']], margins=True) #comb.columns = ['prestige 1','prestige 2','prestige 3','prestige 4', 'rowtotal'] #comb.index = ['admit 0','admit 1', 'coltotal'] #comb
pd.crosstab(df['admit'], df['prestige'], rownames=['admit'])
comb = pd.crosstab(index=df['admit'], columns=df['prestige']) comb.columns = ['prestige 1','prestige 2','prestige 3','prestige 4'] comb.index = ['admit 0','admit 1'] comb
prestige_1 = pd.crosstab(index=handcalc['Prestige1'], columns='count') prestige_1
admit = pd.crosstab(index=handcalc['admit'], columns='count') admit
# crosstab 'prestige 1' admission, indexed by 'admit' # frequency table cutting prestige and whether or not someone was admitted comb1 = pd.crosstab(index=handcalc['admit'], columns=handcalc['Prestige1']) comb1.columns = ['not prestige 1', 'prestige 1'] #what determines the order for the column names? comb1.index = ['admit 0','admit 1'] comb1
# crosstab 'prestige 1' admission, indexed by 'prestige 1' # frequency table cutting prestige and whether or not someone was admitted comb2 = pd.crosstab(index=handcalc['Prestige1'], columns=handcalc['admit']) comb2.columns = ['admit 0','admit 1'] comb2.index = ['not prestige 1', 'prestige 1'] #what determines the order for the column names? comb2
comb3 = pd.crosstab(handcalc['admit'], handcalc['Prestige1'], rownames=['admit'], colnames=['Prestige1']) comb3
comb4 = pd.crosstab(handcalc['Prestige1'], handcalc['admit'], rownames=['Prestige1'], colnames=['admit']) comb4

3.1 Use the cross tab above to calculate the odds of being admitted to grad school if you attended a #1 ranked college

odds1 = comb4.iloc[1][1] / (comb4.iloc[1].sum() - comb4.iloc[1][1]) odds1
1.1785714285714286

odds ratio: 33:28

prob1 = 33/(33+28) prob1
0.5409836065573771

3.2 Now calculate the odds of admission if you did not attend a #1 ranked college

odds_n1 = comb4.iloc[0][1] / (comb4.iloc[0].sum() - comb4.iloc[0][1]) odds_n1
0.38271604938271603
prob4 = 93/(93+243) prob4
0.2767857142857143

3.3 Calculate the odds ratio

odds ratio: 93:243

3.4 Write this finding in a sentenance:

Answer:

we see that prestige plays a big role in admittance to grad school. if you did not attend a prestige 1 school, your odds (93:243) of getting admitted are severely hindered. non-prestige 1 attendance stands a 27% chance of admittance versus a 54% chance if you did attend a prestige 1 school.

3.5 Print the cross tab for prestige_4

comb5 = pd.crosstab(handcalc['Prestige4'], handcalc['admit'], rownames=['Prestige4'], colnames=['admit']) comb5

3.6 Calculate the Odds Ratio

12:55

odds4 = 12/(67-12) odds4
0.21818181818181817
prob4 = 12/(12+55) prob4
0.1791044776119403

3.7 Write this finding in a sentence

Answer:

we see that if you attended a prestige 4 school, your odds (12:55) of getting admitted are even more bleak. prestige 4 attendance stands a 18% chance of admittance versus a 54% chance if you did attend a prestige 1 school.

Part 4. Analysis

# create a clean data frame for the regression cols_to_keep = ['admit', 'gre', 'gpa'] # Dropping one of the dummy columns data = df[cols_to_keep].join(prestige_dummies.iloc[:, 1:]) #the first section in [ , ] is rows, and the second section is columns data.head()

4.1 Create the X and Y variables

feature_cols = ['gre', 'gpa', 'Prestige2', 'Prestige3', 'Prestige4'] X = data[feature_cols] #create X (we are passing a list of arrays, so we don't need to use double [[]] to ensure it reads as a df) y = data['admit'] #create y

4.2 Fit the model -

  • Load sklearn's logistic regression

  • Create the regression object

  • Fit the model

#fitting a logistic regression model and storing the class predictions from sklearn.linear_model import LogisticRegression #load sklearn's logistic regression logreg = LogisticRegression() #create regression object logreg.fit(X, y) #fit pred = logreg.predict(X) #predict logreg.score(X, y) #this returns the accuracy
0.7052896725440806

4.3 Print the coefficients

print (logreg.coef_) print (logreg.intercept_) #note, this is the internal intercept that gets fed into the logistic regression print (df.admit.mean())
[[ 0.00178497 0.23229458 -0.60347467 -1.17214957 -1.37729795]] [-1.81701706] 0.31738035264483627
admit_perc = 126 / (271+126) admit_perc
0.31738035264483627
  • if you throw 0 for all the y preds, you would be right 68% of the time

  • if you throw 1 for all the y preds, you would be right 32% of the time

  • that is not a very good model

print (pred) #this is the predicton
[0 0 1 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
from sklearn.metrics import precision_score as ps from sklearn.metrics import recall_score as rc from sklearn.metrics import confusion_matrix as cm
ps(y, pred) #this gives the precision #true postive / (true postive + false positive) #the group you identified correctly divided by the total group you identified
0.6216216216216216
rc(y, pred) #this gives the recall #true postive / (true postive + false negative) #the group you identified correctly divided by the group total
0.18253968253968253
cm(y, pred) #this gives the confusion matrix, which is easier to read with labels
array([[257, 14], [103, 23]], dtype=int64)
23 / (14+23) #precision #23 is the true positive predicted correctly #14 is the false positive, precicted as admitted but they are not actually admitted
0.6216216216216216
23 / (23+103) #recall #23 is the true positive predicted correctly #103 is the false negative, precicted as not admitted but they are actually admitted
0.18253968253968253

4.4 Calculate the odds ratios of the coeffiencents

hint 1: np.exp(X)

  • odds = probability / (1 - probability) i.e. one specific outcome/the rest of the other outcomes

  • probability = odds / (1 + odds) i.e. one specific outcome/all outcomes

  • logistic regression, compresses the linear regression to fit between 0 and 1

  • the np.exp(X) reverts it back

logreg.coef_ #this is a list of a list, which is why you need to index into it
array([[ 0.00178497, 0.23229458, -0.60347467, -1.17214957, -1.37729795]])
#logodds = logreg.intercept_ + logreg.coef_[0] * ??? #logodds
#this gives the odds ratio params = logreg.coef_[0] np.exp(params)
array([1.00178657, 1.26149128, 0.546908 , 0.3097005 , 0.25225925])
# Convert log odds to odds. odds = np.exp(params) odds
array([1.00178657, 1.26149128, 0.546908 , 0.3097005 , 0.25225925])
# Convert odds to probability. prob = odds/(1 + odds) prob
array([0.50044624, 0.5578139 , 0.35354915, 0.23646666, 0.20144331])

4.5 Interpret the OR of Prestige_2

Answer:

  • ppl who went to prestige 2 school, are 54% more likely to get admitted than prestige 1 students

  • bc prestige 1 is the base dummy variable

4.6 Interpret the OR of GPA

Answer:

  • for one unit increase in gpa you are 1.26149128 times likely to be admitted

Bonus

Plot the probability of being admitted into graduate school, stratified by GPA and GRE score.