Path: blob/master/projects/project_3/starter-code/Project 3 - working file.ipynb
1904 views
Project 3
In this project, you will perform a logistic regression on admissions data
we are missing less than 1% of the data, it could be useful to try and fill in the missing values or do some analysis to determine why they are not there but for now i'm going to drop the na's
Part 1. Frequency Tables
1. Let's create a frequency table of our variables. Look at the documentation for pd.crosstab
the below is a for loop that does the above in one setp, but if ran, the join fails, so need to concat instead...
the default for concat is axis=0, which appends to the rows, instead use axis=1 to concat columns
the below uses crosstab... not 100% on some of the syntax technicalities
Part 2. Return of dummy variables
the below two cells are just notes for reference from a lesson on dummy variables...
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-58-efd39e2a6327> in <module>()
3
4 # create a Series of booleans in which roughly half are True
----> 5 nums = np.random.rand(len(data))
6 mask_large = nums > 0.5
7
NameError: name 'data' is not defined
Part 2. Return of dummy variables
2.1 Create class or dummy variables for prestige
2.2 When modeling our class variables, how many do we need?
Answer:
3 dummies are needed.
when presented with a categorical variables for which every row must take one and exactly one value, you should drop one of the dummy columns so as to avoid redundancy in your exogenous variables (e.g. flip of a coin, you need either heads or tails as a column and do not need both). however if you have a categorical variable for which a row could take multiple or no values, then you leave all the columns there.
Part 3. Hand calculating odds ratios
Develop your intuition about expected outcomes by hand calculating odds ratios.
3.1 Use the cross tab above to calculate the odds of being admitted to grad school if you attended a #1 ranked college
odds ratio: 33:28
3.2 Now calculate the odds of admission if you did not attend a #1 ranked college
3.3 Calculate the odds ratio
odds ratio: 93:243
3.4 Write this finding in a sentenance:
Answer:
we see that prestige plays a big role in admittance to grad school. if you did not attend a prestige 1 school, your odds (93:243) of getting admitted are severely hindered. non-prestige 1 attendance stands a 27% chance of admittance versus a 54% chance if you did attend a prestige 1 school.
3.5 Print the cross tab for prestige_4
3.6 Calculate the Odds Ratio
12:55
3.7 Write this finding in a sentence
Answer:
we see that if you attended a prestige 4 school, your odds (12:55) of getting admitted are even more bleak. prestige 4 attendance stands a 18% chance of admittance versus a 54% chance if you did attend a prestige 1 school.
Part 4. Analysis
if using statsmodel
We will add a constant term for our Logistic Regression.
The statsmodels function requires that intercepts/constants are specified explicitly.
make sure to come back to this with Abe.
4.1 Create the X and Y variables
4.2 Fit the model -
Load sklearn's logistic regression
Create the regression object
Fit the model
4.3 Print the coefficients
if you throw 0 for all the y preds, you would be right 68% of the time
if you throw 1 for all the y preds, you would be right 32% of the time
that is not a very good model
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-124-606e139981b4> in <module>()
----> 1 [TN , FP]
2 [FN , TP]
NameError: name 'TN' is not defined
4.4 Calculate the odds ratios of the coeffiencents
hint 1: np.exp(X)
(from original project)
hint 2: conf['OR'] = params
odds = probability / (1 - probability) i.e. one specific outcome/the rest of the other outcomes
probability = odds / (1 + odds) i.e. one specific outcome/all outcomes
logistic regression, compresses the linear regression to fit between 0 and 1 (need to dive into the mecanics of this)
the np.exp(X) reverts it back (how and why?)
4.5 Interpret the OR of Prestige_2
Answer:
ppl who went to prestige 2 school, are 54% more likely to get admitted than prestige 1 students
bc prestige 1 is the base dummy variable
need to dig into this a bit more and underatand the inner workings better (?)
4.6 Interpret the OR of GPA
Answer:
for one unit increase in gpa you are 1.26149128 times likely to be admitted
Bonus
Plot the probability of being admitted into graduate school, stratified by GPA and GRE score.
(from original project - not part of current project)
Part 5: Predicted probablities
As a way of evaluating our classifier, we're going to recreate the dataset with every logical combination of input values. This will allow us to see how the predicted probability of admission increases/decreases across different variables. First we're going to generate the combinations using a helper function called cartesian (above).
We're going to use np.linspace to create a range of values for "gre" and "gpa". This creates a range of linearly spaced values from a specified min and maximum value--in our case just the min/max observed values.