Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
YStrano
GitHub Repository: YStrano/DataScience_GA
Path: blob/master/april_18/projects/unit-projects/project-3/starter-code/project3-starter.ipynb
1905 views
Kernel: Python 2

Project 3

In this project, you will perform a logistic regression on the admissions data we've been working with in projects 1 and 2.

%matplotlib inline import matplotlib.pyplot as plt import pandas as pd import statsmodels.api as sm import pylab as pl import numpy as np
df_raw = pd.read_csv("../assets/admissions.csv") df = df_raw.dropna() print df.head()
admit gre gpa prestige 0 0 380 3.61 3 1 1 660 3.67 3 2 1 800 4.00 1 3 1 640 3.19 4 4 0 520 2.93 4

Part 1. Frequency Tables

1. Let's create a frequency table of our variables

# frequency table for prestige and whether or not someone was admitted

Part 2. Return of dummy variables

2.1 Create class or dummy variables for prestige

2.2 When modeling our class variables, how many do we need?

Answer:

Part 3. Hand calculating odds ratios

Develop your intuition about expected outcomes by hand calculating odds ratios.

cols_to_keep = ['admit', 'gre', 'gpa'] handCalc = df[cols_to_keep].join(dummy_ranks.ix[:, 'prestige_1':]) print handCalc.head()
#crosstab prestige 1 admission # frequency table cutting prestige and whether or not someone was admitted

3.1 Use the cross tab above to calculate the odds of being admitted to grad school if you attended a #1 ranked college

3.2 Now calculate the odds of admission if you did not attend a #1 ranked college

3.3 Calculate the odds ratio

3.4 Write this finding in a sentenance:

Answer:

3.5 Print the cross tab for prestige_4

3.6 Calculate the OR

3.7 Write this finding in a sentence

Answer:

Part 4. Analysis

# create a clean data frame for the regression cols_to_keep = ['admit', 'gre', 'gpa'] data = df[cols_to_keep].join(dummy_ranks.ix[:, 'prestige_2':]) print data.head()

We're going to add a constant term for our Logistic Regression. The statsmodels function we're going to be using requires that intercepts/constants are specified explicitly.

# manually add the intercept data['intercept'] = 1.0

4.1 Set the covariates to a variable called train_cols

4.2 Fit the model

4.3 Print the summary results

4.4 Calculate the odds ratios of the coeffiencents and their 95% CI intervals

hint 1: np.exp(X)

hint 2: conf['OR'] = params

conf.columns = ['2.5%', '97.5%', 'OR']

4.5 Interpret the OR of Prestige_2

Answer:

4.6 Interpret the OR of GPA

Answer:

Part 5: Predicted probablities

As a way of evaluating our classifier, we're going to recreate the dataset with every logical combination of input values. This will allow us to see how the predicted probability of admission increases/decreases across different variables. First we're going to generate the combinations using a helper function called cartesian (above).

We're going to use np.linspace to create a range of values for "gre" and "gpa". This creates a range of linearly spaced values from a specified min and maximum value--in our case just the min/max observed values.

def cartesian(arrays, out=None): """ Generate a cartesian product of input arrays. Parameters ---------- arrays : list of array-like 1-D arrays to form the cartesian product of. out : ndarray Array to place the cartesian product in. Returns ------- out : ndarray 2-D array of shape (M, len(arrays)) containing cartesian products formed of input arrays. Examples -------- >>> cartesian(([1, 2, 3], [4, 5], [6, 7])) array([[1, 4, 6], [1, 4, 7], [1, 5, 6], [1, 5, 7], [2, 4, 6], [2, 4, 7], [2, 5, 6], [2, 5, 7], [3, 4, 6], [3, 4, 7], [3, 5, 6], [3, 5, 7]]) """ arrays = [np.asarray(x) for x in arrays] dtype = arrays[0].dtype n = np.prod([x.size for x in arrays]) if out is None: out = np.zeros([n, len(arrays)], dtype=dtype) m = n / arrays[0].size out[:,0] = np.repeat(arrays[0], m) if arrays[1:]: cartesian(arrays[1:], out=out[0:m,1:]) for j in xrange(1, arrays[0].size): out[j*m:(j+1)*m,1:] = out[0:m,1:] return out
# instead of generating all possible values of GRE and GPA, we're going # to use an evenly spaced range of 10 values from the min to the max gres = np.linspace(data['gre'].min(), data['gre'].max(), 10) print gres # array([ 220. , 284.44444444, 348.88888889, 413.33333333, # 477.77777778, 542.22222222, 606.66666667, 671.11111111, # 735.55555556, 800. ]) gpas = np.linspace(data['gpa'].min(), data['gpa'].max(), 10) print gpas # array([ 2.26 , 2.45333333, 2.64666667, 2.84 , 3.03333333, # 3.22666667, 3.42 , 3.61333333, 3.80666667, 4. ]) # enumerate all possibilities combos = pd.DataFrame(cartesian([gres, gpas, [1, 2, 3, 4], [1.]]))

5.1 Recreate the dummy variables

# recreate the dummy variables # keep only what we need for making predictions

5.2 Make predictions on the enumerated dataset

5.3 Interpret findings for the last 4 observations

Answer:

Bonus

Plot the probability of being admitted into graduate school, stratified by GPA and GRE score.